gnomad_qc.v5.variant_qc.create_truth_samples_vds
Script to create a VDS of the 8 Genomes-in-a-Bottle (GiaB) truth samples from their gVCFs.
The GiaB gVCFs were sequenced with the same protocol as the AoU v8 data and live in the
AoU control-samples bucket
(gs://fc-aou-datasets-controlled/v8/wgs/short_read/snpindel/aux/qc/control_samples/).
They are already reblocked, so they are passed straight into Hail’s VDS combiner (no
reblocking step).
The combiner needs the per-gVCF paths up front. The truth-sample bucket cannot be
listed and the sample IDs might be sensitive, so neither the paths nor the IDs are stored
in this repo. Instead the script reads a single-column TSV manifest of gVCF paths
(truth_samples_gvcf_paths) by known object path.
This is intended to run in the batch environment (Hail Batch in the AoU authorization
domain), since that is where the AoU truth-sample gVCFs are readable.
usage: gnomad_qc.v5.variant_qc.create_truth_samples_vds.py [-h] [--overwrite]
[--test]
[--manifest-path MANIFEST_PATH]
[--create-truth-samples-vds]
[--validate-truth-samples-vds]
[--gcp-billing-project GCP_BILLING_PROJECT]
[--experimental]
[--app-name APP_NAME]
[--driver-cores DRIVER_CORES]
[--driver-memory DRIVER_MEMORY]
[--jvm-heap-size JVM_HEAP_SIZE]
[--worker-cores WORKER_CORES]
[--worker-memory WORKER_MEMORY]
Named Arguments
- --overwrite
Overwrite any stale saved combiner plan. Does not overwrite an existing VDS (the combiner will fail if the output VDS already exists, regardless of this flag).
Default: False
- --test
Cheap end-to-end test: combine only the first gVCF in the manifest over a single small interval (chr1:55039447-55064852) and write to a temporary path.
Default: False
- --manifest-path
GCS path to the single-column TSV listing each truth-sample gVCF path (one ‘gs://…’ path per line).
Default: “gs://fc-11093c2b-590e-424a-91ac-0cc040d562fc/v5.0/variant_qc/genomes/aou/truth_samples/truth_samples_gvcf_paths.tsv”
- --create-truth-samples-vds
Run the VDS combiner to create the truth-samples VDS.
Default: False
- --validate-truth-samples-vds
Validate the combined VDS (structural invariants + sample/row sanity counts). Can be run in the same invocation as –create-truth-samples-vds.
Default: False
batch configuration
Optional parameters for the batch/QoB backend.
- --gcp-billing-project
Google Cloud billing project for reading requester pays buckets.
Default: “broad-mpg-gnomad”
- --experimental
Route the QoB init through hl.experimental.init instead of hl.init and attach the QoB driver to an existing Hail Batch. Requires HAIL_BATCH_ID to be set in the env (Hail Batch injects this inside batch jobs); raises if not. Without this flag, each invocation creates its own Hail Batch.
Default: False
- --app-name
Job name for batch/QoB backend.
- --driver-cores
Number of cores for the driver node. Pass a power of two between 0.25 and 16 (as a string, e.g. ‘2’ or ‘0.5’).
- --driver-memory
Memory type for driver node (e.g., ‘highmem’).
- --jvm-heap-size
Max JVM heap size (-Xmx) for the in-process QoB driver under –experimental, e.g. ‘5g’ or ‘2500m’. Plumbed to hl.experimental.init(jvm_heap_size=…). Set to ~50-70% of container memory; the rest is for native off-heap (RegionPool), Python, and OS overhead. Ignored without –experimental.
- --worker-cores
Number of cores per worker node. Pass a power of two between 0.25 and 16 (as a string, e.g. ‘1’ or ‘0.5’).
- --worker-memory
Memory type for worker nodes (e.g., ‘highmem’).
Module Functions
|
Read and validate the truth-sample gVCF paths from a single-column TSV manifest. |
|
Verify a combined truth-samples VDS looks correct. |
Create a VDS of the GiaB truth samples from their gVCFs. |
|
|
Get script argument parser. |
Script to create a VDS of the 8 Genomes-in-a-Bottle (GiaB) truth samples from their gVCFs.
The GiaB gVCFs were sequenced with the same protocol as the AoU v8 data and live in the
AoU control-samples bucket
(gs://fc-aou-datasets-controlled/v8/wgs/short_read/snpindel/aux/qc/control_samples/).
They are already reblocked, so they are passed straight into Hail’s VDS combiner (no
reblocking step).
The combiner needs the per-gVCF paths up front. The truth-sample bucket cannot be
listed and the sample IDs might be sensitive, so neither the paths nor the IDs are stored
in this repo. Instead the script reads a single-column TSV manifest of gVCF paths
(truth_samples_gvcf_paths) by known object path.
This is intended to run in the batch environment (Hail Batch in the AoU authorization
domain), since that is where the AoU truth-sample gVCFs are readable.
- gnomad_qc.v5.variant_qc.create_truth_samples_vds.read_and_check_gvcf_paths(manifest_path)[source]
Read and validate the truth-sample gVCF paths from a single-column TSV manifest.
Lines that do not start with
gs://(e.g. a header line, blank lines, or comments) are ignored, so the manifest can optionally have a header. Raises if the manifest does not contain exactlyN_TRUTH_SAMPLESpaths.- Parameters:
manifest_path (
str) – GCS path to the single-column TSV of gVCF paths.- Return type:
List[str]- Returns:
List of gVCF GCS paths.
- gnomad_qc.v5.variant_qc.create_truth_samples_vds.validate_vds(vds_path, test, manifest_path)[source]
Verify a combined truth-samples VDS looks correct.
Runs Hail’s
VariantDataset.validate()(the canonical structural check) plus cheap sanity counts: non-empty data, expected sample count, and (in test mode) that all variant loci fall withinTEST_INTERVAL. In test mode it additionally cross-checks the variant calls and reference blocks against the source gVCF via_verify_test_vds_against_gvcf().- Parameters:
vds_path (
str) – Path to the VDS to validate.test (
bool) – Whether the VDS was produced by a--testrun (1 gVCF, one interval).manifest_path (
str) – Manifest path. Only used in test mode, to locate the single source gVCF for the cross-check; a full run validates againstN_TRUTH_SAMPLES.
- Return type:
None- Returns:
None.