gnomad_qc.v5.variant_qc.create_truth_samples_vds

Script to create a VDS of the 8 Genomes-in-a-Bottle (GiaB) truth samples from their gVCFs.

The GiaB gVCFs were sequenced with the same protocol as the AoU v8 data and live in the AoU control-samples bucket (gs://fc-aou-datasets-controlled/v8/wgs/short_read/snpindel/aux/qc/control_samples/). They are already reblocked, so they are passed straight into Hail’s VDS combiner (no reblocking step).

The combiner needs the per-gVCF paths up front. The truth-sample bucket cannot be listed and the sample IDs might be sensitive, so neither the paths nor the IDs are stored in this repo. Instead the script reads a single-column TSV manifest of gVCF paths (truth_samples_gvcf_paths) by known object path.

This is intended to run in the batch environment (Hail Batch in the AoU authorization domain), since that is where the AoU truth-sample gVCFs are readable.

usage: gnomad_qc.v5.variant_qc.create_truth_samples_vds.py [-h] [--overwrite]
                                                           [--test]
                                                           [--manifest-path MANIFEST_PATH]
                                                           [--create-truth-samples-vds]
                                                           [--validate-truth-samples-vds]
                                                           [--gcp-billing-project GCP_BILLING_PROJECT]
                                                           [--experimental]
                                                           [--app-name APP_NAME]
                                                           [--driver-cores DRIVER_CORES]
                                                           [--driver-memory DRIVER_MEMORY]
                                                           [--jvm-heap-size JVM_HEAP_SIZE]
                                                           [--worker-cores WORKER_CORES]
                                                           [--worker-memory WORKER_MEMORY]

Named Arguments

--overwrite

Overwrite any stale saved combiner plan. Does not overwrite an existing VDS (the combiner will fail if the output VDS already exists, regardless of this flag).

Default: False

--test

Cheap end-to-end test: combine only the first gVCF in the manifest over a single small interval (chr1:55039447-55064852) and write to a temporary path.

Default: False

--manifest-path

GCS path to the single-column TSV listing each truth-sample gVCF path (one ‘gs://…’ path per line).

Default: “gs://fc-11093c2b-590e-424a-91ac-0cc040d562fc/v5.0/variant_qc/genomes/aou/truth_samples/truth_samples_gvcf_paths.tsv”

--create-truth-samples-vds

Run the VDS combiner to create the truth-samples VDS.

Default: False

--validate-truth-samples-vds

Validate the combined VDS (structural invariants + sample/row sanity counts). Can be run in the same invocation as –create-truth-samples-vds.

Default: False

batch configuration

Optional parameters for the batch/QoB backend.

--gcp-billing-project

Google Cloud billing project for reading requester pays buckets.

Default: “broad-mpg-gnomad”

--experimental

Route the QoB init through hl.experimental.init instead of hl.init and attach the QoB driver to an existing Hail Batch. Requires HAIL_BATCH_ID to be set in the env (Hail Batch injects this inside batch jobs); raises if not. Without this flag, each invocation creates its own Hail Batch.

Default: False

--app-name

Job name for batch/QoB backend.

--driver-cores

Number of cores for the driver node. Pass a power of two between 0.25 and 16 (as a string, e.g. ‘2’ or ‘0.5’).

--driver-memory

Memory type for driver node (e.g., ‘highmem’).

--jvm-heap-size

Max JVM heap size (-Xmx) for the in-process QoB driver under –experimental, e.g. ‘5g’ or ‘2500m’. Plumbed to hl.experimental.init(jvm_heap_size=…). Set to ~50-70% of container memory; the rest is for native off-heap (RegionPool), Python, and OS overhead. Ignored without –experimental.

--worker-cores

Number of cores per worker node. Pass a power of two between 0.25 and 16 (as a string, e.g. ‘1’ or ‘0.5’).

--worker-memory

Memory type for worker nodes (e.g., ‘highmem’).

Module Functions

gnomad_qc.v5.variant_qc.create_truth_samples_vds.read_and_check_gvcf_paths(...)

Read and validate the truth-sample gVCF paths from a single-column TSV manifest.

gnomad_qc.v5.variant_qc.create_truth_samples_vds.validate_vds(...)

Verify a combined truth-samples VDS looks correct.

gnomad_qc.v5.variant_qc.create_truth_samples_vds.main(args)

Create a VDS of the GiaB truth samples from their gVCFs.

gnomad_qc.v5.variant_qc.create_truth_samples_vds.get_script_argument_parser()

Get script argument parser.

Script to create a VDS of the 8 Genomes-in-a-Bottle (GiaB) truth samples from their gVCFs.

The GiaB gVCFs were sequenced with the same protocol as the AoU v8 data and live in the AoU control-samples bucket (gs://fc-aou-datasets-controlled/v8/wgs/short_read/snpindel/aux/qc/control_samples/). They are already reblocked, so they are passed straight into Hail’s VDS combiner (no reblocking step).

The combiner needs the per-gVCF paths up front. The truth-sample bucket cannot be listed and the sample IDs might be sensitive, so neither the paths nor the IDs are stored in this repo. Instead the script reads a single-column TSV manifest of gVCF paths (truth_samples_gvcf_paths) by known object path.

This is intended to run in the batch environment (Hail Batch in the AoU authorization domain), since that is where the AoU truth-sample gVCFs are readable.

gnomad_qc.v5.variant_qc.create_truth_samples_vds.read_and_check_gvcf_paths(manifest_path)[source]

Read and validate the truth-sample gVCF paths from a single-column TSV manifest.

Lines that do not start with gs:// (e.g. a header line, blank lines, or comments) are ignored, so the manifest can optionally have a header. Raises if the manifest does not contain exactly N_TRUTH_SAMPLES paths.

Parameters:

manifest_path (str) – GCS path to the single-column TSV of gVCF paths.

Return type:

List[str]

Returns:

List of gVCF GCS paths.

gnomad_qc.v5.variant_qc.create_truth_samples_vds.validate_vds(vds_path, test, manifest_path)[source]

Verify a combined truth-samples VDS looks correct.

Runs Hail’s VariantDataset.validate() (the canonical structural check) plus cheap sanity counts: non-empty data, expected sample count, and (in test mode) that all variant loci fall within TEST_INTERVAL. In test mode it additionally cross-checks the variant calls and reference blocks against the source gVCF via _verify_test_vds_against_gvcf().

Parameters:
  • vds_path (str) – Path to the VDS to validate.

  • test (bool) – Whether the VDS was produced by a --test run (1 gVCF, one interval).

  • manifest_path (str) – Manifest path. Only used in test mode, to locate the single source gVCF for the cross-check; a full run validates against N_TRUTH_SAMPLES.

Return type:

None

Returns:

None.

gnomad_qc.v5.variant_qc.create_truth_samples_vds.main(args)[source]

Create a VDS of the GiaB truth samples from their gVCFs.