gnomad_qc.v4.variant_qc.vqsr

Script to run VQSR on an AS-annotated Sites VCF.

usage: gnomad_qc.v4.variant_qc.vqsr.py [-h] [-o] [--test] --out-bucket
                                       OUT_BUCKET --out-vcf-name OUT_VCF_NAME
                                       --model-id MODEL_ID
                                       [--header-path HEADER_PATH]
                                       [--n-partitions N_PARTITIONS]
                                       [--array-elements-required]
                                       [--gatk-image GATK_IMAGE] --resources
                                       RESOURCES
                                       [--scatter-count SCATTER_COUNT]
                                       [--snp-hard-filter SNP_HARD_FILTER]
                                       [--indel-hard-filter INDEL_HARD_FILTER]
                                       [--run-mode {small,standard,large}]
                                       --batch-billing-project
                                       BATCH_BILLING_PROJECT
                                       --gcp-billing-project
                                       GCP_BILLING_PROJECT
                                       [--compute-info-method {AS,quasi,set_long_AS_missing}]
                                       [--adj] [--transmitted-singletons]
                                       [--sibling-singletons]
                                       [--interval-qc-filter]
                                       [--calling-interval-filter]
                                       [--snp-features SNP_FEATURES [SNP_FEATURES ...]]
                                       [--indel-features INDEL_FEATURES [INDEL_FEATURES ...]]
                                       [--overlap-skip]
                                       [--batch-suffix BATCH_SUFFIX]
                                       [--load-vqsr]

Named Arguments

-o, --overwrite

Whether to overwrite data already present in the output Table.

Default: False

--test

If the dataset should be filtered to chr22 for testing.

Default: False

--out-bucket

Bucket to store VQSR outputs in.

--out-vcf-name

Required prefix for VQSR outputs.

--model-id

Model ID for the variant QC result HT.

--header-path

Optional path to a header file to use for importing the variant QC result VCF.

--n-partitions

Number of desired partitions for output Table.

Default: 5000

--array-elements-required

Pass if you would like array elements required in import_vcf to be true.

Default: False

--gatk-image

GATK Image.

Default: “us.gcr.io/broad-gatk/gatk:4.2.6.1”

--resources

Path to .json file containing paths and information for the VQSR pipeline.

--scatter-count

Number of intervals to scatter across.

Default: 100

--snp-hard-filter

Hard filter cutoff for SNPs.

Default: 99.7

--indel-hard-filter

Hard filter cutoff for INDELs.

Default: 99.0

--run-mode

Possible choices: small, standard, large

Option to pass so that the mode/resources fit the size of the database. This affects the size of the clusters and, if –large is set, will run in a scattered mode (one job for each partition).

Default: “standard”

--batch-billing-project

Hail Batch billing project.

--gcp-billing-project

Google Cloud billing project for reading requester pays buckets.

--compute-info-method

Possible choices: AS, quasi, set_long_AS_missing

Method of computing the INFO score to use for the variant QC features. Default is ‘AS’.

Default: “quasi”

--adj

Use adj genotypes for transmitted/sibling singletons.

Default: False

--transmitted-singletons

Include transmitted singletons in training.

Default: False

--sibling-singletons

Include sibling singletons in training.

Default: False

--interval-qc-filter

Whether interval QC should be applied for training.

Default: False

--calling-interval-filter

Whether to filter to the intersection of Broad/DSP calling intervals with 50 bp of padding for training.

Default: False

--snp-features

Features to use in the SNP VQSR model.

Default: [‘AS_QD’, ‘AS_MQRankSum’, ‘AS_ReadPosRankSum’, ‘AS_FS’, ‘AS_MQ’]

--indel-features

Features to use in the indel VQSR model.

Default: [‘AS_QD’, ‘AS_MQRankSum’, ‘AS_ReadPosRankSum’, ‘AS_FS’]

--overlap-skip

Skip code to address overlaps, output sharded VCF.

Default: False

--batch-suffix

String to add to end of batch name.

Default: “”

--load-vqsr

Run import_variant_qc_vcf() to load VQSR VCFs as HT

Default: False

Module Functions

`gnomad_qc.v4.variant_qc.vqsr.split_intervals`(b, ...)	Split genome into intervals to parallelize VQSR for large sample sizes.
`gnomad_qc.v4.variant_qc.vqsr.snps_variant_recalibrator_create_model`(b, ...)	First step of VQSR for SNPs: run VariantRecalibrator to subsample variants and produce a file of the VQSR model.
`gnomad_qc.v4.variant_qc.vqsr.snps_variant_recalibrator`(b, ...)	Second step of VQSR for SNPs: run VariantRecalibrator scattered to apply the VQSR model file to each genomic interval.
`gnomad_qc.v4.variant_qc.vqsr.indels_variant_recalibrator_create_model`(b, ...)	First step of VQSR for INDELs: run VariantRecalibrator to subsample variants and produce a file of the VQSR model.
`gnomad_qc.v4.variant_qc.vqsr.indels_variant_recalibrator`(b, ...)	Second step of VQSR for INDELs: run VariantRecalibrator scattered to apply the VQSR model file to each genomic interval.
`gnomad_qc.v4.variant_qc.vqsr.gather_tranches`(b, ...)	Third step of VQSR for SNPs: run GatherTranches to gather scattered per-interval tranches outputs.
`gnomad_qc.v4.variant_qc.vqsr.apply_recalibration`(b, ...)	Apply a score cutoff to filter variants based on a recalibration table.
`gnomad_qc.v4.variant_qc.vqsr.gather_vcfs`(b, ...)	Combine recalibrated VCFs into a single VCF & saves the output VCF to a bucket out_bucket.
`gnomad_qc.v4.variant_qc.vqsr.make_vqsr_jobs`(b, ...)	Add jobs to Batch that perform the allele-specific VQSR variant QC.
`gnomad_qc.v4.variant_qc.vqsr.main`(args)	Run VQSR variant qc workflow.
`gnomad_qc.v4.variant_qc.vqsr.get_script_argument_parser`()	Get script argument parser.

Script to run VQSR on an AS-annotated Sites VCF.

gnomad_qc.v4.variant_qc.vqsr.split_intervals(b, utils, calling_interval_path, gcp_billing_project, gatk_image, scatter_count=100, interval_padding=150)[source]

Split genome into intervals to parallelize VQSR for large sample sizes.

Parameters:

b (Batch) – Batch object to add jobs to.
utils (Dict) – Dictionary containing resources (file paths and arguments) to be used to split genome.
calling_interval_path (str) – Path to the calling interval list with no padding.
gcp_billing_project (str) – GCP billing project for requester-pays buckets.
gatk_image (str) – GATK docker image.
scatter_count (int) – Number of intervals to split the dataset into for scattering.
interval_padding (int) – Number of bases to pad each interval by.

Return type:

Job

Returns:

Job object with a single output j.intervals of type ResourceGroup.

gnomad_qc.v4.variant_qc.vqsr.snps_variant_recalibrator_create_model(b, sites_only_vcf, utils, use_as_annotations, gcp_billing_project, gatk_image, features, singletons_resource_vcf=None, evaluation_interval_path=None, out_bucket=None, is_small_callset=False, is_large_callset=False, max_gaussians=6)[source]

First step of VQSR for SNPs: run VariantRecalibrator to subsample variants and produce a file of the VQSR model.

To support cohorts with more than 10,000 WGS samples, the SNP recalibration process is broken down across genomic regions for parallel processing, and done in 3 steps:

Run the recalibrator with the following additional arguments: –sample-every-Nth-variant <downsample_factor> –output-model <model_file>

Apply the resulting model to each genomic interval with, running the recalibrator with the same base parameters, plus: –input-model <model-file> –output-tranches-for-scatter

Collate the resulting per-interval tranches with GatherTranches

The –max-gaussians parameter sets the expected number of clusters in modeling. If a dataset gives fewer distinct clusters, e.g. as can happen for smaller data, then the tool will tell you there is insufficient data with a No data found error message. In this case, try decrementing the –max-gaussians value.

Parameters:

b (Batch) – Batch object to add jobs to.
sites_only_vcf (str) – Sites only VCF file to be used to build the model.
utils (Dict) – a dictionary containing resources (file paths) to be used to create the model.
use_as_annotations (bool) – If set, Allele-Specific variant recalibrator will be used.
gcp_billing_project (str) – GCP billing project for requester-pays buckets.
gatk_image (str) – GATK docker image.
features (List[str]) – List of features to be used in the SNP VQSR model.
singletons_resource_vcf (Optional[str]) – Optional singletons VCF to be used in building the model.
evaluation_interval_path (Optional[str]) – Optional path to the evaluation interval list.
out_bucket (str) – Full path to output bucket to write model and plots to.
is_small_callset (bool) – Whether the dataset is small. Used to set number of CPUs for the job.
is_large_callset (bool) – Whether the dataset is huge. Used to set number of CPUs for the job.
max_gaussians (int) – Maximum number of Gaussians for the positive model.

Return type:

Job

Returns:

Job object with 2 outputs: j.model_file and j.snp_rscript_file.

gnomad_qc.v4.variant_qc.vqsr.snps_variant_recalibrator(b, sites_only_vcf, utils, out_bucket, use_as_annotations, gcp_billing_project, gatk_image, features, interval=None, tranche_idx=None, model_file=None, singletons_resource_vcf=None, max_gaussians=4)[source]

Second step of VQSR for SNPs: run VariantRecalibrator scattered to apply the VQSR model file to each genomic interval.

To support cohorts with more than 10,000 WGS samples, the SNP recalibration process is broken down across genomic regions for parallel processing, and done in 3 steps:

Run the recalibrator with the following additional arguments: –sample-every-Nth-variant <downsample_factor> –output-model <model_file>

Apply the resulting model to each genomic interval with, running the recalibrator with the same base parameters, plus: –input-model <model-file> –output-tranches-for-scatter

Collate the resulting per-interval tranches with GatherTranches

The –max-gaussians parameter sets the expected number of clusters in modeling. If a dataset gives fewer distinct clusters, e.g. as can happen for smaller data, then the tool will tell you there is insufficient data with a No data found error message. In this case, try decrementing the –max-gaussians value.

Parameters:

b (Batch) – Batch object to add jobs to.
sites_only_vcf (str) – Sites only VCF file to be used to build the model.
model_file (Optional[ResourceFile]) – Model file to be applied.
utils (Dict) – Dictionary containing resources (file paths).
out_bucket (str) – Full path to output bucket to write model and plots to.
tranche_idx (Optional[int]) – Index for the tranches file.
use_as_annotations (bool) – If set, Allele-Specific variant recalibrator will be used.
gcp_billing_project – GCP billing project for requester-pays buckets.
gatk_image (str) – GATK docker image.
features (List[str]) – List of features to be used in the SNP VQSR model.
singletons_resource_vcf (Optional[str]) – Optional singletons VCF to include in VariantRecalibrator.
interval (Optional[ResourceGroup]) – Genomic interval to apply the model to.
max_gaussians (int) – Maximum number of Gaussians for the positive model.

Return type:

Job

Returns:

Job object with 2 outputs: j.recalibration (ResourceGroup) and j.tranches.

gnomad_qc.v4.variant_qc.vqsr.indels_variant_recalibrator_create_model(b, sites_only_vcf, utils, use_as_annotations, gcp_billing_project, gatk_image, features, singletons_resource_vcf=None, evaluation_interval_path=None, out_bucket=None, is_small_callset=False, max_gaussians=4)[source]

First step of VQSR for INDELs: run VariantRecalibrator to subsample variants and produce a file of the VQSR model.

To support cohorts with more than 10,000 WGS samples, the INDEL recalibration process is broken down across genomic regions for parallel processing, and done in 3 steps:

Run the recalibrator with the following additional arguments: –sample-every-Nth-variant <downsample_factor> –output-model <model_file>

Apply the resulting model to each genomic interval with, running the recalibrator with the same base parameters, plus: –input-model <model-file> –output-tranches-for-scatter

Collate the resulting per-interval tranches with GatherTranches

The –max-gaussians parameter sets the expected number of clusters in modeling. If a dataset gives fewer distinct clusters, e.g. as can happen for smaller data, then the tool will tell you there is insufficient data with a No data found error message. In this case, try decrementing the –max-gaussians value

Parameters:

b (Batch) – Batch object to add jobs to.
sites_only_vcf (str) – Sites only VCF file to be used to build the model.
utils (Dict) – Dictionary containing resources (file paths).
use_as_annotations (bool) – If set, Allele-Specific variant recalibrator will be used.
gcp_billing_project (str) – GCP billing project for requester-pays buckets.
gatk_image (str) – GATK docker image.
features (List[str]) – List of features to be used in the INDEL VQSR model.
singletons_resource_vcf (str) – Optional singletons VCF to be used in building the model.
evaluation_interval_path (Optional[str]) – Optional path to the evaluation interval list.
out_bucket (str) – Full path to output bucket to write model and plots to.
is_small_callset (bool) – Whether the dataset is small. Used to set number of CPUs for the job.
max_gaussians (int) – Maximum number of Gaussians for the positive model.

Return type:

Job

Returns:

Job object with 2 outputs: j.model_file and j.indel_rscript_file. The latter is useful to produce the optional tranche plot.

gnomad_qc.v4.variant_qc.vqsr.indels_variant_recalibrator(b, sites_only_vcf, utils, out_bucket, use_as_annotations, gcp_billing_project, gatk_image, features, interval=None, tranche_idx=None, model_file=None, singletons_resource_vcf=None, max_gaussians=4)[source]

Second step of VQSR for INDELs: run VariantRecalibrator scattered to apply the VQSR model file to each genomic interval.

To support cohorts with more than 10,000 WGS samples, the SNP recalibration process is broken down across genomic regions for parallel processing, and done in 3 steps:

Run the recalibrator with the following additional arguments: –sample-every-Nth-variant <downsample_factor> –output-model <model_file>

Apply the resulting model to each genomic interval with, running the recalibrator with the same base parameters, plus: –input-model <model-file> –output-tranches-for-scatter

Collate the resulting per-interval tranches with GatherTranches

The –max-gaussians parameter sets the expected number of clusters in modeling. If a dataset gives fewer distinct clusters, e.g. as can happen for smaller data, then the tool will tell you there is insufficient data with a No data found error message. In this case, try decrementing the –max-gaussians value.

Parameters:

b (Batch) – Batch object to add jobs to.
sites_only_vcf (str) – Sites only VCF file to be used to build the model.
model_file (Optional[ResourceFile]) – Model file to be applied.
utils (Dict) – Dictionary containing resources (file paths).
out_bucket (str) – Full path to output bucket to write model and plots to.
tranche_idx (Optional[int]) – Index for the tranches file.
use_as_annotations (bool) – If set, Allele-Specific variant recalibrator will be used.
gcp_billing_project (str) – GCP billing project for requester-pays buckets.
gatk_image (str) – GATK docker image.
features (List[str]) – List of features to be used in the INDEL VQSR model.
singletons_resource_vcf (Optional[str]) – Optional singletons VCF to include in VariantRecalibrator.
interval (Optional[ResourceGroup]) – Genomic interval to apply the model to.
max_gaussians (int) – Maximum number of Gaussians for the positive model.

Return type:

Job

Returns:

Job object with 2 outputs: j.recalibration (ResourceGroup) and j.tranches.

gnomad_qc.v4.variant_qc.vqsr.gather_tranches(b, tranches, mode, disk_size, gcp_billing_project)[source]

Third step of VQSR for SNPs: run GatherTranches to gather scattered per-interval tranches outputs.

To support cohorts with more than 10,000 WGS samples, the SNP recalibration process is broken down across genomic regions for parallel processing, and done in 3 steps:

Run the recalibrator with the following additional arguments: --sample-every-Nth-variant <downsample_factor> --output-model <model_file>

Apply the resulting model to each genomic interval with, running the recalibrator with the same base parameters, plus: --input-model <model-file> --output-tranches-for-scatter

Collate the resulting per-interval tranches with GatherTranches

Parameters:

b (Batch) – Batch object to add jobs to.
tranches (List[ResourceFile]) – Index for the tranches file.
mode (str) – Recalibration mode to employ, either SNP or INDEL.
disk_size (int) – Disk size to be used for the job.
gcp_billing_project (str) –

Return type:

Job

Returns:

Job object with one output j.out_tranches.

gnomad_qc.v4.variant_qc.vqsr.apply_recalibration(b, input_vcf, out_vcf_name, indels_recalibration, indels_tranches, snps_recalibration, snps_tranches, snp_hard_filter, indel_hard_filter, disk_size, use_as_annotations, gcp_billing_project, scatter=None, interval=None, out_bucket=None, overlap_skip=False)[source]

Apply a score cutoff to filter variants based on a recalibration table.

Runs ApplyVQSR twice to apply first indel, then SNP recalibrations.

Targets indel_filter_level and snp_filter_level sensitivities. The tool matches them internally to a VQSLOD score cutoff based on the model’s estimated sensitivity to a set of true variants.

The filter determination is not just a pass/fail process. The tool evaluates for each variant which “tranche”, or slice of the dataset, it falls into in terms of sensitivity to the truthset. Variants in tranches that fall below the specified truth sensitivity filter level have their FILTER field annotated with the corresponding tranche level. This results in a callset that is filtered to the desired level but retains the information necessary to increase sensitivity if needed.

Parameters:

b (Batch) – Batch object to add jobs to.
input_vcf (str) – Sites only VCF file to be used.
out_vcf_name (str) – Output vcf filename.
indels_recalibration (ResourceGroup) – Input recal file (ResourceGroup) for INDELs.
indels_tranches (ResourceFile) – Input tranches file (ResourceFile) for INDELs.
snps_recalibration (ResourceGroup) – Input recal file (ResourceGroup) for SNPs.
snps_tranches (ResourceFile) – Input tranches file (ResourceFile) for SNPs.
snp_hard_filter (float) – SNP hard filter level.
indel_hard_filter (float) – INDEL hard filter level.
disk_size (int) – Disk size to be used for the job.
use_as_annotations (bool) – If set, Allele-Specific variant recalibrator will be used.
gcp_billing_project (str) – GCP billing project for requester-pays buckets.
scatter (Optional[int]) – Scatter index to be used in output VCF filename if running in scattered mode.
interval (Optional[ResourceGroup]) – Genomic interval to apply the model to.
out_bucket (Optional[str]) – Full path to output bucket to write output(s) to.
overlap_skip (bool) –

Return type:

Job

Returns:

Job object with one ResourceGroup output j.output_vcf, corresponding to a VCF with tranche annotated in the FILTER field.

gnomad_qc.v4.variant_qc.vqsr.gather_vcfs(b, input_vcfs, out_vcf_name, disk_size, gcp_billing_project, gatk_image, out_bucket=None)[source]

Combine recalibrated VCFs into a single VCF & saves the output VCF to a bucket out_bucket.

Parameters:

b (Batch) – Batch object to add jobs to.
input_vcfs (List[ResourceGroup]) – List of VCFs to be gathered.
out_vcf_name (str) – Output vcf filename.
disk_size (int) – Disk size to be used for the job.
gcp_billing_project (str) – GCP billing project for requester-pays buckets.
gatk_image (str) – GATK docker image.
out_bucket (str) – Full path to output bucket to write the gathered VCF to.

Return type:

Job

Returns:

Job object with one ResourceGroup output j.output_vcf.

gnomad_qc.v4.variant_qc.vqsr.make_vqsr_jobs(b, sites_only_vcf, is_small_callset, is_large_callset, output_vcf_name, utils, out_bucket, intervals, use_as_annotations, gcp_billing_project, gatk_image, snp_features, indel_features, snp_hard_filter, indel_hard_filter, scatter_count=100, singleton_vcf_path=None, evaluation_interval_path=None, overlap_skip=False)[source]

Add jobs to Batch that perform the allele-specific VQSR variant QC.

Parameters:

b (Batch) – Batch object to add jobs to.
sites_only_vcf (str) – Path to a sites only VCF created using gnomAD default_compute_info().
is_small_callset (bool) – For small callsets, we reduce the resources per job from our default.
is_large_callset (bool) – For large callsets, we increase resources per job from our default and shard our VCF to launch jobs in a scattered mode.
output_vcf_name (str) – Name, without extension, to use for the output VCF file(s).
utils (Dict) – Dictionary containing resource files and parameters to be used in VQSR.
out_bucket (str) – Path to write, plots, evaluation results, and recalibrated VCF to.
intervals (Dict) – ResourceGroup object with intervals to scatter.
use_as_annotations (bool) – Whether to use allele-specific annotation for VQSR.
gcp_billing_project (str) – GCP billing project for requester-pays buckets.
gatk_image (str) – GATK docker image.
snp_features (List[str]) – List of features to be used in the SNP VQSR model.
indel_features (List[str]) – List of features to be used in the INDEL VQSR model.
snp_hard_filter (float) – SNP hard filter level.
indel_hard_filter (float) – INDEL hard filter level.
scatter_count (int) – Number of intervals to scatter over.
singleton_vcf_path (Optional[str]) – Full path to transmitted and/or sibling singletons VCF file and its index.
evaluation_interval_path (Optional[str]) – Optional full path to evaluation intervals file.
overlap_skip (bool) –

Return type:

None

Returns:

None.

gnomad_qc.v4.variant_qc.vqsr.main(args)[source]: Run VQSR variant qc workflow.