gnomad_qc.v4.create_release.validate_and_export_vcf

Script to validate and export gnomAD VCFs.

usage: gnomad_qc.v4.create_release.validate_and_export_vcf.py
       [-h] [--validate-release-ht] [--verbose] [--overwrite] [--test]
       [-d {exomes,genomes,joint}] [--prepare-vcf-header] [--export-vcf]
       [--contig CONTIG] [--joint-included]

Named Arguments

--validate-release-ht

Run release HT validation

Default: False

--verbose

Log successes in addition to failures during validation

Default: False

--overwrite

Overwrite all data and start from raw inputs

Default: False

--test

For validation, test will run on chr20, chrX, and chrY. For VCF export, test runs on PCSK9 region. Outputs to test bucket.

Default: False

-d, --data-type

Possible choices: exomes, genomes, joint

Data type to run validity checks on.

Default: “exomes”

--prepare-vcf-header

Prepare VCF header dict.

Default: False

--export-vcf

Export VCF.

Default: False

--contig

Contig to export VCF for.

--joint-included

Whether joint frequency data is included in the release HT.

Default: False

Module Functions

`gnomad_qc.v4.create_release.validate_and_export_vcf.get_export_resources`([...])	Get export resources.
`gnomad_qc.v4.create_release.validate_and_export_vcf.filter_to_test`(ht)	Filter Table to num_partitions partitions on chr20, chrX, and chrY for testing.
`gnomad_qc.v4.create_release.validate_and_export_vcf.select_type_from_joint_ht`(ht, ...)	Select all fields from the joint HT that are relevant to data_type.
`gnomad_qc.v4.create_release.validate_and_export_vcf.unfurl_nested_annotations`(ht)	Create dictionary keyed by the variant annotation labels to be extracted from variant annotation arrays.
`gnomad_qc.v4.create_release.validate_and_export_vcf.make_info_expr`(t)	Make Hail expression for variant annotations to be included in VCF INFO field.
`gnomad_qc.v4.create_release.validate_and_export_vcf.prepare_ht_for_validation`(ht)	Prepare HT for validity checks and export.
`gnomad_qc.v4.create_release.validate_and_export_vcf.populate_subset_info_dict`(...)	Call make_info_dict to populate INFO dictionary for the requested subset.
`gnomad_qc.v4.create_release.validate_and_export_vcf.populate_info_dict`(...)	Call make_info_dict and make_hist_dict to populate INFO dictionary.
`gnomad_qc.v4.create_release.validate_and_export_vcf.prepare_vcf_header_dict`(ht, ...)	Prepare VCF header dictionary.
`gnomad_qc.v4.create_release.validate_and_export_vcf.get_downsamplings_fields`(ht)	Get downsampling specific annotations from info struct.
`gnomad_qc.v4.create_release.validate_and_export_vcf.format_validated_ht_for_export`(ht)	Format validated HT for export.
`gnomad_qc.v4.create_release.validate_and_export_vcf.process_vep_csq_header`([...])	Process VEP CSQ header string, delimited by '\|', to remove polyphen and sift annotations.
`gnomad_qc.v4.create_release.validate_and_export_vcf.check_globals_for_retired_terms`(ht)	Check list of dictionaries to see if the keys in the dictionaries contain either 'pop and 'oth'.
`gnomad_qc.v4.create_release.validate_and_export_vcf.get_joint_filters`(ht)	Transform exomes and genomes filters to joint filters.
`gnomad_qc.v4.create_release.validate_and_export_vcf.main`(args)	Validate release Table and export VCFs.
`gnomad_qc.v4.create_release.validate_and_export_vcf.get_script_argument_parser`()	Get script argument parser.

Script to validate and export gnomAD VCFs.

gnomad_qc.v4.create_release.validate_and_export_vcf.get_export_resources(overwrite=False, data_type='exomes', test=False, contig=None)[source]

Get export resources.

Parameters:

overwrite (bool) – Whether to overwrite existing files.
data_type (str) – Data type to get resources for. One of “exomes” or “genomes”. Default is “exomes”.
test (Optional[bool]) – Whether to use test resources.
contig (Optional[str]) – Contig to get resources for. Default is None.

Return type:

PipelineResourceCollection

Returns:

Export resources.

gnomad_qc.v4.create_release.validate_and_export_vcf.filter_to_test(ht, num_partitions=2)[source]

Filter Table to num_partitions partitions on chr20, chrX, and chrY for testing.

Parameters:

ht (Table) – Input Table to filter.
num_partitions (int) – Number of partitions to grab from each chromosome.

Return type:

Table

Returns:

Input Table filtered to num_partitions on chr20, chrX, and chrY.

gnomad_qc.v4.create_release.validate_and_export_vcf.select_type_from_joint_ht(ht, data_type)[source]

Select all fields from the joint HT that are relevant to data_type.

Parameters:

ht (Table) – Joint release HT.
data_type (str) – Data type to select in joint HT. One of “exomes”, “genomes”, or “joint”.

Return type:

Table

Returns:

Joint HT with fields relevant to data_type.

gnomad_qc.v4.create_release.validate_and_export_vcf.unfurl_nested_annotations(ht, entries_to_remove=None, data_type='exomes', joint_included=False, hist_prefix='', freq_comparison_included=False, for_joint_validation=False)[source]

Create dictionary keyed by the variant annotation labels to be extracted from variant annotation arrays.

The values of the returned dictionary are Hail Expressions describing how to access the corresponding values.

Parameters:

ht (Table) – Table containing the nested variant annotation arrays to be unfurled.
entries_to_remove (Set[str]) – Optional Set of frequency entries to remove for vcf_export.
data_type (str) – Data type to unfurl nested annotations for. One of “exomes”, “genomes”, or “joint”. Default is “exomes”.
joint_included (bool) – Whether joint frequency data is included in the exome or genome HT. Default is False.
hist_prefix (str) – Prefix to use for histograms. Default is “”.
freq_comparison_included (bool) – Whether frequency comparison data is included in the HT. Default is False.
for_joint_validation (bool) – Whether to prepare HT for joint validation. Default is False.

Return type:

[<class ‘hail.expr.expressions.typed_expressions.StructExpression’>, Set[str], Dict[str, str]]

Returns:

StructExpression containing variant annotations and their corresponding expressions and updated entries, set of frequency entries to remove from the VCF, and a dict of fields to rename when for_joint_validation is True.

gnomad_qc.v4.create_release.validate_and_export_vcf.make_info_expr(t, data_type='exomes', for_joint_validation=False)[source]

Make Hail expression for variant annotations to be included in VCF INFO field.

Parameters:

t (Table) – Table containing variant annotations to be reformatted for VCF export.
data_type (str) – Data type to make info expression for. One of “exomes”, “genomes”, or “joint”. Default is “exomes”.
for_joint_validation (bool) – Whether to prepare HT for joint validation. Default is False.

Return type:

Dict[str, Expression]

Returns:

Dictionary containing Hail expressions for relevant INFO annotations.

gnomad_qc.v4.create_release.validate_and_export_vcf.prepare_ht_for_validation(ht, data_type='exomes', freq_entries_to_remove=None, vcf_info_reorder=None, joint_included=False, for_joint_validation=True, freq_comparison_included=False)[source]

Prepare HT for validity checks and export.

Parameters:

ht (Table) – Release Hail Table.
data_type (str) – Data type to prepare HT for. One of “exomes”, “genomes”, or “joint”. Default is “exomes”.
freq_entries_to_remove (Optional[List[str]]) – List of entries to remove from freq.
vcf_info_reorder (Optional[List[str]]) – Order of VCF INFO fields.
joint_included (bool) – Whether joint frequency data is included in the HT. Default is False.
for_joint_validation (bool) – Whether to prepare HT for joint validation. Default is True.
freq_comparison_included (bool) – Whether frequency comparison data is included in the HT. Default is False.

Return type:

Table

Returns:

Hail Table prepared for validity checks and export and a dictionary of fields to rename when for_joint_validation is True.

gnomad_qc.v4.create_release.validate_and_export_vcf.populate_subset_info_dict(subset, description_text, data_type='exomes', pops={'afr': 'African/African-American', 'amr': 'Admixed American', 'asj': 'Ashkenazi Jewish', 'eas': 'East Asian', 'fin': 'Finnish', 'mid': 'Middle Eastern', 'nfe': 'Non-Finnish European', 'remaining': 'Remaining individuals', 'sas': 'South Asian'}, faf_pops={'v3': ['afr', 'amr', 'eas', 'nfe', 'sas'], 'v4': ['afr', 'amr', 'eas', 'mid', 'nfe', 'sas']}, sexes=['XX', 'XY'], label_delimiter='_', freq_comparison_included=False)[source]

Call make_info_dict to populate INFO dictionary for the requested subset.

Creates:

INFO fields for AC, AN, AF, nhomalt for each combination of sample genetic
ancestry group, sex both for adj and raw data.
INFO fields for filtering allele frequency (faf) annotations.

Parameters:

subset (str) – Sample subset in dataset. “” is used as a placeholder for the full dataset.
description_text (str) – Text describing the sample subset that should be added to the INFO description.
data_type (str) – One of “exomes”, “genomes”, or “joint”. Default is “exomes”.
pops (Dict[str, str]) – Dict of sample global genetic ancestry names for the gnomAD data type.
faf_pops (Dict[str, List[str]]) – Dict with gnomAD version (keys) and faf genentic ancestry group names (values). Default is FAF_GEN_ANC_GROUPS.
sexes (List[str]) – gnomAD sample sexes used in VCF export. Default is SEXES.
label_delimiter (str) – String to use as delimiter when making group label combinations. Default is ‘_’.
freq_comparison_included (bool) – Whether frequency comparison data is included in the HT.

Return type:

Dict[str, Dict[str, str]]

Returns:

Dictionary containing Subset specific INFO header fields.

gnomad_qc.v4.create_release.validate_and_export_vcf.populate_info_dict(info_fields, bin_edges, age_hist_distribution=None, info_dict={'AS_SB_TABLE': {'Description': 'Allele-specific forward/reverse read counts for strand bias tests', 'Number': '.'}, 'AS_pab_max': {'Description': 'Maximum p-value over callset for binomial test of observed allele balance for a heterozygous genotype, given expectation of 0.5', 'Number': 'A'}, 'BaseQRankSum': {'Description': 'Z-score from Wilcoxon rank sum test of alternate vs. reference base qualities'}, 'FS': {'Description': "Phred-scaled p-value of Fisher's exact test for strand bias"}, 'InbreedingCoeff': {'Description': 'Inbreeding coefficient, the excess heterozygosity at a variant site, computed as 1 - (the number of heterozygous genotypes)/(the number of heterozygous genotypes expected under Hardy-Weinberg equilibrium)', 'Number': 'A'}, 'MQ': {'Description': 'Root mean square of the mapping quality of reads across all samples'}, 'MQRankSum': {'Description': 'Z-score from Wilcoxon rank sum test of alternate vs. reference read mapping qualities'}, 'NEGATIVE_TRAIN_SITE': {'Description': 'Variant was used to build the negative training set of low-quality variants for VQSR'}, 'POSITIVE_TRAIN_SITE': {'Description': 'Variant was used to build the positive training set of high-quality variants for VQSR'}, 'QD': {'Description': 'Variant call confidence normalized by depth of sample reads supporting a variant'}, 'QUALapprox': {'Description': 'Sum of PL[0] values; used to approximate the QUAL score', 'Number': '1'}, 'ReadPosRankSum': {'Description': 'Z-score from Wilcoxon rank sum test of alternate vs. reference read position bias'}, 'SOR': {'Description': 'Strand bias estimated by the symmetric odds ratio test'}, 'VQSLOD': {'Description': 'Log-odds ratio of being a true variant versus being a false positive under the trained VQSR Gaussian mixture model'}, 'VarDP': {'Description': 'Depth over variant genotypes (does not include depth of reference samples)'}, 'allele_type': {'Description': 'Allele type (snv, insertion, deletion, or mixed)'}, 'culprit': {'Description': 'Worst-performing annotation in the VQSR Gaussian mixture model'}, 'decoy': {'Description': 'Variant falls within a reference decoy region'}, 'fail_interval_qc': {'Description': 'Less than 85 percent of samples meet 20X coverage if variant is in autosomal or PAR regions or 10X coverage for non-PAR regions of chromosomes X and Y.'}, 'has_star': {'Description': 'Variant locus coincides with a spanning deletion (represented by a star) observed elsewhere in the callset'}, 'inbreeding_coeff': {'Description': 'Inbreeding coefficient, the excess heterozygosity at a variant site, computed as 1 - (the number of heterozygous genotypes)/(the number of heterozygous genotypes expected under Hardy-Weinberg equilibrium)', 'Number': 'A'}, 'lcr': {'Description': 'Variant falls within a low complexity region'}, 'monoallelic': {'Description': 'All samples are homozygous alternate for the variant'}, 'n_alt_alleles': {'Description': 'Total number of alternate alleles observed at variant locus', 'Number': '1'}, 'negative_train_site': {'Description': 'Variant was used to build the negative training set of low-quality variants for VQSR'}, 'non_par': {'Description': 'Variant (on sex chromosome) falls outside a pseudoautosomal region'}, 'nonpar': {'Description': 'Variant (on sex chromosome) falls outside a pseudoautosomal region'}, 'only_het': {'Description': 'All samples are heterozygous for the variant'}, 'original_alleles': {'Description': 'Alleles before splitting multiallelics'}, 'outside_broad_capture_region': {'Description': 'Variant falls outside of Broad exome capture regions.'}, 'outside_ukb_capture_region': {'Description': 'Variant falls outside of UK Biobank exome capture regions.'}, 'positive_train_site': {'Description': 'Variant was used to build the positive training set of high-quality variants for VQSR'}, 'rf_label': {'Description': 'Random forest training label'}, 'rf_negative_label': {'Description': 'Variant was labelled as a negative example for training of random forest model'}, 'rf_positive_label': {'Description': 'Variant was labelled as a positive example for training of random forest model'}, 'rf_tp_probability': {'Description': 'Probability of a called variant being a true variant as determined by random forest model'}, 'rf_train': {'Description': 'Variant was used in training random forest model'}, 'segdup': {'Description': 'Variant falls within a segmental duplication region'}, 'sibling_singleton': {'Description': 'Variant was a callset-wide doubleton that was present only in two siblings (i.e., a singleton amongst unrelated samples in cohort).'}, 'transmitted_singleton': {'Description': 'Variant was a callset-wide doubleton that was transmitted within a family from a parent to a child (i.e., a singleton amongst unrelated samples in cohort)'}, 'variant_type': {'Description': 'Variant type (snv, indel, multi-snv, multi-indel, or mixed)'}, 'was_mixed': {'Description': 'Variant type was mixed'}}, subset_list=['non_ukb'], pops={'afr': 'African/African-American', 'amr': 'Admixed American', 'asj': 'Ashkenazi Jewish', 'eas': 'East Asian', 'fin': 'Finnish', 'mid': 'Middle Eastern', 'nfe': 'Non-Finnish European', 'remaining': 'Remaining individuals', 'sas': 'South Asian'}, faf_pops={'v3': ['afr', 'amr', 'eas', 'nfe', 'sas'], 'v4': ['afr', 'amr', 'eas', 'mid', 'nfe', 'sas']}, sexes=['XX', 'XY'], in_silico_dict={'cadd_phred': {'Description': "Cadd Phred-like scores ('scaled C-scores') ranging from 1 to 99, based on the rank of each variant relative to all possible 8.6 billion substitutions in the human reference genome. Larger values are more deleterious.", 'Number': '1'}, 'cadd_raw_score': {'Description': "Raw CADD scores are interpretable as the extent to which the annotation profile for a given variant suggests that the variant is likely to be 'observed' (negative values) vs 'simulated' (positive values). Larger values are more deleterious.", 'Number': '1'}, 'pangolin_largest_ds': {'Description': "Pangolin's largest delta score across 2 splicing consequences, which reflects the probability of the variant being splice-altering", 'Number': '1'}, 'phylop': {'Description': 'Base-wise conservation score across the 241 placental mammals in the Zoonomia project. Score ranges from -20 to 9.28, and reflects acceleration (faster evolution than expected under neutral drift, assigned negative scores) as well as conservation (slower than expected evolution, assigned positive scores).', 'Number': '1'}, 'polyphen_max': {'Description': 'Score that predicts the possible impact of an amino acid substitution on the structure and function of a human protein, ranging from 0.0 (tolerated) to 1.0 (deleterious). We prioritize max scores for MANE Select transcripts where possible and otherwise report a score for the canonical transcript.', 'Number': '1'}, 'revel_max': {'Description': "The maximum REVEL score at a site's MANE Select or canonical transcript. It's an ensemble score for predicting the pathogenicity of missense variants (based on 13 other variant predictors). Scores ranges from 0 to 1. Variants with higher scores are predicted to be more likely to be deleterious.", 'Number': '1'}, 'sift_max': {'Description': 'Score reflecting the scaled probability of the amino acid substitution being tolerated, ranging from 0 to 1. Scores below 0.05 are predicted to impact protein function. We prioritize max scores for MANE Select transcripts where possible and otherwise report a score for the canonical transcript.', 'Number': '1'}, 'spliceai_ds_max': {'Description': "Illumina's SpliceAI max delta score; interpreted as the probability of the variant being splice-altering.", 'Number': '1'}}, vrs_fields_dict={'VRS_Allele_IDs': {'Description': 'The computed identifiers for the GA4GH VRS Alleles corresponding to the values in the REF and ALT fields', 'Number': 'R'}, 'VRS_Ends': {'Description': 'Interresidue coordinates used as the location ends for the GA4GH VRS Alleles corresponding to the values in the REF and ALT fields', 'Number': 'R'}, 'VRS_Starts': {'Description': 'Interresidue coordinates used as the location starts for the GA4GH VRS Alleles corresponding to the values in the REF and ALT fields', 'Number': 'R'}, 'VRS_States': {'Description': 'The literal sequence states used for the GA4GH VRS Alleles corresponding to the values in the REF and ALT fields', 'Number': '.'}}, label_delimiter='_', data_type='exomes', freq_comparison_included=False, extra_suffix=None, extra_description_text=None)[source]

Call make_info_dict and make_hist_dict to populate INFO dictionary.

Used during VCF export.

Creates:

INFO fields for age histograms (bin freq, n_smaller, and n_larger for heterozygous and homozygous variant carriers).
INFO fields for grpmax AC, AN, AF, nhomalt, and grpmax genetic ancestry group.
INFO fields for AC, AN, AF, nhomalt for each combination of sample genetic ancestry group, sex both for adj and raw data.
INFO fields for filtering allele frequency (faf) annotations.
INFO fields for variant histograms (hist_bin_freq for each histogram and hist_n_larger for DP histograms).

Parameters:

info_fields (List[str]) – List of info fields to add to the info dict. Default is None.
bin_edges (Dict[str, str]) – Dictionary of variant annotation histograms and their associated bin edges.
age_hist_distribution (str) – Pipe-delimited string of overall age histogram bin frequency.
info_dict (Dict[str, Dict[str, str]]) – INFO dict to be populated.
subset_list (List[str]) – List of sample subsets in dataset. Default is SUBSETS[“exomes”].
pops (Dict[str, str]) – Dict of sample global genetic ancestry names for the gnomAD data type.
faf_pops (Dict[str, List[str]]) – Dict with gnomAD version (keys) and faf genentic ancestry group names (values). Default is FAF_GEN_ANC_GROUPS.
sexes (List[str]) – gnomAD sample sexes used in VCF export. Default is SEXES.
in_silico_dict (Dict[str, Dict[str, str]]) – Dictionary of in silico predictor score descriptions.
vrs_fields_dict (Dict[str, Dict[str, str]]) – Dictionary with VRS annotations.
label_delimiter (str) – String to use as delimiter when making group label combinations.
data_type (str) – Data type to populate info dict for. One of “exomes” or “genomes”. Default is “exomes”.
freq_comparison_included (bool) – Whether frequency comparison data is included in the HT.
extra_suffix (str) – Suffix to add to INFO field.
extra_description_text (str) – Extra description text to add to INFO field.

Return type:

Dict[str, Dict[str, str]]

Returns:

Updated INFO dictionary for VCF export.

gnomad_qc.v4.create_release.validate_and_export_vcf.prepare_vcf_header_dict(ht, validated_ht, info_fields, bin_edges, age_hist_distribution, subset_list, pops, data_type='exomes', joint_included=False, freq_comparison_included=False, extra_suffix=None, extra_description_text=None)[source]

Prepare VCF header dictionary.

Parameters:

ht (Table) – Input Table
validated_ht (Optional[Table]) – Validated HT with unfurled info fields.
info_fields (List[str]) – List of info fields to add to the info dict.
bin_edges (Dict[str, str]) – Dictionary of variant annotation histograms and their associated bin edges.
age_hist_distribution (str) – Pipe-delimited string of overall age histogram bin frequency.
subset_list (List[str]) – List of sample subsets in dataset.
pops (Dict[str, str]) – List of sample global genetic ancestry group names for gnomAD data type.
data_type (str) – Data type to prepare VCF header for. One of “exomes” or “genomes”. Default is “exomes”.
joint_included (bool) – Whether joint frequency data is included in the HT. Default is False.
freq_comparison_included (bool) – Whether frequency comparison data is included in the HT.
extra_suffix (str) – Suffix to add to INFO field.
extra_description_text (str) – Extra description text to add to INFO field.

Return type:

Dict[str, Dict[str, str]]

Returns:

Prepared VCF header dictionary.

gnomad_qc.v4.create_release.validate_and_export_vcf.get_downsamplings_fields(ht)[source]

Get downsampling specific annotations from info struct.

Note

This retrieves any annotation that contains any downsampling value in its name.

Parameters:: ht (Table) – Input Table.
Return type:: List[str]
Returns:: List of downsampling specific annotations to drop from info struct.

gnomad_qc.v4.create_release.validate_and_export_vcf.format_validated_ht_for_export(ht, data_type='exomes', vcf_info_reorder=['AC', 'AN', 'AF', 'grpmax', 'fafmax_faf95_max', 'fafmax_faf95_max_gen_anc'], info_fields_to_drop=None)[source]

Format validated HT for export.

Drop downsamplings frequency stats from info, rearrange info, and make sure fields are VCF compatible.

Parameters:

ht (Table) – Validated HT.
data_type (str) – Data type to format validated HT for. One of “exomes” or “genomes”. Default is “exomes”.
vcf_info_reorder (Optional[List[str]]) – Order of VCF INFO fields. These will be placed in front of all other fields in the order specified.
info_fields_to_drop (Optional[List[str]]) – List of info fields to drop from the info struct.

Return type:

Tuple[Table, List[str]]

Returns:

Formatted HT and list rename row annotations.

Process VEP CSQ header string, delimited by ‘|’, to remove polyphen and sift annotations.

Parameters:: vep_csq_header (str) – VEP CSQ header.
Return type:: str
Returns:: Processed VEP CSQ header.

gnomad_qc.v4.create_release.validate_and_export_vcf.check_globals_for_retired_terms(ht)[source]

Check list of dictionaries to see if the keys in the dictionaries contain either ‘pop and ‘oth’.

Parameters:: ht (Table) – Input Table
Return type:: None

gnomad_qc.v4.create_release.validate_and_export_vcf.get_joint_filters(ht)[source]

Transform exomes and genomes filters to joint filters.

Parameters:: ht (Table) – Input Table.
Return type:: Table
Returns:: Table with joint filters transformed from exomes and genomes filters.

gnomad_qc.v4.create_release.validate_and_export_vcf.main(args)[source]: Validate release Table and export VCFs.