gnomad.utils.vcf

gnomad.utils.vcf.SORT_ORDER

Order to sort subgroupings during VCF export.

gnomad.utils.vcf.GROUPS

Group names used to generate labels for high quality genotypes and all raw genotypes.

gnomad.utils.vcf.HISTS

Quality histograms used in VCF export.

gnomad.utils.vcf.FAF_POPS

Global populations that are included in filtering allele frequency (faf) calculations.

gnomad.utils.vcf.SEXES

Sample sexes used in VCF export.

gnomad.utils.vcf.AS_FIELDS

Allele-specific variant annotations.

gnomad.utils.vcf.SITE_FIELDS

Site level variant annotations.

gnomad.utils.vcf.ALLELE_TYPE_FIELDS

Allele-type annotations.

gnomad.utils.vcf.REGION_FLAG_FIELDS

Annotations about variant region type.

gnomad.utils.vcf.JOINT_REGION_FLAG_FIELDS

Annotations about variant region type that are specifically created for joint dataset of exomes and genomes from gnomAD v4.1.

gnomad.utils.vcf.RF_FIELDS

Annotations specific to the variant QC using a random forest model.

gnomad.utils.vcf.AS_VQSR_FIELDS

Allele-specific VQSR annotations.

gnomad.utils.vcf.VQSR_FIELDS

Annotations specific to VQSR.

gnomad.utils.vcf.INFO_DICT

Dictionary used during VCF export to export row (variant) annotations.

gnomad.utils.vcf.IN_SILICO_ANNOTATIONS_INFO_DICT

Dictionary with in silico score descriptions to include in the VCF INFO header.

gnomad.utils.vcf.VRS_FIELDS_DICT

Dictionary with VRS annotations to include in the VCF INFO field and VCF header.

gnomad.utils.vcf.ENTRIES

Densified entries to be selected during VCF export.

gnomad.utils.vcf.SPARSE_ENTRIES

Sparse entries to be selected and densified during VCF export.

gnomad.utils.vcf.FORMAT_DICT

Dictionary used during VCF export to export MatrixTable entries.

gnomad.utils.vcf.adjust_vcf_incompatible_types(ht)

Create a Table ready for vcf export.

gnomad.utils.vcf.make_label_combos(label_groups)

Make combinations of all possible labels for a supplied dictionary of label groups.

gnomad.utils.vcf.index_globals(...[, ...])

Create a dictionary keyed by the specified label groupings with values describing the corresponding index of each grouping entry in the meta_array annotation.

gnomad.utils.vcf.make_combo_header_text(...)

Programmatically generate text to populate the VCF header description for a given variant annotation with specific groupings and subset.

gnomad.utils.vcf.create_label_groups(pops[, ...])

Generate a list of label group dictionaries needed to populate info dictionary.

gnomad.utils.vcf.make_info_dict([prefix, ...])

Generate dictionary of Number and Description attributes of VCF INFO fields.

gnomad.utils.vcf.add_as_info_dict([...])

Update info dictionary with allele-specific terms and their descriptions.

gnomad.utils.vcf.make_vcf_filter_dict([...])

Generate dictionary of Number and Description attributes to be used in the VCF header, specifically for FILTER annotations.

gnomad.utils.vcf.make_hist_bin_edges_expr(ht)

Create dictionaries containing variant histogram annotations and their associated bin edges, formatted into a string separated by pipe delimiters.

gnomad.utils.vcf.make_hist_dict(bin_edges, adj)

Generate dictionary of Number and Description attributes to be used in the VCF header, specifically for histogram annotations.

gnomad.utils.vcf.set_female_y_metrics_to_na(t)

Set AC, AN, and nhomalt chrY variant annotations for females to NA (instead of 0).

gnomad.utils.vcf.build_vcf_export_reference(name)

Create export reference based on reference genome defined by build.

gnomad.utils.vcf.rekey_new_reference(t, ...)

Re-key Table or MatrixTable with a new reference genome.

gnomad.utils.vcf.SORT_ORDER = ['subset', 'downsampling', 'popmax', 'grpmax', 'pop', 'gen_anc', 'subpop', 'sex', 'group']

Order to sort subgroupings during VCF export. Ensures that INFO labels in VCF are in desired order (e.g., raw_AC_afr_female).

gnomad.utils.vcf.GROUPS = ['adj', 'raw']

Group names used to generate labels for high quality genotypes and all raw genotypes. Used in VCF export.

gnomad.utils.vcf.HISTS = ['gq_hist_alt', 'gq_hist_all', 'dp_hist_alt', 'dp_hist_all', 'ab_hist_alt']

Quality histograms used in VCF export.

gnomad.utils.vcf.FAF_POPS = {'v3': ['afr', 'amr', 'eas', 'nfe', 'sas'], 'v4': ['afr', 'amr', 'eas', 'mid', 'nfe', 'sas']}

Global populations that are included in filtering allele frequency (faf) calculations. Used in VCF export.

gnomad.utils.vcf.SEXES = ['XX', 'XY']

Sample sexes used in VCF export.

Used to stratify frequency annotations (AC, AN, AF) for each sex. Note that sample sexes in gnomAD v3 and earlier were ‘male’ and ‘female’.

gnomad.utils.vcf.AS_FIELDS = ['AS_FS', 'AS_MQ', 'AS_MQRankSum', 'AS_pab_max', 'AS_QUALapprox', 'AS_QD', 'AS_ReadPosRankSum', 'AS_SB_TABLE', 'AS_SOR', 'AS_VarDP', 'InbreedingCoeff']

Allele-specific variant annotations.

gnomad.utils.vcf.SITE_FIELDS = ['FS', 'MQ', 'MQRankSum', 'QUALapprox', 'QD', 'ReadPosRankSum', 'SB', 'SOR', 'VarDP']

Site level variant annotations.

gnomad.utils.vcf.ALLELE_TYPE_FIELDS = ['allele_type', 'has_star', 'n_alt_alleles', 'original_alleles', 'variant_type', 'was_mixed']

Allele-type annotations.

gnomad.utils.vcf.REGION_FLAG_FIELDS = ['decoy', 'lcr', 'nonpar', 'non_par', 'segdup']

Annotations about variant region type.

Note

decoy resource files do not currently exist for GRCh38/hg38.

gnomad.utils.vcf.JOINT_REGION_FLAG_FIELDS = ['fail_interval_qc', 'outside_broad_capture_region', 'outside_ukb_capture_region', 'outside_broad_calling_region', 'outside_ukb_calling_region', 'not_called_in_exomes', 'not_called_in_genomes']

Annotations about variant region type that are specifically created for joint dataset of exomes and genomes from gnomAD v4.1.

gnomad.utils.vcf.RF_FIELDS = ['rf_positive_label', 'rf_negative_label', 'rf_label', 'rf_train', 'rf_tp_probability']

Annotations specific to the variant QC using a random forest model.

gnomad.utils.vcf.AS_VQSR_FIELDS = ['AS_culprit', 'AS_VQSLOD']

Allele-specific VQSR annotations.

gnomad.utils.vcf.VQSR_FIELDS = ['AS_culprit', 'AS_VQSLOD', 'NEGATIVE_TRAIN_SITE', 'POSITIVE_TRAIN_SITE']

Annotations specific to VQSR.

gnomad.utils.vcf.INFO_DICT = {'AS_SB_TABLE': {'Description': 'Allele-specific forward/reverse read counts for strand bias tests', 'Number': '.'}, 'AS_pab_max': {'Description': 'Maximum p-value over callset for binomial test of observed allele balance for a heterozygous genotype, given expectation of 0.5', 'Number': 'A'}, 'BaseQRankSum': {'Description': 'Z-score from Wilcoxon rank sum test of alternate vs. reference base qualities'}, 'FS': {'Description': "Phred-scaled p-value of Fisher's exact test for strand bias"}, 'InbreedingCoeff': {'Description': 'Inbreeding coefficient, the excess heterozygosity at a variant site, computed as 1 - (the number of heterozygous genotypes)/(the number of heterozygous genotypes expected under Hardy-Weinberg equilibrium)', 'Number': 'A'}, 'MQ': {'Description': 'Root mean square of the mapping quality of reads across all samples'}, 'MQRankSum': {'Description': 'Z-score from Wilcoxon rank sum test of alternate vs. reference read mapping qualities'}, 'NEGATIVE_TRAIN_SITE': {'Description': 'Variant was used to build the negative training set of low-quality variants for VQSR'}, 'POSITIVE_TRAIN_SITE': {'Description': 'Variant was used to build the positive training set of high-quality variants for VQSR'}, 'QD': {'Description': 'Variant call confidence normalized by depth of sample reads supporting a variant'}, 'QUALapprox': {'Description': 'Sum of PL[0] values; used to approximate the QUAL score', 'Number': '1'}, 'ReadPosRankSum': {'Description': 'Z-score from Wilcoxon rank sum test of alternate vs. reference read position bias'}, 'SOR': {'Description': 'Strand bias estimated by the symmetric odds ratio test'}, 'VQSLOD': {'Description': 'Log-odds ratio of being a true variant versus being a false positive under the trained VQSR Gaussian mixture model'}, 'VarDP': {'Description': 'Depth over variant genotypes (does not include depth of reference samples)'}, 'allele_type': {'Description': 'Allele type (snv, insertion, deletion, or mixed)'}, 'culprit': {'Description': 'Worst-performing annotation in the VQSR Gaussian mixture model'}, 'decoy': {'Description': 'Variant falls within a reference decoy region'}, 'fail_interval_qc': {'Description': 'Less than 85 percent of samples meet 20X coverage if variant is in autosomal or PAR regions or 10X coverage for non-PAR regions of chromosomes X and Y.'}, 'has_star': {'Description': 'Variant locus coincides with a spanning deletion (represented by a star) observed elsewhere in the callset'}, 'inbreeding_coeff': {'Description': 'Inbreeding coefficient, the excess heterozygosity at a variant site, computed as 1 - (the number of heterozygous genotypes)/(the number of heterozygous genotypes expected under Hardy-Weinberg equilibrium)', 'Number': 'A'}, 'lcr': {'Description': 'Variant falls within a low complexity region'}, 'monoallelic': {'Description': 'All samples are homozygous alternate for the variant'}, 'n_alt_alleles': {'Description': 'Total number of alternate alleles observed at variant locus', 'Number': '1'}, 'negative_train_site': {'Description': 'Variant was used to build the negative training set of low-quality variants for VQSR'}, 'non_par': {'Description': 'Variant (on sex chromosome) falls outside a pseudoautosomal region'}, 'nonpar': {'Description': 'Variant (on sex chromosome) falls outside a pseudoautosomal region'}, 'only_het': {'Description': 'All samples are heterozygous for the variant'}, 'original_alleles': {'Description': 'Alleles before splitting multiallelics'}, 'outside_broad_capture_region': {'Description': 'Variant falls outside of Broad exome capture regions.'}, 'outside_ukb_capture_region': {'Description': 'Variant falls outside of UK Biobank exome capture regions.'}, 'positive_train_site': {'Description': 'Variant was used to build the positive training set of high-quality variants for VQSR'}, 'rf_label': {'Description': 'Random forest training label'}, 'rf_negative_label': {'Description': 'Variant was labelled as a negative example for training of random forest model'}, 'rf_positive_label': {'Description': 'Variant was labelled as a positive example for training of random forest model'}, 'rf_tp_probability': {'Description': 'Probability of a called variant being a true variant as determined by random forest model'}, 'rf_train': {'Description': 'Variant was used in training random forest model'}, 'segdup': {'Description': 'Variant falls within a segmental duplication region'}, 'sibling_singleton': {'Description': 'Variant was a callset-wide doubleton that was present only in two siblings (i.e., a singleton amongst unrelated samples in cohort).'}, 'transmitted_singleton': {'Description': 'Variant was a callset-wide doubleton that was transmitted within a family from a parent to a child (i.e., a singleton amongst unrelated samples in cohort)'}, 'variant_type': {'Description': 'Variant type (snv, indel, multi-snv, multi-indel, or mixed)'}, 'was_mixed': {'Description': 'Variant type was mixed'}}

Dictionary used during VCF export to export row (variant) annotations.

gnomad.utils.vcf.IN_SILICO_ANNOTATIONS_INFO_DICT = {'cadd_phred': {'Description': "Cadd Phred-like scores ('scaled C-scores') ranging from 1 to 99, based on the rank of each variant relative to all possible 8.6 billion substitutions in the human reference genome. Larger values are more deleterious.", 'Number': '1'}, 'cadd_raw_score': {'Description': "Raw CADD scores are interpretable as the extent to which the annotation profile for a given variant suggests that the variant is likely to be 'observed' (negative values) vs 'simulated' (positive values). Larger values are more deleterious.", 'Number': '1'}, 'pangolin_largest_ds': {'Description': "Pangolin's largest delta score across 2 splicing consequences, which reflects the probability of the variant being splice-altering", 'Number': '1'}, 'phylop': {'Description': 'Base-wise conservation score across the 241 placental mammals in the Zoonomia project. Score ranges from -20 to 9.28, and reflects acceleration (faster evolution than expected under neutral drift, assigned negative scores) as well as conservation (slower than expected evolution, assigned positive scores).', 'Number': '1'}, 'polyphen_max': {'Description': 'Score that predicts the possible impact of an amino acid substitution on the structure and function of a human protein, ranging from 0.0 (tolerated) to 1.0 (deleterious).  We prioritize max scores for MANE Select transcripts where possible and otherwise report a score for the canonical transcript.', 'Number': '1'}, 'revel_max': {'Description': "The maximum REVEL score at a site's MANE Select or canonical transcript. It's an ensemble score for predicting the pathogenicity of missense variants (based on 13 other variant predictors). Scores ranges from 0 to 1. Variants with higher scores are predicted to be more likely to be deleterious.", 'Number': '1'}, 'sift_max': {'Description': 'Score reflecting the scaled probability of the amino acid substitution being tolerated, ranging from 0 to 1. Scores below 0.05 are predicted to impact protein function. We prioritize max scores for MANE Select transcripts where possible and otherwise report a score for the canonical transcript.', 'Number': '1'}, 'spliceai_ds_max': {'Description': "Illumina's SpliceAI max delta score; interpreted as the probability of the variant being splice-altering.", 'Number': '1'}}

Dictionary with in silico score descriptions to include in the VCF INFO header.

gnomad.utils.vcf.VRS_FIELDS_DICT = {'VRS_Allele_IDs': {'Description': 'The computed identifiers for the GA4GH VRS Alleles corresponding to the values in the REF and ALT fields', 'Number': 'R'}, 'VRS_Ends': {'Description': 'Interresidue coordinates used as the location ends for the GA4GH VRS Alleles corresponding to the values in the REF and ALT fields', 'Number': 'R'}, 'VRS_Starts': {'Description': 'Interresidue coordinates used as the location starts for the GA4GH VRS Alleles corresponding to the values in the REF and ALT fields', 'Number': 'R'}, 'VRS_States': {'Description': 'The literal sequence states used for the GA4GH VRS Alleles corresponding to the values in the REF and ALT fields', 'Number': '.'}}

Dictionary with VRS annotations to include in the VCF INFO field and VCF header.

gnomad.utils.vcf.ENTRIES = ['GT', 'GQ', 'DP', 'AD', 'MIN_DP', 'PGT', 'PID', 'PL', 'SB']

Densified entries to be selected during VCF export.

gnomad.utils.vcf.SPARSE_ENTRIES = ['END', 'DP', 'GQ', 'LA', 'LAD', 'LGT', 'LPGT', 'LPL', 'MIN_DP', 'PID', 'RGQ', 'SB']

Sparse entries to be selected and densified during VCF export.

gnomad.utils.vcf.FORMAT_DICT = {'AD': {'Description': 'Allelic depths for the ref and alt alleles in the order listed', 'Number': 'R', 'Type': 'Integer'}, 'DP': {'Description': 'Approximate read depth (reads with MQ=255 or with bad mates are filtered)', 'Number': '1', 'Type': 'Integer'}, 'GQ': {'Description': 'Phred-scaled confidence that the genotype assignment is correct. Value is the difference between the second lowest PL and the lowest PL (always normalized to 0).', 'Number': '1', 'Type': 'Integer'}, 'GT': {'Description': 'Genotype', 'Number': '1', 'Type': 'String'}, 'MIN_DP': {'Description': 'Minimum DP observed within the GVCF block', 'Number': '1', 'Type': 'Integer'}, 'PGT': {'Description': 'Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another', 'Number': '1', 'Type': 'String'}, 'PID': {'Description': 'Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group', 'Number': '1', 'Type': 'String'}, 'PL': {'Description': 'Normalized, phred-scaled likelihoods for genotypes as defined in the VCF specification', 'Number': 'G', 'Type': 'Integer'}, 'SB': {'Description': "Per-sample component statistics which comprise the Fisher's exact test to detect strand bias. Values are: depth of reference allele on forward strand, depth of reference allele on reverse strand, depth of alternate allele on forward strand, depth of alternate allele on reverse strand.", 'Number': '4', 'Type': 'Integer'}}

Dictionary used during VCF export to export MatrixTable entries.

gnomad.utils.vcf.adjust_vcf_incompatible_types(ht, pipe_delimited_annotations=['AS_QUALapprox', 'AS_VarDP', 'AS_MQ_DP', 'AS_RAW_MQ', 'AS_SB_TABLE'])[source]

Create a Table ready for vcf export.

In particular, the following conversions are done:
  • All int64 are coerced to int32

  • Fields specified by pipe_delimited_annotations are converted from arrays to pipe-delimited strings

Parameters:
  • ht (Table) – Input Table.

  • pipe_delimited_annotations (List[str]) – List of info fields (they must be fields of the ht.info Struct).

Return type:

Table

Returns:

Table ready for VCF export.

gnomad.utils.vcf.make_label_combos(label_groups, sort_order=['subset', 'downsampling', 'popmax', 'grpmax', 'pop', 'gen_anc', 'subpop', 'sex', 'group'], label_delimiter='_')[source]

Make combinations of all possible labels for a supplied dictionary of label groups.

For example, if label_groups is {“sex”: [“male”, “female”], “pop”: [“afr”, “nfe”, “amr”]}, this function will return [“afr_male”, “afr_female”, “nfe_male”, “nfe_female”, “amr_male”, “amr_female’]

Parameters:
  • label_groups (Dict[str, List[str]]) – Dictionary containing an entry for each label group, where key is the name of the grouping, e.g. “sex” or “pop”, and value is a list of all possible values for that grouping (e.g. [“male”, “female”] or [“afr”, “nfe”, “amr”]).

  • sort_order (List[str]) – List containing order to sort label group combinations. Default is SORT_ORDER.

  • label_delimiter (str) – String to use as delimiter when making group label combinations.

Return type:

List[str]

Returns:

list of all possible combinations of values for the supplied label groupings.

gnomad.utils.vcf.index_globals(globals_array, label_groups, label_delimiter='_')[source]

Create a dictionary keyed by the specified label groupings with values describing the corresponding index of each grouping entry in the meta_array annotation.

Parameters:
  • globals_array (List[Dict[str, str]]) – Ordered list containing dictionary entries describing all the grouping combinations contained in the globals_array annotation. Keys are the grouping type (e.g., ‘group’, ‘pop’, ‘sex’) and values are the grouping attribute (e.g., ‘adj’, ‘eas’, ‘XY’).

  • label_groups (Dict[str, List[str]]) – Dictionary containing an entry for each label group, where key is the name of the grouping, e.g. “sex” or “pop”, and value is a list of all possible values for that grouping (e.g. [“male”, “female”] or [“afr”, “nfe”, “amr”])

  • label_delimiter (str) – String used as delimiter when making group label combinations.

Return type:

Dict[str, int]

Returns:

Dictionary keyed by specified label grouping combinations, with values describing the corresponding index of each grouping entry in the globals

gnomad.utils.vcf.make_combo_header_text(preposition, combo_dict, pop_names)[source]

Programmatically generate text to populate the VCF header description for a given variant annotation with specific groupings and subset.

For example, if preposition is “for”, group_types is [“group”, “pop”, “sex”], and combo_fields is [“adj”, “afr”, “female”], this function will return the string ” for female samples in the African-American/African genetic ancestry group”.

Parameters:
  • preposition (str) – Relevant preposition to precede automatically generated text.

  • combo_dict (Dict[str, str]) – Dict with grouping types as keys and values for grouping type as values. This function generates text for these values. Possible grouping types are: “group”, “pop”, “sex”, and “subpop”. Example input: {“pop”: “afr”, “sex”: “female”}

  • pop_names (Dict[str, str]) – Dict with global population names (keys) and population descriptions (values).

Return type:

str

Returns:

String with automatically generated description text for a given set of combo fields.

gnomad.utils.vcf.create_label_groups(pops, sexes=['XX', 'XY'], all_groups=['adj', 'raw'], pop_sex_groups=['adj'])[source]

Generate a list of label group dictionaries needed to populate info dictionary.

Label dictionaries are passed as input to make_info_dict.

Parameters:
  • pops (List[str]) – List of population names.

  • sexes (List[str]) – List of sample sexes.

  • all_groups (List[str]) – List of data types (raw, adj). Default is GROUPS, which is [“raw”, “adj”].

  • pop_sex_groups (List[str]) – List of data types (raw, adj) to populate with pops and sexes. Default is [“adj”].

Return type:

List[Dict[str, List[str]]]

Returns:

List of label group dictionaries.

gnomad.utils.vcf.make_info_dict(prefix='', suffix='', prefix_before_metric=True, pop_names={'afr': 'African/African-American', 'ami': 'Amish', 'amr': 'Admixed American', 'asj': 'Ashkenazi Jewish', 'bgr': 'Bulgarian (Eastern European)', 'consanguineous': 'South Asian (F > 0.05)', 'eas': 'East Asian', 'est': 'Estonian', 'eur': 'European', 'exac': 'ExAC', 'fin': 'Finnish', 'gbr': 'British', 'jpn': 'Japanese', 'kor': 'Korean', 'mde': 'Middle Eastern', 'mid': 'Middle Eastern', 'nfe': 'Non-Finnish European', 'nwe': 'North-Western European', 'oea': 'Other East Asian', 'oeu': 'Other European', 'onf': 'Other Non-Finnish European', 'oth': 'Other', 'remaining': 'Remaining individuals', 'sas': 'South Asian', 'sas_non_consang': 'South Asian (F < 0.05)', 'seu': 'Southern European', 'sgp': 'Singaporean', 'swe': 'Swedish', 'uniform': 'Uniform', 'unk': 'Unknown'}, label_groups=None, label_delimiter='_', bin_edges=None, faf=False, popmax=False, grpmax=False, fafmax=False, callstats=False, freq_ctt=False, freq_cmh=False, freq_stat_union=False, description_text='', age_hist_distribution=None, sort_order=['subset', 'downsampling', 'popmax', 'grpmax', 'pop', 'gen_anc', 'subpop', 'sex', 'group'])[source]

Generate dictionary of Number and Description attributes of VCF INFO fields.

Used to populate the INFO fields of the VCF header during export.

Creates:
  • INFO fields for age histograms (bin freq, n_smaller, and n_larger for heterozygous and homozygous variant carriers)

  • INFO fields for popmax AC, AN, AF, nhomalt, and popmax population

  • INFO fields for AC, AN, AF, nhomalt for each combination of sample population, sex, and subpopulation, both for adj and raw data

  • INFO fields for filtering allele frequency (faf) annotations

Parameters:
  • prefix (str) – Prefix string for data, e.g. “gnomAD”. Default is empty string.

  • suffix (str) – Suffix string for data, e.g. “gnomAD”. Default is empty string.

  • prefix_before_metric (bool) – Whether prefix should be added before the metric (AC, AN, AF, nhomalt, faf95, faf99) in INFO field. Default is True.

  • pop_names (Dict[str, str]) – Dict with global population names (keys) and population descriptions (values). Default is POP_NAMES.

  • label_groups (Dict[str, List[str]]) – Dictionary containing an entry for each label group, where key is the name of the grouping, e.g. “sex” or “pop”, and value is a list of all possible values for that grouping (e.g. [“male”, “female”] or [“afr”, “nfe”, “amr”]).

  • label_delimiter (str) – String to use as delimiter when making group label combinations.

  • bin_edges (Dict[str, str]) – Dictionary keyed by annotation type, with values that reflect the bin edges corresponding to the annotation.

  • faf (bool) – If True, use alternate logic to auto-populate dictionary values associated with filter allele frequency annotations.

  • popmax (bool) – If True, use alternate logic to auto-populate dictionary values associated with popmax annotations.

  • grpmax (bool) – If True, use alternate logic to auto-populate dictionary values associated with grpmax annotations.

  • fafmax (bool) – If True, use alternate logic to auto-populate dictionary values associated with fafmax annotations.

  • callstats (bool) – If True, use alternate logic to auto-populate dictionary values associated with callstats annotations.

  • freq_ctt (bool) – If True, use alternate logic to auto-populate dictionary values associated with frequency contingency table test (CTT) annotations.

  • freq_cmh (bool) – If True, use alternate logic to auto-populate dictionary values associated with frequency Cochran-Mantel-Haenszel (CMH) annotations.

  • freq_stat_union (bool) – If True, use alternate logic to auto-populate dictionary values associated with the union of the contingency table and Cochran-Mantel-Haenszel tests.

  • description_text (str) – Optional text to append to the end of descriptions. Needs to start with a space if specified.

  • age_hist_distribution (str) – Pipe-delimited string of overall age distribution.

  • sort_order (List[str]) – List containing order to sort label group combinations. Default is SORT_ORDER.

  • age_hist_distribution

Return type:

Dict[str, Dict[str, str]]

Returns:

Dictionary keyed by VCF INFO annotations, where values are dictionaries of Number and Description attributes.

gnomad.utils.vcf.add_as_info_dict(info_dict={'AS_SB_TABLE': {'Description': 'Allele-specific forward/reverse read counts for strand bias tests', 'Number': '.'}, 'AS_pab_max': {'Description': 'Maximum p-value over callset for binomial test of observed allele balance for a heterozygous genotype, given expectation of 0.5', 'Number': 'A'}, 'BaseQRankSum': {'Description': 'Z-score from Wilcoxon rank sum test of alternate vs. reference base qualities'}, 'FS': {'Description': "Phred-scaled p-value of Fisher's exact test for strand bias"}, 'InbreedingCoeff': {'Description': 'Inbreeding coefficient, the excess heterozygosity at a variant site, computed as 1 - (the number of heterozygous genotypes)/(the number of heterozygous genotypes expected under Hardy-Weinberg equilibrium)', 'Number': 'A'}, 'MQ': {'Description': 'Root mean square of the mapping quality of reads across all samples'}, 'MQRankSum': {'Description': 'Z-score from Wilcoxon rank sum test of alternate vs. reference read mapping qualities'}, 'NEGATIVE_TRAIN_SITE': {'Description': 'Variant was used to build the negative training set of low-quality variants for VQSR'}, 'POSITIVE_TRAIN_SITE': {'Description': 'Variant was used to build the positive training set of high-quality variants for VQSR'}, 'QD': {'Description': 'Variant call confidence normalized by depth of sample reads supporting a variant'}, 'QUALapprox': {'Description': 'Sum of PL[0] values; used to approximate the QUAL score', 'Number': '1'}, 'ReadPosRankSum': {'Description': 'Z-score from Wilcoxon rank sum test of alternate vs. reference read position bias'}, 'SOR': {'Description': 'Strand bias estimated by the symmetric odds ratio test'}, 'VQSLOD': {'Description': 'Log-odds ratio of being a true variant versus being a false positive under the trained VQSR Gaussian mixture model'}, 'VarDP': {'Description': 'Depth over variant genotypes (does not include depth of reference samples)'}, 'allele_type': {'Description': 'Allele type (snv, insertion, deletion, or mixed)'}, 'culprit': {'Description': 'Worst-performing annotation in the VQSR Gaussian mixture model'}, 'decoy': {'Description': 'Variant falls within a reference decoy region'}, 'fail_interval_qc': {'Description': 'Less than 85 percent of samples meet 20X coverage if variant is in autosomal or PAR regions or 10X coverage for non-PAR regions of chromosomes X and Y.'}, 'has_star': {'Description': 'Variant locus coincides with a spanning deletion (represented by a star) observed elsewhere in the callset'}, 'inbreeding_coeff': {'Description': 'Inbreeding coefficient, the excess heterozygosity at a variant site, computed as 1 - (the number of heterozygous genotypes)/(the number of heterozygous genotypes expected under Hardy-Weinberg equilibrium)', 'Number': 'A'}, 'lcr': {'Description': 'Variant falls within a low complexity region'}, 'monoallelic': {'Description': 'All samples are homozygous alternate for the variant'}, 'n_alt_alleles': {'Description': 'Total number of alternate alleles observed at variant locus', 'Number': '1'}, 'negative_train_site': {'Description': 'Variant was used to build the negative training set of low-quality variants for VQSR'}, 'non_par': {'Description': 'Variant (on sex chromosome) falls outside a pseudoautosomal region'}, 'nonpar': {'Description': 'Variant (on sex chromosome) falls outside a pseudoautosomal region'}, 'only_het': {'Description': 'All samples are heterozygous for the variant'}, 'original_alleles': {'Description': 'Alleles before splitting multiallelics'}, 'outside_broad_capture_region': {'Description': 'Variant falls outside of Broad exome capture regions.'}, 'outside_ukb_capture_region': {'Description': 'Variant falls outside of UK Biobank exome capture regions.'}, 'positive_train_site': {'Description': 'Variant was used to build the positive training set of high-quality variants for VQSR'}, 'rf_label': {'Description': 'Random forest training label'}, 'rf_negative_label': {'Description': 'Variant was labelled as a negative example for training of random forest model'}, 'rf_positive_label': {'Description': 'Variant was labelled as a positive example for training of random forest model'}, 'rf_tp_probability': {'Description': 'Probability of a called variant being a true variant as determined by random forest model'}, 'rf_train': {'Description': 'Variant was used in training random forest model'}, 'segdup': {'Description': 'Variant falls within a segmental duplication region'}, 'sibling_singleton': {'Description': 'Variant was a callset-wide doubleton that was present only in two siblings (i.e., a singleton amongst unrelated samples in cohort).'}, 'transmitted_singleton': {'Description': 'Variant was a callset-wide doubleton that was transmitted within a family from a parent to a child (i.e., a singleton amongst unrelated samples in cohort)'}, 'variant_type': {'Description': 'Variant type (snv, indel, multi-snv, multi-indel, or mixed)'}, 'was_mixed': {'Description': 'Variant type was mixed'}}, as_fields=['AS_FS', 'AS_MQ', 'AS_MQRankSum', 'AS_pab_max', 'AS_QUALapprox', 'AS_QD', 'AS_ReadPosRankSum', 'AS_SB_TABLE', 'AS_SOR', 'AS_VarDP', 'InbreedingCoeff'])[source]

Update info dictionary with allele-specific terms and their descriptions.

Used in VCF export.

Parameters:
  • info_dict (Dict[str, Dict[str, str]]) – Dictionary containing site-level annotations and their descriptions. Default is INFO_DICT.

  • as_fields (List[str]) – List containing allele-specific fields to be added to info_dict. Default is AS_FIELDS.

Return type:

Dict[str, Dict[str, str]]

Returns:

Dictionary with allele specific annotations, their descriptions, and their VCF number field.

gnomad.utils.vcf.make_vcf_filter_dict(snp_cutoff=None, indel_cutoff=None, inbreeding_cutoff=None, variant_qc_filter='RF', joint=False)[source]

Generate dictionary of Number and Description attributes to be used in the VCF header, specifically for FILTER annotations.

Generates descriptions for:
  • AC0 filter

  • InbreedingCoeff filter

  • Variant QC filter (RF or AS_VQSR)

  • PASS (passed all variant filters)

Parameters:
  • snp_cutoff (Optional[float]) – Minimum SNP cutoff score from random forest model.

  • indel_cutoff (Optional[float]) – Minimum indel cutoff score from random forest model.

  • inbreeding_cutoff (Optional[float]) – Inbreeding coefficient hard cutoff.

  • variant_qc_filter (str) – Method used for variant QC filter. One of ‘RF’ or ‘AS_VQSR’. Default is ‘RF’.

  • joint (bool) – Whether the filter dictionary is for the joint release. Default is False.

Return type:

Dict[str, str]

Returns:

Dictionary keyed by VCF FILTER annotations, where values are Dictionaries of Number and Description attributes.

gnomad.utils.vcf.make_hist_bin_edges_expr(ht, hists=['gq_hist_alt', 'gq_hist_all', 'dp_hist_alt', 'dp_hist_all', 'ab_hist_alt'], ann_with_hists=None, prefix='', label_delimiter='_', include_age_hists=True)[source]

Create dictionaries containing variant histogram annotations and their associated bin edges, formatted into a string separated by pipe delimiters.

Parameters:
  • ht (Table) – Table containing histogram variant annotations.

  • hists (List[str]) – List of variant histogram annotations. Default is HISTS.

  • ann_with_hists (Optional[str]) – Name of row annotation containing histogram data. In exomes or genomes release HT, histograms is a row, but in the joint release HT, it’s under the row of exomes, genomes, or joint.

  • prefix (str) – Prefix text for age histogram bin edges. Default is empty string.

  • label_delimiter (str) – String used as delimiter between prefix and histogram annotation.

  • include_age_hists (bool) – Include age histogram annotations.

Return type:

Dict[str, str]

Returns:

Dictionary keyed by histogram annotation name, with corresponding reformatted bin edges for values.

gnomad.utils.vcf.make_hist_dict(bin_edges, adj, hist_metric_list=['gq_hist_alt', 'gq_hist_all', 'dp_hist_alt', 'dp_hist_all', 'ab_hist_alt'], label_delimiter='_', drop_n_smaller_larger=False, prefix='', suffix='', description_text='')[source]

Generate dictionary of Number and Description attributes to be used in the VCF header, specifically for histogram annotations.

Parameters:
  • bin_edges (Dict[str, Dict[str, str]]) – Dictionary keyed by histogram annotation name, with corresponding string-reformatted bin edges for values.

  • adj (bool) – Whether to create a header dict for raw or adj quality histograms.

  • hist_metric_list (List[str]) – List of hists for which to build hist info dict

  • label_delimiter (str) – String used as delimiter in values stored in hist_metric_list.

  • drop_n_smaller_larger (bool) – Whether to drop n_smaller and n_larger annotations from header dict. Default is False.

  • prefix (str) – Prefix text for histogram annotations. Default is empty string.

  • suffix (str) – Suffix text for histogram annotations. Default is empty string.

  • description_text (str) – Optional text to append to the end of descriptions. Needs to start with a space if specified.

Return type:

Dict[str, str]

Returns:

Dictionary keyed by VCF INFO annotations, where values are Dictionaries of Number and Description attributes.

gnomad.utils.vcf.set_female_y_metrics_to_na(t)[source]

Set AC, AN, and nhomalt chrY variant annotations for females to NA (instead of 0).

Parameters:

t (Union[Table, MatrixTable]) – Table/MatrixTable containing female variant annotations.

Return type:

Dict[str, Int32Expression]

Returns:

Dictionary with reset annotations

gnomad.utils.vcf.build_vcf_export_reference(name, build='GRCh38', keep_contigs=['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY'], keep_chrM=True)[source]

Create export reference based on reference genome defined by build.

By default this will return a new reference with all non-standard contigs eliminated. Keeps chr 1-22, Y, X, and M.

An example of a non-standard contig is: ##contig=<ID=chr3_GL000221v1_random,length=155397,assembly=GRCh38>

Parameters:
  • name (str) – Name to use for new reference.

  • build (str) – Reference genome build to use as starting reference genome.

  • keep_contigs (List[str]) – Contigs to keep from reference genome defined by build. Default is autosomes and sex chromosomes.

  • keep_chrM (bool) – Whether to keep chrM. Default is True.

Return type:

ReferenceGenome

Returns:

Reference genome for VCF export containing only contigs in keep_contigs.

gnomad.utils.vcf.rekey_new_reference(t, reference)[source]

Re-key Table or MatrixTable with a new reference genome.

Parameters:
Return type:

Union[Table, MatrixTable]

Returns:

Re-keyed Table/MatrixTable