gnomad.utils.vcf
| Order to sort subgroupings during VCF export. | |
| Group names used to generate labels for high quality genotypes and all raw genotypes. | |
| Quality histograms used in VCF export. | |
| Genetic ancestry groups that are included in filtering allele frequency (faf) calculations. | |
| Sample sexes used in VCF export. | |
| Allele-specific variant annotations. | |
| Site level variant annotations. | |
| Allele-type annotations. | |
| Annotations about variant region type. | |
| Annotations about variant region type that are specifically created for joint dataset of exomes and genomes from gnomAD v4.1. | |
| Annotations specific to the variant QC using a random forest model. | |
| Allele-specific VQSR annotations. | |
| Annotations specific to VQSR. | |
| Dictionary used during VCF export to export row (variant) annotations. | |
| Dictionary with in silico score descriptions to include in the VCF INFO header. | |
| Dictionary with VRS annotations to include in the VCF INFO field and VCF header. | |
| Densified entries to be selected during VCF export. | |
| Sparse entries to be selected and densified during VCF export. | |
| Dictionary used during VCF export to export MatrixTable entries. | |
| Create a Table ready for vcf export. | |
| 
 | Make combinations of all possible labels for a supplied dictionary of label groups. | 
| 
 | Create a dictionary keyed by the specified label groupings with values describing the corresponding index of each grouping entry in the meta_array annotation. | 
| Programmatically generate text to populate the VCF header description for a given variant annotation with specific groupings and subset. | |
| 
 | Generate a list of label group dictionaries needed to populate info dictionary. | 
| 
 | Generate dictionary of Number and Description attributes of VCF INFO fields. | 
| Update info dictionary with allele-specific terms and their descriptions. | |
| Generate dictionary of Number and Description attributes to be used in the VCF header, specifically for FILTER annotations. | |
| Create dictionaries containing variant histogram annotations and their associated bin edges, formatted into a string separated by pipe delimiters. | |
| 
 | Generate dictionary of Number and Description attributes to be used in the VCF header, specifically for histogram annotations. | 
| Set AC, AN, and nhomalt chrY variant annotations for XX samples to NA (instead of 0). | |
| Create export reference based on reference genome defined by build. | |
| Re-key Table or MatrixTable with a new reference genome. | 
- gnomad.utils.vcf.SORT_ORDER = ['subset', 'downsampling', 'grpmax', 'popmax', 'gen_anc', 'pop', 'subgrp', 'subpop', 'sex', 'group']
- Order to sort subgroupings during VCF export. Ensures that INFO labels in VCF are in desired order (e.g., raw_AC_afr_XX). 
- gnomad.utils.vcf.GROUPS = ['adj', 'raw']
- Group names used to generate labels for high quality genotypes and all raw genotypes. Used in VCF export. 
- gnomad.utils.vcf.HISTS = ['gq_hist_alt', 'gq_hist_all', 'dp_hist_alt', 'dp_hist_all', 'ab_hist_alt']
- Quality histograms used in VCF export. 
- gnomad.utils.vcf.FAF_GEN_ANC_GROUPS = {'v3': ['afr', 'amr', 'eas', 'nfe', 'sas'], 'v4': ['afr', 'amr', 'eas', 'mid', 'nfe', 'sas']}
- Genetic ancestry groups that are included in filtering allele frequency (faf) calculations. Used in VCF export. 
- gnomad.utils.vcf.SEXES = ['XX', 'XY']
- Sample sexes used in VCF export. - Used to stratify frequency annotations (AC, AN, AF) for each sex. Note that sample sexes in gnomAD v3 and earlier were ‘male’ and ‘female’. 
- gnomad.utils.vcf.AS_FIELDS = ['AS_FS', 'AS_MQ', 'AS_MQRankSum', 'AS_pab_max', 'AS_QUALapprox', 'AS_QD', 'AS_ReadPosRankSum', 'AS_SB_TABLE', 'AS_SOR', 'AS_VarDP', 'InbreedingCoeff']
- Allele-specific variant annotations. 
- gnomad.utils.vcf.SITE_FIELDS = ['FS', 'MQ', 'MQRankSum', 'QUALapprox', 'QD', 'ReadPosRankSum', 'SB', 'SOR', 'VarDP']
- Site level variant annotations. 
- gnomad.utils.vcf.ALLELE_TYPE_FIELDS = ['allele_type', 'has_star', 'n_alt_alleles', 'original_alleles', 'variant_type', 'was_mixed']
- Allele-type annotations. 
- gnomad.utils.vcf.REGION_FLAG_FIELDS = ['decoy', 'lcr', 'nonpar', 'non_par', 'segdup']
- Annotations about variant region type. - Note - decoy resource files do not currently exist for GRCh38/hg38. 
- gnomad.utils.vcf.JOINT_REGION_FLAG_FIELDS = ['fail_interval_qc', 'outside_broad_capture_region', 'outside_ukb_capture_region', 'outside_broad_calling_region', 'outside_ukb_calling_region', 'not_called_in_exomes', 'not_called_in_genomes']
- Annotations about variant region type that are specifically created for joint dataset of exomes and genomes from gnomAD v4.1. 
- gnomad.utils.vcf.RF_FIELDS = ['rf_positive_label', 'rf_negative_label', 'rf_label', 'rf_train', 'rf_tp_probability']
- Annotations specific to the variant QC using a random forest model. 
- gnomad.utils.vcf.AS_VQSR_FIELDS = ['AS_culprit', 'AS_VQSLOD']
- Allele-specific VQSR annotations. 
- gnomad.utils.vcf.VQSR_FIELDS = ['AS_culprit', 'AS_VQSLOD', 'NEGATIVE_TRAIN_SITE', 'POSITIVE_TRAIN_SITE']
- Annotations specific to VQSR. 
- gnomad.utils.vcf.INFO_DICT = {'AS_SB_TABLE': {'Description': 'Allele-specific forward/reverse read counts for strand bias tests', 'Number': '.'}, 'AS_pab_max': {'Description': 'Maximum p-value over callset for binomial test of observed allele balance for a heterozygous genotype, given expectation of 0.5', 'Number': 'A'}, 'BaseQRankSum': {'Description': 'Z-score from Wilcoxon rank sum test of alternate vs. reference base qualities'}, 'FS': {'Description': "Phred-scaled p-value of Fisher's exact test for strand bias"}, 'InbreedingCoeff': {'Description': 'Inbreeding coefficient, the excess heterozygosity at a variant site, computed as 1 - (the number of heterozygous genotypes)/(the number of heterozygous genotypes expected under Hardy-Weinberg equilibrium)', 'Number': 'A'}, 'MQ': {'Description': 'Root mean square of the mapping quality of reads across all samples'}, 'MQRankSum': {'Description': 'Z-score from Wilcoxon rank sum test of alternate vs. reference read mapping qualities'}, 'NEGATIVE_TRAIN_SITE': {'Description': 'Variant was used to build the negative training set of low-quality variants for VQSR'}, 'POSITIVE_TRAIN_SITE': {'Description': 'Variant was used to build the positive training set of high-quality variants for VQSR'}, 'QD': {'Description': 'Variant call confidence normalized by depth of sample reads supporting a variant'}, 'QUALapprox': {'Description': 'Sum of PL[0] values; used to approximate the QUAL score', 'Number': '1'}, 'ReadPosRankSum': {'Description': 'Z-score from Wilcoxon rank sum test of alternate vs. reference read position bias'}, 'SOR': {'Description': 'Strand bias estimated by the symmetric odds ratio test'}, 'VQSLOD': {'Description': 'Log-odds ratio of being a true variant versus being a false positive under the trained VQSR Gaussian mixture model'}, 'VarDP': {'Description': 'Depth over variant genotypes (does not include depth of reference samples)'}, 'allele_type': {'Description': 'Allele type (snv, insertion, deletion, or mixed)'}, 'culprit': {'Description': 'Worst-performing annotation in the VQSR Gaussian mixture model'}, 'decoy': {'Description': 'Variant falls within a reference decoy region'}, 'fail_interval_qc': {'Description': 'Less than 85 percent of samples meet 20X coverage if variant is in autosomal or PAR regions or 10X coverage for non-PAR regions of chromosomes X and Y.'}, 'has_star': {'Description': 'Variant locus coincides with a spanning deletion (represented by a star) observed elsewhere in the callset'}, 'inbreeding_coeff': {'Description': 'Inbreeding coefficient, the excess heterozygosity at a variant site, computed as 1 - (the number of heterozygous genotypes)/(the number of heterozygous genotypes expected under Hardy-Weinberg equilibrium)', 'Number': 'A'}, 'lcr': {'Description': 'Variant falls within a low complexity region'}, 'monoallelic': {'Description': 'All samples are homozygous alternate for the variant'}, 'n_alt_alleles': {'Description': 'Total number of alternate alleles observed at variant locus', 'Number': '1'}, 'negative_train_site': {'Description': 'Variant was used to build the negative training set of low-quality variants for VQSR'}, 'non_par': {'Description': 'Variant (on sex chromosome) falls outside a pseudoautosomal region'}, 'nonpar': {'Description': 'Variant (on sex chromosome) falls outside a pseudoautosomal region'}, 'only_het': {'Description': 'All samples are heterozygous for the variant'}, 'original_alleles': {'Description': 'Alleles before splitting multiallelics'}, 'outside_broad_capture_region': {'Description': 'Variant falls outside of Broad exome capture regions.'}, 'outside_ukb_capture_region': {'Description': 'Variant falls outside of UK Biobank exome capture regions.'}, 'positive_train_site': {'Description': 'Variant was used to build the positive training set of high-quality variants for VQSR'}, 'rf_label': {'Description': 'Random forest training label'}, 'rf_negative_label': {'Description': 'Variant was labelled as a negative example for training of random forest model'}, 'rf_positive_label': {'Description': 'Variant was labelled as a positive example for training of random forest model'}, 'rf_tp_probability': {'Description': 'Probability of a called variant being a true variant as determined by random forest model'}, 'rf_train': {'Description': 'Variant was used in training random forest model'}, 'segdup': {'Description': 'Variant falls within a segmental duplication region'}, 'sibling_singleton': {'Description': 'Variant was a callset-wide doubleton that was present only in two siblings (i.e., a singleton amongst unrelated samples in cohort).'}, 'transmitted_singleton': {'Description': 'Variant was a callset-wide doubleton that was transmitted within a family from a parent to a child (i.e., a singleton amongst unrelated samples in cohort)'}, 'variant_type': {'Description': 'Variant type (snv, indel, multi-snv, multi-indel, or mixed)'}, 'was_mixed': {'Description': 'Variant type was mixed'}}
- Dictionary used during VCF export to export row (variant) annotations. 
- gnomad.utils.vcf.IN_SILICO_ANNOTATIONS_INFO_DICT = {'cadd_phred': {'Description': "Cadd Phred-like scores ('scaled C-scores') ranging from 1 to 99, based on the rank of each variant relative to all possible 8.6 billion substitutions in the human reference genome. Larger values are more deleterious.", 'Number': '1'}, 'cadd_raw_score': {'Description': "Raw CADD scores are interpretable as the extent to which the annotation profile for a given variant suggests that the variant is likely to be 'observed' (negative values) vs 'simulated' (positive values). Larger values are more deleterious.", 'Number': '1'}, 'pangolin_largest_ds': {'Description': "Pangolin's largest delta score across 2 splicing consequences, which reflects the probability of the variant being splice-altering", 'Number': '1'}, 'phylop': {'Description': 'Base-wise conservation score across the 241 placental mammals in the Zoonomia project. Score ranges from -20 to 9.28, and reflects acceleration (faster evolution than expected under neutral drift, assigned negative scores) as well as conservation (slower than expected evolution, assigned positive scores).', 'Number': '1'}, 'polyphen_max': {'Description': 'Score that predicts the possible impact of an amino acid substitution on the structure and function of a human protein, ranging from 0.0 (tolerated) to 1.0 (deleterious). We prioritize max scores for MANE Select transcripts where possible and otherwise report a score for the canonical transcript.', 'Number': '1'}, 'revel_max': {'Description': "The maximum REVEL score at a site's MANE Select or canonical transcript. It's an ensemble score for predicting the pathogenicity of missense variants (based on 13 other variant predictors). Scores ranges from 0 to 1. Variants with higher scores are predicted to be more likely to be deleterious.", 'Number': '1'}, 'sift_max': {'Description': 'Score reflecting the scaled probability of the amino acid substitution being tolerated, ranging from 0 to 1. Scores below 0.05 are predicted to impact protein function. We prioritize max scores for MANE Select transcripts where possible and otherwise report a score for the canonical transcript.', 'Number': '1'}, 'spliceai_ds_max': {'Description': "Illumina's SpliceAI max delta score; interpreted as the probability of the variant being splice-altering.", 'Number': '1'}}
- Dictionary with in silico score descriptions to include in the VCF INFO header. 
- gnomad.utils.vcf.VRS_FIELDS_DICT = {'VRS_Allele_IDs': {'Description': 'The computed identifiers for the GA4GH VRS Alleles corresponding to the values in the REF and ALT fields', 'Number': 'R'}, 'VRS_Ends': {'Description': 'Interresidue coordinates used as the location ends for the GA4GH VRS Alleles corresponding to the values in the REF and ALT fields', 'Number': 'R'}, 'VRS_Starts': {'Description': 'Interresidue coordinates used as the location starts for the GA4GH VRS Alleles corresponding to the values in the REF and ALT fields', 'Number': 'R'}, 'VRS_States': {'Description': 'The literal sequence states used for the GA4GH VRS Alleles corresponding to the values in the REF and ALT fields', 'Number': '.'}}
- Dictionary with VRS annotations to include in the VCF INFO field and VCF header. 
- gnomad.utils.vcf.ENTRIES = ['GT', 'GQ', 'DP', 'AD', 'MIN_DP', 'PGT', 'PID', 'PL', 'SB']
- Densified entries to be selected during VCF export. 
- gnomad.utils.vcf.SPARSE_ENTRIES = ['END', 'DP', 'GQ', 'LA', 'LAD', 'LGT', 'LPGT', 'LPL', 'MIN_DP', 'PID', 'RGQ', 'SB']
- Sparse entries to be selected and densified during VCF export. 
- gnomad.utils.vcf.FORMAT_DICT = {'AD': {'Description': 'Allelic depths for the ref and alt alleles in the order listed', 'Number': 'R', 'Type': 'Integer'}, 'DP': {'Description': 'Approximate read depth (reads with MQ=255 or with bad mates are filtered)', 'Number': '1', 'Type': 'Integer'}, 'GQ': {'Description': 'Phred-scaled confidence that the genotype assignment is correct. Value is the difference between the second lowest PL and the lowest PL (always normalized to 0).', 'Number': '1', 'Type': 'Integer'}, 'GT': {'Description': 'Genotype', 'Number': '1', 'Type': 'String'}, 'MIN_DP': {'Description': 'Minimum DP observed within the GVCF block', 'Number': '1', 'Type': 'Integer'}, 'PGT': {'Description': 'Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another', 'Number': '1', 'Type': 'String'}, 'PID': {'Description': 'Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group', 'Number': '1', 'Type': 'String'}, 'PL': {'Description': 'Normalized, phred-scaled likelihoods for genotypes as defined in the VCF specification', 'Number': 'G', 'Type': 'Integer'}, 'RGQ': {'Description': 'Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)', 'Number': '1', 'Type': 'Integer'}, 'SB': {'Description': "Per-sample component statistics which comprise the Fisher's exact test to detect strand bias. Values are: depth of reference allele on forward strand, depth of reference allele on reverse strand, depth of alternate allele on forward strand, depth of alternate allele on reverse strand.", 'Number': '4', 'Type': 'Integer'}}
- Dictionary used during VCF export to export MatrixTable entries. 
- gnomad.utils.vcf.adjust_vcf_incompatible_types(ht, pipe_delimited_annotations=['AS_QUALapprox', 'AS_VarDP', 'AS_MQ_DP', 'AS_RAW_MQ', 'AS_SB_TABLE'])[source]
- Create a Table ready for vcf export. - In particular, the following conversions are done:
- All int64 are coerced to int32 
- Fields specified by pipe_delimited_annotations are converted from arrays to pipe-delimited strings 
 
 
- gnomad.utils.vcf.make_label_combos(label_groups, sort_order=['subset', 'downsampling', 'grpmax', 'popmax', 'gen_anc', 'pop', 'subgrp', 'subpop', 'sex', 'group'], label_delimiter='_')[source]
- Make combinations of all possible labels for a supplied dictionary of label groups. - For example, if label_groups is {“sex”: [“XY”, “XX”], “gen_anc”: [“afr”, “nfe”, “amr”]}, this function will return [“afr_XY”, “afr_XX”, “nfe_XY”, “nfe_XX”, “amr_XY”, “amr_XX”] - Parameters:
- label_groups ( - Dict[- str,- List[- str]]) – Dictionary containing an entry for each label group, where key is the name of the grouping, e.g. “sex” or “gen_anc”, and value is a list of all possible values for that grouping (e.g. [“XY”, “XX”] or [“afr”, “nfe”, “amr”]).
- sort_order ( - List[- str]) – List containing order to sort label group combinations. Default is SORT_ORDER.
- label_delimiter ( - str) – String to use as delimiter when making group label combinations.
 
- Return type:
- List[- str]
- Returns:
- list of all possible combinations of values for the supplied label groupings. 
 
- gnomad.utils.vcf.index_globals(globals_array, label_groups, label_delimiter='_')[source]
- Create a dictionary keyed by the specified label groupings with values describing the corresponding index of each grouping entry in the meta_array annotation. - Parameters:
- globals_array ( - List[- Dict[- str,- str]]) – Ordered list containing dictionary entries describing all the grouping combinations contained in the globals_array annotation. Keys are the grouping type (e.g., ‘group’, ‘gen_anc’, ‘sex’) and values are the grouping attribute (e.g., ‘adj’, ‘eas’, ‘XY’).
- label_groups ( - Dict[- str,- List[- str]]) – Dictionary containing an entry for each label group, where key is the name of the grouping, e.g. “sex” or “gen_anc”, and value is a list of all possible values for that grouping (e.g. [“XY”, “XX”] or [“afr”, “nfe”, “amr”])
- label_delimiter ( - str) – String used as delimiter when making group label combinations.
 
- Return type:
- Dict[- str,- int]
- Returns:
- Dictionary keyed by specified label grouping combinations, with values describing the corresponding index of each grouping entry in the globals 
 
- gnomad.utils.vcf.make_combo_header_text(preposition, combo_dict, gen_anc_names)[source]
- Programmatically generate text to populate the VCF header description for a given variant annotation with specific groupings and subset. - For example, if preposition is “for”, group_types is [“group”, “gen_anc”, “sex”], and combo_fields is [“adj”, “afr”, “XX”], this function will return the string ” for XX samples in the African-American/African genetic ancestry group”. - Parameters:
- preposition ( - str) – Relevant preposition to precede automatically generated text.
- combo_dict ( - Dict[- str,- str]) – Dict with grouping types as keys and values for grouping type as values. This function generates text for these values. Possible grouping types are: “group”, “gen_anc”, “sex”, and “subgroup”. Example input: {“gen_anc”: “afr”, “sex”: “XX”}
- gen_anc_names ( - Dict[- str,- str]) – Dict with global genetic ancestry group names (keys) and genetic ancestry group descriptions (values).
 
- Return type:
- str
- Returns:
- String with automatically generated description text for a given set of combo fields. 
 
- gnomad.utils.vcf.create_label_groups(gen_ancs, sexes=['XX', 'XY'], all_groups=['adj', 'raw'], gen_anc_sex_groups=['adj'])[source]
- Generate a list of label group dictionaries needed to populate info dictionary. - Label dictionaries are passed as input to make_info_dict. - Parameters:
- gen_ancs ( - List[- str]) – List of genetic ancestry group names.
- sexes ( - List[- str]) – List of sample sexes.
- all_groups ( - List[- str]) – List of data types (raw, adj). Default is GROUPS, which is [“raw”, “adj”].
- gen_anc_sex_groups ( - List[- str]) – List of data types (raw, adj) to populate with gen_ancs and sexes. Default is [“adj”].
 
- Return type:
- List[- Dict[- str,- List[- str]]]
- Returns:
- List of label group dictionaries. 
 
- gnomad.utils.vcf.make_info_dict(prefix='', suffix='', prefix_before_metric=True, gen_anc_names={'afr': 'African/African-American', 'ami': 'Amish', 'amr': 'Admixed American', 'asj': 'Ashkenazi Jewish', 'bgr': 'Bulgarian (Eastern European)', 'consanguineous': 'South Asian (F > 0.05)', 'eas': 'East Asian', 'est': 'Estonian', 'eur': 'European', 'exac': 'ExAC', 'fin': 'Finnish', 'gbr': 'British', 'jpn': 'Japanese', 'kor': 'Korean', 'mde': 'Middle Eastern', 'mid': 'Middle Eastern', 'nfe': 'Non-Finnish European', 'nwe': 'North-Western European', 'oea': 'Other East Asian', 'oeu': 'Other European', 'onf': 'Other Non-Finnish European', 'oth': 'Other', 'remaining': 'Remaining individuals', 'sas': 'South Asian', 'sas_non_consang': 'South Asian (F < 0.05)', 'seu': 'Southern European', 'sgp': 'Singaporean', 'swe': 'Swedish', 'uniform': 'Uniform', 'unk': 'Unknown'}, label_groups=None, label_delimiter='_', bin_edges=None, faf=False, grpmax=False, fafmax=False, callstats=False, freq_ctt=False, freq_cmh=False, freq_stat_union=False, description_text='', age_hist_distribution=None, sort_order=['subset', 'downsampling', 'grpmax', 'popmax', 'gen_anc', 'pop', 'subgrp', 'subpop', 'sex', 'group'])[source]
- Generate dictionary of Number and Description attributes of VCF INFO fields. - Used to populate the INFO fields of the VCF header during export. - Creates:
- INFO fields for age histograms (bin freq, n_smaller, and n_larger for heterozygous and homozygous variant carriers) 
- INFO fields for grpmax AC, AN, AF, nhomalt, and grpmax genetic ancestry group 
- INFO fields for AC, AN, AF, nhomalt for each combination of sample genetic ancestry group, sex, and subgroup, both for adj and raw data 
- INFO fields for filtering allele frequency (faf) annotations 
 
 - Parameters:
- prefix ( - str) – Prefix string for data, e.g. “gnomAD”. Default is empty string.
- suffix ( - str) – Suffix string for data, e.g. “gnomAD”. Default is empty string.
- prefix_before_metric ( - bool) – Whether prefix should be added before the metric (AC, AN, AF, nhomalt, faf95, faf99) in INFO field. Default is True.
- gen_anc_names ( - Dict[- str,- str]) – Dict with global genetic ancestry group names (keys) and genetic ancestry group descriptions (values). Default is GEN_ANC_NAMES.
- label_groups ( - Dict[- str,- List[- str]]) – Dictionary containing an entry for each label group, where key is the name of the grouping, e.g. “sex” or “gen_anc”, and value is a list of all possible values for that grouping (e.g. [“XY”, “XX”] or [“afr”, “nfe”, “amr”]).
- label_delimiter ( - str) – String to use as delimiter when making group label combinations.
- bin_edges ( - Dict[- str,- str]) – Dictionary keyed by annotation type, with values that reflect the bin edges corresponding to the annotation.
- faf ( - bool) – If True, use alternate logic to auto-populate dictionary values associated with filter allele frequency annotations.
- grpmax ( - bool) – If True, use alternate logic to auto-populate dictionary values associated with grpmax annotations.
- fafmax ( - bool) – If True, use alternate logic to auto-populate dictionary values associated with fafmax annotations.
- callstats ( - bool) – If True, use alternate logic to auto-populate dictionary values associated with callstats annotations.
- freq_ctt ( - bool) – If True, use alternate logic to auto-populate dictionary values associated with frequency contingency table test (CTT) annotations.
- freq_cmh ( - bool) – If True, use alternate logic to auto-populate dictionary values associated with frequency Cochran-Mantel-Haenszel (CMH) annotations.
- freq_stat_union ( - bool) – If True, use alternate logic to auto-populate dictionary values associated with the union of the contingency table and Cochran-Mantel-Haenszel tests.
- description_text ( - str) – Optional text to append to the end of descriptions. Needs to start with a space if specified.
- age_hist_distribution ( - str) – Pipe-delimited string of overall age distribution.
- sort_order ( - List[- str]) – List containing order to sort label group combinations. Default is SORT_ORDER.
- age_hist_distribution – 
 
- Return type:
- Dict[- str,- Dict[- str,- str]]
- Returns:
- Dictionary keyed by VCF INFO annotations, where values are dictionaries of Number and Description attributes. 
 
- gnomad.utils.vcf.add_as_info_dict(info_dict={'AS_SB_TABLE': {'Description': 'Allele-specific forward/reverse read counts for strand bias tests', 'Number': '.'}, 'AS_pab_max': {'Description': 'Maximum p-value over callset for binomial test of observed allele balance for a heterozygous genotype, given expectation of 0.5', 'Number': 'A'}, 'BaseQRankSum': {'Description': 'Z-score from Wilcoxon rank sum test of alternate vs. reference base qualities'}, 'FS': {'Description': "Phred-scaled p-value of Fisher's exact test for strand bias"}, 'InbreedingCoeff': {'Description': 'Inbreeding coefficient, the excess heterozygosity at a variant site, computed as 1 - (the number of heterozygous genotypes)/(the number of heterozygous genotypes expected under Hardy-Weinberg equilibrium)', 'Number': 'A'}, 'MQ': {'Description': 'Root mean square of the mapping quality of reads across all samples'}, 'MQRankSum': {'Description': 'Z-score from Wilcoxon rank sum test of alternate vs. reference read mapping qualities'}, 'NEGATIVE_TRAIN_SITE': {'Description': 'Variant was used to build the negative training set of low-quality variants for VQSR'}, 'POSITIVE_TRAIN_SITE': {'Description': 'Variant was used to build the positive training set of high-quality variants for VQSR'}, 'QD': {'Description': 'Variant call confidence normalized by depth of sample reads supporting a variant'}, 'QUALapprox': {'Description': 'Sum of PL[0] values; used to approximate the QUAL score', 'Number': '1'}, 'ReadPosRankSum': {'Description': 'Z-score from Wilcoxon rank sum test of alternate vs. reference read position bias'}, 'SOR': {'Description': 'Strand bias estimated by the symmetric odds ratio test'}, 'VQSLOD': {'Description': 'Log-odds ratio of being a true variant versus being a false positive under the trained VQSR Gaussian mixture model'}, 'VarDP': {'Description': 'Depth over variant genotypes (does not include depth of reference samples)'}, 'allele_type': {'Description': 'Allele type (snv, insertion, deletion, or mixed)'}, 'culprit': {'Description': 'Worst-performing annotation in the VQSR Gaussian mixture model'}, 'decoy': {'Description': 'Variant falls within a reference decoy region'}, 'fail_interval_qc': {'Description': 'Less than 85 percent of samples meet 20X coverage if variant is in autosomal or PAR regions or 10X coverage for non-PAR regions of chromosomes X and Y.'}, 'has_star': {'Description': 'Variant locus coincides with a spanning deletion (represented by a star) observed elsewhere in the callset'}, 'inbreeding_coeff': {'Description': 'Inbreeding coefficient, the excess heterozygosity at a variant site, computed as 1 - (the number of heterozygous genotypes)/(the number of heterozygous genotypes expected under Hardy-Weinberg equilibrium)', 'Number': 'A'}, 'lcr': {'Description': 'Variant falls within a low complexity region'}, 'monoallelic': {'Description': 'All samples are homozygous alternate for the variant'}, 'n_alt_alleles': {'Description': 'Total number of alternate alleles observed at variant locus', 'Number': '1'}, 'negative_train_site': {'Description': 'Variant was used to build the negative training set of low-quality variants for VQSR'}, 'non_par': {'Description': 'Variant (on sex chromosome) falls outside a pseudoautosomal region'}, 'nonpar': {'Description': 'Variant (on sex chromosome) falls outside a pseudoautosomal region'}, 'only_het': {'Description': 'All samples are heterozygous for the variant'}, 'original_alleles': {'Description': 'Alleles before splitting multiallelics'}, 'outside_broad_capture_region': {'Description': 'Variant falls outside of Broad exome capture regions.'}, 'outside_ukb_capture_region': {'Description': 'Variant falls outside of UK Biobank exome capture regions.'}, 'positive_train_site': {'Description': 'Variant was used to build the positive training set of high-quality variants for VQSR'}, 'rf_label': {'Description': 'Random forest training label'}, 'rf_negative_label': {'Description': 'Variant was labelled as a negative example for training of random forest model'}, 'rf_positive_label': {'Description': 'Variant was labelled as a positive example for training of random forest model'}, 'rf_tp_probability': {'Description': 'Probability of a called variant being a true variant as determined by random forest model'}, 'rf_train': {'Description': 'Variant was used in training random forest model'}, 'segdup': {'Description': 'Variant falls within a segmental duplication region'}, 'sibling_singleton': {'Description': 'Variant was a callset-wide doubleton that was present only in two siblings (i.e., a singleton amongst unrelated samples in cohort).'}, 'transmitted_singleton': {'Description': 'Variant was a callset-wide doubleton that was transmitted within a family from a parent to a child (i.e., a singleton amongst unrelated samples in cohort)'}, 'variant_type': {'Description': 'Variant type (snv, indel, multi-snv, multi-indel, or mixed)'}, 'was_mixed': {'Description': 'Variant type was mixed'}}, as_fields=['AS_FS', 'AS_MQ', 'AS_MQRankSum', 'AS_pab_max', 'AS_QUALapprox', 'AS_QD', 'AS_ReadPosRankSum', 'AS_SB_TABLE', 'AS_SOR', 'AS_VarDP', 'InbreedingCoeff'])[source]
- Update info dictionary with allele-specific terms and their descriptions. - Used in VCF export. - Parameters:
- info_dict ( - Dict[- str,- Dict[- str,- str]]) – Dictionary containing site-level annotations and their descriptions. Default is INFO_DICT.
- as_fields ( - List[- str]) – List containing allele-specific fields to be added to info_dict. Default is AS_FIELDS.
 
- Return type:
- Dict[- str,- Dict[- str,- str]]
- Returns:
- Dictionary with allele specific annotations, their descriptions, and their VCF number field. 
 
- gnomad.utils.vcf.make_vcf_filter_dict(snp_cutoff=None, indel_cutoff=None, inbreeding_cutoff=None, variant_qc_filter='RF', joint=False)[source]
- Generate dictionary of Number and Description attributes to be used in the VCF header, specifically for FILTER annotations. - Generates descriptions for:
- AC0 filter 
- InbreedingCoeff filter 
- Variant QC filter (RF or AS_VQSR) 
- PASS (passed all variant filters) 
 
 - Parameters:
- snp_cutoff ( - Optional[- float]) – Minimum SNP cutoff score from random forest model.
- indel_cutoff ( - Optional[- float]) – Minimum indel cutoff score from random forest model.
- inbreeding_cutoff ( - Optional[- float]) – Inbreeding coefficient hard cutoff.
- variant_qc_filter ( - str) – Method used for variant QC filter. One of ‘RF’ or ‘AS_VQSR’. Default is ‘RF’.
- joint ( - bool) – Whether the filter dictionary is for the joint release. Default is False.
 
- Return type:
- Dict[- str,- str]
- Returns:
- Dictionary keyed by VCF FILTER annotations, where values are Dictionaries of Number and Description attributes. 
 
- gnomad.utils.vcf.make_hist_bin_edges_expr(ht, hists=['gq_hist_alt', 'gq_hist_all', 'dp_hist_alt', 'dp_hist_all', 'ab_hist_alt'], ann_with_hists=None, prefix='', label_delimiter='_', include_age_hists=True)[source]
- Create dictionaries containing variant histogram annotations and their associated bin edges, formatted into a string separated by pipe delimiters. - Parameters:
- ht ( - Table) – Table containing histogram variant annotations.
- hists ( - List[- str]) – List of variant histogram annotations. Default is HISTS.
- ann_with_hists ( - Optional[- str]) – Name of row annotation containing histogram data. In exomes or genomes release HT, histograms is a row, but in the joint release HT, it’s under the row of exomes, genomes, or joint.
- prefix ( - str) – Prefix text for age histogram bin edges. Default is empty string.
- label_delimiter ( - str) – String used as delimiter between prefix and histogram annotation.
- include_age_hists ( - bool) – Include age histogram annotations.
 
- Return type:
- Dict[- str,- str]
- Returns:
- Dictionary keyed by histogram annotation name, with corresponding reformatted bin edges for values. 
 
- gnomad.utils.vcf.make_hist_dict(bin_edges, adj, hist_metric_list=['gq_hist_alt', 'gq_hist_all', 'dp_hist_alt', 'dp_hist_all', 'ab_hist_alt'], label_delimiter='_', drop_n_smaller_larger=False, prefix='', suffix='', description_text='')[source]
- Generate dictionary of Number and Description attributes to be used in the VCF header, specifically for histogram annotations. - Parameters:
- bin_edges ( - Dict[- str,- Dict[- str,- str]]) – Dictionary keyed by histogram annotation name, with corresponding string-reformatted bin edges for values.
- adj ( - bool) – Whether to create a header dict for raw or adj quality histograms.
- hist_metric_list ( - List[- str]) – List of hists for which to build hist info dict
- label_delimiter ( - str) – String used as delimiter in values stored in hist_metric_list.
- drop_n_smaller_larger ( - bool) – Whether to drop n_smaller and n_larger annotations from header dict. Default is False.
- prefix ( - str) – Prefix text for histogram annotations. Default is empty string.
- suffix ( - str) – Suffix text for histogram annotations. Default is empty string.
- description_text ( - str) – Optional text to append to the end of descriptions. Needs to start with a space if specified.
 
- Return type:
- Dict[- str,- str]
- Returns:
- Dictionary keyed by VCF INFO annotations, where values are Dictionaries of Number and Description attributes. 
 
- gnomad.utils.vcf.set_xx_y_metrics_to_na(t)[source]
- Set AC, AN, and nhomalt chrY variant annotations for XX samples to NA (instead of 0). - Parameters:
- t ( - Union[- Table,- MatrixTable]) – Table/MatrixTable containing XX variant annotations.
- Return type:
- Dict[- str,- Int32Expression]
- Returns:
- Dictionary with reset annotations 
 
- gnomad.utils.vcf.build_vcf_export_reference(name, build='GRCh38', keep_contigs=['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY'], keep_chrM=True)[source]
- Create export reference based on reference genome defined by build. - By default this will return a new reference with all non-standard contigs eliminated. Keeps chr 1-22, Y, X, and M. - An example of a non-standard contig is: ##contig=<ID=chr3_GL000221v1_random,length=155397,assembly=GRCh38> - Parameters:
- name ( - str) – Name to use for new reference.
- build ( - str) – Reference genome build to use as starting reference genome.
- keep_contigs ( - List[- str]) – Contigs to keep from reference genome defined by build. Default is autosomes and sex chromosomes.
- keep_chrM ( - bool) – Whether to keep chrM. Default is True.
 
- Return type:
- Returns:
- Reference genome for VCF export containing only contigs in keep_contigs. 
 
- gnomad.utils.vcf.rekey_new_reference(t, reference)[source]
- Re-key Table or MatrixTable with a new reference genome. - Parameters:
- t ( - Union[- Table,- MatrixTable]) – Input Table/MatrixTable.
- reference ( - ReferenceGenome) – Reference genome to re-key with.
 
- Return type:
- Union[- Table,- MatrixTable]
- Returns:
- Re-keyed Table/MatrixTable