gnomad_qc.v4.annotations.generate_freq
Script to generate the frequency data annotations across v4 exomes.
This script first splits the v4 VDS into multiples VDSs based which are then densified and annotated with frequency data and histograms. The VDSs are then merged back together in a hail Table. Next the script corrects for the high AB heterozygous GATK artifact in existing annotations when given the AF threshold using a high AB het array. The script then computes the inbreeding coefficient using the raw call stats. Finally, it computes the filtering allele frequency and grpmax with the AB-adjusted frequencies.
usage: gnomad_qc.v4.annotations.generate_freq.py [-h] [--use-test-dataset]
[--test-gene]
[--test-n-partitions [TEST_N_PARTITIONS]]
[--chrom CHROM] [--overwrite]
[--slack-channel SLACK_CHANNEL]
[--write-split-vds-and-downsampling-ht]
[--run-freq-and-dense-annotations]
[--ukb-only] [--non-ukb-only]
[--combine-freq-hts]
[--correct-for-high-ab-hets]
[--ab-cutoff AB_CUTOFF]
[--af-threshold AF_THRESHOLD]
[--finalize-freq-ht]
Named Arguments
- --use-test-dataset
Runs a test on the gnomad test dataset.
Default: False
- --test-gene
Runs a test on the DRD2 gene in the gnomad test dataset.
Default: False
- --test-n-partitions
Use only N partitions of the VDS as input for testing purposes. Defaultsto 2 if passed without a value.
- --chrom
If passed, script will only run on passed chromosome.
- --overwrite
Overwrites existing files.
Default: False
- --slack-channel
Slack channel to post results and notifications to.
- --write-split-vds-and-downsampling-ht
Write split VDS and downsampling HT.
Default: False
- --run-freq-and-dense-annotations
Calculate frequencies, histograms, and high AB sites per sample grouping.
Default: False
- --ukb-only
Only run frequency and histogram calculations for UKB samples.
Default: False
- --non-ukb-only
Only run frequency and histogram calculations for non-UKB samples.
Default: False
- --combine-freq-hts
Combine frequency and histogram Tables for UKB and non-UKB samples into a single Table.
Default: False
- --correct-for-high-ab-hets
Correct each frequency entry to account for homozygous alternate depletion present in GATK versions released prior to 4.1.4.1 and run chosen downstream annotations.
Default: False
- --ab-cutoff
Allele balance threshold to use when adjusting heterozygous calls to homozygous alternate calls at sites for samples that used GATK versions released prior to 4.1.4.1.
Default: 0.9
- --af-threshold
Threshold at which to adjust site group frequencies at sites for homozygous alternate depletion present in GATK versions released prior to 4.1.4.1.
Default: 0.01
- --finalize-freq-ht
Finalize frequency Table by dropping unnecessary fields and renaming remaining fields.
Default: False
Module Functions
Age histograms to compute and keep on the frequency Table. |
|
Quality histograms to compute and keep on the frequency Table. |
|
|
List of top level row and global annotations relating to the high allele balance heterozygote correction that we want on the frequency HT before deciding on the AF cutoff. |
List of top level row and global annotations with no high allele balance heterozygote correction that we want on the frequency HT. |
|
List of final top level row and global annotations created from dense data that we want on the frequency HT before deciding on the AF cutoff. |
|
List of final global annotations created from dense data that we want on the frequency HT before deciding on the AF cutoff. |
|
Dictionary for accessing the annotations with subset specific annotations such as age_hists, popmax, and faf. |
|
|
Get frequency resources. |
|
Prepare VDS for frequency calculation by filtering to release samples and only adding necessary annotations. |
|
Create frequency index dictionary. |
|
Filter frequency arrays by metadata. |
Determine if a call is considered a high allele balance heterozygous call. |
|
|
Generate frequency Table. |
|
Densify VDS and select necessary annotations for frequency and histogram calculations. |
|
Get Table with downsampling groups for all samples or the non-UKB subset. |
|
Update non-UKB subset frequencies to be ready for combining with the frequencies of other samples. |
Annotate quality metrics histograms and age histograms onto MatrixTable. |
|
|
Combine frequency HTs into a single HT. |
|
Correct for high allele balance heterozygous calls in call statistics and histograms. |
|
Compute filtering allele frequencies ('faf'), 'grpmax', and 'gen_anc_faf_max' with the AB-adjusted frequencies. |
|
Compute inbreeding coefficient using raw call stats. |
|
Create final freq Table with only desired annotations. |
Split a VDS into multiple VDSs based on strata_expr. |
|
Script to generate frequency and dense dependent annotations on v4 exomes. |
|
|
Get script argument parser. |
Script to generate the frequency data annotations across v4 exomes.
This script first splits the v4 VDS into multiples VDSs based which are then densified and annotated with frequency data and histograms. The VDSs are then merged back together in a hail Table. Next the script corrects for the high AB heterozygous GATK artifact in existing annotations when given the AF threshold using a high AB het array. The script then computes the inbreeding coefficient using the raw call stats. Finally, it computes the filtering allele frequency and grpmax with the AB-adjusted frequencies.
- gnomad_qc.v4.annotations.generate_freq.AGE_HISTS = ['age_hist_het', 'age_hist_hom']
Age histograms to compute and keep on the frequency Table.
- gnomad_qc.v4.annotations.generate_freq.QUAL_HISTS = ['gq_hist_all', 'dp_hist_all', 'gq_hist_alt', 'dp_hist_alt', 'ab_hist_alt']
Quality histograms to compute and keep on the frequency Table.
- gnomad_qc.v4.annotations.generate_freq.FREQ_HIGH_AB_HET_ROW_FIELDS = ['high_ab_hets_by_group', 'high_ab_het_adjusted_ab_hists', 'high_ab_het_adjusted_age_hists']
List of top level row and global annotations relating to the high allele balance heterozygote correction that we want on the frequency HT before deciding on the AF cutoff.
- gnomad_qc.v4.annotations.generate_freq.FREQ_ROW_FIELDS = ['freq', 'qual_hists', 'raw_qual_hists', 'age_hists']
List of top level row and global annotations with no high allele balance heterozygote correction that we want on the frequency HT.
- gnomad_qc.v4.annotations.generate_freq.ALL_FREQ_ROW_FIELDS = ['freq', 'qual_hists', 'raw_qual_hists', 'age_hists', 'high_ab_hets_by_group', 'high_ab_het_adjusted_ab_hists', 'high_ab_het_adjusted_age_hists']
List of final top level row and global annotations created from dense data that we want on the frequency HT before deciding on the AF cutoff.
- gnomad_qc.v4.annotations.generate_freq.FREQ_GLOBAL_FIELDS = ['downsamplings', 'freq_meta', 'age_distribution', 'freq_index_dict', 'freq_meta_sample_count']
List of final global annotations created from dense data that we want on the frequency HT before deciding on the AF cutoff.
- gnomad_qc.v4.annotations.generate_freq.SUBSET_DICT = {'gnomad': 0, 'non_ukb': 1}
Dictionary for accessing the annotations with subset specific annotations such as age_hists, popmax, and faf.
- gnomad_qc.v4.annotations.generate_freq.get_freq_resources(overwrite=False, test=False, chrom=None, ukb=False, non_ukb=False)[source]
Get frequency resources.
- Parameters:
overwrite (
bool
) – Whether to overwrite existing files.test (
Optional
[bool
]) – Whether to use test resources.chrom (
Optional
[str
]) – Chromosome used in freq calculations.ukb (
bool
) – Whether to get frequency resources for UKB subset.non_ukb (
bool
) – Whether to get frequency resources for non UKB subset.
- Return type:
PipelineResourceCollection
- Returns:
Frequency resources.
- gnomad_qc.v4.annotations.generate_freq.get_vds_for_freq(use_test_dataset=False, test_gene=False, test_n_partitions=None, chrom=None)[source]
Prepare VDS for frequency calculation by filtering to release samples and only adding necessary annotations.
- Parameters:
use_test_dataset (
bool
) – Whether to use test dataset.test_gene (
bool
) – Whether to filter to DRD2 for testing purposes.test_n_partitions (
Optional
[int
]) – Number of partitions to use for testing.chrom (
Optional
[int
]) – Chromosome to filter to.
- Return type:
- Returns:
Hail VDS with only necessary annotations.
- gnomad_qc.v4.annotations.generate_freq.annotate_freq_index_dict(ht)[source]
Create frequency index dictionary.
The keys are the strata over which frequency aggregations where calculated and the values are the strata’s index in the frequency array.
- gnomad_qc.v4.annotations.generate_freq.filter_freq_arrays_for_non_ukb_subset(ht, items_to_filter, keep=True, combine_operator='and', annotations=('freq', 'high_ab_hets_by_group'), remove_subset_from_meta=False)[source]
Filter frequency arrays by metadata.
Filter ‘annotations’ and freq_meta array fields to only items_to_filter by using the ‘freq_meta’ array field values.
If remove_subset_from_meta is True, update ‘freq_meta’ dicts by removing items with the key “subset” removed. If False, update ‘freq_meta’ dicts to include a “subset” key with “non_ukb” value.
Also rename the ‘downsamplings’ global field to “non_ukb_downsamplings” so we can merge them later without losing the non-UKB downsampling information.
- Parameters:
ht (
Table
) – Input Table.items_to_filter (
Union
[List
[str
],Dict
[str
,List
[str
]]]) – Items to filter by.keep (
bool
) – Whether to keep or remove items. Default is True.combine_operator (
str
) – Operator (“and” or “or”) to use when combining items in ‘items_to_filter’. Default is “and”.annotations (
Union
[List
[str
],Tuple
[str
]]) – Annotations in ‘ht’ to filter by items_to_filter.remove_subset_from_meta (
bool
) – Whether to remove the “subset” key from ‘freq_meta’ or add “subset” key with “non_ukb” value. Default is False.
- Return type:
- Returns:
Table with filtered ‘annotations’ and ‘freq_meta’ array fields.
- gnomad_qc.v4.annotations.generate_freq.high_ab_het(entry, col)[source]
Determine if a call is considered a high allele balance heterozygous call.
High allele balance heterozygous calls were introduced in certain GATK versions. Track how many calls appear at each site to correct them to homozygous alternate calls downstream in frequency calculations and histograms.
- Assumes the following annotations are present in entry struct:
GT
adj
_high_ab_het_ref
- Assumes the following annotations are present in col struct:
fixed_homalt_model
- Parameters:
entry (
StructExpression
) – Entry struct.col (
StructExpression
) – Column struct.
- Return type:
- Returns:
1 if high allele balance heterozygous call, else 0.
- gnomad_qc.v4.annotations.generate_freq.generate_freq_ht(mt, ds_ht, meta_ht, non_ukb_ds_ht=None)[source]
Generate frequency Table.
- Assumes all necessary annotations are present:
- mt annotations:
GT
adj
_high_ab_het_ref
fixed_homalt_model
fixed_homalt_model
- ds_ht annotations:
downsampling
downsamplings
ds_pop_counts
- meta_ht annotations:
pop
sex_karyotype
gatk_version
age
ukb_sample
- Parameters:
mt (
MatrixTable
) – Input MatrixTable.ds_ht (
Table
) – Table with downsampling annotations.meta_ht (
Table
) – Table with sample metadata annotations.non_ukb_ds_ht (
Optional
[Table
]) – Optional Table with non-UKB downsampling annotations.
- Return type:
- Returns:
Hail Table with frequency annotations.
- gnomad_qc.v4.annotations.generate_freq.densify_and_prep_vds_for_freq(vds, ab_cutoff=0.9)[source]
Densify VDS and select necessary annotations for frequency and histogram calculations.
Select entry annotations required for downstream work. ‘DP’, ‘GQ’, and ‘_het_ab’ are required for histograms. ‘GT’ and ‘adj’ are required for frequency calculations. ‘_high_ab_het_ref’ is required for high AB call corrections in frequency and histogram annotations.
- Assumes all necessary annotations are present:
adj
_het_non_ref
GT
GQ
AD
DP
sex_karyotype
- Parameters:
vds (
VariantDataset
) – Input VDS.ab_cutoff (
float
) – Allele balance cutoff to use for high AB het annotation.
- Return type:
- Returns:
Dense MatrixTable with only necessary entry annotations.
- gnomad_qc.v4.annotations.generate_freq.get_downsampling_ht(mt, non_ukb=False)[source]
Get Table with downsampling groups for all samples or the non-UKB subset.
- Parameters:
mt (
MatrixTable
) – Input MatrixTable.non_ukb (
bool
) – Whether to get downsampling groups for the non-UKB subset. Default is False.
- Return type:
- Returns:
Table with downsampling groups.
- gnomad_qc.v4.annotations.generate_freq.update_non_ukb_freq_ht(freq_ht)[source]
Update non-UKB subset frequencies to be ready for combining with the frequencies of other samples.
Duplicates frequency info for all groups except the non-UKB specific downsamplings and adds “subset” annotation to ‘freq_meta’ for one of the duplicates.
This allows for the non-UKB subset frequencies to be merged with the frequencies of the other samples to provide full dataset frequencies while keeping the non-UKB subset specific frequency information.
- gnomad_qc.v4.annotations.generate_freq.mt_hists_fields(mt)[source]
Annotate quality metrics histograms and age histograms onto MatrixTable.
- Parameters:
mt (
MatrixTable
) – Input MatrixTable.- Return type:
- Returns:
Struct with quality metrics histograms and age histograms.
- gnomad_qc.v4.annotations.generate_freq.combine_freq_hts(freq_hts, row_annotations, global_annotations, age_hists=['age_hist_het', 'age_hist_hom'], qual_hists=['gq_hist_all', 'dp_hist_all', 'gq_hist_alt', 'dp_hist_alt', 'ab_hist_alt'])[source]
Combine frequency HTs into a single HT.
- Parameters:
freq_hts (
Dict
[str
,Table
]) – Dictionary of frequency HTs.row_annotations (
List
[str
]) – List of annotations to put onto one hail Table.global_annotations (
List
[str
]) – List of global annotations to put onto one hail Table.age_hists (
List
[str
]) – List of age histogram annotations to merge.qual_hists (
List
[str
]) – List of quality histogram annotations to merge.
- Return type:
- Returns:
HT with all freq_hts annotations.
- gnomad_qc.v4.annotations.generate_freq.correct_for_high_ab_hets(ht, af_threshold=0.01)[source]
Correct for high allele balance heterozygous calls in call statistics and histograms.
High allele balance GTs were being called heterozygous instead of homozygous alternate in GATK until version 4.1.4.1. This corrects for those calls by adjusting them to homozygote alternate within our call statistics and histograms when a site’s allele frequency is greater than the passed af_threshold. Raw data is not adjusted.
- gnomad_qc.v4.annotations.generate_freq.generate_faf_grpmax(ht)[source]
Compute filtering allele frequencies (‘faf’), ‘grpmax’, and ‘gen_anc_faf_max’ with the AB-adjusted frequencies.
- gnomad_qc.v4.annotations.generate_freq.compute_inbreeding_coeff(ht)[source]
Compute inbreeding coefficient using raw call stats.
- gnomad_qc.v4.annotations.generate_freq.create_final_freq_ht(ht)[source]
Create final freq Table with only desired annotations.
- gnomad_qc.v4.annotations.generate_freq.split_vds(vds, strata_expr)[source]
Split a VDS into multiple VDSs based on strata_expr.
- Parameters:
vds (
VariantDataset
) – Input VDS.strata_expr (
Expression
) – Expression on VDS variant_data MT columns that will be used to determine if a sample belongs to certian split or subset of the VDS.
- Return type:
Dict
[str
,VariantDataset
]- Returns:
Dictionary where strata value is key and VDS is value.