gnomad_qc.v4.annotations.generate_freq

Script to generate the frequency data annotations across v4 exomes.

This script first splits the v4 VDS into multiples VDSs based which are then densified and annotated with frequency data and histograms. The VDSs are then merged back together in a hail Table. Next the script corrects for the high AB heterozygous GATK artifact in existing annotations when given the AF threshold using a high AB het array. The script then computes the inbreeding coefficient using the raw call stats. Finally, it computes the filtering allele frequency and grpmax with the AB-adjusted frequencies.

usage: gnomad_qc.v4.annotations.generate_freq.py [-h] [--use-test-dataset]
                                                 [--test-gene]
                                                 [--test-n-partitions [TEST_N_PARTITIONS]]
                                                 [--chrom CHROM] [--overwrite]
                                                 [--slack-channel SLACK_CHANNEL]
                                                 [--write-split-vds-and-downsampling-ht]
                                                 [--run-freq-and-dense-annotations]
                                                 [--ukb-only] [--non-ukb-only]
                                                 [--combine-freq-hts]
                                                 [--correct-for-high-ab-hets]
                                                 [--ab-cutoff AB_CUTOFF]
                                                 [--af-threshold AF_THRESHOLD]
                                                 [--finalize-freq-ht]

Named Arguments

--use-test-dataset

Runs a test on the gnomad test dataset.

Default: False

--test-gene

Runs a test on the DRD2 gene in the gnomad test dataset.

Default: False

--test-n-partitions

Use only N partitions of the VDS as input for testing purposes. Defaultsto 2 if passed without a value.

--chrom

If passed, script will only run on passed chromosome.

--overwrite

Overwrites existing files.

Default: False

--slack-channel

Slack channel to post results and notifications to.

--write-split-vds-and-downsampling-ht

Write split VDS and downsampling HT.

Default: False

--run-freq-and-dense-annotations

Calculate frequencies, histograms, and high AB sites per sample grouping.

Default: False

--ukb-only

Only run frequency and histogram calculations for UKB samples.

Default: False

--non-ukb-only

Only run frequency and histogram calculations for non-UKB samples.

Default: False

--combine-freq-hts

Combine frequency and histogram Tables for UKB and non-UKB samples into a single Table.

Default: False

--correct-for-high-ab-hets

Correct each frequency entry to account for homozygous alternate depletion present in GATK versions released prior to 4.1.4.1 and run chosen downstream annotations.

Default: False

--ab-cutoff

Allele balance threshold to use when adjusting heterozygous calls to homozygous alternate calls at sites for samples that used GATK versions released prior to 4.1.4.1.

Default: 0.9

--af-threshold

Threshold at which to adjust site group frequencies at sites for homozygous alternate depletion present in GATK versions released prior to 4.1.4.1.

Default: 0.01

--finalize-freq-ht

Finalize frequency Table by dropping unnecessary fields and renaming remaining fields.

Default: False

Module Functions

gnomad_qc.v4.annotations.generate_freq.AGE_HISTS

Age histograms to compute and keep on the frequency Table.

gnomad_qc.v4.annotations.generate_freq.QUAL_HISTS

Quality histograms to compute and keep on the frequency Table.

gnomad_qc.v4.annotations.generate_freq.FREQ_HIGH_AB_HET_ROW_FIELDS

List of top level row and global annotations relating to the high allele balance heterozygote correction that we want on the frequency HT before deciding on the AF cutoff.

gnomad_qc.v4.annotations.generate_freq.FREQ_ROW_FIELDS

List of top level row and global annotations with no high allele balance heterozygote correction that we want on the frequency HT.

gnomad_qc.v4.annotations.generate_freq.ALL_FREQ_ROW_FIELDS

List of final top level row and global annotations created from dense data that we want on the frequency HT before deciding on the AF cutoff.

gnomad_qc.v4.annotations.generate_freq.FREQ_GLOBAL_FIELDS

List of final global annotations created from dense data that we want on the frequency HT before deciding on the AF cutoff.

gnomad_qc.v4.annotations.generate_freq.SUBSET_DICT

Dictionary for accessing the annotations with subset specific annotations such as age_hists, popmax, and faf.

gnomad_qc.v4.annotations.generate_freq.get_freq_resources([...])

Get frequency resources.

gnomad_qc.v4.annotations.generate_freq.get_vds_for_freq([...])

Prepare VDS for frequency calculation by filtering to release samples and only adding necessary annotations.

gnomad_qc.v4.annotations.generate_freq.annotate_freq_index_dict(ht)

Create frequency index dictionary.

gnomad_qc.v4.annotations.generate_freq.filter_freq_arrays_for_non_ukb_subset(ht, ...)

Filter frequency arrays by metadata.

gnomad_qc.v4.annotations.generate_freq.high_ab_het(...)

Determine if a call is considered a high allele balance heterozygous call.

gnomad_qc.v4.annotations.generate_freq.generate_freq_ht(mt, ...)

Generate frequency Table.

gnomad_qc.v4.annotations.generate_freq.densify_and_prep_vds_for_freq(vds)

Densify VDS and select necessary annotations for frequency and histogram calculations.

gnomad_qc.v4.annotations.generate_freq.get_downsampling_ht(mt)

Get Table with downsampling groups for all samples or the non-UKB subset.

gnomad_qc.v4.annotations.generate_freq.update_non_ukb_freq_ht(freq_ht)

Update non-UKB subset frequencies to be ready for combining with the frequencies of other samples.

gnomad_qc.v4.annotations.generate_freq.mt_hists_fields(mt)

Annotate quality metrics histograms and age histograms onto MatrixTable.

gnomad_qc.v4.annotations.generate_freq.combine_freq_hts(...)

Combine frequency HTs into a single HT.

gnomad_qc.v4.annotations.generate_freq.correct_for_high_ab_hets(ht)

Correct for high allele balance heterozygous calls in call statistics and histograms.

gnomad_qc.v4.annotations.generate_freq.generate_faf_grpmax(ht)

Compute filtering allele frequencies ('faf'), 'grpmax', and 'gen_anc_faf_max' with the AB-adjusted frequencies.

gnomad_qc.v4.annotations.generate_freq.compute_inbreeding_coeff(ht)

Compute inbreeding coefficient using raw call stats.

gnomad_qc.v4.annotations.generate_freq.create_final_freq_ht(ht)

Create final freq Table with only desired annotations.

gnomad_qc.v4.annotations.generate_freq.split_vds(...)

Split a VDS into multiple VDSs based on strata_expr.

gnomad_qc.v4.annotations.generate_freq.main(args)

Script to generate frequency and dense dependent annotations on v4 exomes.

gnomad_qc.v4.annotations.generate_freq.get_script_argument_parser()

Get script argument parser.

Script to generate the frequency data annotations across v4 exomes.

This script first splits the v4 VDS into multiples VDSs based which are then densified and annotated with frequency data and histograms. The VDSs are then merged back together in a hail Table. Next the script corrects for the high AB heterozygous GATK artifact in existing annotations when given the AF threshold using a high AB het array. The script then computes the inbreeding coefficient using the raw call stats. Finally, it computes the filtering allele frequency and grpmax with the AB-adjusted frequencies.

gnomad_qc.v4.annotations.generate_freq.AGE_HISTS = ['age_hist_het', 'age_hist_hom']

Age histograms to compute and keep on the frequency Table.

gnomad_qc.v4.annotations.generate_freq.QUAL_HISTS = ['gq_hist_all', 'dp_hist_all', 'gq_hist_alt', 'dp_hist_alt', 'ab_hist_alt']

Quality histograms to compute and keep on the frequency Table.

gnomad_qc.v4.annotations.generate_freq.FREQ_HIGH_AB_HET_ROW_FIELDS = ['high_ab_hets_by_group', 'high_ab_het_adjusted_ab_hists', 'high_ab_het_adjusted_age_hists']

List of top level row and global annotations relating to the high allele balance heterozygote correction that we want on the frequency HT before deciding on the AF cutoff.

gnomad_qc.v4.annotations.generate_freq.FREQ_ROW_FIELDS = ['freq', 'qual_hists', 'raw_qual_hists', 'age_hists']

List of top level row and global annotations with no high allele balance heterozygote correction that we want on the frequency HT.

gnomad_qc.v4.annotations.generate_freq.ALL_FREQ_ROW_FIELDS = ['freq', 'qual_hists', 'raw_qual_hists', 'age_hists', 'high_ab_hets_by_group', 'high_ab_het_adjusted_ab_hists', 'high_ab_het_adjusted_age_hists']

List of final top level row and global annotations created from dense data that we want on the frequency HT before deciding on the AF cutoff.

gnomad_qc.v4.annotations.generate_freq.FREQ_GLOBAL_FIELDS = ['downsamplings', 'freq_meta', 'age_distribution', 'freq_index_dict', 'freq_meta_sample_count']

List of final global annotations created from dense data that we want on the frequency HT before deciding on the AF cutoff.

gnomad_qc.v4.annotations.generate_freq.SUBSET_DICT = {'gnomad': 0, 'non_ukb': 1}

Dictionary for accessing the annotations with subset specific annotations such as age_hists, popmax, and faf.

gnomad_qc.v4.annotations.generate_freq.get_freq_resources(overwrite=False, test=False, chrom=None, ukb=False, non_ukb=False)[source]

Get frequency resources.

Parameters:
  • overwrite (bool) – Whether to overwrite existing files.

  • test (Optional[bool]) – Whether to use test resources.

  • chrom (Optional[str]) – Chromosome used in freq calculations.

  • ukb (bool) – Whether to get frequency resources for UKB subset.

  • non_ukb (bool) – Whether to get frequency resources for non UKB subset.

Return type:

PipelineResourceCollection

Returns:

Frequency resources.

gnomad_qc.v4.annotations.generate_freq.get_vds_for_freq(use_test_dataset=False, test_gene=False, test_n_partitions=None, chrom=None)[source]

Prepare VDS for frequency calculation by filtering to release samples and only adding necessary annotations.

Parameters:
  • use_test_dataset (bool) – Whether to use test dataset.

  • test_gene (bool) – Whether to filter to DRD2 for testing purposes.

  • test_n_partitions (Optional[int]) – Number of partitions to use for testing.

  • chrom (Optional[int]) – Chromosome to filter to.

Return type:

VariantDataset

Returns:

Hail VDS with only necessary annotations.

gnomad_qc.v4.annotations.generate_freq.annotate_freq_index_dict(ht)[source]

Create frequency index dictionary.

The keys are the strata over which frequency aggregations where calculated and the values are the strata’s index in the frequency array.

Parameters:

ht (Table) – Input Table.

Return type:

Table

Returns:

Table with ‘freq_index_dict’ global field.

gnomad_qc.v4.annotations.generate_freq.filter_freq_arrays_for_non_ukb_subset(ht, items_to_filter, keep=True, combine_operator='and', annotations=('freq', 'high_ab_hets_by_group'), remove_subset_from_meta=False)[source]

Filter frequency arrays by metadata.

Filter ‘annotations’ and freq_meta array fields to only items_to_filter by using the ‘freq_meta’ array field values.

If remove_subset_from_meta is True, update ‘freq_meta’ dicts by removing items with the key “subset” removed. If False, update ‘freq_meta’ dicts to include a “subset” key with “non_ukb” value.

Also rename the ‘downsamplings’ global field to “non_ukb_downsamplings” so we can merge them later without losing the non-UKB downsampling information.

Parameters:
  • ht (Table) – Input Table.

  • items_to_filter (Union[List[str], Dict[str, List[str]]]) – Items to filter by.

  • keep (bool) – Whether to keep or remove items. Default is True.

  • combine_operator (str) – Operator (“and” or “or”) to use when combining items in ‘items_to_filter’. Default is “and”.

  • annotations (Union[List[str], Tuple[str]]) – Annotations in ‘ht’ to filter by items_to_filter.

  • remove_subset_from_meta (bool) – Whether to remove the “subset” key from ‘freq_meta’ or add “subset” key with “non_ukb” value. Default is False.

Return type:

Table

Returns:

Table with filtered ‘annotations’ and ‘freq_meta’ array fields.

gnomad_qc.v4.annotations.generate_freq.high_ab_het(entry, col)[source]

Determine if a call is considered a high allele balance heterozygous call.

High allele balance heterozygous calls were introduced in certain GATK versions. Track how many calls appear at each site to correct them to homozygous alternate calls downstream in frequency calculations and histograms.

Assumes the following annotations are present in entry struct:
  • GT

  • adj

  • _high_ab_het_ref

Assumes the following annotations are present in col struct:
  • fixed_homalt_model

Parameters:
Return type:

Int32Expression

Returns:

1 if high allele balance heterozygous call, else 0.

gnomad_qc.v4.annotations.generate_freq.generate_freq_ht(mt, ds_ht, meta_ht, non_ukb_ds_ht=None)[source]

Generate frequency Table.

Assumes all necessary annotations are present:
mt annotations:
  • GT

  • adj

  • _high_ab_het_ref

  • fixed_homalt_model

  • fixed_homalt_model

ds_ht annotations:
  • downsampling

  • downsamplings

  • ds_pop_counts

meta_ht annotations:
  • pop

  • sex_karyotype

  • gatk_version

  • age

  • ukb_sample

Parameters:
  • mt (MatrixTable) – Input MatrixTable.

  • ds_ht (Table) – Table with downsampling annotations.

  • meta_ht (Table) – Table with sample metadata annotations.

  • non_ukb_ds_ht (Optional[Table]) – Optional Table with non-UKB downsampling annotations.

Return type:

Table

Returns:

Hail Table with frequency annotations.

gnomad_qc.v4.annotations.generate_freq.densify_and_prep_vds_for_freq(vds, ab_cutoff=0.9)[source]

Densify VDS and select necessary annotations for frequency and histogram calculations.

Select entry annotations required for downstream work. ‘DP’, ‘GQ’, and ‘_het_ab’ are required for histograms. ‘GT’ and ‘adj’ are required for frequency calculations. ‘_high_ab_het_ref’ is required for high AB call corrections in frequency and histogram annotations.

Assumes all necessary annotations are present:
  • adj

  • _het_non_ref

  • GT

  • GQ

  • AD

  • DP

  • sex_karyotype

Parameters:
  • vds (VariantDataset) – Input VDS.

  • ab_cutoff (float) – Allele balance cutoff to use for high AB het annotation.

Return type:

MatrixTable

Returns:

Dense MatrixTable with only necessary entry annotations.

gnomad_qc.v4.annotations.generate_freq.get_downsampling_ht(mt, non_ukb=False)[source]

Get Table with downsampling groups for all samples or the non-UKB subset.

Parameters:
  • mt (MatrixTable) – Input MatrixTable.

  • non_ukb (bool) – Whether to get downsampling groups for the non-UKB subset. Default is False.

Return type:

Table

Returns:

Table with downsampling groups.

gnomad_qc.v4.annotations.generate_freq.update_non_ukb_freq_ht(freq_ht)[source]

Update non-UKB subset frequencies to be ready for combining with the frequencies of other samples.

Duplicates frequency info for all groups except the non-UKB specific downsamplings and adds “subset” annotation to ‘freq_meta’ for one of the duplicates.

This allows for the non-UKB subset frequencies to be merged with the frequencies of the other samples to provide full dataset frequencies while keeping the non-UKB subset specific frequency information.

Parameters:

freq_ht (Table) – Non-UKB frequency Table.

Return type:

Table

Returns:

Restructured non-UKB frequency Table.

gnomad_qc.v4.annotations.generate_freq.mt_hists_fields(mt)[source]

Annotate quality metrics histograms and age histograms onto MatrixTable.

Parameters:

mt (MatrixTable) – Input MatrixTable.

Return type:

StructExpression

Returns:

Struct with quality metrics histograms and age histograms.

gnomad_qc.v4.annotations.generate_freq.combine_freq_hts(freq_hts, row_annotations, global_annotations, age_hists=['age_hist_het', 'age_hist_hom'], qual_hists=['gq_hist_all', 'dp_hist_all', 'gq_hist_alt', 'dp_hist_alt', 'ab_hist_alt'])[source]

Combine frequency HTs into a single HT.

Parameters:
  • freq_hts (Dict[str, Table]) – Dictionary of frequency HTs.

  • row_annotations (List[str]) – List of annotations to put onto one hail Table.

  • global_annotations (List[str]) – List of global annotations to put onto one hail Table.

  • age_hists (List[str]) – List of age histogram annotations to merge.

  • qual_hists (List[str]) – List of quality histogram annotations to merge.

Return type:

Table

Returns:

HT with all freq_hts annotations.

gnomad_qc.v4.annotations.generate_freq.correct_for_high_ab_hets(ht, af_threshold=0.01)[source]

Correct for high allele balance heterozygous calls in call statistics and histograms.

High allele balance GTs were being called heterozygous instead of homozygous alternate in GATK until version 4.1.4.1. This corrects for those calls by adjusting them to homozygote alternate within our call statistics and histograms when a site’s allele frequency is greater than the passed af_threshold. Raw data is not adjusted.

Parameters:
  • ht (Table) – Table with frequency and histogram annotations for correction as well as high AB het annotations.

  • af_threshold (float) – Allele frequency threshold for high AB adjustment. Default is 0.01.

Return type:

Table

Returns:

Table with corrected call statistics and histograms.

gnomad_qc.v4.annotations.generate_freq.generate_faf_grpmax(ht)[source]

Compute filtering allele frequencies (‘faf’), ‘grpmax’, and ‘gen_anc_faf_max’ with the AB-adjusted frequencies.

Parameters:

ht (Table) – Hail Table containing ‘freq’, ‘ab_adjusted_freq’, ‘high_ab_het’ annotations.

Return type:

Table

Returns:

Hail Table with ‘faf’, ‘grpmax’, and ‘gen_anc_faf_max’ annotations.

gnomad_qc.v4.annotations.generate_freq.compute_inbreeding_coeff(ht)[source]

Compute inbreeding coefficient using raw call stats.

Parameters:

ht (Table) – Hail Table containing ‘freq’ array with struct entries of ‘AC’, ‘AN’, and ‘homozygote_count’.

Return type:

Table

Returns:

Hail Table with inbreeding coefficient annotation.

gnomad_qc.v4.annotations.generate_freq.create_final_freq_ht(ht)[source]

Create final freq Table with only desired annotations.

Parameters:

ht (Table) – Hail Table containing all annotations.

Return type:

Table

Returns:

Hail Table with final annotations.

gnomad_qc.v4.annotations.generate_freq.split_vds(vds, strata_expr)[source]

Split a VDS into multiple VDSs based on strata_expr.

Parameters:
  • vds (VariantDataset) – Input VDS.

  • strata_expr (Expression) – Expression on VDS variant_data MT columns that will be used to determine if a sample belongs to certian split or subset of the VDS.

Return type:

Dict[str, VariantDataset]

Returns:

Dictionary where strata value is key and VDS is value.

gnomad_qc.v4.annotations.generate_freq.main(args)[source]

Script to generate frequency and dense dependent annotations on v4 exomes.