gnomad_qc.v4.annotations.fix_freq_an
Updates the v4.0 freq HT with the correct AN for the v4.1 release.
There is an error in the v4.0 freq HT where the AN can be incorrect for variants that are only present in one of the UKB or non-UKB subsets. This is because the frequency data was computed on each subset separately and then combined. In order to filter the dataset to only samples that are in the UKB or non-UKB subsets, the hail.vds.filter_samples was used. Unfortunately, the behavior of this function was not the expected behavior due to the following line of code:
variant_data = variant_data.filter_rows(hl.agg.count() > 0)
This line of code filters any row that has no genotypes in the sample subset. When the VariantDataset is densified prior to the frequency calculations this results in no reference data being filled in for that row, unless a row is part of multiallelic site that is not exclusive to the sample subset, and therefore a missing AN for that row. When combining the frequency data for the UKB and non-UKB subsets, the AN for these rows will only have the AN for the subset that has non-ref genotypes for that row.
This script performs the following steps:
Generates allele number and GQ/DP histograms for all sites in the exome calling interval.
For raw genotypes, the AN is just the sum of the ploidy (more specifics on this below) for all genotypes.
For adj genotypes, the AN is the sum of the ploidy for all genotypes that pass the GQ, DP, and allele balance adj filters. Since this is for a site, rather than a specific variant, this number will not always be the same as the AN for a specific variant at that site. This AN can be smaller if it is a multi-allelic site where there are non-ref genotypes for another variant that fails the adj filters.
For raw genotypes on non-PAR regions of the X or Y chromosome, the AN is the sum of the ploidy after adjusting for sex karyotype, BUT NOT taking into account the genotype when adjusting the ploidy. For instance, a het genotype for an XY sample will still have a ploidy of 1, even though the genotype is a het and is therefore set to missing for the frequency calculations.
Generates allele number and GQ/DP histograms for the frequency correction.
Module Functions
|
Prepare VDS for all sites stats. |
|
Compute allele number and histograms for het fail adj ab and non-PAR XY het. |
|
Compute allele number per reference site and histograms for frequency correction. |
|
Compute allele number and histograms for het non ref adjustments. |
Subtract sub histograms from histograms. |
|
|
Adjust allele number and histograms for frequency correction. |
|
Adjust all sites allele number and histograms Table. |
|
Update frequency HT AN field and histograms with the correct AN from the AN HT. |
|
Drop GATK groupings from the frequency HT. |
Script to generate all sites AN and v4.0 exomes frequency fix. |
Updates the v4.0 freq HT with the correct AN for the v4.1 release.
There is an error in the v4.0 freq HT where the AN can be incorrect for variants that are only present in one of the UKB or non-UKB subsets. This is because the frequency data was computed on each subset separately and then combined. In order to filter the dataset to only samples that are in the UKB or non-UKB subsets, the hail.vds.filter_samples was used. Unfortunately, the behavior of this function was not the expected behavior due to the following line of code:
variant_data = variant_data.filter_rows(hl.agg.count() > 0)
This line of code filters any row that has no genotypes in the sample subset. When the VariantDataset is densified prior to the frequency calculations this results in no reference data being filled in for that row, unless a row is part of multiallelic site that is not exclusive to the sample subset, and therefore a missing AN for that row. When combining the frequency data for the UKB and non-UKB subsets, the AN for these rows will only have the AN for the subset that has non-ref genotypes for that row.
This script performs the following steps:
Generates allele number and GQ/DP histograms for all sites in the exome calling interval.
For raw genotypes, the AN is just the sum of the ploidy (more specifics on this below) for all genotypes.
For adj genotypes, the AN is the sum of the ploidy for all genotypes that pass the GQ, DP, and allele balance adj filters. Since this is for a site, rather than a specific variant, this number will not always be the same as the AN for a specific variant at that site. This AN can be smaller if it is a multi-allelic site where there are non-ref genotypes for another variant that fails the adj filters.
For raw genotypes on non-PAR regions of the X or Y chromosome, the AN is the sum of the ploidy after adjusting for sex karyotype, BUT NOT taking into account the genotype when adjusting the ploidy. For instance, a het genotype for an XY sample will still have a ploidy of 1, even though the genotype is a het and is therefore set to missing for the frequency calculations.
Generates allele number and GQ/DP histograms for the frequency correction.
- gnomad_qc.v4.annotations.fix_freq_an.prep_vds_for_all_sites_stats(vds)[source]
Prepare VDS for all sites stats.
Adds the following annotations to the VDS:
‘ploidy’: The ploidy of the genotype adjusted for sex karyotype.
‘adj’: The adj filter for GQ and DP.
The following steps are performed on the reference data MT:
Segments the reference data MT at the PAR boundaries on chrX and chrY.
This is done to ensure that entries are annotated with the correct ‘ploidy’ field when adjusting for sex karyotype.
If the END field of an entry spans a PAR boundary, the entry will be split into two separate entries so that a locus ‘in_non_par’ annotation can be used to correctly annotate if the full reference block defined by an entry is non-PAR or not.
Annotates each entry with the ploidy of the reference block adjusting for sex karyotype.
chrY non-PAR XX entries are set to missing ploidy.
chrX and chrY non-PAR XY entries are set to ploidy 1.
All other entries are set to ploidy 2.
Annotates each entry with the adj filter for GQ and DP taking into account sex karyotype for DP.
The following steps are performed on the variant data MT:
Annotates each entry with the ploidy of the genotype adjusting for sex karyotype.
chrY non-PAR XX entries are set to missing ploidy.
chrX and chrY non-PAR XY entries are set to ploidy 1 if LGT is defined.
This deviates from our standard ploidy adjustment where if the genotype is a het, the ploidy is set to missing. This is because we are adjusting ploidy before splitting multi-allelic sites and this missing annotation will be propagated to all split entries including the reference genotypes causing discrepancies in adjusting ploidy before splitting and after splitting. Since we are using this VDS downstream to compute a reference AN, we want to keep this as a ploidy of 1 even if the genotype is a het.
All other entries are set to the ploidy of the genotype.
Annotates each entry with the adj filter for GQ and DP taking into account sex karyotype for DP.
The sex karyotype and locus are used instead of the genotype to determine if the genotype is haploid or not. This is for the same reason as described above for the ploidy adjustment of non-PAR XY het genotypes.
- Parameters:
vds (
VariantDataset
) – Hail VDS to annotate adj onto variant data.- Return type:
- Returns:
Hail VDS with adj annotation.
- gnomad_qc.v4.annotations.fix_freq_an.compute_an_and_hists_het_fail_adj_ab(vds, group_membership_ht, freq_ht)[source]
Compute allele number and histograms for het fail adj ab and non-PAR XY het.
This module computes the allele number and quality histograms for only the genotypes that are het and fail the adj allele balance or are non-PAR XY het genotypes.
- Parameters:
vds (
VariantDataset
) – Input VariantDataset.group_membership_ht (
Table
) – Table of samples group memberships.freq_ht (
Table
) – Table of frequency data.
- Return type:
- Returns:
Table of allele number and histograms for het fail adj ab.
- gnomad_qc.v4.annotations.fix_freq_an.compute_allele_number_per_ref_site(vds, reference_ht, interval_ht, group_membership_ht)[source]
Compute allele number per reference site and histograms for frequency correction.
- Parameters:
vds (
VariantDataset
) – Input VariantDataset.reference_ht (
Table
) – Table of reference sites.interval_ht (
Table
) – Table of intervals.group_membership_ht (
Table
) – Table of samples group memberships.
- Return type:
- Returns:
Table of AN and GQ/DP hists per reference site.
- gnomad_qc.v4.annotations.fix_freq_an.get_all_sites_nonref_adjustment(vds, group_membership_ht)[source]
Compute allele number and histograms for het non ref adjustments.
The adjustment numbers computed in compute_an_and_hists_het_fail_adj_ab are correct for per allele AN and histogram adjustments for the frequency data, but they are not correct for the all sites AN and histogram adjustments. This is because when computing the sum of all per allele adjustment values, the heterozygous non-ref genotypes can be counted 2 times if both alleles are adjusted.
This function will compute the number of het non-ref genotypes that are being counted twice so the all sites AN and histograms can be adjusted correctly.
- Parameters:
vds (
VariantDataset
) – Input VariantDataset.group_membership_ht (
Table
) – Table of samples group memberships.
- Return type:
- Returns:
Table of allele number and histograms for het fail adj ab.
- gnomad_qc.v4.annotations.fix_freq_an.get_sub_hist_expr(hist_expr, sub_hist_expr)[source]
Subtract sub histograms from histograms.
- Parameters:
hist_expr (
StructExpression
) – Histograms to subtract from.sub_hist_expr (
StructExpression
) – Histograms to subtract.
- Return type:
- Returns:
Subtracted histograms.
- gnomad_qc.v4.annotations.fix_freq_an.adjust_per_site_an_and_hists_for_frequency(ht, freq_ht, het_fail_adj_ab_ht)[source]
Adjust allele number and histograms for frequency correction.
The all sites values on ht are only the reference AN and quality histograms at the site level (keyed by locus) and don’t take into account the differences in the alleles. This function will adjust the all sites AN and quality histograms by the het_fail_adj_ab_ht (computed by compute_an_and_hists_het_fail_adj_ab) so they can be used for the variant frequency correction.
The reasons a variant’s AN and quality histograms can be different from the all sites values are:
The reference AN adj filter doesn’t take into account the allele balance adj filter, so the variant AN can be smaller than the reference AN if there are samples with het genotypes that fail the adj allele balance filter.
The reference AN doesn’t remove the non-PAR het genotypes in XY samples, so the variant AN can be smaller than the reference AN if there are XY samples with a het genotype in a non-PAR region on chrX or chrY.
- gnomad_qc.v4.annotations.fix_freq_an.adjust_all_sites_an_and_hists(ht, het_fail_adj_ab_ht, het_nonref_ht)[source]
Adjust all sites allele number and histograms Table.
- Parameters:
- Return type:
- Returns:
Adjusted all sites AN and GQ/DP hists Table.
- gnomad_qc.v4.annotations.fix_freq_an.update_freq_an_and_hists(freq_ht, freq_correction_ht)[source]
Update frequency HT AN field and histograms with the correct AN from the AN HT.
This module updates the freq_ht AN field which is an array of ints with the correct AN from the AN HT integer array annotation. It uses the freq_correction_ht dictionary and the freq_ht dictionary to match the indexing of array annotations so each group is correctly updated. It then recomputes AF, AC/AN.