gnomad_qc.v4.annotations.generate_freq_genomes
Script to create frequencies HT for v4.0 genomes.
This script is written specifically for the v4.0 genomes release to handle the addition and subtraction of samples from the HGDP + 1KG subset. The HGDP + 1KG subset has updated sample QC annotations. The addition of new samples from the densified MT will include new variants, which will require the recomputation of AN for the v3.1 releases samples, and merging it to call stats of the v3.1.2 release sites HT and of updated HGDP + 1KG subset. In addition, the code will also get the other call stats related annotations, including: filtering allele frequencies (‘faf’), ‘grpmax’, ‘gen_anc_faf_max’ and ‘InbreedingCoeff’ as done in v4.0 exomes.
This script creates the v4.0 genomes release HT.
usage: gnomad_qc.v4.annotations.generate_freq_genomes.py [-h] [--overwrite]
[--slack-channel SLACK_CHANNEL]
[--test-pcsk9 | --test-x-gene]
[--get-related-to-nonsubset]
[--get-hgdp-tgp-v4-exome-duplicates]
[--update-annotations]
[--get-callstats-for-updated-samples]
[--join-callstats-for-update]
[--compute-allele-number-for-new-variants]
[--compute-allele-number-for-pop-diff]
[--update-release-callstats]
[--apply-patch-to-freq-ht]
Named Arguments
- --overwrite
Overwrite data
Default: False
- --slack-channel
Slack channel to post results and notifications to.
- --test-pcsk9
Test on a subset of variants in PCSK9 gene.
Default: False
- --test-x-gene
Test on a subset of variants in TRPC5 on chrX.
Default: False
- --get-related-to-nonsubset
Get the relatedness to nonsubset samples.
Default: False
- --get-hgdp-tgp-v4-exome-duplicates
Get the duplicated samples to exomes.
Default: False
- --update-annotations
Update HGDP + 1KG sample QC annotations.
Default: False
- --get-callstats-for-updated-samples
Get the call stats for the genomes with an updated release status for v4.0 compared to v3.1.
Default: False
- --join-callstats-for-update
Join the call stats Tables for the updated samples with the release call stats Table.
Default: False
- --compute-allele-number-for-new-variants
Compute the allele number for variants that are in the updated samples but not in the release HT.
Default: False
- --compute-allele-number-for-pop-diff
Compute the allele number of all v4.0 variants on the samples that have different pops in v4.0.
Default: False
- --update-release-callstats
Update the release call stats by merging the call stats for the updated samples.
Default: False
- --apply-patch-to-freq-ht
Apply a patch to the final freq HT.
Default: False
Module Functions
Map of HGDP populations that need to be renamed in the v4.0 genomes release. |
|
Frequency Tables to join for creation of the v4.0 genomes release sites HT. |
|
Global annotations on the frequency Table. |
|
|
Filter to a region in PCSK9 or TRPC5 in Table or MatrixTable for testing purposes. |
|
Get the samples in the HGDP + 1KG subset that were related to samples outside the subset in the v3.1 release and were not included in the v3.1 release. |
|
Get the samples in the HGDP + 1KG subset that are duplicates of an exome in the v4.0 release. |
|
Add updated sample QC annotations to the HGDP + 1KG subset. |
|
Get the samples in the HGDP + 1KG subset that will be added, subtracted, or have different pop labels in the v4.0 release compared to the v3.1 release. |
|
Calculate the call stats for samples in samples_ht. |
|
Concatenate annotations on multiple Tables into a single Table. |
|
Calculate call stats for samples in samples_ht grouped by all stratifications in the v3.1 release call stats. |
|
Use filter_arrays_by_meta to filter annotations and globals by 'freq_meta' and annotate them back on ht. |
|
Get the freq sample counts from the v3.1 subsets. |
|
Update the population labels in the 'freq_meta' annotation on ht. |
|
Join the release HT with the call stats for added and subtracted samples. |
|
Generate a Table with a 'group_membership' array for each sample indicating whether the sample belongs to specific stratification groups. |
|
Compute the allele number for new variants in the v4.0 release by call stats group membership. |
|
Generate the call stats for the v4.0 genomes release. |
|
Finalize the call stats for the v4.0 genomes release. |
|
Get histograms for the v4.0 genomes release from the v3.1 release. |
|
Set the downsampling call stats for new variants in ht to missing. |
|
Get the updated 'age_distribution' annotation for v4.0 genomes. |
|
Get the group membership for samples in pop_diff for the v3.1 VDS. |
|
Patch the call stats for some inconsistent variant calls found during validity checks. |
|
Get PipelineResourceCollection for all resources needed to create the gnomAD v4.0 genomes release. |
Create the v4.0 genomes release Table. |
|
|
Get script argument parser. |
Script to create frequencies HT for v4.0 genomes.
This script is written specifically for the v4.0 genomes release to handle the addition and subtraction of samples from the HGDP + 1KG subset. The HGDP + 1KG subset has updated sample QC annotations. The addition of new samples from the densified MT will include new variants, which will require the recomputation of AN for the v3.1 releases samples, and merging it to call stats of the v3.1.2 release sites HT and of updated HGDP + 1KG subset. In addition, the code will also get the other call stats related annotations, including: filtering allele frequencies (‘faf’), ‘grpmax’, ‘gen_anc_faf_max’ and ‘InbreedingCoeff’ as done in v4.0 exomes.
- gnomad_qc.v4.annotations.generate_freq_genomes.SUBSETS = ['non_v2', 'non_topmed', 'non_cancer', 'controls_and_biobanks', 'non_neuro']
List of subsets in the v3.1/v4.0 genomes release.
- gnomad_qc.v4.annotations.generate_freq_genomes.POP_MAP = {'bantusafrica': 'bantusouthafrica', 'biakaPygmy': 'biaka', 'italian': 'bergamoitalian', 'mbutiPygmy': 'mbuti', 'melanesian': 'bougainville', 'miaozu': 'miao', 'mongola': 'mongolian', 'oth': 'remaining', 'yizu': 'yi'}
Map of HGDP populations that need to be renamed in the v4.0 genomes release. Also includes the renaming of ‘oth’ to ‘remaining’.
- gnomad_qc.v4.annotations.generate_freq_genomes.JOIN_FREQS = ['release', 'pop_diff', 'added', 'subtracted']
Frequency Tables to join for creation of the v4.0 genomes release sites HT.
- gnomad_qc.v4.annotations.generate_freq_genomes.FREQ_GLOBALS = ('freq_meta', 'freq_meta_sample_count')
Global annotations on the frequency Table.
- gnomad_qc.v4.annotations.generate_freq_genomes.filter_to_test(t, gene_on_chrx=False)[source]
Filter to a region in PCSK9 or TRPC5 in Table or MatrixTable for testing purposes.
- Parameters:
t (
Union
[Table
,MatrixTable
,VariantDataset
]) – Table or MatrixTable to filter.gene_on_chrx (
bool
) – Whether to filter to TRPC5 (in the non-PAR region), instead of PCSK9, for testing chrX.
- Return type:
Union
[Table
,MatrixTable
,VariantDataset
]- Returns:
Table or MatrixTable filtered to a region in PCSK9 or TRPC5.
Get the samples in the HGDP + 1KG subset that were related to samples outside the subset in the v3.1 release and were not included in the v3.1 release.
- Parameters:
- Return type:
- Returns:
Table with the samples in the HGDP + 1KG subset that are related to samples outside the subset in v3 release.
- gnomad_qc.v4.annotations.generate_freq_genomes.get_hgdp_tgp_v4_exome_duplicates(v3_meta_ht, v4_meta_ht, rel_ht)[source]
Get the samples in the HGDP + 1KG subset that are duplicates of an exome in the v4.0 release.
The duplicated samples are defined as samples that were HGDP + 1KG subset samples in the v3.1 release and are also in the v4.0 exomes release. The duplicated samples have to be removed because we will have combined frequencies rom v4.0 exomes and genomes.
- Parameters:
- Return type:
- Returns:
Table with the samples in the HGDP + 1KG subset that are duplicates of an exome in the v4.0 exomes release.
- gnomad_qc.v4.annotations.generate_freq_genomes.add_updated_sample_qc_annotations(ht)[source]
Add updated sample QC annotations to the HGDP + 1KG subset.
Note
The following annotations need to be updated for the v4.0 genomes release based on the latest sample QC results of the subset:
hgdp_tgp_meta.subcontinental_pca.outlier: to apply the updated pop outlier filter implemented by Alicia Martin’s group.
gnomad_sample_filters.hard_filtered: to apply the recomputed freemix filter for HGDP samples.
gnomad_sample_filters.related_to_nonsubset: to further filter out samples that are related to samples outside the subset but were not included in the v3 release.
gnomad_sample_filters.v4_exome_duplicate: to further filter out the samples in the HGDP + 1KG subset that are duplicates of an exome in the v4.0 release.
relatedness_inference.related: to apply the updated relatedness inference implemented by Alicia Martin’s group.
- gnomad_qc.v4.annotations.generate_freq_genomes.get_updated_release_samples(ht)[source]
Get the samples in the HGDP + 1KG subset that will be added, subtracted, or have different pop labels in the v4.0 release compared to the v3.1 release.
- Three sets of samples will be obtained:
samples that will have different pop labels in the v4.0 genomes release: samples in the to-be-split ‘Han’ and ‘Papuan’ populations AND their ‘gnomad_release’ status hasn’t changed.
samples that will be added to the v4.0 genomes release: samples where ‘gnomad_release’ status has changed and ‘gnomad_release’ is now True.
samples that will be removed from the v4.0 genomes release: samples where ‘gnomad_release’ status has changed and ‘gnomad_release’ is now False.
- gnomad_qc.v4.annotations.generate_freq_genomes.calculate_callstats_for_selected_samples(mt, samples_ht, subsets=None)[source]
Calculate the call stats for samples in samples_ht.
- Parameters:
mt (
MatrixTable
) – MatrixTable with the HGDP + 1KG subset.samples_ht (
Table
) – Table with selected samples and their metadata.subsets (
Optional
[List
[str
]]) – Optional List of subsets to be included in the frequency metadata ‘freq_meta’ annotation returned by annotate_freq. e.g. [‘hgdp’], [‘tgp’], [‘hgdp’, ‘tgp’].
- Return type:
- Returns:
Table with the call stats for the selected samples.
- gnomad_qc.v4.annotations.generate_freq_genomes.concatenate_annotations(hts, global_field_names=('freq_meta', 'freq_meta_sample_count'), row_field_names=('freq',))[source]
Concatenate annotations on multiple Tables into a single Table.
- Parameters:
hts (
List
[Table
]) – List of Tables with annotations to be concatenated.global_field_names (
Union
[List
[str
],Tuple
[str
]]) – Global field names to concatenate.row_field_names (
Union
[List
[str
],Tuple
[str
]]) – Row field names to concatenate.
- Return type:
- Returns:
Table with concatenated annotations.
- gnomad_qc.v4.annotations.generate_freq_genomes.get_hgdp_tgp_callstats_for_selected_samples(mt, samples_ht, compute_freq_all=True)[source]
Calculate call stats for samples in samples_ht grouped by all stratifications in the v3.1 release call stats.
- This function calculates call stats for:
all samples provided in the samples_ht (if compute_freq_all is True).
samples in only the HGDP subset.
samples in only the 1KG (tgp) subset.
The call stats in these three Tables are then concatenate to create a single Table with all call stats.
- Parameters:
mt (
MatrixTable
) – MatrixTable with the HGDP + 1KG subset.samples_ht (
Table
) – Table with the samples to filter to before computing call stats.compute_freq_all (
bool
) – Whether to compute the call stats for all samples in samples_ht instead of only the subset call stats. If False, only the HGDP and 1KG subset call stats are computed. Default is True.
- Return type:
- Returns:
Table with the call stats for the selected samples.
- gnomad_qc.v4.annotations.generate_freq_genomes.filter_freq_arrays(ht, items_to_filter, keep=True, combine_operator='and', annotations=('freq',))[source]
Use filter_arrays_by_meta to filter annotations and globals by ‘freq_meta’ and annotate them back on ht.
Filter annotations, ‘freq_meta’ and ‘freq_meta_sample_count’ fields to only items_to_filter by using the ‘freq_meta’ array field values.
- Parameters:
ht (
Table
) – Input Table.items_to_filter (
Union
[List
[str
],Dict
[str
,List
[str
]]]) – Items to filter by.keep (
bool
) – Whether to keep or remove items. Default is True.combine_operator (
str
) – Operator (“and” or “or”) to use when combining items in ‘items_to_filter’. Default is “and”.annotations (
Union
[List
[str
],Tuple
[str
]]) – Annotations in ‘ht’ to filter by items_to_filter.
- Return type:
- Returns:
Table with filtered ‘annotations’ and ‘freq_meta’ array fields.
- gnomad_qc.v4.annotations.generate_freq_genomes.annotate_v3_subsets_sample_count(ht)[source]
Get the freq sample counts from the v3.1 subsets.
The freq sample counts on the v3.1.2 release sites HT only contains the counts for the non subset groupings. This function adds the freq sample counts from the v3.1 subset frequency groups to the v3.1.2 sample count array to give a sample count array matching the v3.1.2 ‘freq_meta’ global annotation.
- gnomad_qc.v4.annotations.generate_freq_genomes.update_pop_labels(ht, pop_map)[source]
Update the population labels in the ‘freq_meta’ annotation on ht.
- gnomad_qc.v4.annotations.generate_freq_genomes.join_release_ht_with_subsets(ht, freq_hts)[source]
Join the release HT with the call stats for added and subtracted samples.
- gnomad_qc.v4.annotations.generate_freq_genomes.get_group_membership_ht_for_an(ht, only_pop_diff=False)[source]
Generate a Table with a ‘group_membership’ array for each sample indicating whether the sample belongs to specific stratification groups.
- ht must have the following annotations:
‘meta.population_inference.pop’: population label.
‘meta.sex_imputation.sex_karyotype`: sex label.
‘meta.project_meta.project_subpop’: subpopulation label.
‘meta.subsets’: dictionary with subset labels as keys and boolean values.
- Parameters:
ht (
Table
) – Table with the sample metadata.only_pop_diff (
bool
) – Whether to only include the population stratification groups for samples in pop_diff. Default is False.
- Return type:
- Returns:
Table with the group membership for each sample to be used for computing allele number (AN) per group.
- gnomad_qc.v4.annotations.generate_freq_genomes.compute_an_by_group_membership(vds, group_membership_ht, variant_filter_ht)[source]
Compute the allele number for new variants in the v4.0 release by call stats group membership.
- Parameters:
vds (
VariantDataset
) – VariantDataset with all v3.1 samples (including non-release).group_membership_ht (
Table
) – Table with the group membership for each sample. This is generated by get_group_membership_ht_for_an.variant_filter_ht (
Table
) – Table with all variants that need AN to be computed.
- Return type:
- Returns:
Table with the allele number for new variants in the v4.0 release.
- gnomad_qc.v4.annotations.generate_freq_genomes.generate_v4_genomes_callstats(ht, an_ht, pop_diff_an_ht)[source]
Generate the call stats for the v4.0 genomes release.
Merge call stats from the v3.1 release, the v3.1 AN of new v4.0 variants, samples with updated population labels, added samples, and removed samples.
Also, compute filtering allele frequencies (‘faf’), ‘grpmax’, ‘gen_anc_faf_max’ and ‘InbreedingCoeff’ on the merged call stats.
- Parameters:
ht (
Table
) – Table returned by join_release_ht_with_subsets.an_ht (
Table
) – Table with the allele number for new variants in the v4.0 release.pop_diff_an_ht (
Table
) – Table with the allele number for samples with updated pop labels and variants in the v4.0 release but not present in the callstats of this subset.
- Return type:
- Returns:
Table with the updated call stats for the v4.0 genomes release.
- gnomad_qc.v4.annotations.generate_freq_genomes.finalize_v4_genomes_callstats(ht, v3_sites_ht, v3_meta_ht, updated_meta_ht)[source]
Finalize the call stats for the v4.0 genomes release.
- The following is done to create the final v4.0 genomes call stats:
Compute filtering allele frequencies (‘faf’), ‘grpmax’, ‘gen_anc_faf_max’ and ‘inbreeding_coeff’ on the v4.0 genomes call stats.
Drop downsamplings from the call stats.
Get the quality histograms from the v3.1 release.
Get the age distribution for the v4.0 genomes release.
- Parameters:
- Return type:
- Returns:
Table with the finalized call stats for the v4.0 genomes release.
- gnomad_qc.v4.annotations.generate_freq_genomes.get_histograms(ht, v3_sites_ht)[source]
Get histograms for the v4.0 genomes release from the v3.1 release.
- gnomad_qc.v4.annotations.generate_freq_genomes.set_downsampling_freq_missing(ht, new_variants_ht)[source]
Set the downsampling call stats for new variants in ht to missing.
- gnomad_qc.v4.annotations.generate_freq_genomes.get_age_distribution(v3_meta_ht, subset_updated_meta_ht)[source]
Get the updated ‘age_distribution’ annotation for v4.0 genomes.
- Parameters:
- Return type:
- Returns:
Expression with the age distribution for samples in subset_updated_meta.
- gnomad_qc.v4.annotations.generate_freq_genomes.get_pop_diff_v3_vds_group_membership(v3_vds, meta_ht)[source]
Get the group membership for samples in pop_diff for the v3.1 VDS.
- Parameters:
v3_vds (
VariantDataset
) – VDS with the v3.1 release samples.meta_ht (
Table
) – Table with the updated HGDP + 1KG sample metadata.
- Return type:
- Returns:
Table with the group membership for samples in pop_diff.
- gnomad_qc.v4.annotations.generate_freq_genomes.patch_v4_genomes_callstats(freq_ht)[source]
Patch the call stats for some inconsistent variant calls found during validity checks.
14 variants were found to have inconsistent calls between v3.1 and v4.0 genomes release, we determined the reason for each of these by manual inspection and this function applies the needed patches to these variants.
Note
This is a temporary fix until we can recompute the call stats for all the samples.
- gnomad_qc.v4.annotations.generate_freq_genomes.get_v4_genomes_release_resources(test, overwrite)[source]
Get PipelineResourceCollection for all resources needed to create the gnomAD v4.0 genomes release.
- Parameters:
test (
bool
) – Whether to gather all resources for testing.overwrite (
bool
) – Whether to overwrite resources if they exist.
- Return type:
PipelineResourceCollection
- Returns:
PipelineResourceCollection containing resources for all steps of the gnomAD v4.0 genomes release pipeline.
- gnomad_qc.v4.annotations.generate_freq_genomes.main(args)[source]
Create the v4.0 genomes release Table.
Update call stats to include samples from the HGDP + 1KG subset that were unintentionally excluded whole populations within the HGDP + 1KG subset that were the most genetically unique and had small sample sizes (more specifically: San, Mbuti Pygmy, Biaka Pygmy, Bougainville, and Papuan) compared to other populations within the gnomAD v3.1 callset.
In order to avoid re-calculating the call stats for the whole subset / whole release, we calculate the call stats for the samples that will be added and subtracted, then merge the call stats with the old call stats in the release HT.
- Changes compared to the v3 release:
Some small updates to samples that are hard filtered.
Use a population PC outlier approach to filter the HGDP + 1KG samples instead of the sample QC metric outlier filtering approach used on the full dataset that caused some samples to be unintentionally excluded.
HGDP + 1KG release samples are determined using relatedness (pc_relate) run on only samples within the subset as well as relatedness to the rest of the release.
Some population labels were updated.
The Han and Papuan populations were split into more specific groupings.