gnomad_qc.v4.annotations.generate_freq_genomes

Script to create frequencies HT for v4.0 genomes.

This script is written specifically for the v4.0 genomes release to handle the addition and subtraction of samples from the HGDP + 1KG subset. The HGDP + 1KG subset has updated sample QC annotations. The addition of new samples from the densified MT will include new variants, which will require the recomputation of AN for the v3.1 releases samples, and merging it to call stats of the v3.1.2 release sites HT and of updated HGDP + 1KG subset. In addition, the code will also get the other call stats related annotations, including: filtering allele frequencies (‘faf’), ‘grpmax’, ‘gen_anc_faf_max’ and ‘InbreedingCoeff’ as done in v4.0 exomes.

This script creates the v4.0 genomes release HT.

usage: gnomad_qc.v4.annotations.generate_freq_genomes.py [-h] [--overwrite]
                                                         [--slack-channel SLACK_CHANNEL]
                                                         [--test-pcsk9 | --test-x-gene]
                                                         [--get-related-to-nonsubset]
                                                         [--get-hgdp-tgp-v4-exome-duplicates]
                                                         [--update-annotations]
                                                         [--get-callstats-for-updated-samples]
                                                         [--join-callstats-for-update]
                                                         [--compute-allele-number-for-new-variants]
                                                         [--compute-allele-number-for-pop-diff]
                                                         [--update-release-callstats]
                                                         [--apply-patch-to-freq-ht]

Named Arguments

--overwrite

Overwrite data

Default: False

--slack-channel

Slack channel to post results and notifications to.

--test-pcsk9

Test on a subset of variants in PCSK9 gene.

Default: False

--test-x-gene

Test on a subset of variants in TRPC5 on chrX.

Default: False

--get-related-to-nonsubset

Get the relatedness to nonsubset samples.

Default: False

--get-hgdp-tgp-v4-exome-duplicates

Get the duplicated samples to exomes.

Default: False

--update-annotations

Update HGDP + 1KG sample QC annotations.

Default: False

--get-callstats-for-updated-samples

Get the call stats for the genomes with an updated release status for v4.0 compared to v3.1.

Default: False

--join-callstats-for-update

Join the call stats Tables for the updated samples with the release call stats Table.

Default: False

--compute-allele-number-for-new-variants

Compute the allele number for variants that are in the updated samples but not in the release HT.

Default: False

--compute-allele-number-for-pop-diff

Compute the allele number of all v4.0 variants on the samples that have different pops in v4.0.

Default: False

--update-release-callstats

Update the release call stats by merging the call stats for the updated samples.

Default: False

--apply-patch-to-freq-ht

Apply a patch to the final freq HT.

Default: False

Module Functions

`gnomad_qc.v4.annotations.generate_freq_genomes.POP_MAP`	Map of HGDP populations that need to be renamed in the v4.0 genomes release.
`gnomad_qc.v4.annotations.generate_freq_genomes.JOIN_FREQS`	Frequency Tables to join for creation of the v4.0 genomes release sites HT.
`gnomad_qc.v4.annotations.generate_freq_genomes.FREQ_GLOBALS`	Global annotations on the frequency Table.
`gnomad_qc.v4.annotations.generate_freq_genomes.filter_to_test`(t)	Filter to a region in PCSK9 or TRPC5 in Table or MatrixTable for testing purposes.
`gnomad_qc.v4.annotations.generate_freq_genomes.get_hgdp_tgp_related_to_nonsubset`(...)	Get the samples in the HGDP + 1KG subset that were related to samples outside the subset in the v3.1 release and were not included in the v3.1 release.
`gnomad_qc.v4.annotations.generate_freq_genomes.get_hgdp_tgp_v4_exome_duplicates`(...)	Get the samples in the HGDP + 1KG subset that are duplicates of an exome in the v4.0 release.
`gnomad_qc.v4.annotations.generate_freq_genomes.add_updated_sample_qc_annotations`(ht)	Add updated sample QC annotations to the HGDP + 1KG subset.
`gnomad_qc.v4.annotations.generate_freq_genomes.get_updated_release_samples`(ht)	Get the samples in the HGDP + 1KG subset that will be added, subtracted, or have different pop labels in the v4.0 release compared to the v3.1 release.
`gnomad_qc.v4.annotations.generate_freq_genomes.calculate_callstats_for_selected_samples`(mt, ...)	Calculate the call stats for samples in samples_ht.
`gnomad_qc.v4.annotations.generate_freq_genomes.concatenate_annotations`(hts)	Concatenate annotations on multiple Tables into a single Table.
`gnomad_qc.v4.annotations.generate_freq_genomes.get_hgdp_tgp_callstats_for_selected_samples`(mt, ...)	Calculate call stats for samples in samples_ht grouped by all stratifications in the v3.1 release call stats.
`gnomad_qc.v4.annotations.generate_freq_genomes.filter_freq_arrays`(ht, ...)	Use filter_arrays_by_meta to filter annotations and globals by 'freq_meta' and annotate them back on ht.
`gnomad_qc.v4.annotations.generate_freq_genomes.annotate_v3_subsets_sample_count`(ht)	Get the freq sample counts from the v3.1 subsets.
`gnomad_qc.v4.annotations.generate_freq_genomes.update_pop_labels`(ht, ...)	Update the population labels in the 'freq_meta' annotation on ht.
`gnomad_qc.v4.annotations.generate_freq_genomes.join_release_ht_with_subsets`(ht, ...)	Join the release HT with the call stats for added and subtracted samples.
`gnomad_qc.v4.annotations.generate_freq_genomes.get_group_membership_ht_for_an`(ht)	Generate a Table with a 'group_membership' array for each sample indicating whether the sample belongs to specific stratification groups.
`gnomad_qc.v4.annotations.generate_freq_genomes.compute_an_by_group_membership`(...)	Compute the allele number for new variants in the v4.0 release by call stats group membership.
`gnomad_qc.v4.annotations.generate_freq_genomes.generate_v4_genomes_callstats`(ht, ...)	Generate the call stats for the v4.0 genomes release.
`gnomad_qc.v4.annotations.generate_freq_genomes.finalize_v4_genomes_callstats`(ht, ...)	Finalize the call stats for the v4.0 genomes release.
`gnomad_qc.v4.annotations.generate_freq_genomes.get_histograms`(ht, ...)	Get histograms for the v4.0 genomes release from the v3.1 release.
`gnomad_qc.v4.annotations.generate_freq_genomes.set_downsampling_freq_missing`(ht, ...)	Set the downsampling call stats for new variants in ht to missing.
`gnomad_qc.v4.annotations.generate_freq_genomes.get_age_distribution`(...)	Get the updated 'age_distribution' annotation for v4.0 genomes.
`gnomad_qc.v4.annotations.generate_freq_genomes.get_pop_diff_v3_vds_group_membership`(...)	Get the group membership for samples in pop_diff for the v3.1 VDS.
`gnomad_qc.v4.annotations.generate_freq_genomes.patch_v4_genomes_callstats`(freq_ht)	Patch the call stats for some inconsistent variant calls found during validity checks.
`gnomad_qc.v4.annotations.generate_freq_genomes.get_v4_genomes_release_resources`(...)	Get PipelineResourceCollection for all resources needed to create the gnomAD v4.0 genomes release.
`gnomad_qc.v4.annotations.generate_freq_genomes.main`(args)	Create the v4.0 genomes release Table.
`gnomad_qc.v4.annotations.generate_freq_genomes.get_script_argument_parser`()	Get script argument parser.

Script to create frequencies HT for v4.0 genomes.

This script is written specifically for the v4.0 genomes release to handle the addition and subtraction of samples from the HGDP + 1KG subset. The HGDP + 1KG subset has updated sample QC annotations. The addition of new samples from the densified MT will include new variants, which will require the recomputation of AN for the v3.1 releases samples, and merging it to call stats of the v3.1.2 release sites HT and of updated HGDP + 1KG subset. In addition, the code will also get the other call stats related annotations, including: filtering allele frequencies (‘faf’), ‘grpmax’, ‘gen_anc_faf_max’ and ‘InbreedingCoeff’ as done in v4.0 exomes.

gnomad_qc.v4.annotations.generate_freq_genomes.SUBSETS = ['non_v2', 'non_topmed', 'non_cancer', 'controls_and_biobanks', 'non_neuro']: List of subsets in the v3.1/v4.0 genomes release.

gnomad_qc.v4.annotations.generate_freq_genomes.POP_MAP = {'bantusafrica': 'bantusouthafrica', 'biakaPygmy': 'biaka', 'italian': 'bergamoitalian', 'mbutiPygmy': 'mbuti', 'melanesian': 'bougainville', 'miaozu': 'miao', 'mongola': 'mongolian', 'oth': 'remaining', 'yizu': 'yi'}: Map of HGDP populations that need to be renamed in the v4.0 genomes release. Also includes the renaming of ‘oth’ to ‘remaining’.

gnomad_qc.v4.annotations.generate_freq_genomes.JOIN_FREQS = ['release', 'pop_diff', 'added', 'subtracted']: Frequency Tables to join for creation of the v4.0 genomes release sites HT.

gnomad_qc.v4.annotations.generate_freq_genomes.FREQ_GLOBALS = ('freq_meta', 'freq_meta_sample_count'): Global annotations on the frequency Table.

gnomad_qc.v4.annotations.generate_freq_genomes.filter_to_test(t, gene_on_chrx=False)[source]

Filter to a region in PCSK9 or TRPC5 in Table or MatrixTable for testing purposes.

Parameters:

t (Union[Table, MatrixTable, VariantDataset]) – Table or MatrixTable to filter.
gene_on_chrx (bool) – Whether to filter to TRPC5 (in the non-PAR region), instead of PCSK9, for testing chrX.

Return type:

Union[Table, MatrixTable, VariantDataset]

Returns:

Table or MatrixTable filtered to a region in PCSK9 or TRPC5.

gnomad_qc.v4.annotations.generate_freq_genomes.get_hgdp_tgp_related_to_nonsubset(v3_meta_ht, rel_ht)[source]

Get the samples in the HGDP + 1KG subset that were related to samples outside the subset in the v3.1 release and were not included in the v3.1 release.

Parameters:

v3_meta_ht (Table) – Table with the v3.1 release metadata.
rel_ht (Table) – Table with the v3.1 release relatedness, here we use the results from pc_relate of the full dataset.

Return type: