gnomad_qc.v4.assessment.calculate_per_sample_stats
Script to get per-sample variant counts and aggregate sample statistics.
The following per-sample variant counts (including heterozygous, homozygous, non-ref, singletons etc.) can be calculated:
Total number of variants
Number of variants that pass all variant qc filters
Number of variants in UK Biobank capture regions
Number of variants in Broad capture regions
Number of variants in the intersect of UK Biobank and Broad capture regions
Number of variants in the union of UK Biobank and Broad capture regions
Number of rare variants (adj AF <0.1%)
Number of loss-of-function variants
Number of missense variants
Number of synonymous variants
The following aggregate sample stats of all of the above per-sample counts can be computed:
Mean
Quantiles (0.0, 0.25, 0.5, 0.75, 1.0)
Aggregated statistics can also be computed by ancestry.
Module Functions
|
Dictionary of default filter settings for summary stats. |
|
Dictionary of common filter settings to use for most summary stats. |
|
List of common variant filter combinations to use for summary stats. |
|
Dictionary of an additional filter group to use for loss-of-function filter combinations. |
|
List of loss-of-function consequence combinations to use for summary stats. |
|
Dictionary to rename keys in SUM_STAT_FILTERS, COMMON_FILTERS, or LOF_FILTERS_FOR_COMBO to final metadata keys. |
|
Process test arguments for the per-sample stats pipeline. |
|
Get filter expressions for UK Biobank and Broad capture regions. |
|
Create Table annotated with an array of booleans indicating whether a variant belongs to certain filter groups. |
|
Load VDS variant data and prepare MatrixTable for computing per-sample stats counts. |
|
Annotate the per-sample stats Table with ratios of other per-sample stats. |
|
Create Table of Hail's sample_qc output broken down by requested variant groupings. |
|
Create intermediate MatrixTable for computing per-sample stats counts. |
|
Create Table of sample QC metrics broken down by requested variant groupings. |
|
Combine autosomal and sex chromosome per-sample stats Tables. |
|
Compute aggregate statistics for per-sample QC metrics. |
|
Get PipelineResourceCollection for all resources needed in the per-sample stats pipeline. |
|
Collect per-sample variant counts and aggregate sample statistics. |
Script to get per-sample variant counts and aggregate sample statistics.
The following per-sample variant counts (including heterozygous, homozygous, non-ref, singletons etc.) can be calculated:
Total number of variants
Number of variants that pass all variant qc filters
Number of variants in UK Biobank capture regions
Number of variants in Broad capture regions
Number of variants in the intersect of UK Biobank and Broad capture regions
Number of variants in the union of UK Biobank and Broad capture regions
Number of rare variants (adj AF <0.1%)
Number of loss-of-function variants
Number of missense variants
Number of synonymous variants
The following aggregate sample stats of all of the above per-sample counts can be computed:
Mean
Quantiles (0.0, 0.25, 0.5, 0.75, 1.0)
Aggregated statistics can also be computed by ancestry.
- gnomad_qc.v4.assessment.calculate_per_sample_stats.SUM_STAT_FILTERS = {'capture': ['ukb', 'broad', 'ukb_broad_intersect', 'ukb_broad_union'], 'csq': ['missense_variant', 'synonymous_variant', 'intron_variant', 'intergenic_variant'], 'csq_set': ['lof', 'coding', 'non_coding'], 'lof_csq': {'frameshift_variant', 'splice_acceptor_variant', 'splice_donor_variant', 'stop_gained'}, 'max_af': [0.0001, 0.001, 0.01], 'variant_qc': ['none', 'pass']}
Dictionary of default filter settings for summary stats.
- gnomad_qc.v4.assessment.calculate_per_sample_stats.COMMON_FILTERS = {'capture': ['ukb_broad_intersect'], 'variant_qc': ['pass']}
Dictionary of common filter settings to use for most summary stats.
- gnomad_qc.v4.assessment.calculate_per_sample_stats.COMMON_FILTER_COMBOS = [['variant_qc'], ['variant_qc', 'capture']]
List of common variant filter combinations to use for summary stats.
- gnomad_qc.v4.assessment.calculate_per_sample_stats.LOF_FILTERS_FOR_COMBO = {'lof_csq_set': ['lof'], 'loftee_HC': ['HC'], 'loftee_flags': ['no_flags', 'with_flags'], 'loftee_label': ['HC', 'LC']}
Dictionary of an additional filter group to use for loss-of-function filter combinations.
- gnomad_qc.v4.assessment.calculate_per_sample_stats.LOF_FILTER_COMBOS = [['lof_csq', 'loftee_label'], ['lof_csq_set', 'loftee_label'], ['lof_csq_set', 'loftee_HC', 'loftee_flags'], ['lof_csq', 'loftee_HC', 'loftee_flags']]
List of loss-of-function consequence combinations to use for summary stats.
- gnomad_qc.v4.assessment.calculate_per_sample_stats.MAP_FILTER_FIELD_TO_META = {'lof_csq': 'csq', 'lof_csq_set': 'csq_set', 'loftee_HC': 'loftee_label'}
Dictionary to rename keys in SUM_STAT_FILTERS, COMMON_FILTERS, or LOF_FILTERS_FOR_COMBO to final metadata keys.
- gnomad_qc.v4.assessment.calculate_per_sample_stats.process_test_args(data_type, test_dataset=False, test_gene=False, test_partitions=None, test_difficult_partitions=False, create_filter_group=False, create_intermediate=False, create_per_sample_counts=False, use_intermediate=False, sex_chr_only=False, autosomes_only=False)[source]
Process test arguments for the per-sample stats pipeline.
Return the intervals and partitions to filter to for testing or raise a ValueError if the test argument combination is invalid.
The test argument combination is invalid if:
create_filter_group is True, and:
test_partitions and create_per_sample_counts are both True because the partitions in the release HT and the raw VDS are different.
test_difficult_partitions is True because difficult partitions are only relevant for testing per-sample counts.
test_dataset is True and create_per_sample_counts is False because the test dataset is only relevant when testing per-sample counts.
test_difficult_partitions is True, and:
data_type is “genomes” because difficult partitions are only relevant for the raw exome VDS.
test_dataset or test_partitions is True because difficult partitions only apply to the full exomes VDS.
test_gene is True, and test_partitions or test_difficult_partitions is True because there is no (or likely no) overlap in partitions between the gene test intervals and the test partitions or difficult partitions.
- Parameters:
data_type (
str
) – Data type of the dataset.test_dataset (
bool
) – Boolean indicating whether to use the test dataset instead of the full dataset. Default is False.test_gene (
bool
) – Boolean indicating whether to filter the dataset to PCSK9 and/or TRPC5 and ZFY (depends on values of sex_chr_only and autosomes_only because PCSK9 is on chr1 and TRPC5 is on chrX and ZFY is on chrY). Default is False.test_partitions (
Optional
[List
[int
]]) – Optional list of partitions to test on. Default is None.test_difficult_partitions (
bool
) – Boolean indicating whether to test on difficult partitions. For exomes only. Default is False.create_filter_group (
bool
) – Boolean indicating whether the filter group HT is being created. Default is False.create_intermediate (
bool
) – Boolean indicating whether the intermediate MT is being created. Default is False.create_per_sample_counts (
bool
) – Boolean indicating whether the per-sample counts HT is being created. Default is False.use_intermediate (
bool
) – Boolean indicating whether the intermediate MT is being used. Default is False.sex_chr_only (
bool
) – Boolean indicating whether the dataset is being filtered to sex chromosomes only. Default is False.autosomes_only (
bool
) – Boolean indicating whether the dataset is being filtered to autosomes only. Default is False.
- Return type:
Tuple
[Optional
[List
[tinterval
]],Optional
[List
[int
]]]- Returns:
Tuple of filter intervals and partitions to filter to for testing.
- gnomad_qc.v4.assessment.calculate_per_sample_stats.get_capture_filter_exprs(ht, ukb_capture=True, broad_capture=True)[source]
Get filter expressions for UK Biobank and Broad capture regions.
- Parameters:
ht (
Table
) – Table containing variant annotations. The following annotations are required: ‘region_flags’.ukb_capture (
bool
) – Expression for variants that are in UKB capture intervals. Default is True.broad_capture (
bool
) – Expression for variants that are in Broad capture intervals. Default is True.
- Return type:
Dict
[str
,BooleanExpression
]- Returns:
Dictionary of filter expressions for UK Biobank and Broad capture regions.
- gnomad_qc.v4.assessment.calculate_per_sample_stats.get_summary_stats_filter_groups_ht(ht, capture_regions=False, vep_canonical=True, vep_mane=False, rare_variants_afs=None, rare_variants_grpmax=None, rare_variant_mode=False)[source]
Create Table annotated with an array of booleans indicating whether a variant belongs to certain filter groups.
A ‘filter_groups’ annotation is added to the Table containing an ArrayExpression of BooleanExpressions for the following filter groups:
All variants
Variants that pass all variant QC filters
Variants in UK Biobank capture regions
Variants in Broad capture regions
Variants in the intersect of UK Biobank and Broad capture regions
Variants in the union of UK Biobank and Broad capture regions
Variant consequence: loss-of-function, missense, and synonymous
Rare variants with allele frequencies less than the thresholds in rare_variants_afs
A ‘filter_group_meta’ global annotation is added to the Table containing an array of dictionaries detailing the filters used in each filter group.
- Parameters:
ht (
Table
) – Table containing variant annotations. The following annotations are required: ‘freq’, ‘filters’, ‘region_flags’, and ‘vep’.capture_regions (
bool
) – Whether to include filtering groups for variants in UK Biobank and Broad capture regions. Default is False.vep_canonical (
bool
) – Whether to filter to only canonical transcripts. If trying count variants in all transcripts, set it to False. Default is True.vep_mane (
bool
) – Whether to filter to only MANE transcripts. Default is False.rare_variants_afs (
Optional
[List
[float
]]) – Optional list of allele frequency thresholds to use for rare variant filtering.rare_variants_grpmax (
Optional
[List
[float
]]) – Optional list of grpmax thresholds to use for rare variant filtering.rare_variant_mode (
bool
) – Whether to only include the rare variant filter groups and variant_qc as the common filter groups. Default is False.
- Return type:
- Returns:
Table containing an ArrayExpression of filter groups for summary stats.
- gnomad_qc.v4.assessment.calculate_per_sample_stats.load_mt_for_sample_counts(data_type, filter_group_ht, adjust_ploidy=False, **kwargs)[source]
Load VDS variant data and prepare MatrixTable for computing per-sample stats counts.
This function loads the VDS for the requested data type and prepares the variant data MatrixTable for computing per-sample stats counts. It does the following:
Filters to non-ref genotypes (likely unnecessary if the MT is already a variant data MT).
Annotates the rows with the filter group metadata from filter_group_ht.
Performs the high AB het -> hom alt adjustment of GT, and adds a ‘high_ab_het_ref’ annotation.
Filters to adj genotypes and selects only the necessary entries for downstream processing.
- Parameters:
data_type (
str
) – Data type of the MatrixTable to return. Either ‘genomes’ or ‘exomes’.filter_group_ht (
Table
) – Table containing filter group metadata from get_summary_stats_filter_groups_ht.adjust_ploidy (
bool
) – Whether to adjust ploidy. If data_type is ‘exomes’, ploidy is adjusted before determining the adj annotation. If data_type is ‘genomes’, the ploidy adjustment is done after determining the adj annotation. This difference is only added for consistency with v3.1, where we added the adj annotation before adjusting ploidy, but the correct thing to do is to adjust ploidy before determining the adj annotation. Default is False.kwargs – Additional keyword arguments to pass to the VDS loading function.
- Return type:
- Returns:
MatrixTable prepared for computing per-sample stats counts.
- gnomad_qc.v4.assessment.calculate_per_sample_stats.annotate_per_sample_stat_combinations(ht, sums={'n_indel': ['n_insertion', 'n_deletion'], 'n_snp': ['n_transition', 'n_transversion']}, diffs={}, ratios={'r_het_hom_var': ('n_het', 'n_hom_var'), 'r_insertion_deletion': ('n_insertion', 'n_deletion'), 'r_ti_tv': ('n_transition', 'n_transversion'), 'r_ti_tv_singleton': ('n_singleton_ti', 'n_singleton_tv')}, additional_stat_combos={})[source]
Annotate the per-sample stats Table with ratios of other per-sample stats.
- Parameters:
ht (
Table
) – Input Table containing per-sample stats.sums (
Dict
[str
,List
[str
]]) – Dictionary of per-sample stats to sum. The key is the name of the sum and the value is a List of the stats to sum. Default is to sum the number of insertions and deletions to get the total number of indels, and the number of transitions and transversions to get the total number of transitions.diffs (
Dict
[str
,Tuple
[str
,str
]]) – Dictionary of per-sample stats to subtract. The key is the name of the difference and the value is a tuple of the stats to subtract, where the second stat is subtracted from the first. Default is {}.ratios (
Dict
[str
,Tuple
[str
,str
]]) – Dictionary of ratios to compute. The key is the name of the ratio and the value is a tuple of the numerator and denominator per-sample stats. Default is to compute the transition/transversion ratio, the singleton transition/transversion ratio, the heterozygous/homozygous variant ratio, and the insertion/deletion ratio.additional_stat_combos (
Optional
[Dict
[str
,Callable
]]) – Optional dictionary of additional per-sample stat combinations to compute. The key is the name of the stat and the value is a function that takes the per-sample stats struct and returns the computed stat. Default is {}.
- Return type:
- Returns:
Table containing per-sample stats annotated with the requested ratios.
- gnomad_qc.v4.assessment.calculate_per_sample_stats.create_per_sample_counts_ht(mt, gq_bins=(60,), dp_bins=(20, 30))[source]
Create Table of Hail’s sample_qc output broken down by requested variant groupings.
Useful for finding the number of variants per sample, either all variants, or variants fall into specific capture regions, or variants that are rare (adj AF <0.1%), or variants categorized by predicted consequences in all, canonical or mane transcripts.
- Parameters:
mt (
MatrixTable
) – Input MatrixTable containing variant data. Must have multi-allelic sites split.gq_bins (
Tuple
[int
]) – Tuple of GQ bins to use for filtering. Default is (60,).dp_bins (
Tuple
[int
]) – Tuple of DP bins to use for filtering. Default is (20, 30).
- Return type:
- Returns:
Table containing per-sample variant counts.
- gnomad_qc.v4.assessment.calculate_per_sample_stats.create_intermediate_mt_for_sample_counts(mt, gq_bins=(60,), dp_bins=(20, 30))[source]
Create intermediate MatrixTable for computing per-sample stats counts.
This function creates an intermediate Table for use in create_per_sample_counts_ht. It does the following:
Converts the MT to a HT with a sample_idx_by_stat row annotation that is a struct of arrays of sample indices for each genotype level stat. The stats are: non_ref, het, hom_var, high_ab_het_ref, over_gq_{gq}, and over_dp_{dp}.
Changes the filter_groups annotation from an array of booleans to an array of structs where group_filter is the original filter group boolean value, mapping to the filter_group_meta, and adds boolean annotations to the struct indicating whether the variant is included in each of the following variant level stats: singleton, singleton_ti, singleton_tv, insertion, deletion, transition, and transversion.
This structure creates a large intermediate Table that allows for memory efficient computation of per-sample stats counts. Smaller datasets do not require this intermediate step and can compute per-sample stats counts directly from the MT and with the use of Hail’s vmt_sample_qc function.
- Parameters:
mt (
MatrixTable
) – Input MatrixTable containing variant data. Must have multi-allelic sites split.gq_bins (
Tuple
[int
]) – Tuple of GQ bins to use for filtering. Default is (60,).dp_bins (
Tuple
[int
]) – Tuple of DP bins to use for filtering. Default is (20, 30).
- Return type:
- Returns:
Intermediate Table for computing per-sample stats counts.
- gnomad_qc.v4.assessment.calculate_per_sample_stats.create_per_sample_counts_from_intermediate_ht(ht)[source]
Create Table of sample QC metrics broken down by requested variant groupings.
Warning
This function is memory intensive and should be run on a cluster with n1-highmem-8 workers. It also requires shuffling, so a cluster with all workers is recommended.
Takes an intermediate Table (returned by create_intermediate_mt_for_sample_counts) with an array of variant level stats by filter groups and an array of sample indices for each genotype level stat.
This function restructures the Table a few times to get around memory errors that occur when trying to aggregate the Table directly.
The following steps are taken:
The Table is grouped by filter group (including all variant level stats) and aggregated to get the count of variants for each sample in each filter group. Leads to a Table where the number of rows is approximately the number of filter groups times the possible permutations of variant level stats.
Since this number can sometimes be too few partitions to get around the memory errors, we approximate a reasonable number of partitions as the number required to split the table into approximately 10,000 variants per partition.
Then we artificially increase the number of aggregation groups by assigning a random group number to each variant, so that when the Table is grouped by both the filter group and the random group instead of only the filter group, the number of rows will be approximately the number of desired partitions.
Therefore, the number of random groups is the ideal number of partitions divided by the number of possible filter groups.
Sample IDs are mapped to the sample indices and the Table is exploded so that each row is counts for a sample and a filter group. The number of rows in the Table is approximately the number of samples times the number of rows in the previous Table.
Group the Table by sample to get a struct of the count of variants by sample QC metric for each sample in each filter group.
Add sample QC stats that are combinations of other stats.
- gnomad_qc.v4.assessment.calculate_per_sample_stats.combine_autosome_and_sex_chr_stats(autosome_ht, sex_chr_ht)[source]
Combine autosomal and sex chromosome per-sample stats Tables.
- gnomad_qc.v4.assessment.calculate_per_sample_stats.compute_agg_sample_stats(ht, meta_ht=None, by_ancestry=False, by_subset=False)[source]
Compute aggregate statistics for per-sample QC metrics.
- Parameters:
ht (
Table
) – Table containing sample QC metrics.meta_ht (
Optional
[Table
]) – Optional Table containing sample metadata. Required if by_ancestry is True.by_ancestry (
bool
) – Boolean indicating whether to stratify by ancestry.by_subset (
bool
) – Boolean indicating whether to stratify by subset. This is only working on “exomes” data.
- Return type:
- Returns:
Struct of aggregate statistics for per-sample QC metrics.
- gnomad_qc.v4.assessment.calculate_per_sample_stats.get_pipeline_resources(data_type, test, test_gene, autosomes_only, sex_chr_only, use_intermediate_mt_for_sample_counts, by_ancestry, by_subset, overwrite, custom_suffix=None)[source]
Get PipelineResourceCollection for all resources needed in the per-sample stats pipeline.
- Parameters:
data_type (
str
) – Data type of the dataset.test (
bool
) – Whether to gather all resources for testing.test_gene (
bool
) – Whether the test is being performed on specific gene(s).autosomes_only (
bool
) – Whether to gather resources for autosomes only.sex_chr_only (
bool
) – Whether to gather resources for sex chromosomes only.use_intermediate_mt_for_sample_counts (
bool
) – Whether to use an intermediate MT for computing per-sample counts.by_ancestry (
bool
) – Whether to return resource for aggregate stats stratified by ancestry.by_subset (
bool
) – Whether to return resource for aggregate stats stratified by subset.overwrite (
bool
) – Whether to overwrite resources if they exist.custom_suffix (
Optional
[str
]) – Optional custom suffix to add to the resource names.
- Return type:
PipelineResourceCollection
- Returns:
PipelineResourceCollection containing resources for all steps of the per-sample stats pipeline.