gnomad.assessment.summary_stats
Return frequency string annotations based on input AC or AF. |
|
|
Return dictionary containing containing counts of multiple variant categories. |
Return dictionary containing containing total allele counts for variant categories. |
|
Generate a struct with summary counts across variant categories. |
|
Generate criteria to filter samples based on allele number (AN). |
|
Pull appropriate transcript expression annotation struct given a specific locus and alleles (provided in key_expr). |
|
|
Generate loss-of-function gene matrix. |
|
Generate dictionary containing summary counts. |
|
Generate summary counts for loss-of-function (LoF), missense, and synonymous variants. |
- gnomad.assessment.summary_stats.freq_bin_expr(freq_expr, index=0)[source]
Return frequency string annotations based on input AC or AF.
Note
Default index is 0 because function assumes freq_expr was calculated with annotate_freq.
Frequency index 0 from annotate_freq is frequency for all pops calculated on adj genotypes only.
- Parameters:
freq_expr (
ArrayExpression
) – Array of structs containing frequency information.index (
int
) – Which index of freq_expr to use for annotation. Default is 0.
- Return type:
- Returns:
StringExpression containing bin name based on input AC or AF.
- gnomad.assessment.summary_stats.get_summary_counts_dict(locus_expr, allele_expr, lof_expr, no_lof_flags_expr, most_severe_csq_expr, prefix_str='')[source]
Return dictionary containing containing counts of multiple variant categories.
- Categories are:
Number of variants
Number of indels
Number of SNVs
Number of LoF variants
Number of LoF variants that pass LOFTEE
Number of LoF variants that pass LOFTEE without any flgs
Number of LoF variants annotated as ‘other splice’ (OS) by LOFTEE
Number of LoF variants that fail LOFTEE
Number of missense variants
Number of synonymous variants
Number of autosomal variants
Number of allosomal variants
Warning
Assumes allele_expr contains only two variants (multi-allelics have been split).
- Parameters:
locus_expr (
LocusExpression
) – LocusExpression.allele_expr (
ArrayExpression
) – ArrayExpression containing alleles.lof_expr (
StringExpression
) – StringExpression containing LOFTEE annotation.no_lof_flags_expr (
BooleanExpression
) – BooleanExpression indicating whether LoF variant has any flags.most_severe_csq_expr (
StringExpression
) – StringExpression containing most severe consequence annotation.prefix_str (
str
) – Desired prefix string for category names. Default is empty str.
- Return type:
Dict
[str
,Int64Expression
]- Returns:
Dict of categories and counts per category.
- gnomad.assessment.summary_stats.get_summary_ac_dict(ac_expr, lof_expr, no_lof_flags_expr, most_severe_csq_expr)[source]
Return dictionary containing containing total allele counts for variant categories.
- Categories are:
All variants
LoF variants
LoF variants that pass LOFTEE
LoF variants that pass LOFTEE without any flags
LoF variants that are annotate as ‘other splice’ (OS) by LOFTEE
LoF variants that fail LOFTEE
Missense variants
Synonymous variants
Warning
Assumes allele_expr contains only two variants (multi-allelics have been split).
- Parameters:
allele_expr – ArrayExpression containing alleles.
lof_expr (
StringExpression
) – StringExpression containing LOFTEE annotation.no_lof_flags_expr (
BooleanExpression
) – BooleanExpression indicating whether LoF variant has any flags.ac_expr (
Int64Expression
) –most_severe_csq_expr (
StringExpression
) –
- Return type:
Dict
[str
,Int64Expression
]- Returns:
Dict of variant categories and their total allele counts.
- gnomad.assessment.summary_stats.get_summary_counts(ht, freq_field='freq', filter_field='filters', filter_decoy=False, canonical_only=True, mane_select_only=False, index=0)[source]
Generate a struct with summary counts across variant categories.
- Summary counts:
Number of variants
Number of indels
Number of SNVs
Number of LoF variants
Number of LoF variants that pass LOFTEE (including with LoF flags)
Number of LoF variants that pass LOFTEE without LoF flags
Number of OS (other splice) variants annotated by LOFTEE
Number of LoF variants that fail LOFTEE filters
Also annotates Table’s globals with total variant counts.
- Before calculating summary counts, function:
Filters out low confidence regions
Uses the most severe consequence
Filters to canonical transcripts (if canonical_only is True) or MANE Select transcripts (if mane_select_only is True)
- Assumes that:
Input HT is annotated with VEP.
Multiallelic variants have been split and/or input HT contains bi-allelic variants only.
freq_expr was calculated with annotate_freq.
(Frequency index 0 from annotate_freq is frequency for all pops calculated on adj genotypes only.)
- Parameters:
ht (
Table
) – Input Table.freq_field (
str
) – Name of field in HT containing frequency annotation (array of structs). Default is “freq”.filter_field (
str
) – Name of field in HT containing variant filter information. Default is “filters”.canonical_only (
bool
) – Whether to filter to canonical transcripts. Default is True.mane_select_only (
bool
) – Whether to filter to MANE Select transcripts. Default is False.filter_decoy (
bool
) – Whether to filter decoy regions. Default is False.index (
int
) – Which index of freq_expr to use for annotation. Default is 0.
- Return type:
- Returns:
Table grouped by frequency bin and aggregated across summary count categories.
- gnomad.assessment.summary_stats.get_an_criteria(mt, samples_by_sex=None, meta_root='meta', sex_field='sex_imputation.sex_karyotype', xy_str='XY', xx_str='XX', freq_field='freq', freq_index=0, an_proportion_cutoff=0.8)[source]
Generate criteria to filter samples based on allele number (AN).
Uses allele number as proxy for call rate.
- Parameters:
mt (
MatrixTable
) – Input MatrixTable.samples_by_sex (
Optional
[Dict
[str
,int
]]) – Optional Dictionary containing number of samples (value) for each sample sex (key).meta_root (
str
) – Name of field in MatrixTable containing sample metadata information. Default is ‘meta’.sex_field (
str
) – Name of field in MatrixTable containing sample sex assignment. Defualt is ‘sex_imputation.sex_karyotype’.xy_str (
str
) – String marking whether a sample has XY sex. Default is ‘XY’.xx_str (
str
) – String marking whether a sample has XX sex. Default is ‘XX’.freq_field (
str
) – Name of field in MT that contains frequency information. Default is ‘freq’.freq_index (
int
) – Which index of frequency struct to use. Default is 0.an_proportion_cutoff (
float
) – Desired allele number proportion cutoff. Default is 0.8.
- Return type:
- gnomad.assessment.summary_stats.get_tx_expression_expr(key_expr, tx_ht, csq_expr, gene_field='ensg', csq_field='csq', tx_struct='tx_annotation')[source]
Pull appropriate transcript expression annotation struct given a specific locus and alleles (provided in key_expr).
Assumes that key_expr contains a locus and alleles. Assumes that multi-allelic variants have been split in both tx_ht and key_expr.
- Parameters:
row_key_expr – StructExpression containing locus and alleles to search in tx_ht.
tx_ht (
Table
) – Input Table containing transcript expression information.csq_expr (
StructExpression
) – Input StructExpression that contains VEP consequence information.gene_field (
str
) – Field in csq_expr that contains gene ID.csq_field (
str
) – Field in csq_expr that contains most_severe_consequence annotation.tx_struct (
str
) – StructExpression that contains transcript expression information.key_expr (
StructExpression
) –
- Return type:
- Returns:
StructExpression that contains transcript expression information for given gene ID in csq_expr.
- gnomad.assessment.summary_stats.default_generate_gene_lof_matrix(mt, tx_ht, high_expression_cutoff=0.9, low_expression_cutoff=0.1, filter_field='filters', freq_field='freq', freq_index=0, additional_csq_set={'missense_variant', 'synonymous_variant'}, all_transcripts=False, filter_an=False, filter_to_rare=False, pre_loftee=False, lof_csq_set={'frameshift_variant', 'splice_acceptor_variant', 'splice_donor_variant', 'stop_gained'}, remove_ultra_common=False)[source]
Generate loss-of-function gene matrix.
Used to generate summary metrics on LoF variants.
- Parameters:
mt (
MatrixTable
) – Input MatrixTable.tx_ht (
Optional
[Table
]) – Optional Table containing expression levels per transcript.high_expression_cutoff (
float
) – Minimum mean proportion expressed cutoff for a transcript to be considered highly expressed. Default is 0.9.low_expression_cutoff (
float
) – Upper mean proportion expressed cutoff for a transcript to lowly expressed. Default is 0.1.filter_field (
str
) – Name of field in MT that contains variant filters. Default is ‘filters’.freq_field (
str
) – Name of field in MT that contains frequency information. Default is ‘freq’.freq_index (
int
) – Which index of frequency struct to use. Default is 0.additional_csq_set (
Set
[str
]) – Set of additional consequences to keep. Default is {‘missense_variant’, ‘synonymous_variant’}.all_transcripts (
bool
) – Whether to use all transcripts instead of just the transcript with most severe consequence. Default is False.filter_an (
bool
) – Whether to filter using allele number as proxy for call rate. Default is False.filter_to_rare (
bool
) – Whether to filter to rare (AF < 5%) variants. Default is False.pre_loftee (
bool
) – Whether LoF consequences have been annotated with LOFTEE. Default is False.lof_csq_set (
Set
[str
]) – Set of LoF consequence strings. Default is {“splice_acceptor_variant”, “splice_donor_variant”, “stop_gained”, “frameshift_variant”}.remove_ultra_common (
bool
) – Whether to remove ultra common (AF > 95%) variants. Default is False.
- Return type:
- gnomad.assessment.summary_stats.get_het_hom_summary_dict(csq_set, most_severe_csq_expr, defined_sites_expr, num_homs_expr, num_hets_expr, pop_expr)[source]
Generate dictionary containing summary counts.
- Summary counts are:
Number of sites with defined genotype calls
Number of samples with heterozygous calls
Number of samples with homozygous calls
Function has option to generate counts by population.
- Parameters:
csq_set (
Set
[str
]) – Set containing transcript consequence string(s).most_severe_csq_expr (
StringExpression
) – StringExpression containing most severe consequence.defined_sites_expr (
Int64Expression
) – Int64Expression containing number of sites with defined genotype calls.num_homs_expr (
Int64Expression
) – Int64Expression containing number of samples with homozygous genotype calls.num_hets_expr (
Int64Expression
) – Int64Expression containing number of samples with heterozygous genotype calls.pop_expr (
StringExpression
) – StringExpression containing sample population labels.
- Return type:
Dict
[str
,Int64Expression
]- Returns:
Dictionary of summary annotation names and their values.
- gnomad.assessment.summary_stats.default_generate_gene_lof_summary(mt, collapse_indels=False, tx=False, lof_csq_set={'frameshift_variant', 'splice_acceptor_variant', 'splice_donor_variant', 'stop_gained'}, meta_root='meta', pop_field='pop', filter_loftee=False)[source]
Generate summary counts for loss-of-function (LoF), missense, and synonymous variants.
Also calculates p, proportion of of haplotypes carrying a putative LoF (pLoF) variant, and observed/expected (OE) ratio of samples with homozygous pLoF variant calls.
- Summary counts are (all per gene):
Number of samples with no pLoF variants.
Number of samples with heterozygous pLoF variants.
Number of samples with homozygous pLoF variants.
Total number of sites with genotype calls.
All of the above stats grouped by population.
Assumes MT was created using default_generate_gene_lof_matrix.
Note
Assumes LoF variants in MT were filtered (LOFTEE pass and no LoF flag only). If LoF variants have not been filtered and filter_loftee is True, expects MT has the row annotation vep.
- Parameters:
mt (
MatrixTable
) – Input MatrixTable.collapse_indels (
bool
) – Whether to collapse indels. Default is False.tx (
bool
) – Whether input MT has transcript expression data. Default is False.lof_csq_set (
Set
[str
]) – Set containing LoF transcript consequence strings. Default is LOF_CSQ_SET.meta_root (
str
) – String indicating top level name for sample metadata. Default is ‘meta’.pop_field (
str
) – String indiciating field with sample population assignment information. Default is ‘pop’.filter_loftee (
bool
) – Filters to LOFTEE pass variants (and no LoF flags) only. Default is False.
- Return type:
- Returns:
Table with het/hom summary counts.