gnomad.assessment.summary_stats

gnomad.assessment.summary_stats.freq_bin_expr(...)

Return frequency string annotations based on input AC or AF.

gnomad.assessment.summary_stats.get_summary_counts_dict(...)

Return dictionary containing containing counts of multiple variant categories.

gnomad.assessment.summary_stats.get_summary_ac_dict(...)

Return dictionary containing containing total allele counts for variant categories.

gnomad.assessment.summary_stats.get_summary_counts(ht)

Generate a struct with summary counts across variant categories.

gnomad.assessment.summary_stats.get_an_criteria(mt)

Generate criteria to filter samples based on allele number (AN).

gnomad.assessment.summary_stats.get_tx_expression_expr(...)

Pull appropriate transcript expression annotation struct given a specific locus and alleles (provided in key_expr).

gnomad.assessment.summary_stats.default_generate_gene_lof_matrix(mt, ...)

Generate loss-of-function gene matrix.

gnomad.assessment.summary_stats.get_het_hom_summary_dict(...)

Generate dictionary containing summary counts.

gnomad.assessment.summary_stats.default_generate_gene_lof_summary(mt)

Generate summary counts for loss-of-function (LoF), missense, and synonymous variants.

gnomad.assessment.summary_stats.freq_bin_expr(freq_expr, index=0)[source]

Return frequency string annotations based on input AC or AF.

Note

  • Default index is 0 because function assumes freq_expr was calculated with annotate_freq.

  • Frequency index 0 from annotate_freq is frequency for all pops calculated on adj genotypes only.

Parameters:
  • freq_expr (ArrayExpression) – Array of structs containing frequency information.

  • index (int) – Which index of freq_expr to use for annotation. Default is 0.

Return type:

StringExpression

Returns:

StringExpression containing bin name based on input AC or AF.

gnomad.assessment.summary_stats.get_summary_counts_dict(locus_expr, allele_expr, lof_expr, no_lof_flags_expr, most_severe_csq_expr, prefix_str='')[source]

Return dictionary containing containing counts of multiple variant categories.

Categories are:
  • Number of variants

  • Number of indels

  • Number of SNVs

  • Number of LoF variants

  • Number of LoF variants that pass LOFTEE

  • Number of LoF variants that pass LOFTEE without any flgs

  • Number of LoF variants annotated as ‘other splice’ (OS) by LOFTEE

  • Number of LoF variants that fail LOFTEE

  • Number of missense variants

  • Number of synonymous variants

  • Number of autosomal variants

  • Number of allosomal variants

Warning

Assumes allele_expr contains only two variants (multi-allelics have been split).

Parameters:
  • locus_expr (LocusExpression) – LocusExpression.

  • allele_expr (ArrayExpression) – ArrayExpression containing alleles.

  • lof_expr (StringExpression) – StringExpression containing LOFTEE annotation.

  • no_lof_flags_expr (BooleanExpression) – BooleanExpression indicating whether LoF variant has any flags.

  • most_severe_csq_expr (StringExpression) – StringExpression containing most severe consequence annotation.

  • prefix_str (str) – Desired prefix string for category names. Default is empty str.

Return type:

Dict[str, Int64Expression]

Returns:

Dict of categories and counts per category.

gnomad.assessment.summary_stats.get_summary_ac_dict(ac_expr, lof_expr, no_lof_flags_expr, most_severe_csq_expr)[source]

Return dictionary containing containing total allele counts for variant categories.

Categories are:
  • All variants

  • LoF variants

  • LoF variants that pass LOFTEE

  • LoF variants that pass LOFTEE without any flags

  • LoF variants that are annotate as ‘other splice’ (OS) by LOFTEE

  • LoF variants that fail LOFTEE

  • Missense variants

  • Synonymous variants

Warning

Assumes allele_expr contains only two variants (multi-allelics have been split).

Parameters:
Return type:

Dict[str, Int64Expression]

Returns:

Dict of variant categories and their total allele counts.

gnomad.assessment.summary_stats.get_summary_counts(ht, freq_field='freq', filter_field='filters', filter_decoy=False, canonical_only=True, mane_select_only=False, index=0)[source]

Generate a struct with summary counts across variant categories.

Summary counts:
  • Number of variants

  • Number of indels

  • Number of SNVs

  • Number of LoF variants

  • Number of LoF variants that pass LOFTEE (including with LoF flags)

  • Number of LoF variants that pass LOFTEE without LoF flags

  • Number of OS (other splice) variants annotated by LOFTEE

  • Number of LoF variants that fail LOFTEE filters

Also annotates Table’s globals with total variant counts.

Before calculating summary counts, function:
  • Filters out low confidence regions

  • Uses the most severe consequence

  • Filters to canonical transcripts (if canonical_only is True) or MANE Select transcripts (if mane_select_only is True)

Assumes that:
  • Input HT is annotated with VEP.

  • Multiallelic variants have been split and/or input HT contains bi-allelic variants only.

  • freq_expr was calculated with annotate_freq.

  • (Frequency index 0 from annotate_freq is frequency for all pops calculated on adj genotypes only.)

Parameters:
  • ht (Table) – Input Table.

  • freq_field (str) – Name of field in HT containing frequency annotation (array of structs). Default is “freq”.

  • filter_field (str) – Name of field in HT containing variant filter information. Default is “filters”.

  • canonical_only (bool) – Whether to filter to canonical transcripts. Default is True.

  • mane_select_only (bool) – Whether to filter to MANE Select transcripts. Default is False.

  • filter_decoy (bool) – Whether to filter decoy regions. Default is False.

  • index (int) – Which index of freq_expr to use for annotation. Default is 0.

Return type:

Table

Returns:

Table grouped by frequency bin and aggregated across summary count categories.

gnomad.assessment.summary_stats.get_an_criteria(mt, samples_by_sex=None, meta_root='meta', sex_field='sex_imputation.sex_karyotype', xy_str='XY', xx_str='XX', freq_field='freq', freq_index=0, an_proportion_cutoff=0.8)[source]

Generate criteria to filter samples based on allele number (AN).

Uses allele number as proxy for call rate.

Parameters:
  • mt (MatrixTable) – Input MatrixTable.

  • samples_by_sex (Optional[Dict[str, int]]) – Optional Dictionary containing number of samples (value) for each sample sex (key).

  • meta_root (str) – Name of field in MatrixTable containing sample metadata information. Default is ‘meta’.

  • sex_field (str) – Name of field in MatrixTable containing sample sex assignment. Defualt is ‘sex_imputation.sex_karyotype’.

  • xy_str (str) – String marking whether a sample has XY sex. Default is ‘XY’.

  • xx_str (str) – String marking whether a sample has XX sex. Default is ‘XX’.

  • freq_field (str) – Name of field in MT that contains frequency information. Default is ‘freq’.

  • freq_index (int) – Which index of frequency struct to use. Default is 0.

  • an_proportion_cutoff (float) – Desired allele number proportion cutoff. Default is 0.8.

Return type:

BooleanExpression

gnomad.assessment.summary_stats.get_tx_expression_expr(key_expr, tx_ht, csq_expr, gene_field='ensg', csq_field='csq', tx_struct='tx_annotation')[source]

Pull appropriate transcript expression annotation struct given a specific locus and alleles (provided in key_expr).

Assumes that key_expr contains a locus and alleles. Assumes that multi-allelic variants have been split in both tx_ht and key_expr.

Parameters:
  • row_key_expr – StructExpression containing locus and alleles to search in tx_ht.

  • tx_ht (Table) – Input Table containing transcript expression information.

  • csq_expr (StructExpression) – Input StructExpression that contains VEP consequence information.

  • gene_field (str) – Field in csq_expr that contains gene ID.

  • csq_field (str) – Field in csq_expr that contains most_severe_consequence annotation.

  • tx_struct (str) – StructExpression that contains transcript expression information.

  • key_expr (StructExpression) –

Return type:

Float64Expression

Returns:

StructExpression that contains transcript expression information for given gene ID in csq_expr.

gnomad.assessment.summary_stats.default_generate_gene_lof_matrix(mt, tx_ht, high_expression_cutoff=0.9, low_expression_cutoff=0.1, filter_field='filters', freq_field='freq', freq_index=0, additional_csq_set={'missense_variant', 'synonymous_variant'}, all_transcripts=False, filter_an=False, filter_to_rare=False, pre_loftee=False, lof_csq_set={'frameshift_variant', 'splice_acceptor_variant', 'splice_donor_variant', 'stop_gained'}, remove_ultra_common=False)[source]

Generate loss-of-function gene matrix.

Used to generate summary metrics on LoF variants.

Parameters:
  • mt (MatrixTable) – Input MatrixTable.

  • tx_ht (Optional[Table]) – Optional Table containing expression levels per transcript.

  • high_expression_cutoff (float) – Minimum mean proportion expressed cutoff for a transcript to be considered highly expressed. Default is 0.9.

  • low_expression_cutoff (float) – Upper mean proportion expressed cutoff for a transcript to lowly expressed. Default is 0.1.

  • filter_field (str) – Name of field in MT that contains variant filters. Default is ‘filters’.

  • freq_field (str) – Name of field in MT that contains frequency information. Default is ‘freq’.

  • freq_index (int) – Which index of frequency struct to use. Default is 0.

  • additional_csq_set (Set[str]) – Set of additional consequences to keep. Default is {‘missense_variant’, ‘synonymous_variant’}.

  • all_transcripts (bool) – Whether to use all transcripts instead of just the transcript with most severe consequence. Default is False.

  • filter_an (bool) – Whether to filter using allele number as proxy for call rate. Default is False.

  • filter_to_rare (bool) – Whether to filter to rare (AF < 5%) variants. Default is False.

  • pre_loftee (bool) – Whether LoF consequences have been annotated with LOFTEE. Default is False.

  • lof_csq_set (Set[str]) – Set of LoF consequence strings. Default is {“splice_acceptor_variant”, “splice_donor_variant”, “stop_gained”, “frameshift_variant”}.

  • remove_ultra_common (bool) – Whether to remove ultra common (AF > 95%) variants. Default is False.

Return type:

MatrixTable

gnomad.assessment.summary_stats.get_het_hom_summary_dict(csq_set, most_severe_csq_expr, defined_sites_expr, num_homs_expr, num_hets_expr, pop_expr)[source]

Generate dictionary containing summary counts.

Summary counts are:
  • Number of sites with defined genotype calls

  • Number of samples with heterozygous calls

  • Number of samples with homozygous calls

Function has option to generate counts by population.

Parameters:
  • csq_set (Set[str]) – Set containing transcript consequence string(s).

  • most_severe_csq_expr (StringExpression) – StringExpression containing most severe consequence.

  • defined_sites_expr (Int64Expression) – Int64Expression containing number of sites with defined genotype calls.

  • num_homs_expr (Int64Expression) – Int64Expression containing number of samples with homozygous genotype calls.

  • num_hets_expr (Int64Expression) – Int64Expression containing number of samples with heterozygous genotype calls.

  • pop_expr (StringExpression) – StringExpression containing sample population labels.

Return type:

Dict[str, Int64Expression]

Returns:

Dictionary of summary annotation names and their values.

gnomad.assessment.summary_stats.default_generate_gene_lof_summary(mt, collapse_indels=False, tx=False, lof_csq_set={'frameshift_variant', 'splice_acceptor_variant', 'splice_donor_variant', 'stop_gained'}, meta_root='meta', pop_field='pop', filter_loftee=False)[source]

Generate summary counts for loss-of-function (LoF), missense, and synonymous variants.

Also calculates p, proportion of of haplotypes carrying a putative LoF (pLoF) variant, and observed/expected (OE) ratio of samples with homozygous pLoF variant calls.

Summary counts are (all per gene):
  • Number of samples with no pLoF variants.

  • Number of samples with heterozygous pLoF variants.

  • Number of samples with homozygous pLoF variants.

  • Total number of sites with genotype calls.

  • All of the above stats grouped by population.

Assumes MT was created using default_generate_gene_lof_matrix.

Note

Assumes LoF variants in MT were filtered (LOFTEE pass and no LoF flag only). If LoF variants have not been filtered and filter_loftee is True, expects MT has the row annotation vep.

Parameters:
  • mt (MatrixTable) – Input MatrixTable.

  • collapse_indels (bool) – Whether to collapse indels. Default is False.

  • tx (bool) – Whether input MT has transcript expression data. Default is False.

  • lof_csq_set (Set[str]) – Set containing LoF transcript consequence strings. Default is LOF_CSQ_SET.

  • meta_root (str) – String indicating top level name for sample metadata. Default is ‘meta’.

  • pop_field (str) – String indiciating field with sample population assignment information. Default is ‘pop’.

  • filter_loftee (bool) – Filters to LOFTEE pass variants (and no LoF flags) only. Default is False.

Return type:

Table

Returns:

Table with het/hom summary counts.