gnomad.assessment.summary_stats

gnomad.assessment.summary_stats.freq_bin_expr(...)

Return frequency string annotations based on input AC or AF.

gnomad.assessment.summary_stats.get_summary_counts_dict(...)

Return dictionary containing containing counts of multiple variant categories.

gnomad.assessment.summary_stats.get_summary_ac_dict(...)

Return dictionary containing containing total allele counts for variant categories.

gnomad.assessment.summary_stats.get_summary_counts(ht)

Generate a struct with summary counts across variant categories.

gnomad.assessment.summary_stats.get_an_criteria(mt)

Generate criteria to filter samples based on allele number (AN).

gnomad.assessment.summary_stats.get_tx_expression_expr(...)

Pull appropriate transcript expression annotation struct given a specific locus and alleles (provided in key_expr).

gnomad.assessment.summary_stats.get_summary_stats_variant_filter_expr(t)

Generate variant filtering expression for summary stats.

gnomad.assessment.summary_stats.get_summary_stats_csq_filter_expr(t)

Generate consequence filtering expression for summary stats.

gnomad.assessment.summary_stats.generate_filter_combinations(combos)

Generate list of all possible filter combinations from a list of filter options.

gnomad.assessment.summary_stats.get_summary_stats_filter_group_meta(...)

Generate list of filter group combination metadata for summary stats.

gnomad.assessment.summary_stats.default_generate_gene_lof_matrix(mt, ...)

Generate loss-of-function gene matrix.

gnomad.assessment.summary_stats.get_het_hom_summary_dict(...)

Generate dictionary containing summary counts.

gnomad.assessment.summary_stats.default_generate_gene_lof_summary(mt)

Generate summary counts for loss-of-function (LoF), missense, and synonymous variants.

gnomad.assessment.summary_stats.freq_bin_expr(freq_expr, index=0)[source]

Return frequency string annotations based on input AC or AF.

Note

  • Default index is 0 because function assumes freq_expr was calculated with annotate_freq.

  • Frequency index 0 from annotate_freq is frequency for all pops calculated on adj genotypes only.

Parameters:
  • freq_expr (ArrayExpression) – Array of structs containing frequency information.

  • index (int) – Which index of freq_expr to use for annotation. Default is 0.

Return type:

StringExpression

Returns:

StringExpression containing bin name based on input AC or AF.

gnomad.assessment.summary_stats.get_summary_counts_dict(locus_expr, allele_expr, lof_expr, no_lof_flags_expr, most_severe_csq_expr, prefix_str='')[source]

Return dictionary containing containing counts of multiple variant categories.

Categories are:
  • Number of variants

  • Number of indels

  • Number of SNVs

  • Number of LoF variants

  • Number of LoF variants that pass LOFTEE

  • Number of LoF variants that pass LOFTEE without any flgs

  • Number of LoF variants annotated as ‘other splice’ (OS) by LOFTEE

  • Number of LoF variants that fail LOFTEE

  • Number of missense variants

  • Number of synonymous variants

  • Number of autosomal variants

  • Number of allosomal variants

Warning

Assumes allele_expr contains only two variants (multi-allelics have been split).

Parameters:
  • locus_expr (LocusExpression) – LocusExpression.

  • allele_expr (ArrayExpression) – ArrayExpression containing alleles.

  • lof_expr (StringExpression) – StringExpression containing LOFTEE annotation.

  • no_lof_flags_expr (BooleanExpression) – BooleanExpression indicating whether LoF variant has any flags.

  • most_severe_csq_expr (StringExpression) – StringExpression containing most severe consequence annotation.

  • prefix_str (str) – Desired prefix string for category names. Default is empty str.

Return type:

Dict[str, Int64Expression]

Returns:

Dict of categories and counts per category.

gnomad.assessment.summary_stats.get_summary_ac_dict(ac_expr, lof_expr, no_lof_flags_expr, most_severe_csq_expr)[source]

Return dictionary containing containing total allele counts for variant categories.

Categories are:
  • All variants

  • LoF variants

  • LoF variants that pass LOFTEE

  • LoF variants that pass LOFTEE without any flags

  • LoF variants that are annotate as ‘other splice’ (OS) by LOFTEE

  • LoF variants that fail LOFTEE

  • Missense variants

  • Synonymous variants

Warning

Assumes allele_expr contains only two variants (multi-allelics have been split).

Parameters:
Return type:

Dict[str, Int64Expression]

Returns:

Dict of variant categories and their total allele counts.

gnomad.assessment.summary_stats.get_summary_counts(ht, freq_field='freq', filter_field='filters', filter_decoy=False, canonical_only=True, mane_select_only=False, index=0)[source]

Generate a struct with summary counts across variant categories.

Summary counts:
  • Number of variants

  • Number of indels

  • Number of SNVs

  • Number of LoF variants

  • Number of LoF variants that pass LOFTEE (including with LoF flags)

  • Number of LoF variants that pass LOFTEE without LoF flags

  • Number of OS (other splice) variants annotated by LOFTEE

  • Number of LoF variants that fail LOFTEE filters

Also annotates Table’s globals with total variant counts.

Before calculating summary counts, function:
  • Filters out low confidence regions

  • Uses the most severe consequence

  • Filters to canonical transcripts (if canonical_only is True) or MANE Select transcripts (if mane_select_only is True)

Assumes that:
  • Input HT is annotated with VEP.

  • Multiallelic variants have been split and/or input HT contains bi-allelic variants only.

  • freq_expr was calculated with annotate_freq.

  • (Frequency index 0 from annotate_freq is frequency for all pops calculated on adj genotypes only.)

Parameters:
  • ht (Table) – Input Table.

  • freq_field (str) – Name of field in HT containing frequency annotation (array of structs). Default is “freq”.

  • filter_field (str) – Name of field in HT containing variant filter information. Default is “filters”.

  • canonical_only (bool) – Whether to filter to canonical transcripts. Default is True.

  • mane_select_only (bool) – Whether to filter to MANE Select transcripts. Default is False.

  • filter_decoy (bool) – Whether to filter decoy regions. Default is False.

  • index (int) – Which index of freq_expr to use for annotation. Default is 0.

Return type:

Table

Returns:

Table grouped by frequency bin and aggregated across summary count categories.

gnomad.assessment.summary_stats.get_an_criteria(mt, samples_by_sex=None, meta_root='meta', sex_field='sex_imputation.sex_karyotype', xy_str='XY', xx_str='XX', freq_field='freq', freq_index=0, an_proportion_cutoff=0.8)[source]

Generate criteria to filter samples based on allele number (AN).

Uses allele number as proxy for call rate.

Parameters:
  • mt (MatrixTable) – Input MatrixTable.

  • samples_by_sex (Optional[Dict[str, int]]) – Optional Dictionary containing number of samples (value) for each sample sex (key).

  • meta_root (str) – Name of field in MatrixTable containing sample metadata information. Default is ‘meta’.

  • sex_field (str) – Name of field in MatrixTable containing sample sex assignment. Defualt is ‘sex_imputation.sex_karyotype’.

  • xy_str (str) – String marking whether a sample has XY sex. Default is ‘XY’.

  • xx_str (str) – String marking whether a sample has XX sex. Default is ‘XX’.

  • freq_field (str) – Name of field in MT that contains frequency information. Default is ‘freq’.

  • freq_index (int) – Which index of frequency struct to use. Default is 0.

  • an_proportion_cutoff (float) – Desired allele number proportion cutoff. Default is 0.8.

Return type:

BooleanExpression

gnomad.assessment.summary_stats.get_tx_expression_expr(key_expr, tx_ht, csq_expr, gene_field='ensg', csq_field='csq', tx_struct='tx_annotation')[source]

Pull appropriate transcript expression annotation struct given a specific locus and alleles (provided in key_expr).

Assumes that key_expr contains a locus and alleles. Assumes that multi-allelic variants have been split in both tx_ht and key_expr.

Parameters:
  • row_key_expr – StructExpression containing locus and alleles to search in tx_ht.

  • tx_ht (Table) – Input Table containing transcript expression information.

  • csq_expr (StructExpression) – Input StructExpression that contains VEP consequence information.

  • gene_field (str) – Field in csq_expr that contains gene ID.

  • csq_field (str) – Field in csq_expr that contains most_severe_consequence annotation.

  • tx_struct (str) – StructExpression that contains transcript expression information.

  • key_expr (StructExpression) –

Return type:

Float64Expression

Returns:

StructExpression that contains transcript expression information for given gene ID in csq_expr.

gnomad.assessment.summary_stats.get_summary_stats_variant_filter_expr(t, filter_lcr=False, filter_expr=None, freq_expr=None, grpmax_expr=None, max_af=None, max_grpmax=None, min_an_proportion=None)[source]

Generate variant filtering expression for summary stats.

The possible filtering groups are:

  • ‘no_lcr’ if filter_lcr is True.

  • ‘variant_qc_pass’ if filter_expr is provided.

  • ‘max_af’ as a struct with a field for each af in max_af if max_af is provided.

  • ‘max_grpmax’ as a struct with a field for each grpmax in max_grpmax if max_grpmax is provided.

  • ‘min_an’ as a struct with a field for each an_proportion in min_an_proportion if min_an_proportion is provided.

Parameters:
  • t (Union[Table, MatrixTable]) – Input Table/MatrixTable.

  • filter_lcr (bool) – Whether to filter out low confidence regions. Default is False.

  • filter_expr (SetExpression) – SetExpression containing variant filters. Default is None.

  • freq_expr (Float64Expression) – Float64Expression containing frequency information. Default is None.

  • grpmax_expr (Float64Expression) – Float64Expression containing group max frequency information. Default is None.

  • max_af (Union[float, List[float], None]) – Maximum allele frequency cutoff(s). Can be a single float or a list of floats. Default is None.

  • max_grpmax (Union[float, List[float], None]) – Maximum genetic ancestry group max frequency cutoff(s). Can be a single float or a list of floats. Default is None.

  • min_an_proportion (Union[float, List[float], None]) – Minimum allele number proportion (used as a proxy for call rate). Default is None.

Return type:

Dict[str, Union[BooleanExpression, StructExpression]]

Returns:

Dict of BooleanExpressions or StructExpressions for filtering variants.

gnomad.assessment.summary_stats.get_summary_stats_csq_filter_expr(t, lof_csq_set=None, lof_label_set=None, lof_no_flags=False, lof_any_flags=False, additional_csq_sets=None, additional_csqs=None)[source]

Generate consequence filtering expression for summary stats.

Note

  • Assumes that input Table/MatrixTable/StructExpression contains the required annotations for the requested filtering groups.

    • ‘lof’ annotation for lof_csq_set.

    • ‘no_lof_flags’ annotation for lof_no_flags and lof_any_flags.

The possible filtering groups are:

  • ‘lof’ if lof_csq_set is provided.

  • ‘loftee_no_flags’ if lof_no_flags is True.

  • ‘loftee_with_flags’ if lof_any_flags is True.

  • ‘loftee_label’ as a struct with a field for each lof_label in lof_label_set if provided.

  • ‘csq’ as a struct with a field for each consequence in additional_csqs and lof_csq_set if provided.

  • ‘csq_set’ as a struct with a field for each consequence set in additional_csq_sets if provided. This will also have an lof field if lof_csq_set is provided.

Parameters:
  • t (Union[Table, MatrixTable, StructExpression]) – Input Table/MatrixTable/StructExpression.

  • lof_csq_set (Optional[Set[str]]) – Set of LoF consequence strings. Default is None.

  • lof_label_set (Optional[Set[str]]) – Set of LoF consequence labels. Default is None.

  • lof_no_flags (bool) – Whether to filter to variants with no flags. Default is False.

  • lof_any_flags (bool) – Whether to filter to variants with any flags. Default is False.

  • additional_csq_sets (Optional[Dict[str, Set[str]]]) – Dictionary containing additional consequence sets. Default is None.

  • additional_csqs (Optional[Set[str]]) – Set of additional consequences to keep. Default is None.

Return type:

Dict[str, Union[BooleanExpression, StructExpression]]

Returns:

Dict of BooleanExpressions or StructExpressions for filtering by consequence.

gnomad.assessment.summary_stats.generate_filter_combinations(combos, combo_options=None)[source]

Generate list of all possible filter combinations from a list of filter options.

Example input:

[
    {'pass_filters': [False, True]},
    {'pass_filters': [False, True], 'capture': ['ukb', 'broad']}
]

Example output:

[
    {'pass_filters': False},
    {'pass_filters': True},
    {'pass_filters': False, 'capture': 'ukb'},
    {'pass_filters': False, 'capture': 'broad'},
    {'pass_filters': True, 'capture': 'ukb'},
    {'pass_filters': True, 'capture': 'broad'},
]
Parameters:
  • combos (List[Union[List[str], Dict[str, List[str]]]]) – List of filter groups and their options.

  • combo_options (Optional[Dict[str, List[str]]]) – Dictionary of filter groups and their options that can be supplied if combos is a list of lists.

Return type:

List[Dict[str, str]]

Returns:

List of all possible filter combinations for each filter group.

gnomad.assessment.summary_stats.get_summary_stats_filter_group_meta(all_sum_stat_filters, common_filter_combos=None, common_filter_override=None, lof_filter_combos=None, lof_filter_override=None, filter_key_rename=None)[source]

Generate list of filter group combination metadata for summary stats.

This function combines various filter settings for summary statistics and generates all possible filter combinations. It ensures that the generated combinations include both common filters and specific loss-of-function (LOF) filters.

Note

  • The “variant_qc” filter group is removed if the value is “none”, which can lead to a filter group of {} (no filters).

  • The filter_key_rename parameter can be used to rename keys in the all_sum_stat_filters, common_filter_override, or lof_filter_override after creating all combinations.

Example:

Given the following input:

all_sum_stat_filters = {
    "variant_qc": ["none", "pass"],
    "capture": ["1", "2"],
    "max_af": [0.01],
    "lof_csq": ["stop_gained"],
    "lof_csq_set": ["lof"],
}
common_filter_combos = [["variant_qc"], ["variant_qc", "capture"]]
common_filter_override = {"variant_qc": ["pass"], "capture": ["1"]}
lof_filter_combos = [
    ["lof_csq_set", "loftee_HC"],
    ["lof_csq_set", "loftee_HC", "loftee_flags"],
    ["lof_csq", "loftee_HC", "loftee_flags"],
]
lof_filter_override = {"loftee_HC": ["HC"], "loftee_flags": ["with_flags"]}
filter_key_rename = {
    "lof_csq": "csq",
    "loftee_HC": "loftee_labels",
    "lof_csq_set": "csq_set",
}

The function will generate the following filter combinations:

[
   # Combinations of all common filter keys and their possible values.
    {},
    {'capture': '1'},
    {'capture': '2'},
    {'variant_qc': 'pass'},
    {'variant_qc': 'pass', 'capture': '1'},
    {'variant_qc': 'pass', 'capture': '2'},

    # Combinations of all requested common filter combinations with all
    # possible other filter keys and values.
    {'variant_qc': 'pass', 'max_af': '0.01'},
    {'variant_qc': 'pass', 'csq': 'stop_gained'},
    {'variant_qc': 'pass', 'csq_set': 'lof'},
    {'variant_qc': 'pass', 'capture': '1', 'max_af': '0.01'},
    {'variant_qc': 'pass', 'capture': '1', 'csq': 'stop_gained'},
    {'variant_qc': 'pass', 'capture': '1', 'csq_set': 'lof'},

    # Combinations of all requested common filter combinations with all
    # requested LOF filter combination keys and their requested values.
    {'variant_qc': 'pass', 'csq_set': 'lof', 'loftee_labels': 'HC'},
    {
        'variant_qc': 'pass', 'csq_set': 'lof', 'loftee_labels': 'HC',
        'loftee_flags': 'with_flags'
    },
    {
        'variant_qc': 'pass', 'csq': 'stop_gained', 'loftee_labels': 'HC',
        'loftee_flags': 'with_flags'
    },
    {
        'variant_qc': 'pass', 'capture': '1', 'csq_set': 'lof',
        'loftee_labels': 'HC'
    },
    {
        'variant_qc': 'pass', 'capture': '1', 'csq_set': 'lof',
        'loftee_labels': 'HC', 'loftee_flags': 'with_flags'
    },
    {
        'variant_qc': 'pass', 'capture': '1', 'csq': 'stop_gained',
        'loftee_labels': 'HC', 'loftee_flags': 'with_flags'
    }
]
Parameters:
  • all_sum_stat_filters (Dict[str, List[str]]) – Dictionary of all possible filter types.

  • common_filter_combos (List[List[str]]) – Optional list of lists of common filter keys to use for creating common filter combinations.

  • common_filter_override (Dict[str, List[str]]) – Optional dictionary of filter groups and their options to override the values in all_sum_stat_filters for use with values in common_filter_combos. This is only used if common_filter_combos is not None.

  • lof_filter_combos (Optional[List[List[str]]]) – Optional List of loss-of-function keys in all_sum_stat_filters to use for creating filter combinations.

  • lof_filter_override (Dict[str, List[str]]) – Optional Dictionary of filter groups and their options to override the values in all_sum_stat_filters for use with values in lof_combos. This is only used if lof_filter_combos is not None.

  • filter_key_rename (Dict[str, str]) – Optional dictionary to rename keys in all_sum_stat_filters, common_filter_override, or lof_filter_override to final metadata keys.

Return type:

List[Dict[str, str]]

Returns:

Dictionary of filter field to metadata.

gnomad.assessment.summary_stats.default_generate_gene_lof_matrix(mt, tx_ht, high_expression_cutoff=0.9, low_expression_cutoff=0.1, filter_field='filters', freq_field='freq', freq_index=0, additional_csq_set={'missense_variant', 'synonymous_variant'}, all_transcripts=False, filter_an=False, filter_to_rare=False, pre_loftee=False, lof_csq_set={'frameshift_variant', 'splice_acceptor_variant', 'splice_donor_variant', 'stop_gained'}, remove_ultra_common=False)[source]

Generate loss-of-function gene matrix.

Used to generate summary metrics on LoF variants.

Parameters:
  • mt (MatrixTable) – Input MatrixTable.

  • tx_ht (Optional[Table]) – Optional Table containing expression levels per transcript.

  • high_expression_cutoff (float) – Minimum mean proportion expressed cutoff for a transcript to be considered highly expressed. Default is 0.9.

  • low_expression_cutoff (float) – Upper mean proportion expressed cutoff for a transcript to lowly expressed. Default is 0.1.

  • filter_field (str) – Name of field in MT that contains variant filters. Default is ‘filters’.

  • freq_field (str) – Name of field in MT that contains frequency information. Default is ‘freq’.

  • freq_index (int) – Which index of frequency struct to use. Default is 0.

  • additional_csq_set (Set[str]) – Set of additional consequences to keep. Default is {‘missense_variant’, ‘synonymous_variant’}.

  • all_transcripts (bool) – Whether to use all transcripts instead of just the transcript with most severe consequence. Default is False.

  • filter_an (bool) – Whether to filter using allele number as proxy for call rate. Default is False.

  • filter_to_rare (bool) – Whether to filter to rare (AF < 5%) variants. Default is False.

  • pre_loftee (bool) – Whether LoF consequences have been annotated with LOFTEE. Default is False.

  • lof_csq_set (Set[str]) – Set of LoF consequence strings. Default is {“splice_acceptor_variant”, “splice_donor_variant”, “stop_gained”, “frameshift_variant”}.

  • remove_ultra_common (bool) – Whether to remove ultra common (AF > 95%) variants. Default is False.

Return type:

MatrixTable

gnomad.assessment.summary_stats.get_het_hom_summary_dict(csq_set, most_severe_csq_expr, defined_sites_expr, num_homs_expr, num_hets_expr, pop_expr)[source]

Generate dictionary containing summary counts.

Summary counts are:
  • Number of sites with defined genotype calls

  • Number of samples with heterozygous calls

  • Number of samples with homozygous calls

Function has option to generate counts by population.

Parameters:
  • csq_set (Set[str]) – Set containing transcript consequence string(s).

  • most_severe_csq_expr (StringExpression) – StringExpression containing most severe consequence.

  • defined_sites_expr (Int64Expression) – Int64Expression containing number of sites with defined genotype calls.

  • num_homs_expr (Int64Expression) – Int64Expression containing number of samples with homozygous genotype calls.

  • num_hets_expr (Int64Expression) – Int64Expression containing number of samples with heterozygous genotype calls.

  • pop_expr (StringExpression) – StringExpression containing sample population labels.

Return type:

Dict[str, Int64Expression]

Returns:

Dictionary of summary annotation names and their values.

gnomad.assessment.summary_stats.default_generate_gene_lof_summary(mt, collapse_indels=False, tx=False, lof_csq_set={'frameshift_variant', 'splice_acceptor_variant', 'splice_donor_variant', 'stop_gained'}, meta_root='meta', pop_field='pop', filter_loftee=False)[source]

Generate summary counts for loss-of-function (LoF), missense, and synonymous variants.

Also calculates p, proportion of of haplotypes carrying a putative LoF (pLoF) variant, and observed/expected (OE) ratio of samples with homozygous pLoF variant calls.

Summary counts are (all per gene):
  • Number of samples with no pLoF variants.

  • Number of samples with heterozygous pLoF variants.

  • Number of samples with homozygous pLoF variants.

  • Total number of sites with genotype calls.

  • All of the above stats grouped by population.

Assumes MT was created using default_generate_gene_lof_matrix.

Note

Assumes LoF variants in MT were filtered (LOFTEE pass and no LoF flag only). If LoF variants have not been filtered and filter_loftee is True, expects MT has the row annotation vep.

Parameters:
  • mt (MatrixTable) – Input MatrixTable.

  • collapse_indels (bool) – Whether to collapse indels. Default is False.

  • tx (bool) – Whether input MT has transcript expression data. Default is False.

  • lof_csq_set (Set[str]) – Set containing LoF transcript consequence strings. Default is LOF_CSQ_SET.

  • meta_root (str) – String indicating top level name for sample metadata. Default is ‘meta’.

  • pop_field (str) – String indiciating field with sample population assignment information. Default is ‘pop’.

  • filter_loftee (bool) – Filters to LOFTEE pass variants (and no LoF flags) only. Default is False.

Return type:

Table

Returns:

Table with het/hom summary counts.