gnomad.utils.constraint
Script containing generic constraint functions that may be used in the constraint pipeline.
Minimum median exome coverage differentiating high coverage sites from low coverage sites. |
|
Annotate SNP mutation rate for the input Table. |
|
Count number of observed or possible variants by context, ref, alt, and optionally methylation_level. |
|
Get indices of dictionaries in meta dictionaries that only have the "downsampling" key with specified genetic_ancestry_label and "variant_quality" values. |
|
Return an aggregation expression to compute an array of counts of all downsamplings found in freq_expr where specified criteria is met. |
|
Explode observed and expected downsampling counts for each genetic ancestry group and metric. |
|
Annotate mutation types. |
|
Trim heptamer context to create trimer context. |
|
Return the deduplicated context by collapsing DNA strands. |
|
|
Build coverage and plateau models. |
Build plateau models to calibrate mutation rate to compute predicted proportion observed value. |
|
Build coverage model. |
|
Get the minimum length of observed variant counts array for each population downsampling. |
|
Collect annotations used for constraint groupings. |
|
|
Annotate Table with annotations used for constraint groupings. |
Apply plateau models for all sites and for a population (if specified) to compute predicted proportion observed ratio and expected variant counts. |
|
Get aggregation expressions to compute the observed:expected ratio for rows defined by filter_expr. |
|
|
Compute the pLI score using the observed and expected variant counts. |
Determine the confidence interval around the observed:expected ratio. |
|
Compute the signed raw z-score using observed and expected variant counts. |
|
Determine the constraint flags that define why constraint will not be calculated. |
|
Calculate the standard deviation of the raw z-score. |
|
|
Add GENCODE annotations to Table based on transcript id. |
Script containing generic constraint functions that may be used in the constraint pipeline.
- gnomad.utils.constraint.COVERAGE_CUTOFF = 30
Minimum median exome coverage differentiating high coverage sites from low coverage sites.
Low coverage sites require an extra calibration when computing the proportion of expected variation.
- gnomad.utils.constraint.annotate_with_mu(ht, mutation_ht, mu_annotation='mu_snp')[source]
Annotate SNP mutation rate for the input Table.
Note
Function expects that`ht` includes`mutation_ht`’s key fields. Note that these annotations don’t need to be the keys of ht.
- gnomad.utils.constraint.count_variants_by_group(ht, freq_expr=None, freq_meta_expr=None, count_singletons=False, count_downsamplings=(), downsamplings=None, additional_grouping=(), partition_hint=100, omit_methylation=False, use_table_group_by=False, singleton_expr=None, max_af=None)[source]
Count number of observed or possible variants by context, ref, alt, and optionally methylation_level.
Performs variant count aggregations based on specified criteria (count_singletons, count_downsamplings, and max_af), and grouped by: ‘context’, ‘ref’, ‘alt’, ‘methylation_level’ (optional), and all annotations provided in additional_grouping.
If variant allele frequency information is required based on other parameter selections (described in detail below) and freq_expr is not supplied, freq_expr defaults to ht.freq if it exists.
freq_expr should be an ArrayExpression of Structs with ‘AC’ and ‘AF’ annotations. This is the same format as the freq annotation that is created using annotate_freq().
- Variant allele frequency information is needed when:
max_af is not None - freq_expr[0].AF is used to filter to only variants with a maximum allele frequency of max_af prior to counting variants. In the standard freq ArrayExpression annotated by annotate_freq(), this first element corresponds to the allele frequency information for high quality genotypes (adj).
count_singletons is True and singleton_expr is None - If singleton counts are requested and no expression is specified to determine whether a variant is a singleton, singleton_expr defaults to freq_expr[0].AC == 1. In the standard freq ArrayExpression annotated by annotate_freq(), this corresponds to allele count of only 1 in the callset after filtering to high quality genotypes.
count_downsamplings is not empty - When downsampling counts are requested, freq_expr needs to contain frequency information for downsamplings within each population requested. In addition to needing freq_expr, this also requires the use of freq_meta_expr. If freq_meta_expr is None, freq_meta_expr it defaults to ht.freq_meta if it exists. Similar to freq_expr, freq_meta_expr is expected to have the same format as the freq_meta global annotation that is created using annotate_freq(). freq_meta_expr is used to determine the index of allele frequency information within freq_expr for each population requested and it’s downsamplings.
This function will return a Table with annotations used for grouping (‘context’, ‘ref’, ‘alt’, ‘methylation_level’ (optional), additional_grouping) and ‘variant_count’ annotation.
Note
- The following annotations should be present in ht:
ref - the reference allele
alt - the alternate base
context - trinucleotide genomic context
methylation_level - methylation level (optional if omit_methylation==True)
freq - allele frequency information (AC, AN, AF, homozygote count; not required if freq_expr is given)
freq_meta - an ordered list containing the frequency aggregation group for each element of the freq array row annotation (not required if freq_meta_expr is given)
- Parameters:
ht (
Table
) – Input Hail Table.freq_expr (
Optional
[ArrayExpression
]) – ArrayExpression of Structs with ‘AC’ and ‘AF’ annotations. If freq_expr is None and any of count_downsamplings, max_af, and count_singletons is True, freq_expr would be ht.freq.freq_meta_expr (
Optional
[ArrayExpression
]) – ArrayExpression of meta dictionaries corresponding to freq_expr. If count_downsamplings and freq_meta_expr is None, freq_meta_expr would be ht.freq_meta.count_singletons (
bool
) – Whether to count singletons (defined by singleton_expr). Default is False.count_downsamplings (
Tuple
[str
]) – Tuple of populations to use for downsampling counts. Default is ().downsamplings (
Optional
[List
[int
]]) – Optional List of integers specifying what downsampling indices to obtain. Default is None, which will return all downsampling counts.additional_grouping (
Tuple
[str
]) – Additional features to group by. e.g. ‘exome_coverage’. Default is ().partition_hint (
int
) – Target number of partitions for aggregation. Default is 100.omit_methylation (
bool
) – Whether to omit ‘methylation_level’ from the grouping when counting variants. Default is False.use_table_group_by (
bool
) – Whether to group ht before aggregating the variant counts. If use_table_group_by is False, function will return a hl. StructExpression. Default is False.singleton_expr (
Optional
[BooleanExpression
]) – Expression for defining a singleton. When count_singletons is True and singleton_expr is None, singleton_expression would be freq_expr [0].AC == 1. Default is None.max_af (
Optional
[float
]) – Maximum variant allele frequency to keep. By default, no cutoff is applied.
- Return type:
Union
[Table
,Any
]- Returns:
Table including ‘variant_count’ annotation and if requested, singleton_count and downsampling counts.
- gnomad.utils.constraint.get_downsampling_freq_indices(freq_meta_expr, pop='global', variant_quality='adj', genetic_ancestry_label=None, subset=None, downsamplings=None)[source]
Get indices of dictionaries in meta dictionaries that only have the “downsampling” key with specified genetic_ancestry_label and “variant_quality” values.
- Parameters:
freq_meta_expr (
ArrayExpression
) – ArrayExpression containing the set of groupings for each element of the freq_expr array (e.g., [{‘group’: ‘adj’}, {‘group’: ‘adj’, ‘pop’: ‘nfe’}, {‘downsampling’: ‘5000’, ‘group’: ‘adj’, ‘pop’: ‘global’}]).pop (
str
) – Population to use for filtering by the genetic_ancestry_label key in freq_meta_expr. Default is ‘global’.variant_quality (
str
) – Variant quality to use for filtering by the ‘group’ key in freq_meta_expr. Default is ‘adj’.genetic_ancestry_label (
Optional
[str
]) – Label defining the genetic ancestry groups. If None, “gen_anc” or “pop” is used (in that order of preference) if present. Default is None.subset (
Optional
[str
]) – Subset to use for filtering by the ‘subset’ key in freq_meta_expr. Default is None, which will return all downsampling indices without a ‘subset’ key in freq_meta_expr.downsamplings (
Optional
[List
[int
]]) – Optional List of integers specifying what downsampling indices to obtain. Default is None, which will return all downsampling indices.
- Return type:
- Returns:
ArrayExpression of indices of dictionaries in freq_meta_expr that only have the “downsampling” key with specified genetic_ancestry_label and “variant_quality” values.
- gnomad.utils.constraint.downsampling_counts_expr(freq_expr, freq_meta_expr, pop='global', variant_quality='adj', singleton=False, max_af=None, genetic_ancestry_label=None, subset=None, downsamplings=None)[source]
Return an aggregation expression to compute an array of counts of all downsamplings found in freq_expr where specified criteria is met.
The frequency metadata (freq_meta_expr) should be in a similar format to the freq_meta annotation added by annotate_freq(). Each downsampling should have ‘group’, genetic_ancestry_label, and ‘downsampling’ keys. Included downsamplings are those where ‘group’ == variant_quality and genetic_ancestry_label == pop.
- Parameters:
freq_expr (
ArrayExpression
) – ArrayExpression of Structs with ‘AC’ and ‘AF’ annotations.freq_meta_expr (
ArrayExpression
) – ArrayExpression containing the set of groupings for each element of the freq_expr array (e.g., [{‘group’: ‘adj’}, {‘group’: ‘adj’, ‘pop’: ‘nfe’}, {‘downsampling’: ‘5000’, ‘group’: ‘adj’, ‘pop’: ‘global’}]).pop (
str
) – Population to use for filtering by the genetic_ancestry_label key in freq_meta_expr. Default is ‘global’.variant_quality (
str
) – Variant quality to use for filtering by the ‘group’ key in freq_meta_expr. Default is ‘adj’.singleton (
bool
) – Whether to filter to only singletons before counting (AC == 1). Default is False.max_af (
Optional
[float
]) – Maximum variant allele frequency to keep. By default no allele frequency cutoff is applied.genetic_ancestry_label (
Optional
[str
]) – Label defining the genetic ancestry groups. If None, “gen_anc” or “pop” is used (in that order of preference) if present. Default is None.subset (
Optional
[str
]) – Subset to use for filtering by the ‘subset’ key in freq_meta_expr. Default is None, which will return all downsampling counts without a ‘subset’ key in freq_meta_expr. If specified, only downsamplings with the specified subset will be included.downsamplings (
Optional
[List
[int
]]) – Optional List of integers specifying what downsampling indices to obtain. Default is None, which will return all downsampling counts.
- Return type:
- Returns:
Aggregation Expression for an array of the variant counts in downsamplings for specified population.
- gnomad.utils.constraint.explode_downsamplings_oe(ht, downsampling_meta, metrics=['syn', 'lof', 'mis'])[source]
Explode observed and expected downsampling counts for each genetic ancestry group and metric.
The input ht must contain struct of downsampling information for genetic ancestry groups under each metric name. For example: ‘lof’: struct {gen_anc_exp: struct {global: array<float64>}.
- Parameters:
ht (
Table
) – Input Table.metrics (
List
[str
]) – List of metrics to explode. Default is ‘[‘syn’, ‘lof’, ‘mis’]’.downsampling_meta (
Dict
[str
,List
[str
]]) – Dictionary containing downsampling metadata. Keys are the genetic ancestry group names and values are the list of downsamplings for that genetic ancestry group. Example: {‘global’: [‘5000’, ‘10000’], ‘afr’: [‘5000’, ‘10000’]}.
- Return type:
- Returns:
Table with downsampling counts exploded so that observed and expected metric counts for each pair of genetic ancestry groups and downsamplings is a separate row.
- gnomad.utils.constraint.annotate_mutation_type(t, context_length=None, num_scan_context_length=100)[source]
Annotate mutation types.
- The following annotations are added to the output Table:
cpg
transition
mutation_type - one of “CpG”, “non-CpG transition”, or “transversion”
mutation_type_model
..note:
This function uses the term ‘mutation_type’ because ‘variant_type’ is already used in this repo to indicate a variant’s multiallelic and SNP/indel status.
- Parameters:
t (
Union
[MatrixTable
,Table
]) – Input Table or MatrixTable.context_length (
Optional
[int
]) – Length of the ‘context’ annotation in ‘t’. If this is not specified, the value will be determined by examining the first num_scan_context_length values of the ‘context’ annotation. Default is None.num_scan_context_length (
Optional
[int
]) – Number of values in the ‘context’ annotation to use for determining context_length if it is not specified. If set to None, all values in ‘context’ will be used. Default is 100.
- Return type:
Union
[MatrixTable
,Table
]- Returns:
Table with mutation type annotations added.
- gnomad.utils.constraint.trimer_from_heptamer(t)[source]
Trim heptamer context to create trimer context.
- Parameters:
t (
Union
[MatrixTable
,Table
]) – Input MatrixTable or Table with context annotation.- Return type:
Union
[MatrixTable
,Table
]- Returns:
MatrixTable or Table with trimer context annotated.
- gnomad.utils.constraint.collapse_strand(t)[source]
Return the deduplicated context by collapsing DNA strands.
Function returns the reverse complement for ‘ref, ‘alt’, and ‘context’ if the reference allele is either ‘G’ or ‘T’.
- The following annotations are added to the output Table:
was_flipped - whether the ‘ref, ‘alt’, and ‘context’ were flipped (reverse complement taken)
- Parameters:
ht – Input Table.
t (
Union
[Table
,MatrixTable
]) –
- Return type:
Union
[Table
,MatrixTable
]- Returns:
Table with deduplicated context annotation (ref, alt, context, was_flipped).
- gnomad.utils.constraint.build_models(coverage_ht, coverage_expr, weighted=False, pops=(), keys=('context', 'ref', 'alt', 'methylation_level', 'mu_snp'), high_cov_definition=30, upper_cov_cutoff=None, skip_coverage_model=False, log10_coverage=True)[source]
Build coverage and plateau models.
This function builds models (plateau_models) using linear regression to calibrate mutation rate estimates against the proportion observed of each substitution, context, and methylation level in coverage_ht.
Two plateau models are fit, one for CpG transitions, and one for the remainder of sites (transversions and non CpG transitions).
The plateau models only consider high coverage sites, or sites above a median coverage of high_cov_definition and median coverage below upper_cov_cutoff.
Plateau model: adjusts proportion of expected variation based on location in the genome and CpG status. The x and y of the plateau models: - x: mu_snp - mutation rate - y: proportion observed (‘observed_variants’ or ‘observed_{pop}’ / ‘possible_variants’)
This function also builds models (coverage models) to calibrate the proportion of expected variation at low coverage sites (sites below high_cov_definition).
The coverage models are built by creating a scaling factor across all high coverage sites, applying this ratio to the low coverage sites, and running a linear regression.
Coverage model: corrects proportion of expected variation at low coverage sites. Low coverage sites are defined as sites with median coverage < high_cov_definition.
The x and y of the coverage model: - x: log10 groupings of exome coverage at low coverage sites - y: sum(‘observed_variants’)/ (high_coverage_scale_factor * sum(‘possible_variants’ * ‘mu_snp’) at low coverage sites
- high_coverage_scale_factor = sum(‘observed_variants’) /
sum(‘possible_variants’ * ‘mu_snp’) at high coverage sites
Note
This function expects that the input Table(coverage_ht) was created using get_proportion_observed_by_coverage, which means that coverage_ht should contain only high quality synonymous variants below 0.1% frequency.
This function also expects that the following fields are present in coverage_ht: - context - trinucleotide genomic context - ref - the reference allele - alt - the alternate allele - methylation_level - methylation level - cpg - whether the site is CpG site - observed_variants - the number of observed variants in the dataset for each variant. Note that the term “variant” here refers to a specific substitution, context, methylation level, and coverage combination - downsampling_counts_{pop} (optional) - array of observed variant counts per population after downsampling. Used only when pops is specified. - mu_snp - mutation rate - possible_variants - the number of possible variants in the dataset for each variant
- Parameters:
coverage_ht (
Table
) – Input coverage Table.coverage_expr (
Int32Expression
) – Expression that defines the coverage metric.weighted (
bool
) – Whether to weight the plateau models (a linear regression model) by ‘possible_variants’. Default is False.pops (
Tuple
[str
]) – List of populations used to build plateau models. Default is ().keys (
Tuple
[str
]) – Annotations used to group observed and possible variant counts. Default is (“context”, “ref”, “alt”, “methylation_level”, “mu_snp”).high_cov_definition (
int
) – Lower median coverage cutoff. Sites with coverage above this cutoff are considered well covered. Default is COVERAGE_CUTOFF.upper_cov_cutoff (
Optional
[int
]) – Upper median coverage cutoff. Sites with coverage above this cutoff are excluded from the high coverage Table. Default is None.skip_coverage_model (
bool
) – Whether to skip generating the coverage model. If set to True, None is returned instead of the coverage model. Default is False.log10_coverage (
bool
) – Whether to convert coverage sites with log10 when building the coverage model. Default is True.
- Return type:
Tuple
[Optional
[Tuple
[float
,float
]],StructExpression
]- Returns:
Coverage model and plateau models.
- gnomad.utils.constraint.build_plateau_models(cpg_expr, mu_snp_expr, observed_variants_expr, possible_variants_expr, pops_observed_variants_array_expr=[], weighted=False)[source]
Build plateau models to calibrate mutation rate to compute predicted proportion observed value.
The x and y of the plateau models: - x: mu_snp_expr - y: observed_variants_expr / possible_variants_expr or pops_observed_variants_array_expr`[index] / `possible_variants_expr if pops is specified
- Parameters:
cpg_expr (
BooleanExpression
) – BooleanExpression noting whether a site is a CPG site.mu_snp_expr (
Float64Expression
) – Float64Expression of the mutation rate.observed_variants_expr (
Int64Expression
) – Int64Expression of the observed variant counts.possible_variants_expr (
Int64Expression
) – Int64Expression of the possible variant counts.pops_observed_variants_array_expr (
List
[ArrayExpression
]) – Nested ArrayExpression with all observed variant counts ArrayNumericExpressions for specified populations. e.g., [[1,1, 1],[1,1,1]]. Default is None.weighted (
bool
) – Whether to generalize the model to weighted least squares using ‘possible_variants’. Default is False.
- Return type:
Dict
[str
,Union
[Dict
[bool
,ArrayExpression
],ArrayExpression
]]- Returns:
A dictionary of intercepts and slopes of plateau models. The keys are ‘total’ (for all sites) and ‘pop’ (optional; for populations). The values for ‘total’ is a dictionary (e.g., <DictExpression of type dict<bool, array<float64>>>), and the value for ‘pop’ is a nested list of dictionaries (e. g., <ArrayExpression of type array<array<dict<bool, array<float64>>>>>). The key of the dictionary in the nested list is CpG status (BooleanExpression), and the value is an ArrayExpression containing intercept and slope values.
- gnomad.utils.constraint.build_coverage_model(low_coverage_oe_expr, coverage_expr)[source]
Build coverage model.
This function uses linear regression to build a model of coverage to correct proportion of expected variation at low coverage sites.
The x and y of the coverage model: - x: coverage_expr - y: low_coverage_oe_expr
- Parameters:
low_coverage_oe_expr (
Float64Expression
) – The Float64Expression of observed:expected ratio for a given coverage level.coverage_expr (
Float64Expression
) – The Float64Expression of the coverage expression.
- Return type:
- Returns:
StructExpression with intercept and slope of the model.
- gnomad.utils.constraint.get_all_pop_lengths(ht, pops, obs_expr)[source]
Get the minimum length of observed variant counts array for each population downsampling.
The observed variant counts for each population in pops are specified by annotations on the obs_expr expression.
The function also performs a check that arrays of variant counts within population downsamplings all have the same lengths.
- Parameters:
ht (
Table
) – Input Table containing obs_expr.pops (
Tuple
[str
]) – Populations used to categorize observed variant counts in downsamplings.obs_expr (
StructExpression
) – Expression for the population observed variant counts. Should be a struct containing an array for each pop in pops.
- Return type:
List
[Tuple
[str
,str
]]- Returns:
A Dictionary with the minimum array length for each population.
- gnomad.utils.constraint.get_constraint_grouping_expr(vep_annotation_expr, coverage_expr=None, include_transcript_group=True, include_canonical_group=True, include_mane_select_group=False)[source]
Collect annotations used for constraint groupings.
- Function collects the following annotations:
annotation - ‘most_severe_consequence’ annotation in vep_annotation_expr
- modifier - classic lof annotation from ‘lof’ annotation in
vep_annotation_expr, LOFTEE annotation from ‘lof’ annotation in vep_annotation_expr, PolyPhen annotation from ‘polyphen_prediction’ in vep_annotation_expr, or “None” if neither is defined
gene - ‘gene_symbol’ annotation inside vep_annotation_expr
coverage - exome coverage if coverage_expr is specified
- transcript - id from ‘transcript_id’ in vep_annotation_expr (added when
include_transcript_group is True)
- canonical from vep_annotation_expr (added when include_canonical_group is
True)
- mane_select from vep_annotation_expr (added when include_mane_select_group is
True)
Note
This function expects that the following fields are present in vep_annotation_expr: - lof - polyphen_prediction - most_severe_consequence - gene_symbol - transcript_id (if include_transcript_group is True) - canonical (if include_canonical_group is True) - mane_select (if include_mane_select_group is True)
- Parameters:
vep_annotation_expr (
StructExpression
) – StructExpression of VEP annotation.coverage_expr (
Optional
[Int32Expression
]) – Optional Int32Expression of exome coverage. Default is None.include_transcript_group (
bool
) – Whether to include the transcript annotation in the groupings. Default is True.include_canonical_group (
bool
) – Whether to include canonical annotation in the groupings. Default is True.include_mane_select_group (
bool
) – Whether to include mane_select annotation in the groupings. Default is False.
- Return type:
Dict
[str
,Union
[StringExpression
,Int32Expression
,BooleanExpression
]]- Returns:
A dictionary with keys as annotation names and values as actual annotations.
- gnomad.utils.constraint.annotate_exploded_vep_for_constraint_groupings(ht, coverage_expr, vep_annotation='transcript_consequences', include_canonical_group=True, include_mane_select_group=False)[source]
Annotate Table with annotations used for constraint groupings.
- Function explodes the specified VEP annotation (vep_annotation) and adds the following annotations:
annotation -‘most_severe_consequence’ annotation in vep_annotation
- modifier - classic lof annotation from ‘lof’ annotation in
vep_annotation, LOFTEE annotation from ‘lof’ annotation in vep_annotation, PolyPhen annotation from ‘polyphen_prediction’ in vep_annotation, or “None” if neither is defined
gene - ‘gene_symbol’ annotation inside vep_annotation
- transcript - id from ‘transcript_id’ in vep_annotation (added when
include_transcript_group is True)
- canonical from vep_annotation (added when include_canonical_group is
True)
- mane_select from vep_annotation (added when include_mane_select_group is
True)
Note
This function expects that the following annotations are present in ht: - vep - exome_coverage
- Parameters:
ht (
Table
) – Input Table or MatrixTable.coverage_expr (
Int32Expression
) – Expression that defines the coverage metric.vep_annotation (
str
) – Name of annotation in ‘vep’ annotation (one of “transcript_consequences” and “worst_csq_by_gene”) that will be used for obtaining constraint annotations. Default is “transcript_consequences”.include_canonical_group (
bool
) – Whether to include ‘canonical’ annotation in the groupings. Default is True. Ignored unless vep_annotation is “transcript_consequences”.include_mane_select_group (
bool
) – Whether to include ‘mane_select’ annotation in the groupings. Default is False. Ignored unless vep_annotation is “transcript_consequences”.
- Return type:
Tuple
[Union
[Table
,MatrixTable
],Tuple
[str
]]- Returns:
A tuple of input Table or MatrixTable with grouping annotations added and the names of added annotations.
- gnomad.utils.constraint.compute_expected_variants(ht, plateau_models_expr, mu_expr, cov_corr_expr, possible_variants_expr, cpg_expr, pop=None)[source]
Apply plateau models for all sites and for a population (if specified) to compute predicted proportion observed ratio and expected variant counts.
- Parameters:
ht (
Table
) – Input Table.plateau_models_expr (
StructExpression
) – Linear models (output of build_models(), with the values of the dictionary formatted as a StructExpression of intercept and slope, that calibrates mutation rate to proportion observed for high coverage exomes. It includes models for CpG, non-CpG sites, and each population if specified.mu_expr (
Float64Expression
) – Float64Expression of mutation rate.possible_variants_expr (
Int64Expression
) – Int64Expression of possible variant counts.cov_corr_expr (
Float64Expression
) – Float64Expression of corrected coverage expression.cpg_expr (
BooleanExpression
) – BooleanExpression noting whether a site is a CPG site.pop (
Optional
[str
]) – Optional population to use when applying plateau model. Default is None.
- Return type:
Dict
[str
,Union
[Float64Expression
,Int64Expression
]]- Returns:
A dictionary with predicted proportion observed ratio and expected variant counts.
- gnomad.utils.constraint.oe_aggregation_expr(ht, filter_expr, pops=(), exclude_mu_sum=False)[source]
Get aggregation expressions to compute the observed:expected ratio for rows defined by filter_expr.
Return a Struct containing aggregation expressions to sum the number of observed variants, possible variants, expected variants, and mutation rate (if exclude_mu_sum is not True) for rows defined by filter_expr. The Struct also includes an aggregation expression for the observed:expected ratio.
- The following annotations are in the returned StructExpression:
obs - the sum of observed variants filtered to filter_expr.
mu - the sum of mutation rate of variants filtered to filter_expr.
possible - possible number of variants filtered to filter_expr.
exp - expected number of variants filtered to filter_expr.
oe - observed:expected ratio of variants filtered to filter_expr.
- If pops is specified:
gen_anc_exp - Struct with the expected number of variants per population (for all pop in pops) filtered to filter_expr.
gen_anc_obs - Struct with the observed number of variants per population (for all pop in pops) filtered to filter_expr.
Note
- The following annotations should be present in ht:
observed_variants
mu
possible_variants
expected_variants
- If pops is specified, the following annotations should also be present:
expected_variants_{pop} for all pop in pops
downsampling_counts_{pop} for all pop in pops
- Parameters:
ht (
Table
) – Input Table to create observed:expected ratio aggregation expressions for.filter_expr (
BooleanExpression
) – Boolean expression used to filter ht before aggregation.pops (
Tuple
[str
]) – List of populations to compute constraint metrics for. Default is ().exclude_mu_sum (
bool
) – Whether to exclude mu sum aggregation expression from returned struct. Default is False.
- Return type:
- Returns:
StructExpression with observed:expected ratio aggregation expressions.
- gnomad.utils.constraint.compute_pli(ht, obs_expr, exp_expr, expected_values=None, min_diff_convergence=0.001)[source]
Compute the pLI score using the observed and expected variant counts.
Full details on pLI can be found in the ExAC paper: Lek, M., Karczewski, K., Minikel, E. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
pLI is the probability of being loss-of-function intolerant, and this function computes that probability using the expectation-maximization (EM) algorithm.
We assume a 3 state model, where each gene fits into one of three categories with respect loss-of-function variation sensitivity:
Null: where protein truncating variation is completely tolerated by natural selection.
Recessive (Rec): where heterozygous pLoFs are tolerated but homozygous pLoFs are not.
Haploinsufficient (LI): where heterozygous pLoFs are not tolerated.
The function requires the expected amount of loss-of-function depletion for each of these states. The default provided is based on the observed depletion of protein-truncating variation in the Blekhman autosomal recessive and ClinGen dosage sensitivity gene sets (Supplementary Information Table 12 of the above reference):
Null: 1.0, assume tolerant genes have the expected amount of truncating variation.
Rec: 0.463, derived from the empirical mean observed/expected rate of truncating variation for recessive disease genes (0.463).
LI: 0.089, derived from the empirical mean observed/expected rate of truncating variation for severe haploinsufficient genes.
The output StructExpression will include the following annotations:
pLI: Probability of loss-of-function intolerance; probability that transcript falls into distribution of haploinsufficient genes.
pNull: Probability that transcript falls into distribution of unconstrained genes.
pRec: Probability that transcript falls into distribution of recessive genes.
- Parameters:
ht (
Table
) – Input Table containing obs_expr and exp_expr.obs_expr (
Int64Expression
) – Expression for the number of observed variants on each gene or transcript in ht.exp_expr (
Float64Expression
) – Expression for the number of expected variants on each gene or transcript in ht.expected_values (
Optional
[Dict
[str
,float
]]) – Dictionary containing the expected values for ‘Null’, ‘Rec’, and ‘LI’ to use as starting values.min_diff_convergence (
float
) – Minimum iteration change in LI to consider the EM model convergence criteria as met. Default is 0.001.
- Return type:
- Returns:
StructExpression for pLI scores.
- gnomad.utils.constraint.oe_confidence_interval(obs_expr, exp_expr, alpha=0.05)[source]
Determine the confidence interval around the observed:expected ratio.
For a given pair of observed (obs_expr) and expected (exp_expr) values, the function computes the density of the Poisson distribution (performed using Hail’s dpois module) with fixed k (x in dpois is set to the observed number of variants) over a range of lambda (lamb in dpois) values, which are given by the expected number of variants times a varying parameter ranging between 0 and 2 (the observed:expected ratio is typically between 0 and 1, so we want to extend the upper bound of the confidence interval to capture this). The cumulative density function of the Poisson distribution density is computed and the value of the varying parameter is extracted at points corresponding to alpha (defaults to 5%) and 1-alpha (defaults to 95%) to indicate the lower and upper bounds of the confidence interval.
- The following annotations are in the output StructExpression:
lower - the lower bound of confidence interval
upper - the upper bound of confidence interval
- Parameters:
obs_expr (
Int64Expression
) – Expression for the observed variant counts of pLoF, missense, or synonymous variants in ht.exp_expr (
Float64Expression
) – Expression for the expected variant counts of pLoF, missense, or synonymous variants in ht.alpha (
float
) – The significance level used to compute the confidence interval. Default is 0.05.
- Return type:
- Returns:
StructExpression for the confidence interval lower and upper bounds.
- gnomad.utils.constraint.calculate_raw_z_score(obs_expr, exp_expr)[source]
Compute the signed raw z-score using observed and expected variant counts.
The raw z-scores are positive when the transcript had fewer variants than expected, and are negative when transcripts had more variants than expected.
- Parameters:
obs_expr (
Int64Expression
) – Observed variant count expression.exp_expr (
Float64Expression
) – Expected variant count expression.
- Return type:
- Returns:
StructExpression for the raw z-score.
- gnomad.utils.constraint.get_constraint_flags(exp_expr, raw_z_expr, raw_z_lower_threshold=-5.0, raw_z_upper_threshold=5.0, flag_postfix='')[source]
Determine the constraint flags that define why constraint will not be calculated.
- Flags which are added:
“no_exp_{flag_postfix}” - for genes that have missing or zero expected variants.
“outlier_{flag_postfix}” - for genes that are raw z-score outliers: (raw_z_expr < raw_z_lower_threshold) or (raw_z_expr > raw_z_upper_threshold).
- Parameters:
exp_expr (
Float64Expression
) – Expression for the expected variant counts of pLoF, missense, or synonymous variants.raw_z_expr (
Float64Expression
) – Expression for the signed raw z-score of pLoF, missense, or synonymous variants.raw_z_lower_threshold (
Optional
[float
]) – Lower threshold for the raw z-score. When raw_z_expr is less than this threshold it is considered an ‘outlier’. Default is -5.0.raw_z_upper_threshold (
Optional
[float
]) – Upper threshold for the raw z-score. When raw_z_expr is greater than this threshold it is considered an ‘outlier’. Default is 5.0.flag_postfix (
str
) – Postfix to add to the end of the constraint flag names.
- Return type:
Dict
[str
,Expression
]- Returns:
Dictionary containing expressions for constraint flags.
- gnomad.utils.constraint.calculate_raw_z_score_sd(raw_z_expr, flag_expr, mirror_neg_raw_z=True)[source]
Calculate the standard deviation of the raw z-score.
When using mirror_neg_raw_z is True, all the negative raw z-scores (defined by raw_z_expr) are combined with those same z-scores multiplied by -1 (to create a mirrored distribution).
- Parameters:
raw_z_expr (
Float64Expression
) – Expression for the raw z-score.flag_expr (
StringExpression
) – Expression for the constraint flags. z-score will not be calculated if flags are present.mirror_neg_raw_z (
bool
) – Whether the standard deviation should be computed using a mirrored distribution of negative raw_z_expr.
- Return type:
- Returns:
StructExpression containing standard deviation of the raw z-score and the z-score.
- gnomad.utils.constraint.add_gencode_transcript_annotations(ht, gencode_ht, annotations=['level', 'transcript_type'])[source]
Add GENCODE annotations to Table based on transcript id.
Note
Added annotations by default are: - level - transcript_type
Computed annotations are: - chromosome - cds_length - num_coding_exons
- Parameters:
- Return type:
- Returns:
Table with transcript annotations from GENCODE added.