gnomad.utils.constraint

Script containing generic constraint functions that may be used in the constraint pipeline.

`gnomad.utils.constraint.COVERAGE_CUTOFF`	Minimum median exome coverage differentiating high coverage sites from low coverage sites.
`gnomad.utils.constraint.get_mu_annotation_expr`(ht, ...)	Get mutation rate annotation expression from the mutation rate Table.
`gnomad.utils.constraint.annotate_with_mu`(ht, ...)	Annotate SNP mutation rate for the input Table.
`gnomad.utils.constraint.count_variants_by_group`(ht)	Count number of observed or possible variants by context, ref, alt, and optionally methylation_level.
`gnomad.utils.constraint.get_downsampling_freq_indices`(...)	Get indices of dictionaries in meta dictionaries that only have the "downsampling" key with specified genetic_ancestry_label and "variant_quality" values.
`gnomad.utils.constraint.downsampling_counts_expr`(...)	Return an aggregation expression to compute an array of counts of all downsamplings found in freq_expr where specified criteria is met.
`gnomad.utils.constraint.explode_downsamplings_oe`(ht, ...)	Explode observed and expected downsampling counts for each genetic ancestry group and metric.
`gnomad.utils.constraint.annotate_mutation_type`(t)	Annotate mutation types.
`gnomad.utils.constraint.trimer_from_heptamer`(t)	Trim heptamer context to create trimer context.
`gnomad.utils.constraint.collapse_strand`(t)	Return the deduplicated context by collapsing DNA strands.
`gnomad.utils.constraint.transform_methylation_level`([...])	Transform methylation level into a 0 to 2 scale.
`gnomad.utils.constraint.assemble_constraint_context_ht`(ht)	Assemble context Table with necessary annotations for constraint calculations.
`gnomad.utils.constraint.build_models`(...[, ...])	Build coverage and plateau models.
`gnomad.utils.constraint.build_plateau_models`(...)	Build plateau models to calibrate mutation rate to compute predicted proportion observed value.
`gnomad.utils.constraint.build_coverage_model`(...)	Build coverage model.
`gnomad.utils.constraint.get_all_gen_anc_lengths`(ht, ...)	Get the minimum length of observed variant counts array for each genetic ancestry group downsampling.
`gnomad.utils.constraint.get_constraint_grouping_expr`(...)	Collect annotations used for constraint groupings.
`gnomad.utils.constraint.annotate_exploded_vep_for_constraint_groupings`(ht, ...)	Annotate Table with annotations used for constraint groupings.
`gnomad.utils.constraint.compute_expected_variants`(ht, ...)	Apply plateau models for all sites and for a genetic ancestry group (if specified) to compute predicted proportion observed ratio and expected variant counts.
`gnomad.utils.constraint.oe_aggregation_expr`(ht, ...)	Get aggregation expressions to compute the observed:expected ratio for rows defined by filter_expr.
`gnomad.utils.constraint.compute_pli`(ht, ...)	Compute the pLI score using the observed and expected variant counts.
`gnomad.utils.constraint.oe_confidence_interval`(...)	Determine the confidence interval around the observed:expected ratio.
`gnomad.utils.constraint.calculate_raw_z_score`(...)	Compute the signed raw z-score using observed and expected variant counts.
`gnomad.utils.constraint.get_constraint_flags`(...)	Determine the constraint flags that define why constraint will not be calculated.
`gnomad.utils.constraint.calculate_raw_z_score_sd`(...)	Calculate the standard deviation of the raw z-score.
`gnomad.utils.constraint.add_gencode_transcript_annotations`(ht, ...)	Add GENCODE annotations to Table based on transcript id.

Script containing generic constraint functions that may be used in the constraint pipeline.

gnomad.utils.constraint.COVERAGE_CUTOFF = 30

Minimum median exome coverage differentiating high coverage sites from low coverage sites.

Low coverage sites require an extra calibration when computing the proportion of expected variation.

gnomad.utils.constraint.get_mu_annotation_expr(ht, mutation_ht, mu_expr='mu_snp')[source]

Get mutation rate annotation expression from the mutation rate Table.

Note

Function expects that ht includes mutation_ht’s key fields. Note that these annotations don’t need to be the keys of ht. The standard keys used are: ‘context’, ‘ref’, ‘alt’, and ‘methylation_level’.

This function converts the mutation rate Table into a dictionary and uses the mutation rate HT key fields in the input Table to get the mutation rate annotation expression. The dictionary is used instead of joining the mutation rate Table to the input Table to avoid unnecessary shuffling given the small size of the mutation rate Table.

Parameters:

ht (Table) – Input Table that will be annotated with the returned mutation rate expression.
mutation_ht (Table) – Mutation rate Table.
mu_expr (Union[str, Float64Expression]) – Mutation rate expression or annotation name in mutation_ht. Default is ‘mu_snp’.

Return type:

Float64Expression

Returns:

Mutation rate annotation expression.

gnomad.utils.constraint.annotate_with_mu(ht, mutation_ht, mu_annotation='mu_snp')[source]

Annotate SNP mutation rate for the input Table.

Note

Function expects that ht includes mutation_ht’s key fields. Note that these annotations don’t need to be the keys of ht. The standard keys used are: ‘context’, ‘ref’, ‘alt’, and ‘methylation_level’.

Parameters:

ht (Table) – Input Table to annotate.
mutation_ht (Table) – Mutation rate Table.
mu_annotation (str) – The name of mutation rate annotation in mutation_ht. Default is ‘mu_snp’.

Return type:

Table

Returns:

Table with mutational rate annotation added.

gnomad.utils.constraint.count_variants_by_group(ht, freq_expr=None, freq_meta_expr=None, count_singletons=False, count_downsamplings=(), downsamplings=None, additional_grouping=(), partition_hint=100, omit_methylation=False, use_table_group_by=False, singleton_expr=None, max_af=None)[source]

Count number of observed or possible variants by context, ref, alt, and optionally methylation_level.

Performs variant count aggregations based on specified criteria (count_singletons, count_downsamplings, and max_af), and grouped by: ‘context’, ‘ref’, ‘alt’, ‘methylation_level’ (optional), and all annotations provided in additional_grouping.

If variant allele frequency information is required based on other parameter selections (described in detail below) and freq_expr is not supplied, freq_expr defaults to ht.freq if it exists.

freq_expr should be an ArrayExpression of Structs with ‘AC’ and ‘AF’ annotations. This is the same format as the freq annotation that is created using annotate_freq().

Variant allele frequency information is needed when:

max_af is not None - freq_expr[0].AF is used to filter to only variants with a maximum allele frequency of max_af prior to counting variants. In the standard freq ArrayExpression annotated by annotate_freq(), this first element corresponds to the allele frequency information for high quality genotypes (adj).
count_singletons is True and singleton_expr is None - If singleton counts are requested and no expression is specified to determine whether a variant is a singleton, singleton_expr defaults to freq_expr[0].AC == 1. In the standard freq ArrayExpression annotated by annotate_freq(), this corresponds to allele count of only 1 in the callset after filtering to high quality genotypes.
count_downsamplings is not empty - When downsampling counts are requested, freq_expr needs to contain frequency information for downsamplings within each genetic ancestry group requested. In addition to needing freq_expr, this also requires the use of freq_meta_expr. If freq_meta_expr is None, freq_meta_expr it defaults to ht.freq_meta if it exists. Similar to freq_expr, freq_meta_expr is expected to have the same format as the freq_meta global annotation that is created using annotate_freq(). freq_meta_expr is used to determine the index of allele frequency information within freq_expr for each genetic ancestry group requested and it’s downsamplings.

This function will return a Table with annotations used for grouping (‘context’, ‘ref’, ‘alt’, ‘methylation_level’ (optional), additional_grouping) and ‘variant_count’ annotation.

Note

The following annotations should be present in ht:

ref - the reference allele
alt - the alternate base
context - trinucleotide genomic context
methylation_level - methylation level (optional if omit_methylation==True)
freq - allele frequency information (AC, AN, AF, homozygote count; not required if freq_expr is given)
freq_meta - an ordered list containing the frequency aggregation group for each element of the freq array row annotation (not required if freq_meta_expr is given)

Parameters:

ht (Table) – Input Hail Table.
freq_expr (Optional[ArrayExpression]) – ArrayExpression of Structs with ‘AC’ and ‘AF’ annotations. If freq_expr is None and any of count_downsamplings, max_af, and count_singletons is True, freq_expr would be ht.freq.
freq_meta_expr (Optional[ArrayExpression]) – ArrayExpression of meta dictionaries corresponding to freq_expr. If count_downsamplings and freq_meta_expr is None, freq_meta_expr would be ht.freq_meta.
count_singletons (bool) – Whether to count singletons (defined by singleton_expr). Default is False.
count_downsamplings (Tuple[str]) – Tuple of genetic ancestry groups to use for downsampling counts. Default is ().
downsamplings (Optional[List[int]]) – Optional List of integers specifying what downsampling indices to obtain. Default is None, which will return all downsampling counts.
additional_grouping (Tuple[str]) – Additional features to group by. e.g. ‘exome_coverage’. Default is ().
partition_hint (int) – Target number of partitions for aggregation. Default is 100.
omit_methylation (bool) – Whether to omit ‘methylation_level’ from the grouping when counting variants. Default is False.
use_table_group_by (bool) – Whether to group ht before aggregating the variant counts. If use_table_group_by is False, function will return a hl. StructExpression. Default is False.
singleton_expr (Optional[BooleanExpression]) – Expression for defining a singleton. When count_singletons is True and singleton_expr is None, singleton_expression would be freq_expr [0].AC == 1. Default is None.
max_af (Optional[float]) – Maximum variant allele frequency to keep. By default, no cutoff is applied.

Return type:

Union[Table, Any]

Returns:

Table including ‘variant_count’ annotation and if requested, singleton_count and downsampling counts.

gnomad.utils.constraint.get_downsampling_freq_indices(freq_meta_expr, gen_anc='global', variant_quality='adj', genetic_ancestry_label=None, subset=None, downsamplings=None)[source]

Get indices of dictionaries in meta dictionaries that only have the “downsampling” key with specified genetic_ancestry_label and “variant_quality” values.

Parameters:

freq_meta_expr (ArrayExpression) – ArrayExpression containing the set of groupings for each element of the freq_expr array (e.g., [{‘group’: ‘adj’}, {‘group’: ‘adj’, ‘gen_anc’: ‘nfe’}, {‘downsampling’: ‘5000’, ‘group’: ‘adj’, ‘gen_anc’: ‘global’}]).
gen_anc (str) – Genetic ancestry group to use for filtering by the genetic_ancestry_label key in freq_meta_expr. Default is ‘global’.
variant_quality (str) – Variant quality to use for filtering by the ‘group’ key in freq_meta_expr. Default is ‘adj’.
genetic_ancestry_label (Optional[str]) – Label defining the genetic ancestry groups. If None, “gen_anc” or “gen_anc” is used (in that order of preference) if present. Default is None.
subset (Optional[str]) – Subset to use for filtering by the ‘subset’ key in freq_meta_expr. Default is None, which will return all downsampling indices without a ‘subset’ key in freq_meta_expr.
downsamplings (Optional[List[int]]) – Optional List of integers specifying what downsampling indices to obtain. Default is None, which will return all downsampling indices.

Return type:

ArrayExpression

Returns:

ArrayExpression of indices of dictionaries in freq_meta_expr that only have the “downsampling” key with specified genetic_ancestry_label and “variant_quality” values.

gnomad.utils.constraint.downsampling_counts_expr(freq_expr, freq_meta_expr, gen_anc='global', variant_quality='adj', singleton=False, max_af=None, genetic_ancestry_label=None, subset=None, downsamplings=None)[source]

Return an aggregation expression to compute an array of counts of all downsamplings found in freq_expr where specified criteria is met.

The frequency metadata (freq_meta_expr) should be in a similar format to the freq_meta annotation added by annotate_freq(). Each downsampling should have ‘group’, genetic_ancestry_label, and ‘downsampling’ keys. Included downsamplings are those where ‘group’ == variant_quality and genetic_ancestry_label == gen_anc.

Parameters:

freq_expr (ArrayExpression) – ArrayExpression of Structs with ‘AC’ and ‘AF’ annotations.
freq_meta_expr (ArrayExpression) – ArrayExpression containing the set of groupings for each element of the freq_expr array (e.g., [{‘group’: ‘adj’}, {‘group’: ‘adj’, ‘gen_anc’: ‘nfe’}, {‘downsampling’: ‘5000’, ‘group’: ‘adj’, ‘gen_anc’: ‘global’}]).
gen_anc (str) – Genetic ancestry group to use for filtering by the genetic_ancestry_label key in freq_meta_expr. Default is ‘global’.
variant_quality (str) – Variant quality to use for filtering by the ‘group’ key in freq_meta_expr. Default is ‘adj’.
singleton (bool) – Whether to filter to only singletons before counting (AC == 1). Default is False.
max_af (Optional[float]) – Maximum variant allele frequency to keep. By default no allele frequency cutoff is applied.
genetic_ancestry_label (Optional[str]) – Label defining the genetic ancestry groups. If None, “gen_anc” or “pop” is used (in that order of preference) if present. Default is None.
subset (Optional[str]) – Subset to use for filtering by the ‘subset’ key in freq_meta_expr. Default is None, which will return all downsampling counts without a ‘subset’ key in freq_meta_expr. If specified, only downsamplings with the specified subset will be included.
downsamplings (Optional[List[int]]) – Optional List of integers specifying what downsampling indices to obtain. Default is None, which will return all downsampling counts.

Return type:

ArrayExpression

Returns:

Aggregation Expression for an array of the variant counts in downsamplings for specified genetic ancestry group.

gnomad.utils.constraint.explode_downsamplings_oe(ht, downsampling_meta, metrics=['syn', 'lof', 'mis'])[source]

Explode observed and expected downsampling counts for each genetic ancestry group and metric.

The input ht must contain struct of downsampling information for genetic ancestry groups under each metric name. For example: ‘lof’: struct {gen_anc_exp: struct {global: array<float64>}.

Parameters:

ht (Table) – Input Table.
metrics (List[str]) – List of metrics to explode. Default is ‘[‘syn’, ‘lof’, ‘mis’]’.
downsampling_meta (Dict[str, List[str]]) – Dictionary containing downsampling metadata. Keys are the genetic ancestry group names and values are the list of downsamplings for that genetic ancestry group. Example: {‘global’: [‘5000’, ‘10000’], ‘afr’: [‘5000’, ‘10000’]}.

Return type:

Table

Returns:

Table with downsampling counts exploded so that observed and expected metric counts for each pair of genetic ancestry groups and downsamplings is a separate row.

gnomad.utils.constraint.annotate_mutation_type(t, context_length=None, num_scan_context_length=100)[source]

Annotate mutation types.

The following annotations are added to the output Table:

cpg
transition
mutation_type - one of “CpG”, “non-CpG transition”, or “transversion”
mutation_type_model

..note:

This function uses the term ‘mutation_type’ because ‘variant_type’ is already used in this repo to indicate a variant’s multiallelic and SNP/indel status.

Parameters:

t (Union[MatrixTable, Table]) – Input Table or MatrixTable.
context_length (Optional[int]) – Length of the ‘context’ annotation in ‘t’. If this is not specified, the value will be determined by examining the first num_scan_context_length values of the ‘context’ annotation. Default is None.
num_scan_context_length (Optional[int]) – Number of values in the ‘context’ annotation to use for determining context_length if it is not specified. If set to None, all values in ‘context’ will be used. Default is 100.

Return type:

Union[MatrixTable, Table]

Returns:

Table with mutation type annotations added.

gnomad.utils.constraint.trimer_from_heptamer(t)[source]

Trim heptamer context to create trimer context.

Parameters:: t (Union[MatrixTable, Table]) – Input MatrixTable or Table with context annotation.
Return type:: Union[MatrixTable, Table]
Returns:: MatrixTable or Table with trimer context annotated.

gnomad.utils.constraint.collapse_strand(t)[source]

Return the deduplicated context by collapsing DNA strands.

Function returns the reverse complement for ‘ref, ‘alt’, and ‘context’ if the reference allele is either ‘G’ or ‘T’.

The following annotations are added to the output Table:

was_flipped - whether the ‘ref, ‘alt’, and ‘context’ were flipped (reverse complement taken)

Parameters:

ht – Input Table.
t (Union[Table, MatrixTable]) –

Return type:

Union[Table, MatrixTable]

Returns:

Table with deduplicated context annotation (ref, alt, context, was_flipped).

gnomad.utils.constraint.transform_methylation_level(methylation_expr='methylation_level', methylation_cutoffs=(0, 5), ht=None)[source]

Transform methylation level into a 0 to 2 scale.

The methylation level is transformed to a 0-2 scale based on the provided cutoffs. The methylation level is assigned a value of 2 if it is greater than the second cutoff, 1 if it is greater than the first cutoff (but less than or equal to the second), and 0 otherwise.

Parameters:

methylation_expr (Union[str, NumericExpression]) – Methylation level expression or annotation name in ht. If methylation_expr is a string, ht must be provided.
methylation_cutoffs (Tuple[Union[int, float], Union[int, float]]) – Tuple of two integers/floats representing the cutoffs for the methylation level transformation. Default is (0, 5).
ht (Optional[Table]) – Input Table. Default is None.

Return type:

Union[Table, NumericExpression]

Returns:

Table with methylation level annotation added or methylation level expression.

gnomad.utils.constraint.assemble_constraint_context_ht(ht, methylation_ht=None, gerp_ht=None, coverage_hts=None, an_hts=None, freq_hts=None, filter_hts=None, transformation_funcs=None)[source]

Assemble context Table with necessary annotations for constraint calculations.

Note

Checks for ‘was_split’ annotation in Table. If not present, splits multiallelic sites.

The following annotations are added to the output Table:

ref: Reference allele.

alt: Alternate allele.

context: Trimer context.

annotations added by annotate_mutation_type(), collapse_strand(), and add_most_severe_csq_to_tc_within_vep_root().

Depending on the annotations provided, the following annotations may also be added:

‘methylation_level’: Methylation level annotation will be added if methylation_ht is provided.

‘gerp’: GERP annotation will be added if gerp_ht is provided.

‘coverage’: Coverage annotations will be added if coverage_hts is provided.

‘AN’: Allele number annotations will be added if an_hts is provided.

‘freq’: Frequency annotations will be added if freq_hts is provided.

‘filters’: Filter annotations will be added if filter_hts is provided.

The transformation_funcs parameter can be used to transform the HTs before adding annotations. The keys should be the annotation names (‘coverage’, ‘gerp’, etc.) and the values should be functions that take the annotation Table and the ht that is being annotated as input and return the transformed and keyed annotation. If not provided, the following default transformations are used:

‘methylation_level’: Uses the ‘methylation_level’ annotation in the Methylation Table after transforming the methylation level to a 0-2 scale using transform_grch37_methylation() or transform_grch38_methylation().

‘gerp’: Uses the ‘S’ annotation in the GERP Table. If ‘S’ is missing, it defaults to 0.

‘coverage’: If necessary, pulls out the first element of coverage statistics (which includes all samples). Relevant to v4, where coverage stats include additional elements to stratify by UK Biobank subset and platforms.

‘AN’: Uses the ‘AN’ annotation in the allele number Table.

‘freq’: Uses the ‘freq’ annotation in the frequency Table.

‘filters’: Uses the ‘filters’ annotation in the filter Table.

The following global annotations may also be added to the output Table:

‘an_globals’: Global allele number annotations ‘strata_sample_count’ and ‘strata_meta’ will be added if an_hts is provided.

‘freq_globals’: Global frequency annotations ‘freq_meta_sample_count’ and ‘freq_meta’ will be added if freq_hts is provided.

Parameters:

ht (Table) – Input context Table with VEP annotation.
methylation_ht (Table) – Optional Table with methylation annotation. Default is None.
gerp_ht (Table) – Optional Table with GERP annotation. Default is None.
coverage_hts (Dict[str, Table]) – An optional Dictionary with key as one of ‘exomes’ or ‘genomes’ and values as corresponding coverage Tables. Default is None.
an_hts (Dict[str, Table]) – A Dictionary with key as one of ‘exomes’ or ‘genomes’ and values as corresponding allele number Tables. Default is None.
transformation_funcs (Optional[Dict[str, Callable]]) – An optional Dictionary to transform the HTs before adding annotations. Default is None, which will use some default transformations as described above.
freq_hts (Dict[str, Table]) –
filter_hts (Dict[str, Table]) –

Return type:

Table

Returns:

Table with sites split and necessary annotations.

gnomad.utils.constraint.build_models(coverage_ht, coverage_expr, weighted=False, gen_ancs=(), keys=('context', 'ref', 'alt', 'methylation_level', 'mu_snp'), high_cov_definition=30, upper_cov_cutoff=None, skip_coverage_model=False, log10_coverage=True)[source]

Build coverage and plateau models.

This function builds models (plateau_models) using linear regression to calibrate mutation rate estimates against the proportion observed of each substitution, context, and methylation level in coverage_ht.

Two plateau models are fit, one for CpG transitions, and one for the remainder of sites (transversions and non CpG transitions).

The plateau models only consider high coverage sites, or sites above a median coverage of high_cov_definition and median coverage below upper_cov_cutoff.

Plateau model: adjusts proportion of expected variation based on location in the genome and CpG status. The x and y of the plateau models: - x: mu_snp - mutation rate - y: proportion observed (‘observed_variants’ or ‘observed_{gen_anc}’ / ‘possible_variants’)

This function also builds models (coverage models) to calibrate the proportion of expected variation at low coverage sites (sites below high_cov_definition).

The coverage models are built by creating a scaling factor across all high coverage sites, applying this ratio to the low coverage sites, and running a linear regression.

Coverage model: corrects proportion of expected variation at low coverage sites. Low coverage sites are defined as sites with median coverage < high_cov_definition.

The x and y of the coverage model: - x: log10 groupings of exome coverage at low coverage sites - y: sum(‘observed_variants’)/ (high_coverage_scale_factor * sum(‘possible_variants’ * ‘mu_snp’) at low coverage sites

high_coverage_scale_factor = sum(‘observed_variants’) /: sum(‘possible_variants’ * ‘mu_snp’) at high coverage sites

Note

This function expects that the input Table(coverage_ht) was created using get_proportion_observed_by_coverage, which means that coverage_ht should contain only high quality synonymous variants below 0.1% frequency.

This function also expects that the following fields are present in coverage_ht: - context - trinucleotide genomic context - ref - the reference allele - alt - the alternate allele - methylation_level - methylation level - cpg - whether the site is CpG site - observed_variants - the number of observed variants in the dataset for each variant. Note that the term “variant” here refers to a specific substitution, context, methylation level, and coverage combination - downsampling_counts_{gen_anc} (optional) - array of observed variant counts per genetic ancestry group after downsampling. Used only when gen_ancs is specified. - mu_snp - mutation rate - possible_variants - the number of possible variants in the dataset for each variant

Parameters:

coverage_ht (Table) – Input coverage Table.
coverage_expr (Int32Expression) – Expression that defines the coverage metric.
weighted (bool) – Whether to weight the plateau models (a linear regression model) by ‘possible_variants’. Default is False.
gen_ancs (Tuple[str]) – List of genetic ancestry groups used to build plateau models. Default is ().
keys (Tuple[str]) – Annotations used to group observed and possible variant counts. Default is (“context”, “ref”, “alt”, “methylation_level”, “mu_snp”).
high_cov_definition (int) – Lower median coverage cutoff. Sites with coverage above this cutoff are considered well covered. Default is COVERAGE_CUTOFF.
upper_cov_cutoff (Optional[int]) – Upper median coverage cutoff. Sites with coverage above this cutoff are excluded from the high coverage Table. Default is None.
skip_coverage_model (bool) – Whether to skip generating the coverage model. If set to True, None is returned instead of the coverage model. Default is False.
log10_coverage (bool) – Whether to convert coverage sites with log10 when building the coverage model. Default is True.

Return type:

Tuple[Optional[Tuple[float, float]], StructExpression]

Returns:

Coverage model and plateau models.

gnomad.utils.constraint.build_plateau_models(cpg_expr, mu_snp_expr, observed_variants_expr, possible_variants_expr, gen_ancs_observed_variants_array_expr=[], weighted=False)[source]

Build plateau models to calibrate mutation rate to compute predicted proportion observed value.

The x and y of the plateau models: - x: mu_snp_expr - y: observed_variants_expr / possible_variants_expr or gen_ancs_observed_variants_array_expr`[index] / `possible_variants_expr if gen_ancs is specified

Parameters:

cpg_expr (BooleanExpression) – BooleanExpression noting whether a site is a CPG site.
mu_snp_expr (Float64Expression) – Float64Expression of the mutation rate.
observed_variants_expr (Int64Expression) – Int64Expression of the observed variant counts.
possible_variants_expr (Int64Expression) – Int64Expression of the possible variant counts.
gen_ancs_observed_variants_array_expr (List[ArrayExpression]) – Nested ArrayExpression with all observed variant counts ArrayNumericExpressions for specified genetic ancestry groups. e.g., [[1,1, 1],[1,1,1]]. Default is None.
weighted (bool) – Whether to generalize the model to weighted least squares using ‘possible_variants’. Default is False.

Return type:

Dict[str, Union[Dict[bool, ArrayExpression], ArrayExpression]]

Returns:

A dictionary of intercepts and slopes of plateau models. The keys are ‘total’ (for all sites) and ‘gen_anc’ (optional; for genetic ancestry groups). The values for ‘total’ is a dictionary (e.g., <DictExpression of type dict<bool, array<float64>>>), and the value for ‘gen_anc’ is a nested list of dictionaries (e. g., <ArrayExpression of type array<array<dict<bool, array<float64>>>>>). The key of the dictionary in the nested list is CpG status (BooleanExpression), and the value is an ArrayExpression containing intercept and slope values.

gnomad.utils.constraint.build_coverage_model(low_coverage_oe_expr, coverage_expr)[source]

Build coverage model.

This function uses linear regression to build a model of coverage to correct proportion of expected variation at low coverage sites.

The x and y of the coverage model: - x: coverage_expr - y: low_coverage_oe_expr

Parameters:

low_coverage_oe_expr (Float64Expression) – The Float64Expression of observed:expected ratio for a given coverage level.
coverage_expr (Float64Expression) – The Float64Expression of the coverage expression.

Return type:

StructExpression

Returns:

StructExpression with intercept and slope of the model.

gnomad.utils.constraint.get_all_gen_anc_lengths(ht, gen_ancs, obs_expr)[source]

Get the minimum length of observed variant counts array for each genetic ancestry group downsampling.

The observed variant counts for each genetic ancestry group in gen_ancs are specified by annotations on the obs_expr expression.

The function also performs a check that arrays of variant counts within genetic ancestry group downsamplings all have the same lengths.

Parameters:

ht (Table) – Input Table containing obs_expr.
gen_ancs (Tuple[str]) – Genetic ancestry groups used to categorize observed variant counts in downsamplings.
obs_expr (StructExpression) – Expression for the genetic ancestry group observed variant counts. Should be a struct containing an array for each gen_anc in gen_ancs.

Return type:

List[Tuple[str, str]]

Returns:

A Dictionary with the minimum array length for each genetic ancestry group.

gnomad.utils.constraint.get_constraint_grouping_expr(vep_annotation_expr, coverage_expr=None, include_transcript_group=True, include_canonical_group=True, include_mane_select_group=False)[source]

Collect annotations used for constraint groupings.

Function collects the following annotations:

annotation - ‘most_severe_consequence’ annotation in vep_annotation_expr
modifier - classic lof annotation from ‘lof’ annotation in
vep_annotation_expr, LOFTEE annotation from ‘lof’ annotation in vep_annotation_expr, PolyPhen annotation from ‘polyphen_prediction’ in vep_annotation_expr, or “None” if neither is defined
gene - ‘gene_symbol’ annotation inside vep_annotation_expr
coverage - exome coverage if coverage_expr is specified
transcript - id from ‘transcript_id’ in vep_annotation_expr (added when
include_transcript_group is True)
canonical from vep_annotation_expr (added when include_canonical_group is
True)
mane_select from vep_annotation_expr (added when include_mane_select_group is
True)

Note

This function expects that the following fields are present in vep_annotation_expr: - lof - polyphen_prediction - most_severe_consequence - gene_symbol - transcript_id (if include_transcript_group is True) - canonical (if include_canonical_group is True) - mane_select (if include_mane_select_group is True)

Parameters:

vep_annotation_expr (StructExpression) – StructExpression of VEP annotation.
coverage_expr (Optional[Int32Expression]) – Optional Int32Expression of exome coverage. Default is None.
include_transcript_group (bool) – Whether to include the transcript annotation in the groupings. Default is True.
include_canonical_group (bool) – Whether to include canonical annotation in the groupings. Default is True.
include_mane_select_group (bool) – Whether to include mane_select annotation in the groupings. Default is False.

Return type:

Dict[str, Union[StringExpression, Int32Expression, BooleanExpression]]

Returns:

A dictionary with keys as annotation names and values as actual annotations.

gnomad.utils.constraint.annotate_exploded_vep_for_constraint_groupings(ht, coverage_expr, vep_annotation='transcript_consequences', include_canonical_group=True, include_mane_select_group=False)[source]

Annotate Table with annotations used for constraint groupings.

Function explodes the specified VEP annotation (vep_annotation) and adds the following annotations:

annotation -‘most_severe_consequence’ annotation in vep_annotation
modifier - classic lof annotation from ‘lof’ annotation in
vep_annotation, LOFTEE annotation from ‘lof’ annotation in vep_annotation, PolyPhen annotation from ‘polyphen_prediction’ in vep_annotation, or “None” if neither is defined
gene - ‘gene_symbol’ annotation inside vep_annotation
transcript - id from ‘transcript_id’ in vep_annotation (added when
include_transcript_group is True)
canonical from vep_annotation (added when include_canonical_group is
True)
mane_select from vep_annotation (added when include_mane_select_group is
True)

Note

This function expects that the following annotations are present in ht: - vep - exome_coverage

Parameters:

ht (Table) – Input Table or MatrixTable.
coverage_expr (Int32Expression) – Expression that defines the coverage metric.
vep_annotation (str) – Name of annotation in ‘vep’ annotation (one of “transcript_consequences” and “worst_csq_by_gene”) that will be used for obtaining constraint annotations. Default is “transcript_consequences”.
include_canonical_group (bool) – Whether to include ‘canonical’ annotation in the groupings. Default is True. Ignored unless vep_annotation is “transcript_consequences”.
include_mane_select_group (bool) – Whether to include ‘mane_select’ annotation in the groupings. Default is False. Ignored unless vep_annotation is “transcript_consequences”.

Return type:

Tuple[Union[Table, MatrixTable], Tuple[str]]

Returns:

A tuple of input Table or MatrixTable with grouping annotations added and the names of added annotations.

gnomad.utils.constraint.compute_expected_variants(ht, plateau_models_expr, mu_expr, cov_corr_expr, possible_variants_expr, cpg_expr, gen_anc=None)[source]

Apply plateau models for all sites and for a genetic ancestry group (if specified) to compute predicted proportion observed ratio and expected variant counts.

Parameters:

ht (Table) – Input Table.
plateau_models_expr (StructExpression) – Linear models (output of build_models(), with the values of the dictionary formatted as a StructExpression of intercept and slope, that calibrates mutation rate to proportion observed for high coverage exomes. It includes models for CpG, non-CpG sites, and each genetic ancestry group if specified.
mu_expr (Float64Expression) – Float64Expression of mutation rate.
possible_variants_expr (Int64Expression) – Int64Expression of possible variant counts.
cov_corr_expr (Float64Expression) – Float64Expression of corrected coverage expression.
cpg_expr (BooleanExpression) – BooleanExpression noting whether a site is a CPG site.
gen_anc (Optional[str]) – Optional genetic ancestry group to use when applying plateau model. Default is None.

Return type:

Dict[str, Union[Float64Expression, Int64Expression]]

Returns:

A dictionary with predicted proportion observed ratio and expected variant counts.

gnomad.utils.constraint.oe_aggregation_expr(ht, filter_expr, gen_ancs=(), exclude_mu_sum=False)[source]

Get aggregation expressions to compute the observed:expected ratio for rows defined by filter_expr.

Return a Struct containing aggregation expressions to sum the number of observed variants, possible variants, expected variants, and mutation rate (if exclude_mu_sum is not True) for rows defined by filter_expr. The Struct also includes an aggregation expression for the observed:expected ratio.

The following annotations are in the returned StructExpression:

obs - the sum of observed variants filtered to filter_expr.
mu - the sum of mutation rate of variants filtered to filter_expr.
possible - possible number of variants filtered to filter_expr.
exp - expected number of variants filtered to filter_expr.
oe - observed:expected ratio of variants filtered to filter_expr.

If gen_ancs is specified:

gen_anc_exp - Struct with the expected number of variants per genetic ancestry group (for all gen_anc in gen_ancs) filtered to filter_expr.
gen_anc_obs - Struct with the observed number of variants per genetic ancestry group (for all gen_anc in gen_ancs) filtered to filter_expr.

Note

The following annotations should be present in ht:

observed_variants
mu
possible_variants
expected_variants

If gen_ancs is specified, the following annotations should also be present:

expected_variants_{gen_anc} for all gen_anc in gen_ancs
downsampling_counts_{gen_anc} for all gen_anc in gen_ancs

Parameters:

ht (Table) – Input Table to create observed:expected ratio aggregation expressions for.
filter_expr (BooleanExpression) – Boolean expression used to filter ht before aggregation.
gen_ancs (Tuple[str]) – List of genetic ancestry groups to compute constraint metrics for. Default is ().
exclude_mu_sum (bool) – Whether to exclude mu sum aggregation expression from returned struct. Default is False.

Return type:

StructExpression

Returns:

StructExpression with observed:expected ratio aggregation expressions.

gnomad.utils.constraint.compute_pli(ht, obs_expr, exp_expr, expected_values=None, min_diff_convergence=0.001)[source]

Compute the pLI score using the observed and expected variant counts.

Full details on pLI can be found in the ExAC paper: Lek, M., Karczewski, K., Minikel, E. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

pLI is the probability of being loss-of-function intolerant, and this function computes that probability using the expectation-maximization (EM) algorithm.

We assume a 3 state model, where each gene fits into one of three categories with respect loss-of-function variation sensitivity:

Null: where protein truncating variation is completely tolerated by natural selection.

Recessive (Rec): where heterozygous pLoFs are tolerated but homozygous pLoFs are not.

Haploinsufficient (LI): where heterozygous pLoFs are not tolerated.

The function requires the expected amount of loss-of-function depletion for each of these states. The default provided is based on the observed depletion of protein-truncating variation in the Blekhman autosomal recessive and ClinGen dosage sensitivity gene sets (Supplementary Information Table 12 of the above reference):

Null: 1.0, assume tolerant genes have the expected amount of truncating variation.

Rec: 0.463, derived from the empirical mean observed/expected rate of truncating variation for recessive disease genes (0.463).

LI: 0.089, derived from the empirical mean observed/expected rate of truncating variation for severe haploinsufficient genes.

The output StructExpression will include the following annotations:

pLI: Probability of loss-of-function intolerance; probability that transcript falls into distribution of haploinsufficient genes.

pNull: Probability that transcript falls into distribution of unconstrained genes.

pRec: Probability that transcript falls into distribution of recessive genes.

Parameters:

ht (Table) – Input Table containing obs_expr and exp_expr.
obs_expr (Int64Expression) – Expression for the number of observed variants on each gene or transcript in ht.
exp_expr (Float64Expression) – Expression for the number of expected variants on each gene or transcript in ht.
expected_values (Optional[Dict[str, float]]) – Dictionary containing the expected values for ‘Null’, ‘Rec’, and ‘LI’ to use as starting values.
min_diff_convergence (float) – Minimum iteration change in LI to consider the EM model convergence criteria as met. Default is 0.001.

Return type:

StructExpression

Returns:

StructExpression for pLI scores.

gnomad.utils.constraint.oe_confidence_interval(obs_expr, exp_expr, alpha=0.05)[source]

Determine the confidence interval around the observed:expected ratio.

For a given pair of observed (obs_expr) and expected (exp_expr) values, the function computes the density of the Poisson distribution (performed using Hail’s dpois module) with fixed k (x in dpois is set to the observed number of variants) over a range of lambda (lamb in dpois) values, which are given by the expected number of variants times a varying parameter ranging between 0 and 2 (the observed:expected ratio is typically between 0 and 1, so we want to extend the upper bound of the confidence interval to capture this). The cumulative density function of the Poisson distribution density is computed and the value of the varying parameter is extracted at points corresponding to alpha (defaults to 5%) and 1-alpha (defaults to 95%) to indicate the lower and upper bounds of the confidence interval.

The following annotations are in the output StructExpression:

lower - the lower bound of confidence interval
upper - the upper bound of confidence interval

Parameters:

obs_expr (Int64Expression) – Expression for the observed variant counts of pLoF, missense, or synonymous variants in ht.
exp_expr (Float64Expression) – Expression for the expected variant counts of pLoF, missense, or synonymous variants in ht.
alpha (float) – The significance level used to compute the confidence interval. Default is 0.05.

Return type:

StructExpression

Returns:

StructExpression for the confidence interval lower and upper bounds.

gnomad.utils.constraint.calculate_raw_z_score(obs_expr, exp_expr)[source]

Compute the signed raw z-score using observed and expected variant counts.

The raw z-scores are positive when the transcript had fewer variants than expected, and are negative when transcripts had more variants than expected.

Parameters:

obs_expr (Int64Expression) – Observed variant count expression.
exp_expr (Float64Expression) – Expected variant count expression.

Return type:

StructExpression

Returns:

StructExpression for the raw z-score.

gnomad.utils.constraint.get_constraint_flags(exp_expr, raw_z_expr, raw_z_lower_threshold=-5.0, raw_z_upper_threshold=5.0, flag_postfix='')[source]

Determine the constraint flags that define why constraint will not be calculated.

Flags which are added:

“no_exp_{flag_postfix}” - for genes that have missing or zero expected variants.
“outlier_{flag_postfix}” - for genes that are raw z-score outliers: (raw_z_expr < raw_z_lower_threshold) or (raw_z_expr > raw_z_upper_threshold).

Parameters:

exp_expr (Float64Expression) – Expression for the expected variant counts of pLoF, missense, or synonymous variants.
raw_z_expr (Float64Expression) – Expression for the signed raw z-score of pLoF, missense, or synonymous variants.
raw_z_lower_threshold (Optional[float]) – Lower threshold for the raw z-score. When raw_z_expr is less than this threshold it is considered an ‘outlier’. Default is -5.0.
raw_z_upper_threshold (Optional[float]) – Upper threshold for the raw z-score. When raw_z_expr is greater than this threshold it is considered an ‘outlier’. Default is 5.0.
flag_postfix (str) – Postfix to add to the end of the constraint flag names.

Return type:

Dict[str, Expression]

Returns:

Dictionary containing expressions for constraint flags.

gnomad.utils.constraint.calculate_raw_z_score_sd(raw_z_expr, flag_expr, mirror_neg_raw_z=True)[source]

Calculate the standard deviation of the raw z-score.

When using mirror_neg_raw_z is True, all the negative raw z-scores (defined by raw_z_expr) are combined with those same z-scores multiplied by -1 (to create a mirrored distribution).

Parameters:

raw_z_expr (Float64Expression) – Expression for the raw z-score.
flag_expr (StringExpression) – Expression for the constraint flags. z-score will not be calculated if flags are present.
mirror_neg_raw_z (bool) – Whether the standard deviation should be computed using a mirrored distribution of negative raw_z_expr.

Return type:

Expression

Returns:

StructExpression containing standard deviation of the raw z-score and the z-score.

gnomad.utils.constraint.add_gencode_transcript_annotations(ht, gencode_ht, annotations=['level', 'transcript_type'])[source]

Add GENCODE annotations to Table based on transcript id.

Note

Added annotations by default are: - level - transcript_type

Computed annotations are: - chromosome - cds_length - num_coding_exons

Parameters:

ht (Table) – Input Table.
gencode_ht (Table) – Table with GENCODE annotations.
annotations (List[str]) – List of GENCODE annotations to add. Default is [“level”, “transcript_type”]. Added annotations also become keys for the group by when computing “cds_length” and “num_coding_exons”.

Return type:

Table

Returns:

Table with transcript annotations from GENCODE added.