gnomad.utils.annotations

gnomad.utils.annotations.pop_max_expr(freq, …)

Create an expression containing the frequency information about the population that has the highest AF in freq_meta.

gnomad.utils.annotations.project_max_expr(…)

Create an expression that computes allele frequency information by project for the n_projects with the largest AF at this row.

gnomad.utils.annotations.faf_expr(freq, …)

Calculate the filtering allele frequency (FAF) for each threshold specified in faf_thresholds.

gnomad.utils.annotations.qual_hist_expr([…])

Return a struct expression with genotype quality histograms based on the arguments given (dp, gq, ad).

gnomad.utils.annotations.age_hists_expr(…)

Return a StructExpression with the age histograms for hets and homs.

gnomad.utils.annotations.annotate_freq(mt[, …])

Annotate mt with stratified allele frequencies.

gnomad.utils.annotations.get_lowqual_expr(…)

Compute lowqual threshold expression for either split or unsplit alleles based on QUALapprox or AS_QUALapprox.

gnomad.utils.annotations.get_annotations_hists(ht, …)

Create histograms for variant metrics in ht.info.

gnomad.utils.annotations.create_frequency_bins_expr(AC, AF)

Create bins for frequencies in preparation for aggregating QUAL by frequency bin.

gnomad.utils.annotations.get_adj_expr(…[, …])

Get adj genotype annotation.

gnomad.utils.annotations.annotate_adj(mt[, …])

Annotate genotypes with adj criteria (assumes diploid).

gnomad.utils.annotations.add_variant_type(…)

Get Struct of variant_type and n_alt_alleles from ArrayExpression of Strings (all alleles).

gnomad.utils.annotations.annotation_type_is_numeric(t)

Given an annotation type, return whether it is a numerical type or not.

gnomad.utils.annotations.annotation_type_in_vcf_info(t)

Given an annotation type, returns whether that type can be natively exported to a VCF INFO field.

gnomad.utils.annotations.bi_allelic_site_inbreeding_expr(call)

Return the site inbreeding coefficient as an expression to be computed on a MatrixTable.

gnomad.utils.annotations.fs_from_sb(sb[, …])

Compute FS (Fisher strand balance) annotation from the SB (strand balance table) field.

gnomad.utils.annotations.sor_from_sb(sb)

Compute SOR (Symmetric Odds Ratio test) annotation from the SB (strand balance table) field.

gnomad.utils.annotations.bi_allelic_expr(t)

Return a boolean expression selecting bi-allelic sites only, accounting for whether the input MT/HT was split.

gnomad.utils.annotations.unphase_call_expr(…)

Generate unphased version of a call expression (which can be phased or not).

gnomad.utils.annotations.region_flag_expr(t)

Create a region_flag struct that contains flags for problematic regions (i.e., LCR, decoy, segdup, and nonpar regions).

gnomad.utils.annotations.missing_callstats_expr()

Create a missing callstats struct for insertion into frequency annotation arrays when data is missing.

gnomad.utils.annotations.set_female_y_metrics_to_na_expr(t)

Set Y-variant frequency callstats for female-specific metrics to missing structs.

gnomad.utils.annotations.hemi_expr(locus, …)

Return whether genotypes are hemizygous.

gnomad.utils.annotations.pop_max_expr(freq, freq_meta, pops_to_exclude=None)[source]

Create an expression containing the frequency information about the population that has the highest AF in freq_meta.

Populations specified in pops_to_exclude are excluded and only frequencies from adj populations are considered.

This resulting struct contains the following fields:

  • AC: int32

  • AF: float64

  • AN: int32

  • homozygote_count: int32

  • pop: str

Parameters
  • freq (ArrayExpression) – ArrayExpression of Structs with fields [‘AC’, ‘AF’, ‘AN’, ‘homozygote_count’]

  • freq_meta (ArrayExpression) – ArrayExpression of meta dictionaries corresponding to freq (as returned by annotate_freq)

  • pops_to_exclude (Optional[Set[str]]) – Set of populations to skip for popmax calcluation

Return type

StructExpression

Returns

Popmax struct

gnomad.utils.annotations.project_max_expr(project_expr, gt_expr, alleles_expr, n_projects=5)[source]

Create an expression that computes allele frequency information by project for the n_projects with the largest AF at this row.

Will return an array with one element per non-reference allele.

Each of these elements is itself an array of structs with the following fields:

  • AC: int32

  • AF: float64

  • AN: int32

  • homozygote_count: int32

  • project: str

Note

Only projects with AF > 0 are returned. In case of ties, the project ordering is not guaranteed, and at most n_projects are returned.

Parameters
  • project_expr (StringExpression) – column expression containing the project

  • gt_expr (CallExpression) – entry expression containing the genotype

  • alleles_expr (ArrayExpression) – row expression containing the alleles

  • n_projects (int) – Maximum number of projects to return for each row

Return type

ArrayExpression

Returns

projectmax expression

gnomad.utils.annotations.faf_expr(freq, freq_meta, locus, pops_to_exclude=None, faf_thresholds=[0.95, 0.99])[source]

Calculate the filtering allele frequency (FAF) for each threshold specified in faf_thresholds.

See http://cardiodb.org/allelefrequencyapp/ for more information.

The FAF is computed for each of the following population stratification if found in freq_meta:

  • All samples, with adj criteria

  • For each population, with adj criteria

  • For all sex/population on the non-PAR regions of sex chromosomes (will be missing on autosomes and PAR regions of sex chromosomes)

Each of the FAF entry is a struct with one entry per threshold specified in faf_thresholds of type float64.

This returns a tuple with two expressions:

  1. An array of FAF expressions as described above

  2. An array of dict containing the metadata for each of the array elements, in the same format as that produced by annotate_freq.

Parameters
  • freq (ArrayExpression) – ArrayExpression of call stats structs (typically generated by hl.agg.call_stats)

  • freq_meta (ArrayExpression) – ArrayExpression of meta dictionaries corresponding to freq (typically generated using annotate_freq)

  • locus (LocusExpression) – locus

  • pops_to_exclude (Optional[Set[str]]) – Set of populations to exclude from faf calculation (typically bottlenecked or consanguineous populations)

  • faf_thresholds (List[float]) – List of FAF thresholds to compute

Return type

Tuple[ArrayExpression, List[Dict[str, str]]]

Returns

(FAF expression, FAF metadata)

gnomad.utils.annotations.qual_hist_expr(gt_expr=None, gq_expr=None, dp_expr=None, ad_expr=None, adj_expr=None)[source]

Return a struct expression with genotype quality histograms based on the arguments given (dp, gq, ad).

Note

  • If gt_expr is provided, will return histograms for non-reference samples only as well as all samples.

  • gt_expr is required for the allele-balance histogram, as it is only computed on het samples.

  • If adj_expr is provided, additional histograms are computed using only adj samples.

Parameters
  • gt_expr (Optional[CallExpression]) – Entry expression containing genotype

  • gq_expr (Optional[NumericExpression]) – Entry expression containing genotype quality

  • dp_expr (Optional[NumericExpression]) – Entry expression containing depth

  • ad_expr (Optional[ArrayNumericExpression]) – Entry expression containing allelic depth (bi-allelic here)

  • adj_expr (Optional[BooleanExpression]) – Entry expression containing adj (high quality) genotype status

Return type

StructExpression

Returns

Genotype quality histograms expression

gnomad.utils.annotations.age_hists_expr(adj_expr, gt_expr, age_expr, lowest_boundary=30, highest_boundary=80, n_bins=10)[source]

Return a StructExpression with the age histograms for hets and homs.

Parameters
  • adj_expr (BooleanExpression) – Entry expression containing whether a genotype is high quality (adj) or not

  • gt_expr (CallExpression) – Entry expression containing the genotype

  • age_expr (NumericExpression) – Col expression containing the sample’s age

  • lowest_boundary (int) – Lowest bin boundary (any younger sample will be binned in n_smaller)

  • highest_boundary (int) – Highest bin boundary (any older sample will be binned in n_larger)

  • n_bins (int) – Total number of bins

Return type

StructExpression

Returns

A struct with age_hist_het and age_hist_hom

gnomad.utils.annotations.annotate_freq(mt, sex_expr=None, pop_expr=None, subpop_expr=None, additional_strata_expr=None, downsamplings=None)[source]

Annotate mt with stratified allele frequencies.

The output Matrix table will include:
  • row annotation freq containing the stratified allele frequencies

  • global annotation freq_meta with metadata

  • global annotation freq_sample_count with sample count information

Note

Currently this only supports bi-allelic sites. The input mt needs to have the following entry fields: - GT: a CallExpression containing the genotype - adj: a BooleanExpression containing whether the genotype is of high quality or not. All expressions arguments need to be expression on the input mt.

freq row annotation

The freq row annotation is an Array of Struct, with each Struct containing the following fields:

  • AC: int32

  • AF: float64

  • AN: int32

  • homozygote_count: int32

Each element of the array corresponds to a stratification of the data, and the metadata about these annotations is stored in the globals.

Global freq_meta metadata annotation

The global annotation freq_meta is added to the input mt. It is a list of dict. Each element of the list contains metadata on a frequency stratification and the index in the list corresponds to the index of that frequency stratification in the freq row annotation.

Global freq_sample_count annotation

The global annotation freq_sample_count is added to the input mt. This is a sample count per sample grouping defined in the freq_meta global annotation.

The downsamplings parameter

If the downsamplings parameter is used, frequencies will be computed for all samples and by population (if pop_expr is specified) by downsampling the number of samples without replacement to each of the numbers specified in the downsamplings array, provided that there are enough samples in the dataset. In addition, if pop_expr is specified, a downsampling to each of the exact number of samples present in each population is added. Note that samples are randomly sampled only once, meaning that the lower downsamplings are subsets of the higher ones.

Parameters
  • mt (MatrixTable) – Input MatrixTable

  • sex_expr (Optional[StringExpression]) – When specified, frequencies are stratified by sex. If pop_expr is also specified, then a pop/sex stratifiction is added.

  • pop_expr (Optional[StringExpression]) – When specified, frequencies are stratified by population. If sex_expr is also specified, then a pop/sex stratifiction is added.

  • subpop_expr (Optional[StringExpression]) – When specified, frequencies are stratified by sub-continental population. Note that pop_expr is required as well when using this option.

  • additional_strata_expr (Optional[Dict[str, StringExpression]]) – When specified, frequencies are stratified by the given additional strata found in the dict. This can e.g. be used to stratify by platform.

  • downsamplings (Optional[List[int]]) – When specified, frequencies are computed by downsampling the data to the number of samples given in the list. Note that if pop_expr is specified, downsamplings by population is also computed.

Return type

MatrixTable

Returns

MatrixTable with freq annotation

gnomad.utils.annotations.get_lowqual_expr(alleles, qual_approx_expr, snv_phred_threshold=30, snv_phred_het_prior=30, indel_phred_threshold=30, indel_phred_het_prior=39)[source]

Compute lowqual threshold expression for either split or unsplit alleles based on QUALapprox or AS_QUALapprox.

Note

When running This lowqual annotation using QUALapprox, it differs from the GATK LowQual filter. This is because GATK computes this annotation at the site level, which uses the least stringent prior for mixed sites. When run using AS_QUALapprox, this implementation can thus be more stringent for certain alleles at mixed sites.

Parameters
  • alleles (ArrayExpression) – Array of alleles

  • qual_approx_expr (Union[ArrayNumericExpression, NumericExpression]) – QUALapprox or AS_QUALapprox

  • snv_phred_threshold (int) – Phred-scaled SNV “emission” threshold (similar to GATK emission threshold)

  • snv_phred_het_prior (int) – Phred-scaled SNV heterozygosity prior (30 = 1/1000 bases, GATK default)

  • indel_phred_threshold (int) – Phred-scaled indel “emission” threshold (similar to GATK emission threshold)

  • indel_phred_het_prior (int) – Phred-scaled indel heterozygosity prior (30 = 1/1000 bases, GATK default)

Return type

Union[BooleanExpression, ArrayExpression]

Returns

lowqual expression (BooleanExpression if qual_approx_expr`is Numeric, Array[BooleanExpression] if `qual_approx_expr is ArrayNumeric)

gnomad.utils.annotations.get_annotations_hists(ht, annotations_hists, log10_annotations=['DP'])[source]

Create histograms for variant metrics in ht.info.

Used when creating site quality distribution json files.

Parameters
  • ht (Table) – Table with variant metrics

  • annotations_hists (Dict[str, Tuple]) – Dictionary of metrics names and their histogram values (start, end, bins)

  • log10_annotations (List[str]) – List of metrics to log scale

Returns

Dictionary of merics and their histograms

Return type

Dict[str, hl.expr.StructExpression]

gnomad.utils.annotations.create_frequency_bins_expr(AC, AF)[source]

Create bins for frequencies in preparation for aggregating QUAL by frequency bin.

Bins:
  • singleton

  • doubleton

  • 0.00005

  • 0.0001

  • 0.0002

  • 0.0005

  • 0.001,

  • 0.002

  • 0.005

  • 0.01

  • 0.02

  • 0.05

  • 0.1

  • 0.2

  • 0.5

  • 1

NOTE: Frequencies should be frequencies from raw data. Used when creating site quality distribution json files.

Parameters
  • AC (NumericExpression) – Field in input that contains the allele count information

  • AF (NumericExpression) – Field in input that contains the allele frequency information

Returns

Expression containing bin name

Return type

hl.expr.StringExpression

gnomad.utils.annotations.get_adj_expr(gt_expr, gq_expr, dp_expr, ad_expr, adj_gq=20, adj_dp=10, adj_ab=0.2, haploid_adj_dp=5)[source]

Get adj genotype annotation.

Defaults correspond to gnomAD values.

Parameters
Return type

BooleanExpression

gnomad.utils.annotations.annotate_adj(mt, adj_gq=20, adj_dp=10, adj_ab=0.2, haploid_adj_dp=5)[source]

Annotate genotypes with adj criteria (assumes diploid).

Defaults correspond to gnomAD values.

Parameters
  • mt (MatrixTable) –

  • adj_gq (int) –

  • adj_dp (int) –

  • adj_ab (float) –

  • haploid_adj_dp (int) –

Return type

MatrixTable

gnomad.utils.annotations.add_variant_type(alt_alleles)[source]

Get Struct of variant_type and n_alt_alleles from ArrayExpression of Strings (all alleles).

Parameters

alt_alleles (ArrayExpression) –

Return type

StructExpression

gnomad.utils.annotations.annotation_type_is_numeric(t)[source]

Given an annotation type, return whether it is a numerical type or not.

Parameters

t (Any) – Type to test

Return type

bool

Returns

If the input type is numeric

gnomad.utils.annotations.annotation_type_in_vcf_info(t)[source]

Given an annotation type, returns whether that type can be natively exported to a VCF INFO field.

Note

Types that aren’t natively exportable to VCF will be converted to String on export.

Parameters

t (Any) – Type to test

Return type

bool

Returns

If the input type can be exported to VCF

gnomad.utils.annotations.bi_allelic_site_inbreeding_expr(call)[source]

Return the site inbreeding coefficient as an expression to be computed on a MatrixTable.

This is implemented based on the GATK InbreedingCoeff metric: https://software.broadinstitute.org/gatk/documentation/article.php?id=8032

Note

The computation is run based on the counts of alternate alleles and thus should only be run on bi-allelic sites.

Parameters

call (CallExpression) – Expression giving the calls in the MT

Return type

Float32Expression

Returns

Site inbreeding coefficient expression

gnomad.utils.annotations.fs_from_sb(sb, normalize=True, min_cell_count=200, min_count=4, min_p_value=1e-320)[source]

Compute FS (Fisher strand balance) annotation from the SB (strand balance table) field.

FS is the phred-scaled value of the double-sided Fisher exact test on strand balance.

Using default values will have the same behavior as the GATK implementation, that is: - If sum(counts) > 2*`min_cell_count` (default to GATK value of 200), they are normalized - If sum(counts) < min_count (default to GATK value of 4), returns missing - Any p-value < min_p_value (default to GATK value of 1e-320) is truncated to that value

In addition to the default GATK behavior, setting normalize to False will perform a chi-squared test for large counts (> min_cell_count) instead of normalizing the cell values.

Note

This function can either take - an array of length four containing the forward and reverse strands’ counts of ref and alt alleles: [ref fwd, ref rev, alt fwd, alt rev] - a two dimensional array with arrays of length two, containing the counts: [[ref fwd, ref rev], [alt fwd, alt rev]]

GATK code here: https://github.com/broadinstitute/gatk/blob/master/src/main/java/org/broadinstitute/hellbender/tools/walkers/annotator/FisherStrand.java

Parameters
  • sb (Union[ArrayNumericExpression, ArrayExpression]) – Count of ref/alt reads on each strand

  • normalize (bool) – Whether to normalize counts is sum(counts) > min_cell_count (normalize=True), or use a chi sq instead of FET (normalize=False)

  • min_cell_count (int) – Maximum count for performing a FET

  • min_count (int) – Minimum total count to output FS (otherwise null it output)

  • min_p_value (float) –

Return type

Int64Expression

Returns

FS value

gnomad.utils.annotations.sor_from_sb(sb)[source]

Compute SOR (Symmetric Odds Ratio test) annotation from the SB (strand balance table) field.

Note

This function can either take - an array of length four containing the forward and reverse strands’ counts of ref and alt alleles: [ref fwd, ref rev, alt fwd, alt rev] - a two dimensional array with arrays of length two, containing the counts: [[ref fwd, ref rev], [alt fwd, alt rev]]

GATK code here: https://github.com/broadinstitute/gatk/blob/master/src/main/java/org/broadinstitute/hellbender/tools/walkers/annotator/StrandOddsRatio.java

Parameters

sb (Union[ArrayNumericExpression, ArrayExpression]) – Count of ref/alt reads on each strand

Return type

Float64Expression

Returns

SOR value

gnomad.utils.annotations.bi_allelic_expr(t)[source]

Return a boolean expression selecting bi-allelic sites only, accounting for whether the input MT/HT was split.

Parameters

t (Union[Table, MatrixTable]) – Input HT/MT

Return type

BooleanExpression

Returns

Boolean expression selecting only bi-allelic sites

gnomad.utils.annotations.unphase_call_expr(call_expr)[source]

Generate unphased version of a call expression (which can be phased or not).

Parameters

call_expr (CallExpression) – Input call expression

Return type

CallExpression

Returns

unphased call expression

gnomad.utils.annotations.region_flag_expr(t, non_par=True, prob_regions=None)[source]

Create a region_flag struct that contains flags for problematic regions (i.e., LCR, decoy, segdup, and nonpar regions).

Note

No hg38 resources for decoy or self chain are available yet.

Parameters
  • t (Union[Table, MatrixTable]) – Input Table/MatrixTable

  • non_par (bool) – If True, flag loci that occur within pseudoautosomal regions on sex chromosomes

  • prob_regions (Optional[Dict[str, Table]]) – If supplied, flag loci that occur within regions defined in Hail Table(s)

Return type

StructExpression

Returns

region_flag struct row annotation

gnomad.utils.annotations.missing_callstats_expr()[source]

Create a missing callstats struct for insertion into frequency annotation arrays when data is missing.

Return type

StructExpression

Returns

Hail Struct with missing values for each callstats element

gnomad.utils.annotations.set_female_y_metrics_to_na_expr(t)[source]

Set Y-variant frequency callstats for female-specific metrics to missing structs.

Note

Requires freq, freq_meta, and freq_index_dict annotations to be present in Table or MatrixTable

Parameters

t (Union[Table, MatrixTable]) – Table or MatrixTable for which to adjust female metrics

Return type

ArrayExpression

Returns

Hail array expression to set female Y-variant metrics to missing values

gnomad.utils.annotations.hemi_expr(locus, sex_expr, gt, male_str='XY')[source]

Return whether genotypes are hemizygous.

Return missing expression if locus is not in chrX/chrY non-PAR regions.

Parameters
  • locus (LocusExpression) – Input locus.

  • sex_expr (StringExpression) – Input StringExpression indicating whether sample is XX or XY.

  • gt (CallExpression) – Input genotype.

  • xy_str – String indicating whether sample is XY. Default is “XY”.

  • male_str (str) –

Return type

BooleanExpression

Returns

BooleanExpression indicating whether genotypes are hemizygous.