gnomad.utils.annotations<a class="headerlink" href="#gnomad-utils-annotations" title="Permalink to this heading">

Returns:

Genetic ancestry max struct

gnomad.utils.annotations.project_max_expr(project_expr, gt_expr, alleles_expr, n_projects=5)[source]

Create an expression that computes allele frequency information by project for the n_projects with the largest AF at this row.

Will return an array with one element per non-reference allele.

Each of these elements is itself an array of structs with the following fields:

AC: int32

AF: float64

AN: int32

homozygote_count: int32

project: str

Note

Only projects with AF > 0 are returned. In case of ties, the project ordering is not guaranteed, and at most n_projects are returned.

Parameters:

project_expr (StringExpression) – column expression containing the project
gt_expr (CallExpression) – entry expression containing the genotype
alleles_expr (ArrayExpression) – row expression containing the alleles
n_projects (int) – Maximum number of projects to return for each row

Return type:

ArrayExpression

Returns:

projectmax expression

gnomad.utils.annotations.faf_expr(freq, freq_meta, locus, gen_anc_groups_to_exclude=None, faf_thresholds=[0.95, 0.99], gen_anc_label='gen_anc')[source]

Calculate the filtering allele frequency (FAF) for each threshold specified in faf_thresholds.

See http://cardiodb.org/allelefrequencyapp/ for more information.

The FAF is computed for each of the following genetic ancestry group stratification if found in freq_meta:

All samples, with adj criteria

For each genetic ancestry group, with adj criteria

For all sex/genetic ancestry group on the non-PAR regions of sex chromosomes (will be missing on autosomes and PAR regions of sex chromosomes)

Each of the FAF entry is a struct with one entry per threshold specified in faf_thresholds of type float64.

This returns a tuple with two expressions:

An array of FAF expressions as described above

An array of dict containing the metadata for each of the array elements, in the same format as that produced by annotate_freq.

Parameters:

freq (ArrayExpression) – ArrayExpression of call stats structs (typically generated by hl.agg.call_stats)
freq_meta (ArrayExpression) – ArrayExpression of meta dictionaries corresponding to freq (typically generated using annotate_freq)
locus (LocusExpression) – locus
gen_anc_groups_to_exclude (Optional[Set[str]]) – Set of genetic ancestry groups to exclude from faf calculation (typically bottlenecked or consanguineous genetic ancestry groups)
faf_thresholds (List[float]) – List of FAF thresholds to compute
gen_anc_label (str) – Label of the genetic ancestry group field in the meta dictionary

Return type:

Tuple[ArrayExpression, List[Dict[str, str]]]

Returns:

(FAF expression, FAF metadata)

gnomad.utils.annotations.gen_anc_faf_max_expr(faf, faf_meta, gen_anc_label='gen_anc')[source]

Retrieve the maximum FAF and corresponding genetic ancestry for each of the thresholds in faf.

This resulting struct contains the following fields:

faf95_max: float64

faf95_max_gen_anc: str

faf99_max: float64

faf99_max_gen_anc: str

Parameters:

faf (ArrayExpression) – ArrayExpression of Structs of FAF thresholds previously computed. When faf_expr is used, contains fields ‘faf95’ and ‘faf99’.
faf_meta (ArrayExpression) – ArrayExpression of meta dictionaries corresponding to faf (as returned by faf_expr)
gen_anc_label (str) – Label of the genetic ancestry group field in the meta dictionary

Return type:

Returns:

Genetic ancestry group struct for FAF max

gnomad.utils.annotations.qual_hist_expr(gt_expr=None, gq_expr=None, dp_expr=None, ad_expr=None, adj_expr=None, ab_expr=None, split_adj_and_raw=False)[source]

Return a struct expression with genotype quality histograms based on the arguments given (dp, gq, ad, ab).

Note

If gt_expr is provided, will return histograms for non-reference samples only as well as all samples.
gt_expr is required for the allele-balance histogram, as it is only computed on het samples.
If ab_expr is provided, the allele-balance histogram is computed using this expression instead of the ad_expr.
If adj_expr is provided, additional histograms are computed using only adj samples.

Parameters:

gt_expr (Optional[CallExpression]) – Entry expression containing genotype.
gq_expr (Optional[NumericExpression]) – Entry expression containing genotype quality.
dp_expr (Optional[NumericExpression]) – Entry expression containing depth.
ad_expr (Optional[ArrayNumericExpression]) – Entry expression containing allelic depth (bi-allelic here).
adj_expr (Optional[BooleanExpression]) – Entry expression containing adj (high quality) genotype status.
ab_expr (Optional[NumericExpression]) – Entry expression containing allele balance (bi-allelic here).
split_adj_and_raw (bool) – Whether to split the adj and raw histograms into separate fields in the returned struct expr.

Return type:

Returns:

Genotype quality histograms expression.

gnomad.utils.annotations.age_hists_expr(adj_expr, gt_expr, age_expr, lowest_boundary=30, highest_boundary=80, n_bins=10)[source]

Return a StructExpression with the age histograms for hets and homs.

Parameters:

adj_expr (BooleanExpression) – Entry expression containing whether a genotype is high quality (adj) or not
gt_expr (CallExpression) – Entry expression containing the genotype
age_expr (NumericExpression) – Col expression containing the sample’s age
lowest_boundary (int) – Lowest bin boundary (any younger sample will be binned in n_smaller)
highest_boundary (int) – Highest bin boundary (any older sample will be binned in n_larger)
n_bins (int) – Total number of bins

Return type:

Returns:

A struct with age_hist_het and age_hist_hom

gnomad.utils.annotations.get_lowqual_expr(alleles, qual_approx_expr, snv_phred_threshold=30, snv_phred_het_prior=30, indel_phred_threshold=30, indel_phred_het_prior=39)[source]

Compute lowqual threshold expression for either split or unsplit alleles based on QUALapprox or AS_QUALapprox.

Note

When running This lowqual annotation using QUALapprox, it differs from the GATK LowQual filter. This is because GATK computes this annotation at the site level, which uses the least stringent prior for mixed sites. When run using AS_QUALapprox, this implementation can thus be more stringent for certain alleles at mixed sites.

Parameters:

alleles (ArrayExpression) – Array of alleles
qual_approx_expr (Union[ArrayNumericExpression, NumericExpression]) – QUALapprox or AS_QUALapprox
snv_phred_threshold (int) – Phred-scaled SNV “emission” threshold (similar to GATK emission threshold)
snv_phred_het_prior (int) – Phred-scaled SNV heterozygosity prior (30 = 1/1000 bases, GATK default)
indel_phred_threshold (int) – Phred-scaled indel “emission” threshold (similar to GATK emission threshold)
indel_phred_het_prior (int) – Phred-scaled indel heterozygosity prior (30 = 1/1000 bases, GATK default)

Return type:

Union[BooleanExpression, ArrayExpression]

Returns:

lowqual expression (BooleanExpression if qual_approx_expr`is Numeric, Array[BooleanExpression] if `qual_approx_expr is ArrayNumeric)

gnomad.utils.annotations.get_annotations_hists(ht, annotations_hists, log10_annotations=['DP'])[source]

Create histograms for variant metrics in ht.info.

Used when creating site quality distribution json files.

Parameters:

ht (Table) – Table with variant metrics
annotations_hists (Dict[str, Tuple]) – Dictionary of metrics names and their histogram values (start, end, bins)
log10_annotations (List[str]) – List of metrics to log scale

Returns:

Dictionary of merics and their histograms

Return type:

Dict[str, hl.expr.StructExpression]

gnomad.utils.annotations.create_frequency_bins_expr(AC, AF)[source]

Create bins for frequencies in preparation for aggregating QUAL by frequency bin.

Bins:

singleton
doubleton
0.00005
0.0001
0.0002
0.0005
0.001,
0.002
0.005
0.01
0.02
0.05
0.1
0.2
0.5
1

NOTE: Frequencies should be frequencies from raw data. Used when creating site quality distribution json files.

Parameters:

AC (NumericExpression) – Field in input that contains the allele count information
AF (NumericExpression) – Field in input that contains the allele frequency information

Returns:

Expression containing bin name

Return type:

hl.expr.StringExpression

gnomad.utils.annotations.annotate_and_index_source_mt_for_sex_ploidy(locus_expr, karyotype_expr, xy_karyotype_str='XY', xx_karyotype_str='XX')[source]

Prepare relevant ploidy annotations for downstream calculations on a matrix table.

This method is used as an optimization for the get_is_haploid_expr and adjusted_sex_ploidy_expr methods.

This method annotates the locus_expr source matrix table with the following fields:

xy: Boolean indicating if the sample is XY.

xx: Boolean indicating if the sample is XX.

in_non_par: Boolean indicating if the locus is in a non-PAR region.

x_nonpar: Boolean indicating if the locus is in a non-PAR region of the X chromosome.

y_par: Boolean indicating if the locus is in a PAR region of the Y chromosome.

y_nonpar: Boolean indicating if the locus is in a non-PAR region of the Y chromosome.

Parameters:

locus_expr (LocusExpression) – Locus expression.
karyotype_expr (StringExpression) – Karyotype expression.
xy_karyotype_str (str) – String representing XY karyotype. Default is “XY”.
xx_karyotype_str (str) – String representing XX karyotype. Default is “XX”.

Return type:

Tuple[StructExpression, StructExpression]

Returns:

Tuple of index expressions for columns and rows.

gnomad.utils.annotations.get_is_haploid_expr(gt_expr=None, locus_expr=None, karyotype_expr=None, xy_karyotype_str='XY', xx_karyotype_str='XX')[source]

Determine if a genotype or locus and karyotype combination is haploid.

Note

One of gt_expr or locus_expr and karyotype_expr is required.

Parameters:

gt_expr (Optional[CallExpression]) – Optional genotype expression.
locus_expr (Optional[LocusExpression]) – Optional locus expression.
karyotype_expr (Optional[StringExpression]) – Optional sex karyotype expression.
xy_karyotype_str (str) – String representing XY karyotype. Default is “XY”.
xx_karyotype_str (str) – String representing XX karyotype. Default is “XX”.

Return type:

Returns:

Boolean expression indicating if the genotype is haploid.

gnomad.utils.annotations.get_gq_dp_adj_expr(gq_expr, dp_expr, gt_expr=None, locus_expr=None, karyotype_expr=None, adj_gq=20, adj_dp=10, haploid_adj_dp=5)[source]

Get adj annotation using only GQ and DP.

Default thresholds correspond to gnomAD values.

Note

This function can be used to annotate adj taking into account only GQ and DP. It is useful for cases where the GT field is not available, such as in the reference data of a VariantDataset.

Note

One of gt_expr or locus_expr and karyotype_expr is required.

Parameters:

gq_expr (Union[Int32Expression, Int64Expression]) – GQ expression.
dp_expr (Union[Int32Expression, Int64Expression]) – DP expression.
gt_expr (Optional[CallExpression]) – Optional genotype expression.
locus_expr (Optional[LocusExpression]) – Optional locus expression.
karyotype_expr (Optional[StringExpression]) – Optional sex karyotype expression.
adj_gq (int) – GQ threshold for adj. Default is 20.
adj_dp (int) – DP threshold for adj. Default is 10.
haploid_adj_dp (int) – Haploid DP threshold for adj. Default is 5.

Return type:

Returns:

Boolean expression indicating adj filter.

gnomad.utils.annotations.get_het_ab_adj_expr(gt_expr, dp_expr, ad_expr, adj_ab=0.2)[source]

Get adj het AB annotation.

Parameters:

gt_expr (CallExpression) – Genotype expression.
dp_expr (Union[Int32Expression, Int64Expression]) – DP expression.
ad_expr (ArrayNumericExpression) – AD expression.
adj_ab (float) – AB threshold for adj. Default is 0.2.

Return type:

Returns:

Boolean expression indicating adj het AB filter.

gnomad.utils.annotations.get_adj_expr(gt_expr, gq_expr, dp_expr, ad_expr, adj_gq=20, adj_dp=10, adj_ab=0.2, haploid_adj_dp=5)[source]

Get adj genotype annotation.

Defaults correspond to gnomAD values.

Parameters:

gt_expr (CallExpression) –
gq_expr (Union[Int32Expression, Int64Expression]) –
dp_expr (Union[Int32Expression, Int64Expression]) –
ad_expr (ArrayNumericExpression) –
adj_gq (int) –
adj_dp (int) –
adj_ab (float) –
haploid_adj_dp (int) –

Return type:

gnomad.utils.annotations.annotate_adj(mt, adj_gq=20, adj_dp=10, adj_ab=0.2, haploid_adj_dp=5)[source]

Annotate genotypes with adj criteria (assumes diploid).

Defaults correspond to gnomAD values.

Parameters:

mt (MatrixTable) –
adj_gq (int) –
adj_dp (int) –
adj_ab (float) –
haploid_adj_dp (int) –

Return type:

MatrixTable

gnomad.utils.annotations.add_variant_type(alt_alleles)[source]

Get Struct of variant_type and n_alt_alleles from ArrayExpression of Strings.

Parameters:: alt_alleles (ArrayExpression) –
Return type:: StructExpression

gnomad.utils.annotations.annotate_allele_info(ht)[source]

Return bi-allelic sites Table with an ‘allele_info’ annotation.

Note

This function requires that the input ht is unsplit and returns a split ht.

‘allele_info’ is a struct with the following information:

variant_type: Variant type (snv, indel, multi-snv, multi-indel, or mixed).
n_alt_alleles: Total number of alternate alleles observed at variant locus.
has_star: True if the variant contains a star allele.
allele_type: Allele type (snv, insertion, deletion, or mixed).
was_mixed: True if the variant was mixed (i.e. contained both SNVs and indels).
nonsplit_alleles: Array of alleles before splitting.

Parameters:

ht (Table) – Unsplit input Table.
ht –

Return type:

Returns:

Split Table with allele data annotation added,

gnomad.utils.annotations.annotation_type_is_numeric(t)[source]

Given an annotation type, return whether it is a numerical type or not.

Parameters:: t (Any) – Type to test
Return type:: bool
Returns:: If the input type is numeric

gnomad.utils.annotations.annotation_type_in_vcf_info(t)[source]

Given an annotation type, returns whether that type can be natively exported to a VCF INFO field.

Note

Types that aren’t natively exportable to VCF will be converted to String on export.

Parameters:: t (Any) – Type to test
Return type:: bool
Returns:: If the input type can be exported to VCF

gnomad.utils.annotations.bi_allelic_site_inbreeding_expr(call=None, callstats_expr=None)[source]

Return the site inbreeding coefficient as an expression to be computed on a MatrixTable.

This is implemented based on the GATK InbreedingCoeff metric: https://software.broadinstitute.org/gatk/documentation/article.php?id=8032

Note

The computation is run based on the counts of alternate alleles and thus should only be run on bi-allelic sites.

Parameters:

call (Optional[CallExpression]) – Expression giving the calls in the MT
callstats_expr (Optional[StructExpression]) – StructExpression containing only alternate allele AC, AN, and homozygote_count as integers. If passed, used to create expression in place of GT calls.

Return type:

Float32Expression

Returns:

Site inbreeding coefficient expression

gnomad.utils.annotations.fs_from_sb(sb, normalize=True, min_cell_count=200, min_count=4, min_p_value=1e-320)[source]

Compute FS (Fisher strand balance) annotation from the SB (strand balance table) field.

FS is the phred-scaled value of the double-sided Fisher exact test on strand balance.

Using default values will have the same behavior as the GATK implementation, that is: - If sum(counts) > 2*`min_cell_count` (default to GATK value of 200), they are normalized - If sum(counts) < min_count (default to GATK value of 4), returns missing - Any p-value < min_p_value (default to GATK value of 1e-320) is truncated to that value

In addition to the default GATK behavior, setting normalize to False will perform a chi-squared test for large counts (> min_cell_count) instead of normalizing the cell values.

Note

This function can either take - an array of length four containing the forward and reverse strands’ counts of ref and alt alleles: [ref fwd, ref rev, alt fwd, alt rev] - a two dimensional array with arrays of length two, containing the counts: [[ref fwd, ref rev], [alt fwd, alt rev]]

GATK code here: https://github.com/broadinstitute/gatk/blob/master/src/main/java/org/broadinstitute/hellbender/tools/walkers/annotator/FisherStrand.java

Parameters:

sb (Union[ArrayNumericExpression, ArrayExpression]) – Count of ref/alt reads on each strand
normalize (bool) – Whether to normalize counts is sum(counts) > min_cell_count (normalize=True), or use a chi sq instead of FET (normalize=False)
min_cell_count (int) – Maximum count for performing a FET
min_count (int) – Minimum total count to output FS (otherwise null it output)
min_p_value (float) –

Return type:

Int64Expression

Returns:

FS value

gnomad.utils.annotations.sor_from_sb(sb)[source]

Compute SOR (Symmetric Odds Ratio test) annotation from the SB (strand balance table) field.

Note

This function can either take - an array of length four containing the forward and reverse strands’ counts of ref and alt alleles: [ref fwd, ref rev, alt fwd, alt rev] - a two dimensional array with arrays of length two, containing the counts: [[ref fwd, ref rev], [alt fwd, alt rev]]

GATK code here: https://github.com/broadinstitute/gatk/blob/master/src/main/java/org/broadinstitute/hellbender/tools/walkers/annotator/StrandOddsRatio.java

Parameters:: sb (Union[ArrayNumericExpression, ArrayExpression]) – Count of ref/alt reads on each strand
Return type:: Float64Expression
Returns:: SOR value

gnomad.utils.annotations.pab_max_expr(gt_expr, ad_expr, la_expr=None, n_alleles_expr=None)[source]

Compute the maximum p-value of the binomial test for the alternate allele balance (PAB) for each allele.

Note

This function can take a gt_expr and ad_expr that use local or global alleles. If they use local alleles, la_expr and n_alleles_expr should be provided to transform gt_expr and ad_expr to global alleles.

Parameters:

gt_expr (CallExpression) – Genotype call expression.
ad_expr (ArrayExpression) – Allele depth expression.
la_expr (Optional[ArrayExpression]) – Allele local index expression. When provided gt_expr and ad_expr are transformed from using local alleles to global alleles using la_expr.
n_alleles_expr (Optional[Int32Expression]) – Number of alleles expression. Required when ‘la_expr’ is provided.

Return type:

ArrayExpression

Returns:

Array expression of maximum p-values.

gnomad.utils.annotations.bi_allelic_expr(t)[source]

Return a boolean expression selecting bi-allelic sites only, accounting for whether the input MT/HT was split.

Parameters:: t (Union[Table, MatrixTable]) – Input HT/MT
Return type:: BooleanExpression
Returns:: Boolean expression selecting only bi-allelic sites

gnomad.utils.annotations.unphase_call_expr(call_expr)[source]

Generate unphased version of a call expression (which can be phased or not).

Parameters:: call_expr (CallExpression) – Input call expression
Return type:: CallExpression
Returns:: unphased call expression

gnomad.utils.annotations.region_flag_expr(t, non_par=True, prob_regions=None)[source]

Create a region_flag struct that contains flags for problematic regions (i.e., LCR, decoy, segdup, and nonpar regions).

Note

No hg38 resources for decoy or self chain are available yet.

Parameters:

t (Union[Table, MatrixTable]) – Input Table/MatrixTable
non_par (bool) – If True, flag loci that occur within pseudoautosomal regions on sex chromosomes
prob_regions (Dict[str, Table]) – If supplied, flag loci that occur within regions defined in Hail Table(s)

Return type:

Returns:

region_flag struct row annotation

gnomad.utils.annotations.missing_callstats_expr()[source]

Create a missing callstats struct for insertion into frequency annotation arrays when data is missing.

Return type:: StructExpression
Returns:: Hail Struct with missing values for each callstats element

gnomad.utils.annotations.set_xx_y_metrics_to_na_expr(t, freq_expr='freq', freq_meta_expr='freq_meta', freq_index_dict_expr='freq_index_dict')[source]

Set Y-variant frequency callstats for XX-specific metrics to missing structs.

Parameters:

t (Union[Table, MatrixTable]) – Table or MatrixTable for which to adjust XX metrics.
freq_expr (Union[ArrayExpression, str]) – Array expression or string annotation name for the frequency array. Default is “freq”.
freq_meta_expr (Union[ArrayExpression, str]) – Array expression or string annotation name for the frequency metadata. Default is “freq_meta”.
freq_index_dict_expr (Union[DictExpression, str]) – Dict expression or string annotation name for the frequency metadata index dictionary. Default is “freq_index_dict”.

Return type:

ArrayExpression

Returns:

Hail array expression to set XX Y-variant metrics to missing values.

gnomad.utils.annotations.hemi_expr(locus, sex_expr, gt, male_str='XY')[source]

Return whether genotypes are hemizygous.

Return missing expression if locus is not in chrX/chrY non-PAR regions.

Parameters:

locus (LocusExpression) – Input locus.
sex_expr (StringExpression) – Input StringExpression indicating whether sample is XX or XY.
gt (CallExpression) – Input genotype.
xy_str – String indicating whether sample is XY. Default is “XY”.
male_str (str) –

Return type:

Returns:

BooleanExpression indicating whether genotypes are hemizygous.

gnomad.utils.annotations.merge_array_expressions(arrays, meta, operation='sum', set_negatives_to_zero=False, struct_fields=None, count_arrays=None)[source]

Merge a list of array expressions based on the supplied operation.

This function can handle both: 1. Arrays of integers (e.g., allele numbers). 2. Arrays of structs containing integer fields (e.g., frequency structs with AC, AN, etc.).

Warning

Arrays must be on the same Table.

Note

Arrays do not have to contain the same groupings or order of groupings but the array indices for an array in arrays must be the same as its associated metadata index in meta i.e., arrays = [arr1, arr2] then meta must equal [meta1, meta2] where meta1 contains the metadata information for arr1.

If operation is set to “sum”, groups in the merged array will be the union of groupings found within the arrays’ metadata and all arrays will be summed by grouping. If operation is set to “diff”, the merged array will contain groups only found in the first array of meta. Any array containing any of these groups will have their values subtracted from the values of the first array.

Parameters:

arrays (List[ArrayExpression]) – List of arrays to merge. First entry in the list is the primary array to which other arrays will be added or subtracted. All arrays must be on the same Table.
meta (List[List[Dict[str, str]]]) – List of metadata for arrays being merged.
operation (str) – Merge operation to perform. Options are “sum” and “diff”. If “diff” is passed, the first array in the list will have the other arrays subtracted from it.
set_negatives_to_zero (bool) – If True, set negative array values to 0. If False, raise a ValueError. Default is False.
struct_fields (Optional[List[str]]) – List of field names to merge if arrays contain structs. If None, arrays are treated as integer arrays. Default is None.
count_arrays (Optional[Dict[str, List[ArrayExpression]]]) – Dictionary of Lists of arrays containing counts to merge using the passed operation. Must use the same group indexing as meta. Keys are the descriptor names, values are Lists of arrays to merge. Default is None.

Return type:

Union[Tuple[ArrayExpression, List[Dict[str, int]]], Tuple[ArrayExpression, List[Dict[str, int]], Dict[str, ArrayExpression]]]

Returns:

Tuple of merged array, metadata list and if count_arrays is not None, a dictionary of merged count arrays.

gnomad.utils.annotations.merge_freq_arrays(farrays, fmeta, operation='sum', set_negatives_to_zero=False, count_arrays=None)[source]

Merge a list of frequency arrays based on the supplied operation.

Warning

Arrays must be on the same Table.

Note

Arrays do not have to contain the same groupings or order of groupings but the array indices for a freq array in farrays must be the same as its associated frequency metadata index in fmeta i.e., farrays = [freq1, freq2] then fmeta must equal [fmeta1, fmeta2] where fmeta1 contains the metadata information for freq1.

If operation is set to “sum”, groups in the merged array will be the union of groupings found within the arrays’ metadata and all arrays with be summed by grouping. If operation is set to “diff”, the merged array will contain groups only found in the first array of fmeta. Any array containing any of these groups will have their values subtracted from the values of the first array.

Parameters:

farrays (List[ArrayExpression]) – List of frequency arrays to merge. First entry in the list is the primary array to which other arrays will be added or subtracted. All arrays must be on the same Table.
fmeta (List[List[Dict[str, str]]]) – List of frequency metadata for arrays being merged.
operation (str) – Merge operation to perform. Options are “sum” and “diff”. If “diff” is passed, the first freq array in the list will have the other arrays subtracted from it.
set_negatives_to_zero (bool) – If True, set negative array values to 0 for AC, AN, AF, and homozygote_count. If False, raise a ValueError. Default is False.
count_arrays (Optional[Dict[str, List[ArrayExpression]]]) – Dictionary of Lists of arrays containing counts to merge using the passed operation. Must use the same group indexing as fmeta. Keys are the descriptor names, values are Lists of arrays to merge. Default is None.

Return type:

Union[Tuple[ArrayExpression, List[Dict[str, int]]], Tuple[ArrayExpression, List[Dict[str, int]], Dict[str, List[ArrayExpression]]]]

Returns:

Tuple of merged frequency array, frequency metadata list and if count_arrays is not None, a dictionary of merged count arrays.

gnomad.utils.annotations.merge_histograms(hists, operation='sum', set_negatives_to_zero=False)[source]

Merge a list of histogram annotations.

This function merges a list of histogram annotations by summing or subtracting the arrays in an element-wise fashion. It keeps one ‘bin_edge’ annotation but merges the ‘bin_freq’, ‘n_smaller’, and ‘n_larger’ annotations by summing or subtracting them.

Note

Bin edges are assumed to be the same for all histograms, and lengths of bin edges and frequencies are assumed to be the same for all histograms.

Parameters:

hists (List[StructExpression]) – List of histogram structs to merge.
operation (str) – Merge operation to perform. Options are “sum” and “diff”. If “diff” is passed, the first histogram will have the other histograms subtracted from it. Default is “sum”.
set_negatives_to_zero (bool) – If True, set negative values to 0. If False, raise a ValueError. Default is False.

Return type:

Expression

Returns:

Merged histogram struct.

gnomad.utils.annotations.annotate_freq(mt, sex_expr=None, gen_anc_expr=None, subgrp_expr=None, additional_strata_expr=None, downsamplings=None, downsampling_expr=None, ds_gen_anc_counts=None, entry_agg_funcs=None, annotate_mt=True)[source]

Annotate mt with stratified allele frequencies.

The output Matrix table will include:

row annotation freq containing the stratified allele frequencies
global annotation freq_meta with metadata
global annotation freq_meta_sample_count with sample count information

Note

Currently this only supports bi-allelic sites.

The input mt needs to have the following entry fields:

GT: a CallExpression containing the genotype
adj: a BooleanExpression containing whether the genotype is of high quality or not.

All expressions arguments need to be expression on the input mt.

freq row annotation

The freq row annotation is an Array of Structs, with each Struct containing the following fields:

AC: int32

AF: float64

AN: int32

homozygote_count: int32

Each element of the array corresponds to a stratification of the data, and the metadata about these annotations is stored in the globals.

Global freq_meta metadata annotation

The global annotation freq_meta is added to the input mt. It is a list of dict. Each element of the list contains metadata on a frequency stratification and the index in the list corresponds to the index of that frequency stratification in the freq row annotation.

Global freq_meta_sample_count annotation

The global annotation freq_meta_sample_count is added to the input mt. This is a sample count per sample grouping defined in the freq_meta global annotation.

The additional_strata_expr parameter

If the additional_strata_expr parameter is used, frequencies will be computed for each of the strata dictionaries across all values. For example, if additional_strata_expr is set to [{‘platform’: mt.platform}, {‘platform’:mt.platform, ‘gen_anc’: mt.gen_anc}, {‘age_bin’: mt.age_bin}], then frequencies will be computed for each of the values of mt.platform, each of the combined values of mt.platform and mt.gen_anc, and each of the values of mt.age_bin.

The downsamplings parameter

If the downsamplings parameter is used without the downsampling_expr, frequencies will be computed for all samples and by genetic ancestry group (if gen_anc_expr is specified) by downsampling the number of samples without replacement to each of the numbers specified in the downsamplings array, provided that there are enough samples in the dataset. In addition, if gen_anc_expr is specified, a downsampling to each of the exact number of samples present in each genetic ancestry group is added. Note that samples are randomly sampled only once, meaning that the lower downsamplings are subsets of the higher ones. If the downsampling_expr parameter is used with the downsamplings parameter, the downsamplings parameter informs the function which downsampling groups were already created and are to be used in the frequency calculation.

The downsampling_expr and ds_gen_anc_counts parameters

If the downsampling_expr parameter is used, downsamplings must also be set and frequencies will be computed for all samples and by genetic ancestry group (if gen_anc_expr is specified) using the downsampling indices to each of the numbers specified in the downsamplings array. The function expects a ‘global_idx’, and if gen_anc_expr is used, a ‘gen_anc_idx’ within the downsampling_expr to be used to determine if a sample belongs within a certain downsampling group, i.e. the index is less than the group size. The function `annotate_downsamplings can be used to to create the downsampling_expr, downsamplings, and ds_gen_anc_counts expressions.

The entry_agg_funcs parameter

If the entry_agg_funcs parameter is used, the output MatrixTable will also contain the annotations specified in the entry_agg_funcs parameter. The keys of the dict are the names of the annotations and the values are tuples of functions. The first function is used to transform the mt entries in some way, and the second function is used to aggregate the output from the first function. For example, if entry_agg_funcs is set to {‘adj_samples’: (get_adj_expr, hl.agg.sum)}`, then the output MatrixTable will contain an annotation adj_samples which is an array of the number of adj samples per strata in each row.

Parameters:

mt (MatrixTable) – Input MatrixTable
sex_expr (Optional[StringExpression]) – When specified, frequencies are stratified by sex. If gen_anc_expr is also specified, then a genetic ancestry group/sex stratification is added.
gen_anc_expr (Optional[StringExpression]) – When specified, frequencies are stratified by genetic ancestry group. If sex_expr is also specified, then a genetic ancestry group/sex stratification is added.
subgrp_expr (Optional[StringExpression]) – When specified, frequencies are stratified by genetic ancestry subgroup. Note that gen_anc_expr is required as well when using this option.
additional_strata_expr (Union[List[Dict[str, StringExpression]], Dict[str, StringExpression], None]) – When specified, frequencies are stratified by the given additional strata. This can e.g. be used to stratify by platform, platform-gen-anc, platform-gen-anc-sex.
downsamplings (Optional[List[int]]) – When specified, frequencies are computed by downsampling the data to the number of samples given in the list. Note that if gen_anc_expr is specified, downsamplings by genetic ancestry group is also computed.
downsampling_expr (Optional[StructExpression]) – When specified, frequencies are computed using the downsampling indices in the provided StructExpression. Note that if gen_anc_idx is specified within the struct, downsamplings by genetic ancestry group is also computed.
ds_gen_anc_counts (Optional[Dict[str, int]]) – When specified, frequencies are computed by downsampling the data to the number of samples per genetic ancestry group in the dict. The key is the genetic ancestry group and the value is the number of samples.
entry_agg_funcs (Optional[Dict[str, Tuple[Callable, Callable]]]) – When specified, additional annotations are added to the output Table/MatrixTable. The keys of the dict are the names of the annotations and the values are tuples of functions. The first function is used to transform the mt entries in some way, and the second function is used to aggregate the output from the first function.
annotate_mt (bool) – Whether to return the full MatrixTable with annotations added instead of only a Table with freq and other annotations. Default is True.

Return type:

Union[Table, MatrixTable]

Returns:

MatrixTable or Table with freq annotation.

gnomad.utils.annotations.annotate_downsamplings(t, downsamplings, gen_anc_expr=None)[source]

Annotate MatrixTable or Table with downsampling groups.

Parameters:

t (Union[MatrixTable, Table]) – Input MatrixTable or Table.
downsamplings (List[int]) – List of downsampling sizes.
gen_anc_expr (Optional[StringExpression]) – Optional expression for genetic ancestry group. When provided, genetic ancestry group downsamplings will be computed for each genetic ancestry group.

Return type:

Union[MatrixTable, Table]

Returns:

MatrixTable or Table with downsampling annotations.

gnomad.utils.annotations.build_freq_stratification_list(sex_expr=None, gen_anc_expr=None, subgrp_expr=None, additional_strata_expr=None, downsampling_expr=None)[source]

Build a list of stratification groupings to be used in frequency calculations based on supplied parameters.

Note

This function is primarily used through annotate_freq but can be used independently if desired. The returned list of stratifications can be passed to generate_freq_group_membership_array.

Parameters:

sex_expr (Optional[StringExpression]) – When specified, the returned list contains a stratification for sex. If gen_anc_expr is also specified, then the returned list also contains a stratification for sex/genetic ancestry group.
gen_anc_expr (Optional[StringExpression]) – When specified, the returned list contains a stratification for genetic ancestry group.
subgrp_expr (Optional[StringExpression]) – When specified, the returned list contains a stratification for genetic ancestry subgroup. Note that gen_anc_expr is required as well when using this option.
additional_strata_expr (Union[List[Dict[str, StringExpression]], Dict[str, StringExpression], None]) – When specified, the returned list contains a stratification for each of the additional strata. This can e.g. be used to stratify by platform, platform-gen-anc, platform-gen-anc-sex.
downsampling_expr (Optional[StructExpression]) – When specified, the returned list contains a stratification for downsampling. If gen_anc_expr is also specified, then the returned list also contains a downsampling/gen-anc stratification.

Return type:

List[Dict[str, StringExpression]]

Returns:

List of dictionaries specifying stratification groups where the keys of each dictionary are strings and the values are corresponding expressions that define the values to stratify frequency calculations by.

gnomad.utils.annotations.generate_freq_group_membership_array(ht, strata_expr, downsamplings=None, ds_gen_anc_counts=None, remove_zero_sample_groups=False, no_raw_group=False)[source]

Generate a Table with a ‘group_membership’ array for each sample indicating whether the sample belongs to specific stratification groups.

Note

This function is primarily used through annotate_freq but can be used independently if desired. Please see the annotate_freq function for more complete documentation.

The following global annotations are added to the returned Table:

freq_meta: Each element of the list contains metadata on a stratification group.
freq_meta_sample_count: sample count per grouping defined in freq_meta.
If downsamplings or ds_gen_anc_counts are specified, they are also added as global annotations on the returned Table.

Each sample is annotated with a ‘group_membership’ array indicating whether the sample belongs to specific stratification groups. All possible value combinations are determined for each stratification grouping in the strata_expr list.

Parameters:

ht (Table) – Input Table that contains Expressions specified by strata_expr.
strata_expr (List[Dict[str, StringExpression]]) – List of dictionaries specifying stratification groups where the keys of each dictionary are strings and the values are corresponding expressions that define the values to stratify frequency calculations by.
downsamplings (Optional[List[int]]) – List of downsampling values to include in the stratifications.
ds_gen_anc_counts (Optional[Dict[str, int]]) – Dictionary of genetic ancestry group counts for each downsampling value.
remove_zero_sample_groups (bool) – Whether to remove groups with a sample count of 0. Default is False.
no_raw_group (bool) – Whether to remove the raw group from the ‘group_membership’ annotation and the ‘freq_meta’ and ‘freq_meta_sample_count’ global annotations. Default is False.

Return type:

Returns:

Table with the ‘group_membership’ array annotation.

gnomad.utils.annotations.compute_freq_by_strata(mt, entry_agg_funcs=None, select_fields=None, group_membership_includes_raw_group=True)[source]

Compute call statistics and, when passed, entry aggregation function(s) by strata.

The computed call statistics are AC, AF, AN, and homozygote_count. The entry aggregation functions are applied to the MatrixTable entries and aggregated. The MatrixTable must contain a ‘group_membership’ annotation (like the one added by generate_freq_group_membership_array) that is a list of bools to aggregate the columns by.

Note

This function is primarily used through annotate_freq but can be used independently if desired. Please see the annotate_freq function for more complete documentation.

Parameters:

mt (MatrixTable) – Input MatrixTable.
entry_agg_funcs (Optional[Dict[str, Tuple[Callable, Callable]]]) – Optional dict of entry aggregation functions. When specified, additional annotations are added to the output Table/MatrixTable. The keys of the dict are the names of the annotations and the values are tuples of functions. The first function is used to transform the mt entries in some way, and the second function is used to aggregate the output from the first function.
select_fields (Optional[List[str]]) – Optional list of row fields from mt to keep on the output Table.
group_membership_includes_raw_group (bool) – Whether the ‘group_membership’ annotation includes an entry for the ‘raw’ group, representing all samples. If False, the ‘raw’ group is inserted as the second element in all added annotations using the same ‘group_membership’, resulting in array lengths of ‘group_membership’+1. If True, the second element of each added annotation is still the ‘raw’ group, but the group membership is determined by the values in the second element of ‘group_membership’, and the output annotations will be the same length as ‘group_membership’. Default is True.

Return type:

Returns:

Table or MatrixTable with allele frequencies by strata.

gnomad.utils.annotations.agg_by_strata(mt, entry_agg_funcs, select_fields=None, group_membership_ht=None, entry_agg_group_membership=None)[source]

Get row expression for annotations of each entry aggregation function(s) by strata.

The entry aggregation functions are applied to the MatrixTable entries and aggregated. If no group_membership_ht (like the one returned by generate_freq_group_membership_array) is supplied, mt must contain a ‘group_membership’ annotation that is a list of bools to aggregate the columns by.

Parameters:

mt (MatrixTable) – Input MatrixTable.
entry_agg_funcs (Dict[str, Tuple[Callable, Callable]]) – Dict of entry aggregation functions where the keys of the dict are the names of the annotations and the values are tuples of functions. The first function is used to transform the mt entries in some way, and the second function is used to aggregate the output from the first function.
select_fields (Optional[List[str]]) – Optional list of row fields from mt to keep on the output Table.
group_membership_ht (Optional[Table]) – Optional Table containing group membership annotations to stratify the aggregations by. If not provided, the ‘group_membership’ annotation is expected to be present on mt.
entry_agg_group_membership (Optional[Dict[str, List[dict]]]) – Optional dict indicating the subset of group strata in ‘freq_meta’ to run the entry aggregation functions on. The keys of the dict can be any of the keys in entry_agg_funcs and the values are lists of dicts. Each dict in the list contains the strata in ‘freq_meta’ to use for the corresponding entry aggregation function. If provided, ‘freq_meta’ must be present in group_membership_ht or mt and represent the same strata as those in ‘group_membership’. If not provided, all entries of the ‘group_membership’ annotation will have the entry aggregation functions applied to them.

Return type:

Returns:

Table with annotations of stratified aggregations.

gnomad.utils.annotations.update_structured_annotations(ht, annotation_update_exprs, annotation_update_label=None)[source]

Update highly structured annotations on a Table.

This function recursively updates annotations defined by annotation_update_exprs and if annotation_update_label is supplied, it checks if the sample annotations are different from the input and adds a flag to the Table, indicating which annotations have been updated for each sample.

Parameters:

ht (Table) – Input Table with structured annotations to update.
annotation_update_exprs (Dict[str, Expression]) – Dictionary of annotations to update, structured as they are structured on the input ht.
annotation_update_label (Optional[str]) – Optional string of the label to use for an annotation indicating which annotations have been updated. Default is None, so no annotation is added.

Return type:

Returns:

Table with updated annotations and optionally a flag indicating which annotations were changed.

gnomad.utils.annotations.fill_missing_key_combinations(ht, fill_values, key_values=None)[source]

Fill missing key combinations with requested fill values.

This function fills missing key combinations in the input Table with requested fill values. The fill values are specified in the fill_values dictionary. The unique values for each key field are collected from the input Table unless specified in the key_values dictionary.

This is useful when you want to ensure that all possible key combinations are present in a Table, even if they are missing from the input Table. This can happen when you are aggregating data and want to ensure that all possible key combinations are present in the output Table, not only the ones that are present.

Example:

.. code-block:: python

    ht = hl.Table.parallelize(
        [
            {'key1': 'A', 'key2': 1, 'value': 10},
            {'key1': 'A', 'key2': 2, 'value': 20},
            {'key1': 'B', 'key2': 1, 'value': 30},
        ],
        hl.tstruct(key1=hl.tstr, key2=hl.tint32, value=hl.tint32),
        key=['key1', 'key2'],
    )
    fill_values = {'value': hl.missing(hl.tint32)}
    filled_ht = fill_missing_key_combinations(ht, fill_values)
    filled_ht.show()

+------+------+
| key1 | key2 | value |
+------+------+
|  A   |  1   |  10   |
|  A   |  2   |  20   |
|  B   |  1   |  30   |
|  B   |  2   | null  |
+------+------+

In this example, the input table is missing the combination (B, 2). After applying fill_missing_key_combinations, the missing key combination (B, 2) is filled with the specified fill value for ‘value’ (null in this case).

Parameters:

ht (Table) – Input Table containing key fields.
fill_values (Dict[str, Expression]) – Dictionary of fill values to use for missing key combinations.
key_values (Optional[Dict[str, List]]) – Optional dictionary of unique values to use for each key field. Default is None. If None, the unique values for each key field are collected from the input Table.

Return type: