gnomad.utils.annotations
Annotate a MatrixTable/Table with additional annotations from another Table. |
|
|
Create an expression containing the frequency information about the population that has the highest AF in freq_meta. |
Create an expression that computes allele frequency information by project for the n_projects with the largest AF at this row. |
|
|
Calculate the filtering allele frequency (FAF) for each threshold specified in faf_thresholds. |
Retrieve the maximum FAF and corresponding genetic ancestry for each of the thresholds in faf. |
|
Return a struct expression with genotype quality histograms based on the arguments given (dp, gq, ad, ab). |
|
Return a StructExpression with the age histograms for hets and homs. |
|
Compute lowqual threshold expression for either split or unsplit alleles based on QUALapprox or AS_QUALapprox. |
|
Create histograms for variant metrics in ht.info. |
|
Create bins for frequencies in preparation for aggregating QUAL by frequency bin. |
|
|
Prepare relevant ploidy annotations for downstream calculations on a matrix table. |
Determine if a genotype or locus and karyotype combination is haploid. |
|
Get adj annotation using only GQ and DP. |
|
Get adj het AB annotation. |
|
|
Get adj genotype annotation. |
|
Annotate genotypes with adj criteria (assumes diploid). |
Get Struct of variant_type and n_alt_alleles from ArrayExpression of Strings. |
|
Return bi-allelic sites Table with an 'allele_info' annotation. |
|
Given an annotation type, return whether it is a numerical type or not. |
|
Given an annotation type, returns whether that type can be natively exported to a VCF INFO field. |
|
|
Return the site inbreeding coefficient as an expression to be computed on a MatrixTable. |
|
Compute FS (Fisher strand balance) annotation from the SB (strand balance table) field. |
Compute SOR (Symmetric Odds Ratio test) annotation from the SB (strand balance table) field. |
|
|
Compute the maximum p-value of the binomial test for the alternate allele balance (PAB) for each allele. |
Return a boolean expression selecting bi-allelic sites only, accounting for whether the input MT/HT was split. |
|
Generate unphased version of a call expression (which can be phased or not). |
|
Create a region_flag struct that contains flags for problematic regions (i.e., LCR, decoy, segdup, and nonpar regions). |
|
Create a missing callstats struct for insertion into frequency annotation arrays when data is missing. |
|
Set Y-variant frequency callstats for female-specific metrics to missing structs. |
|
|
Return whether genotypes are hemizygous. |
Merge a list of frequency arrays based on the supplied operation. |
|
Merge a list of histogram annotations. |
|
|
Annotate mt with stratified allele frequencies. |
Annotate MatrixTable or Table with downsampling groups. |
|
|
Build a list of stratification groupings to be used in frequency calculations based on supplied parameters. |
|
Generate a Table with a 'group_membership' array for each sample indicating whether the sample belongs to specific stratification groups. |
Compute call statistics and, when passed, entry aggregation function(s) by strata. |
|
Get row expression for annotations of each entry aggregation function(s) by strata. |
|
|
Update highly structured annotations on a Table. |
|
Fill missing key combinations with requested fill values. |
Create a struct of missing values corresponding to each field and type in the input struct. |
|
Generate a dictionary containing VRS information from a given locus and struct of VRS information. |
|
|
Generate Python dictionary containing GKS VA annotations. |
- gnomad.utils.annotations.annotate_with_ht(t, annotation_ht, fields=None, annotate_cols=False, filter_missing=False)[source]
Annotate a MatrixTable/Table with additional annotations from another Table.
- Parameters:
t (
Union
[MatrixTable
,Table
]) – MatrixTable/Table to be annotated.annotation_ht (
Table
) – Table containing additional annotations to be joined on t.fields (
Optional
[List
]) – Optional list of fields to select from annotation_ht and add to tannotate_cols (
bool
) – If True, annotate columns instead of rows. Default is False.filter_missing (
bool
) – If True, filter out missing rows/cols in t that are not present in annotation_ht. Default is False.
- Return type:
Union
[MatrixTable
,Table
]- Returns:
Annotated MatrixTable/Table.
- gnomad.utils.annotations.pop_max_expr(freq, freq_meta, pops_to_exclude=None, pop_label='pop')[source]
Create an expression containing the frequency information about the population that has the highest AF in freq_meta.
Populations specified in pops_to_exclude are excluded and only frequencies from adj populations are considered.
This resulting struct contains the following fields:
AC: int32
AF: float64
AN: int32
homozygote_count: int32
pop: str
- Parameters:
freq (
ArrayExpression
) – ArrayExpression of Structs with fields [‘AC’, ‘AF’, ‘AN’, ‘homozygote_count’]freq_meta (
ArrayExpression
) – ArrayExpression of meta dictionaries corresponding to freq (as returned by annotate_freq)pops_to_exclude (
Optional
[Set
[str
]]) – Set of populations to skip for popmax calcluationpop_label (
str
) – Label of the population field in the meta dictionary
- Return type:
- Returns:
Popmax struct
- gnomad.utils.annotations.project_max_expr(project_expr, gt_expr, alleles_expr, n_projects=5)[source]
Create an expression that computes allele frequency information by project for the n_projects with the largest AF at this row.
Will return an array with one element per non-reference allele.
Each of these elements is itself an array of structs with the following fields:
AC: int32
AF: float64
AN: int32
homozygote_count: int32
project: str
Note
Only projects with AF > 0 are returned. In case of ties, the project ordering is not guaranteed, and at most n_projects are returned.
- Parameters:
project_expr (
StringExpression
) – column expression containing the projectgt_expr (
CallExpression
) – entry expression containing the genotypealleles_expr (
ArrayExpression
) – row expression containing the allelesn_projects (
int
) – Maximum number of projects to return for each row
- Return type:
- Returns:
projectmax expression
- gnomad.utils.annotations.faf_expr(freq, freq_meta, locus, pops_to_exclude=None, faf_thresholds=[0.95, 0.99], pop_label='pop')[source]
Calculate the filtering allele frequency (FAF) for each threshold specified in faf_thresholds.
See http://cardiodb.org/allelefrequencyapp/ for more information.
The FAF is computed for each of the following population stratification if found in freq_meta:
All samples, with adj criteria
For each population, with adj criteria
For all sex/population on the non-PAR regions of sex chromosomes (will be missing on autosomes and PAR regions of sex chromosomes)
Each of the FAF entry is a struct with one entry per threshold specified in faf_thresholds of type float64.
This returns a tuple with two expressions:
An array of FAF expressions as described above
An array of dict containing the metadata for each of the array elements, in the same format as that produced by annotate_freq.
- Parameters:
freq (
ArrayExpression
) – ArrayExpression of call stats structs (typically generated by hl.agg.call_stats)freq_meta (
ArrayExpression
) – ArrayExpression of meta dictionaries corresponding to freq (typically generated using annotate_freq)locus (
LocusExpression
) – locuspops_to_exclude (
Optional
[Set
[str
]]) – Set of populations to exclude from faf calculation (typically bottlenecked or consanguineous populations)faf_thresholds (
List
[float
]) – List of FAF thresholds to computepop_label (
str
) – Label of the population field in the meta dictionary
- Return type:
Tuple
[ArrayExpression
,List
[Dict
[str
,str
]]]- Returns:
(FAF expression, FAF metadata)
- gnomad.utils.annotations.gen_anc_faf_max_expr(faf, faf_meta, pop_label='pop')[source]
Retrieve the maximum FAF and corresponding genetic ancestry for each of the thresholds in faf.
This resulting struct contains the following fields:
faf95_max: float64
faf95_max_gen_anc: str
faf99_max: float64
faf99_max_gen_anc: str
- Parameters:
faf (
ArrayExpression
) – ArrayExpression of Structs of FAF thresholds previously computed. When faf_expr is used, contains fields ‘faf95’ and ‘faf99’.faf_meta (
ArrayExpression
) – ArrayExpression of meta dictionaries corresponding to faf (as returned by faf_expr)pop_label (
str
) – Label of the population field in the meta dictionary
- Return type:
- Returns:
Genetic ancestry group struct for FAF max
- gnomad.utils.annotations.qual_hist_expr(gt_expr=None, gq_expr=None, dp_expr=None, ad_expr=None, adj_expr=None, ab_expr=None, split_adj_and_raw=False)[source]
Return a struct expression with genotype quality histograms based on the arguments given (dp, gq, ad, ab).
Note
If gt_expr is provided, will return histograms for non-reference samples only as well as all samples.
gt_expr is required for the allele-balance histogram, as it is only computed on het samples.
If ab_expr is provided, the allele-balance histogram is computed using this expression instead of the ad_expr.
If adj_expr is provided, additional histograms are computed using only adj samples.
- Parameters:
gt_expr (
Optional
[CallExpression
]) – Entry expression containing genotype.gq_expr (
Optional
[NumericExpression
]) – Entry expression containing genotype quality.dp_expr (
Optional
[NumericExpression
]) – Entry expression containing depth.ad_expr (
Optional
[ArrayNumericExpression
]) – Entry expression containing allelic depth (bi-allelic here).adj_expr (
Optional
[BooleanExpression
]) – Entry expression containing adj (high quality) genotype status.ab_expr (
Optional
[NumericExpression
]) – Entry expression containing allele balance (bi-allelic here).split_adj_and_raw (
bool
) – Whether to split the adj and raw histograms into separate fields in the returned struct expr.
- Return type:
- Returns:
Genotype quality histograms expression.
- gnomad.utils.annotations.age_hists_expr(adj_expr, gt_expr, age_expr, lowest_boundary=30, highest_boundary=80, n_bins=10)[source]
Return a StructExpression with the age histograms for hets and homs.
- Parameters:
adj_expr (
BooleanExpression
) – Entry expression containing whether a genotype is high quality (adj) or notgt_expr (
CallExpression
) – Entry expression containing the genotypeage_expr (
NumericExpression
) – Col expression containing the sample’s agelowest_boundary (
int
) – Lowest bin boundary (any younger sample will be binned in n_smaller)highest_boundary (
int
) – Highest bin boundary (any older sample will be binned in n_larger)n_bins (
int
) – Total number of bins
- Return type:
- Returns:
A struct with age_hist_het and age_hist_hom
- gnomad.utils.annotations.get_lowqual_expr(alleles, qual_approx_expr, snv_phred_threshold=30, snv_phred_het_prior=30, indel_phred_threshold=30, indel_phred_het_prior=39)[source]
Compute lowqual threshold expression for either split or unsplit alleles based on QUALapprox or AS_QUALapprox.
Note
When running This lowqual annotation using QUALapprox, it differs from the GATK LowQual filter. This is because GATK computes this annotation at the site level, which uses the least stringent prior for mixed sites. When run using AS_QUALapprox, this implementation can thus be more stringent for certain alleles at mixed sites.
- Parameters:
alleles (
ArrayExpression
) – Array of allelesqual_approx_expr (
Union
[ArrayNumericExpression
,NumericExpression
]) – QUALapprox or AS_QUALapproxsnv_phred_threshold (
int
) – Phred-scaled SNV “emission” threshold (similar to GATK emission threshold)snv_phred_het_prior (
int
) – Phred-scaled SNV heterozygosity prior (30 = 1/1000 bases, GATK default)indel_phred_threshold (
int
) – Phred-scaled indel “emission” threshold (similar to GATK emission threshold)indel_phred_het_prior (
int
) – Phred-scaled indel heterozygosity prior (30 = 1/1000 bases, GATK default)
- Return type:
Union
[BooleanExpression
,ArrayExpression
]- Returns:
lowqual expression (BooleanExpression if qual_approx_expr`is Numeric, Array[BooleanExpression] if `qual_approx_expr is ArrayNumeric)
- gnomad.utils.annotations.get_annotations_hists(ht, annotations_hists, log10_annotations=['DP'])[source]
Create histograms for variant metrics in ht.info.
Used when creating site quality distribution json files.
- Parameters:
ht (
Table
) – Table with variant metricsannotations_hists (
Dict
[str
,Tuple
]) – Dictionary of metrics names and their histogram values (start, end, bins)log10_annotations (
List
[str
]) – List of metrics to log scale
- Returns:
Dictionary of merics and their histograms
- Return type:
Dict[str, hl.expr.StructExpression]
- gnomad.utils.annotations.create_frequency_bins_expr(AC, AF)[source]
Create bins for frequencies in preparation for aggregating QUAL by frequency bin.
- Bins:
singleton
doubleton
0.00005
0.0001
0.0002
0.0005
0.001,
0.002
0.005
0.01
0.02
0.05
0.1
0.2
0.5
1
NOTE: Frequencies should be frequencies from raw data. Used when creating site quality distribution json files.
- Parameters:
AC (
NumericExpression
) – Field in input that contains the allele count informationAF (
NumericExpression
) – Field in input that contains the allele frequency information
- Returns:
Expression containing bin name
- Return type:
hl.expr.StringExpression
- gnomad.utils.annotations.annotate_and_index_source_mt_for_sex_ploidy(locus_expr, karyotype_expr, xy_karyotype_str='XY', xx_karyotype_str='XX')[source]
Prepare relevant ploidy annotations for downstream calculations on a matrix table.
This method is used as an optimization for the get_is_haploid_expr and adjusted_sex_ploidy_expr methods.
This method annotates the locus_expr source matrix table with the following fields:
xy: Boolean indicating if the sample is XY.
xx: Boolean indicating if the sample is XX.
in_non_par: Boolean indicating if the locus is in a non-PAR region.
x_nonpar: Boolean indicating if the locus is in a non-PAR region of the X chromosome.
y_par: Boolean indicating if the locus is in a PAR region of the Y chromosome.
y_nonpar: Boolean indicating if the locus is in a non-PAR region of the Y chromosome.
- Parameters:
locus_expr (
LocusExpression
) – Locus expression.karyotype_expr (
StringExpression
) – Karyotype expression.xy_karyotype_str (
str
) – String representing XY karyotype. Default is “XY”.xx_karyotype_str (
str
) – String representing XX karyotype. Default is “XX”.
- Return type:
Tuple
[StructExpression
,StructExpression
]- Returns:
Tuple of index expressions for columns and rows.
- gnomad.utils.annotations.get_is_haploid_expr(gt_expr=None, locus_expr=None, karyotype_expr=None, xy_karyotype_str='XY', xx_karyotype_str='XX')[source]
Determine if a genotype or locus and karyotype combination is haploid.
Note
One of gt_expr or locus_expr and karyotype_expr is required.
- Parameters:
gt_expr (
Optional
[CallExpression
]) – Optional genotype expression.locus_expr (
Optional
[LocusExpression
]) – Optional locus expression.karyotype_expr (
Optional
[StringExpression
]) – Optional sex karyotype expression.xy_karyotype_str (
str
) – String representing XY karyotype. Default is “XY”.xx_karyotype_str (
str
) – String representing XX karyotype. Default is “XX”.
- Return type:
- Returns:
Boolean expression indicating if the genotype is haploid.
- gnomad.utils.annotations.get_gq_dp_adj_expr(gq_expr, dp_expr, gt_expr=None, locus_expr=None, karyotype_expr=None, adj_gq=20, adj_dp=10, haploid_adj_dp=5)[source]
Get adj annotation using only GQ and DP.
Default thresholds correspond to gnomAD values.
Note
This function can be used to annotate adj taking into account only GQ and DP. It is useful for cases where the GT field is not available, such as in the reference data of a VariantDataset.
Note
One of gt_expr or locus_expr and karyotype_expr is required.
- Parameters:
gq_expr (
Union
[Int32Expression
,Int64Expression
]) – GQ expression.dp_expr (
Union
[Int32Expression
,Int64Expression
]) – DP expression.gt_expr (
Optional
[CallExpression
]) – Optional genotype expression.locus_expr (
Optional
[LocusExpression
]) – Optional locus expression.karyotype_expr (
Optional
[StringExpression
]) – Optional sex karyotype expression.adj_gq (
int
) – GQ threshold for adj. Default is 20.adj_dp (
int
) – DP threshold for adj. Default is 10.haploid_adj_dp (
int
) – Haploid DP threshold for adj. Default is 5.
- Return type:
- Returns:
Boolean expression indicating adj filter.
- gnomad.utils.annotations.get_het_ab_adj_expr(gt_expr, dp_expr, ad_expr, adj_ab=0.2)[source]
Get adj het AB annotation.
- Parameters:
gt_expr (
CallExpression
) – Genotype expression.dp_expr (
Union
[Int32Expression
,Int64Expression
]) – DP expression.ad_expr (
ArrayNumericExpression
) – AD expression.adj_ab (
float
) – AB threshold for adj. Default is 0.2.
- Return type:
- Returns:
Boolean expression indicating adj het AB filter.
- gnomad.utils.annotations.get_adj_expr(gt_expr, gq_expr, dp_expr, ad_expr, adj_gq=20, adj_dp=10, adj_ab=0.2, haploid_adj_dp=5)[source]
Get adj genotype annotation.
Defaults correspond to gnomAD values.
- Parameters:
gt_expr (
CallExpression
) –gq_expr (
Union
[Int32Expression
,Int64Expression
]) –dp_expr (
Union
[Int32Expression
,Int64Expression
]) –ad_expr (
ArrayNumericExpression
) –adj_gq (
int
) –adj_dp (
int
) –adj_ab (
float
) –haploid_adj_dp (
int
) –
- Return type:
- gnomad.utils.annotations.annotate_adj(mt, adj_gq=20, adj_dp=10, adj_ab=0.2, haploid_adj_dp=5)[source]
Annotate genotypes with adj criteria (assumes diploid).
Defaults correspond to gnomAD values.
- Parameters:
mt (
MatrixTable
) –adj_gq (
int
) –adj_dp (
int
) –adj_ab (
float
) –haploid_adj_dp (
int
) –
- Return type:
- gnomad.utils.annotations.add_variant_type(alt_alleles)[source]
Get Struct of variant_type and n_alt_alleles from ArrayExpression of Strings.
- Parameters:
alt_alleles (
ArrayExpression
) –- Return type:
- gnomad.utils.annotations.annotate_allele_info(ht)[source]
Return bi-allelic sites Table with an ‘allele_info’ annotation.
Note
This function requires that the input ht is unsplit and returns a split ht.
- ‘allele_info’ is a struct with the following information:
variant_type: Variant type (snv, indel, multi-snv, multi-indel, or mixed).
n_alt_alleles: Total number of alternate alleles observed at variant locus.
has_star: True if the variant contains a star allele.
allele_type: Allele type (snv, insertion, deletion, or mixed).
was_mixed: True if the variant was mixed (i.e. contained both SNVs and indels).
nonsplit_alleles: Array of alleles before splitting.
- gnomad.utils.annotations.annotation_type_is_numeric(t)[source]
Given an annotation type, return whether it is a numerical type or not.
- Parameters:
t (
Any
) – Type to test- Return type:
bool
- Returns:
If the input type is numeric
- gnomad.utils.annotations.annotation_type_in_vcf_info(t)[source]
Given an annotation type, returns whether that type can be natively exported to a VCF INFO field.
Note
Types that aren’t natively exportable to VCF will be converted to String on export.
- Parameters:
t (
Any
) – Type to test- Return type:
bool
- Returns:
If the input type can be exported to VCF
- gnomad.utils.annotations.bi_allelic_site_inbreeding_expr(call=None, callstats_expr=None)[source]
Return the site inbreeding coefficient as an expression to be computed on a MatrixTable.
This is implemented based on the GATK InbreedingCoeff metric: https://software.broadinstitute.org/gatk/documentation/article.php?id=8032
Note
The computation is run based on the counts of alternate alleles and thus should only be run on bi-allelic sites.
- Parameters:
call (
Optional
[CallExpression
]) – Expression giving the calls in the MTcallstats_expr (
Optional
[StructExpression
]) – StructExpression containing only alternate allele AC, AN, and homozygote_count as integers. If passed, used to create expression in place of GT calls.
- Return type:
- Returns:
Site inbreeding coefficient expression
- gnomad.utils.annotations.fs_from_sb(sb, normalize=True, min_cell_count=200, min_count=4, min_p_value=1e-320)[source]
Compute FS (Fisher strand balance) annotation from the SB (strand balance table) field.
FS is the phred-scaled value of the double-sided Fisher exact test on strand balance.
Using default values will have the same behavior as the GATK implementation, that is: - If sum(counts) > 2*`min_cell_count` (default to GATK value of 200), they are normalized - If sum(counts) < min_count (default to GATK value of 4), returns missing - Any p-value < min_p_value (default to GATK value of 1e-320) is truncated to that value
In addition to the default GATK behavior, setting normalize to False will perform a chi-squared test for large counts (> min_cell_count) instead of normalizing the cell values.
Note
This function can either take - an array of length four containing the forward and reverse strands’ counts of ref and alt alleles: [ref fwd, ref rev, alt fwd, alt rev] - a two dimensional array with arrays of length two, containing the counts: [[ref fwd, ref rev], [alt fwd, alt rev]]
GATK code here: https://github.com/broadinstitute/gatk/blob/master/src/main/java/org/broadinstitute/hellbender/tools/walkers/annotator/FisherStrand.java
- Parameters:
sb (
Union
[ArrayNumericExpression
,ArrayExpression
]) – Count of ref/alt reads on each strandnormalize (
bool
) – Whether to normalize counts is sum(counts) > min_cell_count (normalize=True), or use a chi sq instead of FET (normalize=False)min_cell_count (
int
) – Maximum count for performing a FETmin_count (
int
) – Minimum total count to output FS (otherwise null it output)min_p_value (
float
) –
- Return type:
- Returns:
FS value
- gnomad.utils.annotations.sor_from_sb(sb)[source]
Compute SOR (Symmetric Odds Ratio test) annotation from the SB (strand balance table) field.
Note
This function can either take - an array of length four containing the forward and reverse strands’ counts of ref and alt alleles: [ref fwd, ref rev, alt fwd, alt rev] - a two dimensional array with arrays of length two, containing the counts: [[ref fwd, ref rev], [alt fwd, alt rev]]
GATK code here: https://github.com/broadinstitute/gatk/blob/master/src/main/java/org/broadinstitute/hellbender/tools/walkers/annotator/StrandOddsRatio.java
- Parameters:
sb (
Union
[ArrayNumericExpression
,ArrayExpression
]) – Count of ref/alt reads on each strand- Return type:
- Returns:
SOR value
- gnomad.utils.annotations.pab_max_expr(gt_expr, ad_expr, la_expr=None, n_alleles_expr=None)[source]
Compute the maximum p-value of the binomial test for the alternate allele balance (PAB) for each allele.
Note
This function can take a gt_expr and ad_expr that use local or global alleles. If they use local alleles, la_expr and n_alleles_expr should be provided to transform gt_expr and ad_expr to global alleles.
- Parameters:
gt_expr (
CallExpression
) – Genotype call expression.ad_expr (
ArrayExpression
) – Allele depth expression.la_expr (
Optional
[ArrayExpression
]) – Allele local index expression. When provided gt_expr and ad_expr are transformed from using local alleles to global alleles using la_expr.n_alleles_expr (
Optional
[Int32Expression
]) – Number of alleles expression. Required when ‘la_expr’ is provided.
- Return type:
- Returns:
Array expression of maximum p-values.
- gnomad.utils.annotations.bi_allelic_expr(t)[source]
Return a boolean expression selecting bi-allelic sites only, accounting for whether the input MT/HT was split.
- Parameters:
t (
Union
[Table
,MatrixTable
]) – Input HT/MT- Return type:
- Returns:
Boolean expression selecting only bi-allelic sites
- gnomad.utils.annotations.unphase_call_expr(call_expr)[source]
Generate unphased version of a call expression (which can be phased or not).
- Parameters:
call_expr (
CallExpression
) – Input call expression- Return type:
- Returns:
unphased call expression
- gnomad.utils.annotations.region_flag_expr(t, non_par=True, prob_regions=None)[source]
Create a region_flag struct that contains flags for problematic regions (i.e., LCR, decoy, segdup, and nonpar regions).
Note
No hg38 resources for decoy or self chain are available yet.
- Parameters:
t (
Union
[Table
,MatrixTable
]) – Input Table/MatrixTablenon_par (
bool
) – If True, flag loci that occur within pseudoautosomal regions on sex chromosomesprob_regions (
Dict
[str
,Table
]) – If supplied, flag loci that occur within regions defined in Hail Table(s)
- Return type:
- Returns:
region_flag struct row annotation
- gnomad.utils.annotations.missing_callstats_expr()[source]
Create a missing callstats struct for insertion into frequency annotation arrays when data is missing.
- Return type:
- Returns:
Hail Struct with missing values for each callstats element
- gnomad.utils.annotations.set_female_y_metrics_to_na_expr(t, freq_expr='freq', freq_meta_expr='freq_meta', freq_index_dict_expr='freq_index_dict')[source]
Set Y-variant frequency callstats for female-specific metrics to missing structs.
- Parameters:
t (
Union
[Table
,MatrixTable
]) – Table or MatrixTable for which to adjust female metrics.freq_expr (
Union
[ArrayExpression
,str
]) – Array expression or string annotation name for the frequency array. Default is “freq”.freq_meta_expr (
Union
[ArrayExpression
,str
]) – Array expression or string annotation name for the frequency metadata. Default is “freq_meta”.freq_index_dict_expr (
Union
[DictExpression
,str
]) – Dict expression or string annotation name for the frequency metadata index dictionary. Default is “freq_index_dict”.
- Return type:
- Returns:
Hail array expression to set female Y-variant metrics to missing values.
- gnomad.utils.annotations.hemi_expr(locus, sex_expr, gt, male_str='XY')[source]
Return whether genotypes are hemizygous.
Return missing expression if locus is not in chrX/chrY non-PAR regions.
- Parameters:
locus (
LocusExpression
) – Input locus.sex_expr (
StringExpression
) – Input StringExpression indicating whether sample is XX or XY.gt (
CallExpression
) – Input genotype.xy_str – String indicating whether sample is XY. Default is “XY”.
male_str (
str
) –
- Return type:
- Returns:
BooleanExpression indicating whether genotypes are hemizygous.
- gnomad.utils.annotations.merge_freq_arrays(farrays, fmeta, operation='sum', set_negatives_to_zero=False, count_arrays=None)[source]
Merge a list of frequency arrays based on the supplied operation.
Warning
Arrays must be on the same Table.
Note
Arrays do not have to contain the same groupings or order of groupings but the array indices for a freq array in farrays must be the same as its associated frequency metadata index in fmeta i.e., farrays = [freq1, freq2] then fmeta must equal [fmeta1, fmeta2] where fmeta1 contains the metadata information for freq1.
If operation is set to “sum”, groups in the merged array will be the union of groupings found within the arrays’ metadata and all arrays with be summed by grouping. If operation is set to “diff”, the merged array will contain groups only found in the first array of fmeta. Any array containing any of these groups will have thier values subtracted from the values of the first array.
- Parameters:
farrays (
List
[ArrayExpression
]) – List of frequency arrays to merge. First entry in the list is the primary array to which other arrays will be added or subtracted. All arrays must be on the same Table.fmeta (
List
[List
[Dict
[str
,str
]]]) – List of frequency metadata for arrays being merged.operation (
str
) – Merge operation to perform. Options are “sum” and “diff”. If “diff” is passed, the first freq array in the list will have the other arrays subtracted from it.set_negatives_to_zero (
bool
) – If True, set negative array values to 0 for AC, AN, AF, and homozygote_count. If False, raise a ValueError. Default is False.count_arrays (
Optional
[Dict
[str
,List
[ArrayExpression
]]]) – Dictionary of Lists of arrays containing counts to merge using the passed operation. Must use the same group indexing as fmeta. Keys are the descriptor names, values are Lists of arrays to merge. Default is None.
- Return type:
Union
[Tuple
[ArrayExpression
,List
[Dict
[str
,int
]]],Tuple
[ArrayExpression
,List
[Dict
[str
,int
]],Dict
[str
,List
[ArrayExpression
]]]]- Returns:
Tuple of merged frequency array, frequency metadata list and if count_arrays is not None, a dictionary of merged count arrays.
- gnomad.utils.annotations.merge_histograms(hists)[source]
Merge a list of histogram annotations.
This function merges a list of histogram annotations by summing the arrays in an element-wise fashion. It keeps one ‘bin_edge’ annotation but merges the ‘bin_freq’, ‘n_smaller’, and ‘n_larger’ annotations by summing them.
Note
Bin edges are assumed to be the same for all histograms.
- Parameters:
hists (
List
[StructExpression
]) – List of histogram structs to merge.- Return type:
- Returns:
Merged histogram struct.
- gnomad.utils.annotations.annotate_freq(mt, sex_expr=None, pop_expr=None, subpop_expr=None, additional_strata_expr=None, downsamplings=None, downsampling_expr=None, ds_pop_counts=None, entry_agg_funcs=None, annotate_mt=True)[source]
Annotate mt with stratified allele frequencies.
- The output Matrix table will include:
row annotation freq containing the stratified allele frequencies
global annotation freq_meta with metadata
global annotation freq_meta_sample_count with sample count information
Note
Currently this only supports bi-allelic sites.
- The input mt needs to have the following entry fields:
GT: a CallExpression containing the genotype
adj: a BooleanExpression containing whether the genotype is of high quality or not.
All expressions arguments need to be expression on the input mt.
freq row annotation
The freq row annotation is an Array of Structs, with each Struct containing the following fields:
AC: int32
AF: float64
AN: int32
homozygote_count: int32
Each element of the array corresponds to a stratification of the data, and the metadata about these annotations is stored in the globals.
Global freq_meta metadata annotation
The global annotation freq_meta is added to the input mt. It is a list of dict. Each element of the list contains metadata on a frequency stratification and the index in the list corresponds to the index of that frequency stratification in the freq row annotation.
Global freq_meta_sample_count annotation
The global annotation freq_meta_sample_count is added to the input mt. This is a sample count per sample grouping defined in the freq_meta global annotation.
The additional_strata_expr parameter
If the additional_strata_expr parameter is used, frequencies will be computed for each of the strata dictionaries across all values. For example, if additional_strata_expr is set to [{‘platform’: mt.platform}, {‘platform’:mt.platform, ‘pop’: mt.pop}, {‘age_bin’: mt.age_bin}], then frequencies will be computed for each of the values of mt.platform, each of the combined values of mt.platform and mt.pop, and each of the values of mt.age_bin.
The downsamplings parameter
If the downsamplings parameter is used without the downsampling_expr, frequencies will be computed for all samples and by population (if pop_expr is specified) by downsampling the number of samples without replacement to each of the numbers specified in the downsamplings array, provided that there are enough samples in the dataset. In addition, if pop_expr is specified, a downsampling to each of the exact number of samples present in each population is added. Note that samples are randomly sampled only once, meaning that the lower downsamplings are subsets of the higher ones. If the downsampling_expr parameter is used with the downsamplings parameter, the downsamplings parameter informs the function which downsampling groups were already created and are to be used in the frequency calculation.
The downsampling_expr and ds_pop_counts parameters
If the downsampling_expr parameter is used, downsamplings must also be set and frequencies will be computed for all samples and by population (if pop_expr is specified) using the downsampling indices to each of the numbers specified in the downsamplings array. The function expects a ‘global_idx’, and if pop_expr is used, a ‘pop_idx’ within the downsampling_expr to be used to determine if a sample belongs within a certain downsampling group, i.e. the index is less than the group size. The function `annotate_downsamplings can be used to to create the downsampling_expr, downsamplings, and ds_pop_counts expressions.
The entry_agg_funcs parameter
If the entry_agg_funcs parameter is used, the output MatrixTable will also contain the annotations specified in the entry_agg_funcs parameter. The keys of the dict are the names of the annotations and the values are tuples of functions. The first function is used to transform the mt entries in some way, and the second function is used to aggregate the output from the first function. For example, if entry_agg_funcs is set to {‘adj_samples’: (get_adj_expr, hl.agg.sum)}`, then the output MatrixTable will contain an annotation adj_samples which is an array of the number of adj samples per strata in each row.
- Parameters:
mt (
MatrixTable
) – Input MatrixTablesex_expr (
Optional
[StringExpression
]) – When specified, frequencies are stratified by sex. If pop_expr is also specified, then a pop/sex stratifiction is added.pop_expr (
Optional
[StringExpression
]) – When specified, frequencies are stratified by population. If sex_expr is also specified, then a pop/sex stratifiction is added.subpop_expr (
Optional
[StringExpression
]) – When specified, frequencies are stratified by sub-continental population. Note that pop_expr is required as well when using this option.additional_strata_expr (
Union
[List
[Dict
[str
,StringExpression
]],Dict
[str
,StringExpression
],None
]) – When specified, frequencies are stratified by the given additional strata. This can e.g. be used to stratify by platform, platform-pop, platform-pop-sex.downsamplings (
Optional
[List
[int
]]) – When specified, frequencies are computed by downsampling the data to the number of samples given in the list. Note that if pop_expr is specified, downsamplings by population is also computed.downsampling_expr (
Optional
[StructExpression
]) – When specified, frequencies are computed using the downsampling indices in the provided StructExpression. Note that if pop_idx is specified within the struct, downsamplings by population is also computed.ds_pop_counts (
Optional
[Dict
[str
,int
]]) – When specified, frequencies are computed by downsampling the data to the number of samples per pop in the dict. The key is the population and the value is the number of samples.entry_agg_funcs (
Optional
[Dict
[str
,Tuple
[Callable
,Callable
]]]) – When specified, additional annotations are added to the output Table/MatrixTable. The keys of the dict are the names of the annotations and the values are tuples of functions. The first function is used to transform the mt entries in some way, and the second function is used to aggregate the output from the first function.annotate_mt (
bool
) – Whether to return the full MatrixTable with annotations added instead of only a Table with freq and other annotations. Default is True.
- Return type:
Union
[Table
,MatrixTable
]- Returns:
MatrixTable or Table with freq annotation.
- gnomad.utils.annotations.annotate_downsamplings(t, downsamplings, pop_expr=None)[source]
Annotate MatrixTable or Table with downsampling groups.
- Parameters:
t (
Union
[MatrixTable
,Table
]) – Input MatrixTable or Table.downsamplings (
List
[int
]) – List of downsampling sizes.pop_expr (
Optional
[StringExpression
]) – Optional expression for population group. When provided, population sample sizes are added as values to downsamplings.
- Return type:
Union
[MatrixTable
,Table
]- Returns:
MatrixTable or Table with downsampling annotations.
- gnomad.utils.annotations.build_freq_stratification_list(sex_expr=None, pop_expr=None, subpop_expr=None, additional_strata_expr=None, downsampling_expr=None)[source]
Build a list of stratification groupings to be used in frequency calculations based on supplied parameters.
Note
This function is primarily used through annotate_freq but can be used independently if desired. The returned list of stratifications can be passed to generate_freq_group_membership_array.
- Parameters:
sex_expr (
Optional
[StringExpression
]) – When specified, the returned list contains a stratification for sex. If pop_expr is also specified, then the returned list also contains a pop/sex stratification.pop_expr (
Optional
[StringExpression
]) – When specified, the returned list contains a stratification for population. If sex_expr is also specified, then the returned list also contains a pop/sex stratification.subpop_expr (
Optional
[StringExpression
]) – When specified, the returned list contains a stratification for sub-continental population. Note that pop_expr is required as well when using this option.additional_strata_expr (
Union
[List
[Dict
[str
,StringExpression
]],Dict
[str
,StringExpression
],None
]) – When specified, the returned list contains a stratification for each of the additional strata. This can e.g. be used to stratify by platform, platform-pop, platform-pop-sex.downsampling_expr (
Optional
[StructExpression
]) – When specified, the returned list contains a stratification for downsampling. If pop_expr is also specified, then the returned list also contains a downsampling/pop stratification.
- Return type:
List
[Dict
[str
,StringExpression
]]- Returns:
List of dictionaries specifying stratification groups where the keys of each dictionary are strings and the values are corresponding expressions that define the values to stratify frequency calculations by.
- gnomad.utils.annotations.generate_freq_group_membership_array(ht, strata_expr, downsamplings=None, ds_pop_counts=None, remove_zero_sample_groups=False, no_raw_group=False)[source]
Generate a Table with a ‘group_membership’ array for each sample indicating whether the sample belongs to specific stratification groups.
Note
This function is primarily used through annotate_freq but can be used independently if desired. Please see the annotate_freq function for more complete documentation.
- The following global annotations are added to the returned Table:
freq_meta: Each element of the list contains metadata on a stratification group.
freq_meta_sample_count: sample count per grouping defined in freq_meta.
If downsamplings or ds_pop_counts are specified, they are also added as global annotations on the returned Table.
Each sample is annotated with a ‘group_membership’ array indicating whether the sample belongs to specific stratification groups. All possible value combinations are determined for each stratification grouping in the strata_expr list.
- Parameters:
ht (
Table
) – Input Table that contains Expressions specified by strata_expr.strata_expr (
List
[Dict
[str
,StringExpression
]]) – List of dictionaries specifying stratification groups where the keys of each dictionary are strings and the values are corresponding expressions that define the values to stratify frequency calculations by.downsamplings (
Optional
[List
[int
]]) – List of downsampling values to include in the stratifications.ds_pop_counts (
Optional
[Dict
[str
,int
]]) – Dictionary of population counts for each downsampling value.remove_zero_sample_groups (
bool
) – Whether to remove groups with a sample count of 0. Default is False.no_raw_group (
bool
) – Whether to remove the raw group from the ‘group_membership’ annotation and the ‘freq_meta’ and ‘freq_meta_sample_count’ global annotations. Default is False.
- Return type:
- Returns:
Table with the ‘group_membership’ array annotation.
- gnomad.utils.annotations.compute_freq_by_strata(mt, entry_agg_funcs=None, select_fields=None, group_membership_includes_raw_group=True)[source]
Compute call statistics and, when passed, entry aggregation function(s) by strata.
The computed call statistics are AC, AF, AN, and homozygote_count. The entry aggregation functions are applied to the MatrixTable entries and aggregated. The MatrixTable must contain a ‘group_membership’ annotation (like the one added by generate_freq_group_membership_array) that is a list of bools to aggregate the columns by.
Note
This function is primarily used through annotate_freq but can be used independently if desired. Please see the annotate_freq function for more complete documentation.
- Parameters:
mt (
MatrixTable
) – Input MatrixTable.entry_agg_funcs (
Optional
[Dict
[str
,Tuple
[Callable
,Callable
]]]) – Optional dict of entry aggregation functions. When specified, additional annotations are added to the output Table/MatrixTable. The keys of the dict are the names of the annotations and the values are tuples of functions. The first function is used to transform the mt entries in some way, and the second function is used to aggregate the output from the first function.select_fields (
Optional
[List
[str
]]) – Optional list of row fields from mt to keep on the output Table.group_membership_includes_raw_group (
bool
) – Whether the ‘group_membership’ annotation includes an entry for the ‘raw’ group, representing all samples. If False, the ‘raw’ group is inserted as the second element in all added annotations using the same ‘group_membership’, resulting in array lengths of ‘group_membership’+1. If True, the second element of each added annotation is still the ‘raw’ group, but the group membership is determined by the values in the second element of ‘group_membership’, and the output annotations will be the same length as ‘group_membership’. Default is True.
- Return type:
- Returns:
Table or MatrixTable with allele frequencies by strata.
- gnomad.utils.annotations.agg_by_strata(mt, entry_agg_funcs, select_fields=None, group_membership_ht=None, entry_agg_group_membership=None)[source]
Get row expression for annotations of each entry aggregation function(s) by strata.
The entry aggregation functions are applied to the MatrixTable entries and aggregated. If no group_membership_ht (like the one returned by generate_freq_group_membership_array) is supplied, mt must contain a ‘group_membership’ annotation that is a list of bools to aggregate the columns by.
- Parameters:
mt (
MatrixTable
) – Input MatrixTable.entry_agg_funcs (
Dict
[str
,Tuple
[Callable
,Callable
]]) – Dict of entry aggregation functions where the keys of the dict are the names of the annotations and the values are tuples of functions. The first function is used to transform the mt entries in some way, and the second function is used to aggregate the output from the first function.select_fields (
Optional
[List
[str
]]) – Optional list of row fields from mt to keep on the output Table.group_membership_ht (
Optional
[Table
]) – Optional Table containing group membership annotations to stratify the aggregations by. If not provided, the ‘group_membership’ annotation is expected to be present on mt.entry_agg_group_membership (
Optional
[Dict
[str
,List
[dict
]]]) – Optional dict indicating the subset of group strata in ‘freq_meta’ to run the entry aggregation functions on. The keys of the dict can be any of the keys in entry_agg_funcs and the values are lists of dicts. Each dict in the list contains the strata in ‘freq_meta’ to use for the corresponding entry aggregation function. If provided, ‘freq_meta’ must be present in group_membership_ht or mt and represent the same strata as those in ‘group_membership’. If not provided, all entries of the ‘group_membership’ annotation will have the entry aggregation functions applied to them.
- Return type:
- Returns:
Table with annotations of stratified aggregations.
- gnomad.utils.annotations.update_structured_annotations(ht, annotation_update_exprs, annotation_update_label=None)[source]
Update highly structured annotations on a Table.
This function recursively updates annotations defined by annotation_update_exprs and if annotation_update_label is supplied, it checks if the sample annotations are different from the input and adds a flag to the Table, indicating which annotations have been updated for each sample.
- Parameters:
ht (
Table
) – Input Table with structured annotations to update.annotation_update_exprs (
Dict
[str
,Expression
]) – Dictionary of annotations to update, structured as they are structured on the input ht.annotation_update_label (
Optional
[str
]) – Optional string of the label to use for an annotation indicating which annotations have been updated. Default is None, so no annotation is added.
- Return type:
- Returns:
Table with updated annotations and optionally a flag indicating which annotations were changed.
- gnomad.utils.annotations.fill_missing_key_combinations(ht, fill_values, key_values=None)[source]
Fill missing key combinations with requested fill values.
This function fills missing key combinations in the input Table with requested fill values. The fill values are specified in the fill_values dictionary. The unique values for each key field are collected from the input Table unless specified in the key_values dictionary.
This is useful when you want to ensure that all possible key combinations are present in a Table, even if they are missing from the input Table. This can happen when you are aggregating data and want to ensure that all possible key combinations are present in the output Table, not only the ones that are present.
Example:
.. code-block:: python ht = hl.Table.parallelize( [ {'key1': 'A', 'key2': 1, 'value': 10}, {'key1': 'A', 'key2': 2, 'value': 20}, {'key1': 'B', 'key2': 1, 'value': 30}, ], hl.tstruct(key1=hl.tstr, key2=hl.tint32, value=hl.tint32), key=['key1', 'key2'], ) fill_values = {'value': hl.missing(hl.tint32)} filled_ht = fill_missing_key_combinations(ht, fill_values) filled_ht.show() +------+------+ | key1 | key2 | value | +------+------+ | A | 1 | 10 | | A | 2 | 20 | | B | 1 | 30 | | B | 2 | null | +------+------+
In this example, the input table is missing the combination (B, 2). After applying fill_missing_key_combinations, the missing key combination (B, 2) is filled with the specified fill value for ‘value’ (null in this case).
- Parameters:
ht (
Table
) – Input Table containing key fields.fill_values (
Dict
[str
,Expression
]) – Dictionary of fill values to use for missing key combinations.key_values (
Optional
[Dict
[str
,List
]]) – Optional dictionary of unique values to use for each key field. Default is None. If None, the unique values for each key field are collected from the input Table.
- Return type:
- Returns:
Table with missing key combinations filled with requested fill values.
- gnomad.utils.annotations.missing_struct_expr(dtypes)[source]
Create a struct of missing values corresponding to each field and type in the input struct.
- Parameters:
dtypes (
tstruct
) – StructExpression containing the field names and missing values.- Return type:
- gnomad.utils.annotations.add_gks_vrs(input_locus, input_vrs)[source]
Generate a dictionary containing VRS information from a given locus and struct of VRS information.
Dict will have GA4GH GKS VRS structure.
- Parameters:
input_locus (
locus
) – Locus field from a struct (locus of result of running .collect() on a Hail table).input_vrs (
struct
) – VRS struct (such as from a ht.info.vrs field).
- Return type:
dict
- Returns:
Python dictionary conforming to GA4GH GKS VRS structure.
- gnomad.utils.annotations.add_gks_va(input_struct, label_name='gnomAD', label_version='3.1.2', ancestry_groups=None, ancestry_groups_dict=None, by_sex=False, freq_index_dict=None)[source]
Generate Python dictionary containing GKS VA annotations.
Populate the dictionary with frequency information conforming to the GKS VA frequency schema. If ancestry_groups or by_sex is provided, also include subcohort schemas for each cohort. If input_struct has mean_depth, it is added to ancillaryResults. This annotation is added under the gks_va_freq_dict field of the table. The focusAllele field is not populated, and must be filled in by the caller.
- Parameters:
input_struct (
struct
) – Hail struct for a desired variant (such as result of running .collect()[0] on a Table).label_name (
str
) – Label name to use within the returned dictionary. Example: “gnomAD”.label_version (
str
) – String listing the version of the table being used. Example: “3.1.2”.ancestry_groups (
list
) – List of strings of shortened names of cohorts to return results for. Example: [‘afr’,’fin’,’nfe’]. Default is None.ancestry_groups_dict (
dict
) – Dict mapping shortened genetic ancestry group names to full names. Example: {‘afr’:’African/African American’}. Default is None.by_sex (
bool
) – Boolean to include breakdown of cohorts by inferred sex (XX and XY) as well. Default is None.freq_index_dict (
dict
) –
- Freq_index_dict:
Dict mapping groups to their index for freq info in ht.freq_index_dict[0]. Default is None.
- Return type:
dict
- Returns:
Tuple containing a dictionary containing GKS VA frequency information, (split by ancestry groups and sex if desired) for the specified variant.