gnomad.utils.filtering

`gnomad.utils.filtering.filter_to_adj`(mt)	Filter genotypes to adj criteria.
`gnomad.utils.filtering.filter_by_frequency`(t, ...)	Filter MatrixTable or Table with gnomAD-format frequency data (assumed bi-allelic/split).
`gnomad.utils.filtering.combine_functions`(...)	Combine a list of boolean functions to an Expression using the specified operator.
`gnomad.utils.filtering.low_conf_regions_expr`(...)	Create an expression to filter low confidence regions.
`gnomad.utils.filtering.filter_low_conf_regions`(t, ...)	Filter low-confidence regions.
`gnomad.utils.filtering.filter_to_autosomes`(t)	Filter the Table or MatrixTable to autosomes only.
`gnomad.utils.filtering.add_filters_expr`(filters)	Create an expression to create or add filters.
`gnomad.utils.filtering.subset_samples_and_variants`(...)	Subset the MatrixTable or VariantDataset to the provided list of samples and their variants.
`gnomad.utils.filtering.filter_to_clinvar_pathogenic`(t)	Return a MatrixTable or Table that filters the clinvar data to pathogenic and likely pathogenic variants.
`gnomad.utils.filtering.filter_gencode_ht`([...])	Filter a Gencode Table to specified criteria.
`gnomad.utils.filtering.filter_by_intervals`(t, ...)	Filter Table/MatrixTable by interval(s).
`gnomad.utils.filtering.filter_by_gencode_intervals`(t)	Filter a Table/MatrixTable based on Gencode Table annotations.
`gnomad.utils.filtering.filter_to_gencode_cds`(t, ...)	Filter a Table/MatrixTable to only Gencode CDS regions in protein coding transcripts.
`gnomad.utils.filtering.remove_fields_from_constant`(...)	Remove fields from a list and display any field(s) missing from the original list.
`gnomad.utils.filtering.filter_x_nonpar`(t)	Filter to loci that are in non-PAR regions on chromosome X.
`gnomad.utils.filtering.filter_y_nonpar`(t)	Filter to loci that are in non-PAR regions on chromosome Y.
`gnomad.utils.filtering.filter_by_numeric_expr_range`(t, ...)	Filter rows in the Table/MatrixTable based on the range of a numeric expression.
`gnomad.utils.filtering.filter_for_mu`(ht[, ...])	Filter to non-coding annotations and remove GERP outliers.
`gnomad.utils.filtering.split_vds_by_strata`(...)	Split a VDS into multiple VDSs based on strata_expr.
`gnomad.utils.filtering.filter_arrays_by_meta`(...)	Filter both metadata array expression and meta data indexed expression by items_to_filter.

gnomad.utils.filtering.filter_to_adj(mt)[source]

Filter genotypes to adj criteria.

Parameters:: mt (MatrixTable) –
Return type:: MatrixTable

gnomad.utils.filtering.filter_by_frequency(t, direction, frequency=None, allele_count=None, gen_anc=None, subgrp=None, downsampling=None, keep=True, adj=True)[source]

Filter MatrixTable or Table with gnomAD-format frequency data (assumed bi-allelic/split).

gnomAD frequency data format expectation is: Array[Struct(Array[AC], Array[AF], AN, homozygote_count, meta)].

At least one of frequency or allele_count is required.

Subgroup can be specified without a genetic ancestry group if desired.

Parameters:

t (Union[MatrixTable, Table]) – Input MatrixTable or Table
direction (str) – One of “above”, “below”, and “equal” (how to apply the filter)
frequency (float) – Frequency to filter by (one of frequency or allele_count is required)
allele_count (int) – Allele count to filter by (one of frequency or allele_count is required)
gen_anc (str) – Genetic ancestry group in which to filter frequency
subgrp (str) – Subgroup in which to filter frequency
downsampling (int) – Downsampling in which to filter frequency
keep (bool) – Whether to keep rows passing this frequency (passed to filter_rows)
adj (bool) – Whether to use adj frequency

Return type:

Union[MatrixTable, Table]

Returns:

Filtered MatrixTable or Table

gnomad.utils.filtering.combine_functions(func_list, x, operator_func=<built-in function iand>)[source]

Combine a list of boolean functions to an Expression using the specified operator.

Note

The operator_func is applied cumulatively from left to right of the func_list.

Parameters:

func_list (List[Callable[[bool], bool]]) – A list of boolean functions that can be applied to x.
x (StructExpression) – Expression to be passed to each function in func_list.
operator_func (Callable[[bool, bool], bool]) – Operator function to combine the functions in func_list. Default is operator.iand.

Return type:

bool

Returns:

A boolean from the combined operations.

gnomad.utils.filtering.low_conf_regions_expr(locus_expr, filter_lcr=True, filter_decoy=True, filter_segdup=True, filter_exome_low_coverage_regions=False, filter_telomeres_and_centromeres=False, high_conf_regions=None)[source]

Create an expression to filter low confidence regions.

Parameters:

locus_expr (LocusExpression) – Locus expression to use for filtering.
filter_lcr (bool) – Whether to filter LCR regions
filter_decoy (bool) – Whether to filter decoy regions
filter_segdup (bool) – Whether to filter Segdup regions
filter_exome_low_coverage_regions (bool) – Whether to filter exome low confidence regions
filter_telomeres_and_centromeres (bool) – Whether to filter telomeres and centromeres
high_conf_regions (Optional[List[str]]) – Paths to set of high confidence regions to restrict to (union of regions)

Return type:

BooleanExpression

Returns:

Bool expression of whether loci are not low confidence (TRUE) or low confidence (FALSE)

gnomad.utils.filtering.filter_low_conf_regions(t, **kwargs)[source]

Filter low-confidence regions.

Parameters:

t (Union[MatrixTable, Table]) – MatrixTable or Table to filter.
kwargs – Keyword arguments to pass to low_conf_regions_expr.

Return type:

Union[MatrixTable, Table]

Returns:

MatrixTable or Table with low confidence regions removed.

gnomad.utils.filtering.filter_to_autosomes(t)[source]

Filter the Table or MatrixTable to autosomes only.

This assumes that the input contains a field named locus of type Locus

Parameters:: t (Union[MatrixTable, Table]) – Input MT/HT
Return type:: Union[MatrixTable, Table]
Returns:: MT/HT autosomes

gnomad.utils.filtering.add_filters_expr(filters, current_filters=None)[source]

Create an expression to create or add filters.

For each entry in the filters dictionary, if the value evaluates to True, then the key is added as a filter name.

Current filters are kept if provided using current_filters

Parameters:

filters (Dict[str, BooleanExpression]) – The filters and their expressions
current_filters (SetExpression) – The set of current filters

Return type:

SetExpression

Returns:

An expression that can be used to annotate the filters

gnomad.utils.filtering.subset_samples_and_variants(mtds, sample_path, header=True, table_key='s', sparse=False, gt_expr='GT', remove_dead_alleles=False)[source]

Subset the MatrixTable or VariantDataset to the provided list of samples and their variants.

Parameters:

mtds (Union[MatrixTable, VariantDataset]) – Input MatrixTable or VariantDataset
sample_path (str) – Path to a file with list of samples
header (bool) – Whether file with samples has a header. Default is True
table_key (str) – Key to sample Table. Default is “s”
sparse (bool) – Whether the MatrixTable is sparse. Default is False
gt_expr (str) – Name of field in MatrixTable containing genotype expression. Default is “GT”
remove_dead_alleles (bool) – Remove alleles observed in no samples. This option is currently only relevant when mtds is a VariantDataset. Default is False

Return type:

Union[MatrixTable, VariantDataset]

Returns:

MatrixTable or VariantDataset subsetted to specified samples and their variants

gnomad.utils.filtering.filter_to_clinvar_pathogenic(t, clnrevstat_field='CLNREVSTAT', clnsig_field='CLNSIG', clnsigconf_field='CLNSIGCONF', remove_no_assertion=True, remove_conflicting=True)[source]

Return a MatrixTable or Table that filters the clinvar data to pathogenic and likely pathogenic variants.

Example use:

from gnomad.resources.grch38.reference_data import clinvar
clinvar_ht = clinvar.ht()
clinvar_ht = filter_to_clinvar_pathogenic(clinvar_ht)

Param:

t: Input dataset that contains clinvar data, could either be a MatrixTable or Table.

Parameters:

clnrevstat_field (str) – The field string for the expression that contains the review status of the clinical significance of clinvar variants.
clnsig_field (str) – The field string for the expression that contains the clinical signifcance of the clinvar variant.
clnsigconf_field (str) – The field string for the expression that contains the conflicting clinical significance values for the variant. For variants with no conflicting significance, this field should be undefined.
remove_no_assertion (bool) – Flag for removing entries in which the clnrevstat (clinical significance) has no assertions (zero stars).
remove_conflicting (bool) – Flag for removing entries with conflicting clinical interpretations.
t (Union[MatrixTable, Table]) –

Return type:

Union[MatrixTable, Table]

Returns:

Filtered MatrixTable or Table

gnomad.utils.filtering.filter_gencode_ht(gencode_ht=None, reference_genome='GRCh38', version=None, protein_coding=False, feature=None, genes=None, by_gene_symbol=True)[source]

Filter a Gencode Table to specified criteria.

Note

If no Gencode Table is provided, a reference_genome Gencode Table resource will be used. If version is not provided, the default version of the Gencode Table resource will be used.

Parameters:

gencode_ht (Optional[Table]) – Gencode Table to use for filtering the input Table/MatrixTable to CDS regions. Default is None, which will use the default version of the Gencode Table resource.
reference_genome (Optional[str]) – Reference genome build of Gencode Table to use if none is provided. Default is “GRCh38”.
version (Optional[str]) – Version of the Gencode Table to use if none is provided. Default is None.
protein_coding (bool) – Whether to filter to only intervals where “transcript_type” is “protein_coding”. Default is False.
feature (Union[str, List[str]]) – Optional feature(s) to filter to. Can be a single feature string or list of features. Default is None.
genes (Union[str, List[str], None]) – Optional gene(s) to filter to. Can be a single gene string or list of genes. Default is None.
by_gene_symbol (bool) – Whether to filter by gene symbol. Default is True. If False, will filter by gene ID.

Return type:

Table

Returns:

Gencode Table filtered to specified criteria.

gnomad.utils.filtering.filter_by_intervals(t, intervals, padding_bp=0, max_collect_intervals=3000, reference_genome=None)[source]

Filter Table/MatrixTable by interval(s).

Parameters:

t (Union[MatrixTable, Table]) – Input Table/MatrixTable to filter.
intervals (Union[str, List[str], IntervalExpression, Interval, List[Interval]]) – Interval(s) to filter by. Can be a string, list of strings, IntervalExpression, Interval, or list of Intervals. If a string or list of strings, the interval string format has to be “contig:start-end”, e.g.,”1:1000-2000” (GRCh37) or “chr1:1000-2000” (GRCh38).
padding_bp (int) – Number of bases to pad the intervals by. Default is 0.
max_collect_intervals (int) – Maximum number of intervals for the use of hl.filter_intervals for filtering. When the number of intervals to filter is greater than this number, filter/filter_rows will be used instead. The reason for this is that hl.filter_intervals is faster, but when the number of intervals is too large, this can cause memory errors. Default is 3000.
reference_genome (Optional[str]) – Reference genome build to use for parsing the intervals if the intervals are strings. Default is None.

Return type:

Table

Returns:

Table/MatrixTable filtered by interval(s).

gnomad.utils.filtering.filter_by_gencode_intervals(t, gencode_ht=None, protein_coding=False, feature=None, genes=None, by_gene_symbol=True, padding_bp=0, max_collect_intervals=3000)[source]

Filter a Table/MatrixTable based on Gencode Table annotations.

Note

If no Gencode Table is provided, the default version of the Gencode Table resource for the genome build of the input Table/MatrixTable will be used.

Parameters:

t (Union[MatrixTable, Table]) – Input Table/MatrixTable to filter.
gencode_ht (Optional[Table]) – Gencode Table to use for filtering the input Table/MatrixTable. Default is None, which will use the default version of the Gencode Table resource.
protein_coding (bool) – Whether to filter to only intervals where “transcript_type” is “protein_coding”. Default is False.
feature (Union[str, List[str]]) – Optional feature(s) to filter to. Can be a single feature string or list of features. Default is None.
genes (Union[str, List[str], None]) – Optional gene(s) to filter to. Can be a single gene string or list of genes. Default is None.
by_gene_symbol (bool) – Whether to filter by gene symbol. Default is True. If False, will filter by gene ID.
padding_bp (int) – Number of bases to pad the CDS intervals by. Default is 0.
max_collect_intervals (int) – Maximum number of intervals for the use of hl.filter_intervals for filtering. When the number of intervals to filter is greater than this number, filter/filter_rows will be used instead. The reason for this is that hl.filter_intervals is faster, but when the number of intervals is too large, this can cause memory errors. Default is 3000.

Return type:

Table

Returns:

Table/MatrixTable filtered to loci in requested Gencode intervals.

gnomad.utils.filtering.filter_to_gencode_cds(t, **kwargs)[source]

Filter a Table/MatrixTable to only Gencode CDS regions in protein coding transcripts.

Example use:

from gnomad.resources.grch37.reference_data import gencode
gencode_ht = gencode.ht()
gencode_ht = filter_gencode_to_cds(gencode_ht)

Note

If no Gencode Table is provided, the default version of the Gencode Table resource for the genome build of the input Table/MatrixTable will be used.

Warning

This Gencode CDS interval filter does not take into account the transcript_id, it filters to any locus that is found in a CDS interval for any protein coding transcript. Therefore, if downstream analyses require filtering to CDS intervals by transcript, an additional step must be taken. For example, when filtering VEP transcript consequences, there may be cases where a variant is retained with this filter, but is considered outside the CDS intervals of the transcript per the VEP predicted consequence of the variant.

Parameters:

t (Union[MatrixTable, Table]) – Input Table/MatrixTable to filter.
kwargs – Additional Keyword arguments to pass to filter_gencode_ht.

Return type:

Table

Returns:

Table/MatrixTable filtered to loci in Gencode CDS intervals.

gnomad.utils.filtering.remove_fields_from_constant(constant, fields_to_remove)[source]

Remove fields from a list and display any field(s) missing from the original list.

Parameters:

constant (List[str]) – List of fields
fields_to_remove (List[str]) – List of fields to remove from constant

Return type:

List[str]

gnomad.utils.filtering.filter_x_nonpar(t)[source]

Filter to loci that are in non-PAR regions on chromosome X.

Parameters:: t (Union[Table, MatrixTable]) – Input Table or MatrixTable.
Return type:: Union[Table, MatrixTable]
Returns:: Filtered Table or MatrixTable.

gnomad.utils.filtering.filter_y_nonpar(t)[source]

Filter to loci that are in non-PAR regions on chromosome Y.

Parameters:: t (Union[Table, MatrixTable]) – Input Table or MatrixTable.
Return type:: Union[Table, MatrixTable]
Returns:: Filtered Table or MatrixTable.

gnomad.utils.filtering.filter_by_numeric_expr_range(t, filter_expr, filter_range, keep_between=True, inclusive=True)[source]

Filter rows in the Table/MatrixTable based on the range of a numeric expression.

Parameters:

t (Union[MatrixTable, Table]) – Input Table/MatrixTable.
filter_expr (NumericExpression) – NumericExpression to apply filter_range to.
filter_range (tuple) – Range of values to apply to filter_expr.
keep_between (bool) – Whether to keep the values between filter_range instead of keeping values outside filter_range. Default is True.
inclusive (bool) – Whether or not to include the filter_range values themselves. Default is True.

Return type:

Union[MatrixTable, Table]

Returns:

Table/MatrixTable filtered to rows with specified criteria.

gnomad.utils.filtering.filter_for_mu(ht, gerp_lower_cutoff=-3.9885, gerp_upper_cutoff=2.6607)[source]

Filter to non-coding annotations and remove GERP outliers.

Note

Values for gerp_lower_cutoff and gerp_upper_cutoff default to -3.9885 and 2.6607, respectively. These values were precalculated on the GRCh37 context table and define the 5th and 95th percentiles.

Parameters:

ht (Table) – Input Table.
gerp_lower_cutoff (float) – Minimum GERP score for variant to be included. Default is -3.9885.
gerp_upper_cutoff (float) – Maximum GERP score for variant to be included. Default is 2.6607.

Return type:

Table

Returns:

Table filtered to intron or intergenic variants with GERP outliers removed.

gnomad.utils.filtering.split_vds_by_strata(vds, strata_expr)[source]

Split a VDS into multiple VDSs based on strata_expr.

Parameters:

vds (VariantDataset) – Input VDS.
strata_expr (Expression) – Expression on VDS variant_data MT to split on.

Return type:

Dict[str, VariantDataset]

Returns:

Dictionary where strata value is key and VDS is value.

gnomad.utils.filtering.filter_arrays_by_meta(meta_expr, meta_indexed_exprs, items_to_filter, keep=True, combine_operator='and', exact_match=False)[source]

Filter both metadata array expression and meta data indexed expression by items_to_filter.

The items_to_filter can be used to filter in the following ways based on meta_expr items: - By a list of keys, e.g. [“sex”, “downsampling”]. - By specific key: value pairs, e.g. to filter where ‘gen_anc’ is ‘han’ or ‘papuan’ {“gen_anc”: [“han”, “papuan”]}, or where ‘gen_anc’ is ‘afr’ and/or ‘sex’ is ‘XX’ {“gen_anc”: [“afr”], “sex”: [“XX”]}.

The items can be kept or removed from meta_indexed_expr and meta_expr based on the value of keep. For example if meta_indexed_exprs is {‘freq’: ht.freq, ‘freq_meta_sample_count’: ht.index_globals().freq_meta_sample_count} and meta_expr is ht.freq_meta then if keep is True, the items specified by items_to_filter such as ‘gen_anc’ = ‘han’ will be kept and all other items will be removed from the ht.freq, ht.freq_meta_sample_count, and ht.freq_meta. meta_indexed_exprs can also be a single array expression such as ht.freq.

The filtering can also be applied such that all criteria must be met (combine_operator = “and”) by the meta_expr item in order to be filtered, or at least one of the specified criteria must be met (combine_operator = “or”) by the meta_expr item in order to be filtered.

The exact_match parameter can be used to apply the keep parameter to only items specified in the items_to_filter parameter. For example, by default, if keep is True, combine_operator is “and”, and items_to_filter is [“sex”, “downsampling”], then all items in meta_expr with both “sex” and “downsampling” as keys will be kept. However, if exact_match is True, then the items in meta_expr will only be kept if “sex” and “downsampling” are the only keys in the meta dict.

Parameters:

meta_expr (ArrayExpression) – Metadata expression that contains the values of the elements in meta_indexed_expr. The most often used expression is freq_meta to index into a ‘freq’ array.
meta_indexed_exprs (Union[Dict[str, ArrayExpression], ArrayExpression]) – Either a Dictionary where the keys are the expression name and the values are the expressions indexed by the meta_expr such as a ‘freq’ array or just a single expression indexed by the meta_expr.
items_to_filter (Union[Dict[str, List[str]], List[str]]) – Items to filter by, either a list or a dictionary.
keep (bool) – Whether to keep or remove the items specified by items_to_filter.
combine_operator (str) – Whether to use “and” or “or” to combine the items specified by items_to_filter.
exact_match (bool) – Whether to apply the keep parameter to only the items specified in the items_to_filter parameter or to all items in meta_expr. See the example above for more details. Default is False.

Return type:

Tuple[ArrayExpression, Union[Dict[str, ArrayExpression], ArrayExpression]]

Returns:

A Tuple of the filtered metadata expression and a dictionary of metadata indexed expressions when meta_indexed_expr is a Dictionary or a single filtered array expression when meta_indexed_expr is a single array expression.