gnomad.utils.filtering
Filter genotypes to adj criteria. |
|
Filter MatrixTable or Table with gnomAD-format frequency data (assumed bi-allelic/split). |
|
Combine a list of boolean functions to an Expression using the specified operator. |
|
Create an expression to filter low confidence regions. |
|
Filter low-confidence regions. |
|
Filter the Table or MatrixTable to autosomes only. |
|
Create an expression to create or add filters. |
|
Subset the MatrixTable or VariantDataset to the provided list of samples and their variants. |
|
Return a MatrixTable or Table that filters the clinvar data to pathogenic and likely pathogenic variants. |
|
Filter a Table/MatrixTable to only Gencode CDS regions in protein coding transcripts. |
|
Remove fields from a list and display any field(s) missing from the original list. |
|
Filter to loci that are in non-PAR regions on chromosome X. |
|
Filter to loci that are in non-PAR regions on chromosome Y. |
|
Filter rows in the Table/MatrixTable based on the range of a numeric expression. |
|
|
Filter to non-coding annotations and remove GERP outliers. |
Split a VDS into multiple VDSs based on strata_expr. |
|
Filter both metadata array expression and meta data indexed expression by items_to_filter. |
- gnomad.utils.filtering.filter_to_adj(mt)[source]
Filter genotypes to adj criteria.
- Parameters:
mt (
MatrixTable
) –- Return type:
- gnomad.utils.filtering.filter_by_frequency(t, direction, frequency=None, allele_count=None, population=None, subpop=None, downsampling=None, keep=True, adj=True)[source]
Filter MatrixTable or Table with gnomAD-format frequency data (assumed bi-allelic/split).
gnomAD frequency data format expectation is: Array[Struct(Array[AC], Array[AF], AN, homozygote_count, meta)].
At least one of frequency or allele_count is required.
Subpop can be specified without a population if desired.
- Parameters:
t (
Union
[MatrixTable
,Table
]) – Input MatrixTable or Tabledirection (
str
) – One of “above”, “below”, and “equal” (how to apply the filter)frequency (
float
) – Frequency to filter by (one of frequency or allele_count is required)allele_count (
int
) – Allele count to filter by (one of frequency or allele_count is required)population (
str
) – Population in which to filter frequencysubpop (
str
) – Sub-population in which to filter frequencydownsampling (
int
) – Downsampling in which to filter frequencykeep (
bool
) – Whether to keep rows passing this frequency (passed to filter_rows)adj (
bool
) – Whether to use adj frequency
- Return type:
Union
[MatrixTable
,Table
]- Returns:
Filtered MatrixTable or Table
- gnomad.utils.filtering.combine_functions(func_list, x, operator_func=<built-in function iand>)[source]
Combine a list of boolean functions to an Expression using the specified operator.
Note
The operator_func is applied cumulatively from left to right of the func_list.
- Parameters:
func_list (
List
[Callable
[[bool
],bool
]]) – A list of boolean functions that can be applied to x.x (
StructExpression
) – Expression to be passed to each function in func_list.operator_func (
Callable
[[bool
,bool
],bool
]) – Operator function to combine the functions in func_list. Default is operator.iand.
- Return type:
bool
- Returns:
A boolean from the combined operations.
- gnomad.utils.filtering.low_conf_regions_expr(locus_expr, filter_lcr=True, filter_decoy=True, filter_segdup=True, filter_exome_low_coverage_regions=False, filter_telomeres_and_centromeres=False, high_conf_regions=None)[source]
Create an expression to filter low confidence regions.
- Parameters:
locus_expr (
LocusExpression
) – Locus expression to use for filtering.filter_lcr (
bool
) – Whether to filter LCR regionsfilter_decoy (
bool
) – Whether to filter decoy regionsfilter_segdup (
bool
) – Whether to filter Segdup regionsfilter_exome_low_coverage_regions (
bool
) – Whether to filter exome low confidence regionsfilter_telomeres_and_centromeres (
bool
) – Whether to filter telomeres and centromereshigh_conf_regions (
Optional
[List
[str
]]) – Paths to set of high confidence regions to restrict to (union of regions)
- Return type:
- Returns:
Bool expression of whether loci are not low confidence (TRUE) or low confidence (FALSE)
- gnomad.utils.filtering.filter_low_conf_regions(t, **kwargs)[source]
Filter low-confidence regions.
- Parameters:
t (
Union
[MatrixTable
,Table
]) – MatrixTable or Table to filter.kwargs – Keyword arguments to pass to low_conf_regions_expr.
- Return type:
Union
[MatrixTable
,Table
]- Returns:
MatrixTable or Table with low confidence regions removed.
- gnomad.utils.filtering.filter_to_autosomes(t)[source]
Filter the Table or MatrixTable to autosomes only.
This assumes that the input contains a field named locus of type Locus
- Parameters:
t (
Union
[MatrixTable
,Table
]) – Input MT/HT- Return type:
Union
[MatrixTable
,Table
]- Returns:
MT/HT autosomes
- gnomad.utils.filtering.add_filters_expr(filters, current_filters=None)[source]
Create an expression to create or add filters.
For each entry in the filters dictionary, if the value evaluates to True, then the key is added as a filter name.
Current filters are kept if provided using current_filters
- Parameters:
filters (
Dict
[str
,BooleanExpression
]) – The filters and their expressionscurrent_filters (
SetExpression
) – The set of current filters
- Return type:
- Returns:
An expression that can be used to annotate the filters
- gnomad.utils.filtering.subset_samples_and_variants(mtds, sample_path, header=True, table_key='s', sparse=False, gt_expr='GT', remove_dead_alleles=False)[source]
Subset the MatrixTable or VariantDataset to the provided list of samples and their variants.
- Parameters:
mtds (
Union
[MatrixTable
,VariantDataset
]) – Input MatrixTable or VariantDatasetsample_path (
str
) – Path to a file with list of samplesheader (
bool
) – Whether file with samples has a header. Default is Truetable_key (
str
) – Key to sample Table. Default is “s”sparse (
bool
) – Whether the MatrixTable is sparse. Default is Falsegt_expr (
str
) – Name of field in MatrixTable containing genotype expression. Default is “GT”remove_dead_alleles (
bool
) – Remove alleles observed in no samples. This option is currently only relevant when mtds is a VariantDataset. Default is False
- Return type:
Union
[MatrixTable
,VariantDataset
]- Returns:
MatrixTable or VariantDataset subsetted to specified samples and their variants
- gnomad.utils.filtering.filter_to_clinvar_pathogenic(t, clnrevstat_field='CLNREVSTAT', clnsig_field='CLNSIG', clnsigconf_field='CLNSIGCONF', remove_no_assertion=True, remove_conflicting=True)[source]
Return a MatrixTable or Table that filters the clinvar data to pathogenic and likely pathogenic variants.
Example use:
from gnomad.resources.grch38.reference_data import clinvar clinvar_ht = clinvar.ht() clinvar_ht = filter_to_clinvar_pathogenic(clinvar_ht)
- Param:
t: Input dataset that contains clinvar data, could either be a MatrixTable or Table.
- Parameters:
clnrevstat_field (
str
) – The field string for the expression that contains the review status of the clinical significance of clinvar variants.clnsig_field (
str
) – The field string for the expression that contains the clinical signifcance of the clinvar variant.clnsigconf_field (
str
) – The field string for the expression that contains the conflicting clinical significance values for the variant. For variants with no conflicting significance, this field should be undefined.remove_no_assertion (
bool
) – Flag for removing entries in which the clnrevstat (clinical significance) has no assertions (zero stars).remove_conflicting (
bool
) – Flag for removing entries with conflicting clinical interpretations.t (
Union
[MatrixTable
,Table
]) –
- Return type:
Union
[MatrixTable
,Table
]- Returns:
Filtered MatrixTable or Table
- gnomad.utils.filtering.filter_to_gencode_cds(t, gencode_ht=None, genes=None, by_gene_symbol=True, padding_bp=0, max_collect_intervals=3000)[source]
Filter a Table/MatrixTable to only Gencode CDS regions in protein coding transcripts.
Example use:
from gnomad.resources.grch37.reference_data import gencode gencode_ht = gencode.ht() gencode_ht = filter_gencode_to_cds(gencode_ht)
Note
If no Gencode Table is provided, the default version of the Gencode Table resource for the genome build of the input Table/MatrixTable will be used.
Warning
This Gencode CDS interval filter does not take into account the transcript_id, it filters to any locus that is found in a CDS interval for any protein coding transcript. Therefore, if downstream analyses require filtering to CDS intervals by transcript, an additional step must be taken. For example, when filtering VEP transcript consequences, there may be cases where a variant is retained with this filter, but is considered outside the CDS intervals of the transcript per the VEP predicted consequence of the variant.
- Parameters:
t (
Union
[MatrixTable
,Table
]) – Input Table/MatrixTable to filter.gencode_ht (
Optional
[Table
]) – Gencode Table to use for filtering the input Table/MatrixTable to CDS regions. Default is None, which will use the default version of the Gencode Table resource.genes (
Union
[str
,List
[str
],None
]) – Optional gene(s) to filter to. Can be a single gene string or list of genes. Default is None.by_gene_symbol (
bool
) – Whether to filter by gene symbol. Default is True. If False, will filter by gene ID.padding_bp (
int
) – Number of bases to pad the CDS intervals by. Default is 0.max_collect_intervals (
int
) – Maximum number of intervals for the use of hl.filter_intervals for filtering. When the number of intervals to filter is greater than this number, filter/filter_rows will be used instead. The reason for this is that hl.filter_intervals is faster, but when the number of intervals is too large, this can cause memory errors. Default is 3000.
- Return type:
- Returns:
Table/MatrixTable filtered to loci in Gencode CDS intervals.
- gnomad.utils.filtering.remove_fields_from_constant(constant, fields_to_remove)[source]
Remove fields from a list and display any field(s) missing from the original list.
- Parameters:
constant (
List
[str
]) – List of fieldsfields_to_remove (
List
[str
]) – List of fields to remove from constant
- Return type:
List
[str
]
- gnomad.utils.filtering.filter_x_nonpar(t)[source]
Filter to loci that are in non-PAR regions on chromosome X.
- Parameters:
t (
Union
[Table
,MatrixTable
]) – Input Table or MatrixTable.- Return type:
Union
[Table
,MatrixTable
]- Returns:
Filtered Table or MatrixTable.
- gnomad.utils.filtering.filter_y_nonpar(t)[source]
Filter to loci that are in non-PAR regions on chromosome Y.
- Parameters:
t (
Union
[Table
,MatrixTable
]) – Input Table or MatrixTable.- Return type:
Union
[Table
,MatrixTable
]- Returns:
Filtered Table or MatrixTable.
- gnomad.utils.filtering.filter_by_numeric_expr_range(t, filter_expr, filter_range, keep_between=True, inclusive=True)[source]
Filter rows in the Table/MatrixTable based on the range of a numeric expression.
- Parameters:
t (
Union
[MatrixTable
,Table
]) – Input Table/MatrixTable.filter_expr (
NumericExpression
) – NumericExpression to apply filter_range to.filter_range (
tuple
) – Range of values to apply to filter_expr.keep_between (
bool
) – Whether to keep the values between filter_range instead of keeping values outside filter_range. Default is True.inclusive (
bool
) – Whether or not to include the filter_range values themselves. Default is True.
- Return type:
Union
[MatrixTable
,Table
]- Returns:
Table/MatrixTable filtered to rows with specified criteria.
- gnomad.utils.filtering.filter_for_mu(ht, gerp_lower_cutoff=-3.9885, gerp_upper_cutoff=2.6607)[source]
Filter to non-coding annotations and remove GERP outliers.
Note
Values for gerp_lower_cutoff and gerp_upper_cutoff default to -3.9885 and 2.6607, respectively. These values were precalculated on the GRCh37 context table and define the 5th and 95th percentiles.
- Parameters:
ht (
Table
) – Input Table.gerp_lower_cutoff (
float
) – Minimum GERP score for variant to be included. Default is -3.9885.gerp_upper_cutoff (
float
) – Maximum GERP score for variant to be included. Default is 2.6607.
- Return type:
- Returns:
Table filtered to intron or intergenic variants with GERP outliers removed.
- gnomad.utils.filtering.split_vds_by_strata(vds, strata_expr)[source]
Split a VDS into multiple VDSs based on strata_expr.
- Parameters:
vds (
VariantDataset
) – Input VDS.strata_expr (
Expression
) – Expression on VDS variant_data MT to split on.
- Return type:
Dict
[str
,VariantDataset
]- Returns:
Dictionary where strata value is key and VDS is value.
- gnomad.utils.filtering.filter_arrays_by_meta(meta_expr, meta_indexed_exprs, items_to_filter, keep=True, combine_operator='and', exact_match=False)[source]
Filter both metadata array expression and meta data indexed expression by items_to_filter.
The items_to_filter can be used to filter in the following ways based on meta_expr items: - By a list of keys, e.g. [“sex”, “downsampling”]. - By specific key: value pairs, e.g. to filter where ‘pop’ is ‘han’ or ‘papuan’ {“pop”: [“han”, “papuan”]}, or where ‘pop’ is ‘afr’ and/or ‘sex’ is ‘XX’ {“pop”: [“afr”], “sex”: [“XX”]}.
The items can be kept or removed from meta_indexed_expr and meta_expr based on the value of keep. For example if meta_indexed_exprs is {‘freq’: ht.freq, ‘freq_meta_sample_count’: ht.index_globals().freq_meta_sample_count} and meta_expr is ht.freq_meta then if keep is True, the items specified by items_to_filter such as ‘pop’ = ‘han’ will be kept and all other items will be removed from the ht.freq, ht.freq_meta_sample_count, and ht.freq_meta. meta_indexed_exprs can also be a single array expression such as ht.freq.
The filtering can also be applied such that all criteria must be met (combine_operator = “and”) by the meta_expr item in order to be filtered, or at least one of the specified criteria must be met (combine_operator = “or”) by the meta_expr item in order to be filtered.
The exact_match parameter can be used to apply the keep parameter to only items specified in the items_to_filter parameter. For example, by default, if keep is True, combine_operator is “and”, and items_to_filter is [“sex”, “downsampling”], then all items in meta_expr with both “sex” and “downsampling” as keys will be kept. However, if exact_match is True, then the items in meta_expr will only be kept if “sex” and “downsampling” are the only keys in the meta dict.
- Parameters:
meta_expr (
ArrayExpression
) – Metadata expression that contains the values of the elements in meta_indexed_expr. The most often used expression is freq_meta to index into a ‘freq’ array.meta_indexed_exprs (
Union
[Dict
[str
,ArrayExpression
],ArrayExpression
]) – Either a Dictionary where the keys are the expression name and the values are the expressions indexed by the meta_expr such as a ‘freq’ array or just a single expression indexed by the meta_expr.items_to_filter (
Union
[Dict
[str
,List
[str
]],List
[str
]]) – Items to filter by, either a list or a dictionary.keep (
bool
) – Whether to keep or remove the items specified by items_to_filter.combine_operator (
str
) – Whether to use “and” or “or” to combine the items specified by items_to_filter.exact_match (
bool
) – Whether to apply the keep parameter to only the items specified in the items_to_filter parameter or to all items in meta_expr. See the example above for more details. Default is False.
- Return type:
Tuple
[ArrayExpression
,Union
[Dict
[str
,ArrayExpression
],ArrayExpression
]]- Returns:
A Tuple of the filtered metadata expression and a dictionary of metadata indexed expressions when meta_indexed_expr is a Dictionary or a single filtered array expression when meta_indexed_expr is a single array expression.