gnomad.sample_qc.filtering
|
Compute QC metrics residuals after regressing out PCs (and optionally PC^2). |
|
Compute median, MAD, and upper and lower thresholds for each metric used in outlier filtering. |
|
Run hl.sample_qc on different strata and then also merge the results into a single expression. |
Create an expression that merges results from non-overlapping strata of hail.sample_qc. |
|
|
Determine the nearest neighbors of each sample with information in scores_expr. |
- gnomad.sample_qc.filtering.compute_qc_metrics_residuals(ht, pc_scores, qc_metrics, use_pc_square=True, n_pcs=None, regression_sample_inclusion_expr=<BooleanExpression of type bool>, strata=None)[source]
Compute QC metrics residuals after regressing out PCs (and optionally PC^2).
Note
The regression_sample_inclusion_expr can be used to select a subset of the samples to include in the regression calculation. Residuals are always computed for all samples.
- Parameters:
ht (
Table
) – Input sample QC metrics HT.pc_scores (
ArrayNumericExpression
) – The expression in the input HT that stores the PC scores.qc_metrics (
Dict
[str
,NumericExpression
]) – A dictionary with the name of each QC metric to compute residuals for and their corresponding expression in the input HT.use_pc_square (
bool
) – Whether to use PC^2 in the regression or not.n_pcs (
Optional
[int
]) – Numer of PCs to use. If not set, then all PCs in pc_scores are used.regression_sample_inclusion_expr (
BooleanExpression
) – An optional expression to select samples to include in the regression calculation.strata (
Optional
[Dict
[str
,Expression
]]) – Optional dictionary used for stratification. Keys are strata names and values are filtering expressions. These expressions should refer to data with discrete types!
- Return type:
- Returns:
Table with QC metrics residuals.
- gnomad.sample_qc.filtering.compute_stratified_metrics_filter(ht, qc_metrics, strata=None, lower_threshold=4.0, upper_threshold=4.0, metric_threshold=None, filter_name='qc_metrics_filters', comparison_sample_expr=None)[source]
Compute median, MAD, and upper and lower thresholds for each metric used in outlier filtering.
- Parameters:
ht (
Table
) – HT containing relevant sample QC metric annotations.qc_metrics (
Dict
[str
,NumericExpression
]) – List of metrics (name and expr) for which to compute the critical values for filtering outliers.strata (
Optional
[Dict
[str
,Expression
]]) – Dictionary of annotations used for stratification. These metrics should be discrete types!lower_threshold (
float
) – Lower MAD threshold.upper_threshold (
float
) – Upper MAD threshold.metric_threshold (
Optional
[Dict
[str
,Tuple
[float
,float
]]]) – Can be used to specify different (lower, upper) thresholds for one or more metrics.filter_name (
str
) – Name of resulting filters annotation.comparison_sample_expr (
Union
[BooleanExpression
,CollectionExpression
,None
]) – Optional BooleanExpression or CollectionExpression of sample IDs to use for computation of the metric median, MAD, and upper and lower thresholds to use for each sample. For instance, this works well with the output of determine_nearest_neighbors or a boolean expression defining releasable samples.
- Return type:
- Returns:
Table grouped by strata, with upper and lower threshold values computed for each sample QC metric.
- gnomad.sample_qc.filtering.compute_stratified_sample_qc(mtds, strata, tmp_ht_prefix, gt_col=None)[source]
Run hl.sample_qc on different strata and then also merge the results into a single expression.
Note
Strata should be non-overlapping, e.g. SNV vs indels or bi-allelic vs multi-allelic
- Parameters:
mtds (
Union
[MatrixTable
,VariantDataset
]) – Input MatrixTable or VariantDatasetstrata (
Dict
[str
,BooleanExpression
]) – Strata names and filtering expressionstmp_ht_prefix (
Optional
[str
]) – Optional path prefix to write the intermediate strata results to (recommended for larger datasets)gt_col (
Optional
[str
]) – Name of entry field storing the genotype. Default: ‘GT’
- Return type:
- Returns:
Sample QC table, including strat-specific numbers
- gnomad.sample_qc.filtering.merge_sample_qc_expr(sample_qc_exprs)[source]
Create an expression that merges results from non-overlapping strata of hail.sample_qc.
E.g.:
Compute autosomes and sex chromosomes metrics separately, then merge results
Compute bi-allelic and multi-allelic metrics separately, then merge results
Note regarding the merging of
dp_stats
andgq_stats
: Becausen
is needed to aggregatestdev
,n_called
is used for this purpose. This should work very well on a standard GATK VCF and it essentially assumes that:samples that are called have DP and GQ fields
samples that are not called do not have DP and GQ fields
Even if these assumptions are broken for some genotypes, it shouldn’t matter too much.
- Parameters:
sample_qc_exprs (
List
[StructExpression
]) – List of sample QC struct expressions for each stratification- Return type:
- Returns:
Combined sample QC results
- gnomad.sample_qc.filtering.determine_nearest_neighbors(ht, scores_expr, strata=None, n_pcs=None, n_neighbors=50, n_jobs=-1, add_neighbor_distances=False, distance_metric='euclidean', use_approximation=False, n_trees=10)[source]
Determine the nearest neighbors of each sample with information in scores_expr.
Note
If strata is provided, the nearest neighbors for each sample is limited to the other samples with the same strata values. If n_neighbors is greater than the number of samples in a stratification grouping, all samples within the stratification are returned and a warning is raised indicating that any sample within the stratification group has less than the expected n_neighbors.
- The following annotations are in the returned Table:
nearest_neighbors
nearest_neighbor_dists (if add_neighbor_distances is True)
- Parameters:
ht (
Table
) – Input Table.scores_expr (
ArrayNumericExpression
) – Expression in the input HT that stores the PC scores.strata (
Optional
[Dict
[str
,Expression
]]) – Optional dictionary used for stratification. Keys are strata names and values are filtering expressions. These expressions should refer to data with discrete types!n_pcs (
Optional
[int
]) – Number of PCs to use. If not set, then all PCs in scores_expr are used.n_neighbors (
int
) – Number of nearest neighbors to identify for each sample. Default is 50.n_jobs (
int
) – Number of threads to use when finding the nearest neighbors. Default is -1 which uses the number of CPUs on the head node -1.add_neighbor_distances (
bool
) – Whether to return an annotation for the nearest neighbor distances.distance_metric (
str
) – Distance metric to use. Default is euclidean. Options using scikit-learn are: “euclidean”, “cityblock”, “cosine”, “haversine”, “l1”, “l2”, and “manhattan”. Options using Annoy: “angular”, “euclidean”, “manhattan”, “hamming”, and “dot”.use_approximation (
bool
) – Whether to use the package Annoy to determine approximate nearest neighbors instead of using scikit-learn’s NearestNeighbors. This method is faster, but only needed for very large datasets, for instance > 500,000 samples.n_trees (
int
) – Number of trees to use in the annoy approximation approach. n_trees is provided during build time and affects the build time and the index size. A larger value will give more accurate results, but larger indexes. Default is 10.
- Return type:
- Returns:
Table with an annotation for the nearest neighbors and optionally their distances.