gnomad.sample_qc.filtering

gnomad.sample_qc.filtering.compute_qc_metrics_residuals(ht, ...)

Compute QC metrics residuals after regressing out PCs (and optionally PC^2).

gnomad.sample_qc.filtering.compute_stratified_metrics_filter(ht, ...)

Compute median, MAD, and upper and lower thresholds for each metric used in outlier filtering.

gnomad.sample_qc.filtering.compute_stratified_sample_qc(...)

Run hl.sample_qc on different strata and then also merge the results into a single expression.

gnomad.sample_qc.filtering.merge_sample_qc_expr(...)

Create an expression that merges results from non-overlapping strata of hail.sample_qc.

gnomad.sample_qc.filtering.determine_nearest_neighbors(ht, ...)

Determine the nearest neighbors of each sample with information in scores_expr.

gnomad.sample_qc.filtering.compute_qc_metrics_residuals(ht, pc_scores, qc_metrics, use_pc_square=True, n_pcs=None, regression_sample_inclusion_expr=<BooleanExpression of type bool>, strata=None)[source]

Compute QC metrics residuals after regressing out PCs (and optionally PC^2).

Note

The regression_sample_inclusion_expr can be used to select a subset of the samples to include in the regression calculation. Residuals are always computed for all samples.

Parameters:
  • ht (Table) – Input sample QC metrics HT.

  • pc_scores (ArrayNumericExpression) – The expression in the input HT that stores the PC scores.

  • qc_metrics (Dict[str, NumericExpression]) – A dictionary with the name of each QC metric to compute residuals for and their corresponding expression in the input HT.

  • use_pc_square (bool) – Whether to use PC^2 in the regression or not.

  • n_pcs (Optional[int]) – Numer of PCs to use. If not set, then all PCs in pc_scores are used.

  • regression_sample_inclusion_expr (BooleanExpression) – An optional expression to select samples to include in the regression calculation.

  • strata (Optional[Dict[str, Expression]]) – Optional dictionary used for stratification. Keys are strata names and values are filtering expressions. These expressions should refer to data with discrete types!

Return type:

Table

Returns:

Table with QC metrics residuals.

gnomad.sample_qc.filtering.compute_stratified_metrics_filter(ht, qc_metrics, strata=None, lower_threshold=4.0, upper_threshold=4.0, metric_threshold=None, filter_name='qc_metrics_filters', comparison_sample_expr=None)[source]

Compute median, MAD, and upper and lower thresholds for each metric used in outlier filtering.

Parameters:
  • ht (Table) – HT containing relevant sample QC metric annotations.

  • qc_metrics (Dict[str, NumericExpression]) – List of metrics (name and expr) for which to compute the critical values for filtering outliers.

  • strata (Optional[Dict[str, Expression]]) – Dictionary of annotations used for stratification. These metrics should be discrete types!

  • lower_threshold (float) – Lower MAD threshold.

  • upper_threshold (float) – Upper MAD threshold.

  • metric_threshold (Optional[Dict[str, Tuple[float, float]]]) – Can be used to specify different (lower, upper) thresholds for one or more metrics.

  • filter_name (str) – Name of resulting filters annotation.

  • comparison_sample_expr (Union[BooleanExpression, CollectionExpression, None]) – Optional BooleanExpression or CollectionExpression of sample IDs to use for computation of the metric median, MAD, and upper and lower thresholds to use for each sample. For instance, this works well with the output of determine_nearest_neighbors or a boolean expression defining releasable samples.

Return type:

Table

Returns:

Table grouped by strata, with upper and lower threshold values computed for each sample QC metric.

gnomad.sample_qc.filtering.compute_stratified_sample_qc(mtds, strata, tmp_ht_prefix, gt_col=None)[source]

Run hl.sample_qc on different strata and then also merge the results into a single expression.

Note

Strata should be non-overlapping, e.g. SNV vs indels or bi-allelic vs multi-allelic

Parameters:
  • mtds (Union[MatrixTable, VariantDataset]) – Input MatrixTable or VariantDataset

  • strata (Dict[str, BooleanExpression]) – Strata names and filtering expressions

  • tmp_ht_prefix (Optional[str]) – Optional path prefix to write the intermediate strata results to (recommended for larger datasets)

  • gt_col (Optional[str]) – Name of entry field storing the genotype. Default: ‘GT’

Return type:

Table

Returns:

Sample QC table, including strat-specific numbers

gnomad.sample_qc.filtering.merge_sample_qc_expr(sample_qc_exprs)[source]

Create an expression that merges results from non-overlapping strata of hail.sample_qc.

E.g.:

  • Compute autosomes and sex chromosomes metrics separately, then merge results

  • Compute bi-allelic and multi-allelic metrics separately, then merge results

Note regarding the merging of dp_stats and gq_stats: Because n is needed to aggregate stdev, n_called is used for this purpose. This should work very well on a standard GATK VCF and it essentially assumes that:

  • samples that are called have DP and GQ fields

  • samples that are not called do not have DP and GQ fields

Even if these assumptions are broken for some genotypes, it shouldn’t matter too much.

Parameters:

sample_qc_exprs (List[StructExpression]) – List of sample QC struct expressions for each stratification

Return type:

StructExpression

Returns:

Combined sample QC results

gnomad.sample_qc.filtering.determine_nearest_neighbors(ht, scores_expr, strata=None, n_pcs=None, n_neighbors=50, n_jobs=-1, add_neighbor_distances=False, distance_metric='euclidean', use_approximation=False, n_trees=10)[source]

Determine the nearest neighbors of each sample with information in scores_expr.

Note

If strata is provided, the nearest neighbors for each sample is limited to the other samples with the same strata values. If n_neighbors is greater than the number of samples in a stratification grouping, all samples within the stratification are returned and a warning is raised indicating that any sample within the stratification group has less than the expected n_neighbors.

The following annotations are in the returned Table:
  • nearest_neighbors

  • nearest_neighbor_dists (if add_neighbor_distances is True)

Parameters:
  • ht (Table) – Input Table.

  • scores_expr (ArrayNumericExpression) – Expression in the input HT that stores the PC scores.

  • strata (Optional[Dict[str, Expression]]) – Optional dictionary used for stratification. Keys are strata names and values are filtering expressions. These expressions should refer to data with discrete types!

  • n_pcs (Optional[int]) – Number of PCs to use. If not set, then all PCs in scores_expr are used.

  • n_neighbors (int) – Number of nearest neighbors to identify for each sample. Default is 50.

  • n_jobs (int) – Number of threads to use when finding the nearest neighbors. Default is -1 which uses the number of CPUs on the head node -1.

  • add_neighbor_distances (bool) – Whether to return an annotation for the nearest neighbor distances.

  • distance_metric (str) – Distance metric to use. Default is euclidean. Options using scikit-learn are: “euclidean”, “cityblock”, “cosine”, “haversine”, “l1”, “l2”, and “manhattan”. Options using Annoy: “angular”, “euclidean”, “manhattan”, “hamming”, and “dot”.

  • use_approximation (bool) – Whether to use the package Annoy to determine approximate nearest neighbors instead of using scikit-learn’s NearestNeighbors. This method is faster, but only needed for very large datasets, for instance > 500,000 samples.

  • n_trees (int) – Number of trees to use in the annoy approximation approach. n_trees is provided during build time and affects the build time and the index size. A larger value will give more accurate results, but larger indexes. Default is 10.

Return type:

Table

Returns:

Table with an annotation for the nearest neighbors and optionally their distances.