gnomad_qc.v5.resources.sample_qc

Script containing sample QC related resources.

Module Functions

gnomad_qc.v5.resources.sample_qc.get_sample_qc_root([...])

Return the root GCS path to sample QC results.

gnomad_qc.v5.resources.sample_qc.get_sample_qc([...])

Get AoU sample QC annotations generated by Hail for the specified stratification.

gnomad_qc.v5.resources.sample_qc.get_aou_mt_union([test])

Get the union of AoU ACAF and exome MatrixTables.

gnomad_qc.v5.resources.sample_qc.get_joint_qc([test])

Get joint (exomes + genomes) gnomAD v4 + AoU dense MatrixTableResource.

gnomad_qc.v5.resources.sample_qc.get_cuking_input_path([...])

Return the path containing the input files read by cuKING.

gnomad_qc.v5.resources.sample_qc.get_cuking_output_path([...])

Return the path containing the output files written by cuKING.

gnomad_qc.v5.resources.sample_qc.relatedness([...])

Get the VersionedTableResource for relatedness results.

gnomad_qc.v5.resources.sample_qc.related_samples_to_drop([...])

Get the VersionedTableResource for samples to drop for genetic ancestry PCA.

gnomad_qc.v5.resources.sample_qc.sample_rankings([test])

Get the VersionedTableResource for sample rankings for genetic ancestry PCA.

gnomad_qc.v5.resources.sample_qc.genetic_ancestry_pca_loadings([...])

Get the genetic ancestry PCA loadings VersionedTableResource.

gnomad_qc.v5.resources.sample_qc.genetic_ancestry_pca_scores([...])

Get the genetic ancestry PCA scores VersionedTableResource.

gnomad_qc.v5.resources.sample_qc.genetic_ancestry_pca_eigenvalues([...])

Get the genetic ancestry PCA eigenvalues VersionedTableResource.

gnomad_qc.v5.resources.sample_qc.gen_anc_rf_path([...])

Path to RF model used for inferring genetic ancestry groups.

gnomad_qc.v5.resources.sample_qc.get_gen_anc_ht([...])

Get the TableResource of samples' inferred genetic ancestry group for the indicated gnomAD version.

gnomad_qc.v5.resources.sample_qc.get_gen_anc_pr_ht([...])

Get the TableResource of genetic ancestry inference precision and recall values.

gnomad_qc.v5.resources.sample_qc.per_grp_min_rf_probs_json_path([...])

Get path to JSON file containing per genetic ancestry group minimum RF probabilities.

gnomad_qc.v5.resources.sample_qc.get_outlier_detection_sample_qc([test])

Get modified sample QC Table for sample outlier detection.

gnomad_qc.v5.resources.sample_qc.stratified_filtering([test])

Get VersionedTableResource for stratified genetic ancestry-based metrics filtering.

gnomad_qc.v5.resources.sample_qc.regressed_filtering([...])

Get VersionedTableResource for regression genetic ancestry-based metrics filtering.

gnomad_qc.v5.resources.sample_qc.nearest_neighbors([...])

Get VersionedTableResource for genetic ancestry group PCA nearest neighbors.

gnomad_qc.v5.resources.sample_qc.nearest_neighbors_filtering([test])

Get VersionedTableResource for nearest neighbors platform/genetic ancestry group-based metrics filtering.

gnomad_qc.v5.resources.sample_qc.finalized_outlier_filtering([test])

Get VersionedTableResource for the finalized outlier filtering.

Script containing sample QC related resources.

gnomad_qc.v5.resources.sample_qc.get_sample_qc_root(version='5.0', test=False, data_type='genomes', data_set='aou')[source]

Return the root GCS path to sample QC results.

Parameters:
  • version (str) – Sample QC version (default: CURRENT_SAMPLE_QC_VERSION).

  • test (bool) – If True, return a temporary path (e.g., for testing or development).

  • data_type (str) – Data type (e.g., “genomes” or “exomes”).

  • data_set (str) – Dataset identifier (e.g., “aou”, “hgdp_tgp”).

Return type:

str

Returns:

GCS path to the sample QC directory.

gnomad_qc.v5.resources.sample_qc.get_sample_qc(strat='all', test=False)[source]

Get AoU sample QC annotations generated by Hail for the specified stratification.

Possible values for strat:
  • bi_allelic

  • multi_allelic

  • all

Parameters:
  • strat (str) – Which stratification to return.

  • test (bool) – Whether to use a tmp path for analysis of the test VDS instead of the full VDS.

Return type:

VersionedTableResource

Returns:

Sample QC table.

gnomad_qc.v5.resources.sample_qc.get_aou_mt_union(test=True)[source]

Get the union of AoU ACAF and exome MatrixTables.

Parameters:

test (bool) – Whether to use a tmp path for a test resource. Default is True.

Return type:

MatrixTableResource

Returns:

MatrixTableResource containing the union of AoU ACAF and exome MTs.

gnomad_qc.v5.resources.sample_qc.get_joint_qc(test=False)[source]

Get joint (exomes + genomes) gnomAD v4 + AoU dense MatrixTableResource.

Parameters:

test (bool) – Whether to use a tmp path for a test resource.

Return type:

VersionedMatrixTableResource

Returns:

VersionedMatrixTableResource of QC sites.

gnomad_qc.v5.resources.sample_qc.get_cuking_input_path(version='5.0', test=False, environment='rwb')[source]

Return the path containing the input files read by cuKING.

Those files correspond to Parquet tables derived from the dense QC matrix.

Parameters:
  • version (str) – Sample QC version (default: CURRENT_SAMPLE_QC_VERSION).

  • test (bool) – Whether to return a path corresponding to a test subset. Default is False.

  • environment (str) – Compute environment, either ‘dataproc’ or ‘rwb’. Default is ‘rwb’.

Return type:

str

Returns:

Temporary path to hold Parquet input tables for running cuKING.

gnomad_qc.v5.resources.sample_qc.get_cuking_output_path(version='5.0', test=False, environment='rwb')[source]

Return the path containing the output files written by cuKING.

Those files correspond to Parquet tables containing relatedness results.

Parameters:
  • version (str) – Sample QC version (default: CURRENT_SAMPLE_QC_VERSION).

  • test (bool) – Whether to return a path corresponding to a test subset. Default is False.

  • environment (str) – Compute environment, either ‘dataproc’ or ‘rwb’. Default is ‘rwb’.

Return type:

str

Returns:

Temporary path to hold Parquet output tables for running cuKING.

gnomad_qc.v5.resources.sample_qc.relatedness(test=False, raw=False)[source]

Get the VersionedTableResource for relatedness results.

Parameters:
  • test (bool) – Whether to use a tmp path for a test resource.

  • raw (bool) – Whether to return the raw cuKING output in Hail Table format. If False, returns the processed relatedness table. Default is False.

Return type:

VersionedTableResource

Returns:

VersionedTableResource.

gnomad_qc.v5.resources.sample_qc.related_samples_to_drop(test=False, release=False)[source]

Get the VersionedTableResource for samples to drop for genetic ancestry PCA.

Parameters:
  • test (bool) – Whether to use a tmp path for a test resource.

  • release (bool) – Whether to determine related samples to drop for the release based on outlier filtering of sample QC metrics. Also drops non-released v4 samples and consent drop samples. Default is False.

Return type:

VersionedTableResource

Returns:

VersionedTableResource.

gnomad_qc.v5.resources.sample_qc.sample_rankings(test=False)[source]

Get the VersionedTableResource for sample rankings for genetic ancestry PCA.

Parameters:

test (bool) – Whether to use a tmp path for a test resource.

Return type:

VersionedTableResource

Returns:

VersionedTableResource.

gnomad_qc.v5.resources.sample_qc.genetic_ancestry_pca_loadings(include_unreleasable_samples=False, test=False, data_type='joint')[source]

Get the genetic ancestry PCA loadings VersionedTableResource.

Parameters:
  • include_unreleasable_samples (bool) – Whether to get the PCA loadings from the PCA that used unreleasable samples.

  • test (bool) – Whether to use a temp path.

  • data_type (str) – Data type used in sample QC, e.g. “exomes” or “joint”.

Return type:

VersionedTableResource

Returns:

Genetic ancestry PCA loadings.

gnomad_qc.v5.resources.sample_qc.genetic_ancestry_pca_scores(include_unreleasable_samples=False, test=False, data_type='joint', projection=False)[source]

Get the genetic ancestry PCA scores VersionedTableResource.

Parameters:
  • include_unreleasable_samples (bool) – Whether to get the PCA scores from the PCA that used unreleasable samples.

  • test (bool) – Whether to use a temp path.

  • data_type (str) – Data type used in sample QC, e.g. “exomes” or “joint”.

  • projection (bool) – Whether the scores ht includes projection scores instead of just original scores.

Return type:

VersionedTableResource

Returns:

Genetic ancestry PCA scores.

gnomad_qc.v5.resources.sample_qc.genetic_ancestry_pca_eigenvalues(include_unreleasable_samples=False, test=False, data_type='joint')[source]

Get the genetic ancestry PCA eigenvalues VersionedTableResource.

Parameters:
  • include_unreleasable_samples (bool) – Whether to get the PCA eigenvalues from the PCA that used unreleasable samples.

  • test (bool) – Whether to use a temp path.

  • data_type (str) – Data type used in sample QC, e.g. “exomes” or “joint”.

Return type:

VersionedTableResource

Returns:

Genetic ancestry PCA eigenvalues.

gnomad_qc.v5.resources.sample_qc.gen_anc_rf_path(version='5.0', test=False, data_type='joint')[source]

Path to RF model used for inferring genetic ancestry groups.

Parameters:
  • version (str) – gnomAD Version.

  • test (bool) – Whether the RF assignment was from a test dataset.

  • data_type (str) – Data type used in sample QC, e.g. “exomes” or “joint”.

Return type:

str

Returns:

String path to sample genetic ancestry group RF model.

gnomad_qc.v5.resources.sample_qc.get_gen_anc_ht(version='5.0', test=False, data_type='joint')[source]

Get the TableResource of samples’ inferred genetic ancestry group for the indicated gnomAD version.

Parameters:
  • version (str) – Version of gen anc group TableResource to return.

  • test (bool) – Whether to use the test version of the genetic ancestry TableResource.

  • data_type (str) – Data type used in sample QC, e.g. “exomes” or “joint”.

Returns:

TableResource of sample gen anc groups.

gnomad_qc.v5.resources.sample_qc.get_gen_anc_pr_ht(version='5.0', test=False, data_type='joint')[source]

Get the TableResource of genetic ancestry inference precision and recall values.

Parameters:
  • version (str) – Version of gen anc group PR TableResource to return.

  • test (bool) – Whether to use the test version of the gen anc group PR TableResource.

  • data_type (str) – Data type used in sample QC, e.g. “exomes” or “joint”.

Returns:

TableResource of genetic ancestry inference PR values.

gnomad_qc.v5.resources.sample_qc.per_grp_min_rf_probs_json_path(version='5.0')[source]

Get path to JSON file containing per genetic ancestry group minimum RF probabilities.

Parameters:

version (str) – Version of the JSON to return.

Returns:

Path to per genetic ancestry group minimum RF probabilities JSON.

gnomad_qc.v5.resources.sample_qc.get_outlier_detection_sample_qc(test=False)[source]

Get modified sample QC Table for sample outlier detection.

This table has the following modifications:
  • Remove hard filtered samples

  • Add project prefix to sample collisions

  • Add ‘r_snp_indel’ metric

  • Sample 1% of the dataset if test is True

Parameters:

test (bool) – Whether to use the test version of the sample QC TableResource.

Return type:

VersionedTableResource

Returns:

Modified sample QC Table.

gnomad_qc.v5.resources.sample_qc.stratified_filtering(test=False)[source]

Get VersionedTableResource for stratified genetic ancestry-based metrics filtering.

Parameters:

test (bool) – Whether to use a tmp path for a test resource.

Return type:

VersionedTableResource

Returns:

VersionedTableResource.

gnomad_qc.v5.resources.sample_qc.regressed_filtering(test=False, include_unreleasable_samples=False)[source]

Get VersionedTableResource for regression genetic ancestry-based metrics filtering.

Parameters:
  • test (bool) – Whether to use a tmp path for a test resource.

  • include_unreleasable_samples (bool) – Whether to get resource that included unreleasable samples in regression.

Return type:

VersionedTableResource

Returns:

VersionedTableResource.

gnomad_qc.v5.resources.sample_qc.nearest_neighbors(test=False, approximation=False, include_unreleasable_samples=False)[source]

Get VersionedTableResource for genetic ancestry group PCA nearest neighbors.

Parameters:
  • test (bool) – Whether to use a tmp path for a test resource.

  • approximation (bool) – Whether to get resource that is approximate nearest neighbors.

  • include_unreleasable_samples (bool) – Whether to get resource that included unreleasable samples in nearest neighbors determination.

Return type:

VersionedTableResource

Returns:

VersionedTableResource.

gnomad_qc.v5.resources.sample_qc.nearest_neighbors_filtering(test=False)[source]

Get VersionedTableResource for nearest neighbors platform/genetic ancestry group-based metrics filtering.

Parameters:

test (bool) – Whether to use a tmp path for a test resource.

Return type:

VersionedTableResource

Returns:

VersionedTableResource.

gnomad_qc.v5.resources.sample_qc.finalized_outlier_filtering(test=False)[source]

Get VersionedTableResource for the finalized outlier filtering.

Parameters:

test (bool) – Whether to use a tmp path for a test resource.

Return type:

VersionedTableResource

Returns:

VersionedTableResource.