gnomad_qc.v5.resources.sample_qc
Script containing sample QC related resources.
Module Functions
Return the root GCS path to sample QC results. |
|
Get AoU sample QC annotations generated by Hail for the specified stratification. |
|
Get the union of AoU ACAF and exome MatrixTables. |
|
Get joint (exomes + genomes) gnomAD v4 + AoU dense MatrixTableResource. |
|
|
Return the path containing the input files read by cuKING. |
|
Return the path containing the output files written by cuKING. |
Get the VersionedTableResource for relatedness results. |
|
|
Get the VersionedTableResource for samples to drop for genetic ancestry PCA. |
Get the VersionedTableResource for sample rankings for genetic ancestry PCA or release. |
|
|
Get the genetic ancestry PCA loadings VersionedTableResource. |
|
Get the genetic ancestry PCA scores VersionedTableResource. |
|
Get the genetic ancestry PCA eigenvalues VersionedTableResource. |
Path to RF model used for inferring genetic ancestry groups. |
|
Get the TableResource of samples' inferred genetic ancestry group for the indicated gnomAD version. |
|
Get the TableResource of genetic ancestry inference precision and recall values. |
|
|
Get path to JSON file containing per genetic ancestry group minimum RF probabilities. |
|
Get modified sample QC Table for sample outlier detection. |
|
Get VersionedTableResource for stratified genetic ancestry-based metrics filtering. |
Get VersionedTableResource for regression genetic ancestry-based metrics filtering. |
|
Get VersionedTableResource for genetic ancestry group PCA nearest neighbors. |
|
|
Get VersionedTableResource for nearest neighbors platform/genetic ancestry group-based metrics filtering. |
|
Get VersionedTableResource for the finalized outlier filtering. |
Script containing sample QC related resources.
- gnomad_qc.v5.resources.sample_qc.get_sample_qc_root(version='5.0', test=False, data_type='genomes', data_set='aou')[source]
Return the root GCS path to sample QC results.
- Parameters:
version (
str
) – Sample QC version (default: CURRENT_SAMPLE_QC_VERSION).test (
bool
) – If True, return a temporary path (e.g., for testing or development).data_type (
str
) – Data type (e.g., “genomes” or “exomes”).data_set (
str
) – Dataset identifier (e.g., “aou”, “hgdp_tgp”).
- Return type:
str
- Returns:
GCS path to the sample QC directory.
- gnomad_qc.v5.resources.sample_qc.get_sample_qc(strat='all', test=False)[source]
Get AoU sample QC annotations generated by Hail for the specified stratification.
- Possible values for strat:
bi_allelic
multi_allelic
all
- Parameters:
strat (
str
) – Which stratification to return.test (
bool
) – Whether to use a tmp path for analysis of the test VDS instead of the full VDS.
- Return type:
VersionedTableResource
- Returns:
Sample QC table.
- gnomad_qc.v5.resources.sample_qc.get_aou_mt_union(test=True)[source]
Get the union of AoU ACAF and exome MatrixTables.
- Parameters:
test (
bool
) – Whether to use a tmp path for a test resource. Default is True.- Return type:
MatrixTableResource
- Returns:
MatrixTableResource containing the union of AoU ACAF and exome MTs.
- gnomad_qc.v5.resources.sample_qc.get_joint_qc(test=False)[source]
Get joint (exomes + genomes) gnomAD v4 + AoU dense MatrixTableResource.
- Parameters:
test (
bool
) – Whether to use a tmp path for a test resource.- Return type:
VersionedMatrixTableResource
- Returns:
VersionedMatrixTableResource of QC sites.
- gnomad_qc.v5.resources.sample_qc.get_cuking_input_path(version='5.0', test=False, environment='rwb')[source]
Return the path containing the input files read by cuKING.
Those files correspond to Parquet tables derived from the dense QC matrix.
- Parameters:
version (
str
) – Sample QC version (default: CURRENT_SAMPLE_QC_VERSION).test (
bool
) – Whether to return a path corresponding to a test subset. Default is False.environment (
str
) – Compute environment, either ‘dataproc’ or ‘rwb’. Default is ‘rwb’.
- Return type:
str
- Returns:
Temporary path to hold Parquet input tables for running cuKING.
- gnomad_qc.v5.resources.sample_qc.get_cuking_output_path(version='5.0', test=False, environment='rwb')[source]
Return the path containing the output files written by cuKING.
Those files correspond to Parquet tables containing relatedness results.
- Parameters:
version (
str
) – Sample QC version (default: CURRENT_SAMPLE_QC_VERSION).test (
bool
) – Whether to return a path corresponding to a test subset. Default is False.environment (
str
) – Compute environment, either ‘dataproc’ or ‘rwb’. Default is ‘rwb’.
- Return type:
str
- Returns:
Temporary path to hold Parquet output tables for running cuKING.
Get the VersionedTableResource for relatedness results.
- Parameters:
test (
bool
) – Whether to use a tmp path for a test resource.raw (
bool
) – Whether to return the raw cuKING output in Hail Table format. If False, returns the processed relatedness table. Default is False.
- Return type:
VersionedTableResource
- Returns:
VersionedTableResource.
Get the VersionedTableResource for samples to drop for genetic ancestry PCA.
- Parameters:
test (
bool
) – Whether to use a tmp path for a test resource.release (
bool
) – Whether to determine related samples to drop for the release based on outlier filtering of sample QC metrics. Also drops non-released v4 samples and consent drop samples. Default is False.
- Return type:
VersionedTableResource
- Returns:
VersionedTableResource.
- gnomad_qc.v5.resources.sample_qc.sample_rankings(test=False, release=False)[source]
Get the VersionedTableResource for sample rankings for genetic ancestry PCA or release.
- Parameters:
test (
bool
) – Whether to use a tmp path for a test resource.release (
bool
) – Whether to return resource for ranking of all samples based on outlier filtering of sample QC metrics. Used to determine related samples to drop for the release.
- Return type:
VersionedTableResource
- Returns:
VersionedTableResource.
- gnomad_qc.v5.resources.sample_qc.genetic_ancestry_pca_loadings(include_unreleasable_samples=False, test=False, data_type='joint')[source]
Get the genetic ancestry PCA loadings VersionedTableResource.
- Parameters:
include_unreleasable_samples (
bool
) – Whether to get the PCA loadings from the PCA that used unreleasable samples.test (
bool
) – Whether to use a temp path.data_type (
str
) – Data type used in sample QC, e.g. “exomes” or “joint”.
- Return type:
VersionedTableResource
- Returns:
Genetic ancestry PCA loadings.
- gnomad_qc.v5.resources.sample_qc.genetic_ancestry_pca_scores(include_unreleasable_samples=False, test=False, data_type='joint', projection=False)[source]
Get the genetic ancestry PCA scores VersionedTableResource.
- Parameters:
include_unreleasable_samples (
bool
) – Whether to get the PCA scores from the PCA that used unreleasable samples.test (
bool
) – Whether to use a temp path.data_type (
str
) – Data type used in sample QC, e.g. “exomes” or “joint”.projection (
bool
) – Whether the scores ht includes projection scores instead of just original scores.
- Return type:
VersionedTableResource
- Returns:
Genetic ancestry PCA scores.
- gnomad_qc.v5.resources.sample_qc.genetic_ancestry_pca_eigenvalues(include_unreleasable_samples=False, test=False, data_type='joint')[source]
Get the genetic ancestry PCA eigenvalues VersionedTableResource.
- Parameters:
include_unreleasable_samples (
bool
) – Whether to get the PCA eigenvalues from the PCA that used unreleasable samples.test (
bool
) – Whether to use a temp path.data_type (
str
) – Data type used in sample QC, e.g. “exomes” or “joint”.
- Return type:
VersionedTableResource
- Returns:
Genetic ancestry PCA eigenvalues.
- gnomad_qc.v5.resources.sample_qc.gen_anc_rf_path(version='5.0', test=False, data_type='joint')[source]
Path to RF model used for inferring genetic ancestry groups.
- Parameters:
version (
str
) – gnomAD Version.test (
bool
) – Whether the RF assignment was from a test dataset.data_type (
str
) – Data type used in sample QC, e.g. “exomes” or “joint”.
- Return type:
str
- Returns:
String path to sample genetic ancestry group RF model.
- gnomad_qc.v5.resources.sample_qc.get_gen_anc_ht(version='5.0', test=False, data_type='joint', projection_only=False)[source]
Get the TableResource of samples’ inferred genetic ancestry group for the indicated gnomAD version.
- Parameters:
version (
str
) – Version of gen anc group TableResource to return.test (
bool
) – Whether to use the test version of the genetic ancestry TableResource.data_type (
str
) – Data type used in sample QC, e.g. “exomes” or “joint”.projection_only (
bool
) – Whether the inference results consist of just the results for the projected samples. When set to False, probability scores will not be included as they cannot be obtained for all samples.
- Returns:
TableResource of sample gen anc groups.
- gnomad_qc.v5.resources.sample_qc.get_gen_anc_pr_ht(version='5.0', test=False, data_type='joint')[source]
Get the TableResource of genetic ancestry inference precision and recall values.
- Parameters:
version (
str
) – Version of gen anc group PR TableResource to return.test (
bool
) – Whether to use the test version of the gen anc group PR TableResource.data_type (
str
) – Data type used in sample QC, e.g. “exomes” or “joint”.
- Returns:
TableResource of genetic ancestry inference PR values.
- gnomad_qc.v5.resources.sample_qc.per_grp_min_rf_probs_json_path(version='5.0')[source]
Get path to JSON file containing per genetic ancestry group minimum RF probabilities.
- Parameters:
version (
str
) – Version of the JSON to return.- Returns:
Path to per genetic ancestry group minimum RF probabilities JSON.
- gnomad_qc.v5.resources.sample_qc.get_outlier_detection_sample_qc(test=False)[source]
Get modified sample QC Table for sample outlier detection.
- This table has the following modifications:
Remove hard filtered samples
Add project prefix to sample collisions
Add ‘r_snp_indel’ metric
Sample 1% of the dataset if test is True
- Parameters:
test (
bool
) – Whether to use the test version of the sample QC TableResource.- Return type:
VersionedTableResource
- Returns:
Modified sample QC Table.
- gnomad_qc.v5.resources.sample_qc.stratified_filtering(test=False)[source]
Get VersionedTableResource for stratified genetic ancestry-based metrics filtering.
- Parameters:
test (
bool
) – Whether to use a tmp path for a test resource.- Return type:
VersionedTableResource
- Returns:
VersionedTableResource.
- gnomad_qc.v5.resources.sample_qc.regressed_filtering(test=False, include_unreleasable_samples=False)[source]
Get VersionedTableResource for regression genetic ancestry-based metrics filtering.
- Parameters:
test (
bool
) – Whether to use a tmp path for a test resource.include_unreleasable_samples (
bool
) – Whether to get resource that included unreleasable samples in regression.
- Return type:
VersionedTableResource
- Returns:
VersionedTableResource.
- gnomad_qc.v5.resources.sample_qc.nearest_neighbors(test=False, approximation=False, include_unreleasable_samples=False)[source]
Get VersionedTableResource for genetic ancestry group PCA nearest neighbors.
- Parameters:
test (
bool
) – Whether to use a tmp path for a test resource.approximation (
bool
) – Whether to get resource that is approximate nearest neighbors.include_unreleasable_samples (
bool
) – Whether to get resource that included unreleasable samples in nearest neighbors determination.
- Return type:
VersionedTableResource
- Returns:
VersionedTableResource.
- gnomad_qc.v5.resources.sample_qc.nearest_neighbors_filtering(test=False)[source]
Get VersionedTableResource for nearest neighbors platform/genetic ancestry group-based metrics filtering.
- Parameters:
test (
bool
) – Whether to use a tmp path for a test resource.- Return type:
VersionedTableResource
- Returns:
VersionedTableResource.