gnomad_qc.v5.resources.sample_qc
Script containing sample QC related resources.
Module Functions
Return the root GCS path to sample QC results. |
|
Get AoU sample QC annotations generated by Hail for the specified stratification. |
|
Get the union of AoU ACAF and exome MatrixTables. |
|
Get joint (exomes + genomes) gnomAD v4 + AoU dense MatrixTableResource. |
|
|
Return the path containing the input files read by cuKING. |
|
Return the path containing the output files written by cuKING. |
Get the VersionedTableResource for relatedness results. |
|
|
Get the VersionedTableResource for samples to drop for genetic ancestry PCA. |
Get the VersionedTableResource for sample rankings for genetic ancestry PCA. |
|
|
Get the genetic ancestry PCA loadings VersionedTableResource. |
|
Get the genetic ancestry PCA scores VersionedTableResource. |
|
Get the genetic ancestry PCA eigenvalues VersionedTableResource. |
Path to RF model used for inferring genetic ancestry groups. |
|
Get the TableResource of samples' inferred genetic ancestry group for the indicated gnomAD version. |
|
Get the TableResource of genetic ancestry inference precision and recall values. |
|
|
Get path to JSON file containing per genetic ancestry group minimum RF probabilities. |
|
Get modified sample QC Table for sample outlier detection. |
|
Get VersionedTableResource for stratified genetic ancestry-based metrics filtering. |
Get VersionedTableResource for regression genetic ancestry-based metrics filtering. |
|
Get VersionedTableResource for genetic ancestry group PCA nearest neighbors. |
|
|
Get VersionedTableResource for nearest neighbors platform/genetic ancestry group-based metrics filtering. |
|
Get VersionedTableResource for the finalized outlier filtering. |
Script containing sample QC related resources.
- gnomad_qc.v5.resources.sample_qc.get_sample_qc_root(version='5.0', test=False, data_type='genomes', data_set='aou')[source]
Return the root GCS path to sample QC results.
- Parameters:
version (
str
) – Sample QC version (default: CURRENT_SAMPLE_QC_VERSION).test (
bool
) – If True, return a temporary path (e.g., for testing or development).data_type (
str
) – Data type (e.g., “genomes” or “exomes”).data_set (
str
) – Dataset identifier (e.g., “aou”, “hgdp_tgp”).
- Return type:
str
- Returns:
GCS path to the sample QC directory.
- gnomad_qc.v5.resources.sample_qc.get_sample_qc(strat='all', test=False)[source]
Get AoU sample QC annotations generated by Hail for the specified stratification.
- Possible values for strat:
bi_allelic
multi_allelic
all
- Parameters:
strat (
str
) – Which stratification to return.test (
bool
) – Whether to use a tmp path for analysis of the test VDS instead of the full VDS.
- Return type:
VersionedTableResource
- Returns:
Sample QC table.
- gnomad_qc.v5.resources.sample_qc.get_aou_mt_union(test=True)[source]
Get the union of AoU ACAF and exome MatrixTables.
- Parameters:
test (
bool
) – Whether to use a tmp path for a test resource. Default is True.- Return type:
MatrixTableResource
- Returns:
MatrixTableResource containing the union of AoU ACAF and exome MTs.
- gnomad_qc.v5.resources.sample_qc.get_joint_qc(test=False)[source]
Get joint (exomes + genomes) gnomAD v4 + AoU dense MatrixTableResource.
- Parameters:
test (
bool
) – Whether to use a tmp path for a test resource.- Return type:
VersionedMatrixTableResource
- Returns:
VersionedMatrixTableResource of QC sites.
- gnomad_qc.v5.resources.sample_qc.get_cuking_input_path(version='5.0', test=False, environment='rwb')[source]
Return the path containing the input files read by cuKING.
Those files correspond to Parquet tables derived from the dense QC matrix.
- Parameters:
version (
str
) – Sample QC version (default: CURRENT_SAMPLE_QC_VERSION).test (
bool
) – Whether to return a path corresponding to a test subset. Default is False.environment (
str
) – Compute environment, either ‘dataproc’ or ‘rwb’. Default is ‘rwb’.
- Return type:
str
- Returns:
Temporary path to hold Parquet input tables for running cuKING.
- gnomad_qc.v5.resources.sample_qc.get_cuking_output_path(version='5.0', test=False, environment='rwb')[source]
Return the path containing the output files written by cuKING.
Those files correspond to Parquet tables containing relatedness results.
- Parameters:
version (
str
) – Sample QC version (default: CURRENT_SAMPLE_QC_VERSION).test (
bool
) – Whether to return a path corresponding to a test subset. Default is False.environment (
str
) – Compute environment, either ‘dataproc’ or ‘rwb’. Default is ‘rwb’.
- Return type:
str
- Returns:
Temporary path to hold Parquet output tables for running cuKING.
Get the VersionedTableResource for relatedness results.
- Parameters:
test (
bool
) – Whether to use a tmp path for a test resource.raw (
bool
) – Whether to return the raw cuKING output in Hail Table format. If False, returns the processed relatedness table. Default is False.
- Return type:
VersionedTableResource
- Returns:
VersionedTableResource.
Get the VersionedTableResource for samples to drop for genetic ancestry PCA.
- Parameters:
test (
bool
) – Whether to use a tmp path for a test resource.release (
bool
) – Whether to determine related samples to drop for the release based on outlier filtering of sample QC metrics. Also drops non-released v4 samples and consent drop samples. Default is False.
- Return type:
VersionedTableResource
- Returns:
VersionedTableResource.
- gnomad_qc.v5.resources.sample_qc.sample_rankings(test=False)[source]
Get the VersionedTableResource for sample rankings for genetic ancestry PCA.
- Parameters:
test (
bool
) – Whether to use a tmp path for a test resource.- Return type:
VersionedTableResource
- Returns:
VersionedTableResource.
- gnomad_qc.v5.resources.sample_qc.genetic_ancestry_pca_loadings(include_unreleasable_samples=False, test=False, data_type='joint')[source]
Get the genetic ancestry PCA loadings VersionedTableResource.
- Parameters:
include_unreleasable_samples (
bool
) – Whether to get the PCA loadings from the PCA that used unreleasable samples.test (
bool
) – Whether to use a temp path.data_type (
str
) – Data type used in sample QC, e.g. “exomes” or “joint”.
- Return type:
VersionedTableResource
- Returns:
Genetic ancestry PCA loadings.
- gnomad_qc.v5.resources.sample_qc.genetic_ancestry_pca_scores(include_unreleasable_samples=False, test=False, data_type='joint', projection=False)[source]
Get the genetic ancestry PCA scores VersionedTableResource.
- Parameters:
include_unreleasable_samples (
bool
) – Whether to get the PCA scores from the PCA that used unreleasable samples.test (
bool
) – Whether to use a temp path.data_type (
str
) – Data type used in sample QC, e.g. “exomes” or “joint”.projection (
bool
) – Whether the scores ht includes projection scores instead of just original scores.
- Return type:
VersionedTableResource
- Returns:
Genetic ancestry PCA scores.
- gnomad_qc.v5.resources.sample_qc.genetic_ancestry_pca_eigenvalues(include_unreleasable_samples=False, test=False, data_type='joint')[source]
Get the genetic ancestry PCA eigenvalues VersionedTableResource.
- Parameters:
include_unreleasable_samples (
bool
) – Whether to get the PCA eigenvalues from the PCA that used unreleasable samples.test (
bool
) – Whether to use a temp path.data_type (
str
) – Data type used in sample QC, e.g. “exomes” or “joint”.
- Return type:
VersionedTableResource
- Returns:
Genetic ancestry PCA eigenvalues.
- gnomad_qc.v5.resources.sample_qc.gen_anc_rf_path(version='5.0', test=False, data_type='joint')[source]
Path to RF model used for inferring genetic ancestry groups.
- Parameters:
version (
str
) – gnomAD Version.test (
bool
) – Whether the RF assignment was from a test dataset.data_type (
str
) – Data type used in sample QC, e.g. “exomes” or “joint”.
- Return type:
str
- Returns:
String path to sample genetic ancestry group RF model.
- gnomad_qc.v5.resources.sample_qc.get_gen_anc_ht(version='5.0', test=False, data_type='joint')[source]
Get the TableResource of samples’ inferred genetic ancestry group for the indicated gnomAD version.
- Parameters:
version (
str
) – Version of gen anc group TableResource to return.test (
bool
) – Whether to use the test version of the genetic ancestry TableResource.data_type (
str
) – Data type used in sample QC, e.g. “exomes” or “joint”.
- Returns:
TableResource of sample gen anc groups.
- gnomad_qc.v5.resources.sample_qc.get_gen_anc_pr_ht(version='5.0', test=False, data_type='joint')[source]
Get the TableResource of genetic ancestry inference precision and recall values.
- Parameters:
version (
str
) – Version of gen anc group PR TableResource to return.test (
bool
) – Whether to use the test version of the gen anc group PR TableResource.data_type (
str
) – Data type used in sample QC, e.g. “exomes” or “joint”.
- Returns:
TableResource of genetic ancestry inference PR values.
- gnomad_qc.v5.resources.sample_qc.per_grp_min_rf_probs_json_path(version='5.0')[source]
Get path to JSON file containing per genetic ancestry group minimum RF probabilities.
- Parameters:
version (
str
) – Version of the JSON to return.- Returns:
Path to per genetic ancestry group minimum RF probabilities JSON.
- gnomad_qc.v5.resources.sample_qc.get_outlier_detection_sample_qc(test=False)[source]
Get modified sample QC Table for sample outlier detection.
- This table has the following modifications:
Remove hard filtered samples
Add project prefix to sample collisions
Add ‘r_snp_indel’ metric
Sample 1% of the dataset if test is True
- Parameters:
test (
bool
) – Whether to use the test version of the sample QC TableResource.- Return type:
VersionedTableResource
- Returns:
Modified sample QC Table.
- gnomad_qc.v5.resources.sample_qc.stratified_filtering(test=False)[source]
Get VersionedTableResource for stratified genetic ancestry-based metrics filtering.
- Parameters:
test (
bool
) – Whether to use a tmp path for a test resource.- Return type:
VersionedTableResource
- Returns:
VersionedTableResource.
- gnomad_qc.v5.resources.sample_qc.regressed_filtering(test=False, include_unreleasable_samples=False)[source]
Get VersionedTableResource for regression genetic ancestry-based metrics filtering.
- Parameters:
test (
bool
) – Whether to use a tmp path for a test resource.include_unreleasable_samples (
bool
) – Whether to get resource that included unreleasable samples in regression.
- Return type:
VersionedTableResource
- Returns:
VersionedTableResource.
- gnomad_qc.v5.resources.sample_qc.nearest_neighbors(test=False, approximation=False, include_unreleasable_samples=False)[source]
Get VersionedTableResource for genetic ancestry group PCA nearest neighbors.
- Parameters:
test (
bool
) – Whether to use a tmp path for a test resource.approximation (
bool
) – Whether to get resource that is approximate nearest neighbors.include_unreleasable_samples (
bool
) – Whether to get resource that included unreleasable samples in nearest neighbors determination.
- Return type:
VersionedTableResource
- Returns:
VersionedTableResource.
- gnomad_qc.v5.resources.sample_qc.nearest_neighbors_filtering(test=False)[source]
Get VersionedTableResource for nearest neighbors platform/genetic ancestry group-based metrics filtering.
- Parameters:
test (
bool
) – Whether to use a tmp path for a test resource.- Return type:
VersionedTableResource
- Returns:
VersionedTableResource.