gnomad_qc.v4.sample_qc.assign_ancestry

Script to assign global ancestry labels to samples using known v3 population labels or TGP and HGDP labels.

usage: gnomad_qc.v4.sample_qc.assign_ancestry.py [-h]
                                                 [--slack-channel SLACK_CHANNEL]
                                                 [--test] [--overwrite]
                                                 [--run-pca] [--n-pcs N_PCS]
                                                 [--include-unreleasable-samples]
                                                 [--assign-pops]
                                                 [--pop-pcs POP_PCS [POP_PCS ...]]
                                                 [--min-pop-prob MIN_POP_PROB]
                                                 [--include-v2-known-in-training]
                                                 [--v4-population-spike {Arab,Bedouin,Persian,Qatari} [{Arab,Bedouin,Persian,Qatari} ...]]
                                                 [--v3-population-spike {asj,ami,afr,amr,eas,sas,fin,nfe} [{asj,ami,afr,amr,eas,sas,fin,nfe} ...]]
                                                 [--compute-precision-recall]
                                                 [--number-pr-points NUMBER_PR_POINTS]
                                                 [--apply-per-pop-min-rf-probs]
                                                 [--infer-per-pop-min-rf-probs]
                                                 [--min-recall MIN_RECALL]
                                                 [--min-precision MIN_PRECISION]
                                                 [--set-ami-exomes-to-remaining]

Named Arguments

--slack-channel

Slack channel to post results and notifications to.

--test

Run script on test dataset.

Default: False

--overwrite

Overwrite output files.

Default: False

--run-pca

Compute ancestry PCA

Default: False

--n-pcs

Number of PCs to compute for ancestry PCA. Defaults to 30.

Default: 30

--include-unreleasable-samples

Include unreleasable samples when computing PCA.

Default: False

--assign-pops

Assigns pops from PCA.

Default: False

--pop-pcs

List of PCs to use for ancestry assignment. The values provided should be 1-based. If a single integer is passed, the script assumes this represents the total PCs to use e.g. –pop-pcs=6 will use PCs 1,2,3,4,5,and 6. Defaults to 20 PCs.

Default: [20]

--min-pop-prob

Minimum RF prob for pop assignment. Defaults to 0.75.

Default: 0.75

--include-v2-known-in-training

Whether to train RF classifier using v2 known pop labels. Default is False.

Default: False

--v4-population-spike

Possible choices: Arab, Bedouin, Persian, Qatari

List of v4 populations to spike into the RF training populations.

--v3-population-spike

Possible choices: asj, ami, afr, amr, eas, sas, fin, nfe

List of v3 populations to spike into the RF training populations.

--compute-precision-recall

Compute precision and recall for the RF model using evaluation samples. This is computed for all evaluation samples as well as per population.

Default: False

--number-pr-points

Number of min prob cutoffs to compute PR metrics for. e.g. 100 will compute PR metrics for min prob of 0 to 1 in increments of 0.01. Default is 100.

Default: 100

--apply-per-pop-min-rf-probs

Apply per ancestry group minimum RF probabilities for finalized pop assignment instead of using ‘–min-pop-prob’ for all samples. There must be a JSON file located in the path defined by the ‘per_pop_min_rf_probs_json_path’ resource, or ‘–infer-per-pop-min-rf-probs’ must be used.

Default: False

--infer-per-pop-min-rf-probs

Whether to infer per ancestry group minimum RF probabilities and write them out to ‘per_pop_min_rf_probs_json_path’ before determining the finalized pop assignment.

Default: False

--min-recall

Minimum recall value to choose per ancestry group minimum RF probabilities. This cutoff is applied first, and if the chosen cutoff results in a precision lower than ‘–min-precision’, the minimum RF probabilities with the highest recall that meets ‘–min-precision’ is used. Default is 0.99.

Default: 0.99

--min-precision

Minimum precision value to choose per ancestry group minimum RF probabilities. This cutoff is applied if the chosen minimum RF probabilities cutoff using ‘–min-recall’ results in a precision lower than this value. The minimum RF probabilities with the highest recall that meets ‘–min-precision’ is used. Default is 0.99.

Default: 0.99

--set-ami-exomes-to-remaining

Whether to change the ancestry group for any exomes inferred as ‘ami’ to ‘remaining’. Should be used in cases where only a few exomes were inferred as amish to avoid having ancestry groups with only a few samples.

Default: False

Module Functions

gnomad_qc.v4.sample_qc.assign_ancestry.V4_POP_SPIKE_DICT

Dictionary with potential pops to use for training (with v4 race/ethnicity as key and corresponding pop as value).

gnomad_qc.v4.sample_qc.assign_ancestry.V3_SPIKE_PROJECTS

Dictionary with v3 pops as keys and approved cohorts to use for training for those pops as values.

gnomad_qc.v4.sample_qc.assign_ancestry.run_pca(...)

Run population PCA using run_pca_with_relateds.

gnomad_qc.v4.sample_qc.assign_ancestry.prep_ht_for_rf([...])

Prepare the PCA scores hail Table for the random forest population assignment runs.

gnomad_qc.v4.sample_qc.assign_ancestry.assign_pops(...)

Use a random forest model to assign global population labels based on the results from run_pca.

gnomad_qc.v4.sample_qc.assign_ancestry.write_pca_results(...)

Write out the eigenvalue hail Table, scores hail Table, and loadings hail Table returned by run_pca().

gnomad_qc.v4.sample_qc.assign_ancestry.get_most_likely_pop_expr(ht)

Get StructExpression with 'pop' and 'prob' for the most likely population based on RF probabilities.

gnomad_qc.v4.sample_qc.assign_ancestry.compute_precision_recall(ht)

Create Table with false positives (FP), true positives (TP), false negatives (FN), precision, and recall.

gnomad_qc.v4.sample_qc.assign_ancestry.infer_per_pop_min_rf_probs(ht)

Infer per ancestry group minimum RF probabilities from precision and recall values.

gnomad_qc.v4.sample_qc.assign_ancestry.assign_pop_with_per_pop_probs(...)

Assign samples to populations based on population-specific minimum RF probabilities.

gnomad_qc.v4.sample_qc.assign_ancestry.main(args)

Assign global ancestry labels to samples.

gnomad_qc.v4.sample_qc.assign_ancestry.get_script_argument_parser()

Get script argument parser.

Script to assign global ancestry labels to samples using known v3 population labels or TGP and HGDP labels.

gnomad_qc.v4.sample_qc.assign_ancestry.V4_POP_SPIKE_DICT = {'Arab': 'mid', 'Bedouin': 'mid', 'Persian': 'mid', 'Qatari': 'mid'}

Dictionary with potential pops to use for training (with v4 race/ethnicity as key and corresponding pop as value).

gnomad_qc.v4.sample_qc.assign_ancestry.V3_SPIKE_PROJECTS = {'afr': ['TOPMED_Tishkoff_Cardiometabolics_Phase4'], 'ami': ['NHLBI_WholeGenome_Sequencing'], 'amr': ['PAGE: Global Reference Panel', 'PAGE: Multiethnic Cohort (MEC)', 'CostaRica'], 'asj': ['Jewish_Genome_Project'], 'eas': ['osaka'], 'fin': ['G4L Initiative Stanley Center', 'WGSPD3_Palotie_FinnishBP_THL_WGS', 'WGSPD3_Palotie_Finnish_WGS', 'WGSPD3_Palotie_Finnish_WGS_December2018'], 'nfe': ['Estonia_University of Tartu_Whole Genome Sequencing', 'CCDG_Atrial_Fibrillation_Munich', 'CCDG_Atrial_Fibrillation_Norway', 'CCDG_Atrial_Fibrillation_Sweden', 'Estonia_University of Tartu_Whole Genome Sequencing', 'WGSPD'], 'sas': ['CCDG_PROMIS', 'TOPMED_Saleheen_PROMIS_Phase4']}

Dictionary with v3 pops as keys and approved cohorts to use for training for those pops as values. Decisions were made based on results of an analysis to determine which v3 samples/cohorts to use as training samples. This analysis consisted of computing per sample mean Euclidean distances to all samples in a given population, and per sample the mean Euclidean distances limited to only HGDP/1KG samples in each population. Then cohorts were excluded based on the per cohort distributions of these mean distances.

Projects that were excluded based on this analysis are:

  • afr: NHLBI_WholeGenome_Sequencing

  • ami: Pedigree-Based Whole Genome Sequencing of Affective and Psychotic Disorders

  • amr: NHLBI_WholeGenome_Sequencing

  • amr: PAGE: Women’’s Health Initiative (WHI)

gnomad_qc.v4.sample_qc.assign_ancestry.run_pca(related_samples_to_drop, include_unreleasable_samples=False, n_pcs=30, test=False)[source]

Run population PCA using run_pca_with_relateds.

Parameters:
  • related_samples_to_drop (Table) – Table of related samples to drop from PCA run.

  • include_unreleasable_samples (bool) – Should unreleasable samples be included in the PCA.

  • n_pcs (int) – Number of PCs to compute.

  • test (bool) – Subset QC MT to small test dataset.

Return type:

Tuple[List[float], Table, Table]

Returns:

Eigenvalues, scores and loadings from PCA.

gnomad_qc.v4.sample_qc.assign_ancestry.prep_ht_for_rf(include_unreleasable_samples=False, test=False, include_v2_known_in_training=False, v4_population_spike=None, v3_population_spike=None)[source]

Prepare the PCA scores hail Table for the random forest population assignment runs.

Either train the RF with only HGDP and TGP, or HGDP and TGP and all v2 known labels.

Can also specify list of pops with known v3/v4 labels to include (v3_population_spike/v4_population_spike) for training. Pops supplied for v4 are specified by race/ethnicity and converted to an ancestry group using V4_POP_SPIKE_DICT.

Parameters:
  • include_unreleasable_samples (bool) – Should unreleasable samples be included in the PCA.

  • test (bool) – Whether RF should run on the test QC MT.

  • include_v2_known_in_training (bool) – Whether to train RF classifier using v2 known pop labels. Default is False.

  • v4_population_spike (Optional[List[str]]) – Optional List of populations to spike into training. Must be in V4_POP_SPIKE_DICT dictionary. Default is None.

  • v3_population_spike (Optional[List[str]]) – Optional List of populations to spike into training. Must be in V3_SPIKE_PROJECTS dictionary. Default is None.

Return type:

Table

Returns:

Table with input for the random forest.

gnomad_qc.v4.sample_qc.assign_ancestry.assign_pops(min_prob, include_unreleasable_samples=False, pcs=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], missing_label='remaining', test=False, overwrite=False, include_v2_known_in_training=False, v4_population_spike=None, v3_population_spike=None)[source]

Use a random forest model to assign global population labels based on the results from run_pca.

Training data is the known label for HGDP and 1KG samples and all v2 samples with known pops unless specificied to restrict only to 1KG and HGDP samples. Can also specify a list of pops with known v3/v4 labels to include (v3_population_spike/v4_population_spike) for training. Pops supplied for v4 are specified by race/ethnicity and converted to a ancestry group using V4_POP_SPIKE_DICT. The method assigns a population label to all samples in the dataset.

Parameters:
  • min_prob (float) – Minimum RF probability for pop assignment.

  • include_unreleasable_samples (bool) – Whether unreleasable samples were included in PCA.

  • pcs (List[int]) – List of PCs to use in the RF.

  • missing_label (str) – Label for samples for which the assignment probability is smaller than min_prob.

  • test (bool) – Whether running assigment on a test dataset.

  • overwrite (bool) – Whether to overwrite existing files.

  • include_v2_known_in_training (bool) – Whether to train RF classifier using v2 known pop labels. Default is False.

  • v4_population_spike (Optional[List[str]]) – Optional List of v4 populations to spike into the RF. Must be in v4_pop_spike dictionary. Defaults to None.

  • v3_population_spike (Optional[List[str]]) – Optional List of v3 populations to spike into the RF. Must be in v4_pop_spike dictionary. Defaults to None.

Return type:

Tuple[Table, Any]

Returns:

Table of pop assignments and the RF model.

gnomad_qc.v4.sample_qc.assign_ancestry.write_pca_results(pop_pca_eigenvalues, pop_pca_scores_ht, pop_pca_loadings_ht, overwrite=False, included_unreleasables=False, test=False)[source]

Write out the eigenvalue hail Table, scores hail Table, and loadings hail Table returned by run_pca().

Parameters:
  • pop_pca_eigenvalues (List[float]) – List of eigenvalues returned by run_pca.

  • pop_pca_scores_ht (Table) – Table of scores returned by run_pca.

  • pop_pca_loadings_ht (Table) – Table of loadings returned by run_pca.

  • overwrite (bool) – Whether to overwrite an existing file.

  • included_unreleasables (bool) – Whether run_pca included unreleasable samples.

  • test (bool) – Whether the test QC MT was used in the PCA.

Returns:

None

gnomad_qc.v4.sample_qc.assign_ancestry.get_most_likely_pop_expr(ht)[source]

Get StructExpression with ‘pop’ and ‘prob’ for the most likely population based on RF probabilities.

Parameters:

ht (Table) – Input population inference Table with random forest probabilities.

Return type:

Tuple[StructExpression, List[Tuple[str, str]]]

Returns:

Struct Expression with ‘pop’ and ‘prob’ for the highest RF probability.

gnomad_qc.v4.sample_qc.assign_ancestry.compute_precision_recall(ht, num_pr_points=100)[source]

Create Table with false positives (FP), true positives (TP), false negatives (FN), precision, and recall.

Includes population specific calculations.

Parameters:
  • ht (Table) – Input population inference Table with random forest probabilities.

  • num_pr_points (int) – Number of min prob cutoffs to compute PR metrics for.

Return type:

Table

Returns:

Table with FP, TP, FN, precision, and recall.

gnomad_qc.v4.sample_qc.assign_ancestry.infer_per_pop_min_rf_probs(ht, min_recall=0.99, min_precision=0.99)[source]

Infer per ancestry group minimum RF probabilities from precision and recall values.

Minimum recall (min_recall) is used to choose per ancestry group minimum RF probabilities. This min_recall cutoff is applied first, and if the chosen minimum RF probabilities cutoff results in a precision lower than min_precision, the minimum RF probabilities with the highest recall that meets min_precision is used.

Parameters:
  • ht (Table) – Precision recall Table returned by compute_precision_recall.

  • min_recall (float) – Minimum recall value to choose per ancestry group minimum RF probabilities. Default is 0.99.

  • min_precision (float) – Minimum precision value to choose per ancestry group minimum RF probabilities. Default is 0.99.

Return type:

Dict[str, Dict[str, float]]

Returns:

Dictionary of per pop min probability cutoffs min_prob_cutoff.

gnomad_qc.v4.sample_qc.assign_ancestry.assign_pop_with_per_pop_probs(pop_ht, min_prob_cutoffs, missing_label='remaining')[source]

Assign samples to populations based on population-specific minimum RF probabilities.

Parameters:
  • pop_ht (Table) – Table containing results of population inference.

  • min_prob_cutoffs (Dict[str, float]) – Dictionary with population as key, and minimum RF probability required to assign a sample to that population as value.

  • missing_label (str) – Label for samples for which the assignment probability is smaller than required minimum probability.

Return type:

Table

Returns:

Table with ‘pop’ annotation based on supplied per pop min probabilities.

gnomad_qc.v4.sample_qc.assign_ancestry.main(args)[source]

Assign global ancestry labels to samples.