Skip to main content

Run Ancestry Inference

Pipeline VersionDate UpdatedDocumentation AuthorQuestions or Feedback
aou_9.0.0May, 2025WARP PipelinesFile an issue

Introduction to the Run Ancestry Inference workflow

run_ancestry is a WDL workflow that trains and applies a PCA-based ancestry classifier. It uses a filtered training callset (e.g., HGDP) to build PCA loadings and a Random Forest model, then projects target samples and predicts ancestry labels.

The workflow runs three major stages: training PCA model generation (create_hw_pca_training), ancestry prediction (call_ancestry), and visualization (plot_ancestry). Outputs include prediction tables, Hail Table tarballs, model artifacts, and interactive PCA plots.

Quickstart table

Pipeline FeatureDescriptionSource
Analysis typePCA projection + Random Forest ancestry inference
Workflow languageWDL 1.0openWDL
Data input file formatVCF BGZF + Tabix index + optional metadata TSV
Data output file formatTSV, .tar.gz Hail tables, .pkl, .html
Primary softwareHail, scikit-learn, bokehHail, scikit-learn, Bokeh

Set-up

Run Ancestry Inference installation and requirements

The workflow code can be downloaded by cloning the WARP GitHub repository. For the latest release, please see the run_ancestry changelog.

The pipeline can be deployed using Cromwell, a GA4GH-compliant workflow management system.

Inputs

Input descriptions

Input variable nameDescriptionType
hq_variants_intersectionSites-only VCF of variants shared between training and input datasets.File
hq_variants_intersection_idxIndex for hq_variants_intersection.File
merged_vcf_shardsMerged input-data VCF filtered to shared HQ sites.File
merged_vcf_shards_idxIndex for merged_vcf_shards.File
filtered_training_setTraining full VCF filtered to shared HQ sites.File
filtered_training_set_idxIndex for filtered_training_set.File
hgdp_metadata_file_in(Optional) Training sample metadata TSV. Defaults to the public gnomAD HGDP+1KG metadata file.File?
final_output_prefixPrefix applied to all output artifacts.String
other_cutoff_in(Optional) Probability threshold for assigning ancestry label oth (other). Default: 0.75.Float?
num_pcsNumber of principal components used in training/projection. Default: 16.Int

Run Ancestry Inference tasks and tools

The workflow runs model training, prediction, and plotting tasks in sequence.

  1. Create training PCA artifacts
  2. Predict ancestry labels
  3. Generate PCA plots

To see specific tool parameters, select the task WDL link in the table; then view the command {} section of the task in the WDL script.

Task name and WDL linkToolSoftwareDescription
create_hw_pca_trainingHail PCAhailgenetics/hail:0.2.67Runs HWE-normalized PCA on training data, writes loadings and labeled training PCA outputs.
call_ancestryHail + RandomForestClassifierhailgenetics/hail:0.2.67Projects target samples, trains/applies RF classifier, and writes prediction + model outputs.
plot_ancestryHail + bokehhailgenetics/hail:0.2.67Produces interactive ancestry PCA HTML plots (raw and oth-adjusted labels).

1. Create training PCA artifacts

Builds PCA features and loadings from the training set, joins population labels, and writes tarred Hail table artifacts plus eigenvalues.

2. Predict ancestry labels

Projects input samples into training PCA space, applies Random Forest classification, writes TSV predictions, Hail tables, and classifier pickle output.

3. Generate PCA plots

Uses prediction outputs to render interactive bokeh plots showing ancestry assignments on principal components.

Outputs

Output variable nameFilename, if applicableOutput format and description
results_tsv<final_output_prefix>.ancestry_preds.tsvTab-delimited prediction table containing sample IDs, predicted labels, probabilities, and PCA features.
results_ht<final_output_prefix>.ancestry_preds.ht.tar.gzTarred Hail Table with projected PCA scores for input samples.
results_loadings_ht<final_output_prefix>_loadings.ht.tar.gzTarred Hail Table containing PCA loadings used for projection.
pred_plot<final_output_prefix>.preds.htmlInteractive ancestry PCA plot using direct model predictions.
pred_oth_plot<final_output_prefix>.preds_oth.htmlInteractive PCA plot after applying oth thresholding.
training_pca_labels_ht_tsv<final_output_prefix>_training_pca.tsvTraining PCA table with sample labels.
eigenvalues_txt<final_output_prefix>_eigenvalues.txtPCA eigenvalues text file used in model training.
classifier_pkl<final_output_prefix>_rf_classifier.pklPickled Random Forest classifier artifact.

Versioning

All run_ancestry releases are documented in the changelog.

Feedback

Please help us make our tools better by filing an issue in WARP; we welcome pipeline-related suggestions or questions.