Ancestry Analysis Workflows (WDL)
This section documents the All of Us ancestry analysis workflows used to convert large callsets, identify shared high-quality sites, infer ancestry labels, and perform downstream cohort QC/relatedness analyses.
Quick Summary
- Purpose: Convert and filter large callsets, build shared high-quality site sets, infer ancestry labels, and perform post-inference sample QC.
- Primary Outputs: Per-chromosome VCFs, HQ-intersection callsets, ancestry predictions, relatedness artifacts, and interactive QC plots.
High-level Analysis Flow
Ancestry analysis is typically run in two phases:
- Core ancestry inference preparation and prediction
- Post-inference relatedness and outlier QC
The first phase produces ancestry labels and PCA features used by the second phase.
Phase 1: Core Ancestry Inference
Run these workflows in order:
| Step | Workflow | Description | WDL |
|---|---|---|---|
| 1 | VDS to VCF | Converts a large VDS into per-contig full and sites-only VCF files for downstream ancestry processing. | WDL |
| 2 | Determine HQ Sites Intersection | Intersects training and input-data variants and filters both datasets to the shared high-quality site set. | WDL |
| 3 | Run Ancestry Inference | Trains and applies a PCA-based Random Forest classifier to infer ancestry labels and generate plots. | WDL |
Phase 2: Post-inference Cohort QC and Relatedness
Depending on analysis goals, these workflows are used after Phase 1 outputs are available:
| Step | Workflow | Description | WDL |
|---|---|---|---|
| 4 | Run Relatedness | Computes sample-pair relatedness and a maximal independent set of samples flagged for relatedness deconfounding. | WDL |
| 5 | Run Sample Outlier QC | Joins ancestry predictions with callset metrics and identifies ancestry-stratified sample QC outliers. | WDL |
| 6 | Run Sample Outlier QC Plotting | Joins demographics and generates interactive PC and metric visualizations for outlier QC review. | WDL |
Notes
- Data dependency:
run_sample_outlier_qcrequires ancestry outputs fromrun_ancestry. - Plotting dependency:
run_sample_outlier_qc_plottingrequires outputs fromrun_sample_outlier_qc. - Relatedness usage:
run_relatednessis commonly used to flag related samples before downstream association analyses.
Feedback
Please help us make our tools better by filing an issue in WARP; we welcome pipeline-related suggestions or questions.