Skip to main content

Ancestry Analysis Workflows (WDL)

This section documents the All of Us ancestry analysis workflows used to convert large callsets, identify shared high-quality sites, infer ancestry labels, and perform downstream cohort QC/relatedness analyses.

Quick Summary

  • Purpose: Convert and filter large callsets, build shared high-quality site sets, infer ancestry labels, and perform post-inference sample QC.
  • Primary Outputs: Per-chromosome VCFs, HQ-intersection callsets, ancestry predictions, relatedness artifacts, and interactive QC plots.

High-level Analysis Flow

Ancestry analysis is typically run in two phases:

  1. Core ancestry inference preparation and prediction
  2. Post-inference relatedness and outlier QC

The first phase produces ancestry labels and PCA features used by the second phase.

Phase 1: Core Ancestry Inference

Run these workflows in order:

StepWorkflowDescriptionWDL
1VDS to VCFConverts a large VDS into per-contig full and sites-only VCF files for downstream ancestry processing.WDL
2Determine HQ Sites IntersectionIntersects training and input-data variants and filters both datasets to the shared high-quality site set.WDL
3Run Ancestry InferenceTrains and applies a PCA-based Random Forest classifier to infer ancestry labels and generate plots.WDL

Phase 2: Post-inference Cohort QC and Relatedness

Depending on analysis goals, these workflows are used after Phase 1 outputs are available:

StepWorkflowDescriptionWDL
4Run RelatednessComputes sample-pair relatedness and a maximal independent set of samples flagged for relatedness deconfounding.WDL
5Run Sample Outlier QCJoins ancestry predictions with callset metrics and identifies ancestry-stratified sample QC outliers.WDL
6Run Sample Outlier QC PlottingJoins demographics and generates interactive PC and metric visualizations for outlier QC review.WDL

Notes

  • Data dependency: run_sample_outlier_qc requires ancestry outputs from run_ancestry.
  • Plotting dependency: run_sample_outlier_qc_plotting requires outputs from run_sample_outlier_qc.
  • Relatedness usage: run_relatedness is commonly used to flag related samples before downstream association analyses.

Feedback

Please help us make our tools better by filing an issue in WARP; we welcome pipeline-related suggestions or questions.