Skip to main content

Mitochondria Single Sample Pipeline Overview

Pipeline VersionDate UpdatedDocumentation AuthorQuestions or Feedback
aou_9.0.1May, 2026WARP PipelinesFile an issue

Introduction to the Mitochondria Single Sample workflow

The Mitochondria Mitochondria Single Sample Pipeline (mitochondria_single_sample.wdl) is a cloud-optimized WDL pipeline that processes per-sample mitochondrial DNA (mtDNA) data from hg38-aligned whole-genome sequencing (WGS) BAM or CRAM files. It produces high-quality per-sample mtDNA variant calls, base-level coverage metrics, and supporting QC statistics for use in downstream cohort-level analysis.

The pipeline is based on the mtSwirl v2.5_MongoSwirl_Single methodology and implements a two-round alignment and variant calling strategy. In the first round, reads are aligned to the standard hg38 mitochondrial reference and variants are called with Mutect2. A personalized self-reference is then constructed from the called variants, and in the second round reads are realigned to this self-reference for improved variant calling accuracy. Variant calls from the second round are lifted back to hg38 coordinates for a consistent final output.

The pipeline optionally computes NUMT (nuclear mitochondrial DNA segment) coverage metrics for quality control purposes. Outputs from this pipeline — particularly the final VCF and per-base coverage metrics — serve as per-sample inputs to the MitochondriaMerge cohort-level pipeline.

This pipeline was originally released as part of: Gupta, R., Kanai, M., Durham, T.J. et al. Nuclear genetic control of mtDNA copy number and heteroplasmy in humans. Nature, 2023. https://doi.org/10.1038/s41586-023-06426-5.

Quickstart table

Pipeline FeatureDescription
Assay typeWhole-genome sequencing (mtDNA)
Overall workflowSubsetting, alignment, variant calling, liftover, coverage metrics
Workflow languageWDL 1.0
Genomic reference sequencehg38
Variant callerMutect2 (GATK)
Data input file formatBAM or CRAM (hg38-aligned)
Data output file formatVCF, TSV (coverage metrics)

Set-up

Mitochondria Single Sample Pipeline installation and requirements

The pipeline code can be downloaded by cloning the WARP GitHub repository. The WDL is located at all_of_us/mitochondria/mitochondria_single_sample.wdl.

The pipeline can be deployed using Cromwell, a GA4GH-compliant workflow management system.

Inputs

Input descriptions

Input variable nameDescriptionType
wgs_aligned_input_bam_or_cramhg38-aligned WGS BAM or CRAM fileFile
wgs_aligned_input_bam_or_cram_indexIndex for the input BAM or CRAMFile?
sample_nameSample identifier used in output filenamesString
mt_interval_listPicard-style interval list for chrM (including putative NUMT intervals)File
nuc_interval_listInterval list for nuclear NUMT regionsFile
ref_fastahg38 reference FASTAFile
ref_fasta_indexIndex for the reference FASTAFile
ref_dictSequence dictionary for the reference FASTAFile
mt_fastaMitochondrial reference FASTAFile
mt_fasta_indexIndex for the mitochondrial reference FASTAFile
mt_dictSequence dictionary for the mitochondrial reference FASTAFile
blacklisted_sitesBED file of artifact-prone sites to excludeFile
blacklisted_sites_indexIndex for the blacklisted sites BED fileFile
control_region_shifted_reference_interval_listInterval list for the shifted control region referenceFile
non_control_region_interval_listInterval list for the non-control regionFile
HailLiftoverHail liftover scriptFile
FaRenamingScriptScript for FASTA renaming during self-reference constructionFile
CheckVariantBoundsScriptScript to validate variant coordinatesFile
CheckHomOverlapScriptScript to check homoplasmy overlapsFile
JsonToolsJSON utilities scriptFile
force_manual_downloadForce manual download of input filesBoolean
max_read_lengthRead length for optimization (default: 151)Int?
vaf_filter_thresholdHard threshold for filtering low-VAF sitesFloat?
f_score_betaF-score beta balancing recall vs. precision in filteringFloat?
verifyBamIDVerifyBamID contamination thresholdFloat?
compress_output_vcfWhether to compress output VCFBoolean
compute_numt_coverageWhether to compute NUMT coverage metricsBoolean
use_haplotype_caller_nucdnaWhether to use HaplotypeCaller for nuclear DNABoolean
skip_restore_hardclipsSkip restoring hardclips (required for some CRAMs, e.g., AoU)Boolean
gatk_versionGATK version stringString
ucsc_dockerDocker image for UCSC toolsString
genomes_cloud_dockerDocker image for genomes-in-the-cloud toolsString
haplochecker_dockerDocker image for haplocheckerString
gatk_samtools_dockerDocker image for GATK + samtools tasksString

Mitochondria Single Sample Pipeline tasks and tools

The pipeline calls a series of sub-workflows and tasks. The following sections describe each major step:

  1. Subset and revert BAM to chrM
  2. Round 1: align and call variants on standard reference
  3. Construct self-reference
  4. Round 2: align and call variants on self-reference
  5. Liftover and collect outputs
  6. (Optional) NUMT coverage
Task / sub-workflowToolDescription
MongoSubsetBamToChrMAndRevertGATK PrintReads, RevertSamSubsets WGS BAM to chrM and NUMT intervals, reverts to unmapped BAM
AlignAndCallR1BWA, Mutect2, HaplotypeCaller, haplocheckerAligns to hg38 mtDNA reference, calls variants, estimates contamination and assigns haplogroup via haplochecker
ProduceSelfReferenceFilesbcftools consensus, Picard, UCSC liftOverConstructs a personalized self-reference FASTA, a shifted control-region reference, and chain files from Round 1 variant calls
AlignAndCallR2BWA, Mutect2Aligns to self-reference, calls variants, collects coverage metrics
MongoLiftoverVCFAndGetCoveragePicard LiftoverVcfLifts Round 2 VCF back to hg38 coordinates
MongoLiftoverSelfAndCollectOutputsPicard, custom scriptsCollects final coverage metrics in hg38 coordinates
NucCoverageAtEveryBasePicard CollectHsMetrics(Optional) Computes per-base NUMT coverage before and after realignment

1. Subset and revert BAM to chrM

The MongoSubsetBamToChrMAndRevert task uses GATK PrintReads to subset the input WGS BAM or CRAM to the chrM and NUMT intervals, then reverts the subset to an unmapped BAM for realignment. Mean chrM coverage is computed at this step.

2. Round 1: align and call variants on standard reference

AlignAndCallR1 aligns the unmapped chrM reads to the standard hg38 mitochondrial reference using BWA, marks duplicates, and calls SNPs and INDELs using Mutect2. HaplotypeCaller is optionally used for nuclear DNA variants. Contamination is estimated and a major haplogroup is assigned.

3. Construct self-reference

ProduceSelfReferenceFiles uses the Round 1 variant calls to construct a personalized self-reference FASTA via bcftools consensus. Chain files for lifting variant coordinates between the self-reference and hg38 are produced, along with updated interval lists.

4. Round 2: align and call variants on self-reference

AlignAndCallR2 realigns the unmapped reads to the self-reference and calls variants again with Mutect2. Alignment to the self-reference reduces reference bias and improves genotype accuracy, particularly for heteroplasmic sites.

5. Liftover and collect outputs

MongoLiftoverVCFAndGetCoverage lifts the Round 2 VCF back to hg38 coordinates. MongoLiftoverSelfAndCollectOutputs collects per-base coverage metrics in hg38 space.

6. (Optional) NUMT coverage

When compute_numt_coverage = true, NucCoverageAtEveryBase uses Picard CollectHsMetrics to compute per-base NUMT coverage on the original reference, self-reference, and shifted self-reference for QC comparison.

Outputs

Output variable nameDescription
subset_bam / subset_baichrM-subset BAM and index
r1_vcf / r1_vcf_indexRound 1 filtered VCF and index
r1_nuc_vcf / r1_nuc_vcf_indexRound 1 nuclear DNA VCF and index
self_ref_vcf / self_ref_vcf_indexRound 2 VCF called on self-reference
final_vcfFinal VCF lifted back to hg38 coordinates
final_rejected_vcfVariants that failed liftover
final_base_level_coverage_metricsPer-base mtDNA coverage in hg38 coordinates
numt_base_level_coverage(Optional) Per-base NUMT coverage
self_reference_fastaPersonalized self-reference FASTA
reference_to_self_ref_chainChain file from hg38 → self-reference
stats_outputsSummary statistics table
mean_coverageMean mtDNA coverage depth
median_coverageMedian mtDNA coverage depth
major_haplogroupAssigned mtDNA haplogroup
contaminationEstimated contamination fraction
mtdna_consensus_overlapsNumber of overlapping consensus sites

Versioning and testing

All Mitochondria Single Sample Pipeline releases are documented in the mitochondria_single_sample changelog.

Citing the Mitochondria Single Sample

When citing this pipeline, please cite the original publication:

Gupta, R., Kanai, M., Durham, T.J. et al. Nuclear genetic control of mtDNA copy number and heteroplasmy in humans. Nature, 2023. https://doi.org/10.1038/s41586-023-06426-5.

When citing WARP, please use:

Kylee Degatano, Aseel Awdeh, Robert Sidney Cox III, Wes Dingman, George Grant, Farzaneh Khajouei, Elizabeth Kiernan, Kishori Konwar, Kaylee L Mathews, Kevin Palis, Nikelle Petrillo, Geraldine Van der Auwera, Chengchen (Rex) Wang, Jessica Way. "Warp Analysis Research Pipelines: Cloud-optimized workflows for biological data processing and reproducible analysis." Bioinformatics, 2025; https://doi.org/10.1093/bioinformatics/btaf494.

Feedback

Please help us make our tools better by filing an issue in WARP.