Skip to main content

Mitochondria Merge Pipeline Overview

Pipeline VersionDate UpdatedDocumentation AuthorQuestions or Feedback
aou_9.1.0May, 2026WARP PipelinesFile an issue

Introduction to the Mitochondria Merge workflow

The mtDNA Coverage Merge Pipeline (mitochondria_merge.wdl) is a cloud-optimized WDL pipeline that takes per-sample mtDNA VCF files and coverage metrics produced by the Mitochondria Single Sample and merges them into a single annotated cohort-wide callset. The pipeline is designed to scale to hundreds of thousands of samples and has been validated on the All of Us (AoU) v9 dataset of 535,000 samples.

The pipeline proceeds in six stages: metadata preprocessing, HDF5 coverage database construction, sharded VCF ingestion and multi-round merging, sharded homoplasmic-reference imputation and artifact filtering, and finally a dual-track annotation step that produces both a complete callset and a QC-filtered callset.

A companion utility pipeline, get_wgs_median_coverage.wdl, can be used to extract WGS median coverage values from per-sample Picard metrics files and produce the wgs_median_coverage_tsv input required by this pipeline.

Quickstart table

Pipeline FeatureDescription
Assay typeWhole-genome sequencing (mtDNA, cohort-level)
Overall workflowMetadata join, coverage DB construction, sharded VCF merge, hom-ref imputation, annotation
Workflow languageWDL 1.0
Genomic reference sequencehg38
Data input file formatTSV (per-sample VCF and coverage file paths)
Data output file formatHail MatrixTable (tar.gz), VCF, TSV

Set-up

Mitochondria Merge installation and requirements

The pipeline code can be downloaded by cloning the WARP GitHub repository. The WDL is located at all_of_us/mitochondria/mitochondria_merge.wdl.

The pipeline can be deployed using Cromwell, a GA4GH-compliant workflow management system.

Preparing inputs with get_wgs_median_coverage

The get_wgs_median_coverage.wdl utility pipeline reads a two-column TSV (research_id, metrics_path) pointing to per-sample Picard WGS metrics files and extracts the median coverage value for each sample. Its output (median_coverage_tsv) is the wgs_median_coverage_tsv input to mitochondria_merge. The pipeline shards the input TSV for parallel processing and merges results at the end.

Inputs

Sample data inputs

Input variable nameDescriptionType
full_data_tsvPer-sample data table with paths to VCF and coverage files from the Mitochondria Single Sample PipelineFile
sample_list_tsv(Optional) Single-column list of sample IDs to subset toFile?

Metadata inputs

Input variable nameDescriptionType
coverage_tsvPer-sample QC metrics (mean WGS coverage, contamination, biosample collection date)File
ancestry_tsvPredicted genetic ancestry per sampleFile
dob_tsvDate of birth per sampleFile
wgs_median_coverage_tsvWGS median coverage per sample (produced by get_wgs_median_coverage.wdl)File

GCS bucket inputs

Input variable nameDescriptionType
vcf_merge_output_bucketGCS bucket for intermediate MatrixTable tarballs (VCF shards, merged MTs, finalized shards)String
annotated_output_bucketGCS bucket for final annotated outputsString

Optional tuning inputs

Input variable nameDefaultDescriptionType
vcf_col_name"final_vcf"Column name in full_data_tsv containing the per-sample VCF pathString
combined_mt_name"combined_vcf"Name for the combined output MatrixTableString
vcf_merge_shard_size2500Samples per VCF ingest shardInt
vcf_merge_merge_fanin10Fan-in per merge roundInt
vcf_merge_shard_n_partitions192Hail partitions per shard MTInt
finalize_shard_size25000Samples per finalize shardInt
finalize_shard_n_partitions256Hail partitions per finalize shard MTInt
finalize_union_n_partitions1000Hail partitions for the final unioned MTInt

Mitochondria Merge tasks and tools

The pipeline calls a series of tasks defined in mitochondria_merge.wdl. The following sections describe each step:

  1. (Optional) Subset data table
  2. Preprocess metadata TSV
  3. Build HDF5 coverage database
  4. Shard, ingest, and merge VCFs
  5. Finalize: impute hom-ref and filter artifacts
  6. Add annotations
Task nameDocker imageDescription
subset_data_tablepython-numpy-pandas:1.0.0-2.2.3-1.25.2(Optional) Filters the input data table to a specified sample list
process_tsv_filespython-numpy-pandas:1.0.0-2.2.3-1.25.2Joins metadata TSVs into a master sample sheet
annotate_coverageaou-mito-coverage-db:1.0.0Builds an HDF5 coverage database from per-sample coverage TSVs
make_vcf_shards_from_tsvpython-numpy-pandas:1.0.0-2.2.3-1.25.2Partitions the sample list into shard TSVs
build_vcf_shard_mtaou-mito-hail-processing:1.0.0Imports one shard of VCFs into a Hail MatrixTable
make_mt_merge_groupspython-numpy-pandas:1.0.0-2.2.3-1.25.2Plans fan-in merge groups from a list of MT tarballs
merge_mt_shardsaou-mito-hail-processing:1.0.0Merges a group of MT tarballs into one MT via multi_way_union_mts
shard_mt_by_samplesaou-mito-hail-processing:1.0.0Splits the merged MT into per-sample-shard MT tarballs
finalize_mt_with_covdbaou-mito-hail-processing:1.0.0Imputes hom-ref genotypes from HDF5 coverage DB and applies artifact filters
union_mt_shardsaou-mito-hail-processing:1.0.0Unions finalized sample-shard MTs into a single cohort MT
add_annotationsaou-mito-hail-processing:1.0.0Adds cohort-wide variant statistics, QC filters, and population/haplogroup allele frequencies

1. (Optional) Subset data table

subset_data_table filters full_data_tsv to only the rows matching sample IDs in sample_list_tsv. If no sample list is provided, the full table passes through unchanged.

2. Preprocess metadata TSV

process_tsv_files joins the sample data table with coverage_tsv, ancestry_tsv, dob_tsv, and wgs_median_coverage_tsv on sample ID using pandas. Columns are renamed to the expected downstream format (ancestry_predpop, etc.) and an age field is derived from biosample collection date and date of birth. The output processed_tsv is the master sample sheet for all downstream tasks.

3. Build HDF5 coverage database

annotate_coverage reads the per-sample coverage TSVs listed in processed_tsv and writes coverage values into an HDF5 file (coverage.h5) with shape (n_samples, n_positions). Per-position summary statistics (mean, median, fraction of samples with >100× and >1000× coverage) are computed using numpy. This step runs entirely without Hail or Spark.

Output coverage_db.tar.gz contains coverage.h5 and coverage_summary.tsv and is passed to the finalize and annotation tasks.

4. Shard, ingest, and merge VCFs

VCF ingestion is parallelized across sample shards to avoid driver memory exhaustion at large cohort sizes:

  • make_vcf_shards_from_tsv partitions the sample list into TSV shards of vcf_merge_shard_size samples each.
  • build_vcf_shard_mt (scattered) imports one shard of per-sample VCFs into a Hail MatrixTable and writes the result as a .tar.gz to vcf_merge_output_bucket.
  • make_mt_merge_groups + merge_mt_shards (3 rounds, scattered) merge shard MTs in a fan-in tree with vcf_merge_merge_fanin MTs per group. Three rounds reduce hundreds of shard MTs to a single merged MT. Intermediate tarballs are persisted in vcf_merge_output_bucket for call-caching and restartability.

5. Finalize: impute hom-ref and filter artifacts

The merged MT is further sharded by samples for parallelized finalization:

  • shard_mt_by_samples splits the merged MT into sample shards of finalize_shard_size samples each.
  • finalize_mt_with_covdb (scattered) processes each sample shard: reads the HDF5 coverage DB in position blocks, filters to only the rows in each block before annotating entries (avoiding repeated full-matrix scans), imputes homoplasmic-reference genotypes where coverage exceeds the threshold, and applies the artifact-prone site filter.
  • union_mt_shards reunites the finalized sample shards into a single cohort-wide MT via union_cols.

6. Add annotations

add_annotations runs add_annotations.py on the final unioned MT to compute cohort-wide variant statistics, per-population and per-haplogroup allele frequencies, and QC filters. It is called twice in parallel:

  • annotated (keep_all_samples = true): all samples retained
  • filt_annotated (keep_all_samples = false): samples failing QC filters (contamination, copy number, etc.) removed

Outputs are written directly to timestamped subdirectories under annotated_output_bucket.

Outputs

Output variable nameDescriptionType
processed_tsvMaster sample sheet after metadata joinsFile
output_coverage_dbcoverage_db.tar.gz — HDF5 coverage database and summary TSVFile
combined_vcfTar.gz of the finalized, unioned cohort MatrixTableFile
annotated_output_gcs_pathGCS path to annotated outputs (all samples)String
filt_annotated_output_gcs_pathGCS path to annotated outputs (QC-filtered samples only)String

Downstream pipeline: Mito Post Processing

After MitochondriaMerge completes, the annotated Hail MatrixTable can be passed to the optional mito_post_processing.wdl pipeline, which filters the callset and generates a standard set of QC plots.

Mito Post Processing inputs

Input variable nameDescriptionType
input_pathGCS path to the annotated Hail MatrixTable (output of add_annotations)String
output_pathGCS destination directory for all outputsString
output_baseFilename prefix applied to every output fileString
hail_dockerDocker image (default: us.gcr.io/broad-gotc-prod/aou_mitochondria_post:0.0.5)String

Mito Post Processing task

The pipeline has a single task, RunMitoPostProcessing, which runs mito_plot_filter.py to apply variant and sample filters and produce all outputs.

Task nameDocker imageDescription
RunMitoPostProcessingaou_mitochondria_post:0.0.5Filters the annotated MatrixTable and generates QC plots

Mito Post Processing outputs

Output variable nameDescriptionType
filtered_vcfbgzipped VCF of the filtered callsetFile
filtered_vcf_tbiTabix index for filtered_vcfFile
sample_metadata_tsvPer-sample metadata tableFile
variants_per_sample_svgDistribution of variant counts per sampleFile
mito_cn_distribution_svgmtDNA copy number distribution across samplesFile
variant_allele_frequency_svgVariant allele frequency spectrumFile
variant_af_and_allele_fraction_svgCombined AF and allele fraction plotFile
numt_fp_by_mtcn_svgNuMT false-positive rate as a function of mtCNFile
haplogroup_heteroplasmy_svgHeteroplasmy distribution per haplogroupFile
haplogroup_homoplasmy_svgHomoplasmy distribution per haplogroupFile

Versioning and testing

All mitochondria merge pipeline releases are documented in the mitochondria merge changelog.

Citing the Mitochondria Merge Pipeline

When citing this pipeline, please cite the original mtSwirl publication:

Gupta, R., Kanai, M., Durham, T.J. et al. Nuclear genetic control of mtDNA copy number and heteroplasmy in humans. Nature, 2023. https://doi.org/10.1038/s41586-023-06426-5.

When citing WARP, please use:

Kylee Degatano, Aseel Awdeh, Robert Sidney Cox III, Wes Dingman, George Grant, Farzaneh Khajouei, Elizabeth Kiernan, Kishori Konwar, Kaylee L Mathews, Kevin Palis, Nikelle Petrillo, Geraldine Van der Auwera, Chengchen (Rex) Wang, Jessica Way. "Warp Analysis Research Pipelines: Cloud-optimized workflows for biological data processing and reproducible analysis." Bioinformatics, 2025; https://doi.org/10.1093/bioinformatics/btaf494.

Feedback

Please help us make our tools better by filing an issue in WARP.