Skip to main content

Imputation Overview

Pipeline VersionDate UpdatedDocumentation AuthorQuestions or Feedback
ImputationBeagle_v2.3.0November, 2025Terra Scientific Pipeline ServicesPlease file an issue in WARP.

Introduction to the Array Imputation pipeline

The Array Imputation pipeline imputes missing genotypes from either a multi-sample VCF or an array of single-sample VCFs using a large genomic reference panel. It uses Beagle as the imputation tool. Overall, the pipeline filters, phases, and performs imputation on a multi-sample VCF. It outputs the imputed VCF along with key imputation metrics.

Set-up

Workflow installation and requirements

The Array Imputation workflow is written in the Workflow Description Language (WDL) and can be deployed using a WDL-compatible execution engine like Cromwell, a GA4GH compliant, flexible workflow management system that supports multiple computing platforms. To identify the latest workflow version and release notes, please see the Imputation workflow changelog. The latest release of the workflow, example data, and dependencies are available from the WARP releases page. To discover and search releases, use the WARP command-line tool Wreleaser.

Using the Array Imputation pipeline This pipeline is used by the All of Us + AnVIL Imputation Service. If you choose to use this service, you can impute your samples against the 515,000+ genomes in the All of Us + AnVIL reference panel, which can provide greater accuracy at more sites.

Try the Imputation pipeline in Terra

You can alternatively run the pipeline with your own panel, using this WDL.

Input descriptions

The table below describes each of the Array Imputation pipeline inputs. The workflow requires a multi-sample VCF. These samples must be from the same species and genotyping chip.

For examples of how to specify each input in a configuration file, as well as cloud locations for different example input files, see the example input configuration file (JSON).

Input nameDescriptionType
multi_sample_vcfMulti-sample VCF file containing genotype dataFile
ref_dictReference dictionary for contig information and header updatingFile
contigsArray of allowed contigs/chromosomes to processArray of strings
reference_panel_path_prefixPath to the cloud storage containing the reference panel files for all contigsString
genetic_maps_pathPath to the cloud storage containing the genetic map files for all contigsString
output_basenameBasename for intermediate and output filesString
chunkLengthSize of chunks; default set to 25 MBInt
chunkOverlapsPadding added to the beginning and end of each chunk to reduce edge effects; default set 2 MBInt
sample_chunk_sizeNumber of samples to chunk by when processing (default: 1,000)Int
pipeline_header_lineOptional additional header lines to add to the output VCFString
min_dr2_for_inclusionMin value of DR2 to include in final output (default: 0.0)Float
bref3_suffixFile extension used for the BREF3 in the reference panel (default: .bref3)String
unique_variant_ids_suffixFile extension for unique variant ID files (default: .unique_variants)String
gatk_dockerGATK Docker image (default: us.gcr.io/broad-gatk/gatk:4.6.0.0)String
ubuntu_dockerUbuntu Docker image (default: us.gcr.io/broad-dsde-methods/ubuntu:20.04)String
error_count_overrideOverride for error check on chunk qc (set to 0 for workflow to continue no matter how many errors exist)Int
beagle_cpuNumber of CPUs to use for Beagle Phase and Impute tasks (default: 8)Int
beagle_phase_memory_in_gbMemory, in GB, to use for Beagle Phase task (default: 40)Int
beagle_impute_memory_in_gbMemory, in GB, to use for Beagle Impute task (default: 45)Int

Workflow tasks and tools

The Array Imputation workflow imports a series of tasks from the ImputationTasks WDL and ImputationBeagleTasks WDL, which are hosted in the Broad tasks library. The table below describes each workflow task, including the task name, tools, relevant software and non-default parameters.

Task name (alias) in WDLToolSoftwareDescription
CountSamplesquerybcftoolsUses the merged input VCF file to count the number of samples and output a TXT file containing the count.
CreateVcfIndexindexbcftoolsCreates index of input multisample vcf
CalculateContigsToProcessDetermine which contigs will be processed by the workflowmulti_sample_vcf, contigsExtracts contigs from input VCF and filters by allowed contigs
CalculateChromsomeLengthgrepbashReads chromosome lengths from the reference dictionary and uses these to generate chunk intervals for the GenerateChunk task.
GenerateChunkSelectVariantsGATKPerforms site filtering by selecting SNPs only and excluding InDels, removing duplicate sites from the VCF, selecting biallelic variants, excluding symbolic/mixed variants, and removing sites with a maximum fraction of samples with no-call genotypes greater than 0.1. Also subsets to only a specified chunk of the genome.
ExtractUniqueVariantIdsSelectVariants, queryGATK, bcftoolsExtracts and counts unique variant IDs from an optionally specified interval of the input VCF in CHROM:POS:REF:ALT format
CountUniqueVariantIdsInOverlapcommbashCounts variants in the filtered VCF chunk using unique variant ID lists; Returns the number of variants in the chunk and variants also present in the reference panel.
CheckChunksconvertbcftoolsConfirms that there are no chunks where less than 3 sites or less than 50% of the sites in the array are also in the reference panel; if valid, creates a new VCF output.
CountValidContigChunksbashCounts the number of valid chunks by counting true values in the validation boolean array
StoreMetricsInfoPython (pandas)Gathers all results from CheckChunks; creates chunk-level and contig-level (chromosome-level) metrics files with variant counts
ErrorWithMessageIfErrorCountNotZerobashFails workflow if any chunks fail qc check. Can be overridden with error_count_override input
SelectSamplesWithCutcutbashChunks vcf by sample_chunk_size if more than sample_chunk_size samples exist in input vcf
PhaseBeaglePerforms phasing on the filtered, validated VCF
ImputeBeaglePerforms imputation on the prephased VCF;
LocalizeAndSubsetVcfToRegionSelectVariantsGATKRemove padding from imputed vcf
QuerySampleChunkedVcfForReannotationquerybcftoolsQuery DS, AP1, AP2 from sample chunked VCFs to be used when merging samples together
RemoveAPAnnotationsannotatebcftoolsRemove AP1, AP2 annotations to reduce file size now that they’re no longer needed
RecalculateDR2AndAFChunkedpythonUsed query output to summarize DS, AP1, AP2 values
MergeSampleChunksVcfsWithPastePaste, viewBash, bcftoolsMerge sample chunked VCFs
IndexMergedSampleChunksVcfsindexbcftoolsCreates index for sample chunk merged VCF
AggregateChunkedDR2AndAFpythonTake summarized DS, AP1, AP2 data and calculate AF and DR2
ReannotateDR2AndAFannotatebcftoolsReannotate DR2 and AF for sample chunk merged vcf
FilterVcfByDR2Filter variants by DR2 thresholdimputed VCF, min_dr2_for_inclusionRemoves variants below provided min_dr2_for_inclusion value
UpdateHeaderUpdateVCFSequenceDictionaryGATKUpdates the header of the imputed VCF; adds contig lengths
GatherVcfsNoIndexGatherVCFsGATKGathers the array of imputed VCFs and merges them into one VCF output.
CreateIndexForGatheredVcfindexbcftoolsCreates index for final output vcf

Workflow outputs

The table below summarizes the workflow outputs. If running the workflow on Cromwell, these outputs are found in the task execution directory.

Output nameDescriptionType
imputed_multisample_vcfVCF from the CreateIndexForGatheredVcf task; contains imputed variants as well as missing variants from the input VCF.VCF
imputed_multisample_vcf_indexIndex file for VCF from the CreateIndexForGatheredVcf task.Index
chunks_infoTSV from StoreMetricsInfo task; contains the chunk intervals, variant counts per chunk (filtered input and overlap with reference panel), and whether each chunk was successfully imputed.TSV
contigs_infoTSV from StoreMetricsInfo task; contains contig-level (chromosome-level) aggregated metrics including total variant counts in the raw input, filtered input, and overlap with reference panel for each processed chromosome.TSV

Important notes

  • Runtime parameters are optimized for Broad's Google Cloud Platform implementation.

Citing the Imputation Pipeline

If you use the Imputation Pipeline in your research, please consider citing our preprint:

Degatano, K., Awdeh, A., Cox III, R.S., Dingman, W., Grant, G., Khajouei, F., Kiernan, E., Konwar, K., Mathews, K.L., Palis, K., et al. Warp Analysis Research Pipelines: Cloud-optimized workflows for biological data processing and reproducible analysis. Bioinformatics 2025; btaf494. https://doi.org/10.1093/bioinformatics/btaf494

Contact us

Help us make our tools better by filing an issue in WARP; we welcome pipeline-related suggestions or questions.

Licensing

Copyright Broad Institute, 2020 | BSD-3

The workflow script is released under the WDL open source code license (BSD-3) (full license text at https://github.com/broadinstitute/warp/blob/master/LICENSE). However, please note that the programs it calls may be subject to different licenses. Users are responsible for checking that they are authorized to run all programs before running this script.