Skip to main content

Imputation Overview

Pipeline VersionDate UpdatedDocumentation AuthorQuestions or Feedback
ImputationBeagle_v2.0.0August, 2025Terra Scientific Pipeline ServicesPlease file an issue in WARP.

Introduction to the Array Imputation pipeline

The Array Imputation pipeline imputes missing genotypes from either a multi-sample VCF or an array of single-sample VCFs using a large genomic reference panel. It uses Beagle as the imputation tool. Overall, the pipeline filters, phases, and performs imputation on a multi-sample VCF. It outputs the imputed VCF along with key imputation metrics.

Set-up

Workflow installation and requirements

The Array Imputation workflow is written in the Workflow Description Language (WDL) and can be deployed using a WDL-compatible execution engine like Cromwell, a GA4GH compliant, flexible workflow management system that supports multiple computing platforms. To identify the latest workflow version and release notes, please see the Imputation workflow changelog. The latest release of the workflow, example data, and dependencies are available from the WARP releases page. To discover and search releases, use the WARP command-line tool Wreleaser.

Using the Array Imputation pipeline This pipeline is used by the All of Us + AnVIL Imputation Service. If you choose to use this service, you can impute your samples against the 515,000+ genomes in the All of Us + AnVIL reference panel, which can provide greater accuracy at more sites.

Try the Imputation pipeline in Terra

You can alternatively run the pipeline with your own panel, using this WDL.

Input descriptions

The table below describes each of the Array Imputation pipeline inputs. The workflow requires a multi-sample VCF. These samples must be from the same species and genotyping chip.

For examples of how to specify each input in a configuration file, as well as cloud locations for different example input files, see the example input configuration file (JSON).

Input nameDescriptionType
ChunkLengthSize of chunks; default set to 25 MB.Int
chunkOverlapsPadding adding to the beginning and end of each chunk to reduce edge effects; default set 5 MB.Int
sample_chunk_sizeNumber of samples to chunk by when processing (default: 1,000)Int
multi_sample_vcfMerged VCF containing multiple samples; can also use an array of individual VCFs.File
ref_dictReference dictionary.File
contigsArray of strings defining which contigs (chromosomes) should be used for the reference panel.Array of strings
reference_panel_path_prefixPath to the cloud storage containing the reference panel files for all contigs.String
genetics_maps_pathPath to the cloud storage containing the genetic map files for all contigs.File
output_callset_nameOutput callset name.String
bcf_suffixFile extension used for the BED in the reference panel.String
bref3_suffixFile extension used for the BREF3 in the reference panel.String
error_count_overrideOverride for error check on chunk qc (set to 0 for workflow to continue no matter how many errors exist)Int

Workflow tasks and tools

The Array Imputation workflow imports a series of tasks from the ImputationTasks WDL and ImputationBeagleTasks WDL, which are hosted in the Broad tasks library. The table below describes each workflow task, including the task name, tools, relevant software and non-default parameters.

Task name (alias) in WDLToolSoftwareDescription
CountSamplesquerybcftoolsUses the merged input VCF file to count the number of samples and output a TXT file containing the count.
CreateVcfIndexindexbcftoolsCreates index of input multisample vcf
CalculateChromsomeLengthgrepbashReads chromosome lengths from the reference dictionary and uses these to generate chunk intervals for the GenerateChunk task.
GenerateChunkSelectVariantsGATKPerforms site filtering by selecting SNPs only and excluding InDels, removing duplicate sites from the VCF, selecting biallelic variants, excluding symbolic/mixed variants, and removing sites with a maximum fraction of samples with no-call genotypes greater than 0.1. Also subsets to only a specified chunk of the genome.
CountVariantsInChunksCountVariants, intersectGATK, bedtoolsCounts variants in the filtered VCF file; Returns the number of chunks in the array and in the reference file.
CheckChunksconvertbcftoolsConfirms that there are no chunks where less than 3 sites or less than 50% of the sites in the array are also in the reference panel; if valid, creates a new VCF output.
StoreChunksInfoRGathers all results from CheckChunks
ErrorWithMessageIfErrorCountNotZerobashFails workflow if any chunks fail qc check. Can be overridden with error_count_override input
SelectSamplesWithCutcutbashChunks vcf by sample_chunk_size if more than sample_chunk_size samples exist in input vcf
PhaseBeaglePerforms phasing on the filtered, validated VCF
ImputeBeaglePerforms imputation on the prephased VCF;
LocalizeAndSubsetVcfToRegionSelectVariantsGATKRemove padding from imputed vcf
QuerySampleChunkedVcfForReannotationquerybcftoolsQuery DS, AP1, AP2 from sample chunked VCFs to be used when merging samples together
RemoveAPAnnotationsannotatebcftoolsRemove AP1, AP2 annotations to reduce file size now that they’re no longer needed
RecalculateDR2AndAFChunkedpythonUsed query output to summarize DS, AP1, AP2 values
MergeSampleChunksVcfsWithPastePaste, viewBash, bcftoolsMerge sample chunked VCFs
IndexMergedSampleChunksVcfsindexbcftoolsCreates index for sample chunk merged VCF
AggregateChunkedDR2AndAFpythonTake summarized DS, AP1, AP2 data and calculate AF and DR2
ReannotateDR2AndAFannotatebcftoolsReannotate DR2 and AF for sample chunk merged vcf
UpdateHeaderUpdateVCFSequenceDictionaryGATKUpdates the header of the imputed VCF; adds contig lengths
GatherVcfsNoIndexGatherVCFsGATKGathers the array of imputed VCFs and merges them into one VCF output.
CreateIndexForGatheredVcfindexbcftoolsCreates index for final output vcf

Workflow outputs

The table below summarizes the workflow outputs. If running the workflow on Cromwell, these outputs are found in the task execution directory.

Output nameDescriptionType
imputed_multisample_vcfVCF from the CreateIndexForGatheredVcf task; contains imputed variants as well as missing variants from the input VCF.VCF
imputed_multisample_vcf_indexIndex file for VCF from the CreateIndexForGatheredVcf task.Index
chunks_infoTSV from StoreChunksInfo task; contains the chunk intervals as well as the number of variants in the array.TSV

Important notes

  • Runtime parameters are optimized for Broad's Google Cloud Platform implementation.

Citing the Imputation Pipeline

If you use the Imputation Pipeline in your research, please consider citing our preprint:

Degatano, K.; Awdeh, A.; Dingman, W.; Grant, G.; Khajouei, F.; Kiernan, E.; Konwar, K.; Mathews, K.; Palis, K.; Petrillo, N.; Van der Auwera, G.; Wang, C.; Way, J.; Pipelines, W. WDL Analysis Research Pipelines: Cloud-Optimized Workflows for Biological Data Processing and Reproducible Analysis. Preprints 2024, 2024012131. https://doi.org/10.20944/preprints202401.2131.v1

Contact us

Help us make our tools better by filing an issue in WARP; we welcome pipeline-related suggestions or questions.

Licensing

Copyright Broad Institute, 2020 | BSD-3

The workflow script is released under the WDL open source code license (BSD-3) (full license text at https://github.com/broadinstitute/warp/blob/master/LICENSE). However, please note that the programs it calls may be subject to different licenses. Users are responsible for checking that they are authorized to run all programs before running this script.