Skip to main content

Imputation Overview

Pipeline VersionDate UpdatedDocumentation AuthorQuestions or Feedback
Imputation_v1.1.12February, 2024Elizabeth KiernanPlease file GitHub issues in warp or contact the WARP team

Introduction to the Imputation pipeline

The Imputation pipeline imputes missing genotypes from either a multi-sample VCF or an array of single sample VCFs using a large genomic reference panel. It is based on the Michigan Imputation Server pipeline. Overall, the pipeline filters, phases, and performs imputation on a multi-sample VCF. It outputs the imputed VCF along with key imputation metrics.

Set-up

Workflow installation and requirements

The Imputation workflow is written in the Workflow Description Language (WDL) and can be deployed using a WDL-compatible execution engine like Cromwell, a GA4GH compliant, flexible workflow management system that supports multiple computing platforms.

To identify the latest workflow version and release notes, please see the Imputation workflow changelog.

The latest release of the workflow, example data, and dependencies are available from the WARP releases page. To discover and search releases, use the WARP command-line tool Wreleaser.

Try the Imputation pipeline in Terra

You can run the pipeline in the Imputation workspace on Terra, a cloud-optimized scalable bioinformatics platform. The workspace contains a preconfigured workflow, example inputs, instructions, and cost-estimates.

Input descriptions

The table below describes each of the Imputation pipeline inputs. The workflow requires either a multi-sample VCF or an array of single sample VCFs. These samples must be from the same species and genotyping chip.

You must have two or more samples to run the pipeline.

However, the pipeline is cost-optimized for between 100 and 1,000 samples. After 1,000 samples, the cost per sample no longer decreases (see the Price estimates section).

For examples of how to specify each input in a configuration file, as well as cloud locations for different example input files, see the example input configuration file (JSON).

Input nameDescriptionType
ChunkLengthSize of chunks; default set to 25 MB.Int
chunkOverlapsPadding adding to the beginning and end of each chunk to reduce edge effects; default set 5 MB.Int
multi_sample_vcfMerged VCF containing multiple samples; can also use an array of individual VCFs.File
multi_sample_vcf_indexMerged index for the merged VCF; can also use an array of index files if using an array of VCFs.Index
single_sample_vcfsArray of VCFs, one for each sample; can be used in lieu of a merged VCF containing all samples.Array of files
single_sample_vcf_indicesArray of indices, one for each sample; can be used in lieu of a merged index for a multi-sample VCF.Array of index files
perform_extra_qc_stepsBoolean to indicate if additional QC steps should be performed before imputing; when true, sites with call rates below 95% or low Hardy Weinberg Equilibrium (HWE) p-value are removed before imputation. Default is set to false.Boolean
optional_qc_max_missingOptional float used for the additional QC steps that sets a max threshold for the maximum rate of missing data allowed for sites; default set to 0.05.Float
optional_qc_hweOptional HWE p-value when performing additional QC steps; default set to 0.000001.Float
ref_dictReference dictionary.File
contigsArray of strings defining which contigs (chromsomes) should be used for the reference panel.Array of strings
reference_panel_pathPath to the cloud storage containing the reference panel files for all contigs.String
genetics_maps_eagleGenetic map file for phasing.File
output_callset_nameOutput callset name.String
split_output_to_single_sampleBoolean to split out the final combined VCF to individual sample VCFs; set to false by default.Boolean
merge_ssvcf_mem_mbOptional integer specifying memory allocation for MergeSingleSampleVcfs (in MB); default is 3000.Int
frac_well_imputed_thresholdThreshold for the fraction of well-imputed sites; default set to 0.9.Float
chunks_fail_thresholdMaximum threshold for the number of chunks allowed to fail; default set to 1.Float
vcf_suffixFile extension used for the VCF in the reference panel.String
vcf_index_suffixFile extension used for the VCF index in the reference panel.String
bcf_suffixFile extension used for the BCF in the reference panel.String
bcf_index_suffixFile extension used for the BCF index in the reference panel.String
m3vcf_suffixFile extension used for the M3VCF in the reference panel.String

Imputation reference panel

The Imputation workflow's reference panel files are hosted in a public Google Bucket. For the cloud-path (URI) to the files, see the example input configuration.

Generation of the modified 1000 Genomes reference

Initial tests of the Imputation workflow followed by assessments of polygenic risk score revealed that disease risk scores were lower when computed from imputed array data as opposed to whole-genome sequencing data. This was found to be due to incorrectly genotyped sites in the 1000G reference panel. As a result, the 1000G reference files were modified for the Imputation pipeline as described in the references overview. You can view the original, unmodified 1000G VCFs here.

X-chromosome not imputed

Currently, the pipeline does not perform imputation on the X-chromosome and no reference panels are needed for the X-chromosome. Any sites identified on the X-chromosome after array analysis are merged back into the VCF after the imputation steps.

Workflow tasks and tools

The Imputation workflow imports a series of tasks from the ImputationTasks WDL, which is hosted in the Broad tasks library. The table below describes each workflow task, including the task name, tools, relevant software and non-default parameters.

Task name (alias) in WDLToolSoftwareDescription
MergeSingleSampleVcfsmergebcftoolsIf an array of single sample VCFs are pipeline input, the task merges them into a single VCF.
SetIDs (SetIdsVcfToImpute)annotatebcftools, bashAdds variant IDs to the combined input VCF to create a new VCF. Sorts the alleles for a given variant ID so that REF:ALT is lexicographically consistent across IDs.
ExtractIDs (ExtractIdsVcfToImpute)querybcftoolsExtracts the variant IDs from the SortIDs output VCF to a new “.ids” file so that any missing variants can be added back to the final VCF after imputation.
CountSamplesquerybcftoolsUses the merged input VCF file to count the number of samples and output a TXT file containing the count.
CalculateChromsomeLengthgrepbashReads chromosome lengths from the reference dictionary and uses these to generate chunk intervals for the GenerateChunk task.
GenerateChunkSelectVariantsGATKPerforms site filtering by selecting SNPs only and excluding InDels, removing duplicate sites from the VCF, selecting biallelic variants, excluding symbolic/mixed variants, and removing sites with a maximum fraction of samples with no-call genotypes greater than 0.1. Also subsets to only a specified chunk of the genome.
OptionalQCSites---vcftools, bcftoolsIf the boolean extra_qc_steps is true, performs additional QC steps; excludes sites with more than 95% missing data and assesses sites for Hardy Weinberg Equilibrium, excluding any site with a p-value less than 0.000001.
CountVariantsInChunksCountVariantsGATKCounts variants in the filtered VCF file; Returns the number of chunks in the array and in the reference file.
CheckChunksconvert, indexbcftoolsConfirms that there are no chunks where less than 3 sites or less than 50% of the sites in the array are also in the reference panel; if valid, creates a new VCF output.
PhaseVariantsEagleeagleEagle2Performs phasing on the filtered, validated VCF using the phased reference panel; allows for REF/ALT swaps
Minimac4Minimac4minimac4, bcftoolsPerforms imputation on the prephased VCF; parameterized to include variants that were genotyped but NOT in the reference panel and to specify a minRatio of 0.00001.
AggregateImputationQCMetrics---RUses an R script to take calculate metrics from minimac4 output info file, including total sites, total sites with variants, and sites with an R2 metric of 0.3 (total_sites_r2_gt_0.3); adds the metrics to a new TSV output.
UpdateHeaderUpdateVCFSequenceDictionaryGATKUpdates the header of the imputed VCF; adds contig lengths
SeparateMultiallelicsnormbcftoolsSplits multiallelic sites in the imputed VCF into biallelic records.
RemoveSymbolicAllelesSelectVariantsGATKRemoves SYMBOLIC alleles from the output VCF of the SeparateMultiallelics.
SetIdsannotate, indexbcftoolsSorts the alleles in the variant ID from the RemoveSymbolicAllele output VCF so that REF:ALT is lexicographically consistent across IDs.
GatherVcfsGatherVCFsGATKGathers the array of imputed VCFs and merges them into one VCF output.
ExtractIDsquerybcftoolsExtracts the variant IDs from the imputed VCF.
FindSitesUniqueToFileTwoOnly---UbuntuUses the IDs extracted from imputed VCF and those extracted from original VCF to identify missing variant sites from the original VCF; outputs the IDs to a file.
SelectVariantsByIdsSelectVariantsGATKSelects from the original input VCF any sites which were not included in the imputed VCF.
RemoveAnnotationsannotatebcftoolsRemoves the FORMAT and INFO annotations from the new VCF generated by the SelectVariantsbyIds task that contains the missing variants.
InterleaveVariantsMergeVCFsGATKCombines the missing variants from the original VCF and the imputed variants into a new VCF.
MergeImputationQCMetrics---RUses an R script to calculate the fraction of well-imputed sites and outputs them to a TXT file; the fraction of "well-imputed" sites is based on the minimac reported R2 metric, with R2>0.3 being "well-imputed." Since homomorphic sites lead to an R2 value of 0, we report the fraction of sites with any variation which are well-imputed in addition to the fraction of total sites.
StoreChunksInfo---RUses an R script to record the coordinates of each imputation chunk, number of sites in the original array, and number of sites in the original array which are also in the reference panel, for each imputation chunk.
SplitMultiSampleVcfsplitbcftoolsIf boolean is set to true, will split the interleave variants VCF into single sample VCFs.

Workflow outputs

The table below summarizes the workflow outputs. If running the workflow on Cromwell, these outputs are found in the task execution directory.

Output nameDescriptionType
imputed_single_sample_vcfsArray of imputed single sample VCFs from the SplitMultiSampleVcf task.Array
imputed_single_sample_vcf_indicesArray of indices for the imputed VCFs from the SplitMultiSampleVcf taskArray
imputed_multisample_vcfVCF from the InterleaveVariants task; contains imputed variants as well as missing variants from the input VCF.VCF
imputed_multisample_vcf_indexIndex file for VCF from the InterleaveVariants task.Index
aggregated_imputation_metricsAggregated QC metrics from the MergeImputationQcMetrics task; reports the fraction of sites well-imputed and outputs to TXT file; fraction of "well-imputed" is based on the minimac reported R2 metric, with R2>0.3 being "well-imputed." Since homomorphic sites lead to an R2 value of 0, we report the fraction of sites with any variation which are well-imputed in addition to the fraction of total sites.TXT
chunks_infoTSV from StoreChunksInfo task; contains the chunk intervals as well as the number of variants in the array.TSV
failed_chunksFile with the failed chunks from the StoreChunksInfo task.File
n_failed_chunksFile with the number of failed chunks from the StoreChunksInfo task.File

Important notes

  • Runtime parameters are optimized for Broad's Google Cloud Platform implementation.

Price estimates

The pipeline is cost-optimized for between 100 and 1,000 samples, where the cost per sample continues to decrease until 1,000 samples are run. Cost estimates per sample are provided below:

Cohort size ( # samples)Cost per sample ($)
18
100.8
1000.11
10000.024
13.5 K0.025

Citing the Imputation Pipeline

If you use the Imputation Pipeline in your research, please consider citing our preprint:

Degatano, K.; Awdeh, A.; Dingman, W.; Grant, G.; Khajouei, F.; Kiernan, E.; Konwar, K.; Mathews, K.; Palis, K.; Petrillo, N.; Van der Auwera, G.; Wang, C.; Way, J.; Pipelines, W. WDL Analysis Research Pipelines: Cloud-Optimized Workflows for Biological Data Processing and Reproducible Analysis. Preprints 2024, 2024012131. https://doi.org/10.20944/preprints202401.2131.v1

Contact us

Help us make our tools better by contacting the WARP team for pipeline-related suggestions or questions.

Licensing

Copyright Broad Institute, 2020 | BSD-3

The workflow script is released under the WDL open source code license (BSD-3) (full license text at https://github.com/broadinstitute/warp/blob/master/LICENSE). However, please note that the programs it calls may be subject to different licenses. Users are responsible for checking that they are authorized to run all programs before running this script.