Imputation Overview

Pipeline Version	Date Updated	Documentation Author	Questions or Feedback
Imputation_v1.1.12	February, 2024	Elizabeth Kiernan	Please file an issue in WARP.

Introduction to the Imputation pipeline

The Imputation pipeline imputes missing genotypes from either a multi-sample VCF or an array of single sample VCFs using a large genomic reference panel. It is based on the Michigan Imputation Server pipeline. Overall, the pipeline filters, phases, and performs imputation on a multi-sample VCF. It outputs the imputed VCF along with key imputation metrics.

Set-up

Workflow installation and requirements

The Imputation workflow is written in the Workflow Description Language (WDL) and can be deployed using a WDL-compatible execution engine like Cromwell, a GA4GH compliant, flexible workflow management system that supports multiple computing platforms.

To identify the latest workflow version and release notes, please see the Imputation workflow changelog.

The latest release of the workflow, example data, and dependencies are available from the WARP releases page. To discover and search releases, use the WARP command-line tool Wreleaser.

Try the Imputation pipeline in Terra

You can run the pipeline in the Imputation workspace on Terra, a cloud-optimized scalable bioinformatics platform. The workspace contains a preconfigured workflow, example inputs, instructions, and cost-estimates.

Input descriptions

The table below describes each of the Imputation pipeline inputs. The workflow requires either a multi-sample VCF or an array of single sample VCFs. These samples must be from the same species and genotyping chip.

You must have two or more samples to run the pipeline.

However, the pipeline is cost-optimized for between 100 and 1,000 samples. After 1,000 samples, the cost per sample no longer decreases (see the Price estimates section).

For examples of how to specify each input in a configuration file, as well as cloud locations for different example input files, see the example input configuration file (JSON).

Input name	Description	Type
ChunkLength	Size of chunks; default set to 25 MB.	Int
chunkOverlaps	Padding adding to the beginning and end of each chunk to reduce edge effects; default set 5 MB.	Int
multi_sample_vcf	Merged VCF containing multiple samples; can also use an array of individual VCFs.	File
multi_sample_vcf_index	Merged index for the merged VCF; can also use an array of index files if using an array of VCFs.	Index
single_sample_vcfs	Array of VCFs, one for each sample; can be used in lieu of a merged VCF containing all samples.	Array of files
single_sample_vcf_indices	Array of indices, one for each sample; can be used in lieu of a merged index for a multi-sample VCF.	Array of index files
perform_extra_qc_steps	Boolean to indicate if additional QC steps should be performed before imputing; when true, sites with call rates below 95% or low Hardy Weinberg Equilibrium (HWE) p-value are removed before imputation. Default is set to false.	Boolean
optional_qc_max_missing	Optional float used for the additional QC steps that sets a max threshold for the maximum rate of missing data allowed for sites; default set to 0.05.	Float
optional_qc_hwe	Optional HWE p-value when performing additional QC steps; default set to 0.000001.	Float
ref_dict	Reference dictionary.	File
contigs	Array of strings defining which contigs (chromsomes) should be used for the reference panel.	Array of strings
reference_panel_path	Path to the cloud storage containing the reference panel files for all contigs.	String
genetics_maps_eagle	Genetic map file for phasing.	File
output_callset_name	Output callset name.	String
split_output_to_single_sample	Boolean to split out the final combined VCF to individual sample VCFs; set to false by default.	Boolean
merge_ssvcf_mem_mb	Optional integer specifying memory allocation for MergeSingleSampleVcfs (in MB); default is 3000.	Int
frac_well_imputed_threshold	Threshold for the fraction of well-imputed sites; default set to 0.9.	Float
chunks_fail_threshold	Maximum threshold for the number of chunks allowed to fail; default set to 1.	Float
vcf_suffix	File extension used for the VCF in the reference panel.	String
vcf_index_suffix	File extension used for the VCF index in the reference panel.	String
bcf_suffix	File extension used for the BCF in the reference panel.	String
bcf_index_suffix	File extension used for the BCF index in the reference panel.	String
m3vcf_suffix	File extension used for the M3VCF in the reference panel.	String

Imputation reference panel

The Imputation workflow's reference panel files are hosted in a public Google Bucket. For the cloud-path (URI) to the files, see the example input configuration.

Generation of the modified 1000 Genomes reference

Initial tests of the Imputation workflow followed by assessments of polygenic risk score revealed that disease risk scores were lower when computed from imputed array data as opposed to whole-genome sequencing data. This was found to be due to incorrectly genotyped sites in the 1000G reference panel. As a result, the 1000G reference files were modified for the Imputation pipeline as described in the references overview. You can view the original, unmodified 1000G VCFs here.

X-chromosome not imputed

Currently, the pipeline does not perform imputation on the X-chromosome and no reference panels are needed for the X-chromosome. Any sites identified on the X-chromosome after array analysis are merged back into the VCF after the imputation steps.

Workflow tasks and tools

The Imputation workflow imports a series of tasks from the ImputationTasks WDL, which is hosted in the Broad tasks library. The table below describes each workflow task, including the task name, tools, relevant software and non-default parameters.

Task name (alias) in WDL	Tool	Software	Description
MergeSingleSampleVcfs	merge	bcftools	If an array of single sample VCFs are pipeline input, the task merges them into a single VCF.
SetIDs (SetIdsVcfToImpute)	annotate	bcftools, bash	Adds variant IDs to the combined input VCF to create a new VCF. Sorts the alleles for a given variant ID so that REF:ALT is lexicographically consistent across IDs.
ExtractIDs (ExtractIdsVcfToImpute)	query	bcftools	Extracts the variant IDs from the SortIDs output VCF to a new “.ids” file so that any missing variants can be added back to the final VCF after imputation.
CountSamples	query	bcftools	Uses the merged input VCF file to count the number of samples and output a TXT file containing the count.
CalculateChromsomeLength	grep	bash	Reads chromosome lengths from the reference dictionary and uses these to generate chunk intervals for the GenerateChunk task.
GenerateChunk	SelectVariants	GATK	Performs site filtering by selecting SNPs only and excluding InDels, removing duplicate sites from the VCF, selecting biallelic variants, excluding symbolic/mixed variants, and removing sites with a maximum fraction of samples with no-call genotypes greater than 0.1. Also subsets to only a specified chunk of the genome.
OptionalQCSites	---	vcftools, bcftools	If the boolean extra_qc_steps is true, performs additional QC steps; excludes sites with more than 95% missing data and assesses sites for Hardy Weinberg Equilibrium, excluding any site with a p-value less than 0.000001.
CountVariantsInChunks	CountVariants	GATK	Counts variants in the filtered VCF file; Returns the number of chunks in the array and in the reference file.
CheckChunks	convert, index	bcftools	Confirms that there are no chunks where less than 3 sites or less than 50% of the sites in the array are also in the reference panel; if valid, creates a new VCF output.
PhaseVariantsEagle	eagle	Eagle2	Performs phasing on the filtered, validated VCF using the phased reference panel; allows for REF/ALT swaps
Minimac4	Minimac4	minimac4, bcftools	Performs imputation on the prephased VCF; parameterized to include variants that were genotyped but NOT in the reference panel and to specify a minRatio of 0.00001.
AggregateImputationQCMetrics	---	R	Uses an R script to take calculate metrics from minimac4 output info file, including total sites, total sites with variants, and sites with an R2 metric of 0.3 (total_sites_r2_gt_0.3); adds the metrics to a new TSV output.
UpdateHeader	UpdateVCFSequenceDictionary	GATK	Updates the header of the imputed VCF; adds contig lengths
SeparateMultiallelics	norm	bcftools	Splits multiallelic sites in the imputed VCF into biallelic records.
RemoveSymbolicAlleles	SelectVariants	GATK	Removes SYMBOLIC alleles from the output VCF of the SeparateMultiallelics.
SetIds	annotate, index	bcftools	Sorts the alleles in the variant ID from the RemoveSymbolicAllele output VCF so that REF:ALT is lexicographically consistent across IDs.
GatherVcfs	GatherVCFs	GATK	Gathers the array of imputed VCFs and merges them into one VCF output.
ExtractIDs	query	bcftools	Extracts the variant IDs from the imputed VCF.
FindSitesUniqueToFileTwoOnly	---	Ubuntu	Uses the IDs extracted from imputed VCF and those extracted from original VCF to identify missing variant sites from the original VCF; outputs the IDs to a file.
SelectVariantsByIds	SelectVariants	GATK	Selects from the original input VCF any sites which were not included in the imputed VCF.
RemoveAnnotations	annotate	bcftools	Removes the FORMAT and INFO annotations from the new VCF generated by the SelectVariantsbyIds task that contains the missing variants.
InterleaveVariants	MergeVCFs	GATK	Combines the missing variants from the original VCF and the imputed variants into a new VCF.
MergeImputationQCMetrics	---	R	Uses an R script to calculate the fraction of well-imputed sites and outputs them to a TXT file; the fraction of "well-imputed" sites is based on the minimac reported R2 metric, with R2>0.3 being "well-imputed." Since homomorphic sites lead to an R2 value of 0, we report the fraction of sites with any variation which are well-imputed in addition to the fraction of total sites.
StoreChunksInfo	---	R	Uses an R script to record the coordinates of each imputation chunk, number of sites in the original array, and number of sites in the original array which are also in the reference panel, for each imputation chunk.
SplitMultiSampleVcf	split	bcftools	If boolean is set to true, will split the interleave variants VCF into single sample VCFs.

Workflow outputs

The table below summarizes the workflow outputs. If running the workflow on Cromwell, these outputs are found in the task execution directory.

Output name	Description	Type
imputed_single_sample_vcfs	Array of imputed single sample VCFs from the SplitMultiSampleVcf task.	Array
imputed_single_sample_vcf_indices	Array of indices for the imputed VCFs from the SplitMultiSampleVcf task	Array
imputed_multisample_vcf	VCF from the InterleaveVariants task; contains imputed variants as well as missing variants from the input VCF.	VCF
imputed_multisample_vcf_index	Index file for VCF from the InterleaveVariants task.	Index
aggregated_imputation_metrics	Aggregated QC metrics from the MergeImputationQcMetrics task; reports the fraction of sites well-imputed and outputs to TXT file; fraction of "well-imputed" is based on the minimac reported R2 metric, with R2>0.3 being "well-imputed." Since homomorphic sites lead to an R2 value of 0, we report the fraction of sites with any variation which are well-imputed in addition to the fraction of total sites.	TXT
chunks_info	TSV from StoreChunksInfo task; contains the chunk intervals as well as the number of variants in the array.	TSV
failed_chunks	File with the failed chunks from the StoreChunksInfo task.	File
n_failed_chunks	File with the number of failed chunks from the StoreChunksInfo task.	File

Important notes

Runtime parameters are optimized for Broad's Google Cloud Platform implementation.

Price estimates

The pipeline is cost-optimized for between 100 and 1,000 samples, where the cost per sample continues to decrease until 1,000 samples are run. Cost estimates per sample are provided below:

Cohort size ( # samples)	Cost per sample ($)
1	8
10	0.8
100	0.11
1000	0.024
13.5 K	0.025

Citing the Imputation Pipeline

If you use the Imputation Pipeline in your research, please consider citing our preprint:

Degatano, K.; Awdeh, A.; Dingman, W.; Grant, G.; Khajouei, F.; Kiernan, E.; Konwar, K.; Mathews, K.; Palis, K.; Petrillo, N.; Van der Auwera, G.; Wang, C.; Way, J.; Pipelines, W. WDL Analysis Research Pipelines: Cloud-Optimized Workflows for Biological Data Processing and Reproducible Analysis. Preprints 2024, 2024012131. https://doi.org/10.20944/preprints202401.2131.v1

Contact us

Help us make our tools better by filing an issue in WARP; we welcome pipeline-related suggestions or questions.

Licensing

The workflow script is released under the WDL open source code license (BSD-3) (full license text at https://github.com/broadinstitute/warp/blob/master/LICENSE). However, please note that the programs it calls may be subject to different licenses. Users are responsible for checking that they are authorized to run all programs before running this script.

Imputation Overview

Introduction to the Imputation pipeline​

Set-up​

Workflow installation and requirements​

Input descriptions​

Imputation reference panel​

Generation of the modified 1000 Genomes reference​

Workflow tasks and tools​

Workflow outputs​

Important notes​

Price estimates​

Citing the Imputation Pipeline​

Contact us​

Licensing​