Skip to main content

GLIMPSE2 Low Pass Imputation Overview

Pipeline VersionDate UpdatedDocumentation AuthorQuestions or Feedback
Glimpse2LowPassImputation_v0.0.10 (pre-release)May, 2026Terra Scientific Pipeline ServicesPlease file an issue in WARP.

Introduction to the GLIMPSE2 Low Pass Imputation pipeline

The GLIMPSE2 Low Pass Imputation pipeline imputes missing genotypes from a list of low-pass CRAM/CRAI files (or a sample manifest pointing to GCS file paths) using a large genomic reference panel. It uses GLIMPSE2 as the imputation tool. Overall, the pipeline splits samples into batches, performs variant calling and imputation on each batch across genomic chunks, and merges the results into a final multi-sample VCF. It outputs the imputed VCF along with key imputation metrics.

GLIMPSE2 Low-Pass Imputation Summary

The Glimpse2LowPassImputation workflow is a WDL-based pipeline for low-pass whole genome imputation using GLIMPSE2.
This top-level workflow is now a gateway that scales to large cohorts by splitting samples into batches, running a per-batch imputation subworkflow, then merging batch outputs back into cohort-level results.

The workflow processes each requested contig independently, imputes each sample batch against reference-defined chunks, ligates chunk outputs per batch/contig, merges sample columns across batches, recomputes AF/INFO annotations, and gathers contig outputs into final genome-wide files.

Pipeline Features

Pipeline featuresDescriptionSource
Assay typeLow-pass whole genome imputation using GLIMPSE2GLIMPSE2
Overall workflowCRAM calling, shard-based phasing/imputation, ligation, batch merge, and QCDefined in Glimpse2LowPassImputation.wdl + imported subworkflows/tasks
Workflow languageWDL 1.0openWDL
Sub-workflowsGateway workflow + Glimpse2LowPassImputationBatchImported from Glimpse2LowPassImputationBatch.wdl
Genomic processingContig-by-contig and reference-chunk-based processingWorkflow scatter logic
Cohort scalabilityInputs optionally specified via cram_manifest, sample batching (sample_batch_size) then per-contig batch mergeGateway orchestration in Glimpse2LowPassImputation.wdl
Algorithmsbcftools mpileup/call/norm/merge + GLIMPSE2 phase/ligate + post-merge re-annotationTask commands in batch and task WDLs
Quality controlSample QC metrics and optional coverage-metrics aggregationCollectQCMetrics, CombineCoverageMetrics
Data input file formatCRAM/CRAI arrays with sample IDsWorkflow input block
Data output file formatImputed VCFs, indexes, md5s, and QC/coverage metric tablesWorkflow outputs
ContainersGATK, GLIMPSE2, bcftools/samtools suite, Hail, Python, UbuntuRuntime blocks
Resource optimizationParallelization by sample batch, contig, and reference shardWorkflow architecture

Inputs

This gateway workflow expects CRAM-based inputs and a GLIMPSE2-compatible reference panel layout.

InputDescription
cram_manifestOptional manifest TSV file containing columns (including header line) of sample_id, cram_path, and cram_index_path referring to cloud-hosted input files to be imputed. This or all three array inputs (crams, cram_indices, sample_ids) must be provided.
cramsOptional array of input CRAMs
cram_indicesOptional array of CRAI files corresponding to crams
sample_idsOptional array of sample ID strings corresponding to CRAM inputs
contigsArray of contigs/chromosomes to process
reference_panel_prefixDirectory/prefix containing sites.<contig>.vcf.gz, sites_table.<contig>.gz, and reference_chunks.<contig>.txt
fastaReference FASTA
fasta_indexFASTA index
output_basenameBasename for intermediate and final outputs
ref_dictReference dictionary used during ligation/header normalization
impute_reference_only_variantsWhether to impute reference-only variants (default: false)
call_indelsWhether to include indels during calling/imputation (default: false)
calling_batch_sizeBatch size for CRAM calling inside each batch subworkflow (default: 100)
sample_batch_sizeBatch size at gateway level for splitting very large cohorts (default: 1000)
glimpse_phase_cpu_overrideOptional cpu override for GlimpsePhase task (default: 4)
gatk_dockerGATK Docker image
glimpse_dockerGLIMPSE2 Docker image
docker_mergeDocker used for merge/re-annotation step
info_filter_for_inclusionOptional minimum INFO score threshold; variants below this value are excluded from the final output VCF (default: 0.0)

Workflow Tasks

The top-level workflow orchestrates batching, per-batch imputation, and cohort-level merging/re-annotation.

Task / CallPurposeInput DependenciesKey Function
ConvertCramManifestToInputArraysConvert cram manifest input into CRAMs/CRAIs/sample IDs arrayscram_manifestFacilitates submission of very large sets of inputs via manifest file
SplitIntoSampleBatchesSplit CRAMs/CRAIs/sample IDs into sample-level batchescrams, cram_indices, sample_ids from inputs or derived from cram_manifest, sample_batch_sizeEnables large-cohort scaling at gateway level
RunBatch (Glimpse2LowPassImputationBatch)Run full low-pass imputation pipeline on each sample batchBatch-specific CRAMs/indices/sample IDs + reference inputsProduces per-batch, per-contig ligated imputed VCFs
ExtractAnnotationsExtract AF/INFO annotations from each batch contig VCFBatch ligated VCFs and indexesCaptures annotations needed for post-merge recomputation
MergeContigVcfs (MergeSampleChunksVcfsWithPaste)Merge sample columns across batch VCFs for one contigArray of batch VCFs for contigCreates full-cohort contig VCF with aligned site lists
RecomputeAndAnnotateRecompute AF/INFO across merged cohort and write updated contig VCFMerged contig VCF + extracted annotationsRestores cohort-correct annotations after paste-based merge
SelectContigVariantsCreate variants-only contig VCFRe-annotated contig VCFRemoves homozygous-reference-only records
CreateContigHomRefVcfCreate hom-ref-sites-only contig VCFRe-annotated contig VCFKeeps homozygous-reference-only sites
MergeBatchCoverageMetricsCombine optional coverage metric files across batchesRunBatch.coverage_metricsProduces aggregated coverage table when metrics exist
GatherVcfsNoIndexGather contig variant VCFs into genome-wide variant VCFVariant-only contig VCFsProduces final genome-wide variant VCF
FilterVcfByInfoFilter variants below the INFO score threshold (optional)Gathered variant VCF; only runs when info_filter_for_inclusion is suppliedRemoves low-quality imputed variants from the final VCF
CreateVcfIndexAndMd5Index and checksum final variant VCFFiltered VCF (if info_filter_for_inclusion supplied) or gathered variant VCFCreates .tbi and md5
GatherVcfsNoIndexHomRefOnlyGather contig hom-ref-sites-only VCFsHom-ref contig VCFsProduces final genome-wide hom-ref-sites-only VCF
CreateVcfIndexAndMd5HomRefOnlyIndex and checksum final hom-ref-sites-only VCFGathered hom-ref-sites-only VCFCreates .tbi and md5
CollectQCMetricsCompute sample QC metrics from final imputed variant VCFFiltered VCF (if info_filter_for_inclusion supplied) or gathered variant VCFGenerates sample-level QC report

Outputs

Upon successful completion, the workflow emits final genome-wide imputed outputs, corresponding index and checksum files, and QC metrics. Coverage metrics are optional.

OutputDescription
imputed_vcfFinal imputed multi-sample variant VCF
imputed_vcf_indexIndex file for final imputed VCF
imputed_vcf_md5sumMD5 checksum for final imputed VCF
imputed_hom_ref_sites_only_vcfFinal sites-only VCF containing homozygous-reference-only sites
imputed_hom_ref_sites_only_vcf_indexIndex file for hom-ref-sites-only VCF
imputed_hom_ref_sites_only_vcf_md5MD5 checksum for hom-ref-sites-only VCF
qc_metricsSample-level QC metrics table
coverage_metricsOptional combined coverage metrics table

Glimpse2LowPassImputationBatch summary

The Glimpse2LowPassImputationBatch workflow is the per-batch subworkflow used by the top-level Glimpse2LowPassImputation gateway workflow.
It is designed for cohorts up to roughly 1000 samples per batch, then returns contig-level ligated imputed VCFs that can be merged across batches upstream.

Batch Workflow Role

  • runs low-pass variant calling from CRAMs at reference-panel sites
  • phases/imputes each contig in reference-defined chunks using GLIMPSE2
  • ligates chunk-level outputs back into one imputed VCF per contig
  • emits optional coverage metrics aggregated across chunks/contigs

Batch Inputs

InputDescription
contigsContigs/chromosomes to process
reference_panel_prefixPrefix containing sites.<contig>.vcf.gz, sites_table.<contig>.gz, and reference_chunks.<contig>.txt
cramsCRAM files for this batch
cram_indicesCRAI files for the batch CRAMs
sample_idsSample IDs aligned to crams
fasta / fasta_indexReference FASTA and index
output_basenameBasename for intermediate and emitted files
ref_dictReference dictionary used during ligation/reheader
impute_reference_only_variantsPass-through option for GLIMPSE2 phase
call_indelsWhether to include indels during calling/imputation
calling_batch_sizeInternal batch size for CRAM calling fan-out within this subworkflow
gatk_docker / glimpse_dockerContainer images for GATK and GLIMPSE2 tools

Batch Internal Processing

StepPurpose
SplitIntoBatches (conditional)Splits CRAMs/CRAIs/sample IDs into internal calling batches
BcftoolsMpileupComputes pileups at panel sites per internal batch
BcftoolsCallCalls candidate variants from mpileup output
BcftoolsNormNormalizes and indexes called variants
BcftoolsMerge (conditional)Merges per-internal-batch VCFs if multiple were produced
ComputeShardsAndMemoryPerShardReads reference chunks and computes per-shard memory estimates
GlimpsePhaseRuns GLIMPSE2_phase for each reference shard
GlimpseLigateLigates shard outputs to one contig-level imputed VCF and updates sequence dictionary
MergeBatchCoverageMetrics (conditional)Combines optional shard/contig coverage metric files

Batch Outputs

OutputDescription
imputed_contig_ligated_vcfsPer-contig ligated imputed VCFs for this sample batch
imputed_contig_ligated_vcf_indicesIndex files for each contig ligated VCF
coverage_metricsOptional batch-level combined coverage metrics table

Glimpse2MergeBatches AF and INFO score recalculation

As part of the batch merge and re-annotation step in the top-level workflow, we recalculate AF and INFO scores across the full cohort after merging batch VCFs. This is necessary because the batch-level VCFs contain AF/INFO annotations that are only correct for the samples in that batch, so they must be recalculated for the entire input sample set. To learn more about how we do this, see the AF and INFO score recalculations on GitHub.

Glimpse2LowPassImputationQuotaConsumed summary

The QuotaConsumed workflow computes submitted sample count for service quota accounting.
Quota is derived from number of CRAM entries found in the provided CRAM manifest.

Glimpse2LowPassImputationQC summary

The InputQC workflow validates CRAM-based inputs supplied by the cram_manifest for GLIMPSE2 low-pass imputation. Checks include:

  • required manifest columns are present: sample_id, cram_path, cram_index_path
  • counts of CRAMs, CRAIs, and sample IDs match
  • sample IDs are unique
  • CRAM paths are unique
  • CRAM file names end with .cram and indices end with .crai
  • input paths use gs:// format and are accessible
  • CRAM file sizes do not exceed the configured maximum (default: 10 GB)
  • CRAM files are aligned to the expected reference
  • optional requester-pays validation via billing_project_for_rp

Important notes

  • Runtime parameters are optimized for Broad's Google Cloud Platform implementation.

Citing the Imputation Pipeline

If you use the GLIMPSE2 Low Pass Imputation Pipeline in your research, please consider citing our preprint:

Degatano, K., Awdeh, A., Cox III, R.S., Dingman, W., Grant, G., Khajouei, F., Kiernan, E., Konwar, K., Mathews, K.L., Palis, K., et al. Warp Analysis Research Pipelines: Cloud-optimized workflows for biological data processing and reproducible analysis. Bioinformatics 2025; btaf494. https://doi.org/10.1093/bioinformatics/btaf494

Contact us

Help us make our tools better by filing an issue in WARP; we welcome pipeline-related suggestions or questions.

Licensing

Copyright Broad Institute, 2020 | BSD-3

The workflow script is released under the WDL open source code license (BSD-3) (full license text at https://github.com/broadinstitute/warp/blob/master/LICENSE). However, please note that the programs it calls may be subject to different licenses. Users are responsible for checking that they are authorized to run all programs before running this script.