Skip to main content

VDS to VCF

Pipeline VersionDate UpdatedDocumentation AuthorQuestions or Feedback
aou_9.0.0July, 2025WARP PipelinesFile an issue

Introduction to the VDS to VCF workflow

vds_to_vcf is a WDL workflow that converts a Hail Variant Dataset (VDS) into per-contig VCF outputs for downstream ancestry analysis. It is designed for large All of Us callsets aligned to GRCh38 and supports scatter-based chromosome processing for scalability.

The workflow repartitions the VDS, filters by contig and BED intervals, densifies to a matrix table, and exports both full and sites-only VCFs with Tabix indexes. It also writes two file-of-filenames (FOFN) manifests listing the generated VCFs and index files for downstream workflows.

Quickstart table

Pipeline FeatureDescriptionSource
Analysis typeVDS conversion and interval filtering for ancestry preprocessing
Workflow languageWDL 1.0openWDL
Genomic reference sequenceGRCh38
Data input file formatVDS + BED + contig list
Data output file formatVCF BGZF + Tabix index + FOFN text files
Primary softwareHail, GATK, bcftoolsHail, GATK, bcftools

Set-up

VDS to VCF installation and requirements

The workflow code can be downloaded by cloning the WARP GitHub repository. For the latest release, please see the vds_to_vcf changelog.

The pipeline can be deployed using Cromwell, a GA4GH-compliant workflow management system.

Inputs

Input descriptions

Input variable nameDescriptionType
vds_gs_urlGoogle Cloud Storage path to the input Hail Variant Dataset.String
bed_gs_urlGoogle Cloud Storage path to BED intervals used for filtering.String
n_partitionsNumber of partitions to apply to the VDS before processing. Default: 2000.Int
output_prefixPrefix used for all generated output files.String
contigsOrdered list of contigs/chromosomes to process in scatter mode.Array[String]

VDS to VCF tasks and tools

The VDS to VCF workflow calls two tasks to process each contig and create output manifests.

  1. Process VDS per contig
  2. Create output file manifests

To see specific tool parameters, select the task WDL link in the table; then view the command {} section of the task in the WDL script.

Task name and WDL linkToolSoftwareDescription
process_vdsHail, Pythonhailgenetics/hail:0.2.134-py3.11Repartitions VDS, filters by chromosome and BED intervals, and exports full + sites-only VCFs with indexes.
create_fofnShellus.gcr.io/broad-gatk/gatk:4.2.6.1Writes text manifests listing output VCF and index file paths.

1. Process VDS per contig

For each contig in contigs, the workflow calls process_vds to filter and export outputs named with <output_prefix>.<contig>. Each shard generates a full VCF and a sites-only VCF, each with a .tbi index.

2. Create output file manifests

After scatter completion, create_fofn writes two text files (.fofn1.txt and .fofn2.txt) containing the list of full VCFs and VCF index files.

Outputs

Output variable nameFilename, if applicableOutput format and description
vcfs<output_prefix>.<contig>.vcf.bgzArray of per-contig full VCF files.
vcfs_tbis<output_prefix>.<contig>.vcf.bgz.tbiArray of Tabix indexes for full VCF files.
vcfs_so<output_prefix>.<contig>.so.vcf.bgzArray of per-contig sites-only VCF files.
vcfs_so_tbis<output_prefix>.<contig>.so.vcf.bgz.tbiArray of Tabix indexes for sites-only VCF files.
vcfs_list<output_prefix>.fofn1.txtText file listing full VCF output paths.
vcfs_idx_list<output_prefix>.fofn2.txtText file listing full VCF index output paths.

Versioning

All vds_to_vcf releases are documented in the changelog.

Feedback

Please help us make our tools better by filing an issue in WARP; we welcome pipeline-related suggestions or questions.