Skip to main content

Determine HQ Sites Intersection

Pipeline VersionDate UpdatedDocumentation AuthorQuestions or Feedback
aou_9.0.0September, 2025WARP PipelinesFile an issue

Introduction to the Determine HQ Sites Intersection workflow

determine_hq_sites_intersection is a WDL workflow that identifies high-quality variant sites shared between a training callset and a target dataset, then filters both datasets to the shared intersection. This ensures downstream ancestry inference runs on a consistent set of informative sites.

The workflow processes input VCF shards, applies SNP and quality filters, intersects each shard with training sites-only variants, merges intersections, and generates three final outputs: the shared HQ sites-only VCF, filtered merged input-data VCF, and filtered training-set VCF.

Quickstart table

Pipeline FeatureDescriptionSource
Analysis typeShared high-quality site selection and filtering
Workflow languageWDL 1.0openWDL
Data input file formatVCF BGZF + Tabix index (training + data shards)
Data output file formatFiltered VCF BGZF + Tabix index
Primary softwareGATK, bcftoolsGATK, bcftools

Set-up

Determine HQ Sites Intersection installation and requirements

The workflow code can be downloaded by cloning the WARP GitHub repository. For the latest release, please see the determine_hq_sites_intersection changelog.

The pipeline can be deployed using Cromwell, a GA4GH-compliant workflow management system.

Inputs

Input descriptions

Input variable nameDescriptionType
training_vcf_bgzFull training VCF (with sample genotypes).File
training_vcf_bgz_idxIndex for training_vcf_bgz.File
training_vcf_so_bgzSites-only training VCF corresponding to training_vcf_bgz.File
training_vcf_so_bgz_idxIndex for training_vcf_so_bgz.File
ordered_vcf_shards_inOrdered list of input dataset VCF shards.Array[File]
ordered_vcf_shards_idx_inOrdered list of VCF indexes corresponding to ordered_vcf_shards_in.Array[File]
ordered_vcf_shards_list(Optional) Text file listing shard VCF paths; overrides ordered_vcf_shards_in when provided.File?
ordered_vcf_shards_idx_list(Optional) Text file listing shard index paths; overrides ordered_vcf_shards_idx_in when provided.File?
final_output_prefixOutput prefix for workflow-level naming.String
service_account_json(Optional) Service account key path used for requester-protected data localization.String?
intersecting_intervals(Optional) Additional intervals (e.g., exome targets) for intersection filtering.File?

Determine HQ Sites Intersection tasks and tools

The workflow performs shard-level filtering/intersection followed by merge/filter steps to produce shared HQ callsets.

  1. Filter and sites-only transform each data shard
  2. Intersect shards with training HQ sites
  3. Merge intersection sites and filter datasets

To see specific tool parameters, select the task WDL link in the table; then view the command {} section of the task in the WDL script.

Task name and WDL linkToolSoftwareDescription
sitesOnlyAndHQFilterVcfGATK SelectVariantsus.gcr.io/broad-gatk/gatk:4.2.0.0Filters each input shard to high-quality biallelic SNPs and writes a sites-only VCF.
intersect_vcfs_as_sites_onlyGATK SelectVariantsus.gcr.io/broad-gatk/gatk:4.2.0.0Intersects each filtered shard with the training sites-only VCF.
merge_vcf_bgzsbcftoolsmgibio/bcftools-cwl:1.12Concatenates and sorts VCF shards into a merged VCF.
filter_by_sites_onlyGATK SelectVariantsus.gcr.io/broad-gatk/gatk:4.2.0.0Filters full VCFs to the final intersected sites-only set.

1. Filter and sites-only transform each data shard

Each shard is filtered for biallelic SNPs, allele frequency, missingness, and optional interval overlap, then exported as a sites-only VCF.

2. Intersect shards with training HQ sites

Each filtered shard is intersected against training_vcf_so_bgz to keep only sites shared with the training HQ set.

3. Merge intersection sites and filter datasets

Intersected shard sites are merged into a unified sites-only VCF. That merged site set is then used to filter both the training full VCF and all input data shards; filtered shards are merged into one data VCF.

Outputs

Output variable nameFilename, if applicableOutput format and description
hq_variants_intersectionmerged_sites_only_intersection.vcf.bgzSites-only VCF of shared high-quality variants between training and input datasets.
hq_variants_intersection_idxmerged_sites_only_intersection.vcf.bgz.tbiTabix index for hq_variants_intersection.
merged_vcf_shardsmerged_data_shards.vcf.bgzMerged input-data VCF filtered to shared HQ sites.
merged_vcf_shards_idxmerged_data_shards.vcf.bgz.tbiTabix index for merged_vcf_shards.
filtered_training_setfull_training_sites_filtered.0.vcf.bgzTraining full VCF filtered to shared HQ sites.
filtered_training_set_idxfull_training_sites_filtered.0.vcf.bgz.tbiTabix index for filtered_training_set.

Versioning

All determine_hq_sites_intersection releases are documented in the changelog.

Feedback

Please help us make our tools better by filing an issue in WARP; we welcome pipeline-related suggestions or questions.