Skip to main content

JointGenotyping Overview

Pipeline VersionDate UpdatedDocumentation AuthorQuestions or Feedback
JointGenotyping_v1.6.10February, 2024Elizabeth Kiernan & Kaylee MathewsPlease file an issue in WARP.

Introduction to the JointGenotyping workflow

The JointGenotyping workflow is an open-source, cloud-optimized pipeline that implements joint variant calling, filtering, and (optional) fingerprinting.

The pipeline can be configured to run using one of the following GATK joint genotyping methods:

  • GenotypeGVCFs (default method) performs joint genotyping on GVCF files stored in GenomicsDB and pre-called with HaplotypeCaller.
  • GnarlyGenotyper performs scalable, “quick and dirty” joint genotyping on a set of GVCF files stored in GenomicsDB and pre-called with HaplotypeCaller.

The pipeline can be configured to run using one of the following GATK variant filtering techniques:

The pipeline takes in a sample map file listing GVCF files produced by HaplotypeCaller in GVCF mode and produces a filtered VCF file (with index) containing genotypes for all samples present in the input VCF files. All sites that are present in the input VCF file are retained. Filtered sites are annotated as such in the FILTER field. If you are new to VCF files, see the file type specification.

The JointGenotyping pipeline can be adapted to run on Microsoft Azure instead of Google Cloud. For more information, see the azure-warp-joint-calling GitHub repository.

Set-up

JointGenotyping Installation and Requirements

To download the latest JointGenotyping release, see the release tags prefixed with "JointGenotyping" on the WARP releases page. All JointGenotyping pipeline releases are documented in the JointGenotyping changelog.

To search releases of this and other pipelines, use the WARP command-line tool Wreleaser.

If you’re running a JointGenotyping workflow version prior to the latest release, the accompanying documentation for that release may be downloaded with the source code on the WARP releases page (see the folder website/docs/Pipelines/JointGenotyping).

The JointGenotyping pipeline can be deployed using Cromwell, a GA4GH-compliant, flexible workflow management system that supports multiple computing platforms. The workflow can also be run in Terra, a cloud-based analysis platform. The Terra Whole-Genome-Analysis-Pipeline and Exome-Analysis-Pipeline workspaces contain the JointGenotyping pipeline, as well as workflows for preprocessing, initial variant calling, and sample map generation, workflow configurations, required reference data and other inputs, and example testing data.

Inputs

The JointGenotyping workflow inputs are specified in JSON configuration files. Example configuration files can be found in the test_inputs folder in the WARP repository.

Default joint calling input descriptions

The table below describes the pipeline inputs that apply when the pipeline is run with default parameters and uses GenotypeGVCFs for joint calling and VQSR for variant filtering:

Parameter nameDescriptionType
unpadded_intervals_fileDescribes the intervals for which VCF output will be written; exome data will have different captures/targets.File
callset_nameIdentifier for the group of VCF files used for joint calling.String
sample_name_mapPath to file containing the sample names and the cloud location of the individual GVCF files.String
ref_fastaReference FASTA file used for joint calling; must agree with reference for unpadded_intervals_file.File
ref_fasta_indexIndex for reference FASTA file used for joint calling; must agree with reference for unpadded_intervals_file.File
ref_dictReference dictionary file used for joint calling; must agree with reference for unpadded_intervals_file.File
dbsnp_vcfResource VCF file containing common SNPs and indels used for annotating the VCF file after joint calling.File
dbsnp_vcf_indexIndex for dbsnp_vcf.File
snp_recalibration_tranche_valuesSet of sensitivity levels used when running the pipeline using VQSR; value should match estimated sensitivity of truth resource passed as hapmap_resource_vcf to the SNPsVariantRecalibratorCreateModel and SNPsVariantRecalibrator as SNPsVariantRecalibratorScattered tasks; filter cutoff based on sensitivity to common variants (more sensitivity = more false positives); required when run_vets is “false”.Array[String]
snp_recalibration_annotation_valuesFeatures used for filtering model (annotations in VCF file); all allele-specific versions.Array[String]
indel_recalibration_tranche_valuesSet of sensitivity levels used when running the pipeline using VQSR; value should match estimated sensitivity of truth resource passed as mills_resource_vcf to the IndelsVariantRecalibrator task; filter cutoff based on sensitivity to common variants (more sensitivity = more false positives); required when run_vets is “false”.Array[String]
indel_recalibration_annotation_valuesFeatures used for filtering model when running the pipeline using VQSR; required when run_vets is “false”.Array[String]
eval_interval_listSubset of the unpadded intervals file used for metrics.File
hapmap_resource_vcfUsed for SNP variant recalibration; see the GATK Resource Bundle for more information.File
hapmap_resource_vcf_indexUsed for SNP variant recalibration; see the GATK Resource Bundle for more information.File
omni_resource_vcfUsed for SNP recalibration; see the GATK Resource Bundle for more information.File
omni_resource_vcf_indexUsed for SNP recalibration; see the GATK Resource Bundle for more information.File
one_thousand_genomes_resource_vcfUsed for SNP recalibration; see the GATK Resource Bundle for more information.File
one_thousand_genomes_resource_vcf_indexUsed for SNP recalibration; see the GATK Resource Bundle for more information.File
mills_resource_vcfUsed for indel variant recalibration; see the GATK Resource Bundle for more information.File
mills_resource_vcf_indexUsed for indel variant recalibration; see the GATK Resource Bundle for more information.File
axiomPoly_resource_vcfUsed for indel variant recalibration; see the GATK Resource Bundle for more information.File
axiomPoly_resource_vcf_indexUsed for indel variant recalibration; see the GATK Resource Bundle for more information.File
dbsnp_resource_vcfOptional file used for SNP/indel variant recalibration; set to dbsnp_vcf by default; see the GATK Resource Bundle for more information.File
dbsnp_resource_vcf_indexOptional file used for SNP/indel variant recalibration; set to dbsnp_vcf_index by default; see the GATK Resource Bundle for more information.File
excess_het_thresholdOptional float used for hard filtering joint calls; phred-scaled p-value; set to 54.69 by default to cut off quality scores greater than a z-score of -4.5 (p-value of 3.4e-06).Float
vqsr_snp_filter_levelUsed for applying the recalibration model when running the pipeline using VQSR; required when run_vets is “false”.Float
vqsr_indel_filter_levelUsed for applying the recalibration model when running the pipeline using VQSR; required when run_vets is “false”.Float
snp_vqsr_downsampleFactorThe downsample factor used for SNP variant recalibration if the number of GVCF files is greater than the snps_variant_recalibration_threshold when running the pipeline using VQSR; required when run_vets is “false”.Int
top_level_scatter_countOptional integer used to determine how many files the input interval list should be split into; default will split the interval list into 2 files.Int
gather_vcfsOptional boolean; “true” is used for small callsets containing less than 100,000 GVCF files.Boolean
snps_variant_recalibration_thresholdOptional integer that sets the threshold for the number of callset VCF files used to perform recalibration on a single file; if the number of VCF files exceeds the threshold, variants will be downsampled to enable parallelization; default is “500000”.Int
rename_gvcf_samplesOptional boolean describing whether GVCF samples should be renamed; default is “true”.Boolean
unbounded_scatter_count_scale_factorOptional float used to scale the scatter count when top_level_scatter_count is not provided as input; default is “0.15”.Float
use_allele_specific_annotationsOptional boolean used for SNP and indel variant recalibration when running the pipeline using VQSR; set to “true” by default.Boolean

GnarlyGenotyper joint calling input descriptions

The table below describes the additional pipeline inputs that apply when the pipeline is run with GnarlyGenotyper for joint calling:

Parameter nameDescriptionType
gnarly_scatter_countOptional integer used to determine how many files to split the interval list into when using GnarlyGenotyper; default is “10”.Int
use_gnarly_genotyperOptional boolean describing whether GnarlyGenotyper should be used; default is “false”.Boolean

VETS variant filtering input descriptions

The table below describes the additional pipeline inputs that apply when the pipeline is run with VETS for variant filtering:

Parameter nameDescriptionType
targets_interval_listDescribes the intervals for which the filtering model will be trained when running the pipeline using VETS; for more details, see the associated README; required when run_vets is “true”.File
run_vetsOptional boolean used to describe whether the pipeline will use VQSR (run_vets = false) or VETS (run_vets = true) to create the filtering model; default is “false”.Boolean

Fingerprinting input descriptions

The table below describes the pipeline inputs that apply to fingerprinting:

Parameter nameDescriptionType
haplotype_databaseHaplotype reference used for fingerprinting (see the CrosscheckFingerprints task).File
cross_check_fingerprintsOptional boolean describing whether or not the pipeline should check fingerprints; default is “true”.Boolean
scatter_cross_check_fingerprintsOptional boolean describing whether CrossCheckFingerprintsScattered or CrossCheckFingerprintsSolo should be run; default is “false” and CrossCheckFingerprintsSolo will be run.Boolean

Runtime parameter input descriptions

The table below describes the pipeline inputs used for setting runtime parameters of tasks:

Parameter nameDescriptionType
small_diskDisk size; dependent on cohort size; requires user input; see example JSON configuration files found in the WARP test_inputs folder for recommendations.Int
medium_diskDisk size; dependent on cohort size; requires user input; see example JSON configuration files found in the WARP test_inputs folder for recommendations.Int
large_diskDisk size; dependent on cohort size; requires user input; see example JSON configuration files found in the WARP test_inputs folder for recommendations.Int
huge_diskDisk size; dependent on cohort size; requires user input; see example JSON configuration files found in the WARP test_inputs folder for recommendations.Int

JointGenotyping tasks and tools

The JointGenotyping workflow imports individual "tasks," also written in WDL script, from the WARP tasks folder.

Overall, the JointGenotyping workflow:

  1. Splits the input interval list and imports GVCF files.
  2. Performs joint genotyping using GATK GenotypeGVCFs (default) or GnarlyGenotyper.
  3. Creates single site-specific VCF and index files.
  4. Creates and applies a variant filtering model using GATK VQSR (default) or VETS.
  5. Collects variant calling metrics.
  6. Checks fingerprints (optional).

The tasks and tools used in the JointGenotyping workflow are detailed in the table below.

To see specific tool parameters, select the task WDL link in the table; then find the task and view the command {} section of the task in the WDL script. To view or use the exact tool software, see the task's Docker image which is specified in the task WDL # runtime values section as String docker =.

TaskToolSoftwareDescription
CheckSamplesUniquebashbashChecks that there are more than 50 unique samples in sample_name_map.
SplitIntervalListSplitIntervalsGATKSplits the unpadded interval list for scattering.
ImportGVCFsGenomicsDBImportGATKImports single-sample GVCF files into GenomicsDB before joint genotyping.
SplitIntervalList as GnarlyIntervalScatterDudeSplitIntervalsGATKIf use_gnarly_genotyper is “true” (default is “false”), splits the unpadded interval list for scattering; otherwise, this task is skipped.
GnarlyGenotyperGnarlyGenotyperGATKIf use_gnarly_genotyper is “true” (default is “false”), performs scalable, “quick and dirty” joint genotyping on a set of GVCF files stored in GenomicsDB; otherwise, this task is skipped.
GatherVcfs as TotallyRadicalGatherVcfsGatherVcfsCloudGATKIf use_gnarly_genotyper is “true” (default is “false”), compiles the site-specific VCF files generated for each interval into one VCF output and index; otherwise, this task is skipped.
GenotypeGVCFsGenotypeGVCFsGATKIf use_gnarly_genotyper is “false” (default is “false”), performs joint genotyping on GVCF files stored in GenomicsDB; otherwise this task is skipped.
HardFilterAndMakeSitesOnlyVcfVariantFiltration, MakeSitesOnlyVcfGATKUses the VCF files to hard filter the variant calls; outputs a VCF file with the site-specific (but not genotype) information.
GatherVcfs as SitesOnlyGatherVcfGatherVcfsCloudGATKCompiles the site-specific VCF files generated for each interval into one VCF output file and index.
JointVcfFiltering as TrainAndApplyVETSExtractVariantAnnotations, TrainVariantAnnotationsModel, ScoreVariantAnnotationsGATKIf run_vets is “true” (default is “false”), calls the JointVcfFiltering.wdl subworkflow to extract variant-level annotations, trains a model for variant scoring, and scores variants; otherwise, this task is skipped.
IndelsVariantRecalibratorVariantRecalibratorGATKIf run_vets is “false” (default is “false”), uses the compiled VCF file to build a recalibration model to score indel variant quality; produces a recalibration table.
SNPsVariantRecalibratorCreateModelVariantRecalibratorGATKIf run_vets is “false” (default is “false”) and the number of input GVCF files is greater than snps_variant_recalibration_threshold, builds a recalibration model to score variant quality; otherwise this task is skipped.
SNPsVariantRecalibrator as SNPsVariantRecalibratorScatteredVariantRecalibratorGATKIf run_vets is “false” (default is “false”) and the number of input GVCF files is greater than snps_variant_recalibration_threshold, builds a scattered recalibration model to score variant quality; otherwise this task is skipped.
Tasks.GatherTranches as SNPGatherTranchesGatherTranchesGATKIf run_vets is “false” (default is “false”) and the number of input GVCF files is greater than snps_variant_recalibration_threshold, gathers tranches into a single file; otherwise this task is skipped.
SNPsVariantRecalibrator as SNPsVariantRecalibratorClassicVariantRecalibratorGATKIf run_vets is “false” (default is “false”) and the number of input GVCF files is not greater than snps_variant_recalibration_threshold, builds a recalibration model to score variant quality; otherwise this task is skipped.
ApplyRecalibrationApplyVQSRGATKIf run_vets is “false” (default is “false”), scatters the site-specific VCF file and applies a filtering threshold.
CollectVariantCallingMetrics as CollectMetricsShardedCollectVariantCallingMetricsGATKIf the callset has at least 1000 GVCF files, returns detail and summary metrics for each of the scattered VCF files. If the number is small, will return metrics for a merged VCF file produced in the GatherVcfs as FinalGatherVcf task (listed below).
GatherVcfs as FinalGatherVcfGatherVcfsCloudGATKIf the callset has fewer than 1000 GVCF files, compiles the VCF files prior to collecting metrics in the CollectVariantCallingMetrics as CollectMetricsOnFullVcf task (listed below).
CollectVariantCallingMetrics as CollectMetricsOnFullVcfCollectVariantCallingMetricsGATKIf the callset has fewer than 1000 GVCF files, returns metrics for the merged VCF file produced in the GatherVcfs as FinalGatherVcf task.
GatherVariantCallingMetricsAccumulateVariantCallingMetricsGATKIf the callset has at least 1000 GVCF files, gathers metrics produced for each VCF file.
GetFingerprintingIntervalIndicesIntervalListToolsGATKIf cross_check_fingerprints is “true” (default is “true”) and scatter_cross_check_fingerprints is “true” (default is “false”), gets and sorts indices for fingerprint intervals; otherwise the task is skipped.
GatherVcfs as GatherFingerprintingVcfsGatherVcfsCloudGATKIf cross_check_fingerprints is “true” (default is “true”) and scatter_cross_check_fingerprints is “true” (default is “false”), compiles the fingerprint VCF files; otherwise the task is skipped.
SelectFingerprintSiteVariantsSelectVariantsGATKIf cross_check_fingerprints is “true” (default is “true”)and scatter_cross_check_fingerprints is “true” (default is “false”), selects variants from the fingerprint VCF file; otherwise the task is skipped.
PartitionSampleNameMapbashbashIf cross_check_fingerprints is “true” (default is “true”) and scatter_cross_check_fingerprints is “true” (default is “false”), partitions the sample name map and files are scattered by the partition; otherwise the task is skipped.
CrossCheckFingerprint as CrossCheckFingerprintsScatteredCrosscheckFingerprintsGATKIf cross_check_fingerprints is “true” (default is “true”) and scatter_cross_check_fingerprints is “true” (default is “false”), checks fingerprints for the VCFs in the scattered partitions and produces a metrics file; otherwise the task is skipped.
GatherPicardMetrics as GatherFingerprintingMetricsbashbashIf cross_check_fingerprints is “true” (default is “true”) and scatter_cross_check_fingerprints is “true” (default is “false”), combines the fingerprint metrics files into a single metrics file; otherwise the task is skipped.
CrossCheckFingerprint as CrossCheckFingerprintSoloCrosscheckFingerprintsGATKIf cross_check_fingerprints is “true” (default is “true”) and scatter_cross_check_fingerprints is “false” (default is “false”), checks fingerprints for the single VCF file and produces a metrics file; otherwise the task is skipped.

1. Splits the input interval list and imports GVCF files

The SplitIntervalList task uses GATK’s SplitIntervals tool to split the input interval list into two or more interval files. The number of output interval files can be specified using the top_level_scatter_count input parameter or by specifying unbounded_scatter_count_scale_factor, which will scale the number of output files based on the number of input GVCF files.

The ImportGVCFs task uses GATK’s GenomicsDBImport tool and the input sample map file to import single-sample GVCF files into GenomicsDB before joint genotyping.

2. Performs joint genotyping using GATK GenotypeGVCFs (default) or GnarlyGenotyper

GenotypeGVCFs (default)

When use_gnarly_genotyper is “false”, the GenotypeGVCFs task uses GATK’s GenotypeGVCFs tool to perform joint genotyping on GVCF files stored in GenomicsDB that have been pre-called with HaplotypeCaller.

GnarlyGenotyper

When use_gnarly_genotyper is “true”, the SplitIntervalList as GnarlyIntervalScatterDude task splits the unpadded interval list for scattering using GATK’s SplitIntervals tool. The output is used as input for the GnarlyGenotyper task which performs joint genotyping on the set of GVCF files and outputs an array of VCF and index files using the GnarlyGenotyper tool. Those VCF and index files are gathered in the next task, GatherVcfs as TotallyRadicalGatherVcfs, which uses the GatherVcfsCloud tool.

3. Creates single site-specific VCF and index files

The HardFilterAndMakeSitesOnlyVcf task takes in the output VCF and index files produced by either GnarlyGenotyper or GenotypeGVCFs. The task uses the excess_het_threshold input value to hard filter the variant calls using GATK’s VariantFiltration tool. After filtering, the site-specific VCF files are generated from the filtered VCF files by removing all sample-specific genotype information, leaving only the site-level summary information at each site.

Next, the site-specific VCF and index files for each interval are gathered into a single site-specific VCF and index file by the GatherVcfs as SitesOnlyGatherVcf task, which uses the GatherVcfsCloud tool.

4. Creates and applies a variant filtering model using GATK VQSR (default) or VETS

VQSR (default)

If run_vets is “false”, the IndelsVariantRecalibrator task takes in the site-specific VCF and index files generated in Step 3 and uses GATK’s VariantRecalibrator tool to perform the first step of the Variant Quality Score Recalibration (VQSR) technique of filtering variants. The tool builds a model to be used to score and filter indels and produces a recalibration table as output.

After building the indel filtering model, the workflow uses the VariantRecalibrator tool to build a model to be used to score and filter SNPs. If the number of input GVCF files is greater than snps_variant_recalibration_threshold, the SNPsVariantRecalibratorCreateModel, SNPsVariantRecalibrator as SNPsVariantRecalibratorScattered, and Tasks.GatherTranches as SNPGatherTranches tasks are called to scatter the site-specific VCF and index files, build the SNP model, and gather scattered tranches into a single file. If the number of input GVCF files is less than snps_variant_recalibration_threshold, the SNPsVariantRecalibrator as SNPsVariantRecalibratorClassic task is called to build the SNP model.

The ApplyRecalibration task uses GATK’s ApplyVQSR tool to scatter the site-specific VCF file, apply the indel and SNP filtering models, and output a recalibrated VCF and index file.

VETS

If run_vets is “true”, the JointVcfFiltering as TrainAndApplyVETS task takes in the hard filtered and site-specific VCF and index files generated in Step 3 and calls the JointVcfFiltering.wdl subworkflow. This workflow uses the Variant Extract-Train-Score (VETS) algorithm to extract variant-level annotations, train a filtering model, and score variants based on the model. The subworkflow uses the GATK ExtractVariantAnnotations, TrainVariantAnnotationsModel, and ScoreVariantAnnotations tools to create extracted and scored VCF and index files. The output VCF and index files are not filtered by the score assigned by the model. The score is included in the output VCF files in the INFO field as an annotation called “SCORE”.

The VETS algorithm trains the model only over target regions, rather than including exon tails which can lead to poor-quality data. However, the model is applied everywhere including the exon tails.

5. Collects variant calling metrics

Summary and per-sample metrics are collected using Picard’s CollectVariantCallingMetrics tool. For large callsets (at least 1000 GVCF files), the workflow calls the CollectVariantCallingMetrics as CollectMetricsSharded followed by the GatherVariantCallingMetrics task to compute and gather the variant calling metrics into single output files. For small callsets (less than 1000 GVCF files), the workflow calls the GatherVcfs as FinalGatherVcf task followed by the CollectVariantCallingMetrics as CollectMetricsOnFullVcf task to first compile the VCF files and then compute the variant calling metrics. Detail and summary metrics files are produced as outputs of these tasks.

6. Checks fingerprints (optional)

If cross_check_fingerprints is “true”, the workflow will use Picard to determine the likelihood that the input and output data were generated from the same individual to verify that the pipeline didn’t swap any of the samples during processing. The SelectFingerprintSiteVariants task uses GATK’s SelectVariants tool to select variants in the site-specific VCF file based on the variants present in the haplotype_database and outputs a fingerprint VCF and index file. Next, the workflow cross-checks the fingerprints and creates an output metrics file using the CrosscheckFingerprints tool.

Outputs

The following table lists the output variables and files produced by the pipeline.

Output nameFilename, if applicableOutput format and description
detail_metrics_file<callset_name>.variant_calling_detail_metricsDetail metrics file produced using Picard.
summary_metrics_file<callset_name>.variant_calling_summary_metricsSummary metrics file produced using Picard.
output_vcfs<callset_name>.vcf.gz or <callset_name>.filtered.<idx>.vcf.gzArray of all site-specific output VCF files.
output_vcf_indices<callset_name>.vcf.gz.tbi or <callset_name>.filtered.<idx>.vcf.gz.tbiArray of all output VCF index files.
output_intervalsscatterDir/<output_intervals_files>Interval list file produced by the workflow.
crosscheck_fingerprint_check<callset_name>.fingerprintcheckFingerprint metrics

Versioning and testing

All JointGenotyping pipeline releases are documented in the JointGenotyping changelog and tested using plumbing and scientific test data. To learn more about WARP pipeline testing, see Testing Pipelines.

Citing the JointGenotyping Pipeline

If you use the JointGenotyping Pipeline in your research, please consider citing our preprint:

Degatano, K.; Awdeh, A.; Dingman, W.; Grant, G.; Khajouei, F.; Kiernan, E.; Konwar, K.; Mathews, K.; Palis, K.; Petrillo, N.; Van der Auwera, G.; Wang, C.; Way, J.; Pipelines, W. WDL Analysis Research Pipelines: Cloud-Optimized Workflows for Biological Data Processing and Reproducible Analysis. Preprints 2024, 2024012131. https://doi.org/10.20944/preprints202401.2131.v1

Feedback

Please help us make our tools better by filing an issue in WARP; we welcome pipeline-related suggestions or questions.