Skip to main content

Whole Genome Germline Single Sample Overview

Pipeline VersionDate UpdatedDocumentation AuthorQuestions or Feedback
WholeGenomeGermlineSingleSample_v3.1.20 (see releases page)March, 2024Elizabeth KiernanPlease file GitHub issues in WARP or contact the WARP team

Introduction to the Whole Genome Germline Single Sample Pipeline

The Whole Genome Germline Single Sample (WGS) pipeline implements data pre-processing and initial variant calling according to the GATK Best Practices for germline SNP and Indel discovery in human whole-genome sequencing data. It includes the DRAGEN-GATK mode, which makes the pipeline functionally equivalent to DRAGEN’s analysis pipeline (read more in this DRAGEN-GATK blog).

For a broad overview of the pipeline processes, read the GATK Best Practices documentation for data pre-processing and for germline short variant discovery.

The pipeline adheres to the Functional Equivalence pipeline specification (Regier et al., 2018), a standard set of pipeline parameters to promote data interoperability across a multitude of global research projects and consortia. Read the specification for full details or learn more about functionally equivalent pipelines in this GATK blog.

Want to try the WGS pipeline in Terra?

Two workspaces containing example data and instructions are available to test the WGS pipeline:

  1. a DRAGEN-GATK-Germline-Whole-Genome-Pipeline workspace to showcase the DRAGEN-GATK pipeline mode
  2. a Whole-Genome-Analysis-Pipeline workspace to showcase the WGS pipeline with joint calling

Running the DRAGEN-GATK implementation of the WGS pipeline

Multiple WGS parameters are adjusted for the WGS workflow to run in the DRAGEN-GATK mode.

dragen

Individual DRAGEN-GATK parameters

The WGS workflow can be customized to mix and match different DRAGEN-related parameters. In general, the following booleans may be modified to run in different DRAGEN-realted features:

  1. use_bwa_mem
    • When false, the workflow calls the DRAGEN DRAGMAP aligner instead of BWA mem.
  2. run_dragen_mode_variant_calling
    • When true, the workflow creates a DRAGstr model with the GATK CalibrateDragstrModel tool and uses it for variant calling with HaplotypeCaller in --dragen-mode.
  3. perform_bqsr
    • When false, turns off BQSR as it is not necessary for the DRAGEN pipeline; instead, base error correction is performed during variant calling.
  4. dragen_mode_hard_filter
    • When true, the parameter turns on VCF hard filtering.

Two DRAGEN modes for configuring the WGS pipeline

Although the DRAGEN parameters can be turned on and off as needed, there are two mutually exclusive input workflow modes that can automatically configure the DRAGEN-related inputs:

  1. dragen_functional_equivalence_mode
  2. dragen_maximum_quality_mode

The dragen_functional_equivalence_mode runs the pipeline so that it the outputs are functionally equivalent to those produced with the DRAGEN hardware. This mode will automatically set the following parameters:

  1. run_dragen_mode_variant_calling is true.
  2. use_bwa_mem is false.
  3. perform_bqsr is false.
  4. use_spanning_event_genotyping is false.
  5. dragen_mode_hard_filter is true.

To learn more about how outputs are tested for functional equivalence, try the Functional Equivalence workflow in Terra.

The dragen_maximum_quality_mode runs the pipeline using the DRAGMAP aligner and DRAGEN variant calling, but with additional parameters that produce maximum quality results that are not functionally equivalent to the DRAGEN hardware. This mode will automatically set the following parameters:

  1. run_dragen_mode_variant_calling is true.
  2. use_bwa_mem is false.
  3. perform_bqsr is false.
  4. use_spanning_event_genotyping is true.
  5. dragen_mode_hard_filter is true.

When the workflow applies the DRAGMAP aligner, it calls reference files specific to the aligner. These files are located in a public Google bucket and described in the Input descriptions. See the reference README for details on recreating DRAGEN references.

Set-up

Workflow installation and requirements

The WGS workflow is written in the Workflow Description Language WDL and can be downloaded by cloning the warp repository in GitHub. The workflow can be deployed using Cromwell, a GA4GH compliant, flexible workflow management system that supports multiple computing platforms. For the latest workflow version and release notes, see the WGS changelog.

The latest release of the workflow, example data, and dependencies are available from the WARP releases page. You can explore releases using the WARP command-line tool, Wreleaser.

Input descriptions

The tables below describe each of the WGS pipeline inputs and reference files.

Examples of how to specify each input can be found in the example input configuration files (JSONs).

Multiple references are imported as part of a struct from the DNASeqStruct WDL, which is located in the WARP structs library. For references that are part of a struct, the tables below list the relevant struct’s name.

Overall, the workflow has the following input requirements:

  • Human whole-genome paired-end sequencing data in unmapped BAM (uBAM) format
  • One or more read groups, one per uBAM file, all belonging to a single sample (SM)
  • Input uBAM files must additionally comply with the following requirements:
    • All filenames have the same suffix (we use ".unmapped.bam")
    • Files pass validation by ValidateSamFile
    • Reads are in query-sorted order
    • All reads have an RG tag
  • Reference genome must be Hg38 with ALT contigs

Struct inputs

The following table describes the inputs imported from a struct. Although these are specified in the WGS workflow using the struct name, the actual inputs for each struct are specified in the example configuration files.

Input nameStruct name (alias)Input descriptionInput type
base_file_nameSampleAndUnmappedBams (sample_and_unmapped_bams)String used for output files; can be set to a read group ID.String
final_gvcf_base_nameSampleAndUnmappedBams (sample_and_unmapped_bams)Base name for the output GVCF file; can be set to a read group ID.String
flowcell_unmapped_bamsSampleAndUnmappedBams (sample_and_unmapped_bams)Human whole-genome paired-end sequencing data in unmapped BAM (uBAM) format; each uBAM file contains one or more read groups all belonging to a single sample (SM).Array of files
sample_nameSampleAndUnmappedBams (sample_and_unmapped_bams)A string to describe the sample; can be set to a read group ID.String
unmapped_bam_suffixSampleAndUnmappedBams (sample_and_unmapped_bams)The suffice for the input uBAM file; must be consistent across files; (ex: “.unmapped.bam”).String
contamination_sites_udDNASeqSingleSampleReferences (references)Contamination site files for the CheckContamination task.File
contamination_sites_bedDNASeqSingleSampleReferences (references)Contamination site files for the CheckContamination task.File
contamination_sites_muDNASeqSingleSampleReferences (references)Contamination site files for the CheckContamination task.File
calling_interval_listDNASeqSingleSampleReferences (references)Interval list used for variant calling.File
reference_binDragmapReference (dragmap_reference)Binary representation of the reference FASTA file used for the DRAGEN mode DRAGMAP aligner.File
hash_table_cfg_binDragmapReference (dragmap_reference)Binary representation of the configuration for the hash table used for the DRAGEN mode DRAGMAP aligner.File
hash_table_cmpDragmapReference (dragmap_reference)Compressed representation of the hash table that is used for the DRAGEN mode DRAGMAP aligner.File
haplotype_scatter_countVariantCallingScatterSettings (scatter_settings)Scatter count used for variant calling.Int
break_bands_at_multiples_ofVariantCallingScatterSettings (scatter_settings)Breaks reference bands up at genomic positions that are multiples of this number; used to reduce GVCF file size.Int
preemptible_triesPapiSettings (papi_settings)Number of times the workflow can be preempted.Int
agg_preemptible_triesPapiSettings (papi_settings)Number of preemtible machine tries for the BamtoCram task.Int

Additional inputs

Additional inputs that are not contained in a struct are described in the table below. Similar to the struct inputs, these inputs are specified in the example configuration files or, when noted, are hardcoded into the WDL workflow.

  • Optional inputs, like the fingerprint_genotypes_file, need to match your input samples. For example, the fingerprint file in the workflow's test input configuration JSON is set up to check fingerprints for the NA12878 Plumbing sample. The sample name in the VCF matches the name used for the sample_name input.
Input nameInput descriptionInput type
fingerprint_genotypes_fileGenotype VCF if optionally performing fingerprinting. For the CheckFingerprint task (below), the sample name specified in the sample_and_unmapped_bams variable must match the sample name in the fingerprint_genoptyes_file (VCF format).File
fingerprint_genotypes_indexOptional index for the fingerprinting VCF.File
wgs_coverage_interval_listInterval list for the CollectWgsMetrics tool.File
provide_bam_outputIf set to true, provides the aligned BAM and index as workflow output; default set to false.Boolean
use_gatk3_haplotype_callerUses the GATK3.5 HaplotypeCalller; default set to true.Boolean
dragen_functional_equivalence_modeBoolean used to run the WGS pipeline in a mode functionally equivalent to DRAGEN; set to false by default. This parameter is mutually exclusive with the dragen_maxiumum_quality_mode and will result in an error if both are set to true.Boolean
dragen_maximum_quality_modeBoolean used to run the pipeline in DRAGEN mode with modifications to produce maximum quality results; set to false by default. This parameter is mutually exclusive with the dragen_functional_equivalence_mode and will result in an error if both are set to true.Boolean
run_dragen_mode_variant_callingBoolean used to indicate that DRAGEN mode should be used for variant calling; default set to false but must be true to compose DRAGstr model and perform variant calling with HaplotypeCaller in dragen-mode.Boolean
use_spanning_event_genotypingBoolean used to call the HaplotypeCaller --disable-spanning-event-genotyping parameter; default set to true so that variant calling includes spanning events. Set to false to run the DRAGEN pipeline.Boolean
unmap_contaminant_readsBoolean to indicate whether to identify extremely short alignments (with clipping on both sides) as cross-species contamination and unmap the reads; default set to true. This feature is not used in the pipeline mode functionally equivalent to DRAGEN.Boolean
perform_bqsrBoolean to turn on base recalibration with BQSR; default set to true, but not necessary when running the pipeline in DRAGEN mode.Boolean
use_bwa_memBoolean indicating if workflow should use the BWA mem aligner; default set to true, but must be set to false to alternatively run the DRAGEN-GATK DRAGMAP aligner.Boolean
use_dragen_hard_filteringBoolean that indicates if workflow should perform hard filtering using the GATK VariantFiltration tool with the --filter-name "DRAGENHardQUAL"; default set to false.Boolean
read_lengthSet to a max of 250 for collecting WGS metrics; hardcoded in the workflow WDL.Int
lod_thresholdLOD threshold for checking fingerprints; hardcoded to -20.0 in workflow WDL.Float
cross_check_fingerprints_byChecks fingerprints by READGROUP; hardcoded in the workflow WDL.String
recalibrated_bam_basenameBasename for the recalibrated BAM file; hardcoded to be the base_file_name in the sample_and_unmapped_bams struct + ".aligned.duplicates_marked.recalibrated" in the workflow WDL.String
final_gvcf_base_nameBasename for the final GVCF file; harcoded in workflow WDL to be the final_gvcf_base_name from the sample_and_unmapped_bams struct, if applicable, or the base_file_name.String

Workflow tasks and tools

The WGS workflow imports a series of tasks, coded in WDL scripts, from the tasks library. To learn more about the software tools implemented in these tasks, read the GATK support site’s data pre-processing and germline short variant discovery overviews.

Want to use the Whole Genome Germline Single Sample workflow in your publication?

Check out the workflow Methods to get started!

The sections below outline each of the WGS workflow’s tasks and include tables detailing substasks, tools, and relevant software.

Quality control metric calculation and alignment of the unmapped BAM

Workflow WDL task name and link: UnmappedBamToAlignedBam.UnmappedBamToAlignedBam

The table below details the subtasks called by the UnmappedBamToAlignedBam task, which calculates metrics on the unsorted, unaligned BAMs for each readgroup using Picard and then aligns reads using either BWA mem or the DRAGEN DRAGMAP aligner. It optionally corrects base calling errors with BQSR. It lastly merges individual recalibrated BAM files into an aggregated BAM.

Subtask name (alias) and task WDL linkToolSoftwareDescription
QC.CollectQualityYieldMetrics (CollectQualityYieldMetrics)CollectQualityYieldMetricsPicardCalculates QC metrics on the unaligned BAM.
SplitRG.SplitLargeReadGroup (SplitRG)------If the BAM size is large, will split the BAMs; performs alignment using either BWA mem (use_bwa_mem = true) or the DRAGMAP aligner (use_bwa_mem = false).
Alignment.SamToFastqAndBwaMemAndMba (SamToFastqAndBwaMemAndMba)SamToFastq; MergeBamAlignmentBWA mem, PicardWhen use_bwa_mem = true, aligns using BWA mem; if use_bwa_mem = false, aligns with DRAGMAP aligner in the DragmapAlignment.SamToFastqAndDragmapAndMba task below.
DragmapAlignment.SamToFastqAndDragmapAndMba (SamToFastqAndDragmapAndMba)dragen-os, MergeBamAlignmentDragmap, PicardWhen use_bwa_mem = false, aligns with the DRAGMAP aligner.
QC.CollectUnsortedReadgroupBamQualityMetrics (CollectUnsortedReadgroupBamQualityMetrics)CollectMultipleMetricsPicardPerforms QC on the aligned BAMs with unsorted readgroups.
Processing.MarkDuplicates (MarkDuplicates)MarkDuplicatesPicardMarks duplicate reads.
Processing.SortSamSortSamPicardSorts the aggregated BAM by coordinate sort order.
QC.CrossCheckFingerprints (CrossCheckFingerprints)CrosscheckFingerprintsPicardOptionally checks fingerprints if haplotype database is provided.
Utils.CreateSequenceGroupingTSV (CreateSequenceGroupingTSV)---pythonCreates the sequencing groupings used for BQSR and PrintReads Scatter.
Processing.CheckContaminationVerifyBamID2---Checks cross-sample contamination prior to variant calling.
Processing.BaseRecalibrator (BaseRecalibrator)BaseRecalibratorGATKIf perform_bqsr is true, performs base recalibration by interval. When using the DRAGEN-GATK mode, perform_bqsr is optionally false as base calling errors are corrected in the DRAGEN variant calling step.
Processing.GatherBqsrReports (GatherBqsrReports)GatherBQSRReportsGATKMerges the BQSR reports resulting from by-interval calibration.
Processing.ApplyBQSR (ApplyBQSR)ApplyBQSRGATKApplies the BQSR base recalibration model by interval.
Processing.GatherSortedBamFiles (GatherBamFiles)GatherBamFilesPicardMerges the recalibrated BAM files.

Aggregate the aligned recalibrated BAM and calculate quality control metrics

Workflow task name and link: AggregatedBamQC.AggregatedBamQC

The table below describes the subtasks of AggregatedBamQC.AggregatedBamQC task, which calculates quality control metrics on the aggregated recalibrated BAM file and checks for sample contamination.

Subtask name (alias) and linkToolSoftwareDescription
QC.CollectReadgroupBamQualityMetrics (CollectReadgroupBamQualityMetrics)CollectMultipleMetricsPicardCollects alignment summary and GC bias quality metrics on the recalibrated BAM.
QC.CollectAggregationMetrics (CollectAggregationMetrics)CollectMultipleMetricsPicardCollects quality metrics from the aggregated BAM.
QC.CheckFingerprint (CheckFingerprint)CheckFingerprintPicardCheck that the fingerprint of the sample BAM matches the sample array.
QC.CalculateReadGroupChecksum (CalculateReadGroupChecksum)CalculateReadGroupChecksumPicardGenerate a checksum per readgroup in the final BAM.

Convert the aggregated recalibrated BAM to CRAM

Workflow task name and link: BamToCram.BamToCram

The table below describes the subtasks of BamToCram.BamToCram task which converts the recalibrated BAM to CRAM format and produces a validation report.

Subtask name (alias) and linkToolSoftwareDescription
Utils.ConvertToCram (ConvertToCram)view, indexsamtoolsConverts the merged, recalibrated BAM to CRAM.
QC.CheckPreValidation (CheckPreValidation)---pythonChecks if the data has massively high duplication or chimerism rates.
QC.ValidateSamFile (ValidateCram)ValidateSamFilePicardValidates the output CRAM file.

Collect WGS metrics using stringent thresholds

Workflow task name and link: QC.CollectWgsMetrics

The table below describes the QC.CollectWgsMetrics task which uses the Picard CollectWGSMetrics tool to calculate whole genome metrics using stringent thresholds.

Subtask name (alias) and linkToolSoftwareDescription
CollectWgsMetricsCollectWgsMetricsPicardCollects WGS metrics using stringent thresholds; tasks will break if the read lengths in the BAM are greater than 250, so the max read_length is set to 250 by default.

Collect raw WGS metrics using less stringent thresholds

Workflow task name and link: QC.CollectRawWgsMetrics

The table below describes the QC.CollectRawWgsMetrics task which uses the Picard CollecRawtWGSMetrics tool to calculate whole genome metrics using common thresholds.

Subtask name (alias) and linkToolSoftwareDescription
QC.CollectRawWgsMetrics (CollectRawWgsMetrics)CollectRawWgsMetricsPicardCollects the raw WGS metrics with commonly used QC metrics.

Call variants with HaplotypeCaller

Workflow task name and link: VariantCalling.VariantCalling (BamToGvcf)

The table below describes the subtasks of the VariantCalling.VariantCalling (BamToGvcf) workflow, which uses the GATK HaplotypeCaller for SNP and Indel discovery. When the workflow runs in DRAGEN mode, it produces a Dragstr model that is used during variant calling, and it performs hard filtering.

Subtask name (alias) and linkToolSoftwareDescription
Dragen.CalibrateDragstrModel (DragstrAutoCalibration)CalibrateDragstrModelGATKIf run_dragen_mode_variant_calling is true, uses the reference FASTA file, the reference’s corresponding public STR (short tandem repeat) table file, and the recalibrated BAM to estimate the parameters for the DRAGEN STR model. The output parameter tables are used for the DRAGEN mode HaplotypeCaller.
Utils.ScatterIntervalList (ScatterIntervalList)IntervalListToolsPicard, pythonBreaks the interval list into subintervals for downstream variant calling.
Calling.HaplotypeCaller_GATK35_GVCF (HaplotypeCallerGATK3)PrintReads, HaplotypeCallerGATK4, GATK3.5If use_gatk3_haplotype_caller is true, will call GATK3 Haplotypecaller to call variants in GVCF mode, otherwise will use the HaplotypeCaller_GATK4_VCF task below.
Calling.HaplotypeCaller_GATK4_VCF (HaplotypeCallerGATK4)HaplotypeCallerGATK4If use_gatk3_haplotype_caller is false, will call GATK4 Haplotypecaller to call variants in GVCF mode. If run_dragen_mode_variant_calling is true, uses the --dragstr-params-path containing the DragSTR model and runs it with HaplotypeCaller in --dragen-mode.
Calling.DragenHardFilterVcf (DragenHardFilterVcf)VariantFiltrationGATKIf dragen_mode_hard_filter is true, performs hard filtering that matches the filtering performed by the DRAGEN 3.4.12 pipeline.
BamProcessing.SortSam (SortBamout)SortSamPicardIf the option to make a BAM out file is selected ( make_bamout is true), sorts and gathers the BAM files into one file.
MergeBamoutsmerge, indexsamtoolsIf make_bamout is true, makes corrections to the merged BAM out file from Picard.
Calling.MergeVCFs (MergeVCFs)MergeVcfsPicardCombines by-interval (g)VCFs into a single sample (g)VCF file.
QC.ValidateVCF (ValidateVCF)ValidateVariantsGATKValidates the (g)VCF from HaplotypeCaller with the -gvcf parameter.
QC.CollectVariantCallingMetrics (CollectVariantCallingMetrics)CollectVariantCallingMetricsPicardPerforms quality control on the (g)VCF.

Workflow outputs

The table below describes the final workflow outputs. If running the workflow on Cromwell, these outputs are found in the respective task's execution directory.

Output variable nameDescriptionType
quality_yield_metricsThe quality metrics calculated for the unmapped BAM files.Array of files
unsorted_read_group_base_distribution_by_cycle_pdfPDF of the base distribution for each unsorted, readgroup-specific BAM.Array of files
unsorted_read_group_base_distribution_by_cycle_metricsMetrics of the base distribution by cycle for each unsorted, readgroup-specific BAM.Array of files
unsorted_read_group_insert_size_histogram_pdfHistograms of insert size for the unsorted, readgroup-specific BAMs.Array of files
unsorted_read_group_insert_size_metricsInsert size metrics for the unsorted, readgroup-specific BAMs.Array of files
unsorted_read_group_quality_by_cycle_pdfQuality by cycle PDF for the unsorted, readgroup-specific BAMs.Array of files
unsorted_read_group_quality_by_cycle_metricsQuality by cycle metrics for the unsorted, readgroup-specific BAMs.Array of files
unsorted_read_group_quality_distribution_pdfQuality distribution PDF for the unsorted, readgroup-specific BAMs.Array of files
unsorted_read_group_quality_distribution_metricsQuality distribution metrics for the unsorted, readgroup-specific BAMs.Array of files
read_group_alignment_summary_metricsAlignment summary metrics for the aggregated BAM.File
read_group_gc_bias_detail_metricsGC bias detail metrics for the aggregated BAM.File
read_group_gc_bias_pdfPDF of the GC bias by readgroup for the aggregated BAM.File
read_group_gc_bias_summary_metricsGC bias summary metrics by readgroup for the aggregated BAM.File
cross_check_fingerprints_metricsFingerprint metrics file if optional fingerprinting is performed.File
selfSMContamination estimate from VerifyBamID2.File
contaminationEstimated contamination from the CheckContamination task.Float
calculate_read_group_checksum_md5MD5 checksum for aggregated BAM.File
agg_alignment_summary_metricsAlignment summary metrics for the aggregated BAM.File
agg_bait_bias_detail_metricsBait bias detail metrics for the aggregated BAM.File
agg_bait_bias_summary_metricsBait bias summary metrics for the aggregated BAM.File
agg_gc_bias_detail_metricGC bias detail metrics for the aggregated BAM.File
agg_gc_bias_pdfPDF of GC bias for the aggregated BAM.File
agg_gc_bias_summary_metricsGC bias summary metrics for the aggregated BAM.File
agg_insert_size_histogram_pdfHistogram of insert size for aggregated BAM.File
agg_insert_size_metricsInsert size metrics for the aggregated BAM.File
agg_pre_adapter_detail_metricsDetails metrics for artifacts that occur prior to the addition of adaptors for the aggregated BAM.File
agg_pre_adapter_summary_metricsSummary metrics for artifacts that occur prior to the addition of adaptors for the aggregated BAM.File
agg_quality_distribution_pdfPDF of the quality distribution for the aggregated BAM.File
agg_quality_distribution_metricsQuality distribution metrics for the aggregated BAM.File
agg_error_summary_metricsError summary metrics for the aggregated BAM.File
fingerprint_summary_metricsOptional fingerprint summary metrics for the aggregated BAM.File
fingerprint_detail_metricsOptional fingerprint detail metrics for the aggregated BAM.File
wgs_metricsMetrics from the CollectWgsMetrics tool.File
raw_wgs_metricMetrics from the CollectRawWgsMetrics tool.File
duplicate_metricsDuplicate read metrics from the MarkDuplicates tool.File
output_bqsr_reportsBQSR reports if BQSR tool is run.File
gvcf_summary_metrics(g)VCF summary metricsFile
gvcf_detail_metrics(g)VCF detail metrics.File
output_bamOutput aligned recalibrated BAM if the provided_output_bam is true.File
output_bam_indexOptional index for the aligned recalibrated BAM if the provided_output_bam is true.File
output_cramAligned, recalibrated output CRAM.File
output_cram_indexIndex for the aligned recalibrated CRAM.File
output_cram_md5MD5 checksum for the aligned recalibrated BAM.File
validate_cram_file_reportValidated report for the CRAM created with the ValidateSam tool.File
output_vcfFinal reblocked GVCF with variant calls produced by HaplotypeCaller (read more in the Reblocking section below).File
output_vcf_indexIndex for the final GVCF.File

Reblocking

Reblocking is a process that compresses a HaplotypeCaller GVCF by merging homRef blocks according to new genotype quality (GQ) bands and facilitates joint genotyping by removing alt alleles that do not appear in the called genotype.

As of November 2021, reblocking is a default task in the WGS pipeline. To skip reblocking, add the following to the workflow's input configuration file (JSON):

"WholeGenomeGermlineSingleSample.BamToGvcf.skip_reblocking": true

The Reblocking task uses the GATK ReblockGVCF tool with the arguments:

-do-qual-approx -floor-blocks -GQB 20 -GQB 30 -GQB 40 

The following summarizes how reblocking affects the WGS GVCF and downstream tools compared to the GVCF produced with the default HaplotypeCaller GQ bands:

  1. PLs are omitted for homozygous reference sites to save space– GQs are output for genotypes, PLs can be approximated as [0, GQ, 2*GQ].

  2. GQ resolution for homozygous reference genotypes is reduced (i.e. homRef GQs will be underconfident) which may affect analyses like de novo calling where well-calibrated reference genotype qualities are important.

  3. Alleles that aren’t called in the sample genotype are dropped. Each variant should have no more than two non-symbolic alt alleles, with the majority having just one plus <NON_REF>.

  4. New annotations enable merging data for filtering without using genotypes. For example:

    • RAW_GT_COUNT(S) for doing ExcessHet calculation from a sites-only file.
    • QUALapprox and/or AS_QUALapprox for doing QUAL approximation/filling.
    • QUAL VCF field from a combined sites-only field.
    • VarDP and/or AS_VarDP used to calculate QualByDepth/QD annotation for VQSR.
  5. The MIN_DP has been removed.

  6. Reblocked GVCFs have the following cost/scale improvements:

    • A reduced storage footprint compared with HaplotypeCaller GVCF output.
    • Fewer VariantContexts (i.e. lines) per VCF which speeds up GenomicsDB/Hail import.
    • Fewer alternate alleles which reduce memory requirements for merging.

Additionally, the 4 GQ band schema has specific improvements compared with the 7-band schema:

  1. It does not drop GQ0s; reblocked GVCFs should cover all the positions that the input GVCF covers.
  2. It has no overlaps; the only overlapping positions should be two variants (i.e. deletions) on separate haplotypes.
  3. No more no-calls; all genotypes should be called. Positions with no data will be homRef with GQ0.

Read more about the reblocked GVCFs in the WARP Blog.

Base quality scores

The final CRAM files have base quality scores binned according to the Functional Equivalence specification (Regier et al., 2018). This does not apply to the workflow's DRAGEN modes, which do not perform BQSR recalibration.

Original ScoreScore after BQSR recalibration
1-6unchanged
7-1210
13-2220
22-infinity30

Important notes

  • Runtime parameters are optimized for Broad's Google Cloud Platform implementation.
  • When the pipeline runs in the dragen_functional_equivalence_mode, it produces functionally equivalent outputs to the DRAGEN pipeline.
  • Additional information about the GATK tool parameters and the DRAGEN-GATK best practices pipeline can be found on the GATK support site.

Citing the WGS Pipeline

If you use the WGS Pipeline in your research, please cite our preprint:

Degatano, K.; Awdeh, A.; Dingman, W.; Grant, G.; Khajouei, F.; Kiernan, E.; Konwar, K.; Mathews, K.; Palis, K.; Petrillo, N.; Van der Auwera, G.; Wang, C.; Way, J.; Pipelines, W. WDL Analysis Research Pipelines: Cloud-Optimized Workflows for Biological Data Processing and Reproducible Analysis. Preprints 2024, 2024012131. https://doi.org/10.20944/preprints202401.2131.v1

Contact us

Please help us make our tools better by contacting the WARP team for pipeline-related suggestions or questions.

Licensing

Copyright Broad Institute, 2020 | BSD-3

The workflow script is released under the WDL open source code license (BSD-3) (full license text at https://github.com/broadinstitute/warp/blob/master/LICENSE). However, please note that the programs it calls may be subject to different licenses. Users are responsible for checking that they are authorized to run all programs before running this script.