Whole Genome Germline Single Sample Overview
Pipeline Version | Date Updated | Documentation Author | Questions or Feedback |
---|---|---|---|
WholeGenomeGermlineSingleSample_v3.1.20 (see releases page) | March, 2024 | Elizabeth Kiernan | Please file an issue in WARP. |
Introduction to the Whole Genome Germline Single Sample Pipeline
The Whole Genome Germline Single Sample (WGS) pipeline implements data pre-processing and initial variant calling according to the GATK Best Practices for germline SNP and Indel discovery in human whole-genome sequencing data. It includes the DRAGEN-GATK mode, which makes the pipeline functionally equivalent to DRAGEN’s analysis pipeline (read more in this DRAGEN-GATK blog).
For a broad overview of the pipeline processes, read the GATK Best Practices documentation for data pre-processing and for germline short variant discovery.
The pipeline adheres to the Functional Equivalence pipeline specification (Regier et al., 2018), a standard set of pipeline parameters to promote data interoperability across a multitude of global research projects and consortia. Read the specification for full details or learn more about functionally equivalent pipelines in this GATK blog.
Two workspaces containing example data and instructions are available to test the WGS pipeline:
- a DRAGEN-GATK-Germline-Whole-Genome-Pipeline workspace to showcase the DRAGEN-GATK pipeline mode
- a Whole-Genome-Analysis-Pipeline workspace to showcase the WGS pipeline with joint calling
Running the DRAGEN-GATK implementation of the WGS pipeline
Multiple WGS parameters are adjusted for the WGS workflow to run in the DRAGEN-GATK mode.
Individual DRAGEN-GATK parameters
The WGS workflow can be customized to mix and match different DRAGEN-related parameters. In general, the following booleans may be modified to run in different DRAGEN-realted features:
use_bwa_mem
- When false, the workflow calls the DRAGEN DRAGMAP aligner instead of BWA mem.
run_dragen_mode_variant_calling
- When true, the workflow creates a DRAGstr model with the GATK CalibrateDragstrModel tool and uses it for variant calling with HaplotypeCaller in --dragen-mode.
perform_bqsr
- When false, turns off BQSR as it is not necessary for the DRAGEN pipeline; instead, base error correction is performed during variant calling.
dragen_mode_hard_filter
- When true, the parameter turns on VCF hard filtering.
Two DRAGEN modes for configuring the WGS pipeline
Although the DRAGEN parameters can be turned on and off as needed, there are two mutually exclusive input workflow modes that can automatically configure the DRAGEN-related inputs:
- dragen_functional_equivalence_mode
- dragen_maximum_quality_mode
The dragen_functional_equivalence_mode runs the pipeline so that it the outputs are functionally equivalent to those produced with the DRAGEN hardware. This mode will automatically set the following parameters:
run_dragen_mode_variant_calling
is true.use_bwa_mem
is false.perform_bqsr
is false.use_spanning_event_genotyping
is false.dragen_mode_hard_filter
is true.
To learn more about how outputs are tested for functional equivalence, try the Functional Equivalence workflow in Terra.
The dragen_maximum_quality_mode runs the pipeline using the DRAGMAP aligner and DRAGEN variant calling, but with additional parameters that produce maximum quality results that are not functionally equivalent to the DRAGEN hardware. This mode will automatically set the following parameters:
run_dragen_mode_variant_calling
is true.use_bwa_mem
is false.perform_bqsr
is false.use_spanning_event_genotyping
is true.dragen_mode_hard_filter
is true.
When the workflow applies the DRAGMAP aligner, it calls reference files specific to the aligner. These files are located in a public Google bucket and described in the Input descriptions. See the reference README for details on recreating DRAGEN references.
Set-up
Workflow installation and requirements
The WGS workflow is written in the Workflow Description Language WDL and can be downloaded by cloning the warp repository in GitHub. The workflow can be deployed using Cromwell, a GA4GH compliant, flexible workflow management system that supports multiple computing platforms. For the latest workflow version and release notes, see the WGS changelog.
The latest release of the workflow, example data, and dependencies are available from the WARP releases page. You can explore releases using the WARP command-line tool, Wreleaser.
Input descriptions
The tables below describe each of the WGS pipeline inputs and reference files.
Examples of how to specify each input can be found in the example input configuration files (JSONs).
Multiple references are imported as part of a struct from the DNASeqStruct WDL, which is located in the WARP structs library. For references that are part of a struct, the tables below list the relevant struct’s name.
Overall, the workflow has the following input requirements:
- Human whole-genome paired-end sequencing data in unmapped BAM (uBAM) format
- One or more read groups, one per uBAM file, all belonging to a single sample (SM)
- Input uBAM files must additionally comply with the following requirements:
- All filenames have the same suffix (we use ".unmapped.bam")
- Files pass validation by ValidateSamFile
- Reads are in query-sorted order
- All reads have an RG tag
- Reference genome must be Hg38 with ALT contigs
Struct inputs
The following table describes the inputs imported from a struct. Although these are specified in the WGS workflow using the struct name, the actual inputs for each struct are specified in the example configuration files.
Input name | Struct name (alias) | Input description | Input type |
---|---|---|---|
base_file_name | SampleAndUnmappedBams (sample_and_unmapped_bams) | String used for output files; can be set to a read group ID. | String |
final_gvcf_base_name | SampleAndUnmappedBams (sample_and_unmapped_bams) | Base name for the output GVCF file; can be set to a read group ID. | String |
flowcell_unmapped_bams | SampleAndUnmappedBams (sample_and_unmapped_bams) | Human whole-genome paired-end sequencing data in unmapped BAM (uBAM) format; each uBAM file contains one or more read groups all belonging to a single sample (SM). | Array of files |
sample_name | SampleAndUnmappedBams (sample_and_unmapped_bams) | A string to describe the sample; can be set to a read group ID. | String |
unmapped_bam_suffix | SampleAndUnmappedBams (sample_and_unmapped_bams) | The suffice for the input uBAM file; must be consistent across files; (ex: “.unmapped.bam”). | String |
contamination_sites_ud | DNASeqSingleSampleReferences (references) | Contamination site files for the CheckContamination task. | File |
contamination_sites_bed | DNASeqSingleSampleReferences (references) | Contamination site files for the CheckContamination task. | File |
contamination_sites_mu | DNASeqSingleSampleReferences (references) | Contamination site files for the CheckContamination task. | File |
calling_interval_list | DNASeqSingleSampleReferences (references) | Interval list used for variant calling. | File |
reference_bin | DragmapReference (dragmap_reference) | Binary representation of the reference FASTA file used for the DRAGEN mode DRAGMAP aligner. | File |
hash_table_cfg_bin | DragmapReference (dragmap_reference) | Binary representation of the configuration for the hash table used for the DRAGEN mode DRAGMAP aligner. | File |
hash_table_cmp | DragmapReference (dragmap_reference) | Compressed representation of the hash table that is used for the DRAGEN mode DRAGMAP aligner. | File |
haplotype_scatter_count | VariantCallingScatterSettings (scatter_settings) | Scatter count used for variant calling. | Int |
break_bands_at_multiples_of | VariantCallingScatterSettings (scatter_settings) | Breaks reference bands up at genomic positions that are multiples of this number; used to reduce GVCF file size. | Int |
preemptible_tries | PapiSettings (papi_settings) | Number of times the workflow can be preempted. | Int |
agg_preemptible_tries | PapiSettings (papi_settings) | Number of preemtible machine tries for the BamtoCram task. | Int |
Additional inputs
Additional inputs that are not contained in a struct are described in the table below. Similar to the struct inputs, these inputs are specified in the example configuration files or, when noted, are hardcoded into the WDL workflow.
- Optional inputs, like the fingerprint_genotypes_file, need to match your input samples. For example, the fingerprint file in the workflow's test input configuration JSON is set up to check fingerprints for the NA12878 Plumbing sample. The sample name in the VCF matches the name used for the
sample_name
input.
Input name | Input description | Input type |
---|---|---|
fingerprint_genotypes_file | Genotype VCF if optionally performing fingerprinting. For the CheckFingerprint task (below), the sample name specified in the sample_and_unmapped_bams variable must match the sample name in the fingerprint_genoptyes_file (VCF format). | File |
fingerprint_genotypes_index | Optional index for the fingerprinting VCF. | File |
wgs_coverage_interval_list | Interval list for the CollectWgsMetrics tool. | File |
provide_bam_output | If set to true, provides the aligned BAM and index as workflow output; default set to false. | Boolean |
use_gatk3_haplotype_caller | Uses the GATK3.5 HaplotypeCalller; default set to true. | Boolean |
dragen_functional_equivalence_mode | Boolean used to run the WGS pipeline in a mode functionally equivalent to DRAGEN; set to false by default. This parameter is mutually exclusive with the dragen_maxiumum_quality_mode and will result in an error if both are set to true. | Boolean |
dragen_maximum_quality_mode | Boolean used to run the pipeline in DRAGEN mode with modifications to produce maximum quality results; set to false by default. This parameter is mutually exclusive with the dragen_functional_equivalence_mode and will result in an error if both are set to true. | Boolean |
run_dragen_mode_variant_calling | Boolean used to indicate that DRAGEN mode should be used for variant calling; default set to false but must be true to compose DRAGstr model and perform variant calling with HaplotypeCaller in dragen-mode. | Boolean |
use_spanning_event_genotyping | Boolean used to call the HaplotypeCaller --disable-spanning-event-genotyping parameter; default set to true so that variant calling includes spanning events. Set to false to run the DRAGEN pipeline. | Boolean |
unmap_contaminant_reads | Boolean to indicate whether to identify extremely short alignments (with clipping on both sides) as cross-species contamination and unmap the reads; default set to true. This feature is not used in the pipeline mode functionally equivalent to DRAGEN. | Boolean |
perform_bqsr | Boolean to turn on base recalibration with BQSR; default set to true, but not necessary when running the pipeline in DRAGEN mode. | Boolean |
use_bwa_mem | Boolean indicating if workflow should use the BWA mem aligner; default set to true, but must be set to false to alternatively run the DRAGEN-GATK DRAGMAP aligner. | Boolean |
use_dragen_hard_filtering | Boolean that indicates if workflow should perform hard filtering using the GATK VariantFiltration tool with the --filter-name "DRAGENHardQUAL"; default set to false. | Boolean |
read_length | Set to a max of 250 for collecting WGS metrics; hardcoded in the workflow WDL. | Int |
lod_threshold | LOD threshold for checking fingerprints; hardcoded to -20.0 in workflow WDL. | Float |
cross_check_fingerprints_by | Checks fingerprints by READGROUP; hardcoded in the workflow WDL. | String |
recalibrated_bam_basename | Basename for the recalibrated BAM file; hardcoded to be the base_file_name in the sample_and_unmapped_bams struct + ".aligned.duplicates_marked.recalibrated" in the workflow WDL. | String |
final_gvcf_base_name | Basename for the final GVCF file; harcoded in workflow WDL to be the final_gvcf_base_name from the sample_and_unmapped_bams struct, if applicable, or the base_file_name. | String |
Workflow tasks and tools
The WGS workflow imports a series of tasks, coded in WDL scripts, from the tasks library. To learn more about the software tools implemented in these tasks, read the GATK support site’s data pre-processing and germline short variant discovery overviews.
Check out the workflow Methods to get started!
The sections below outline each of the WGS workflow’s tasks and include tables detailing substasks, tools, and relevant software.
Quality control metric calculation and alignment of the unmapped BAM
Workflow WDL task name and link: UnmappedBamToAlignedBam.UnmappedBamToAlignedBam
The table below details the subtasks called by the UnmappedBamToAlignedBam task, which calculates metrics on the unsorted, unaligned BAMs for each readgroup using Picard and then aligns reads using either BWA mem or the DRAGEN DRAGMAP aligner. It optionally corrects base calling errors with BQSR. It lastly merges individual recalibrated BAM files into an aggregated BAM.
Subtask name (alias) and task WDL link | Tool | Software | Description |
---|---|---|---|
QC.CollectQualityYieldMetrics (CollectQualityYieldMetrics) | CollectQualityYieldMetrics | Picard | Calculates QC metrics on the unaligned BAM. |
SplitRG.SplitLargeReadGroup (SplitRG) | --- | --- | If the BAM size is large, will split the BAMs; performs alignment using either BWA mem (use_bwa_mem = true) or the DRAGMAP aligner (use_bwa_mem = false). |
Alignment.SamToFastqAndBwaMemAndMba (SamToFastqAndBwaMemAndMba) | SamToFastq; MergeBamAlignment | BWA mem, Picard | When use_bwa_mem = true, aligns using BWA mem; if use_bwa_mem = false, aligns with DRAGMAP aligner in the DragmapAlignment.SamToFastqAndDragmapAndMba task below. |
DragmapAlignment.SamToFastqAndDragmapAndMba (SamToFastqAndDragmapAndMba) | dragen-os, MergeBamAlignment | Dragmap, Picard | When use_bwa_mem = false, aligns with the DRAGMAP aligner. |
QC.CollectUnsortedReadgroupBamQualityMetrics (CollectUnsortedReadgroupBamQualityMetrics) | CollectMultipleMetrics | Picard | Performs QC on the aligned BAMs with unsorted readgroups. |
Processing.MarkDuplicates (MarkDuplicates) | MarkDuplicates | Picard | Marks duplicate reads. |
Processing.SortSam | SortSam | Picard | Sorts the aggregated BAM by coordinate sort order. |
QC.CrossCheckFingerprints (CrossCheckFingerprints) | CrosscheckFingerprints | Picard | Optionally checks fingerprints if haplotype database is provided. |
Utils.CreateSequenceGroupingTSV (CreateSequenceGroupingTSV) | --- | python | Creates the sequencing groupings used for BQSR and PrintReads Scatter. |
Processing.CheckContamination | VerifyBamID2 | --- | Checks cross-sample contamination prior to variant calling. |
Processing.BaseRecalibrator (BaseRecalibrator) | BaseRecalibrator | GATK | If perform_bqsr is true, performs base recalibration by interval. When using the DRAGEN-GATK mode, perform_bqsr is optionally false as base calling errors are corrected in the DRAGEN variant calling step. |
Processing.GatherBqsrReports (GatherBqsrReports) | GatherBQSRReports | GATK | Merges the BQSR reports resulting from by-interval calibration. |
Processing.ApplyBQSR (ApplyBQSR) | ApplyBQSR | GATK | Applies the BQSR base recalibration model by interval. |
Processing.GatherSortedBamFiles (GatherBamFiles) | GatherBamFiles | Picard | Merges the recalibrated BAM files. |
Aggregate the aligned recalibrated BAM and calculate quality control metrics
Workflow task name and link: AggregatedBamQC.AggregatedBamQC
The table below describes the subtasks of AggregatedBamQC.AggregatedBamQC task, which calculates quality control metrics on the aggregated recalibrated BAM file and checks for sample contamination.
Subtask name (alias) and link | Tool | Software | Description |
---|---|---|---|
QC.CollectReadgroupBamQualityMetrics (CollectReadgroupBamQualityMetrics) | CollectMultipleMetrics | Picard | Collects alignment summary and GC bias quality metrics on the recalibrated BAM. |
QC.CollectAggregationMetrics (CollectAggregationMetrics) | CollectMultipleMetrics | Picard | Collects quality metrics from the aggregated BAM. |
QC.CheckFingerprint (CheckFingerprint) | CheckFingerprint | Picard | Check that the fingerprint of the sample BAM matches the sample array. |
QC.CalculateReadGroupChecksum (CalculateReadGroupChecksum) | CalculateReadGroupChecksum | Picard | Generate a checksum per readgroup in the final BAM. |
Convert the aggregated recalibrated BAM to CRAM
Workflow task name and link: BamToCram.BamToCram
The table below describes the subtasks of BamToCram.BamToCram task which converts the recalibrated BAM to CRAM format and produces a validation report.
Subtask name (alias) and link | Tool | Software | Description |
---|---|---|---|
Utils.ConvertToCram (ConvertToCram) | view, index | samtools | Converts the merged, recalibrated BAM to CRAM. |
QC.CheckPreValidation (CheckPreValidation) | --- | python | Checks if the data has massively high duplication or chimerism rates. |
QC.ValidateSamFile (ValidateCram) | ValidateSamFile | Picard | Validates the output CRAM file. |
Collect WGS metrics using stringent thresholds
Workflow task name and link: QC.CollectWgsMetrics
The table below describes the QC.CollectWgsMetrics task which uses the Picard CollectWGSMetrics tool to calculate whole genome metrics using stringent thresholds.
Subtask name (alias) and link | Tool | Software | Description |
---|---|---|---|
CollectWgsMetrics | CollectWgsMetrics | Picard | Collects WGS metrics using stringent thresholds; tasks will break if the read lengths in the BAM are greater than 250, so the max read_length is set to 250 by default. |
Collect raw WGS metrics using less stringent thresholds
Workflow task name and link: QC.CollectRawWgsMetrics
The table below describes the QC.CollectRawWgsMetrics task which uses the Picard CollecRawtWGSMetrics tool to calculate whole genome metrics using common thresholds.
Subtask name (alias) and link | Tool | Software | Description |
---|---|---|---|
QC.CollectRawWgsMetrics (CollectRawWgsMetrics) | CollectRawWgsMetrics | Picard | Collects the raw WGS metrics with commonly used QC metrics. |
Call variants with HaplotypeCaller
Workflow task name and link: VariantCalling.VariantCalling (BamToGvcf)
The table below describes the subtasks of the VariantCalling.VariantCalling (BamToGvcf) workflow, which uses the GATK HaplotypeCaller for SNP and Indel discovery. When the workflow runs in DRAGEN mode, it produces a Dragstr model that is used during variant calling, and it performs hard filtering.
Subtask name (alias) and link | Tool | Software | Description |
---|---|---|---|
Dragen.CalibrateDragstrModel (DragstrAutoCalibration) | CalibrateDragstrModel | GATK | If run_dragen_mode_variant_calling is true, uses the reference FASTA file, the reference’s corresponding public STR (short tandem repeat) table file, and the recalibrated BAM to estimate the parameters for the DRAGEN STR model. The output parameter tables are used for the DRAGEN mode HaplotypeCaller. |
Utils.ScatterIntervalList (ScatterIntervalList) | IntervalListTools | Picard, python | Breaks the interval list into subintervals for downstream variant calling. |
Calling.HaplotypeCaller_GATK35_GVCF (HaplotypeCallerGATK3) | PrintReads, HaplotypeCaller | GATK4, GATK3.5 | If use_gatk3_haplotype_caller is true, will call GATK3 Haplotypecaller to call variants in GVCF mode, otherwise will use the HaplotypeCaller_GATK4_VCF task below. |
Calling.HaplotypeCaller_GATK4_VCF (HaplotypeCallerGATK4) | HaplotypeCaller | GATK4 | If use_gatk3_haplotype_caller is false, will call GATK4 Haplotypecaller to call variants in GVCF mode. If run_dragen_mode_variant_calling is true, uses the --dragstr-params-path containing the DragSTR model and runs it with HaplotypeCaller in --dragen-mode. |
Calling.DragenHardFilterVcf (DragenHardFilterVcf) | VariantFiltration | GATK | If dragen_mode_hard_filter is true, performs hard filtering that matches the filtering performed by the DRAGEN 3.4.12 pipeline. |
BamProcessing.SortSam (SortBamout) | SortSam | Picard | If the option to make a BAM out file is selected ( make_bamout is true), sorts and gathers the BAM files into one file. |
MergeBamouts | merge, index | samtools | If make_bamout is true, makes corrections to the merged BAM out file from Picard. |
Calling.MergeVCFs (MergeVCFs) | MergeVcfs | Picard | Combines by-interval (g)VCFs into a single sample (g)VCF file. |
QC.ValidateVCF (ValidateVCF) | ValidateVariants | GATK | Validates the (g)VCF from HaplotypeCaller with the -gvcf parameter. |
QC.CollectVariantCallingMetrics (CollectVariantCallingMetrics) | CollectVariantCallingMetrics | Picard | Performs quality control on the (g)VCF. |
Workflow outputs
The table below describes the final workflow outputs. If running the workflow on Cromwell, these outputs are found in the respective task's execution directory.
Output variable name | Description | Type |
---|---|---|
quality_yield_metrics | The quality metrics calculated for the unmapped BAM files. | Array of files |
unsorted_read_group_base_distribution_by_cycle_pdf | PDF of the base distribution for each unsorted, readgroup-specific BAM. | Array of files |
unsorted_read_group_base_distribution_by_cycle_metrics | Metrics of the base distribution by cycle for each unsorted, readgroup-specific BAM. | Array of files |
unsorted_read_group_insert_size_histogram_pdf | Histograms of insert size for the unsorted, readgroup-specific BAMs. | Array of files |
unsorted_read_group_insert_size_metrics | Insert size metrics for the unsorted, readgroup-specific BAMs. | Array of files |
unsorted_read_group_quality_by_cycle_pdf | Quality by cycle PDF for the unsorted, readgroup-specific BAMs. | Array of files |
unsorted_read_group_quality_by_cycle_metrics | Quality by cycle metrics for the unsorted, readgroup-specific BAMs. | Array of files |
unsorted_read_group_quality_distribution_pdf | Quality distribution PDF for the unsorted, readgroup-specific BAMs. | Array of files |
unsorted_read_group_quality_distribution_metrics | Quality distribution metrics for the unsorted, readgroup-specific BAMs. | Array of files |
read_group_alignment_summary_metrics | Alignment summary metrics for the aggregated BAM. | File |
read_group_gc_bias_detail_metrics | GC bias detail metrics for the aggregated BAM. | File |
read_group_gc_bias_pdf | PDF of the GC bias by readgroup for the aggregated BAM. | File |
read_group_gc_bias_summary_metrics | GC bias summary metrics by readgroup for the aggregated BAM. | File |
cross_check_fingerprints_metrics | Fingerprint metrics file if optional fingerprinting is performed. | File |
selfSM | Contamination estimate from VerifyBamID2. | File |
contamination | Estimated contamination from the CheckContamination task. | Float |
calculate_read_group_checksum_md5 | MD5 checksum for aggregated BAM. | File |
agg_alignment_summary_metrics | Alignment summary metrics for the aggregated BAM. | File |
agg_bait_bias_detail_metrics | Bait bias detail metrics for the aggregated BAM. | File |
agg_bait_bias_summary_metrics | Bait bias summary metrics for the aggregated BAM. | File |
agg_gc_bias_detail_metric | GC bias detail metrics for the aggregated BAM. | File |
agg_gc_bias_pdf | PDF of GC bias for the aggregated BAM. | File |
agg_gc_bias_summary_metrics | GC bias summary metrics for the aggregated BAM. | File |
agg_insert_size_histogram_pdf | Histogram of insert size for aggregated BAM. | File |
agg_insert_size_metrics | Insert size metrics for the aggregated BAM. | File |
agg_pre_adapter_detail_metrics | Details metrics for artifacts that occur prior to the addition of adaptors for the aggregated BAM. | File |
agg_pre_adapter_summary_metrics | Summary metrics for artifacts that occur prior to the addition of adaptors for the aggregated BAM. | File |
agg_quality_distribution_pdf | PDF of the quality distribution for the aggregated BAM. | File |
agg_quality_distribution_metrics | Quality distribution metrics for the aggregated BAM. | File |
agg_error_summary_metrics | Error summary metrics for the aggregated BAM. | File |
fingerprint_summary_metrics | Optional fingerprint summary metrics for the aggregated BAM. | File |
fingerprint_detail_metrics | Optional fingerprint detail metrics for the aggregated BAM. | File |
wgs_metrics | Metrics from the CollectWgsMetrics tool. | File |
raw_wgs_metric | Metrics from the CollectRawWgsMetrics tool. | File |
duplicate_metrics | Duplicate read metrics from the MarkDuplicates tool. | File |
output_bqsr_reports | BQSR reports if BQSR tool is run. | File |
gvcf_summary_metrics | (g)VCF summary metrics | File |
gvcf_detail_metrics | (g)VCF detail metrics. | File |
output_bam | Output aligned recalibrated BAM if the provided_output_bam is true. | File |
output_bam_index | Optional index for the aligned recalibrated BAM if the provided_output_bam is true. | File |
output_cram | Aligned, recalibrated output CRAM. | File |
output_cram_index | Index for the aligned recalibrated CRAM. | File |
output_cram_md5 | MD5 checksum for the aligned recalibrated BAM. | File |
validate_cram_file_report | Validated report for the CRAM created with the ValidateSam tool. | File |
output_vcf | Final reblocked GVCF with variant calls produced by HaplotypeCaller (read more in the Reblocking section below). | File |
output_vcf_index | Index for the final GVCF. | File |
Reblocking
Reblocking is a process that compresses a HaplotypeCaller GVCF by merging homRef blocks according to new genotype quality (GQ) bands and facilitates joint genotyping by removing alt alleles that do not appear in the called genotype.
As of November 2021, reblocking is a default task in the WGS pipeline. To skip reblocking, add the following to the workflow's input configuration file (JSON):
"WholeGenomeGermlineSingleSample.BamToGvcf.skip_reblocking": true
The Reblocking task uses the GATK ReblockGVCF tool with the arguments:
-do-qual-approx -floor-blocks -GQB 20 -GQB 30 -GQB 40
The following summarizes how reblocking affects the WGS GVCF and downstream tools compared to the GVCF produced with the default HaplotypeCaller GQ bands:
PLs are omitted for homozygous reference sites to save space– GQs are output for genotypes, PLs can be approximated as [0, GQ, 2*GQ].
GQ resolution for homozygous reference genotypes is reduced (i.e. homRef GQs will be underconfident) which may affect analyses like de novo calling where well-calibrated reference genotype qualities are important.
Alleles that aren’t called in the sample genotype are dropped. Each variant should have no more than two non-symbolic alt alleles, with the majority having just one plus <NON_REF>.
New annotations enable merging data for filtering without using genotypes. For example:
- RAW_GT_COUNT(S) for doing ExcessHet calculation from a sites-only file.
- QUALapprox and/or AS_QUALapprox for doing QUAL approximation/filling.
- QUAL VCF field from a combined sites-only field.
- VarDP and/or AS_VarDP used to calculate QualByDepth/QD annotation for VQSR.
The MIN_DP has been removed.
Reblocked GVCFs have the following cost/scale improvements:
- A reduced storage footprint compared with HaplotypeCaller GVCF output.
- Fewer VariantContexts (i.e. lines) per VCF which speeds up GenomicsDB/Hail import.
- Fewer alternate alleles which reduce memory requirements for merging.
Additionally, the 4 GQ band schema has specific improvements compared with the 7-band schema:
- It does not drop GQ0s; reblocked GVCFs should cover all the positions that the input GVCF covers.
- It has no overlaps; the only overlapping positions should be two variants (i.e. deletions) on separate haplotypes.
- No more no-calls; all genotypes should be called. Positions with no data will be homRef with GQ0.
Read more about the reblocked GVCFs in the WARP Blog.
Base quality scores
The final CRAM files have base quality scores binned according to the Functional Equivalence specification (Regier et al., 2018). This does not apply to the workflow's DRAGEN modes, which do not perform BQSR recalibration.
Original Score | Score after BQSR recalibration |
---|---|
1-6 | unchanged |
7-12 | 10 |
13-22 | 20 |
22-infinity | 30 |
Important notes
- Runtime parameters are optimized for Broad's Google Cloud Platform implementation.
- When the pipeline runs in the dragen_functional_equivalence_mode, it produces functionally equivalent outputs to the DRAGEN pipeline.
- Additional information about the GATK tool parameters and the DRAGEN-GATK best practices pipeline can be found on the GATK support site.
Citing the WGS Pipeline
If you use the WGS Pipeline in your research, please cite our preprint:
Degatano, K.; Awdeh, A.; Dingman, W.; Grant, G.; Khajouei, F.; Kiernan, E.; Konwar, K.; Mathews, K.; Palis, K.; Petrillo, N.; Van der Auwera, G.; Wang, C.; Way, J.; Pipelines, W. WDL Analysis Research Pipelines: Cloud-Optimized Workflows for Biological Data Processing and Reproducible Analysis. Preprints 2024, 2024012131. https://doi.org/10.20944/preprints202401.2131.v1
Contact us
Please help us make our tools better by filing an issue in WARP; we welcome pipeline-related suggestions or questions.
Licensing
Copyright Broad Institute, 2020 | BSD-3
The workflow script is released under the WDL open source code license (BSD-3) (full license text at https://github.com/broadinstitute/warp/blob/master/LICENSE). However, please note that the programs it calls may be subject to different licenses. Users are responsible for checking that they are authorized to run all programs before running this script.