Exome Germline Single Sample Overview
|Pipeline Version||Date Updated||Documentation Author||Questions or Feedback|
|ExomeGermlineSingleSample_v3.1.15||December, 2023||Elizabeth Kiernan||Please file GitHub issues in WARP or contact the WARP team|
The Exome Germline Single Sample pipeline implements data pre-processing and initial variant calling according to the GATK Best Practices for germline SNP and Indel discovery in human exome sequencing data.
You can test the pipeline in Terra! Go the Exome-Analysis-Pipeline workspace which includes sample data and workflows for preprocessing and initial variant calling, sample map generation, and joint genotyping.
Workflow Installation and Requirements
The Exome Germline Single Sample workflow is written in the Workflow Description Language WDL and can be downloaded by cloning the warp repository in GitHub. The workflow can be deployed using Cromwell, a GA4GH compliant, flexible workflow management system that supports multiple computing platforms. For the latest workflow version and release notes, please see the Exome Germline Single Sample changelog.
Software Version Requirements
- GATK 18.104.22.168
- Picard 2.26.10
- Samtools 1.11
- Python 3.0
- Cromwell version support
- Successfully tested on v52
- Does not work on versions < v23 due to output syntax
- Papi version support
- Successfully tested on Papi v2
Input Requirements and Expectations
- Human exome sequencing data in unmapped BAM (uBAM) format
- One or more read groups, one per uBAM file, all belonging to a single sample (SM)
- Input uBAM files must additionally comply with the following requirements:
- Filenames all have the same suffix (we use ".unmapped.bam")
- Files must pass validation by ValidateSamFile
- Reads are provided in query-sorted order
- All reads must have an RG tag
- Reblocked GVCF output names must end in ".rb.g.vcf.gz"
- Reference genome must be Hg38 with ALT contigs
- Unique exome calling, target, and bait .interval_list obtained from the sequencing provider. Generally the calling, target, and bait files will not be the same.
Workflow Tasks and Tools
Check out the workflow Methods to get started!
- CRAM, CRAM index, and CRAM MD5
- Reblocked GVCF and its GVCF index (read more in the Reblocking section below)
- BQSR report
- Summary metrics; to read more about any particular metric, you can search the metric using the GATK documentation search
Reblocking is a process that compresses a HaplotypeCaller GVCF by merging homRef blocks according to new genotype quality (GQ) bands and facilitates joint genotyping by removing alt alleles that do not appear in the called genotype.
As of November 2021, reblocking is a default task in the Exome pipeline. To skip reblocking, add the following to the workflow's input configuration file (JSON):
The Reblocking task uses the GATK ReblockGVCF tool with the arguments:
-do-qual-approx -floor-blocks -GQB 20 -GQB 30 -GQB 40
The following summarizes how reblocking affects the Exome GVCF and downstream tools compared to the GVCF produced with the default HaplotypeCaller GQ bands:
PLs are omitted for homozygous reference sites to save space– GQs are output for genotypes, PLs can be approximated as [0, GQ, 2*GQ].
GQ resolution for homozygous reference genotypes is reduced (i.e. homRef GQs will be underconfident) which may affect analyses like de novo calling where well-calibrated reference genotype qualities are important.
Alleles that aren’t called in the sample genotype are dropped. Each variant should have no more than two non-symbolic alt alleles, with the majority having just one plus <NON_REF>.
New annotations enable merging data for filtering without using genotypes. For example:
- RAW_GT_COUNT(S) for doing ExcessHet calculation from a sites-only file.
- QUALapprox and/or AS_QUALapprox for doing QUAL approximation/filling.
- QUAL VCF field from a combined sites-only field.
- VarDP and/or AS_VarDP used to calculate QualByDepth/QD annotation for VQSR.
The MIN_DP has been removed.
Reblocked GVCFs have the following cost/scale improvements:
- A reduced storage footprint compared with HaplotypeCaller GVCF output.
- Fewer VariantContexts (i.e. lines) per VCF which speeds up GenomicsDB/Hail import.
- Fewer alternate alleles which reduce memory requirements for merging.
Additionally, the 4 GQ band schema has specific improvements compared with the 7-band schema:
- It does not drop GQ0s; reblocked GVCFs should cover all the positions that the input GVCF covers.
- It has no overlaps; the only overlapping positions should be two variants (i.e. deletions) on separate haplotypes.
- No more no-calls; all genotypes should be called. Positions with no data will be homRef with GQ0.
Read more about the reblocked GVCFs in the WARP Blog.
Base quality scores
|Original Score||Score after BQSR recalibration|
- Runtime parameters are optimized for Broad's Google Cloud Platform implementation.
- For help running workflows on the Google Cloud Platform or locally please view the following tutorial (How to) Execute Workflows from the gatk-workflows Git Organization.
- Please visit the GATK Technical Documentation site for further documentation on our workflows and tools.
- You can access relevant reference and resource bundles in the GATK Resource Bundle.
Copyright Broad Institute, 2023 | BSD-3
The workflow script is released under the WDL open source code license (BSD-3) (full license text at https://github.com/broadinstitute/warp/blob/master/LICENSE). However, please note that the programs it calls may be subject to different licenses. Users are responsible for checking that they are authorized to run all programs before running this script.