November 9, 2021
We are announcing the latest release of the Whole Genome and Exome production pipelines, which now produce reblocked GVCFs, a compressed form of GVCF files, as default outputs.
This is a major release, which in this case means:
- The data output from the pipeline is scientifically different than in previous versions.
- Data output from this pipeline should not be co-analyzed with data from previous versions (for example, in joint calling), without first reblocking the GVCFs.
- Users of the pipeline will see a change in both input and output interfaces.
Reblocking – what is it and why use it?
Reblocking is a process of compressing GVCFs that reduces file storage footprint and facilitates joint genotyping by removing alt alleles that do not appear in the called genotype. Normal GVCFs contain blocks of adjacent homozygous reference (homRef) genotype calls that are stored as a single entry when they have the exact same value up to a Phred-scaled confidence of 60. The vast majority of downstream analyses discard homRef genotypes with a confidence of less than 20, so we can compress that data without losing any fidelity in downstream analysis.
Additionally, because most whole-genome sequencing aims for 30X coverage, most confidences are around 90 or higher. With that, we can represent large intervals of high coverage with a single entry, indicating all high-quality homRef genotypes.
The reblocked GVCFs are sufficient for multiple downstream applications. While some applications, like population variation surveys, can establish variant frequency using binary (i.e. good/bad) homRef confidences, other applications, like de novo analysis, require more nuance. For the latter kind of applications where the reference status of the parents is critical, the reblocked GVCFs provide multiple intermediate categories for homRef confidences.
Overall, reblocked GVCFs reduce the storage and I/O required to operate on variant data. This leads to faster, more efficient, and less expensive call sets. It is standard practice at the Broad to reblock GVCFs before joint calling. Now, we are making that step a standard part of the GVCF generation in the single sample pipeline.
How is reblocking implemented in the pipelines?
We’ve added reblocking as a default subtask, called “Reblock”, in the WARP VariantCalling WDL script which is imported by both the Exome and Whole Genome workflows. The Reblock task performs reblocking by using the GATK tool ReblockGVCF.
While the reblocking option is turned on by default for any pipeline using the VariantCalling task, you can optionally turn it off by setting the
skip_reblocking input in the workflows input configuration file (JSON). For example, if you’re working with the Exome pipeline, just add the parameter:
We recognize that some researchers might want to perform reblocking independent of the Exome or Whole Genome pipelines, so we also created a utility workflow to reblock GVCFs. This pipeline takes in GVCFs produced by the single sample pipelines and outputs equivalently reblocked GVCFs. It is available in both the WARP repository and in a Terra workspace, and costs less than $0.02 per genome to run.
If you’re curious as to whether your WARP pipeline output is a reblocked GVCF, just look at the filename suffix; a reblocked GVCF will have the