Genomic Data Commons (GDC) Whole Genome Somatic Single Sample Overview

danger

9/12/2014

We are deprecating the Genomic Data Commons Whole Genome Somatic Single Sample Pipeline. Although the code will continue to be available, we are no longer supporting it.

Pipeline Version	Date Updated	Documentation Author	Questions or Feedback
GDCWholeGenomeSomaticSingleSample_v1.3.1	January, 2024	Elizabeth Kiernan	Please file an issue in WARP.

Introduction to the GDC Whole Genome Somatic Single Sample pipeline

The GDC Whole Genome Somatic Single Sample (abbreviated GDC here) pipeline is the alignment and preprocessing workflow for genomic data designed for the National Cancer Institute's Genomic Data Commons.

A high-level overview of the pipeline in addition to tool parameters are available on the GDC Documentation site.

Overall, the pipeline converts reads (CRAM or BAM) to FASTQ and (re)aligns them to the latest human reference genome (see the GDC Reference Genome section below). Each read group is aligned separately. Read group alignments that belong to a single sample are then merged and duplicate reads are flagged for downstream variant calling.

Set-up

Workflow installation and requirements

The workflow is written in the Workflow Description Language WDL and can be downloaded by cloning the warp repository in GitHub. The workflow can be deployed using Cromwell, a GA4GH compliant, flexible workflow management system that supports multiple computing platforms.

For the latest workflow version and release notes, please see the changelog.

Software version requirements

GATK 4.5.0.0
Picard 2.26.10
Samtools 1.11
Python 3.0
Cromwell version support
- Successfully tested on v52
- Does not work on versions < v23 due to output syntax
Papi version support
- Successfully tested on Papi v2

Input descriptions

The table below describes each of the GDC pipeline inputs. The workflow requires a single aligned CRAM or BAM file, or a single unmapped BAM (uBAM) as input, and is set up to run on samples with reads greater than 75 bp.

For examples of how to specify each input as well as cloud locations for different example input files, see the input configuration file (JSON).

Input name	Description	Type
input_cram (optional)	A single mapped CRAM file; alternatively, input can be a mapped BAM (`input_bam`) or unmapped BAM (uBAM; `ubam`). If using a CRAM file aligned to a reference different than the reference chosen for the GDC workflow (the `ref_fasta` input), then you must specify both the `cram_ref_fasta` and the `cram_ref_fasta_index` inputs.	File
input_bam (optional)	A single mapped BAM file; alternatively input can be a CRAM (`input_cram`) or uBAM (`ubam`).	File
cram_ref_fasta (optional)	The reference file that was used to align the CRAM input (if used); if unspecified, the workflow will use the `ref_fasta` input file by default.	File
cram_ref_fasta_index (optional)	CRAM reference FASTA index for the `cram_ref_fata` if CRAM is used as workflow input.	File
output_map (optional)	Tab-separated file containing two columns: a list of all the read group IDs found in the input_cram (or input_bam) and a list of the desired name of the uBAMs generated for read group.	File
unmapped_bam_suffix (optional)	Optional string used to name the output uBAM file.	String
ubam (optional)	A single uBAM file; alternatively, input can be a mapped BAM (`input_bam`) or CRAM (`input_cram`).	File
contamination_vcf	VCF file of common variant sites that is used for the check_contamination task.	File
contamination_vcf_index	Index file for the `contamination_vcf` input.	File
dbsnp_vcf	VCF file of known variation sites that can be used to exclude these sites from the analysis; used for the gatk_baserecalibrator task.	File
dbsnp_vcf_index	Index file for the `dbsnp_vcf` input.	File
ref_fasta	Reference FASTA used to convert an input CRAM to BAM; if CRAM is not used, this file does not need to be specified.	File
ref_fai	Reference FASTA index for the `ref_fasta` input.	File
ref_dict	BWA reference dictionary used for alignment.	File
ref_amb	BWA reference file used for alignment.	File
ref_ann	BWA reference file used for alignment.	File
ref_bwt	BWA reference file used for alignment.	File
ref_pac	BWA reference file used for alignment.	File
ref_sa	BWA reference file used for alignment.	File

GDC reference genome

The GDC uses the human reference genome GRCh38.d1.vd1 for all data processing. Unlike the GRCh38 reference used by WARP pipelines for production, the GDC reference includes decoy viral sequences for ten types of human viral genomes. You can learn more about the reference from the GDC documentation.

The reference files required for the GDC workflow are hosted in a public Google Bucket.

Workflow tasks and tools

The workflow imports a series of tasks either from the Workflow script or the Broad tasks library.

Task name in WDL	Tool	Software	Description
CramToUnmappedBams	view, index, RevertSam, ValidateSamFile, SortSam	SamTools and Picard	If a CRAM file is used as input, the task converts to uBAM and generates an output map that is then used to split the uBAM by readgroup. The resulting BAM is sorted by query name using Picard.
bam_readgroup_to_contents	view	Samtools	Extracts all the readgroups from the BAM header and returns a WDL array where each row is a readgroup.
biobambam_bamtofastq	bamtofastq	biobambam2	Converts the uBAMs to FASTQ.
emit_pe_records/emit_se_records	---	---	Associates the fasta file(s) generated by biobambam_bamtofastq with their respective readgroup; creates an array of structs to be used as input for alignment.
bwa_pe/ bwa_se	BWA mem, view	BWA and Samtools	Aligns FASTQ reads to the reference genome, generating an aligned BAM file.
picard_markduplicates	MarkDuplicates	Picard	Runs MarkDuplicates in silent mode to locate and tag duplicate reads. Outputs a tagged BAM and metrics file.
sort_and_index_markdup_bam	sort	Samtools	Sorts and indexes the BAM file; outputs the sorted BAM and index file.
check_contamination	GetPileUpSummaries and CalculateContamination	GATK	Summarizes counts of reads that support reference, alternate and other alleles for given sites and then calculates the fraction of reads coming from cross-sample contamination.
gatk_baserecalibrator	BaseRecalibrator	GATK	Generates recalibration table for Base Quality Score Recalibration (BQSR).
gatk_applybqsr	ApplyBQSR	GATK	Recalibrates the base qualities of the input reads based on the recalibration table produced by the BaseRecalibrator tool and outputs a recalibrated BAM file; uses the emit_original_quals parameter to write the original base qualities under the BAM OQ tag.
collect_insert_size_metrics	CollectInsertSizeMetrics	Picard	Generates metrics about insert size distribution in the form of both a histogram and table (txt file).

Workflow outputs

The following table describes the workflow outputs. If running the workflow using Cromwell, these outputs will automatically be placed in the respective task execution directory.

Alternatively, Cromwell allows you to specify an output directory using an options.json, as described in Cromwell's Workflow Options Overview(see the section on Output Copying).

Output name	Description	Type
validation_report (optional)	Samtools validation report(s); only returned if an aligned CRAM or BAM file is used as workflow input	TXT
unmapped_bams (optional)	uBAM file only returned if an aligned CRAM or BAM is used as workflow input.	BAM
bam	Base recalibrated BAM file.	BAM
bai	Index file for the BAM.	BAI
md_metrics	Picard MarkDuplicates metrics.	TXT
insert_size_metrics	Picard insert size metrics.	TXT
insert_size_histogram_pdf	Histogram representation of insert size metrics.	PDF
contamination	File containing a value indicating the fraction contamination.	TXT

Important notes

Runtime parameters are optimized for Broad's Google Cloud Platform implementation.
Please visit the GATK Technical Documentation site for further documentation on GATK-related workflows and tools.

Citing the GDC Pipeline

If you use the GDC Pipeline in your research, please consider citing our preprint:

Degatano, K.; Awdeh, A.; Dingman, W.; Grant, G.; Khajouei, F.; Kiernan, E.; Konwar, K.; Mathews, K.; Palis, K.; Petrillo, N.; Van der Auwera, G.; Wang, C.; Way, J.; Pipelines, W. WDL Analysis Research Pipelines: Cloud-Optimized Workflows for Biological Data Processing and Reproducible Analysis. Preprints 2024, 2024012131. https://doi.org/10.20944/preprints202401.2131.v1

Contact us

Please help us make our tools better by filing an issue in WARP; we welcome pipeline-related suggestions or questions.

Licensing

The workflow script is released under the WDL open source code license (BSD-3) (full license text at https://github.com/broadinstitute/warp/blob/master/LICENSE). However, please note that the programs it calls may be subject to different licenses. Users are responsible for checking that they are authorized to run all programs before running this script.

Genomic Data Commons (GDC) Whole Genome Somatic Single Sample Overview

Introduction to the GDC Whole Genome Somatic Single Sample pipeline​

Set-up​

Workflow installation and requirements​

Software version requirements​

Input descriptions​

GDC reference genome​

Workflow tasks and tools​

Workflow outputs​

Important notes​

Citing the GDC Pipeline​

Contact us​

Licensing​