Skip to main content

RNA with UMIs Overview

Pipeline VersionDate UpdatedDocumentation AuthorsQuestions or Feedback
RNAWithUMIsPipeline_v1.0.16February, 2024Elizabeth Kiernan & Kaylee MathewsPlease file an issue in WARP.

RNAWithUMIs_diagram

Introduction to the RNA with UMIs workflow

The RNA with UMIs pipeline is an open-source, cloud-optimized workflow for processing total RNA isolated with the Transcriptome Capture (TCap) method. TCap is a technique that hybridizes exome baits to cDNA library preparations to better facilitate RNA-sequencing in low-input or poor quality (degraded) samples. These libraries may additionally be prepared with Unique Molecular Identifiers (UMIs), which can help distinguish biological signal from noise resulting from PCR amplification.

Overall, the workflow performs UMI correction, aligns reads to the genome, quantifies gene counts, and calculates quality metrics. The workflow produces genome- and transcriptome-aligned BAMs with indices and quality metrics files.

While this workflow was created to be used with TCap RNA-seq data, it can be used to process any bulk RNA-seq data.

Want to use the RNA with UMIs pipeline for your publication?

Check out the RNA with UMIs Methods section to get started!

Quickstart table

The following table provides a quick glance at the RNA with UMIs pipeline features:

Pipeline featuresDescriptionSource
Assay typeTCap (or any bulk) RNA-seq dataCieslik et al. 2015
Overall workflowRead alignment and transcriptome quantificationCode available from GitHub
Workflow languageWDL 1.0openWDL
Genomic reference sequenceGRCh38 (hg38) and GRCh37 (hg19) human genome primary sequenceGenome Reference Consortium GRCh38 and GRCh37
Gene annotationsGENCODE v34 (hg38) and v19 (hg19) gene annotationsGENCODE v34 and v19
AlignerSTARSTAR
Transcript quantification and metric calculationRNA-SeQC, Picard, and GATKRNA-SeQC, Picard, and GATK
Data input file formatFile format in which sequencing data is providedBAM or FASTQ
Data output file formatsFile formats in which RNA with UMIs outputs are providedBAM; GCT (counts); TXT, TSV, and PDF (metrics)

Set-up

Installation

To download the latest release of the RNA with UMIs pipeline, see the release tags prefixed with "RNAwithUMIs" on the WARP releases page. All releases of the RNA with UMIs pipeline are documented in the RNA with UMIs changelog.

To search releases of this and other pipelines, use the WARP command-line tool Wreleaser.

If you’re running an RNA with UMIs workflow version prior to the latest release, the accompanying documentation for that release may be downloaded with the source code on the WARP releases page (see the source code folder website/docs/Pipelines/RNA_with_UMIs_Pipeline).

The RNA with UMIs pipeline can be deployed using Cromwell, a GA4GH compliant, flexible workflow management system that supports multiple computing platforms. The workflow can also be run in Terra, a cloud-based analysis platform.

Inputs

The RNA with UMIs workflow inputs are specified in JSON configuration files. Example configuration files can be found in the test_inputs folder in the WARP repository.

Input descriptions

The workflow takes in either a set of paired-end FASTQ files or a read group unmapped BAM that has gone through base calling.

Input variable nameDescriptionType
bamRead group-specific unmapped BAM file; alternatively, paired-end FASTQ files (r1_fastq and r2_fastq) may be used.File
r1_fastqRead 1 FASTQ file; alternatively, the unmapped bam file (bam) may be used as input.File
r2_fastqRead 2 FASTQ file; alternatively, the unmapped bam file (bam) may be used as input.File
read1StructureString describing how the bases in a sequencing run should be allocated into logical reads for read 1 by fgbio's ExtractUmisFromBam tool; for more information about read structures, see the fgbio documentation.String
read2StructureString describing how the bases in a sequencing run should be allocated into logical reads for read 2 by fgbio's ExtractUmisFromBam tool; for more information about read structures, see the fgbio documentation.String
output_basenameString used as a prefix in workflow output files.String
platformString used to describe the sequencing platform.String
library_nameString used to describe the library.String
platform_unitString used to describe the platform unit.String
read_group_nameString used to describe the read group name.String
sequencing_centerString used to describe the sequencing center; default is set to "BI".String
starIndexTAR file containing genome indices used for the STAR aligner.File
gtfGene annotation file (GTF) used for the RNA-SeQC tool.File
refFASTA file used for metric collection with Picard tools.File
refIndexFASTA index file used for metric collection with Picard tools.File
refDictDictionary file used for metric collection with Picard tools.File
refFlatrefFlat file used for metric collection with Picard tools.File
ribosomalIntervalsIntervals file used for RNA metric collection with Picard tools.File
exonBedFileBed file used for fragment size calculations with the RNA-SeQC tool; contains non-overlapping exons.File
population_vcfVCF file used for contamination estimation by GATK's GetPileupSummaries and CalculateContamination tools; contains common SNP sites from population-wide studies.File
population_vcf_indexPopulation VCF index file used for contamination estimation by GATK's GetPileupSummaries and CalculateContamination tools.File

References

The pipeline supports both hg19 and hg38 references. The reference set consists of:

  1. .fasta, .fai, and .dict files
  2. STAR index
  3. GTF file (RNASeQC)
  4. ribosomal interval list and refFlat file (Picard)

FASTA, index, and dictionary files

When running the workflow with the hg38 reference, we recommend using a version without HLA, ALT, and decoy contigs. These non-primary assembly contigs lead to reduced sensitivity unless the mapper is ALT-aware (e.g., bwa-mem). STAR is not ALT-aware, so these contigs should be removed.

In contrast, the hg19 reference does not have nearly as many contigs as the hg38 reference, so the workflow can be run using the standard hg19 reference stored in Broad's public reference bucket.

GTF file

Genome annotation files (GTFs) contain information about genes, such as the start and end coordinates of each exon, the name of the gene, and the type of the transcript (e.g., protein-coding, antisense). The workflow uses the GENCODE v34 GTF for hg38 and v19 for hg19.

Ribosomal interval list and refFlat file

The workflow ribosomal interval list and the refFlat file are used by Picard metrics calculation tools. The workflow uses a custom ribosomal interval list based on the public hg38 ribosomal interval list, which has been modified to include mitochondrial rRNA coding genes.

Additional reference resources

For more information about ALT contigs, HLA, decoys, and ALT-aware mapping, see the following resources:

RNA with UMIs tasks and tools

The RNA with UMIs workflow imports two additional WDL scripts. The UMIAwareDuplicateMarking.wdl script is a nested workflow used to mark duplicate sequencing reads, while the RNAWithUMIsTasks.wdl script contains individual "tasks" called by the workflow.

Overall, the RNA with UMIs workflow:

  1. Converts FASTQs to unmapped BAMs.
  2. Extracts UMIs.
  3. Converts unmapped BAMs to FASTQs and filters reads.
  4. Trims adapters and poly(A) tails.
  5. Converts FASTQs to unmapped BAMs.
  6. Aligns reads.
  7. Marks duplicate reads and sorts BAMs.
  8. Quantifies gene counts.
  9. Calculates RNA and genomic metrics.

The tools each task employs in the RNA with UMIs workflow are detailed in the table below.

To see specific tool parameters, select the task WDL link in the table; then find the task and view the command {} section of the task in the WDL script. To view or use the exact tool software, see the task's Docker image which is specified in the task WDL input {} section as String docker =.

Task name and WDL linkToolSoftwareDescription
tasks.FastqToUbamFastqToSamPicardConverts the paired-end FASTQ files to unmapped BAM.
tasks.ExtractUMIsExtractUmisFromBamfgbioExtracts UMIs from the unmapped BAM and stores them in the RX tag of output BAM.
tasks.SamToFastqFastqToSamPicardConverts the BAM file to paired-end FASTQ files for adapter clipping and removes reads that fail platform/vendor quality checks performed by the sequencing platform.
tasks.FastpfastpfastpTrims adapters and polyA tails from reads.
tasks.FastqToUbam (alias = FastqToUbamAfterClipping)FastqToSamPicardConverts the trimmed paired-end FASTQ files to unmapped BAM.
tasks.FastQCFastQCFastQCCollects overall quality control metrics before alignment and generates an HTML-formatted report.
tasks.STARSTARSTARAligns reads to the genome (using the StarIndex file) and outputs aligned reads to BAM. The task additionally converts the resulting BAM file to transcriptome coordinates, producing a transcriptome-aligned BAM. Parameters are listed below.
tasks.CopyReadGroupsToHeaderview, reheaderSamtoolsCopies the read group information from the genome-aligned BAM to the transcriptome-aligned BAM.
UmiMD.UMIAwareDuplicateMarkingSortSam, MarkDuplicates, groupPicard, GATK, UMI-toolsMarks duplicates on the genome-aligned BAM and tags reads with error-corrected UMIs.
UmiMD.UMIAwareDuplicateMarking (alias = UMIAwareDuplicateMarkingTranscriptome)SortSam, MarkDuplicates, groupPicard, GATK, UMI-toolsMarks duplicates on the transcriptome-aligned BAM and tags reads with error-corrected UMIs.
tasks.PostprocessTranscriptomeForRSEMPostProcessReadsForRSEMGATKSorts reads for RSEM compatibility.
tasks.GetSampleNameGetSampleNameGATKWrites the sample name from the unmapped BAM header into a separate text file.
tasks.rnaseqc2rnaseqcRNA-SeQCUses the genome-aligned, duplicate-marked BAM file to calculate TPMs, gene counts, exon counts, fragment sizes, and additional metrics, each of which is outputted to an individual file.
tasks.CollectRNASeqMetricsCollectRNASeqMetricsPicardCalculates RNA metrics; strand specificity is set to SECOND_READ_TRANSCRIPTION_STRAND.
tasks.CollectMultipleMetricsCollectMultipleMetricsPicardCollects multiple classes of metrics; runs tools CollectInsertSizeMetrics and CollectAlignmentSummaryMetrics.
tasks.CalculateContaminationGetPileupSummaries, Calculate ContaminationGATKUses the population VCF and index files to calculate pileup metrics and estimate cross-sample contamination.

1. Convert FASTQ to uBAM

If paired-end FASTQ files are used as inputs for the pipeline, the first task in the pipeline converts those FASTQ files into an unmapped BAM file using Picard's FastqToSam.

This task is skipped if a read group unmapped BAM file is used as input for the pipeline.

2. UMI extraction

Unique molecular identifiers (UMIs) are DNA tags (bases) that identify each DNA molecule before PCR amplification. They enable us to detect whether two reads that align to the same position in the reference are sequences of the same DNA molecule (have the same UMIs; PCR duplicates) or are two separate copies of DNA that happen to have the same sequence (have different UMIs; biological or natural duplicates).

For example, the Illumina sequencers used by the Broad Genomics Platform read 151 bp per read, of which 5 bp are reserved for the UMIs.

The first step of the RNA with UMIs workflow is to use the fgbio's ExtractUMIsFromBam to remove UMI bases from the reads and store them in the RX read tag.

The resulting RX tag may contain information like "ACT-GCT." The "ACT" is the 3 bp for read 1 and the "GCT" is the 3 bp for read 2. Even though we reserve 5 bases in a read, we only use 3 bases for UMI; the other two bases are reserved for cell barcode, which the workflow doesn't use.

3. Convert uBAM to FASTQs and filter reads

The tasks.SamToFastq task uses Picard's FastqToSam to convert the unmapped BAM to paired-end FASTQ files that can then be used for adapter clipping. This step also removes reads that fail to pass platform/vendor quality checks performed by the sequencing platform and flagged with the corresponding SAM flag value.

4. Trim adapters and poly(A) tails

After converting the uBAM to FASTQs, the workflow uses fastp to trim sequencing adapters and poly(A) tails from the reads. This task requires an adapter FASTA file containing the list of sequences to be trimmed and is publicly available in Broad's public reference bucket.

The adapter FASTA file contains the sequences shown below.

>Illumina TruSeq Adapter Read 1
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
>Illumina TruSeq Adapter Read 2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
>polyA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

5. Convert FASTQs to uBAM

The tasks.FastqToUbam (alias = FastqToUbamAfterClipping) task converts trimmed FASTQs to an unmapped BAM using Picard's FastqToSam.

6. Alignment with STAR

After UMI extraction, the workflow aligns the paired-end reads to the reference (hg38 or hg19) using the STAR aligner, which is specifically designed for RNA-seq data and can align cDNA sequences with many "gaps" that correspond to introns.

The task uses the following parameters:

ParameterValueNotes
outSAMunmappedWithinIncludes unmapped reads in the output file rather than dropping those reads to facilitate potential downstream analysis.
outFilterMismatchNoverLmax0.1Sets the maximum allowable ratio of mismatches to read length. Reads with a ratio larger than the set value are filtered. For example, for paired-end reads with length 146, the reads are filtered if the number of mismatches is greater than 29 (146 * 2 * 0.1 = 29).
alignEndsProtrude20 ConcordantPairAllows a maximum of 20 protruding bases at alignment ends and marks these alignments as concordant pairs to prevent reads from small cDNA fragments that were sequenced into adapters from being dropped. This parameter allows for the processing of data derived from low-quality or degraded tissue such as formalin-fixed paraffin-embedded (FFPE) samples.

Additional parameters are used to match ENCODE bulk RNA-seq data standards. To learn more about ENCODE options in STAR, see the STAR manual.

After STAR alignment, the workflow outputs both a genome- and transcriptome-aligned BAM.

7. Mark duplicates and sort BAMs

As described in Step 2 (UMI extraction), UMIs are DNA tags that allow us to distinguish between PCR duplicates (duplicate reads with the same UMI) and biological duplicates (duplicate reads with different UMIs).

In this step, the workflow sorts the aligned BAMs coordinates using Picard's SortSam tool and then groups the duplicates using UMI-tools' group function. Once the duplicates are grouped by UMI, the PCR duplicates are marked using Picard's MarkDuplicates. This step outputs new genome- and transcriptome-aligned BAM files with PCR duplicates tagged and a corresponding index file.

The transcriptome-aligned BAM is then sorted using GATK’s PostProcessReadsForRSEM for compatibility with RSEM. While RSEM is not used in this workflow, it is an additional tool that can be used to quantify expression from RNA-seq data.

8. Gene quantification

After duplicate reads have been tagged, the workflow uses RNA-SeQC to quantify the expression level of transcripts based on the number of reads that align to one or more exons of each gene (rnaseqc2_gene_counts). Exon-level expression is also quantified based on the number of reads that align to each exon (rnaseqc2_exon_counts).

9. Metric calculation

The pipeline uses FastQC, RNA-SeQC, Picard's CollectRNASeqMetrics and CollectMultipleMetrics tools, and GATK's GetPileupSummaries and CalculateContamination tools to calculate summary metrics that can be used to assess the quality of the data each time the pipeline is run.

If you are a member of the Broad Institute's Genomics Platform using the internal RNA with UMIs pipeline, there is an additional step that merges the individual metrics files to create the MergeMetrics.unified_metrics output file and prepare the data for use in the Terra Data Repository.

10. Outputs

Workflow outputs are described in the table below.

Output variable nameDescriptionType
sample_nameSample name extracted from the input unmapped BAM file header.String
transcriptome_bamDuplicate-marked BAM file containing alignments from STAR translated into transcriptome coordinates and postprocessed for RSEM.BAM
transcriptome_duplicate_metricsFile containing duplication metrics.TXT
output_bamDuplicate-marked BAM file containing alignments from STAR translated into genome coordinates.BAM
output_bam_indexIndex file for the output_bam output.BAM Index
duplicate_metricsDuplicate metrics file containing the number of reads marked as duplicates.TXT
rnaseqc2_gene_tpmFile containing TPMs.GCT
rnaseqc2_gene_countsFile containing gene counts.GCT
rnaseqc2_exon_countsFile containing exon counts.GCT
rnaseqc2_fragment_size_histogramFile containing counts of observed fragment size.TXT
rnaseqc2_metricsFile containing RNA-SeQC metrics including strand specificity, 3’/5’ bias, rRNA reads, and others.TSV
picard_rna_metricsMetrics file containing the output of Picard’s CollectRnaSeqMetrics tool.TXT
picard_alignment_summary_metricsMetrics file containing output of Picard’s CollectAlignmentSummaryMetrics tool.TXT
picard_insert_size_metricsMetrics file containing output of Picard’s CollectInsertSizeMetrics tool.TXT
picard_insert_size_histogramHistogram chart of insert size.PDF
picard_base_distribution_by_cycle_metricsMetrics file containing the output of Picard’s CollectBaseDistributionByCycle tool.TXT
picard_base_distribution_by_cycle_pdfChart of nucleotide distribution per cycle.PDF
picard_quality_by_cycle_metricsMetrics file containing the output of Picard’s MeanQualityByCycle tool.TXT
picard_quality_by_cycle_pdfChart of mean quality by cycle.PDF
picard_quality_distribution_metricsMetrics file containing the output of Picard’s QualityScoreDistribution tool.TXT
picard_quality_distribution_pdfChart of quality score distribution.PDF
contaminationFloat representing the calculated cross-sample contamination.Float
contamination_errorFloat representing the error associated with the contamination calculation.Float
fastqc_html_reportHTML report containing general quality control metrics generated by FastQC.File
fastqc_percent_reads_with_adapterFloat representing the percent of reads with adapter sequences present following the adapter clipping task.Float

Versioning

All RNA with UMIs pipeline releases are documented in the pipeline changelog.

Citing the RNA with UMIs Pipeline

If you use the RNA with UMIs Pipeline in your research, please consider citing our preprint:

Degatano, K.; Awdeh, A.; Dingman, W.; Grant, G.; Khajouei, F.; Kiernan, E.; Konwar, K.; Mathews, K.; Palis, K.; Petrillo, N.; Van der Auwera, G.; Wang, C.; Way, J.; Pipelines, W. WDL Analysis Research Pipelines: Cloud-Optimized Workflows for Biological Data Processing and Reproducible Analysis. Preprints 2024, 2024012131. https://doi.org/10.20944/preprints202401.2131.v1

Feedback

Please help us make our tools better by filing an issue in WARP; we welcome pipeline-related suggestions or questions.