Skip to main content

RNA with UMIs v1.0.16 Methods

Below we provide an example methods section for publications using the RNA with UMIs pipeline. For the complete pipeline documentation, see the RNA with UMIs Overview.

Methods

Data preprocessing, gene counting, and metric calculation were performed using the RNA with UMIs v1.0.16 pipeline, which uses Picard, fgbio v1.4.0, fastp v0.20.1, FastQC v0.11.9, STAR v2.7.10a, Samtools v1.11, UMI-tools v1.1.1, GATK 4.5.0.0, and RNA-SeQC v2.4.2 with default tool parameters unless otherwise specified. Reference files are publicly available in the Broad References Google Bucket and are also listed in example configuration files in the in the WARP repository.

Paired-end FASTQ files were first converted to an unmapped BAM (uBAM) using Picard's (v3.0.0) FastqToSam tool with SORT_ORDER = unsorted. (If a read group unmapped BAM file is used as input for the pipeline, this step is skipped.) Unique molecular identifiers (UMIs) were extracted from the uBAM using fgbio's ExtractUmisFromBam and stored in the RX read tag.

After the extraction of UMIs, reads that failed quality control checks performed by the sequencing platform were filtered and the uBAM was converted to FASTQ files using Picard's FastqToSam tool. Illumina TruSeq adapter and poly(A) sequences were clipped from the reads using fastp. Picard's FastqToSam tool was again used to convert the FASTQ files back to a uBAM. This uBAM was used to calculate quality control metrics using FastQC.

Reads were aligned using STAR to the GRCh38 (hg38) reference with HLA, ALT, and decoy contigs removed with gene annotations from GENCODE v34 (or GRCh37 [hg19] with gene annotations from GENCODE v19). The --readFilesType and --readFilesCommand parameters were set to "SAM PE" and "samtools view -h", respectively, to indicate that the input was a BAM file. To specify that the output was an unsorted BAM that included unmapped reads, --outSAMtype was set to "BAM Unsorted" and --outSAMunmapped was set to "Within". A transcriptome-aligned BAM was also output with --quantMode = TranscriptomeSAM. To match ENCODE bulk RNA-seq data standards, the alignment was performed with parameters --outFilterType = BySJout, --outFilterMultimapNmax = 20, --outFilterMismatchNmax = 999, --alignIntronMin = 20, --alignIntronMax = 1000000, --alignMatesGapMax = 1000000, --alignSJoverhangMin = 8, and --alignSJDBoverhangMin = 1. The fraction of reads required to match the reference was set with --outFilterMatchNminOverLread = 0.33 and the fraction of allowable mismatches to read length was set with --outFilterMismatchNoverLmax = 0.1. Chimeric alignments were included with --chimSegmentMin = 15, where 15 was the minimum length of each segment, and --chimMainSegmentMultNmax = 1 to prevent main chimeric segments from mapping to multiple sites. To output chimeric segments with soft-clipping in the aligned BAM, --chimOutType was set to "WithinBAM SoftClip". A maximum of 20 protruding bases at the ends of alignments was allowed with --alignEndsProtrude set to "20 ConcordantPair" to prevent reads from small cDNA fragments that were sequenced into adapters from being dropped.

Following alignment, both BAM files were sorted by coordinate with Picard's (v2.6.11) SortSam tool. UMI-tools was then used to further divide putative duplicates into subgroups based on UMI and sequencing errors in UMIs were corrected. To specify the tag where the UMIs were stored, --extract-umi-method was set to "tag" and --umi-tag was set to "RX". Unmapped reads were included in the output file with --unmapped-reads = use. Tagged BAM files were output using the option --output-bam. SortSam was used again to sort the BAM files by queryname for Picard's (v2.26.11) MarkDuplicates tool. MarkDuplicates was used to mark PCR duplicates and calculate duplicate metrics. After duplicate marking, BAM files were sorted by coordiante using SortSam to facilitate downstream analysis. The transcriptome-aligned, duplicate-marked BAM was sorted and postprocessed using GATK's PostProcessReadsForRSEM tool for compatability with RSEM.

The genome-aligned, duplicate-marked BAM file was then used to calculate summary metrics using RNASeQC, Picard's (v2.26.11) CollectRNASeqMetrics and (v3.0.0) CollectMultipleMetrics tools, and GATK's GetPileupSummaries and CalculateContamination tools. CollectMultipleMetrics was used with the programs “CollectInsertSizeMetrics” and “CollectAlignmentSummaryMetrics”. GetPileupSummaries was run with the read filters, "WellformedReadFilter" and "MappingQualityAvailableReadFilter" disabled.

The final outputs of the RNA with UMIs pipeline included metrics generated before alignment with FastQC, a transcriptome-aligned, duplicate-marked BAM file with duplication metrics, and a genome-aligned, duplicate-marked BAM file with corresponding index, duplication metrics, and metrics generated with RNASeQC, Picard, and GATK tools.