Skip to main content

Smart-seq2 Single Nucleus Multi-Sample Overview

Pipeline VersionDate UpdatedDocumentation AuthorQuestions or Feedback
MultiSampleSmartSeq2SingleNuclei_v1.1.0.July, 2021Elizabeth KiernanPlease file GitHub issues in WARP or contact Kylee Degatano

Introduction to the Smart-seq2 Single Nucleus Multi-Sample pipeline#

The Smart-seq2 Single Nucleus Multi-Sample (Multi-snSS2) pipeline was developed in collaboration with the BRAIN Initiative Cell Census Network (BICCN) to process single-nucleus RNAseq (snRNAseq) data generated by Smart-seq2 assays. The pipeline's workflow is written in WDL, is freely available in the WARP repository on GitHub, and can be run by any compliant WDL runner (e.g. Crowmell).

The pipeline is designed to process snRNA-seq data from multiple cells. Overall, the workflow trims paired-end FASTQ files, aligns reads to the genome using a modified GTF, counts intronic and exonic reads, and calculates quality control metrics.

The pipeline has been scientifically validated by the BRAIN Institute. Read more in the validation section.

Try the Multi-snSS2 workflow in Terra

You can run the Smart-seq2 Single Nucleus Multi-Sample workflow in Terra, a cloud-based analysis platform. The Terra Smart-seq2 Single Nucleus Multi-Sample public workspace is preloaded with the Multi-snSS2 workflow, example testing data, and all the required reference data.

Quick start table#

Pipeline featuresDescriptionSource
Assay typeSmart-seq2 Single NucleusSmart-seq2
Overall workflowQuality control and transcriptome quantification.Code available from the WARP repository in GitHub
Workflow languageWDLopenWDL
Genomic reference sequence (for validation)GRCm38 mouse genome primary sequence.GENCODE GRCm38 Mouse
Transcriptomic reference annotation (for validation)Modified M23 GTF built with the BuildIndices workflow.GENCODE
AlignerSTAR (v.2.7.9a)STAR
QC metricsPicard (v.2.20.4)Broad Institute
Transcript quantificationfeatureCounts (utilities for counting reads to genomic features).featureCounts(v2.0.2)
Data input file formatFile format in which sequencing data is provided.FASTQ
Data output file formatsFile formats in which Smart-seq2 output is provided.BAM, Loom (counts and metrics; generated with Loompy v.3.0.6), TSV (counts)

Set-Up#

Multi-snSS2 installation and requirements#

The latest release of the workflow, example data, and dependencies are available from the WARP releases page (see release tags prefixed with SmartSeq2SingleNucleus). To discover and search releases, use the WARP command-line tool Wreleaser. .

The workflow is deployed using Cromwell, a GA4GH compliant, flexible workflow management system that supports multiple computing platforms.

Inputs#

There is an example configuration (JSON) file available that you can to test the Multi-snSS2 workflow. It points to publicly available reference files and sample paired-end FASTQs.

Sample data and reference inputs#

The table below details the Multi-snSS2 inputs. The pipeline is designed to take in an array of paired-end reads in the form of two FASTQ files per cell.

  • Reference inputs are created using the BuildIndices Pipeline.
  • The workflow uses a modified version of the 10x Genomic's code for building mouse (GRCm38-2020-A) and human (GRCh38-2020-A) reference packages.
  • To enable intron counting, the workflow calls a shell script to create a custom GTF with intron annotations. Introns are considered any part of a contig that is not exonic nor intergenic.
Input NameInput DescriptionInput Format
fastq1_input_filesCloud path to FASTQ files containing forward paired-end sequencing reads for each cell (sample); order must match the order in input_id.Array of strings
fastq2_input_filesCloud path to FASTQ files containing reverse paired-end sequencing reads for each cell (sample); order must match the order in input_id.Array of strings
input_idsUnique identifiers or names for each cell; can be a UUID or human-readable name.Array of strings
input_namesOptional unique identifiers/names to further describe each cell. If input_ids is a UUID, the input_names could be used as a human-readable identifier.String
batch_idIdentifier for the batch of multiple samples.String
batch_nameOptional string to describe the batch or biological sample.String
input_name_metadata_fieldOptional input describing, when applicable, the metadata field containing the input_names.String
input_id_metadata_fieldOptional string describing, when applicable, the metadata field containing the input_ids.String
project_idOptional project identifier; usually a number.String
project_nameOptional project identifier; usually a human-readable name.String
libraryOptional description of the sequencing method or approach.String
organOptional description of the organ from which the cells were derived.String
speciesOptional description of the species from which the cells were derived.String
tar_star_referenceGenome references for STAR alignment.TAR
annotations_gtfCustom GTF file containing annotations for exon and intron tagging; must match the STAR reference.GTF
genome_ref_fastaFASTA file used for STAR alignment.FASTA
adapter_listFile listing adapter sequences used in the library preparation (i.e. Illumina adapters for Illumina sequencing).FASTA

Running Multi-snSS2#

The Multi-snSS2 workflow is in the pipelines/smartseq2_single_nucleus folder of the WARP repository and implements the workflow by importing individual tasks (written in WDL script) from the WARP tasks folder.

Multi-snSS2 workflow summary#

Task name and task’s WDL linkDescriptionSoftwareTool
CheckInputs.checkInputArraysChecks the inputs and initiates the per cell processing.BashNA
TrimAdapters.TrimAdaptersTrims adapter sequences from the FASTQ inputsea-utils.fastq-mcf
StarAlign.StarAlignFastqMultisampleAligns reads to the genome.STARSTAR
Picard.RemoveDuplicatesFromBamRemoves duplicate reads, producing a new BAM output; adds regroups to deduplicated BAM.PicardMarkDuplicates, AddOrReplaceReadGroups
Picard.CollectMultipleMetricsMultiSampleCollects QC metrics on the deduplicated BAM files.PicardCollectMultipleMetrics
CountAlignments.CountAlignmentsUses a custom GTF with featureCounts and Python to mark introns, create a BAM that has alignments spanning intron-exon junctions removed, and counts exons using the custom BAM and by excluding intron tags.SubreadFeatureCounts, Python 3
LoomUtils.SingleNucleusSmartSeq2LoomOutputCreates the matrix files (Loom format) for each sample.Python 3Custom script: ss2_loom_merge.py
LoomUtils.AggregateSmartSeq2LoomAggregates the matrix files (Loom format) for each sample to produce one final Loom output.Python 3Custom script: ss2_loom_merge.py

Trimming adapters#

The TrimAdapters task uses the adapter list reference file to run the fastq-mcf tool. This tool identifies the adapters in the input FASTQ files and performs clipping by using a subsampling parameter of 200,000 reads. The task outputs the trimmed FASTQ files which are then used for alignment.

Aligning reads#

The StarAlignFastq task runs the STAR aligner on the trimmed FASTQ files. The STAR quantMode parameter is set to GeneCounts, which counts the number of reads per gene while mapping. The task outputs a coordinate-sorted aligned BAM file.

Removing duplicate reads#

The RemoveDuplicatesFromBam task removes multi-mapped reads, optical duplicates, and PCR duplicates from the aligned BAM. It then adds readgroup information and creates a new, coordinate-sorted aligned BAM output.

Collecting metrics#

The CollectMultipleMetrics task uses the Picard tool CollectMultipleMetrics to perform QC on the deduplicated BAM file. These metrics are copied to the final cell-by-gene matrix output (Loom file).

Counting genes#

The CountAlignments task uses the featureCounts package to count introns and exons. First, the featureCounts tool counts intronic alignments in the deduplicated BAM using a custom GTF with annotated introns. The tool flags intronic alignments if they overlap an annotated intron by a minimum of 3 bp.

Next, following pipeline processes established by the BICCN Whole Mouse Brain Working Group, a custom Python script (“remove-reads-on-junctions.py”) removes all alignments in the deduplicated BAM that cross only one intron-exon junction and produces an intermediate BAM file for exon counting. This removes a negligible amount of putative intronic alignments that did not meet the 3 bp intron overlap criteria.

Lastly, featureCounts uses the intermediate BAM with junctions removed to count exons. The final outputs of this step include a cell-by-gene matrix of intronic counts, a cell-by-gene matrix of exonic counts, and summary metrics for the different count types.

Creating the cell-by-gene matrix (Loom)#

The LoomUtils task combines the Picard metrics (alignment_summary_metrics, deduplication metrics, and the G/C bias summary metrics) with the featureCount exon and intron counts to create a Loom formatted cell-by-gene count matrix.

Exonic counts are in the Loom matrix and intronic counts are added as a Loom layer. Read more about Loom file format in the Loompy documentation.

Outputs#

The table below details the final outputs of the Multi-snSS2 workflow.

Output NameOutput DescriptionOutput Format
loom_outputCell-by-gene count matrix that includes the raw exon counts (in matrix), intron counts (in matrix layer), cell metrics (column attributes) and gene IDs (row attributes).Loom
bam_filesArray of genome-aligned BAM files (one for each cell) generated with Star.Array [BAM]
exon_intron_count_filesArray of TXT files (one per cell) that contain intronic and exonic counts.Array [TXT]

Validation#

The Multi-snSS2 pipeline was scientifically validated by the BRAIN Initiatives Cell Census Network (BICCN) 2.0 Whole Mouse Brain Working Group.

Versioning#

All Multi-snSS2 release notes are documented in the Multi-snSS2 changelog.

Citing the Multi-snSS2 Pipeline#

To cite the Multi-snSS2 pipeline, use the SciCrunch resource identifier.

  • Ex: Smart-seq2 Single Nucleus Multi-Sample Pipeline (RRID:SCR_021312)

To view an example of this citation as well as a publication-style methods section, see the Multi-snSS2 Example Methods

Consortia Support#

This pipeline is supported and used by the BRAIN Initiative Cell Census Network (BICCN).

If your organization also uses this pipeline, we would love to list you! Please reach out to us by contacting Kylee Degatano.

Have Suggestions?#

Help us make our tools better by contacting Kylee Degatano for pipeline-related suggestions or questions.