Skip to main content

Smart-seq2 Single Nucleus Multi-Sample Overview

Pipeline VersionDate UpdatedDocumentation AuthorQuestions or Feedback
MultiSampleSmartSeq2SingleNuclei_v1.3.4May, 2024Elizabeth KiernanPlease file GitHub issues in WARP or contact the WARP team

Introduction to the Smart-seq2 Single Nucleus Multi-Sample pipeline

The Smart-seq2 Single Nucleus Multi-Sample (Multi-snSS2) pipeline was developed in collaboration with the BRAIN Initiative Cell Census Network (BICCN) to process single-nucleus RNAseq (snRNAseq) data generated by Smart-seq2 assays. The pipeline's workflow is written in WDL, is freely available in the WARP repository on GitHub, and can be run by any compliant WDL runner (e.g. Crowmell).

The pipeline is designed to process snRNA-seq data from multiple cells. Overall, the workflow trims paired-end FASTQ files, aligns reads to the genome using a modified GTF, counts intronic and exonic reads, and calculates quality control metrics.

The pipeline has been scientifically validated by the BRAIN Institute. Read more in the validation section.

Try the Multi-snSS2 workflow in Terra

You can run the Smart-seq2 Single Nucleus Multi-Sample workflow in Terra, a cloud-based analysis platform. The Terra Smart-seq2 Single Nucleus Multi-Sample public workspace is preloaded with the Multi-snSS2 workflow, example testing data, and all the required reference data.

Quick start table

Pipeline featuresDescriptionSource
Assay typeSmart-seq2 Single NucleusSmart-seq2
Overall workflowQuality control and transcriptome quantification.Code available from the WARP repository in GitHub
Workflow languageWDLopenWDL
Genomic reference sequence (for validation)GRCm38 mouse genome primary sequence.GENCODE GRCm38 mouse reference files
Transcriptomic reference annotation (for validation)Modified M23 GTF built with the BuildIndices workflow.GENCODE M23 GTF; modified version available Broad references
AlignerSTARSTAR
QC metricsPicardBroad Institute
Transcript quantificationfeatureCounts (utilities for counting reads to genomic features).featureCounts
Data input file formatFile format in which sequencing data is provided.FASTQ
Data output file formatsFile formats in which Smart-seq2 output is provided.BAM, Loom (counts and metrics; generated with Loompy v.3.0.6), TSV (counts)

Set-Up

Installation

To download the latest release of the Multi-snSS2 pipeline along with example data and dependencies, see the release tags prefixed with "MultiSampleSmartSeq2SingleNucleus" on the WARP releases page. All releases of the Multi-snSS2 pipeline are documented in the Multi-snSS2 changelog.

To search releases of this and other pipelines, use the WARP command-line tool Wreleaser.

The workflow is deployed using Cromwell, a GA4GH compliant, flexible workflow management system that supports multiple computing platforms.

Inputs

There is an example configuration (JSON) file available that you can use to test the Multi-snSS2 workflow. It points to publicly available reference files and sample paired-end FASTQs.

Sample data and reference inputs

The table below details the Multi-snSS2 inputs. The pipeline is designed to take in an array of paired-end reads in the form of two FASTQ files per cell.

  • The example mouse reference inputs are created using the BuildIndices Pipeline.
  • The pipeline modifies a given GTF downloaded from GENCODE to only include biotypes that are listed in a tab separated file (biotypes.tsv).
  • The example references do not include the pseudogene biotype. Learn more about Ensembl biotypes in the Ensembl overview.
  • To enable intron counting, the workflow calls a python script, add-introns-to-gtf.py, to create a custom GTF with intron annotations. Introns are considered any part of a contig that is not exonic nor intergenic.
Input NameInput DescriptionInput Format
genome_ref_fastaFASTA file used for metric collection by Picard.File
tar_star_referenceTAR file containing genome indices used for the STAR aligner.File
annotations_gtfCustom GTF file containing annotations for exon and intron tagging; must match tar_star_reference.File
adapter_listFASTA file containing adapter sequences used in the library preparation (i.e. Illumina adapters for Illumina sequencing).File
input_idsUnique identifiers or names for each cell; can be a UUID or human-readable name.Array of strings
input_namesOptional unique identifiers/names to further describe each cell. If input_ids is a UUID, the input_names could be used as a human-readable identifier.Array of strings
fastq1_input_filesCloud path to FASTQ files containing forward paired-end sequencing reads for each cell (sample); order must match the order in input_id.Array of strings
fastq2_input_filesCloud path to FASTQ files containing reverse paired-end sequencing reads for each cell (sample); order must match the order in input_id.Array of strings
batch_idIdentifier for the batch of multiple samples.String
batch_nameOptional string to describe the batch or biological sample.String
project_idOptional project identifier; usually a number.Array of strings
project_nameOptional project identifier; usually a human-readable name.Array of strings
libraryOptional description of the sequencing method or approach.Array of strings
organOptional description of the organ from which the cells were derived.Array of strings
speciesOptional description of the species from which the cells were derived.Array of strings
input_name_metadata_fieldOptional input describing, when applicable, the metadata field containing the input_names.String
input_id_metadata_fieldOptional string describing, when applicable, the metadata field containing the input_ids.String

Multi-snSS2 tasks and tools

The Multi-snSS2 workflow imports individual "tasks" (written in WDL script) from the WARP tasks folder.

Overall, the Multi-snSS2 workflow:

  1. Trims adapters.
  2. Aligns reads.
  3. Removes duplicate reads.
  4. Calculates metrics.
  5. Quantifies gene counts.
  6. Merges exon counts, intron counts, and metrics into a Loom-formatted matrix.

The tools each task employs in the Multi-snSS2 workflow are detailed in the table below.

To see specific tool parameters, select the task WDL link in the table; then view the command {} section of the task WDL script. To view or use the exact tool software, see the task's Docker image which is specified in the task WDL # runtime values section as String docker =.

Task name and WDL linkToolSoftwareDescription
CheckInputs.checkInputArrays---BashChecks the inputs and initiates the per cell processing.
StarAlign.STARGenomeRefVersion---BashReads the tar_star_reference file to obtain the genomic reference source, build version, and annotation version.
TrimAdapters.TrimAdaptersfastq-mcfea-utilsTrims adapter sequences from the FASTQ inputs
StarAlign.StarAlignFastqMultisampleSTARSTARAligns reads to the genome.
Picard.RemoveDuplicatesFromBamMarkDuplicates, AddOrReplaceReadGroupsPicardRemoves duplicate reads, producing a new BAM output; adds regroups to deduplicated BAM.
Picard.CollectMultipleMetricsMultiSampleCollectMultipleMetricsPicardCollects QC metrics on the deduplicated BAM files.
CountAlignments.CountAlignmentsFeatureCountsSubread, Python 3Uses a custom GTF with featureCounts and Python to mark introns, create a BAM that has alignments spanning intron-exon junctions removed, and counts exons using the custom BAM and by excluding intron tags.
LoomUtils.SingleNucleusSmartSeq2LoomOutputCustom script: ss2_loom_merge.pyPython 3Creates the matrix files (Loom format) for each sample.
LoomUtils.AggregateSmartSeq2LoomCustom script: ss2_loom_merge.pyPython 3Aggregates the matrix files (Loom format) for each sample to produce one final Loom output.

1. Trimming adapters

The TrimAdapters task uses the adapter list reference file to run the fastq-mcf tool. This tool identifies the adapters in the input FASTQ files and performs clipping by using a subsampling parameter of 200,000 reads. The task outputs the trimmed FASTQ files which are then used for alignment.

2. Aligning reads

The StarAlignFastq task runs the STAR aligner on the trimmed FASTQ files. The STAR quantMode parameter is set to GeneCounts, which counts the number of reads per gene while mapping. The task outputs a coordinate-sorted aligned BAM file.

3. Removing duplicate reads

The RemoveDuplicatesFromBam task removes multi-mapped reads, optical duplicates, and PCR duplicates from the aligned BAM. It then adds readgroup information and creates a new, coordinate-sorted aligned BAM output.

4. Collecting metrics

The CollectMultipleMetrics task uses the Picard tool CollectMultipleMetrics to perform QC on the deduplicated BAM file. These metrics are copied to the final cell-by-gene matrix output (Loom file). A detailed list of these metrics can be found in the Multi-snSS2 Count Matrix Overview.

5. Counting genes

The CountAlignments task uses the featureCounts package to count introns and exons. First, the featureCounts tool counts intronic alignments in the deduplicated BAM using a custom GTF with annotated introns. The tool flags intronic alignments if they overlap an annotated intron by a minimum of 3 bp.

Next, following pipeline processes established by the BICCN Whole Mouse Brain Working Group, a custom Python script (“remove-reads-on-junctions.py”) removes all alignments in the deduplicated BAM that cross only one intron-exon junction and produces an intermediate BAM file for exon counting. This removes a negligible amount of putative intronic alignments that did not meet the 3 bp intron overlap criteria.

Lastly, featureCounts uses the intermediate BAM with junctions removed to count exons. The final outputs of this step include a cell-by-gene matrix of intronic counts, a cell-by-gene matrix of exonic counts, and summary metrics for the different count types.

6. Creating the Loom cell by gene matrix

The LoomUtils task combines the Picard metrics (alignment_summary_metrics, deduplication metrics, and the G/C bias summary metrics) with the featureCount exon and intron counts to create a Loom formatted cell-by-gene count matrix.

Read full details for all the metrics in the Multi-snSS2 Count Matrix Overview.

The cell-by-gene matrix can be examined using Loompy software. Exonic counts are stored in the main Loom matrix which is unnamed by default. They are the default return value of the loompy.connect() command. Intronic counts are stored in the Loom as an additional layer which is named intron_counts.

You can also access exonic and intronic counts using Loompy's layers() method. For example, loompy.connect.layers[“”] will return the exonic counts from the output Loom file. Similarly, loompy.connect.layers[“intron_counts”] will return the intronic counts from the output Loom.

Whole gene counts (which include both intronic and exonic counts) can be accessed by adding the intronic and exonic counts together.

Below is example Loompy code for accessing the Loom's exonic, intronic, and whole gene counts.

import loompy
ds = loompy.connect("/PATH/TO/File.loom")
count_exons = ds[:,:] #geneXcell table for the exonic read counts
count_introns = ds.layer["intron_counts"] #geneXcell table for the intronic read counts
gene_counts = count_exons + count_introns

To read more about the Loom file format and use of layers, see the Loompy documentation.

7. Outputs

The table below details the final outputs of the Multi-snSS2 workflow.

Output variable nameDescriptionType
loom_outputCell-by-gene count matrix that includes the raw exon counts (in matrix), intron counts (in matrix layer), cell metrics (column attributes) and gene IDs (row attributes).Loom
exon_intron_count_filesArray of TXT files (one per cell) that contain intronic and exonic counts.Array [TXT]
bam_filesArray of genome-aligned BAM files (one for each cell) generated with STAR.Array [BAM]
pipeline_version_outVersion of the processing pipeline run on this data.String

The Loom matrix is the default output. See the create_loom_snss2.py script for the detailed code. This matrix contains the count matrices, as well as the gene and cell metrics detailed in the Multi-snSS2 Count Matrix Overview.

To facilitate downstream analysis, the output Loom file contains both gene names and gene IDs.

The output Loom matrix can be converted to an H5AD file using a custom script available in the warp-tools GitHub repository.

Validation

The Multi-snSS2 pipeline was scientifically validated by the BRAIN Initiatives Cell Census Network (BICCN) 2.0 Whole Mouse Brain Working Group.

Versioning

All Multi-snSS2 release notes are documented in the Multi-snSS2 changelog.

Citing the Multi-snSS2 Pipeline

If you use the Multi-snSS2 Pipeline in your research, please identify the pipeline in your methods section using the Multi-snSS2 SciCrunch resource identifier.

  • Ex: Smart-seq2 Single Nucleus Multi-Sample Pipeline (RRID:SCR_021312)

Please also consider citing our preprint:

Degatano, K.; Awdeh, A.; Dingman, W.; Grant, G.; Khajouei, F.; Kiernan, E.; Konwar, K.; Mathews, K.; Palis, K.; Petrillo, N.; Van der Auwera, G.; Wang, C.; Way, J.; Pipelines, W. WDL Analysis Research Pipelines: Cloud-Optimized Workflows for Biological Data Processing and Reproducible Analysis. Preprints 2024, 2024012131. https://doi.org/10.20944/preprints202401.2131.v1

Consortia Support

This pipeline is supported and used by the BRAIN Initiative Cell Census Network (BICCN).

Each consortia may use slightly different reference files for data analysis or have different post-processing steps. Learn more by reading the Consortia Processing overview.

If your organization also uses this pipeline, we would love to list you! Please reach out to us by contacting the WARP Pipeline Development team.

Feedback

Please help us make our tools better by contacting the WARP Pipeline Development team for pipeline-related suggestions or questions.