Smart-seq2 Single Nucleus Multi-Sample Overview

Pipeline Version	Date Updated	Documentation Author	Questions or Feedback
MultiSampleSmartSeq2SingleNuclei_v2.0.1	September, 2024	Elizabeth Kiernan	Please file an issue in WARP.

Introduction to the Smart-seq2 Single Nucleus Multi-Sample pipeline

The Smart-seq2 Single Nucleus Multi-Sample (Multi-snSS2) pipeline was developed in collaboration with the BRAIN Initiative Cell Census Network (BICCN) to process single-nucleus RNAseq (snRNAseq) data generated by Smart-seq2 assays. The pipeline's workflow is written in WDL, is freely available in the WARP repository on GitHub, and can be run by any compliant WDL runner (e.g. Crowmell).

The pipeline is designed to process snRNA-seq data from multiple cells. Overall, the workflow trims paired-end FASTQ files, aligns reads to the genome using a modified GTF, counts intronic and exonic reads, and calculates quality control metrics.

The pipeline has been scientifically validated by the BRAIN Institute. Read more in the validation section.

Try the Multi-snSS2 workflow in Terra

You can run the Smart-seq2 Single Nucleus Multi-Sample workflow in Terra, a cloud-based analysis platform. The Terra Smart-seq2 Single Nucleus Multi-Sample public workspace is preloaded with the Multi-snSS2 workflow, example testing data, and all the required reference data.

Quick start table

Pipeline features	Description	Source
Assay type	Smart-seq2 Single Nucleus	Smart-seq2
Overall workflow	Quality control and transcriptome quantification.	Code available from the WARP repository in GitHub
Workflow language	WDL	openWDL
Genomic reference sequence (for validation)	GRCm38 mouse genome primary sequence.	GENCODE GRCm38 mouse reference files
Transcriptomic reference annotation (for validation)	Modified M23 GTF built with the BuildIndices workflow.	GENCODE M23 GTF; modified version available Broad references
Aligner	STAR	STAR
QC metrics	Picard	Broad Institute
Transcript quantification	featureCounts (utilities for counting reads to genomic features).	featureCounts
Data input file format	File format in which sequencing data is provided.	FASTQ
Data output file formats	File formats in which Smart-seq2 output is provided.	BAM, Loom (counts and metrics; generated with Loompy v.3.0.6), TSV (counts)

Set-Up

Installation

To download the latest release of the Multi-snSS2 pipeline along with example data and dependencies, see the release tags prefixed with "MultiSampleSmartSeq2SingleNucleus" on the WARP releases page. All releases of the Multi-snSS2 pipeline are documented in the Multi-snSS2 changelog.

To search releases of this and other pipelines, use the WARP command-line tool Wreleaser.

The workflow is deployed using Cromwell, a GA4GH compliant, flexible workflow management system that supports multiple computing platforms.

Inputs

There is an example configuration (JSON) file available that you can use to test the Multi-snSS2 workflow. It points to publicly available reference files and sample paired-end FASTQs.

Sample data and reference inputs

The table below details the Multi-snSS2 inputs. The pipeline is designed to take in an array of paired-end reads in the form of two FASTQ files per cell.

The example mouse reference inputs are created using the BuildIndices Pipeline.
The pipeline modifies a given GTF downloaded from GENCODE to only include biotypes that are listed in a tab separated file (biotypes.tsv).
The example references do not include the pseudogene biotype. Learn more about Ensembl biotypes in the Ensembl overview.
To enable intron counting, the workflow calls a python script, add-introns-to-gtf.py, to create a custom GTF with intron annotations. Introns are considered any part of a contig that is not exonic nor intergenic.

Input Name	Input Description	Input Format
genome_ref_fasta	FASTA file used for metric collection by Picard.	File
tar_star_reference	TAR file containing genome indices used for the STAR aligner.	File
annotations_gtf	Custom GTF file containing annotations for exon and intron tagging; must match `tar_star_reference`.	File
adapter_list	FASTA file containing adapter sequences used in the library preparation (i.e. Illumina adapters for Illumina sequencing).	File
input_ids	Unique identifiers or names for each cell; can be a UUID or human-readable name.	Array of strings
input_names	Optional unique identifiers/names to further describe each cell. If `input_ids` is a UUID, the `input_names` could be used as a human-readable identifier.	Array of strings
fastq1_input_files	Cloud path to FASTQ files containing forward paired-end sequencing reads for each cell (sample); order must match the order in input_id.	Array of strings
fastq2_input_files	Cloud path to FASTQ files containing reverse paired-end sequencing reads for each cell (sample); order must match the order in input_id.	Array of strings
batch_id	Identifier for the batch of multiple samples.	String
batch_name	Optional string to describe the batch or biological sample.	String
project_id	Optional project identifier; usually a number.	Array of strings
project_name	Optional project identifier; usually a human-readable name.	Array of strings
library	Optional description of the sequencing method or approach.	Array of strings
organ	Optional description of the organ from which the cells were derived.	Array of strings
species	Optional description of the species from which the cells were derived.	Array of strings
input_name_metadata_field	Optional input describing, when applicable, the metadata field containing the `input_names`.	String
input_id_metadata_field	Optional string describing, when applicable, the metadata field containing the `input_ids`.	String
cloud_provider	String describing the cloud provider that should be used to run the workflow; value should be "gcp" or "azure".	String

Multi-snSS2 tasks and tools

The Multi-snSS2 workflow imports individual "tasks" (written in WDL script) from the WARP tasks folder.

Overall, the Multi-snSS2 workflow:

Trims adapters.
Aligns reads.
Removes duplicate reads.
Calculates metrics.
Quantifies gene counts.
Merges exon counts, intron counts, and metrics into a h5ad-formatted matrix.

The tools each task employs in the Multi-snSS2 workflow are detailed in the table below.

To see specific tool parameters, select the task WDL link in the table; then view the command {} section of the task WDL script. To view or use the exact tool software, see the task's Docker image which is specified in the task WDL # runtime values section as String docker =.

Task name and WDL link	Tool	Software	Description
CheckInputs.checkInputArrays	---	Bash	Checks the inputs and initiates the per cell processing.
StarAlign.STARGenomeRefVersion	---	Bash	Reads the `tar_star_reference` file to obtain the genomic reference source, build version, and annotation version.
TrimAdapters.TrimAdapters	fastq-mcf	ea-utils	Trims adapter sequences from the FASTQ inputs
StarAlign.StarAlignFastqMultisample	STAR	STAR	Aligns reads to the genome.
Picard.RemoveDuplicatesFromBam	MarkDuplicates, AddOrReplaceReadGroups	Picard	Removes duplicate reads, producing a new BAM output; adds regroups to deduplicated BAM.
Picard.CollectMultipleMetricsMultiSample	CollectMultipleMetrics	Picard	Collects QC metrics on the deduplicated BAM files.
CountAlignments.CountAlignments	FeatureCounts	Subread, Python 3	Uses a custom GTF with featureCounts and Python to mark introns, create a BAM that has alignments spanning intron-exon junctions removed, and counts exons using the custom BAM and by excluding intron tags.
H5adUtils.SingleNucleusSmartSeq2H5adOutput	Custom script: create_h5ad_snss2.py	Python 3	Creates the matrix files (h5ad format) for each sample.
H5adUtils.AggregateSmartSeq2H5ad	Custom script: ss2_h5ad_merge.py	Python 3	Aggregates the matrix files (h5ad format) for each sample to produce one final h5ad output.

1. Trimming adapters

The TrimAdapters task uses the adapter list reference file to run the fastq-mcf tool. This tool identifies the adapters in the input FASTQ files and performs clipping by using a subsampling parameter of 200,000 reads. The task outputs the trimmed FASTQ files which are then used for alignment.

2. Aligning reads

The StarAlignFastq task runs the STAR aligner on the trimmed FASTQ files. The STAR quantMode parameter is set to GeneCounts, which counts the number of reads per gene while mapping. The task outputs a coordinate-sorted aligned BAM file.

3. Removing duplicate reads

The RemoveDuplicatesFromBam task removes multi-mapped reads, optical duplicates, and PCR duplicates from the aligned BAM. It then adds readgroup information and creates a new, coordinate-sorted aligned BAM output.

4. Collecting metrics

The CollectMultipleMetrics task uses the Picard tool CollectMultipleMetrics to perform QC on the deduplicated BAM file. These metrics are copied to the final cell-by-gene matrix output (h5ad file). A detailed list of these metrics can be found in the Multi-snSS2 Count Matrix Overview.

5. Counting genes

The CountAlignments task uses the featureCounts package to count introns and exons. First, the featureCounts tool counts intronic alignments in the deduplicated BAM using a custom GTF with annotated introns. The tool flags intronic alignments if they overlap an annotated intron by a minimum of 3 bp.

Next, following pipeline processes established by the BICCN Whole Mouse Brain Working Group, a custom Python script (“remove-reads-on-junctions.py”) removes all alignments in the deduplicated BAM that cross only one intron-exon junction and produces an intermediate BAM file for exon counting. This removes a negligible amount of putative intronic alignments that did not meet the 3 bp intron overlap criteria.

Lastly, featureCounts uses the intermediate BAM with junctions removed to count exons. The final outputs of this step include a cell-by-gene matrix of intronic counts, a cell-by-gene matrix of exonic counts, and summary metrics for the different count types.

6. Creating the h5ad cell by gene matrix

The H5adUtils task combines the Picard metrics (alignment_summary_metrics, deduplication metrics, and the G/C bias summary metrics) with the featureCount exon and intron counts to create an h5ad formatted cell-by-gene count matrix.

Read full details for all the metrics in the Multi-snSS2 Count Matrix Overview.

The cell-by-gene matrix can be examined using anndata software. Exonic counts are stored in the main h5ad matrix which is unnamed by default. They are the default return value of the anndata.read_h5ad() command. Intronic counts are stored in the h5ad as an additional layer which is named intron_counts.

You can also access exonic and intronic counts using anndatas's layers() method. For example, anndata.layers[“”] will return the exonic counts from the output h5ad file. Similarly, anndata.connect.layers[“intron_counts”] will return the intronic counts from the output h5ad.

Whole gene counts (which include both intronic and exonic counts) can be accessed by adding the intronic and exonic counts together.

Below is example anndata code for accessing the h5ad's exonic, intronic, and whole gene counts.

import anndata
adata = anndata.read_h5ad("/PATH/TO/File.h5ad")
count_exons = adata.X #geneXcell table for the exonic read counts
count_introns = adata.layers["intron_counts"] #geneXcell table for the intronic read counts

If you would like to get the counts for both introns and exons, you can sum the counts together.

To read more about the h5ad file format and use of layers, see the h5ad documentation.

7. Outputs

The table below details the final outputs of the Multi-snSS2 workflow.

Output variable name	Description	Type
h5ad_output	Cell-by-gene count matrix that includes the raw exon counts (in matrix), intron counts (in matrix layer), cell metrics (column attributes) and gene IDs (row attributes).	h5ad
exon_intron_count_files	Array of TXT files (one per cell) that contain intronic and exonic counts.	Array [TXT]
bam_files	Array of genome-aligned BAM files (one for each cell) generated with STAR.	Array [BAM]
pipeline_version_out	Version of the processing pipeline run on this data.	String

The h5ad matrix is the default output. See the create_h5ad_snss2.py script for the detailed code. This matrix contains the count matrices, as well as the gene and cell metrics detailed in the Multi-snSS2 Count Matrix Overview.

To facilitate downstream analysis, the output h5ad file contains both gene names and gene IDs.

Validation

The Multi-snSS2 pipeline was scientifically validated by the BRAIN Initiatives Cell Census Network (BICCN) 2.0 Whole Mouse Brain Working Group.

Versioning

All Multi-snSS2 release notes are documented in the Multi-snSS2 changelog.

Citing the Multi-snSS2 Pipeline

If you use the Multi-snSS2 Pipeline in your research, please identify the pipeline in your methods section using the Multi-snSS2 SciCrunch resource identifier.

Ex: Smart-seq2 Single Nucleus Multi-Sample Pipeline (RRID:SCR_021312)

Please also consider citing our preprint:

Degatano, K.; Awdeh, A.; Dingman, W.; Grant, G.; Khajouei, F.; Kiernan, E.; Konwar, K.; Mathews, K.; Palis, K.; Petrillo, N.; Van der Auwera, G.; Wang, C.; Way, J.; Pipelines, W. WDL Analysis Research Pipelines: Cloud-Optimized Workflows for Biological Data Processing and Reproducible Analysis. Preprints 2024, 2024012131. https://doi.org/10.20944/preprints202401.2131.v1

Consortia Support

This pipeline is supported and used by the BRAIN Initiative Cell Census Network (BICCN).

Each consortia may use slightly different reference files for data analysis or have different post-processing steps. Learn more by reading the Consortia Processing overview.

If your organization also uses this pipeline, we would love to list you! Please reach out to us by filing an issue in WARP.

Feedback

Please help us make our tools better by filing an issue in WARP; we welcome pipeline-related suggestions or questions.

Smart-seq2 Single Nucleus Multi-Sample Overview

Introduction to the Smart-seq2 Single Nucleus Multi-Sample pipeline​

Quick start table​

Set-Up​

Installation​

Inputs​

Sample data and reference inputs​

Multi-snSS2 tasks and tools​

1. Trimming adapters​

2. Aligning reads​

3. Removing duplicate reads​

4. Collecting metrics​

5. Counting genes​

6. Creating the h5ad cell by gene matrix​

7. Outputs​

Validation​

Versioning​

Citing the Multi-snSS2 Pipeline​

Consortia Support​

Feedback​