Smart-seq2 Single Nucleus Multi-Sample Overview
Pipeline Version | Date Updated | Documentation Author | Questions or Feedback |
---|---|---|---|
MultiSampleSmartSeq2SingleNuclei_v2.0.1 | September, 2024 | Elizabeth Kiernan | Please file an issue in WARP. |
Introduction to the Smart-seq2 Single Nucleus Multi-Sample pipeline
The Smart-seq2 Single Nucleus Multi-Sample (Multi-snSS2) pipeline was developed in collaboration with the BRAIN Initiative Cell Census Network (BICCN) to process single-nucleus RNAseq (snRNAseq) data generated by Smart-seq2 assays. The pipeline's workflow is written in WDL, is freely available in the WARP repository on GitHub, and can be run by any compliant WDL runner (e.g. Crowmell).
The pipeline is designed to process snRNA-seq data from multiple cells. Overall, the workflow trims paired-end FASTQ files, aligns reads to the genome using a modified GTF, counts intronic and exonic reads, and calculates quality control metrics.
The pipeline has been scientifically validated by the BRAIN Institute. Read more in the validation section.
You can run the Smart-seq2 Single Nucleus Multi-Sample workflow in Terra, a cloud-based analysis platform. The Terra Smart-seq2 Single Nucleus Multi-Sample public workspace is preloaded with the Multi-snSS2 workflow, example testing data, and all the required reference data.
Quick start table
Pipeline features | Description | Source |
---|---|---|
Assay type | Smart-seq2 Single Nucleus | Smart-seq2 |
Overall workflow | Quality control and transcriptome quantification. | Code available from the WARP repository in GitHub |
Workflow language | WDL | openWDL |
Genomic reference sequence (for validation) | GRCm38 mouse genome primary sequence. | GENCODE GRCm38 mouse reference files |
Transcriptomic reference annotation (for validation) | Modified M23 GTF built with the BuildIndices workflow. | GENCODE M23 GTF; modified version available Broad references |
Aligner | STAR | STAR |
QC metrics | Picard | Broad Institute |
Transcript quantification | featureCounts (utilities for counting reads to genomic features). | featureCounts |
Data input file format | File format in which sequencing data is provided. | FASTQ |
Data output file formats | File formats in which Smart-seq2 output is provided. | BAM, Loom (counts and metrics; generated with Loompy v.3.0.6), TSV (counts) |
Set-Up
Installation
To download the latest release of the Multi-snSS2 pipeline along with example data and dependencies, see the release tags prefixed with "MultiSampleSmartSeq2SingleNucleus" on the WARP releases page. All releases of the Multi-snSS2 pipeline are documented in the Multi-snSS2 changelog.
To search releases of this and other pipelines, use the WARP command-line tool Wreleaser.
The workflow is deployed using Cromwell, a GA4GH compliant, flexible workflow management system that supports multiple computing platforms.
Inputs
There is an example configuration (JSON) file available that you can use to test the Multi-snSS2 workflow. It points to publicly available reference files and sample paired-end FASTQs.
Sample data and reference inputs
The table below details the Multi-snSS2 inputs. The pipeline is designed to take in an array of paired-end reads in the form of two FASTQ files per cell.
- The example mouse reference inputs are created using the BuildIndices Pipeline.
- The pipeline modifies a given GTF downloaded from GENCODE to only include biotypes that are listed in a tab separated file (biotypes.tsv).
- The example references do not include the pseudogene biotype. Learn more about Ensembl biotypes in the Ensembl overview.
- To enable intron counting, the workflow calls a python script,
add-introns-to-gtf.py
, to create a custom GTF with intron annotations. Introns are considered any part of a contig that is not exonic nor intergenic.
Input Name | Input Description | Input Format |
---|---|---|
genome_ref_fasta | FASTA file used for metric collection by Picard. | File |
tar_star_reference | TAR file containing genome indices used for the STAR aligner. | File |
annotations_gtf | Custom GTF file containing annotations for exon and intron tagging; must match tar_star_reference . | File |
adapter_list | FASTA file containing adapter sequences used in the library preparation (i.e. Illumina adapters for Illumina sequencing). | File |
input_ids | Unique identifiers or names for each cell; can be a UUID or human-readable name. | Array of strings |
input_names | Optional unique identifiers/names to further describe each cell. If input_ids is a UUID, the input_names could be used as a human-readable identifier. | Array of strings |
fastq1_input_files | Cloud path to FASTQ files containing forward paired-end sequencing reads for each cell (sample); order must match the order in input_id. | Array of strings |
fastq2_input_files | Cloud path to FASTQ files containing reverse paired-end sequencing reads for each cell (sample); order must match the order in input_id. | Array of strings |
batch_id | Identifier for the batch of multiple samples. | String |
batch_name | Optional string to describe the batch or biological sample. | String |
project_id | Optional project identifier; usually a number. | Array of strings |
project_name | Optional project identifier; usually a human-readable name. | Array of strings |
library | Optional description of the sequencing method or approach. | Array of strings |
organ | Optional description of the organ from which the cells were derived. | Array of strings |
species | Optional description of the species from which the cells were derived. | Array of strings |
input_name_metadata_field | Optional input describing, when applicable, the metadata field containing the input_names . | String |
input_id_metadata_field | Optional string describing, when applicable, the metadata field containing the input_ids . | String |
cloud_provider | String describing the cloud provider that should be used to run the workflow; value should be "gcp" or "azure". | String |
Multi-snSS2 tasks and tools
The Multi-snSS2 workflow imports individual "tasks" (written in WDL script) from the WARP tasks folder.
Overall, the Multi-snSS2 workflow:
- Trims adapters.
- Aligns reads.
- Removes duplicate reads.
- Calculates metrics.
- Quantifies gene counts.
- Merges exon counts, intron counts, and metrics into a h5ad-formatted matrix.
The tools each task employs in the Multi-snSS2 workflow are detailed in the table below.
To see specific tool parameters, select the task WDL link in the table; then view the command {}
section of the task WDL script. To view or use the exact tool software, see the task's Docker image which is specified in the task WDL # runtime values
section as String docker =
.
Task name and WDL link | Tool | Software | Description |
---|---|---|---|
CheckInputs.checkInputArrays | --- | Bash | Checks the inputs and initiates the per cell processing. |
StarAlign.STARGenomeRefVersion | --- | Bash | Reads the tar_star_reference file to obtain the genomic reference source, build version, and annotation version. |
TrimAdapters.TrimAdapters | fastq-mcf | ea-utils | Trims adapter sequences from the FASTQ inputs |
StarAlign.StarAlignFastqMultisample | STAR | STAR | Aligns reads to the genome. |
Picard.RemoveDuplicatesFromBam | MarkDuplicates, AddOrReplaceReadGroups | Picard | Removes duplicate reads, producing a new BAM output; adds regroups to deduplicated BAM. |
Picard.CollectMultipleMetricsMultiSample | CollectMultipleMetrics | Picard | Collects QC metrics on the deduplicated BAM files. |
CountAlignments.CountAlignments | FeatureCounts | Subread, Python 3 | Uses a custom GTF with featureCounts and Python to mark introns, create a BAM that has alignments spanning intron-exon junctions removed, and counts exons using the custom BAM and by excluding intron tags. |
H5adUtils.SingleNucleusSmartSeq2H5adOutput | Custom script: create_h5ad_snss2.py | Python 3 | Creates the matrix files (h5ad format) for each sample. |
H5adUtils.AggregateSmartSeq2H5ad | Custom script: ss2_h5ad_merge.py | Python 3 | Aggregates the matrix files (h5ad format) for each sample to produce one final h5ad output. |
1. Trimming adapters
The TrimAdapters task uses the adapter list reference file to run the fastq-mcf tool. This tool identifies the adapters in the input FASTQ files and performs clipping by using a subsampling parameter of 200,000 reads. The task outputs the trimmed FASTQ files which are then used for alignment.
2. Aligning reads
The StarAlignFastq task runs the STAR aligner on the trimmed FASTQ files. The STAR quantMode parameter is set to GeneCounts, which counts the number of reads per gene while mapping. The task outputs a coordinate-sorted aligned BAM file.
3. Removing duplicate reads
The RemoveDuplicatesFromBam task removes multi-mapped reads, optical duplicates, and PCR duplicates from the aligned BAM. It then adds readgroup information and creates a new, coordinate-sorted aligned BAM output.
4. Collecting metrics
The CollectMultipleMetrics task uses the Picard tool CollectMultipleMetrics to perform QC on the deduplicated BAM file. These metrics are copied to the final cell-by-gene matrix output (h5ad file). A detailed list of these metrics can be found in the Multi-snSS2 Count Matrix Overview.
5. Counting genes
The CountAlignments task uses the featureCounts package to count introns and exons. First, the featureCounts tool counts intronic alignments in the deduplicated BAM using a custom GTF with annotated introns. The tool flags intronic alignments if they overlap an annotated intron by a minimum of 3 bp.
Next, following pipeline processes established by the BICCN Whole Mouse Brain Working Group, a custom Python script (“remove-reads-on-junctions.py”) removes all alignments in the deduplicated BAM that cross only one intron-exon junction and produces an intermediate BAM file for exon counting. This removes a negligible amount of putative intronic alignments that did not meet the 3 bp intron overlap criteria.
Lastly, featureCounts uses the intermediate BAM with junctions removed to count exons. The final outputs of this step include a cell-by-gene matrix of intronic counts, a cell-by-gene matrix of exonic counts, and summary metrics for the different count types.
6. Creating the h5ad cell by gene matrix
The H5adUtils task combines the Picard metrics (alignment_summary_metrics, deduplication metrics, and the G/C bias summary metrics) with the featureCount exon and intron counts to create an h5ad formatted cell-by-gene count matrix.
Read full details for all the metrics in the Multi-snSS2 Count Matrix Overview.
The cell-by-gene matrix can be examined using anndata software. Exonic counts are stored in the main h5ad matrix which is unnamed by default. They are the default return value of the anndata.read_h5ad()
command. Intronic counts are stored in the h5ad as an additional layer which is named intron_counts
.
You can also access exonic and intronic counts using anndatas's layers()
method. For example, anndata.layers[“”]
will return the exonic counts from the output h5ad file. Similarly, anndata.connect.layers[“intron_counts”]
will return the intronic counts from the output h5ad.
Whole gene counts (which include both intronic and exonic counts) can be accessed by adding the intronic and exonic counts together.
Below is example anndata code for accessing the h5ad's exonic, intronic, and whole gene counts.
import anndata
adata = anndata.read_h5ad("/PATH/TO/File.h5ad")
count_exons = adata.X #geneXcell table for the exonic read counts
count_introns = adata.layers["intron_counts"] #geneXcell table for the intronic read counts
If you would like to get the counts for both introns and exons, you can sum the counts together.
To read more about the h5ad file format and use of layers, see the h5ad documentation.
7. Outputs
The table below details the final outputs of the Multi-snSS2 workflow.
Output variable name | Description | Type |
---|---|---|
h5ad_output | Cell-by-gene count matrix that includes the raw exon counts (in matrix), intron counts (in matrix layer), cell metrics (column attributes) and gene IDs (row attributes). | h5ad |
exon_intron_count_files | Array of TXT files (one per cell) that contain intronic and exonic counts. | Array [TXT] |
bam_files | Array of genome-aligned BAM files (one for each cell) generated with STAR. | Array [BAM] |
pipeline_version_out | Version of the processing pipeline run on this data. | String |
The h5ad matrix is the default output. See the create_h5ad_snss2.py script for the detailed code. This matrix contains the count matrices, as well as the gene and cell metrics detailed in the Multi-snSS2 Count Matrix Overview.
To facilitate downstream analysis, the output h5ad file contains both gene names and gene IDs.
Validation
The Multi-snSS2 pipeline was scientifically validated by the BRAIN Initiatives Cell Census Network (BICCN) 2.0 Whole Mouse Brain Working Group.
Versioning
All Multi-snSS2 release notes are documented in the Multi-snSS2 changelog.
Citing the Multi-snSS2 Pipeline
If you use the Multi-snSS2 Pipeline in your research, please identify the pipeline in your methods section using the Multi-snSS2 SciCrunch resource identifier.
- Ex: Smart-seq2 Single Nucleus Multi-Sample Pipeline (RRID:SCR_021312)
Please also consider citing our preprint:
Degatano, K.; Awdeh, A.; Dingman, W.; Grant, G.; Khajouei, F.; Kiernan, E.; Konwar, K.; Mathews, K.; Palis, K.; Petrillo, N.; Van der Auwera, G.; Wang, C.; Way, J.; Pipelines, W. WDL Analysis Research Pipelines: Cloud-Optimized Workflows for Biological Data Processing and Reproducible Analysis. Preprints 2024, 2024012131. https://doi.org/10.20944/preprints202401.2131.v1
Consortia Support
This pipeline is supported and used by the BRAIN Initiative Cell Census Network (BICCN).
Each consortia may use slightly different reference files for data analysis or have different post-processing steps. Learn more by reading the Consortia Processing overview.
If your organization also uses this pipeline, we would love to list you! Please reach out to us by filing an issue in WARP.
Feedback
Please help us make our tools better by filing an issue in WARP; we welcome pipeline-related suggestions or questions.