Skip to main content

Smart-seq2 Multi-Sample Overview

Pipeline VersionDate UpdatedDocumentation AuthorQuestions or Feedback
MultiSampleSmartSeq2_v2.2.21December, 2023Elizabeth KiernanPlease file GitHub issues in WARP or contact the WARP team

Introduction

The Smart-seq2 Multi-Sample (Multi-SS2) Pipeline is a wrapper around the Smart-seq2 Single Sample pipeline. It is developed by the Data Coordination Platform of the Human Cell Atlas to process single-cell RNAseq (scRNAseq) data generated by Smart-seq2 assays. The workflow processes multiple cells by importing and running the Smart-seq2 Single Sample workflow for each cell (sample) and then merging the resulting Loom matrix output into a single Loom matrix containing raw counts and TPMs.

Full details about the Smart-seq2 Pipeline can be read in the Smart-seq2 Single Sample Overview in GitHub.

The Multi-SS2 workflow can also be run in Terra, a cloud-based analysis platform. The Terra Smart-seq2 public workspace contains the Smart-seq2 workflow, workflow configurations, required reference data and other inputs, and example testing data.

Want to use the Multi-SS2 Pipeline for your publication?

Check out the Smart-seq2 Publication Methods to get started!

Inputs

There are two example configuration (JSON) files available for testing the Multi-SS2 workflow. Both examples are also preloaded in the Terra Smart-seq2 public workspace.

Sample and Reference Inputs

The workflow’s sample inputs are listed in the table below. Reference inputs are identical to those specified in the Smart-seq2 Single Sample Overview.

The workflow processes both single- and paired-end samples; however, these samples can not be mixed in the same run.

Input nameInput DescriptionInput Type
fastq1_input_filesCloud locations for each read1 fileArray of strings
fastq2_input_filesOptional cloud locations for each read2 file if running paired-end samplesArray of strings
input_idsUnique identifiers or names for each cell; can be a UUID or human-readable nameArray of strings
input_namesOptional unique identifiers/names to further describe each cell. If input_id is a UUID, the input_name could be used as a human-readable identifierString
batch_idIdentifier for the batch of multiple samplesString
batch_nameOptional string to describe the batch or biological sampleString
input_name_metadata_fieldOptional input describing, when applicable, the metadata field containing the input_namesString
input_id_metadata_fieldOptional string describing, when applicable, the metadata field containing the input_idsString
project_idOptional project identifier; usually a numberString
project_nameOptional project identifier; usually a human-readable nameString
libraryOptional description of the sequencing method or approachString
organOptional description of the organ from which the cells were derivedString
speciesOptional description of the species from which the cells were derivedString
paired-endBoolean for whether samples are paired-end or notBoolean

Additional Input

The reference inputs are identical to those specified in the "Additional Reference Inputs" section of the Smart-seq2 Single Sample Overview.

Smart-seq2 Multi-Sample Task Summary

The Multi-SS2 Pipeline calls two tasks:

1) SmartSeq2SingleSample: a task that runs the Smart-seq2 Single Sample workflow 2) SmartSeq2PlateAggregation: the wrapper pipeline that aggregates the results

Outputs

Output file nameOutput DescriptionOutput Type
bam_filesAn array of genome-aligned BAM files (one for each sample) generated with HISAT2Array
bam_index_filesAn array of BAM index files generated with HISAT2Array
loom_outputA single Loom cell-by-gene matrix containing raw counts and TPMs for every cellFile

The final Loom matrix is an aggregate of all the individual Loom matrices generated using the Smart-seq2 Single Sample workflow.

The aggregated Loom filename contains the batch_id prefix, which is the string specified in the input configuration. The batch_id is also set as a global attribute in the Loom.

Both the individual sample Loom files and individual BAM files are described in the Smart-seq2 Single Sample Overview.

Zarr Array Deprecation Notice June 2020

Please note that we have deprecated the previously used Zarr array output. The pipeline now uses the Loom file format as the default output.

Validation

The Multi-SS2 Pipeline has been validated for processing human and mouse, stranded or unstranded, paired- or single-end, and plate- or fluidigm-based Smart-seq2 datasets (see links to validation reports in the table below).

Workflow ConfigurationLink to Report
Mouse paired-endReport
Human and mouse single-endReport
Human stranded fluidigmReport

Versioning

Release information for the Multi-SS2 Pipeline can be found in the changelog. Please note that any major changes to the Smart-seq2 pipeline will be documented in the Smart-seq2 Single Sample changelog.

Citing the Smart-seq2 Multi-Sample Pipeline

If you use the Smart-seq2 Multi-Sample Pipeline in your research, please identify the pipeline in your methods section using the Smart-seq2 Multi-Sample SciCrunch resource identifier.

  • Ex: Smart-seq2 Multi-Sample Pipeline (RRID:SCR_018920)

Please also consider citing our preprint:

Degatano, K.; Awdeh, A.; Dingman, W.; Grant, G.; Khajouei, F.; Kiernan, E.; Konwar, K.; Mathews, K.; Palis, K.; Petrillo, N.; Van der Auwera, G.; Wang, C.; Way, J.; Pipelines, W. WDL Analysis Research Pipelines: Cloud-Optimized Workflows for Biological Data Processing and Reproducible Analysis. Preprints 2024, 2024012131. https://doi.org/10.20944/preprints202401.2131.v1

Consortia Support

This pipeline is supported and used by the Human Cell Atlas (HCA) project.

If your organization also uses this pipeline, we would love to list you! Please reach out to us by contacting the WARP team.

Have Suggestions?

Please help us make our tools better by contacting the WARP team for pipeline-related suggestions or questions.