Skip to main content

Smart-seq2 Single Sample Overview

Pipeline VersionDate UpdatedDocumentation AuthorQuestions or Feedback
smartseq2_v5.1.20February, 2024Elizabeth KiernanPlease file GitHub issues in WARP or contact the WARP team

Introduction to the Smart-seq2 Single Sample Pipeline

The Smart-seq2 Single Sample Pipeline (SS2) is designed by the Data Coordination Platform of the Human Cell Atlas to process single-cell RNAseq (scRNAseq) data generated by Smart-seq2 assays. The pipeline's workflow is written in WDL, is freely available on GitHub, and can be run by any compliant WDL runner (e.g. Crowmell). The pipeline is designed to process scRNA-seq data from an individual cell (to process multiple cells, see the Smart-seq2_Multi Sample pipeline).

The SS2 workflow is divided into two parts which run in parallel. In part one, the workflow aligns reads to the genome and performs quality control on genome-aligned BAMs. In part two, it aligns reads to the transcriptome and generates gene expression estimates from transcriptome-aligned BAMs. The pipeline returns reads and expression estimates in BAM, and read counts and QC metrics in Loom and CSV file format.

The pipeline has been validated to support human and mouse, stranded or unstranded, paired- or single-end, and plate- or fluidigm-based Smart-seq2 datasets. Read more in the validation section.

Want to use the Smart-seq2 pipeline for your publication?

Check out the Smart-seq2 Publication Methods to get started!

Quick Start Table

Pipeline FeaturesDescriptionSource
Assay TypeSmart-seq2 Single SampleSmart-seq2
Overall WorkflowQuality control and transcriptome quantificationCode available from GitHub
Workflow LanguageWDLopenWDL
Genomic Reference Sequence (for validation)GRCh38 human genome primary sequence and M21 (GRCm38.p6) mouse genome primary sequenceGENCODE human reference files and mouse reference files
Transcriptomic Reference Annotation (for validation)V27 GENCODE human transcriptome and M21 mouse transcriptomeGENCODE human GTF and mouse GTF
AlignerHISAT2 (v.2.1.0)Kim, et al.,2019
QC MetricsPicard (v.2.26.10)Broad Institute
Transcript QuantificationUtilities for processing large-scale single cell datasetsRSEM v.1.3.0
Data Input File FormatFile format in which sequencing data is providedFASTQ
Data Output File FormatsFile formats in which Smart-seq2 output is providedBAM, Loom (generated with Loompy v.3.0.6), CSV (QC metrics and counts)

Set-Up

Smart-seq2 Single Sample Installation and Requirements

The SS2 workflow code can be downloaded by cloning the GitHub WARP repository. For the latest release of Smart-seq2, please see the Smart-seq2 Single Sample changelog.

The workflow is deployed using Cromwell, a GA4GH compliant, flexible workflow management system that supports multiple computing platforms.

Try the SS2 workflow in Terra

You can run the Smart-seq2 Multi-Sample workflow (a wrapper for the Single Sample workflow described in this document) in Terra, a cloud-based analysis platform. The Terra Smart-seq2 public workspace is preloaded with the Smart-seq2 workflow, example testing data, and all the required reference data.

Inputs

There are two example configuration (JSON) files available to test the SS2 workflow.

Sample Data Input

The pipeline is designed for both single- and paired-end reads in the form of FASTQ files. The workflow accepts two FASTQ files for paired-end experiments and one FASTQ file single-end experiments. It processes one sample (cell).

  • fastq1: forward reads for sample with paired-end sequencing (or reads for sample with single-end sequencing)
  • fastq2: reverse reads for sample with paired-end sequencing (not applicable for samples with single-end sequencing)

The workflow will use modified tasks depending if a sample is paired-end or single-end. This is designated with a boolean, as detailed in the following Reference Inputs section.

Additional Reference Inputs

The Smart-seq2 Single Sample workflow requires multiple reference indexes. Information on how these indexes are created is found in the BuildIndices workflow overview documentation. The table below describes the references and each tool that uses the reference as input, when applicable.

Workflow StepReference nameReference DescriptionTool
Allinput_idString containing unique identifier or name for the cell. It can be a human-readable descriptor (i.e. "cell1") or a UUIDNA
Allinput_nameOptional string that can be used as a name/identifier for the cellNA
Alloutput_nameOutput name, can include pathNA
Allpaired_endA boolean describing if sample is paired-endNA
Allinput_id_metadata_fieldOptional string that describes, when applicable, the metadata field used for input_idNA
Allinput_name_metadata_fieldOptional string that describes, when applicable, the metadata field used for the input_nameNA
Genomic alignment with HISAT2hisat2_ref_indexHISAT2 reference index file in tarballHISAT2
hisat2_ref_nameHISAT2 reference index nameHISAT2
Picard-generated quality control metricsgenome_ref_fastaGenome reference in fasta formatPicard
gene_ref_flatRefFlat file containing the location of RNA transcripts, exon start sites, etc.Picard
rrna_intervalsRibosomal RNA intervals filePicard
strandedLibrary strand information for HISAT2; example values include FR (read corresponds to transcript), RF(read corresponds to reverse complement of transcript), or NONEPicard
Transcriptomic alignment with HISAT2hisat2_ref_trans_indexHISAT2 transcriptome index file in tarballHISAT2
hisat2_ref_trans_nameHISAT2 transcriptome index file nameHISAT2
Gene expression quantification with RSEMrsem_ref_indexRSEM reference index file in tarballRSEM

Running Smart-seq2

The SmartSeq2SingleSample.wdl is in the pipelines/smartseq2_single_sample folder of the WARP repository and implements the workflow by importing individual tasks (written in WDL script) from the WARP tasks folder.

Smart-seq2 Workflow Summary

Overall, the workflow is divided into two parts that are completed after an initial input validation step. Each workflow part comprises tasks (WDL scripts) that are summarized below and are available in the WARP tasks folder in GitHub.

Part 1: Quality Control Tasks

  1. Aligns reads to the genome with HISAT2 v.2.1.0
  2. Calculates summary metrics from an aligned BAM using Picard v.2.26.10

Part 2: Transcriptome Quantification Tasks

  1. Aligns reads to the transcriptome with HISAT v.2.1.0
  2. Quantifies gene expression using RSEM v.1.3.0

The table below provides links to each of the Smart-seq2 Single Sample pipeline's task WDL scripts, as well as documentation for each task's software tools.

If you are looking for the parameters used for each task/tool, click on the task link in the table below and see the command {} section of the task WDL. The task's Docker image is also specified in the task WDL in the # runtime values section as String docker =.

TaskTool
HISAT2.HISATPairedEndHISAT2 software
HISAT2.HISATSingleEndHISAT2 software
Picard.CollectMultipleMetricsPicard software
Picard.CollectRnaMetricsPicard software
Picard.CollectDuplicationMetricsPicard software
HISAT2.HISAT2RSEMHISAT2 software
HISAT2.HISAT2RSEMSingleEndHISAT2 software
RSEM.RSEMExpressionRSEM software
GroupMetricsOutputs.GroupQCOutputssctools software
LoomUtils.SmartSeq2LoomOutputpython3 software

Part 1: Quality Control Tasks

1.1 Align reads to the genome using HISAT2

HISAT2 is a fast, cost-efficient alignment tool that can determine the presence of non-transcript sequences and true transcript sequences, taking into account the presence of single-nucleotide polymorphisms (Kim et al.,2019). The Smart-seq2 Single Sample workflow uses the HISAT2 task to call HISAT2 and perform graph-based alignment of paired- or single-end reads (in the form of FASTQ files) to a reference genome. This task requires a reference index which can be built using the BuildIndices.wdl documentation. The outputs of the task include a genome-aligned BAM file, a BAM index, and an alignment log file.

1.2 Calculate summary metrics using Picard

Picard is a suite of command line tools used for manipulating high-throughput sequencing data. The Picard task uses Picard tools to calculate quality control measurements on the HISAT2 genome-aligned BAM file. The task requires a reference genome fasta, a RefFlat gene annotation file, and an RNA intervals file (see the BuildIndices workflow for creating references). The outputs of the task are text and PDF files for each metric.

The Picard task generates QC metrics by using three sub-tasks:

Part 2: Transcriptome Quantification Tasks

2.1 Align reads to the transcriptome using HISAT2

The HISAT2.HISAT2RSEM task uses HISAT2 to align reads to the reference transcriptome. The task requires the hisat2_ref_trans_index file and the sample FASTQ files as input. The output of this task is a transcriptome-aligned BAM file and an alignment log file.

2.2 Quantify gene expression using RSEM

RSEM is a software package for quantifying transcript abundances (Li and Dewey, 2011). The Smart-seq2 Single Sample workflow uses the RSEM.RSEMExpression task to calculate expression estimates from a transcriptome-aligned BAM file using the rsem_ref_index file for reference input. The RSEM tool rsem-calculate-expression is used to estimate gene/isoform expression.

The RSEM task returns the following output files:

  • rsem_gene: gene level expression estimates in FPKM, TPM, and counts ("expected_count")
  • rsem_isoform: isoform level expression estimates in FPKM, TPM, and counts ("expected_count")
  • rsem_time: the time consumed by aligning reads
  • rsem_cnt: alignment statistics
  • rsem_model: RNA-seq model parameters
  • rsem_theta: fraction of reads resulting from background

Only the rsem_gene (TPM and expected_count), rsem_isoform (TPM and expected_count), and rsem_cnt files are used for the final outputs of the Smart-seq2 Single Sample workflow.

After the Smart-seq2 workflow generates HISAT2, Picard and RSEM metrics, the GroupMetricsOutputs task combines most metrics into a "group_results" CSV file array. The workflow uses this array in downstream Loom file generation. The following HISAT2, Picard and RSEM outputs are inputs for this task:

  • HISAT2_log_file
  • base_call_dist_metrics
  • gc_bias_detail_metrics
  • pre_adapter_details_metrics
  • pre_adapter_summary_metrics
  • bait_bias_detail_metrics
  • error_summary_metrics
  • alignment_summary_metrics
  • dedup_metrics
  • rna_metrics
  • gc_bias_summary_metrics
  • HISAT2RSEM_log_file
  • RSEMExpression.rsem_cnt

Outputs

The table below details the final outputs of the SS2 workflow.

Output NameOutput DescriptionOutput Format
Pipeline versionVersion of the processing pipeline run on this datastring
aligned_bamHISAT2 out BAMBAM
bam_indexHISAT2 BAM indexBAM
insert_size_metricsPicard insert size metricstxt
quality_distribution_metricsPicard quality distribution metricstxt
quality_by_cycle_metricsPicard quality by cycle metricstxt
bait_bias_summary_metricsPicard bait bias summary metricstxt
rna_metricsPicard RNA metricstxt
group_resultsArray of Picard metric files generated by the GroupQCOutputs taskCSV
rsem_gene_resultsRSEM file containing gene-level expression estimatestab-delimited
rsem_isoform_resultsRSEM file containing isoform-level expression estimatestab delimited
loom_output_filesLoom file containing RSEM TPM and expected_counts and metricsLoom

The final count matrix is in Loom format and includes only the group_results CSV files and the rsem_gene_results (TPM and expected_count). Note that the TPMs are contained in the Loom “matrix", whereas expected_count is contained in the “layers/estimated_counts".

Zarr Array Deprecation Notice June 2020

Please note that we have deprecated the previously used Zarr array output. The pipeline now uses the Loom file format as the default output.

Validation

The SS2 pipeline has been validated for processing human and mouse, stranded or unstranded, paired- or single-end, and plate- or fluidigm-based Smart-seq2 datasets (see links to validation reports in the table below).

Workflow ConfigurationLink to Report
Mouse paired-endReport
Human and mouse single-endReport
Human stranded fluidigmReport

Versioning

All SS2 release notes are documented in the Smartseq2 Single Sample changelog.

Citing the Smart-seq2 Single Sample Pipeline

If you use the Smart-seq2 Single Sample Pipeline in your research, please identify the pipeline in your methods section using the Smart-seq2 Single Sample SciCrunch resource identifier.

  • Ex: Smart-seq2 Single Sample Pipeline (RRID:SCR_021228)

Please also consider citing our preprint:

Degatano, K.; Awdeh, A.; Dingman, W.; Grant, G.; Khajouei, F.; Kiernan, E.; Konwar, K.; Mathews, K.; Palis, K.; Petrillo, N.; Van der Auwera, G.; Wang, C.; Way, J.; Pipelines, W. WDL Analysis Research Pipelines: Cloud-Optimized Workflows for Biological Data Processing and Reproducible Analysis. Preprints 2024, 2024012131. https://doi.org/10.20944/preprints202401.2131.v1

Consortia Support

This pipeline is supported and used by the Human Cell Atlas (HCA) project.

If your organization also uses this pipeline, we would love to list you! Please reach out to us by contacting the WARP team.

Have Suggestions?

Please help us make our tools better by contacting the WARP team for pipeline-related suggestions or questions.