Slide-seq Overview
Pipeline Version | Date Updated | Documentation Author | Questions or Feedback |
---|---|---|---|
SlideSeq v3.4.2 | September, 2024 | Elizabeth Kiernan & Kaylee Mathews | Please file an issue in WARP. |
Introduction to the Slide-seq workflow
The Slide-seq workflow is an open-source, cloud-optimized pipeline developed in collaboration with the BRAIN Initiative Cell Census Network (BICCN) and the BRAIN Initiative Cell Atlas Network (BICAN). It supports the processing of spatial transcriptomic data generated with the Slide-seq (commercialized as Curio Seeker) assay.
Overall, the workflow corrects bead barcodes, aligns reads to the genome, generates a count matrix, calculates summary metrics for genes, barcodes, and UMIs, returns read outputs in BAM format, and returns counts in numpy matrix and h5ad file formats.
Slide-seq has been validated for analyzing mouse datasets generated with the Slide-seq assay. Learn more in the validation section.
Where did the pipeline come from?
The Broad Pipelines Development team worked closely with members of the Evan Macosko Laboratory to adapt an on-premise pipeline into the cloud-based Slide-seq pipeline described here. Thank you to James Webber, Jonah Langlieb, Nina Sachdev, and the rest of the Evan Macosko Lab for their commitment to FAIR Principles and for helping create the cloud-based Slide-seq pipeline.
Quickstart table
The following table provides a quick glance at the Slide-seq pipeline features:
Pipeline features | Description | Source |
---|---|---|
Assay type | Slide-seq | Rodriques et al. 2019 |
Overall workflow | Quality control and transcriptome quantification module | Code available from GitHub |
Workflow language | WDL 1.0 | openWDL |
Genomic Reference Sequence | M23 (GRCm38.p6) mouse genome primary sequence | GENCODE mouse reference files |
Transcriptomic reference annotation | M23 mouse transcriptome built with the BuildIndices workflow | GENCODE mouse GTF; modified version available in Broad’s public reference bucket |
Aligner and transcript quantification | STARsolo | Kaminow et al. 2021 |
Data input file format | File format in which sequencing data is provided | FASTQ |
Data output file format | File formats in which Slide-seq output is provided | BAM, Python NumPy arrays, and h5ad |
Set-up
Slide-seq installation
To download the latest Slide-seq release, see the release tags prefixed with "SlideSeq" on the WARP releases page. All Slide-seq pipeline releases are documented in the Slide-seq changelog.
To search releases of this and other pipelines, use the WARP command-line tool Wreleaser.
If you’re running a Slide-seq workflow version prior to the latest release, the accompanying documentation for that release may be downloaded with the source code on the WARP releases page (see the source code folder website/docs/Pipelines/SlideSeq_Pipeline
).
The Slide-seq pipeline can be deployed using Cromwell, a GA4GH compliant, flexible workflow management system that supports multiple computing platforms. The workflow can also be run in Terra, a cloud-based analysis platform. The Slide-seq public workspace on Terra contains the Slide-seq workflow, workflow configuration, required reference data and other inputs, and example testing data.
Inputs
The Slide-seq workflow inputs are specified in JSON configuration files. Example configuration files can be found in the test_inputs folder in the WARP repository.
Input descriptions
Parameter name | Description | Type |
---|---|---|
r1_fastq | Array of Read 1 FASTQ files split across multiple sequencing lanes; forward reads containing bead barcodes and Unique Molecular Identifiers (UMIs); order of files must match the r2_fastq array. | Array[File] |
r2_fastq | Array of Read 2 FASTQ files split across multiple sequencing lanes; reverse reads containing the cDNA fragment generated from captured mRNA; order of files must match the r1_fastq array. | Array[File] |
i1_fastq | Optional array of i1 (index) FASTQ files; index reads used for demultiplexing of multiple samples on one flow cell. | Array[File] |
input_id | Unique identifier describing the biological sample or replicate that corresponds with the FASTQ files; inserted into read group header and used to name output files. | String |
read_structure | Description of the UMI (M) and Barcode (C) positions in the Read 1 FASTQ; used to trim spacer sequences (X) for use by STARsolo; ex. "8C18X6C9M1X". | String |
tar_star_reference | TAR file containing a species-specific reference genome and GTF; generated using the BuildIndices workflow. | File |
annotations_gtf | GTF containing gene annotations used for gene tagging (must match GTF in STAR reference). | File |
output_bam_basename | Optional string used for the output BAM file basename. | String |
count_exons | Optional boolean indicating if the workflow should calculate exon counts; default is set to “true” and produces an h5ad file containing both whole-gene counts and exon counts in an additional layer; when set to “false”, an h5ad file containing only whole-gene counts is produced. | Boolean |
bead_locations | Whitelist TSV file containing bead barcodes and XY coordinates on a single line for each bead; determined by sequencing prior to mRNA transfer and library preparation. | File |
cloud_provider | String describing the cloud provider that should be used to run the workflow; value should be "gcp" or "azure". | String |
Pseudogene handling
The example Slide-seq reference files are created using the BuildIndices pipeline. The BuildIndices pipeline modifies a given GTF downloaded from GENCODE to only include biotypes that are listed in a tab separated file (biotypes.tsv). The example references do not include the pseudogene biotype. Learn more about Ensembl biotypes in the Ensembl overview.
Sample inputs for analyses in a Terra Workspace
The Slide-seq pipeline is currently available on the cloud-based platform Terra. After registering, you can access the Slide-seq public workspace which is preloaded with instructions and sample data. Please view the Support Center for more information on using the Terra platform.
Slide-seq tasks and tools
The Slide-seq workflow imports individual "tasks," also written in WDL script, from the WARP tasks folder.
Overall, the Slide-seq workflow:
- Calculates prealignment metrics.
- Filters, trims, and splits reads into < 30 GB FASTQs.
- Uses STARsolo to correct bead barcodes, align reads, and count genes.
- Calculates metrics.
- Merges the STAR outputs into NPY and NPZ arrays.
- Merges gene counts and metrics into a h5ad-formatted matrix.
The tools each Slide-seq task employs are detailed in the table below.
To see specific tool parameters, select the task WDL link in the table; then find the task and view the command {}
section of the task in the WDL script. To view or use the exact tool software, see the task's Docker image which is specified in the task WDL # runtime values
section as String docker =
.
Task name and WDL link | Tool | Software | Description |
---|---|---|---|
Metrics.FastqMetricsSlideSeq (alias = FastqMetrics) | fastq_metrics | warp-tools | Calculates prealignment metrics from the input FASTQ files. Outputs metrics files including the bead barcode distribution, UMI distribution, number of reads per cell and number of UMIs per cell. |
FastqProcessing.FastqProcessingSlidSeq (alias = SplitFastq) | fastq_slideseq | warp-tools | Filters reads, trims bead barcodes in read 1 to remove spacer sequences, and partitions the input FASTQ files by bead barcode to create an array of FASTQ files that are each ~ 30 GB. The function keeps all reads belonging to the same bead barcode in the same partitioned FASTQ file. |
StarAlign.STARsoloFastqSlideSeq (alias = STARsoloFastqSlideSeq) | STARsolo | STAR | For each of the partitioned FASTQ files, performs bead barcode correction, poly(A) tail trimming from adapters, alignment, gene annotation, and gene counting. Produces a count matrix and BAM file for each partition. |
Merge.MergeSortBamFiles (alias= MergeBam) | MergeSamFiles | Picard | Merges the array of BAM files into a single BAM and sorts in coordinate order. |
Metrics.CalculateGeneMetrics (alias = GeneMetrics) | TagSort | warp-tools | Sorts the BAM file by gene using the bead barcode (CB), molecule barcode (UB), and gene ID (GX) tags and computes gene metrics. |
Metrics.CalculateUMIsMetrics (alias = UMIsMetrics) | TagSort | warp-tools | Sorts the BAM file by gene using the bead barcode (CB), molecule barcode (UB), and gene ID (GX) tags and computes gene metrics. |
Metrics.CalculateCellMetrics (alias = CellMetrics) | TagSort | warp-tools | Sorts the BAM file by bead barcode (CB), molecule barcode (UB), and gene ID (GX) tags and computes bead barcode metrics. |
StarAlign.MergeStarOutput (alias = MergeStarOutputsExons) | create-npz-output.py | Python 3 | Creates a compressed raw NPY or NPZ file containing the STARsolo output features (NPY), barcodes (NPZ) and counts (NPZ). By default, count_exons is true and exon counts are included in output files. When count_exons is false, exon counts are excluded. |
H5adUtils.SingleNucleusOptimusH5adOutput (alias = OptimusH5adGenerationWithExons) | create_h5ad_optimus.py | Python 3 | Merges the gene counts, bead barcode metrics, and gene metrics data into an h5ad formatted bead-by-gene matrix. By default, the h5ad file contains whole-gene counts with exon counts in an additional layer. When count_exons is false, the task is run as SlideseqH5adGeneration and exon counts are excluded. |
1. Calculating prealignment metrics
The FastqMetricsSlideSeq task calculates prealignment metrics used for assessing data quality from the input FASTQ files. These metrics include the bead barcode distribution, UMI distribution, number of reads per cell and number of UMIs per cell. These metrics are included in the final outputs of the workflow.
2. Filtering reads, trimming barcodes, and splitting FASTQs
Read filtering
Although the function of the bead barcodes is to identify the spatial location of gene expression in the tissue, barcode errors can arise during sequencing that make it difficult to assign reads to the proper location.
The FastqProcessingSlidSeq task uses warp-tools to evaluate barcode errors by comparing the Read 1 FASTQ sequences against a whitelist of barcodes created by sequencing prior to the mRNA transfer and library preparation steps of the Slide-Seq protocol.
Barcodes that are more than one edit distance (Hamming distance) from any of the barcode sequences in the whitelist are filtered at this step and not included in the output. The remaining barcodes are corrected by the STARsoloFastqSlideSeq task.
Barcode trimming
The task uses warp-tools to trim spacer sequences from bead barcodes and UMIs for use by STARsolo, which requires continuous sample barcodes without spacer sequences between them. The input read_structure
is used to parse the barcodes and remove any bases with tags other than C or M, which represent the bead barcode and UMI, respectively. For example, with a read_structure
of 8C18X6C9M1X, bases represented by 18X and 1X are removed from the reads and the string of bases is rewritten with the structure 14C9M. Bases represented by tags other than X will also be removed during this step, so long as they are not C or M.
FASTQ splitting
For faster processing of large datasets, the task divides the input array of FASTQ files into multiple FASTQs, each containing reads from the same bead barcodes such that the FASTQ files are approximately 30 GB in size.
3. Correcting bead barcodes, trimming and aligning reads, and counting genes with STARsolo
Bead barcode correction
The STARsoloFastqSlideSeq task uses the STAR aligner to evaluate barcode errors by comparing the Read 1 FASTQ sequences against a whitelist of barcodes created by sequencing prior to the mRNA transfer and library preparation steps of the Slide-seq protocol.
Corrected barcodes are those that fall within one edit distance (Hamming distance) of one of the barcode sequences in the whitelist. Note that prior to running STARsolo, uncorrectable barcodes (with more than one mismatch) are filtered out by the FastqProcessingSlidSeq task.
Read trimming
Poly(A) tails are trimmed from reads using the STARsolo parameter --clip3pAdapterSeq AAAAAA
.
Alignment
STAR maps barcoded reads to the genome primary assembly reference (see the Quickstart table above for version information). The example references for the Slide-seq workflow were generated using the BuildIndices pipeline.
Gene annotation and counting
Prior to gene counting, STARsolo adds gene annotations which will vary depending on the value of count_exons
. By default, count_exons
is true and STARsolo will run with the parameter --soloFeatures Gene GeneFull
to produce both whole-gene and exon counts. If count_exons
is false, STARsolo will run with the parameter --soloFeatures GeneFull
to produce whole-gene counts.
The resulting BAM files are merged together into a single BAM using the MergeSortBamFiles task.
STARsolo outputs
The task’s output includes a coordinate-sorted BAM file containing the bead barcode-corrected reads and SAM attributes UB UR UY CR CB CY NH GX GN. Additionally, after counting, the task outputs three intermediate TSV files (features, barcodes, and matrix) used for downstream h5ad matrix generation.
4. Calculating metrics
The CalculateGeneMetrics, CalculateUMIsMetrics, and CalculateCellMetrics tasks use warp-tools to calculate summary metrics that help assess the per-bead and per-UMI quality of the data output each time this pipeline is run.
These metrics output from both tasks are included in the output h5ad matrix. A detailed list of these metrics is found in the Slide-seq Count Matrix Overview.
5. Merging the STAR outputs into NPY and NPZ arrays
The STARsolo output includes a features, barcodes, and matrix TSV for each of the partitioned FASTQ input files. The MergeStarOutput task merges each respective TSV. It uses a custom python script to convert the merged matrix, features, and barcodes output from STARsolo into an NPY (features and barcodes)- and NPZ (the matrix)-formatted file.
6. Merging counts and metrics data into h5ad-formatted matrix
The SingleNucleusOptimusH5adOutput task uses a custom python script to merge the converted STARsolo count matrix and the cell (bead) and gene metrics into an h5ad-formatted bead-by-gene matrix. These counts are raw and unfiltered.
Read full details for all the metrics in the Slide-seq Count Matrix Overview.
Gene counts
The type of gene counts in the h5ad file will vary depending on the value of the Slide-seq workflow input, count_exons
. By default, count_exons
is set to true and the output h5ad file will contain whole-gene counts with exon counts in an additional layer.
If the workflow is run with count_exons
set to false, the output h5ad file will contain whole-gene counts. Running the workflow in this configuration will cause the h5ad matrix to have fewer columns (bead barcodes) due to the difference in STARsolo counting mode.
You can determine which type of counts are in the h5ad by looking at the global attribute expression_data_type
.
After running the pipeline with count_exons
set to true, you can access whole-gene and exonic counts using the AnnData layers()
function. For example, adata.layers[“exon_counts”]` will return the exonic counts from the output h5ad.
6. Outputs
Output files of the pipeline include:
- Bead x Gene unnormalized count matrices in h5ad format.
- Unfiltered, sorted BAM file with barcode and downstream analysis tags.
- Bead metadata, including bead metrics.
- Gene metadata, including gene metrics.
The following table lists the output files produced from the pipeline. For samples that have sequenced over multiple lanes, the pipeline will output one merged version of each listed file.
Output Name | Filename, if applicable | Output Type | Output Format |
---|---|---|---|
pipeline_version_out | N/A | Version of the processing pipeline run on this data. | String |
genomic_reference_version | reference_version.txt | File containing genomic reference source, build, and annotation versions. | TXT |
bam | <output_bam_basename>.bam | Aligned BAM | BAM |
matrix | <input_id>_sparse_counts.npz | Converted sparse matrix file from the MergeStarOutputs task. | NPZ |
matrix_row_index | <input_id>_sparse_counts_row_index.npy | Index of beads in count matrix. | NPY |
matrix_col_index | <input_id>_sparse_counts_col_index.npy | Index of genes in count matrix. | NPY |
cell_metrics | cell-metrics.csv.gz | Cell metrics based on bead barcodes | Compressed CSV |
gene_metrics | gene-metrics.csv.gz | Gene metrics | Compressed CSV |
umi_metrics | umi-metrics.csv.gz | UMI metrics | Compressed CSV |
fastq_barcode_distribution | <input_id>.barcode_distribution_XC.txt | Metric file containing the distribution of reads per bead barcode that were calculated prior to alignment. | TXT |
fastq_umi_distribution | <input_id>.barcode_distribution_XM.txt | Metric file containing the distribution of reads per UMI that were calculated prior to alignment. | TXT |
fastq_reads_per_cell | <input_id>.numReads_perCell_XC.txt | Metric file containing the number of reads per barcode that were calculated prior to alignment. | TXT |
fastq_reads_per_umi | <input_id>.numReads_perCell_XM.txt | Metric file containing the number of reads per UMI that were calculated prior to alignment. | TXT |
h5ad_output_file | <input_id>.h5ad | h5ad file containing count data and metadata. | H5AD |
The h5ad matrix is the default output. This matrix contains the unnormalized (unfiltered) count matrices, as well as the gene and bead barcode metrics detailed in the Slide-seq Count Matrix Overview.
Validation against on-prem pipeline
Slide-seq has been validated for processing mouse spatial transcriptomic data generated with the Slide-seq assay. Slide-seq results are compared to those of an on-premise pipeline using BCL input files and python scripts. See the validation report for more information.
Versioning and testing
All Slide-seq pipeline releases are documented in the Slide-seq changelog and tested using plumbing and scientific test data. To learn more about WARP pipeline testing, see Testing Pipelines.
Citing the Slide-seq Pipeline
If you use the Slide-seq Pipeline in your research, please identify the pipeline in your methods section using the Slide-seq SciCrunch resource identifier.
- Ex: Slide-seq Pipeline (RRID:SCR_023379)
Please also consider citing our preprint:
Degatano, K.; Awdeh, A.; Dingman, W.; Grant, G.; Khajouei, F.; Kiernan, E.; Konwar, K.; Mathews, K.; Palis, K.; Petrillo, N.; Van der Auwera, G.; Wang, C.; Way, J.; Pipelines, W. WDL Analysis Research Pipelines: Cloud-Optimized Workflows for Biological Data Processing and Reproducible Analysis. Preprints 2024, 2024012131. https://doi.org/10.20944/preprints202401.2131.v1
Consortia support
This pipeline is supported by the BRAIN Initiative Cell Census Network (BICCN) and BRAIN Initiative Cell Atlas Network (BICAN).
If your organization also uses this pipeline, we would like to list you! Please reach out to us by filing an issue in WARP.
Feedback
Please help us make our tools better by filing an issue in WARP; we welcome pipeline-related suggestions or questions.