Skip to main content

scATAC Overview

Pipeline VersionDate UpdatedDocumentation AuthorQuestions or Feedback
scATAC 1.3.2May 2024Elizabeth KiernanPlease file an issue in WARP.

scATAC_diagram

Introduction

The scATAC Pipeline was developed by the Broad DSP Pipelines team to process single cell/nucleus ATAC-seq datasets. The pipeline is based on the SnapATAC pipeline described by Fang et al. (2019). Overall, the pipeline uses the python module SnapTools to align and process paired reads in the form of FASTQ files. It produces an hdf5-structured Snap file that includes a cell-by-bin count matrix. In addition to the Snap file, the final outputs include a GA4GH compliant aligned BAM and QC metrics.

Want to use the scATAC Pipeline for your publication?

Check out the scATAC Publication Methods to get started!

Quick Start Table

Pipeline FeaturesDescriptionSource
Assay TypeSingle nucleus ATAC-seqPreprint here
Overall WorkflowGenerates Snap file with cell-by-bin matrixCode available from GitHub
Workflow LanguageWDL 1.0openWDL
AlignerBWALi H. and Durbin R., 2009
Data Input File FormatFile format in which sequencing data is providedPaired-end FASTQs with cell barcodes appended to read names (read barcode demultiplexing section here)
Data Output File FormatFile formats in which scATAC output is providedBAM, Snap

Set-up

Workflow Installation and Requirements

The scATAC workflow is written in the Workflow Description Language WDL and can be downloaded by cloning the GitHub WARP repository. The workflow can be deployed using Cromwell, a GA4GH compliant, flexible workflow management system that supports multiple computing platforms. For the latest workflow version and release notes, please see the scATAC changelog.

Pipeline Inputs

The pipeline inputs are detailed in the table below. You can test the workflow by using the human_example.json example configuration file.

Input nameInput typeDescription
input_fastq1FileFASTQ file of the first reads (R1)
input_fastq2FileFASTQ file of the second reads (R2)
input_idStringUnique identifier for the sample; will also be used to name the output files
input_referenceFileReference bundle that is generated with bwa-mk-index-wdl found here
genome_nameStringName of the genomic reference (name that precedes the “.tar” in the input_reference)
output_bamStringName for the output BAM; default is set to the input_id + "_aligned_bam"
bin_size_listStringList of bin sizes used to generate cell-by-bin matrices; default is 10000 bp

Input File Preparation

R1 and R2 FASTQ Preparation

The scATAC workflow requires paired reads in the form FASTQ files with the cell barcodes appended to the readnames. A description of the barcode demultiplexing can be found on the SnapATAC documentation (see barcode demultiplexing section here). The full cell barcode must form the first part of the read name (for both R1 and R2 files) and be separated from the rest of the line by a colon. You can find an example python code to perform demultiplexing in the SnapTools documentation here. The codeblock below demonstrates the correct format.

@CAGTTGCACGTATAGAACAAGGATAGGATAAC:7001113:915:HJ535BCX2:1:1106:1139:1926 1:N:0:0
ACCCTCCGTGTGCCAGGAGATACCATGAATATGCCATAGAACCTGTCTCT
+
DDDDDIIIIIIIIIIIIIIHHIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Input_reference Preparation

The input_reference is a BWA compatible reference bundle in TAR file format. You can create this BWA reference using the accessory workflow here.

Workflow Tasks and Tools

The scATAC workflow is divided into multiple tasks which are described in the table below. The table also links to the Docker Image for each task and to the documentation or code for the relevant software tool parameters.

TaskTask DescriptionTool Docker ImageParameter Descriptions or Code
AlignPairedEndAlign the modified FASTQ files to the genomesnaptools-bwa:1.0.0-1.4.8-0.7.17-1690310027SnapTools documentation
SnapPreInitial generation of snap filesnaptools-bwa:1.0.0-1.4.8-0.7.17-1690310027SnapTools documentation
SnapCellByBinBinning of data by genomic binssnaptools-bwa:1.0.0-1.4.8-0.7.17-1690310027SnapTools documentation
MakeCompliantBAMGeneration of a GA4GH compliant BAMwarp-tools:1.0.1-1690997141Code
BreakoutSnapExtraction of tables from snap file into text format (for testing and user availability)snap-breakout:0.0.1Code

Task Summary

AlignPairedEnd

The AlignPairedEnd task takes the barcode demultiplexed FASTQ files and aligns reads to the genome using the BWA aligner. It uses the SnapTools min_cov parameter to set the minimum number of barcodes a fragment requires to be included in the final output. This parameter is set to 0. The final task output is an aligned BAM file.

SnapPre

The SnapPre task uses SnapTools to perform preprocessing and filtering on the aligned BAM. The task outputs are a Snap file and QC metrics. The table below details the filtering parameters for the task.

Filtering Parameters
ParameterDescriptionValue
--min-mapqFragments with mappability less than value will be filtered30
--min-flenFragments of length shorter than min_flen will be filtered0
--max-flenFragments of length bigger than min_flen will be filtered1000
--keep-chrmBoolean variable indicates whether to keep reads mapped to chrMTRUE
--keep-singleBoolean variable indicates whether to keep single-end readsTRUE
--keep-secondaryBoolean variable indicates whether to keep secondary alignmentsFALSE
--max-numMax number of barcodes to be stored. Based on the coverage, top max_barcode barcodes are selected and stored1000000
--min-covFragments with less than min-cov number of barcodes will be filtered100

SnapCellByBin

The SnapCellByBin task uses the Snap file to create cell-by-bin count matrices in which a “1” represents a bin with an accessible region of the genome and a “0” represents an inaccessible region. The bin_size_list sets the bin size to 10,000 bp by default but can be changed by specifying the value in the inputs to the workflow.

MakeCompliantBAM

The MakeCompliantBAM task uses a custom python script to make a GA4GH compliant BAM by moving the cellular barcodes in the read names to the CB tag.

BreakoutSnap

The BreakoutSnap task extracts data from the Snap file and exports it to individual CSVs. These CSV outputs are listed in the table in the Outputs section below. The code is available here.

Outputs

The main outputs of the scATAC workflow are the Snap file, Snap QC metrics, and the GA4GH compliant BAM file. All files with the prefix “breakout” are CSV files containing individual pieces of data from the Snap. The sessions for the Snap file are described in the SnapTools documentation. Additionally, you can read detailed information on the Snap file fields for each session (select "View Raw").

Output File NameDescription
output_snap_qcQuality control file corresponding to the snap file
output_snapOutput snap file (in hdf5 container format)
output_aligned_bamOutput BAM file, compliant with GA4GH standards
breakout_barcodesText file containing the FM ('Fragment session') barcodeLen and barcodePos fields
breakout_fragmentsText file containing the FM ('Fragments session') fragChrom, fragLen, and fragStart fields
breakout_binCoordinatesText file with the AM session ('Cell x bin accessibility' matrix) binChrom and binStart fields
breakout_binCountsText file with the AM session ('Cell x bin accessibility' matrix) idx, idy, and count fields
breakout_barcodesSectionText file with the data from the BD session ('Barcode session' table)
output_pipeline_versionStrong describing the version of the scATAC pipeline used.
Snap QC Metrics

The following table details the metrics available in the output_snap_qc file.

QC MetricAbbreviation
Total number of unique barcodesNo abbreviation
Total number of fragmentsTN
Total number of uniquely mappedUM
Total number of single endsSE
Total number of secondary alignmentsSA
Total number of paired endsPE
Total number of proper pairedPP
Total number of proper frag lenPL
Total number of usable fragmentsUS
Total number of unique fragmentsUQ
Total number of chrM fragmentsCM

Running on Terra

Terra is a public, cloud-based platform for biomedical research. If you would like to try the scATAC workflow (previously named "snap-atac") in Terra, you can import the most recent version from Dockstore. Additionally, there is a public scATAC workspace preloaded with the scATAC workflow and downsampled data.

Versioning

All scATAC workflow releases are documented in the scATAC changelog.

Citing the scATAC Pipeline

If you use the scATAC Pipeline in your research, please identify the pipeline in your methods section using the scATAC SciCrunch resource identifier.

  • Ex: scATAC Pipeline (RRID:SCR_018919)

Please also consider citing our preprint:

Degatano, K.; Awdeh, A.; Dingman, W.; Grant, G.; Khajouei, F.; Kiernan, E.; Konwar, K.; Mathews, K.; Palis, K.; Petrillo, N.; Van der Auwera, G.; Wang, C.; Way, J.; Pipelines, W. WDL Analysis Research Pipelines: Cloud-Optimized Workflows for Biological Data Processing and Reproducible Analysis. Preprints 2024, 2024012131. https://doi.org/10.20944/preprints202401.2131.v1

Consortia Support

This pipeline is supported and used by the BRAIN Initiative Cell Census Network (BICCN).

If your organization also uses this pipeline, we would love to list you! Please reach out to us by filing an issue in WARP.

Pipeline Improvements

Please help us make our tools better by filing an issue in WARP; we welcome pipeline-related suggestions or questions.