Skip to main content

scATAC Overview

Pipeline VersionDate UpdatedDocumentation AuthorQuestions or Feedback
scATAC 1.2.0 January 04 2021Elizabeth KiernanPlease file GitHub issues in WARP or contact Kylee Degatano

scATAC_diagram

Introduction#

The scATAC Pipeline was developed by the Broad DSP Pipelines team to process single cell/nucleus ATAC-seq datasets. The pipeline is based on the SnapATAC pipeline described by Fang et al. (2019). Overall, the pipeline uses the python module SnapTools to align and process paired reads in the form of FASTQ files. It produces an hdf5-structured Snap file that includes a cell-by-bin count matrix. In addition to the Snap file, the final outputs include a GA4GH compliant aligned BAM and QC metrics.

Want to use the scATAC Pipeline for your publication?

Check out the scATAC Publication Methods to get started!

Quick Start Table#

Pipeline FeaturesDescriptionSource
Assay TypeSingle nucleus ATAC-seqPreprint here
Overall WorkflowGenerates Snap file with cell-by-bin matrixCode available from GitHub
Workflow LanguageWDL 1.0openWDL
AlignerBWALi H. and Durbin R., 2009
Data Input File FormatFile format in which sequencing data is providedPaired-end FASTQs with cell barcodes appended to read names (read barcode demultiplexing section here)
Data Output File FormatFile formats in which scATAC output is providedBAM, Snap

Set-up#

Workflow Installation and Requirements#

The scATAC workflow is written in the Workflow Description Language WDL and can be downloaded by cloning the GitHub WARP repository. The workflow can be deployed using Cromwell, a GA4GH compliant, flexible workflow management system that supports multiple computing platforms. For the latest workflow version and release notes, please see the scATAC changelog.

Pipeline Inputs#

The pipeline inputs are detailed in the table below. You can test the workflow by using the human_example.json example configuration file.

Input nameInput typeDescription
input_fastq1FileFASTQ file of the first reads (R1)
input_fastq2FileFASTQ file of the second reads (R2)
input_idStringUnique identifier for the sample; will also be used to name the output files
input_referenceFileReference bundle that is generated with bwa-mk-index-wdl found here
genome_nameStringName of the genomic reference (name that precedes the “.tar” in the input_reference)
output_bamStringName for the output BAM; default is set to the input_id + "_aligned_bam"
bin_size_listStringList of bin sizes used to generate cell-by-bin matrices; default is 10000 bp

Input File Preparation#

R1 and R2 FASTQ Preparation#

The scATAC workflow requires paired reads in the form FASTQ files with the cell barcodes appended to the readnames. A description of the barcode demultiplexing can be found on the SnapATAC documentation (see barcode demultiplexing section here). The full cell barcode must form the first part of the read name (for both R1 and R2 files) and be separated from the rest of the line by a colon. You can find an example python code to perform demultiplexing in the SnapTools documentation here. The codeblock below demonstrates the correct format.

@CAGTTGCACGTATAGAACAAGGATAGGATAAC:7001113:915:HJ535BCX2:1:1106:1139:1926 1:N:0:0ACCCTCCGTGTGCCAGGAGATACCATGAATATGCCATAGAACCTGTCTCT+DDDDDIIIIIIIIIIIIIIHHIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Input_reference Preparation#

The input_reference is a BWA compatible reference bundle in TAR file format. You can create this BWA reference using the accessory workflow here.

Workflow Tasks and Tools#

The scATAC workflow is divided into multiple tasks which are described in the table below. The table also links to the Docker Image for each task and to the documentation or code for the relevant software tool parameters.

TaskTask DescriptionTool Docker ImageParameter Descriptions or Code
AlignPairedEndAlign the modified FASTQ files to the genomesnaptools:0.0.1SnapTools documentation
SnapPreInitial generation of snap filesnaptools:0.0.1SnapTools documentation
SnapCellByBinBinning of data by genomic binssnaptools:0.0.1SnapTools documentation
MakeCompliantBAMGeneration of a GA4GH compliant BAMsnaptools:0.0.1Code
BreakoutSnapExtraction of tables from snap file into text format (for testing and user availability)snap-breakout:0.0.1Code

Task Summary#

AlignPairedEnd#

The AlignPairedEnd task takes the barcode demultiplexed FASTQ files and aligns reads to the genome using the BWA aligner. It uses the SnapTools min_cov parameter to set the minimum number of barcodes a fragment requires to be included in the final output. This parameter is set to 0. The final task output is an aligned BAM file.

SnapPre#

The SnapPre task uses SnapTools to perform preprocessing and filtering on the aligned BAM. The task outputs are a Snap file and QC metrics. The table below details the filtering parameters for the task.

Filtering Parameters#
ParameterDescriptionValue
--min-mapqFragments with mappability less than value will be filtered30
--min-flenFragments of length shorter than min_flen will be filtered0
--max-flenFragments of length bigger than min_flen will be filtered1000
--keep-chrmBoolean variable indicates whether to keep reads mapped to chrMTRUE
--keep-singleBoolean variable indicates whether to keep single-end readsTRUE
--keep-secondaryBoolean variable indicates whether to keep secondary alignmentsFALSE
--max-numMax number of barcodes to be stored. Based on the coverage, top max_barcode barcodes are selected and stored1000000
--min-covFragments with less than min-cov number of barcodes will be filtered100

SnapCellByBin#

The SnapCellByBin task uses the Snap file to create cell-by-bin count matrices in which a “1” represents a bin with an accessible region of the genome and a “0” represents an inaccessible region. The bin_size_list sets the bin size to 10,000 bp by default but can be changed by specifying the value in the inputs to the workflow.

MakeCompliantBAM#

The MakeCompliantBAM task uses a custom python script (here) to make a GA4GH compliant BAM by moving the cellular barcodes in the read names to the CB tag.

BreakoutSnap#

The BreakoutSnap task extracts data from the Snap file and exports it to individual CSVs. These CSV outputs are listed in the table in the Outputs section below. The code is available here.

Outputs#

The main outputs of the scATAC workflow are the Snap file, Snap QC metrics, and the GA4GH compliant BAM file. All files with the prefix “breakout” are CSV files containing individual pieces of data from the Snap. The sessions for the Snap file are described in the SnapTools documentation. Additionally, you can read detailed information on the Snap file fields for each session (select "View Raw").

Output File NameDescription
output_snap_qcQuality control file corresponding to the snap file
output_snapOutput snap file (in hdf5 container format)
output_aligned_bamOutput BAM file, compliant with GA4GH standards
breakout_barcodesText file containing the FM ('Fragment session') barcodeLen and barcodePos fields
breakout_fragmentsText file containing the FM ('Fragments session') fragChrom, fragLen, and fragStart fields
breakout_binCoordinatesText file with the AM session ('Cell x bin accessibility' matrix) binChrom and binStart fields
breakout_binCountsText file with the AM session ('Cell x bin accessibility' matrix) idx, idy, and count fields
breakout_barcodesSectionText file with the data from the BD session ('Barcode session' table)
Snap QC Metrics#

The following table details the metrics available in the output_snap_qc file.

QC MetricAbbreviation
Total number of unique barcodesNo abbreviation
Total number of fragmentsTN
Total number of uniquely mappedUM
Total number of single endsSE
Total number of secondary alignmentsSA
Total number of paired endsPE
Total number of proper pairedPP
Total number of proper frag lenPL
Total number of usable fragmentsUS
Total number of unique fragmentsUQ
Total number of chrM fragmentsCM

Running on Terra#

Terra is a public, cloud-based platform for biomedical research. If you would like to try the scATAC workflow (previously named "snap-atac") in Terra, you can import the most recent version from Dockstore. Additionally, there is a public scATAC workspace preloaded with the scATAC workflow and downsampled data.

Versioning#

All scATAC workflow releases are documented in the scATAC changelog.

Citing the scATAC Pipeline#

Please identify the pipeline in your methods section using the scATAC Pipeline's SciCrunch resource identifier.

  • Ex: scATAC Pipeline (RRID:SCR_018919)

Consortia Support#

This pipeline is supported and used by the BRAIN Initiative Cell Census Network (BICCN).

If your organization also uses this pipeline, we would love to list you! Please reach out to us by contacting Kylee Degatano.

Pipeline Improvements#

Please help us make our tools better by contacting Kylee Degatano for pipeline-related suggestions or questions.