scATAC Overview

Pipeline Version	Date Updated	Documentation Author	Questions or Feedback
scATAC 1.2.0	January 04 2021	Elizabeth Kiernan	Please file GitHub issues in WARP or contact the WARP team

scATAC_diagram

Introduction

The scATAC Pipeline was developed by the Broad DSP Pipelines team to process single cell/nucleus ATAC-seq datasets. The pipeline is based on the SnapATAC pipeline described by Fang et al. (2019). Overall, the pipeline uses the python module SnapTools to align and process paired reads in the form of FASTQ files. It produces an hdf5-structured Snap file that includes a cell-by-bin count matrix. In addition to the Snap file, the final outputs include a GA4GH compliant aligned BAM and QC metrics.

Want to use the scATAC Pipeline for your publication?

Check out the scATAC Publication Methods to get started!

Quick Start Table

Pipeline Features	Description	Source
Assay Type	Single nucleus ATAC-seq	Preprint here
Overall Workflow	Generates Snap file with cell-by-bin matrix	Code available from GitHub
Workflow Language	WDL 1.0	openWDL
Aligner	BWA	Li H. and Durbin R., 2009
Data Input File Format	File format in which sequencing data is provided	Paired-end FASTQs with cell barcodes appended to read names (read barcode demultiplexing section here)
Data Output File Format	File formats in which scATAC output is provided	BAM, Snap

Set-up

Workflow Installation and Requirements

The scATAC workflow is written in the Workflow Description Language WDL and can be downloaded by cloning the GitHub WARP repository. The workflow can be deployed using Cromwell, a GA4GH compliant, flexible workflow management system that supports multiple computing platforms. For the latest workflow version and release notes, please see the scATAC changelog.

Pipeline Inputs

The pipeline inputs are detailed in the table below. You can test the workflow by using the human_example.json example configuration file.

Input name	Input type	Description
input_fastq1	File	FASTQ file of the first reads (R1)
input_fastq2	File	FASTQ file of the second reads (R2)
input_id	String	Unique identifier for the sample; will also be used to name the output files
input_reference	File	Reference bundle that is generated with bwa-mk-index-wdl found here
genome_name	String	Name of the genomic reference (name that precedes the “.tar” in the input_reference)
output_bam	String	Name for the output BAM; default is set to the `input_id` + "_aligned_bam"
bin_size_list	String	List of bin sizes used to generate cell-by-bin matrices; default is 10000 bp

Input File Preparation

R1 and R2 FASTQ Preparation

The scATAC workflow requires paired reads in the form FASTQ files with the cell barcodes appended to the readnames. A description of the barcode demultiplexing can be found on the SnapATAC documentation (see barcode demultiplexing section here). The full cell barcode must form the first part of the read name (for both R1 and R2 files) and be separated from the rest of the line by a colon. You can find an example python code to perform demultiplexing in the SnapTools documentation here. The codeblock below demonstrates the correct format.

@CAGTTGCACGTATAGAACAAGGATAGGATAAC:7001113:915:HJ535BCX2:1:1106:1139:1926 1:N:0:0
ACCCTCCGTGTGCCAGGAGATACCATGAATATGCCATAGAACCTGTCTCT
+
DDDDDIIIIIIIIIIIIIIHHIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Input_reference Preparation

The input_reference is a BWA compatible reference bundle in TAR file format. You can create this BWA reference using the accessory workflow here.

Workflow Tasks and Tools

The scATAC workflow is divided into multiple tasks which are described in the table below. The table also links to the Docker Image for each task and to the documentation or code for the relevant software tool parameters.

Task	Task Description	Tool Docker Image	Parameter Descriptions or Code
AlignPairedEnd	Align the modified FASTQ files to the genome	snaptools-bwa:1.0.0-1.4.8-0.7.17-1690310027	SnapTools documentation
SnapPre	Initial generation of snap file	snaptools-bwa:1.0.0-1.4.8-0.7.17-1690310027	SnapTools documentation
SnapCellByBin	Binning of data by genomic bins	snaptools-bwa:1.0.0-1.4.8-0.7.17-1690310027	SnapTools documentation
MakeCompliantBAM	Generation of a GA4GH compliant BAM	warp-tools:1.0.1-1690997141	Code
BreakoutSnap	Extraction of tables from snap file into text format (for testing and user availability)	snap-breakout:0.0.1	Code

Task Summary

AlignPairedEnd

The AlignPairedEnd task takes the barcode demultiplexed FASTQ files and aligns reads to the genome using the BWA aligner. It uses the SnapTools min_cov parameter to set the minimum number of barcodes a fragment requires to be included in the final output. This parameter is set to 0. The final task output is an aligned BAM file.

SnapPre

The SnapPre task uses SnapTools to perform preprocessing and filtering on the aligned BAM. The task outputs are a Snap file and QC metrics. The table below details the filtering parameters for the task.

Filtering Parameters

Parameter	Description	Value
--min-mapq	Fragments with mappability less than value will be filtered	30
--min-flen	Fragments of length shorter than min_flen will be filtered	0
--max-flen	Fragments of length bigger than min_flen will be filtered	1000
--keep-chrm	Boolean variable indicates whether to keep reads mapped to chrM	TRUE
--keep-single	Boolean variable indicates whether to keep single-end reads	TRUE
--keep-secondary	Boolean variable indicates whether to keep secondary alignments	FALSE
--max-num	Max number of barcodes to be stored. Based on the coverage, top max_barcode barcodes are selected and stored	1000000
--min-cov	Fragments with less than min-cov number of barcodes will be filtered	100

SnapCellByBin

The SnapCellByBin task uses the Snap file to create cell-by-bin count matrices in which a “1” represents a bin with an accessible region of the genome and a “0” represents an inaccessible region. The bin_size_list sets the bin size to 10,000 bp by default but can be changed by specifying the value in the inputs to the workflow.

MakeCompliantBAM

The MakeCompliantBAM task uses a custom python script to make a GA4GH compliant BAM by moving the cellular barcodes in the read names to the CB tag.

BreakoutSnap

The BreakoutSnap task extracts data from the Snap file and exports it to individual CSVs. These CSV outputs are listed in the table in the Outputs section below. The code is available here.

Outputs

The main outputs of the scATAC workflow are the Snap file, Snap QC metrics, and the GA4GH compliant BAM file. All files with the prefix “breakout” are CSV files containing individual pieces of data from the Snap. The sessions for the Snap file are described in the SnapTools documentation. Additionally, you can read detailed information on the Snap file fields for each session (select "View Raw").

Output File Name	Description
output_snap_qc	Quality control file corresponding to the snap file
output_snap	Output snap file (in hdf5 container format)
output_aligned_bam	Output BAM file, compliant with GA4GH standards
breakout_barcodes	Text file containing the FM ('Fragment session') barcodeLen and barcodePos fields
breakout_fragments	Text file containing the FM ('Fragments session') fragChrom, fragLen, and fragStart fields
breakout_binCoordinates	Text file with the AM session ('Cell x bin accessibility' matrix) binChrom and binStart fields
breakout_binCounts	Text file with the AM session ('Cell x bin accessibility' matrix) idx, idy, and count fields
breakout_barcodesSection	Text file with the data from the BD session ('Barcode session' table)

Snap QC Metrics

The following table details the metrics available in the output_snap_qc file.

QC Metric	Abbreviation
Total number of unique barcodes	No abbreviation
Total number of fragments	TN
Total number of uniquely mapped	UM
Total number of single ends	SE
Total number of secondary alignments	SA
Total number of paired ends	PE
Total number of proper paired	PP
Total number of proper frag len	PL
Total number of usable fragments	US
Total number of unique fragments	UQ
Total number of chrM fragments	CM

Running on Terra

Terra is a public, cloud-based platform for biomedical research. If you would like to try the scATAC workflow (previously named "snap-atac") in Terra, you can import the most recent version from Dockstore. Additionally, there is a public scATAC workspace preloaded with the scATAC workflow and downsampled data.

Versioning

All scATAC workflow releases are documented in the scATAC changelog.

Citing the scATAC Pipeline

If you use the scATAC Pipeline in your research, please identify the pipeline in your methods section using the scATAC SciCrunch resource identifier.

Ex: scATAC Pipeline (RRID:SCR_018919)

Please also consider citing our preprint:

Degatano, K.; Awdeh, A.; Dingman, W.; Grant, G.; Khajouei, F.; Kiernan, E.; Konwar, K.; Mathews, K.; Palis, K.; Petrillo, N.; Van der Auwera, G.; Wang, C.; Way, J.; Pipelines, W. WDL Analysis Research Pipelines: Cloud-Optimized Workflows for Biological Data Processing and Reproducible Analysis. Preprints 2024, 2024012131. https://doi.org/10.20944/preprints202401.2131.v1

Consortia Support

This pipeline is supported and used by the BRAIN Initiative Cell Census Network (BICCN).

If your organization also uses this pipeline, we would love to list you! Please reach out to us by contacting the WARP team.

Pipeline Improvements

Please help us make our tools better by contacting the WARP team for pipeline-related suggestions or questions.

scATAC Overview

Introduction​

Quick Start Table​

Set-up​

Workflow Installation and Requirements​

Pipeline Inputs​

Input File Preparation​

R1 and R2 FASTQ Preparation​

Input_reference Preparation​

Workflow Tasks and Tools​

Task Summary​

AlignPairedEnd​

SnapPre​

Filtering Parameters​

SnapCellByBin​

MakeCompliantBAM​

BreakoutSnap​

Outputs​

Snap QC Metrics​

Running on Terra​

Versioning​

Citing the scATAC Pipeline​

Consortia Support​

Pipeline Improvements​