Skip to main content

ATAC Overview

Pipeline VersionDate UpdatedDocumentation AuthorQuestions or Feedback
2.5.1November, 2024WARP PipelinesPlease file an issue in WARP.

Introduction to the ATAC workflow

ATAC is an open-source, cloud-optimized pipeline developed in collaboration with members of the BRAIN Initiative (BICCN and BICAN Sequencing Working Group) and SCORCH (see Acknowledgements below). It supports the processing of 10x single-nucleus data generated with 10x Multiome ATAC-seq (Assay for Transposase-Accessible Chromatin), a technique used in molecular biology to assess genome-wide chromatin accessibility.

This workflow is the ATAC component of the Mutiome wrapper workflow. It corrects cell barcodes (CBs), aligns reads to the genome, and produces a fragment file as well as per barcode metrics and library-level metrics.

Quickstart table

The following table provides a quick glance at the ATAC pipeline features:

Pipeline featuresDescriptionSource
Assay type10x single cell or single nucleus ATAC10x Genomics
Overall workflowBarcode correction, read alignment, and fragment quantificationCode available from GitHub
Workflow languageWDL 1.0openWDL
Genomic Reference SequenceGRCh38 human genome primary sequenceGENCODE
Alignerbwa-mem2Li H. and Durbin R., 2009
Fragment quantificationSnapATAC2Zhang, K. et al., 2021
Data input file formatFile format in which sequencing data is providedFASTQ
Data output file formatFile formats in which ATAC output is providedTSV, h5ad, BAM
Library-level metricsThe ATAC pipeline uses SnapATAC2 to generate library-level metrics in CSV format.Library-level metrics

Set-up

ATAC installation

To download the latest ATAC release, see the release tags prefixed with "Multiome" on the WARP releases page. All ATAC pipeline releases are documented in the ATAC changelog.

To discover and search releases, use the WARP command-line tool Wreleaser.

ATAC can be deployed using Cromwell, a GA4GH-compliant, flexible workflow management system that supports multiple computing platforms. The workflow can also be run in Terra, a cloud-based analysis platform.

Input Variables

The following describes the inputs of the ATAC workflow. For more details on how default inputs are set for the Multiome workflow, see the Multiome overview.

Variable nameDescription
read1_fastq_gzippedFastq inputs (array of compressed read 1 FASTQ files).
read2_fastq_gzippedFastq inputs (array of compressed read 2 FASTQ files containing cellular barcodes).
read3_fastq_gzippedFastq inputs (array of compressed read 3 FASTQ files).
input_idOutput prefix/base name for all intermediate files and pipeline outputs.
cloud_providerString describing the cloud provider that should be used to run the workflow; value should be "gcp" or "azure".
preindexBoolean used for paired-tag data and not applicable to ATAC data types; default is set to false.
atac_expected_cellsNumber of cells loaded to create the ATAC library; default is set to 3000.
tar_bwa_referenceBWA reference (tar file containing reference fasta and corresponding files).
num_threads_bwaOptional integer defining the number of CPUs per node for the BWA-mem alignment task (default: 128).
mem_size_bwaOptional integer defining the memory size for the BWA-mem alignment task in GB (default: 512).
cpu_platform_bwaOptional string defining the CPU platform for the BWA-mem alignment task (default: "Intel Ice Lake").
annotations_gtfCreateFragmentFile input variable: GTF file for SnapATAC2 to calculate TSS sites of fragment file.
chrom_sizesCreateFragmentFile input variable: Text file containing chrom_sizes for genome build (i.e., hg38)
whitelistWhitelist file for ATAC cellular barcodes.
adapter_seq_read1TrimAdapters input: Sequence adapter for read 1 fastq.
adapter_seq_read3TrimAdapters input: Sequence adapter for read 3 fastq.
vm_sizeString defining the Azure virtual machine family for the workflow (default: "Standard_M128s").
atac_nhash_idString that represents an optional library aliquot identifier. When used, it is echoed in the h5ad unstructured data.

ATAC tasks and tools

Overall, the ATAC workflow:

  1. Identifies optimal parameters for performing CB correction and alignment.
  2. Corrects CBs and partitions FASTQs by CB.
  3. Aligns reads.
  4. Generates a fragment file.
  5. Calculates per cell barcode fragment metrics.

The tools each ATAC task employs are detailed in the table below.

To see specific tool parameters, select the task WDL link in the table; then view the command {} section of the task in the WDL script. To view or use the exact tool software, see the task's Docker image which is specified in the task WDL # runtime values section as String docker =.

Task name and WDL linkToolSoftwareDescription
GetNumSplitsBashBashUses the virtual machine type to determine the optimal number of FASTQ files for performing the BWA-mem alignment step. This allows BWA-mem to run in parallel on multiple FASTQ files in the subsequent workflow steps.
FastqProcessing as SplitFastqfastqprocesscustomDynamically selects the correct barcode orientation, corrects cell barcodes, and splits FASTQ files by the optimal number determined in the GetNumSplits task. The smaller FASTQ files are grouped by cell barcode with each read having the corrected (CB) and raw barcode (CR) in the read name.
TrimAdaptersCutadapt v4.4cutadaptTrims adaptor sequences.
BWAPairedEndAlignmentbwa-mem2memAligns reads from each set of partitioned FASTQ files to the genome and outputs a BAM with ATAC barcodes in the CB:Z tag.
CreateFragmentFilemake_fragment_file, import_dataSnapATAC2Generates a fragment file from the final aligned BAM and outputs per barcode quality metrics in h5ad. A detailed list of these metrics is found in the ATAC Count Matrix Overview.

Output variables

Output variable nameFilename, if applicableOutput format and description
bam_aligned_output<input_id>.bamBAM containing aligned reads from ATAC workflow.
fragment_file<input_id>.fragments.sorted.tsv.gzBgzipped TSV containing fragment start and stop coordinates per barcode. In order, the columns are "Chromosome", "Start", "Stop", "ATAC Barcode", and "Number Reads".
snap_metrics<input_id.metrics.h5adh5ad (Anndata) containing per barcode metrics from SnapATAC2. A detailed list of these metrics is found in the ATAC Count Matrix Overview.
library_metrics<input_id>_`<atac_nhash_id>_library_metrics.csvCSV file containing library-level metrics. Read more in the Library Metrics Overview

Versioning and testing

All ATAC pipeline releases are documented in the ATAC changelog and tested using plumbing and scientific test data. To learn more about WARP pipeline testing, see Testing Pipelines.

Citing the ATAC Pipeline

If you use the ATAC Pipeline in your research, please identify the pipeline in your methods section using the ATAC SciCrunch resource identifier.

  • Ex: ATAC Pipeline (RRID:SCR_024656)

Please also consider citing our preprint:

Degatano, K.; Awdeh, A.; Dingman, W.; Grant, G.; Khajouei, F.; Kiernan, E.; Konwar, K.; Mathews, K.; Palis, K.; Petrillo, N.; Van der Auwera, G.; Wang, C.; Way, J.; Pipelines, W. WDL Analysis Research Pipelines: Cloud-Optimized Workflows for Biological Data Processing and Reproducible Analysis. Preprints 2024, 2024012131. https://doi.org/10.20944/preprints202401.2131.v1

Acknowledgements

We are immensely grateful to the members of the BRAIN Initiative (BICAN Sequencing Working Group) and SCORCH for their invaluable and exceptional contributions to this pipeline. Our heartfelt appreciation goes to Alex Dobin, Aparna Bhaduri, Alec Wysoker, Anish Chakka, Brian Herb, Daofeng Li, Fenna Krienen, Guo-Long Zuo, Jeff Goldy, Kai Zhang, Khalid Shakir, Bo Li, Mariano Gabitto, Michael DeBerardine, Mengyi Song, Melissa Goldman, Nelson Johansen, James Nemesh, and Theresa Hodges for their unwavering dedication and remarkable efforts.