ATAC Overview
Pipeline Version | Date Updated | Documentation Author | Questions or Feedback |
---|---|---|---|
2.5.1 | November, 2024 | WARP Pipelines | Please file an issue in WARP. |
Introduction to the ATAC workflow
ATAC is an open-source, cloud-optimized pipeline developed in collaboration with members of the BRAIN Initiative (BICCN and BICAN Sequencing Working Group) and SCORCH (see Acknowledgements below). It supports the processing of 10x single-nucleus data generated with 10x Multiome ATAC-seq (Assay for Transposase-Accessible Chromatin), a technique used in molecular biology to assess genome-wide chromatin accessibility.
This workflow is the ATAC component of the Mutiome wrapper workflow. It corrects cell barcodes (CBs), aligns reads to the genome, and produces a fragment file as well as per barcode metrics and library-level metrics.
Quickstart table
The following table provides a quick glance at the ATAC pipeline features:
Pipeline features | Description | Source |
---|---|---|
Assay type | 10x single cell or single nucleus ATAC | 10x Genomics |
Overall workflow | Barcode correction, read alignment, and fragment quantification | Code available from GitHub |
Workflow language | WDL 1.0 | openWDL |
Genomic Reference Sequence | GRCh38 human genome primary sequence | GENCODE |
Aligner | bwa-mem2 | Li H. and Durbin R., 2009 |
Fragment quantification | SnapATAC2 | Zhang, K. et al., 2021 |
Data input file format | File format in which sequencing data is provided | FASTQ |
Data output file format | File formats in which ATAC output is provided | TSV, h5ad, BAM |
Library-level metrics | The ATAC pipeline uses SnapATAC2 to generate library-level metrics in CSV format. | Library-level metrics |
Set-up
ATAC installation
To download the latest ATAC release, see the release tags prefixed with "Multiome" on the WARP releases page. All ATAC pipeline releases are documented in the ATAC changelog.
To discover and search releases, use the WARP command-line tool Wreleaser.
ATAC can be deployed using Cromwell, a GA4GH-compliant, flexible workflow management system that supports multiple computing platforms. The workflow can also be run in Terra, a cloud-based analysis platform.
Input Variables
The following describes the inputs of the ATAC workflow. For more details on how default inputs are set for the Multiome workflow, see the Multiome overview.
Variable name | Description |
---|---|
read1_fastq_gzipped | Fastq inputs (array of compressed read 1 FASTQ files). |
read2_fastq_gzipped | Fastq inputs (array of compressed read 2 FASTQ files containing cellular barcodes). |
read3_fastq_gzipped | Fastq inputs (array of compressed read 3 FASTQ files). |
input_id | Output prefix/base name for all intermediate files and pipeline outputs. |
cloud_provider | String describing the cloud provider that should be used to run the workflow; value should be "gcp" or "azure". |
preindex | Boolean used for paired-tag data and not applicable to ATAC data types; default is set to false. |
atac_expected_cells | Number of cells loaded to create the ATAC library; default is set to 3000. |
tar_bwa_reference | BWA reference (tar file containing reference fasta and corresponding files). |
num_threads_bwa | Optional integer defining the number of CPUs per node for the BWA-mem alignment task (default: 128). |
mem_size_bwa | Optional integer defining the memory size for the BWA-mem alignment task in GB (default: 512). |
cpu_platform_bwa | Optional string defining the CPU platform for the BWA-mem alignment task (default: "Intel Ice Lake"). |
annotations_gtf | CreateFragmentFile input variable: GTF file for SnapATAC2 to calculate TSS sites of fragment file. |
chrom_sizes | CreateFragmentFile input variable: Text file containing chrom_sizes for genome build (i.e., hg38) |
whitelist | Whitelist file for ATAC cellular barcodes. |
adapter_seq_read1 | TrimAdapters input: Sequence adapter for read 1 fastq. |
adapter_seq_read3 | TrimAdapters input: Sequence adapter for read 3 fastq. |
vm_size | String defining the Azure virtual machine family for the workflow (default: "Standard_M128s"). |
atac_nhash_id | String that represents an optional library aliquot identifier. When used, it is echoed in the h5ad unstructured data. |
ATAC tasks and tools
Overall, the ATAC workflow:
- Identifies optimal parameters for performing CB correction and alignment.
- Corrects CBs and partitions FASTQs by CB.
- Aligns reads.
- Generates a fragment file.
- Calculates per cell barcode fragment metrics.
The tools each ATAC task employs are detailed in the table below.
To see specific tool parameters, select the task WDL link in the table; then view the command {}
section of the task in the WDL script. To view or use the exact tool software, see the task's Docker image which is specified in the task WDL # runtime values
section as String docker =
.
Task name and WDL link | Tool | Software | Description |
---|---|---|---|
GetNumSplits | Bash | Bash | Uses the virtual machine type to determine the optimal number of FASTQ files for performing the BWA-mem alignment step. This allows BWA-mem to run in parallel on multiple FASTQ files in the subsequent workflow steps. |
FastqProcessing as SplitFastq | fastqprocess | custom | Dynamically selects the correct barcode orientation, corrects cell barcodes, and splits FASTQ files by the optimal number determined in the GetNumSplits task. The smaller FASTQ files are grouped by cell barcode with each read having the corrected (CB) and raw barcode (CR) in the read name. |
TrimAdapters | Cutadapt v4.4 | cutadapt | Trims adaptor sequences. |
BWAPairedEndAlignment | bwa-mem2 | mem | Aligns reads from each set of partitioned FASTQ files to the genome and outputs a BAM with ATAC barcodes in the CB:Z tag. |
CreateFragmentFile | make_fragment_file, import_data | SnapATAC2 | Generates a fragment file from the final aligned BAM and outputs per barcode quality metrics in h5ad. A detailed list of these metrics is found in the ATAC Count Matrix Overview. |
Output variables
Output variable name | Filename, if applicable | Output format and description |
---|---|---|
bam_aligned_output | <input_id> .bam | BAM containing aligned reads from ATAC workflow. |
fragment_file | <input_id> .fragments.sorted.tsv.gz | Bgzipped TSV containing fragment start and stop coordinates per barcode. In order, the columns are "Chromosome", "Start", "Stop", "ATAC Barcode", and "Number Reads". |
snap_metrics | <input_id .metrics.h5ad | h5ad (Anndata) containing per barcode metrics from SnapATAC2. A detailed list of these metrics is found in the ATAC Count Matrix Overview. |
library_metrics | <input_id> _`<atac_nhash_id>_library_metrics.csv | CSV file containing library-level metrics. Read more in the Library Metrics Overview |
Versioning and testing
All ATAC pipeline releases are documented in the ATAC changelog and tested using plumbing and scientific test data. To learn more about WARP pipeline testing, see Testing Pipelines.
Citing the ATAC Pipeline
If you use the ATAC Pipeline in your research, please identify the pipeline in your methods section using the ATAC SciCrunch resource identifier.
- Ex: ATAC Pipeline (RRID:SCR_024656)
Please also consider citing our preprint:
Degatano, K.; Awdeh, A.; Dingman, W.; Grant, G.; Khajouei, F.; Kiernan, E.; Konwar, K.; Mathews, K.; Palis, K.; Petrillo, N.; Van der Auwera, G.; Wang, C.; Way, J.; Pipelines, W. WDL Analysis Research Pipelines: Cloud-Optimized Workflows for Biological Data Processing and Reproducible Analysis. Preprints 2024, 2024012131. https://doi.org/10.20944/preprints202401.2131.v1
Acknowledgements
We are immensely grateful to the members of the BRAIN Initiative (BICAN Sequencing Working Group) and SCORCH for their invaluable and exceptional contributions to this pipeline. Our heartfelt appreciation goes to Alex Dobin, Aparna Bhaduri, Alec Wysoker, Anish Chakka, Brian Herb, Daofeng Li, Fenna Krienen, Guo-Long Zuo, Jeff Goldy, Kai Zhang, Khalid Shakir, Bo Li, Mariano Gabitto, Michael DeBerardine, Mengyi Song, Melissa Goldman, Nelson Johansen, James Nemesh, and Theresa Hodges for their unwavering dedication and remarkable efforts.