ATAC Count Matrix Overview
The ATAC pipeline's default count matrix output is an h5ad file generated using SnapATAC2 and AnnData.
The h5ad file contains unstructured metadata (h5ad.uns
; Table 1) as well as per-barcode quality metrics (h5ad.obs
; Table 2). It also contains an equivalent gene expression barcode for each ATAC barcode. Raw fragments are stored in the h5ad.obsm['insertion']
property of the h5ad file. For more information, see the import_data
function in the SnapATAC2 documentation.
The h5ad file does not contain per-gene metrics, meaning the variables/features data frame (h5ad.var
) is empty.
Table 1. Global attributes
The global attributes (unstuctured metadata) in the h5ad apply to the whole file, not any specific part.
Attribute | Program | Details |
---|---|---|
reference_sequences | SnapATAC2 | Data frame containing the chromosome sizes for the genome build (i.e., hg38); created using the chrom_sizes pipeline input. |
NHashID | N/A | A string that represents the NHashID if specified in the workflow |
Table 2. Cell metrics
Cell Metrics | Program | Details |
---|---|---|
tsse | SnapATAC2 | Transcription start site enrichment (TSSe) score; lower scores suggest poor data quality. Learn more about TSSe in the Definitions section below. |
n_fragment | SnapATAC2 | Number of unique fragments corresponding to the ATAC cell barcode. Fragments are stored in the h5ad.obsm property of the output h5ad file. Learn more about cell barcodes and fragments in the Definitions section below. |
frac_dup | SnapATAC2 | Fraction of reads associated with the cell barcode that are duplicates. |
frac_mito | SnapATAC2 | Fraction of reads associated with the cell barcode that are mitochondrial. |
gex_barcodes | AnnData | Gene expression barcode associated with each ATAC cell barcode. This column is only produced when ATAC is run as part of the Multiome pipeline. |
Definitions
- Cell Barcode: Short nucleotide sequence used to label and distinguish which reads come from each unique cell, allowing for tracking of many cells simultaneously.
- Fragment: A distinct segment of a read that aligns to a specific location on the reference genome.
- Transcription Start Site Enrichment (TSSe): A common quality control metric in ATAC-seq data, indicating increased accessibility around the transcription start sites of genes. High TSSe suggests successful capture of relevant genomic features, while low TSSe may signal data quality or processing issues.