Optimus Count Matrix Overview

danger

The Loom matrix is deprecated and the default matrix is now h5ad.

The Optimus pipeline's default count matrix output is a h5ad file generated using AnnData.

It contains the raw, but UMI-corrected cell by gene counts, which vary depending on the workflow's counting_mode and count_exons parameters. If running single-cell data (counting_mode is sc_rna), the counts will include only exonic gene counts. If running single-nucleus data (counting_mode is sn_rna), the counts will be whole transcript counts. Additionally, if count_exons is set to true in sn_rna mode, the h5ad will contain the whole transcript counts as well as an additional layer with exonic counts.

You can determine which type of counts are in the h5ad file by looking at the unstructured metadata (the anndata.uns property of the matrix) expression_data_type key (see Table 1 below).

The matrix also contains multiple metrics for both individual cells (the anndata.obs property of the matrix; Table 2) and individual genes (the anndata.var property of the matrix; Table 3).

Additional Matrix Processing for Consortia

Previous Loom files generated by Optimus for consortia, such as the Human Cell Atlas (HCA) or the BRAIN Initiative Cell Census Network (BICCN), may have additional processing steps. Read the Consortia Processing Overview for details on consortia-specific matrix changes.

Table 1. Global attributes

The global attributes (unstuctured metadata) in the h5ad apply to the whole file, not any specific part.

Attribute	Details
`expression_data_type`	String describing if the pipeline counts exonic or whole transcript (exonic and intronic) reads. For the single-cell mode (`counting_mode = sc_rna`), the value will be `exonic`; for the single-nucleus mode (`counting_mode = sn_rna`), the value will be `whole_transcript`.
`input_id`	The sample or cell ID listed in the pipeline configuration file. This can be any string, but we recommend it be consistent with any sample metadata.
`input_name`	Optional string that can be used to further describe the input.
`input_id_metadata_field`	Optional string that describes, when applicable, the metadata field containing the `input_id`.
`input_name_metadata_field`	Optional string that describes, when applicable, the metadata field containing the `input_name`.
`pipeline_version`	String describing the version of the Optimus pipeline run on the data.
`NHashID`	String that represents NHashID (an optional library aliquot identifier) if specified during the worfklow run.

Table 2. Cell metrics

Cell Metrics	Program	Details
`CellID`	TagSort	The unique identifier for each cell based on cell barcodes (sequences used to identify unique cells); identical to `cell_names`. Learn more about cell barcodes in the Definitions section below.
`cell_names`	TagSort	The unique identifier for each cell based on cell barcodes; identical to `CellID`.
`input_id`	Provided as pipeline input	The sample or cell ID listed in the pipeline configuration file. This can be any string, but we recommend it be consistent with any sample metadata.
`star_IsCell`	STARsolo	A true/false flag demarcating if the STARsolo aligner called a cell barcode as a cell.
`n_reads`	TagSort	The number of reads associated with the cell. Like all metrics, `n_reads` is calculated from the Optimus output BAM file. Prior to alignment, reads are checked against the whitelist and any within one edit distance (Hamming distance) are corrected. These CB-corrected reads are aligned using STARsolo, where they get further CB correction. For this reason, most reads in the aligned BAM file have both `CB` and `UB` tags. Therefore, `n_reads` represents CB-corrected reads, rather than all reads in the input FASTQ files.
`tso_reads`	TagSort	The number of reads that have 20 or more bp of TSO sequence clipped from the 5' end. Calculated using the first number of cN tag in the BAM, which is specific to the number of TSO nucleotides clipped.
`noise_reads`	TagSort	Number of reads that are categorized by 10x Genomics Cell Ranger as "noise". Refers to long polymers, or reads with high numbers of N (ambiguous) nucleotides.
`perfect_molecule_barcodes`	TagSort	The number of reads with molecule barcodes (sequences used to identify unique transcripts) that have no errors. Learn more about UMIs in the Definitions section below.
`reads_mapped_exonic`	STARsolo and TagSort	The number of unique reads counted as exon; counted when BAM file's `sF` tag is assigned to `1` or `3` and the `NH:i` tag is `1`; mitochondrial reads are excluded.
`reads_mapped_exonic_as`	STARsolo and TagSort	The number of reads counted as exon in the antisense direction; counted when the BAM's `sF` is assigned to a `2` or `4` and the `NH:i` tag is `1`; mitochondrial reads are excluded.
`reads_mapped_intronic`	STARsolo and TagSort	The number of unique reads counted as intron; counted when the BAM files's `sF` tag is assigned to a `5` and the `NH:i` tag is `1`; mitochondrial reads are excluded.
`reads_mapped_intronic_as`	STARsolo and TagSort	The number of unique reads counted as intron in the antisense direction; counted when the BAM file's `sF` tag is assigned to a `6` and the `NH:i` tas is `1`; mitochondrial reads are excluded.
`duplicate_reads`	TagSort	Not currently calculated for Optimus output; number of duplicate reads.
`n_mitochondrial_genes`	TagSort	The number of mitochondrial genes detected by this cell.
`n_mitochondrial_molecules`	TagSort	The number of molecules from mitochondrial genes detected for this cell.
`pct_mitochondrial_molecules`	TagSort	The percentage of molecules from mitochondrial genes detected for this cell.
`reads_mapped_uniquely`	TagSort	The number of reads mapped to a single unambiguous location in the genome; mitochondrial reads are excluded.
`reads_mapped_multiple`	TagSort	The number of reads mapped to multiple genomic positions with equal confidence; mitochondrial reads are excluded.
`spliced_reads`	TagSort	The number of reads that overlap splicing junctions.
`antisense_reads`	TagSort	Not calculated for Optimus outputs; see `reads_mapped_exonic_as` or `reads_mapped_intronic_as` for antisense counts.
`molecule_barcode_fraction_bases_above_30_mean`	TagSort	The average fraction of bases in molecule barcodes that receive quality scores greater than 30 across the reads of the cell.
`molecule_barcode_fraction_bases_above_30_variance`	TagSort	The variance in the fraction of bases in molecule barcodes that receive quality scores greater than 30 across the reads of the cell.
`genomic_reads_fraction_bases_quality_above_30_mean`	TagSort	The average fraction of bases in the genomic read that receive quality scores greater than 30 across the reads of the cell (included for 10x Cell Ranger count comparison).
`genomic_reads_fraction_bases_quality_above_30_variance`	TagSort	The variance in the fraction of bases in the genomic read that receive quality scores greater than 30 across the reads of the cell (included for 10x Cell Ranger count comparison).
`genomic_read_quality_mean`	TagSort	Average quality of Illumina base calls in the genomic reads corresponding to the cell.
`genomic_read_quality_variance`	TagSort	Variance in quality of Illumina base calls in the genomic reads corresponding to the cell.
`n_molecules`	TagSort	Number of molecules corresponding to the cell (only reflects reads with CB and UB tags).
`n_fragments`	TagSort	Number of fragments (distinct segments of reads that align to a specific location on the reference genome), corresponding to the cell barcode. Learn more in the Definitions section below.
`reads_per_fragment`	TagSort	The average number of reads associated with each fragment in the cell.
`fragments_per_molecule`	TagSort	The average number of fragments associated with each molecule in the cell.
`fragments_with_single_read_evidence`	TagSort	The number of fragments associated with the cell that are observed by only one read.
`molecules_with_single_read_evidence`	TagSort	The number of molecules associated with the cell that are observed by only one read.
`reads_mapped_mitochondrial`	TagSort	The number unique reads (NH:i:1 BAM tag) that come from mitochondrial genes.
`perfect_cell_barcodes`	TagSort	The number of reads whose cell barcodes contain no error.
`reads_mapped_too_many_loci`	TagSort	The number of reads that were mapped to too many loci across the genome and as a consequence, are reported unmapped by the aligner.
`cell_barcode_fraction_bases_above_30_variance`	TagSort	The variance of the fraction of Illumina base calls for the cell barcode sequence that are greater than 30, across molecules.
`cell_barcode_fraction_bases_above_30_mean`	TagSort	The average fraction of Illumina base calls for the cell barcode sequences that are greater than 30, across molecules.
`n_genes`	TagSort	The number of genes detected by this cell.
`genes_detected_multiple_observations`	TagSort	The number of genes that are observed by more than one read in this cell.
`emptydrops_FDR`	dropletUtils	False Discovery Rate (FDR) for being a non-empty droplet; single-cell data will read `NA` if task is unable to detect knee point inflection. Column is not included for data run in the `sn_rna` mode.
`emptydrops_IsCell`	dropletUtils	Binarized call of cell/background based on predefined FDR cutoff; single-cell data will read `NA` if task is unable to detect knee point inflection. Column is not included for data run in the `sn_rna` mode
`emptydrops_Limited`	dropletUtils	Indicates whether a lower p-value could be obtained by increasing the number of iterations; single-cell data will read `NA` if task is unable to detect knee point inflection. Column is not included for data run in the `sn_rna` mode.
`emptydrops_LogProb`	dropletUtils	The log-probability of observing the barcode’s count vector under the null model; single-cell data will read `NA` if the task is unable to detect knee point inflection. Column is not included for data run in the `sn_rna` mode.
`emptydrops_PValue`	dropletUtils	The Monte Carlo p-value against the null model; single-cell data will read `NA` if task is unable to detect knee point inflection. Column is not included for data run in the `sn_rna` mode
`emptydrops_Total`	dropletUtils	The total read counts for each barcode; single-cell data will read `NA` if task is unable to detect knee point inflection. Column is not included for data run in the `sn_rna` mode.
`reads_mapped_intergenic`	STARsolo and TagSort	The number of reads counted as intergenic; counted when the BAM file's `sF` tag is assigned to a `7` and the `NH:i` tag is `1`.
`reads_unmapped`	TagSort	The total number of reads that are unmapped; counted when the BAM file's `sF` tag is `0`.
`reads_per_molecule`	TagSort	The average number of reads associated with each molecule in the cell.
`doublet_score`	Modified version of DoubletFinder	A score produced by a modified version of the DoubletFinder software that normalizes data using scanpy and then uses the k-nearest neighbors algorithm to determine cells. This program is non-deterministic, so results will vary across runs of the workflow. The metrics are used to determine overall library quality.

Table 3. Gene metrics

Gene Metrics	Program	Details
`ensembl_ids`	GENCODE GTF	The `gene_id` listed in the GENCODE GTF file.
`Gene`	GENCODE GTF	The unique `gene_name` provided in the GENCODE GTF file; identical to the `gene_names` attribute.
`gene_names`	GENCODE GTF	The unique `gene_name` provided in the GENCODE GTF file; identical to the `Gene` attribute.
`n_reads`	TagSort	The number of reads associated with this gene.
`tso_reads`	TagSort	The number of reads that have 20 or more bp of TSO sequence clipped from the 5' end. Calculated using the first number of cN tag in the BAM, which is specific to the number of TSO nucleotides clipped.
`noise_reads`	TagSort	Not currently calculated for Optimus output; number of reads that are categorized by 10x Genomics Cell Ranger as "noise"; refers to long polymers, or reads with high numbers of N (ambiguous) nucleotides.
`perfect_molecule_barcodes`	TagSort	The number of reads with molecule barcodes (sequences used to identify unique transcripts) that have no errors. Learn more about UMIs in the Definitions section below.
`reads_mapped_exonic`	STARsolo and TagSort	The number of unique reads counted as exon; counted when BAM file's `sF` tag is assigned to `1` or `3` and the `NH:i` tag is `1`; mitochondrial reads are excluded.
`reads_mapped_exonic_as`	STARsolo and TagSort	The number of reads counted as exon in the antisense direction; counted when the BAM file's `sF` tag is assigned to a `2` or `4` and the `NH:i` tag is `1`; mitochondrial reads are excluded.
`reads_mapped_intronic`	STARsolo and TagSort	The number of reads counted as intron; counted when the BAM file's `sF` tag is assigned to a `5` and the `NH:i` tag is `1`; mitochondrial reads are excluded.
`reads_mapped_intronic_as`	STARsolo and TagSort	The number of reads counted as intron in the antisense direction; counted when the BAM file's `sF` tag is assigned to a `6` and the `NH:i` tag is `1`; mitochondrial reads are excluded.
`reads_mapped_uniquely`	TagSort	The number of reads mapped to a single unambiguous location in the genome; mitochondrial reads are excluded.
`reads_mapped_multiple`	TagSort	The number of reads mapped to multiple genomic positions with equal confidence; mitochondrial reads are excluded.
`spliced_reads`	TagSort	The number of reads that overlap splicing junctions.
`antisense_reads`	TagSort	The number of reads that are mapped to the antisense strand instead of the transcribed strand.
`duplicate_reads`	TagSort	The number of duplicate reads.
`molecule_barcode_fraction_bases_above_30_mean`	TagSort	The average fraction of bases in molecule barcodes that receive quality scores greater than 30 across the reads of the gene.
`molecule_barcode_fraction_bases_above_30_variance`	TagSort	The variance in the fraction of bases in molecule barcodes that receive quality scores greater than 30 across the reads of the gene.
`genomic_reads_fraction_bases_quality_above_30_mean`	TagSort	The average fraction of bases in the genomic read that receive quality scores greater than 30 across the reads of the gene (included for 10x Cell Ranger count comparison).
`genomic_reads_fraction_bases_quality_above_30_variance`	TagSort	The variance in the fraction of bases in the genomic read that receive quality scores greater than 30 across the reads of the gene (included for 10x Cell Ranger count comparison).
`genomic_read_quality_mean`	TagSort	Average quality of Illumina base calls in the genomic reads corresponding to the gene.
`genomic_read_quality_variance`	TagSort	Variance in quality of Illumina base calls in the genomic reads corresponding to the gene.
`n_molecules`	TagSort	Number of molecules corresponding to the gene (only reflects reads with `CB` and `UB` tags).
`n_fragments`	TagSort	Number of fragments (distinct segments of reads that align to a specific location on the reference genome), corresponding to the gene. Learn more in the Definitions section below.
`reads_per_molecule`	TagSort	The average number of reads associated with each molecule in the gene.
`reads_per_fragment`	TagSort	The average number of reads associated with each fragment in the gene.
`fragments_per_molecule`	TagSort	The average number of fragments associated with each molecule in the gene.
`fragments_with_single_read_evidence`	TagSort	The number of fragments associated with the gene that are observed by only one read.
`molecules_with_single_read_evidence`	TagSort	The number of molecules associated with the gene that are observed by only one read.
`reads_mapped_mitochondrial`	TagSort	The number unique reads (NH:i:1 BAM tag) that come from mitochondrial genes.
`number_cells_detected_multiple`	TagSort	The number of cells which observe more than one read of the gene.
`number_cells_expressing`	TagSort	The number of cells that detect the gene.

Definitions

Cell Barcode: Short nucleotide sequence used to label and distinguish which reads come from each unique cell, allowing for tracking of many cells simultaneously.
Fragment: A distinct segment of a read that aligns to a specific location on the reference genome. The TagSort function defines fragments based on: 1) the presence of a combined UMI/GX/CB tag, 2) the reference (Chr1, Chr2, etc.), 3) the nuleotide position, and 4) the strand. While some cells may have more n_fragments than n_reads (for example, when an RNA read overlaps an exon-exon junction), some barcodes may have fewer fragments than reads (for example, when a cell has multiple reads that overlap).
Unique Molecular Identifier (UMI): Short nucleotide sequence used to label and distinguish which reads come from each unique transcript present in the cell at the time of lysis, allowing for tracking of many transcripts simultaneously.

Optimus Count Matrix Overview

Table 1. Global attributes​

Table 2. Cell metrics​

Table 3. Gene metrics​

Definitions​

Table 1. Global attributes

Table 2. Cell metrics

Table 3. Gene metrics

Definitions