Skip to main content

Optimus Count Matrix Overview

danger

The Loom matrix is deprecated and the default matrix is now h5ad.

The Optimus pipeline's default count matrix output is a h5ad file generated using AnnData.

It contains the raw, but UMI-corrected cell by gene counts, which vary depending on the workflow's counting_mode and count_exons parameters. If running single-cell data (counting_mode is sc_rna), the counts will include only exonic gene counts. If running single-nucleus data (counting_mode is sn_rna), the counts will be whole transcript counts. Additionally, if count_exons is set to true in sn_rna mode, the h5ad will contain the whole transcript counts as well as an additional layer with exonic counts.

You can determine which type of counts are in the h5ad file by looking at the unstructured metadata (the anndata.uns property of the matrix) expression_data_type key (see Table 1 below).

The matrix also contains multiple metrics for both individual cells (the anndata.obs property of the matrix; Table 2) and individual genes (the anndata.var property of the matrix; Table 3).

Additional Matrix Processing for Consortia

Previous Loom files generated by Optimus for consortia, such as the Human Cell Atlas (HCA) or the BRAIN Initiative Cell Census Network (BICCN), may have additional processing steps. Read the Consortia Processing Overview for details on consortia-specific matrix changes.

Table 1. Global attributes

The global attributes (unstuctured metadata) in the h5ad apply to the whole file, not any specific part.

AttributeDetails
expression_data_typeString describing if the pipeline counts exonic or whole transcript (exonic and intronic) reads. For the single-cell mode (counting_mode = sc_rna), the value will be exonic; for the single-nucleus mode (counting_mode = sn_rna), the value will be whole_transcript.
input_idThe sample or cell ID listed in the pipeline configuration file. This can be any string, but we recommend it be consistent with any sample metadata.
input_nameOptional string that can be used to further describe the input.
input_id_metadata_fieldOptional string that describes, when applicable, the metadata field containing the input_id.
input_name_metadata_fieldOptional string that describes, when applicable, the metadata field containing the input_name.
pipeline_versionString describing the version of the Optimus pipeline run on the data.

Table 2. Cell metrics

Cell MetricsProgramDetails
CellIDTagSortThe unique identifier for each cell based on cell barcodes (sequences used to identify unique cells); identical to cell_names. Learn more about cell barcodes in the Definitions section below.
cell_namesTagSortThe unique identifier for each cell based on cell barcodes; identical to CellID.
input_idProvided as pipeline inputThe sample or cell ID listed in the pipeline configuration file. This can be any string, but we recommend it be consistent with any sample metadata.
n_readsTagSortThe number of reads associated with the cell. Like all metrics, n_reads is calculated from the Optimus output BAM file. Prior to alignment, reads are checked against the whitelist and any within one edit distance (Hamming distance) are corrected. These CB-corrected reads are aligned using STARsolo, where they get further CB correction. For this reason, most reads in the aligned BAM file have both CB and UB tags. Therefore, n_reads represents CB-corrected reads, rather than all reads in the input FASTQ files.
noise_readsTagSortNumber of reads that are categorized by 10x Genomics Cell Ranger as "noise". Refers to long polymers, or reads with high numbers of N (ambiguous) nucleotides.
perfect_molecule_barcodesTagSortThe number of reads with molecule barcodes (sequences used to identify unique transcripts) that have no errors. Learn more about UMIs in the Definitions section below.
reads_mapped_exonicSTARsolo and TagSortThe number of unique reads counted as exon; counted when BAM file's sF tag is assigned to 1 or 3 and the NH:i tag is 1; mitochondrial reads are excluded.
reads_mapped_exonic_asSTARsolo and TagSortThe number of reads counted as exon in the antisense direction; counted when the BAM's sF is assigned to a 2 or 4 and the NH:i tag is 1; mitochondrial reads are excluded.
reads_mapped_intronicSTARsolo and TagSortThe number of unique reads counted as intron; counted when the BAM files's sF tag is assigned to a 5 and the NH:i tag is 1; mitochondrial reads are excluded.
reads_mapped_intronic_asSTARsolo and TagSortThe number of unique reads counted as intron in the antisense direction; counted when the BAM file's sF tag is assigned to a 6 and the NH:i tas is 1; mitochondrial reads are excluded.
duplicate_readsTagSortNot currently calculated for Optimus output; number of duplicate reads.
n_mitochondrial_genesTagSortThe number of mitochondrial genes detected by this cell.
n_mitochondrial_moleculesTagSortThe number of molecules from mitochondrial genes detected for this cell.
pct_mitochondrial_moleculesTagSortThe percentage of molecules from mitochondrial genes detected for this cell.
reads_mapped_uniquelyTagSortThe number of reads mapped to a single unambiguous location in the genome; mitochondrial reads are excluded.
reads_mapped_multipleTagSortThe number of reads mapped to multiple genomic positions with equal confidence; mitochondrial reads are excluded.
spliced_readsTagSortThe number of reads that overlap splicing junctions.
antisense_readsTagSortNot calculated for Optimus outputs; see reads_mapped_exonic_as or reads_mapped_intronic_as for antisense counts.
molecule_barcode_fraction_bases_above_30_meanTagSortThe average fraction of bases in molecule barcodes that receive quality scores greater than 30 across the reads of the cell.
molecule_barcode_fraction_bases_above_30_varianceTagSortThe variance in the fraction of bases in molecule barcodes that receive quality scores greater than 30 across the reads of the cell.
genomic_reads_fraction_bases_quality_above_30_meanTagSortThe average fraction of bases in the genomic read that receive quality scores greater than 30 across the reads of the cell (included for 10x Cell Ranger count comparison).
genomic_reads_fraction_bases_quality_above_30_varianceTagSortThe variance in the fraction of bases in the genomic read that receive quality scores greater than 30 across the reads of the cell (included for 10x Cell Ranger count comparison).
genomic_read_quality_meanTagSortAverage quality of Illumina base calls in the genomic reads corresponding to the cell.
genomic_read_quality_varianceTagSortVariance in quality of Illumina base calls in the genomic reads corresponding to the cell.
n_moleculesTagSortNumber of molecules corresponding to the cell (only reflects reads with CB and UB tags).
n_fragmentsTagSortNumber of fragments (distinct segments of reads that align to a specific location on the reference genome), corresponding to the cell barcode. Learn more in the Definitions section below.
reads_per_fragmentTagSortThe average number of reads associated with each fragment in the cell.
fragments_per_moleculeTagSortThe average number of fragments associated with each molecule in the cell.
fragments_with_single_read_evidenceTagSortThe number of fragments associated with the cell that are observed by only one read.
molecules_with_single_read_evidenceTagSortThe number of molecules associated with the cell that are observed by only one read.
perfect_cell_barcodesTagSortThe number of reads whose cell barcodes contain no error.
reads_mapped_too_many_lociTagSortThe number of reads that were mapped to too many loci across the genome and as a consequence, are reported unmapped by the aligner.
cell_barcode_fraction_bases_above_30_varianceTagSortThe variance of the fraction of Illumina base calls for the cell barcode sequence that are greater than 30, across molecules.
cell_barcode_fraction_bases_above_30_meanTagSortThe average fraction of Illumina base calls for the cell barcode sequences that are greater than 30, across molecules.
n_genesTagSortThe number of genes detected by this cell.
genes_detected_multiple_observationsTagSortThe number of genes that are observed by more than one read in this cell.
emptydrops_FDRdropletUtilsFalse Discovery Rate (FDR) for being a non-empty droplet; single-cell data will read NA if task is unable to detect knee point inflection. Column is not included for data run in the sn_rna mode.
emptydrops_IsCelldropletUtilsBinarized call of cell/background based on predefined FDR cutoff; single-cell data will read NA if task is unable to detect knee point inflection. Column is not included for data run in the sn_rna mode
emptydrops_LimiteddropletUtilsIndicates whether a lower p-value could be obtained by increasing the number of iterations; single-cell data will read NA if task is unable to detect knee point inflection. Column is not included for data run in the sn_rna mode.
emptydrops_LogProbdropletUtilsThe log-probability of observing the barcode’s count vector under the null model; single-cell data will read NA if the task is unable to detect knee point inflection. Column is not included for data run in the sn_rna mode.
emptydrops_PValuedropletUtilsThe Monte Carlo p-value against the null model; single-cell data will read NA if task is unable to detect knee point inflection. Column is not included for data run in the sn_rna mode
emptydrops_TotaldropletUtilsThe total read counts for each barcode; single-cell data will read NA if task is unable to detect knee point inflection. Column is not included for data run in the sn_rna mode.
reads_mapped_intergenicSTARsolo and TagSortThe number of reads counted as intergenic; counted when the BAM file's sF tag is assigned to a 7 and the NH:i tag is 1.
reads_unmappedTagSortThe total number of reads that are unmapped; counted when the BAM file's sF tag is 0.
reads_per_moleculeTagSortThe average number of reads associated with each molecule in the cell.

Table 3. Gene metrics

Gene MetricsProgramDetails
ensembl_idsGENCODE GTFThe gene_id listed in the GENCODE GTF file.
GeneGENCODE GTFThe unique gene_name provided in the GENCODE GTF file; identical to the gene_names attribute.
gene_namesGENCODE GTFThe unique gene_name provided in the GENCODE GTF file; identical to the Gene attribute.
n_readsTagSortThe number of reads associated with this gene.
noise_readsTagSortNot currently calculated for Optimus output; number of reads that are categorized by 10x Genomics Cell Ranger as "noise"; refers to long polymers, or reads with high numbers of N (ambiguous) nucleotides.
perfect_molecule_barcodesTagSortThe number of reads with molecule barcodes (sequences used to identify unique transcripts) that have no errors. Learn more about UMIs in the Definitions section below.
reads_mapped_exonicSTARsolo and TagSortThe number of unique reads counted as exon; counted when BAM file's sF tag is assigned to 1 or 3 and the NH:i tag is 1; mitochondrial reads are excluded.
reads_mapped_exonic_asSTARsolo and TagSortThe number of reads counted as exon in the antisense direction; counted when the BAM file's sF tag is assigned to a 2 or 4 and the NH:i tag is 1; mitochondrial reads are excluded.
reads_mapped_intronicSTARsolo and TagSortThe number of reads counted as intron; counted when the BAM file's sF tag is assigned to a 5 and the NH:i tag is 1; mitochondrial reads are excluded.
reads_mapped_intronic_asSTARsolo and TagSortThe number of reads counted as intron in the antisense direction; counted when the BAM file's sF tag is assigned to a 6 and the NH:i tag is 1; mitochondrial reads are excluded.
reads_mapped_uniquelyTagSortThe number of reads mapped to a single unambiguous location in the genome; mitochondrial reads are excluded.
reads_mapped_multipleTagSortThe number of reads mapped to multiple genomic positions with equal confidence; mitochondrial reads are excluded.
spliced_readsTagSortThe number of reads that overlap splicing junctions.
antisense_readsTagSortThe number of reads that are mapped to the antisense strand instead of the transcribed strand.
duplicate_readsTagSortThe number of duplicate reads.
molecule_barcode_fraction_bases_above_30_meanTagSortThe average fraction of bases in molecule barcodes that receive quality scores greater than 30 across the reads of the gene.
molecule_barcode_fraction_bases_above_30_varianceTagSortThe variance in the fraction of bases in molecule barcodes that receive quality scores greater than 30 across the reads of the gene.
genomic_reads_fraction_bases_quality_above_30_meanTagSortThe average fraction of bases in the genomic read that receive quality scores greater than 30 across the reads of the gene (included for 10x Cell Ranger count comparison).
genomic_reads_fraction_bases_quality_above_30_varianceTagSortThe variance in the fraction of bases in the genomic read that receive quality scores greater than 30 across the reads of the gene (included for 10x Cell Ranger count comparison).
genomic_read_quality_meanTagSortAverage quality of Illumina base calls in the genomic reads corresponding to the gene.
genomic_read_quality_varianceTagSortVariance in quality of Illumina base calls in the genomic reads corresponding to the gene.
n_moleculesTagSortNumber of molecules corresponding to the gene (only reflects reads with CB and UB tags).
n_fragmentsTagSortNumber of fragments (distinct segments of reads that align to a specific location on the reference genome), corresponding to the gene. Learn more in the Definitions section below.
reads_per_moleculeTagSortThe average number of reads associated with each molecule in the gene.
reads_per_fragmentTagSortThe average number of reads associated with each fragment in the gene.
fragments_per_moleculeTagSortThe average number of fragments associated with each molecule in the gene.
fragments_with_single_read_evidenceTagSortThe number of fragments associated with the gene that are observed by only one read.
molecules_with_single_read_evidenceTagSortThe number of molecules associated with the gene that are observed by only one read.
number_cells_detected_multipleTagSortThe number of cells which observe more than one read of the gene.
number_cells_expressingTagSortThe number of cells that detect the gene.

Definitions

  • Cell Barcode: Short nucleotide sequence used to label and distinguish which reads come from each unique cell, allowing for tracking of many cells simultaneously.
  • Fragment: A distinct segment of a read that aligns to a specific location on the reference genome. The TagSort function defines fragments based on: 1) the presence of a combined UMI/GX/CB tag, 2) the reference (Chr1, Chr2, etc.), 3) the nuleotide position, and 4) the strand. While some cells may have more n_fragments than n_reads (for example, when an RNA read overlaps an exon-exon junction), some barcodes may have fewer fragments than reads (for example, when a cell has multiple reads that overlap).
  • Unique Molecular Identifier (UMI): Short nucleotide sequence used to label and distinguish which reads come from each unique transcript present in the cell at the time of lysis, allowing for tracking of many transcripts simultaneously.