Optimus Count Matrix Overview
The Loom matrix is deprecated and the default matrix is now h5ad.
The Optimus pipeline's default count matrix output is a h5ad file generated using AnnData.
It contains the raw, but UMI-corrected cell by gene counts, which vary depending on the workflow's counting_mode
and count_exons
parameters. If running single-cell data (counting_mode
is sc_rna
), the counts will include only exonic gene counts. If running single-nucleus data (counting_mode
is sn_rna
), the counts will be whole transcript counts. Additionally, if count_exons
is set to true
in sn_rna
mode, the h5ad will contain the whole transcript counts as well as an additional layer with exonic counts.
You can determine which type of counts are in the h5ad file by looking at the unstructured metadata (the anndata.uns
property of the matrix) expression_data_type
key (see Table 1 below).
The matrix also contains multiple metrics for both individual cells (the anndata.obs
property of the matrix; Table 2) and individual genes (the anndata.var
property of the matrix; Table 3).
Previous Loom files generated by Optimus for consortia, such as the Human Cell Atlas (HCA) or the BRAIN Initiative Cell Census Network (BICCN), may have additional processing steps. Read the Consortia Processing Overview for details on consortia-specific matrix changes.
Table 1. Global attributes
The global attributes (unstuctured metadata) in the h5ad apply to the whole file, not any specific part.
Attribute | Details |
---|---|
expression_data_type | String describing if the pipeline counts exonic or whole transcript (exonic and intronic) reads. For the single-cell mode (counting_mode = sc_rna ), the value will be exonic ; for the single-nucleus mode (counting_mode = sn_rna ), the value will be whole_transcript . |
input_id | The sample or cell ID listed in the pipeline configuration file. This can be any string, but we recommend it be consistent with any sample metadata. |
input_name | Optional string that can be used to further describe the input. |
input_id_metadata_field | Optional string that describes, when applicable, the metadata field containing the input_id . |
input_name_metadata_field | Optional string that describes, when applicable, the metadata field containing the input_name . |
pipeline_version | String describing the version of the Optimus pipeline run on the data. |
NHashID | String that represents NHashID (an optional library aliquot identifier) if specified during the worfklow run. |
Table 2. Cell metrics
Cell Metrics | Program | Details |
---|---|---|
CellID | TagSort | The unique identifier for each cell based on cell barcodes (sequences used to identify unique cells); identical to cell_names . Learn more about cell barcodes in the Definitions section below. |
cell_names | TagSort | The unique identifier for each cell based on cell barcodes; identical to CellID . |
input_id | Provided as pipeline input | The sample or cell ID listed in the pipeline configuration file. This can be any string, but we recommend it be consistent with any sample metadata. |
star_IsCell | STARsolo | A true/false flag demarcating if the STARsolo aligner called a cell barcode as a cell. |
n_reads | TagSort | The number of reads associated with the cell. Like all metrics, n_reads is calculated from the Optimus output BAM file. Prior to alignment, reads are checked against the whitelist and any within one edit distance (Hamming distance) are corrected. These CB-corrected reads are aligned using STARsolo, where they get further CB correction. For this reason, most reads in the aligned BAM file have both CB and UB tags. Therefore, n_reads represents CB-corrected reads, rather than all reads in the input FASTQ files. |
tso_reads | TagSort | The number of reads that have 20 or more bp of TSO sequence clipped from the 5' end. Calculated using the first number of cN tag in the BAM, which is specific to the number of TSO nucleotides clipped. |
noise_reads | TagSort | Number of reads that are categorized by 10x Genomics Cell Ranger as "noise". Refers to long polymers, or reads with high numbers of N (ambiguous) nucleotides. |
perfect_molecule_barcodes | TagSort | The number of reads with molecule barcodes (sequences used to identify unique transcripts) that have no errors. Learn more about UMIs in the Definitions section below. |
reads_mapped_exonic | STARsolo and TagSort | The number of unique reads counted as exon; counted when BAM file's sF tag is assigned to 1 or 3 and the NH:i tag is 1 ; mitochondrial reads are excluded. |
reads_mapped_exonic_as | STARsolo and TagSort | The number of reads counted as exon in the antisense direction; counted when the BAM's sF is assigned to a 2 or 4 and the NH:i tag is 1 ; mitochondrial reads are excluded. |
reads_mapped_intronic | STARsolo and TagSort | The number of unique reads counted as intron; counted when the BAM files's sF tag is assigned to a 5 and the NH:i tag is 1 ; mitochondrial reads are excluded. |
reads_mapped_intronic_as | STARsolo and TagSort | The number of unique reads counted as intron in the antisense direction; counted when the BAM file's sF tag is assigned to a 6 and the NH:i tas is 1 ; mitochondrial reads are excluded. |
duplicate_reads | TagSort | Not currently calculated for Optimus output; number of duplicate reads. |
n_mitochondrial_genes | TagSort | The number of mitochondrial genes detected by this cell. |
n_mitochondrial_molecules | TagSort | The number of molecules from mitochondrial genes detected for this cell. |
pct_mitochondrial_molecules | TagSort | The percentage of molecules from mitochondrial genes detected for this cell. |
reads_mapped_uniquely | TagSort | The number of reads mapped to a single unambiguous location in the genome; mitochondrial reads are excluded. |
reads_mapped_multiple | TagSort | The number of reads mapped to multiple genomic positions with equal confidence; mitochondrial reads are excluded. |
spliced_reads | TagSort | The number of reads that overlap splicing junctions. |
antisense_reads | TagSort | Not calculated for Optimus outputs; see reads_mapped_exonic_as or reads_mapped_intronic_as for antisense counts. |
molecule_barcode_fraction_bases_above_30_mean | TagSort | The average fraction of bases in molecule barcodes that receive quality scores greater than 30 across the reads of the cell. |
molecule_barcode_fraction_bases_above_30_variance | TagSort | The variance in the fraction of bases in molecule barcodes that receive quality scores greater than 30 across the reads of the cell. |
genomic_reads_fraction_bases_quality_above_30_mean | TagSort | The average fraction of bases in the genomic read that receive quality scores greater than 30 across the reads of the cell (included for 10x Cell Ranger count comparison). |
genomic_reads_fraction_bases_quality_above_30_variance | TagSort | The variance in the fraction of bases in the genomic read that receive quality scores greater than 30 across the reads of the cell (included for 10x Cell Ranger count comparison). |
genomic_read_quality_mean | TagSort | Average quality of Illumina base calls in the genomic reads corresponding to the cell. |
genomic_read_quality_variance | TagSort | Variance in quality of Illumina base calls in the genomic reads corresponding to the cell. |
n_molecules | TagSort | Number of molecules corresponding to the cell (only reflects reads with CB and UB tags). |
n_fragments | TagSort | Number of fragments (distinct segments of reads that align to a specific location on the reference genome), corresponding to the cell barcode. Learn more in the Definitions section below. |
reads_per_fragment | TagSort | The average number of reads associated with each fragment in the cell. |
fragments_per_molecule | TagSort | The average number of fragments associated with each molecule in the cell. |
fragments_with_single_read_evidence | TagSort | The number of fragments associated with the cell that are observed by only one read. |
molecules_with_single_read_evidence | TagSort | The number of molecules associated with the cell that are observed by only one read. |
reads_mapped_mitochondrial | TagSort | The number unique reads (NH:i:1 BAM tag) that come from mitochondrial genes. |
perfect_cell_barcodes | TagSort | The number of reads whose cell barcodes contain no error. |
reads_mapped_too_many_loci | TagSort | The number of reads that were mapped to too many loci across the genome and as a consequence, are reported unmapped by the aligner. |
cell_barcode_fraction_bases_above_30_variance | TagSort | The variance of the fraction of Illumina base calls for the cell barcode sequence that are greater than 30, across molecules. |
cell_barcode_fraction_bases_above_30_mean | TagSort | The average fraction of Illumina base calls for the cell barcode sequences that are greater than 30, across molecules. |
n_genes | TagSort | The number of genes detected by this cell. |
genes_detected_multiple_observations | TagSort | The number of genes that are observed by more than one read in this cell. |
emptydrops_FDR | dropletUtils | False Discovery Rate (FDR) for being a non-empty droplet; single-cell data will read NA if task is unable to detect knee point inflection. Column is not included for data run in the sn_rna mode. |
emptydrops_IsCell | dropletUtils | Binarized call of cell/background based on predefined FDR cutoff; single-cell data will read NA if task is unable to detect knee point inflection. Column is not included for data run in the sn_rna mode |
emptydrops_Limited | dropletUtils | Indicates whether a lower p-value could be obtained by increasing the number of iterations; single-cell data will read NA if task is unable to detect knee point inflection. Column is not included for data run in the sn_rna mode. |
emptydrops_LogProb | dropletUtils | The log-probability of observing the barcode’s count vector under the null model; single-cell data will read NA if the task is unable to detect knee point inflection. Column is not included for data run in the sn_rna mode. |
emptydrops_PValue | dropletUtils | The Monte Carlo p-value against the null model; single-cell data will read NA if task is unable to detect knee point inflection. Column is not included for data run in the sn_rna mode |
emptydrops_Total | dropletUtils | The total read counts for each barcode; single-cell data will read NA if task is unable to detect knee point inflection. Column is not included for data run in the sn_rna mode. |
reads_mapped_intergenic | STARsolo and TagSort | The number of reads counted as intergenic; counted when the BAM file's sF tag is assigned to a 7 and the NH:i tag is 1 . |
reads_unmapped | TagSort | The total number of reads that are unmapped; counted when the BAM file's sF tag is 0 . |
reads_per_molecule | TagSort | The average number of reads associated with each molecule in the cell. |
doublet_score | Modified version of DoubletFinder | A score produced by a modified version of the DoubletFinder software that normalizes data using scanpy and then uses the k-nearest neighbors algorithm to determine cells. This program is non-deterministic, so results will vary across runs of the workflow. The metrics are used to determine overall library quality. |
Table 3. Gene metrics
Gene Metrics | Program | Details |
---|---|---|
ensembl_ids | GENCODE GTF | The gene_id listed in the GENCODE GTF file. |
Gene | GENCODE GTF | The unique gene_name provided in the GENCODE GTF file; identical to the gene_names attribute. |
gene_names | GENCODE GTF | The unique gene_name provided in the GENCODE GTF file; identical to the Gene attribute. |
n_reads | TagSort | The number of reads associated with this gene. |
tso_reads | TagSort | The number of reads that have 20 or more bp of TSO sequence clipped from the 5' end. Calculated using the first number of cN tag in the BAM, which is specific to the number of TSO nucleotides clipped. |
noise_reads | TagSort | Not currently calculated for Optimus output; number of reads that are categorized by 10x Genomics Cell Ranger as "noise"; refers to long polymers, or reads with high numbers of N (ambiguous) nucleotides. |
perfect_molecule_barcodes | TagSort | The number of reads with molecule barcodes (sequences used to identify unique transcripts) that have no errors. Learn more about UMIs in the Definitions section below. |
reads_mapped_exonic | STARsolo and TagSort | The number of unique reads counted as exon; counted when BAM file's sF tag is assigned to 1 or 3 and the NH:i tag is 1 ; mitochondrial reads are excluded. |
reads_mapped_exonic_as | STARsolo and TagSort | The number of reads counted as exon in the antisense direction; counted when the BAM file's sF tag is assigned to a 2 or 4 and the NH:i tag is 1 ; mitochondrial reads are excluded. |
reads_mapped_intronic | STARsolo and TagSort | The number of reads counted as intron; counted when the BAM file's sF tag is assigned to a 5 and the NH:i tag is 1 ; mitochondrial reads are excluded. |
reads_mapped_intronic_as | STARsolo and TagSort | The number of reads counted as intron in the antisense direction; counted when the BAM file's sF tag is assigned to a 6 and the NH:i tag is 1 ; mitochondrial reads are excluded. |
reads_mapped_uniquely | TagSort | The number of reads mapped to a single unambiguous location in the genome; mitochondrial reads are excluded. |
reads_mapped_multiple | TagSort | The number of reads mapped to multiple genomic positions with equal confidence; mitochondrial reads are excluded. |
spliced_reads | TagSort | The number of reads that overlap splicing junctions. |
antisense_reads | TagSort | The number of reads that are mapped to the antisense strand instead of the transcribed strand. |
duplicate_reads | TagSort | The number of duplicate reads. |
molecule_barcode_fraction_bases_above_30_mean | TagSort | The average fraction of bases in molecule barcodes that receive quality scores greater than 30 across the reads of the gene. |
molecule_barcode_fraction_bases_above_30_variance | TagSort | The variance in the fraction of bases in molecule barcodes that receive quality scores greater than 30 across the reads of the gene. |
genomic_reads_fraction_bases_quality_above_30_mean | TagSort | The average fraction of bases in the genomic read that receive quality scores greater than 30 across the reads of the gene (included for 10x Cell Ranger count comparison). |
genomic_reads_fraction_bases_quality_above_30_variance | TagSort | The variance in the fraction of bases in the genomic read that receive quality scores greater than 30 across the reads of the gene (included for 10x Cell Ranger count comparison). |
genomic_read_quality_mean | TagSort | Average quality of Illumina base calls in the genomic reads corresponding to the gene. |
genomic_read_quality_variance | TagSort | Variance in quality of Illumina base calls in the genomic reads corresponding to the gene. |
n_molecules | TagSort | Number of molecules corresponding to the gene (only reflects reads with CB and UB tags). |
n_fragments | TagSort | Number of fragments (distinct segments of reads that align to a specific location on the reference genome), corresponding to the gene. Learn more in the Definitions section below. |
reads_per_molecule | TagSort | The average number of reads associated with each molecule in the gene. |
reads_per_fragment | TagSort | The average number of reads associated with each fragment in the gene. |
fragments_per_molecule | TagSort | The average number of fragments associated with each molecule in the gene. |
fragments_with_single_read_evidence | TagSort | The number of fragments associated with the gene that are observed by only one read. |
molecules_with_single_read_evidence | TagSort | The number of molecules associated with the gene that are observed by only one read. |
reads_mapped_mitochondrial | TagSort | The number unique reads (NH:i:1 BAM tag) that come from mitochondrial genes. |
number_cells_detected_multiple | TagSort | The number of cells which observe more than one read of the gene. |
number_cells_expressing | TagSort | The number of cells that detect the gene. |
Definitions
- Cell Barcode: Short nucleotide sequence used to label and distinguish which reads come from each unique cell, allowing for tracking of many cells simultaneously.
- Fragment: A distinct segment of a read that aligns to a specific location on the reference genome. The TagSort function defines fragments based on: 1) the presence of a combined
UMI/GX/CB
tag, 2) the reference (Chr1, Chr2, etc.), 3) the nuleotide position, and 4) the strand. While some cells may have moren_fragments
thann_reads
(for example, when an RNA read overlaps an exon-exon junction), some barcodes may have fewer fragments than reads (for example, when a cell has multiple reads that overlap). - Unique Molecular Identifier (UMI): Short nucleotide sequence used to label and distinguish which reads come from each unique transcript present in the cell at the time of lysis, allowing for tracking of many transcripts simultaneously.