Celldega File Formats
While there has been tremendous progress in developing standardized data formats and architectures for spatial-omics data, namely SpatialData and the related AnnData, these approaches currently lack support for interactive cloud-based visualization of large (>100M transcripts) Spatial Transcriptomics (ST) data. Furthermore, all-in-one data format approaches preclude the development of compact visualization-specific data formats.
The Celldega project addresses these challenges with the development of a new ST data format called LandscapeFiles, specifically built for cloud-based visualization. LandscapeFiles support Celldega's Landscape visualization method by leveraging compact image formats and cloud-native data formats to enable efficient storage and visualization of image (e.g., microscopy images) and vectorized data (e.g., transcript coordinates). This approach is highly scalable, enabling the visualization of very large ST datasets (>400M transcripts), while remaining compact enough that the LandscapeFiles for an entire Xenium dataset can be hosted in a public GitHub repository.
LandscapeFiles
LandscapeFiles are generated using the Celldega pre module (see example Google Colab notebook Celldega-Landscape-Pre-Process_Xenium-Pancreas-Dataset) and are used by Celldega's JavaScript front-end to interactively visualize ST data. Users have several options for hosting LandscapeFiles both locally on the cloud (e.g., Terra.bio buckets) or locally (e.g., running a local server to locally host LandscapeFiles).
iST LandscapeFiles
The file structure for a Xenium Prime dataset's LandscapeFiles is shown below.
.
├── cbg
├── cell_clusters
├── cell_metadata.parquet
├── cell_segmentation
├── df_sig.parquet
├── landscape_parameters.json
├── meta_gene.parquet
├── pyramid_images
│ ├── bound_files
│ ├── dapi_files
│ ├── prot_files
│ └── rna_files
│── transcript_tiles
└── xenium_transform.csv
The LandscapeFiles for an an example public 10X Genomics Xenium dataset can be found here.
Cell-by-Gene
The cbg
directory contains parquet files for each gene. Each file has a table of all the non-zero single cell expression counts. See example below:
A2ML1
aaaaljij-1 18
aaabgfcl-1 24
aaacghkb-1 28
aaachnfg-1 14
aaacknep-1 1
Cell Clusters
The cell_clusters
directory contains single-cell clustering data. For Xenium data, these will include the default clustering results stored in two parquet files.
cluster.parquet
This file contains the cluster identity of each cell. See example below:
cluster
aaaaljij-1 28
aaabgfcl-1 27
aaacghkb-1 27
aaachnfg-1 28
aaacknep-1 28
meta_cluster.parquet
This file contains metadata on the cell clusters, which includes the color and cell count. See example below:
color count
1 #1f77b4 12742
2 #ff7f0e 10058
3 #2ca02c 9171
4 #d62728 8781
5 #9467bd 7760
Cell Metadata
The cell_metadata.parquet
file contains the centroid positions of all cells. See example below:
cell_id name geometry
aaaaljij-1 aaaaljij-1 [819.7626194690856, 10819.416697734863]
aaabgfcl-1 aaabgfcl-1 [861.7377139772034, 10683.254024123535]
aaacghkb-1 aaacghkb-1 [876.8403955191346, 10627.146491566895]
aaachnfg-1 aaachnfg-1 [799.6315031020813, 10692.094786328125]
aaacknep-1 aaacknep-1 [760.0623424668274, 10729.360408533203]
Cell Segmentation
The cell_segmentation
directory contains tiled parquet files that contain cell segmentation polygons for the cells within a given tile. See example below:
cell_id GEOMETRY name
mnnojdjm-1 [[[35052.998290142576, 2648.999973659546], [35... mnnojdjm-1
mnoafgkh-1 [[[35229.99735775, 2654.999800875], [35227.998... mnoafgkh-1
mnodjmcf-1 [[[35090.99920641015, 2657.9998580948486], [35... mnodjmcf-1
moelbbjj-1 [[[35233.997817008785, 2602.9998622198486], [3... moelbbjj-1
moemfhce-1 [[[35242.998275892576, 2539.9998095], [35239.9... moemfhce-1
Cell Cluster Gene Expression Signatures
The df_sig.parquet
file contains the gene expression signatures of the cell clusters - defined as the average gene expression level of a cluster's cells. See example below:
1 2 3 4 5 6 7 8 9 10 ... 20 21 22 23 24 25 26 27 28 29
A2ML1 0.000235 0.000597 0.000109 0.000114 0.002320 0.000823 0.000996 0.000372 0.000760 0.000473 ... 0.018356 0.0000 0.000000 0.000000 0.000000 0.000000 2.593168 8.309091 5.602632 0.000000
AAMP 0.296029 0.298668 0.032494 0.052727 0.003222 0.515027 0.014938 0.070061 0.395857 0.470894 ... 0.122905 0.0960 0.429596 0.058206 0.128743 0.511224 0.580745 0.456566 0.136842 0.052632
AAR2 0.075655 0.069994 0.015375 0.023118 0.002964 0.118705 0.006971 0.038840 0.091410 0.117132 ... 0.054270 0.0424 0.129148 0.022901 0.030938 0.130612 0.154244 0.117172 0.057895 0.021053
AARSD1 0.074557 0.156194 0.013412 0.017880 0.001546 0.200357 0.005477 0.028805 0.121057 0.120208 ... 0.047087 0.0272 0.093274 0.019084 0.028942 0.223469 0.120083 0.024242 0.005263 0.010526
ABAT 0.004787 0.008053 0.009814 0.015830 0.000902 0.009743 0.004481 0.004832 0.003801 0.006626 ... 0.008779 0.0072 0.004484 0.017176 0.008982 0.002041 0.006211 0.002020 0.000000 0.005263
Landscape Parameters
This file contains the configuration information about the dataset. See example below:
{
"technology": "Xenium",
"max_pyramid_zoom": 16,
"tile_size": 250,
"image_info": [
{
"name": "dapi",
"button_name": "DAPI",
"color": [
0,
0,
255
]
},
{
"name": "bound",
"button_name": "BOUND",
"color": [
0,
255,
0
]
},
{
"name": "rna",
"button_name": "RNA",
"color": [
255,
0,
0
]
},
{
"name": "prot",
"button_name": "PROT",
"color": [
255,
255,
255
]
}
],
"image_format": ".webp"
}
Gene Metadata
The gene_metadata.parquet
file contains gene level metadata including: average expression across all cells, standard deviation, max expression, proportion of cells with non-zero expression, and the color assigned to each gene. See example below:
mean std max non-zero color
A2ML1 0.078391 3.128721 46.0 0.000009 #1f77b4
AAMP 0.175449 1.621841 7.0 0.000009 #ff7f0e
AAR2 0.048494 0.780702 4.0 0.000009 #2ca02c
AARSD1 0.060533 0.897824 4.0 0.000009 #d62728
ABAT 0.006575 0.285613 3.0 0.000009 #9467bd
Pyramid Images
The pyramid_images
directory contains iST images from all available channels saved Deep Zoom pyramids using the image file format WebP. An example directory structure for a Xenium multi-modal dataset looks like:
.
├── bound.dzi
├── bound_files
│ ├── 0
│ ├── 1
│ ├── 10
│ ├── 11
│ ├── 12
│ ├── 13
│ ├── 14
│ ├── 15
│ ├── 16
│ ├── 2
│ ├── 3
│ ├── 4
│ ├── 5
│ ├── 6
│ ├── 7
│ ├── 8
│ ├── 9
│ └── vips-properties.xml
├── dapi.dzi
├── dapi_files
│ ├── 0
│ ├── 1
│ ├── 10
│ ├── 11
│ ├── 12
│ ├── 13
│ ├── 14
│ ├── 15
│ ├── 16
│ ├── 2
│ ├── 3
│ ├── 4
│ ├── 5
│ ├── 6
│ ├── 7
│ ├── 8
│ ├── 9
│ └── vips-properties.xml
├── prot.dzi
├── prot_files
│ ├── 0
│ ├── 1
│ ├── 10
│ ├── 11
│ ├── 12
│ ├── 13
│ ├── 14
│ ├── 15
│ ├── 16
│ ├── 2
│ ├── 3
│ ├── 4
│ ├── 5
│ ├── 6
│ ├── 7
│ ├── 8
│ ├── 9
│ └── vips-properties.xml
├── rna.dzi
└── rna_files
├── 0
├── 1
├── 10
├── 11
├── 12
├── 13
├── 14
├── 15
├── 16
├── 2
├── 3
├── 4
├── 5
├── 6
├── 7
├── 8
├── 9
└── vips-properties.xml
Transcript Tiles
The transcript_tiles
directory contains tiled parquet files that contain transcript data for transcripts within a given tile. See example below:
name geometry
20862147 AARSD1 [25663.38, 11758.09]
20862230 ABCA1 [25650.44, 11757.28]
20862780 ABCD1 [25634.19, 11754.12]
20862819 ABCD1 [25635.51, 11755.29]
20863051 ABHD6 [25506.54, 11753.75]
Image Transformation
The xenium_transform.csv
file contains the 3x3 image transformation matrix to transition from physical coordinates into image coordinates.