Row Group Storage Mode
Overview
Celldega supports an optional row group storage mode that consolidates tile-based data into single Parquet files/chunked-files using Apache Parquet's row group feature. Instead of storing thousands of individual tile files, this mode stores all tiles as row groups within a single file per data type.
This approach offers significant advantages for cloud hosting platforms that have file count limitations (e.g., GitHub, Hugging Face) while maintaining comparable performance through HTTP Range Requests.
Why Row Groups?
The Problem with Many Small Files
Traditional LandscapeFiles store each tile as an individual Parquet file:
- Transcript tiles: transcript_tiles/tile_X_Y.parquet (potentially thousands of files)
- Cell boundaries: cell_segmentation/tile_X_Y.parquet (thousands of files)
- Gene expression: cbg/GENE_NAME.parquet (hundreds of files)
- Image tiles: pyramid_images/channel_files/zoom/X_Y.webp (tens of thousands of files)
For a typical Xenium dataset, this can result in 50,000+ files, which: - Exceeds file limits on platforms like GitHub (100,000 files) - Creates overhead in file system operations - Complicates data distribution and hosting
The Row Group Solution
Row groups allow storing multiple logical "tiles" within a single physical file: - Each row group corresponds to one tile/gene/image - Row groups can be read independently via byte-range offsets - HTTP Range Requests enable fetching only the needed row groups - Total file count reduced from 50,000+ to ~10 files
Enabling Row Group Mode
Python Preprocessing
Enable row group mode by setting use_row_groups=True when running preprocessing:
import celldega as dega
dega.pre.main(
sample='Xenium_Sample',
data_root_dir='data/xenium_data/',
tile_size=250,
path_dega_files='data/landscape_files/my_sample_row_groups',
use_int_index=True,
use_row_groups=True # Enable row group mode
)
Command Line
python -m celldega.pre.run_pre_processing \
--sample Xenium_Sample \
--data_root_dir data/xenium_data/ \
--path_dega_files data/landscape_files/my_sample_row_groups \
--use_row_groups True
File Structure Comparison
Traditional Mode (Individual Files)
landscape_files/
├── cbg/
│ ├── GENE1.parquet
│ ├── GENE2.parquet
│ └── ... (hundreds of files)
├── cell_segmentation/
│ ├── tile_0_0.parquet
│ ├── tile_0_1.parquet
│ └── ... (thousands of files)
├── transcript_tiles/
│ ├── tile_0_0.parquet
│ ├── tile_0_1.parquet
│ └── ... (thousands of files)
├── pyramid_images/
│ ├── dapi.dzi
│ ├── dapi_files/
│ │ ├── 0/
│ │ ├── 1/
│ │ └── ... (tens of thousands of WebP files)
│ └── ...
└── landscape_parameters.json
Row Group Mode (Consolidated Files)
landscape_files/
├── cbg.parquet # All genes as row groups
├── transcripts.parquet # All transcript tiles as row groups
├── cell_segmentation.parquet # All cell tiles as row groups
├── pyramid_images/
│ ├── dapi.dzi # Metadata preserved
│ ├── dapi.parquet # All image tiles as row groups
│ ├── rna.dzi
│ ├── rna.parquet
│ └── ...
└── landscape_parameters.json # Updated with row group info
How It Works
Formula-Based Indexing
Row groups are organized using a deterministic indexing scheme that allows direct lookup without parsing metadata:
row_group_index = tile_x * num_tiles_y + tile_y
For example, with a grid of 137×55 tiles: - Tile (0, 0) → Row group 0 - Tile (0, 1) → Row group 1 - Tile (1, 0) → Row group 55 - Tile (136, 54) → Row group 7534
This formula enables the frontend to compute exactly which row groups to fetch based on the visible viewport, without needing to scan file metadata.
HTTP Range Requests
When hosted on a server that supports Range requests (most cloud storage does), the frontend uses parquet-wasm's streaming capabilities to fetch only the needed row groups:
- Frontend calculates which tiles are in view
- Computes row group indices using the formula
- Issues Range requests for just those row groups
- Parquet-wasm fetches only the required bytes
This provides performance comparable to individual files while maintaining a single consolidated file.
Fallback Mode
If the server doesn't support Range requests (e.g., local development servers), the frontend automatically falls back to: 1. Downloading the entire Parquet file once 2. Caching it in memory 3. Reading specific row groups from the cached data
landscape_parameters.json Updates
When use_row_groups=True, the landscape_parameters.json includes additional configuration:
{
"technology": "Xenium",
"use_row_groups": true,
"tile_grid": {
"num_tiles_x": 137,
"num_tiles_y": 55,
"tile_size": 250
},
"row_group_files": {
"transcripts": "transcripts.parquet",
"cell_segmentation": "cell_segmentation.parquet",
"cbg": "cbg.parquet",
"images": {
"dapi": {
"path": "pyramid_images/dapi.parquet",
"zoom_info": { ... },
"zoom_levels": [0, 1, 2, ..., 16]
},
"rna": {
"path": "pyramid_images/rna.parquet",
"zoom_info": { ... },
"zoom_levels": [0, 1, 2, ..., 16]
}
}
},
"image_dimensions": {
"width": 34375,
"height": 13750,
"tile_size": 256
},
"max_pyramid_zoom": 16
}
Data Types Supported
Transcripts (transcripts.parquet)
- One row group per spatial tile
- Columns:
name(gene),geometry(coordinates),tile_x,tile_y - Empty tiles are preserved as empty row groups for index consistency
Cell Boundaries (cell_segmentation.parquet)
- One row group per spatial tile
- Columns:
cell_id,GEOMETRY(polygon),name,tile_x,tile_y - Same grid as transcripts for consistent indexing
Cell-by-Gene (cbg.parquet)
- One row group per gene (sorted alphabetically)
- Columns:
cell_id,expression,gene - Gene-to-row-group mapping stored in Parquet metadata
Image Tiles (pyramid_images/<channel>.parquet)
- One row group per image tile across all zoom levels
- Columns:
image_data(binary),zoom,tile_x,tile_y - Zoom-aware indexing with cumulative offsets
- Original WebP tile directories deleted;
.dzifiles preserved
Performance Considerations
Advantages
- Reduced file count: 50,000+ files → ~10 files
- Better hosting compatibility: Works with GitHub, Hugging Face limits
- Efficient streaming: HTTP Range Requests fetch only needed data
- Simpler distribution: Fewer files to manage and transfer
Trade-offs
- Footer size: Large datasets require reading Parquet footer (~100KB-1MB)
- Initial overhead: First request includes footer parsing
- Memory in fallback mode: Full file loaded if Range requests unavailable
Optimizations Applied
write_statistics=Falsereduces footer size by ~50%- Deterministic indexing eliminates metadata scanning
- Automatic fallback ensures compatibility
Backwards Compatibility
- Row group mode is opt-in via
use_row_groups=True - Default behavior (
use_row_groups=False) unchanged - Frontend auto-detects mode from
landscape_parameters.json - Both modes can coexist in different datasets