Dataset Module API Reference
The dataset module contains dataset-level modality constructors and
helpers for building MuData-backed DatasetCollection objects from cell-level
AnnData metadata.
DatasetCollection is the dataset-level collection object: its obs table is
the canonical dataset/sample axis, and derived feature spaces are stored
directly as MuData modalities in dset.mod. Dataset-specific feature
calculation should happen through methods on this object so the result is
attached to the collection that owns the dataset axis. Cell-level AnnData is
used as an input to constructors and calculations, but it is not stored on the
dataset-level collection.
import celldega as dega
dset = dega.dataset.DatasetCollection(
adata,
dataset_col="sample_id",
obs_columns=["patient_id", "condition"],
)
dset.calc_dataset_by_pop(
adata,
category="cell_type",
output="proportion",
)
population = dset.mod["population"]
dset.calc_dataset_signature(
adata,
category="cell_type",
value="CD8 T",
modality_name="cd8_t_expression",
missing_datasets="nan",
)
cd8_t_expression = dset.mod["cd8_t_expression"]
dset.write("dataset.h5mu")
loaded = dega.dataset.DatasetCollection.read("dataset.h5mu")
Dataset-level Celldega collection objects.
DatasetCollection
Bases: CelldegaCollection
Dataset-level collection with convenience modality constructors.
DatasetCollection is the dataset-level Celldega collection. Its obs
table is the canonical dataset/sample axis, and feature constructors attach
clusterable AnnData modalities directly to self.mod.
__init__(adata=None, dataset_col='sample_id', obs_columns=None, obs=None, mdata=None, source=None, name=None, meta=None, mod=None, relations=None, provenance=None, uns=None)
Build a dataset-level collection.
The dataset/sample observation axis is established one of three ways:
from a pre-built mdata (dataset_col recovered from its metadata),
from an explicit obs table, or — most commonly — by binning
cell-level adata over dataset_col (one row per unique dataset,
with an n_cells count and the first value of each obs_columns).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData | None
|
Cell-level |
None
|
dataset_col
|
str
|
|
'sample_id'
|
obs_columns
|
list[str] | None
|
Per-dataset metadata columns to carry over from
|
None
|
obs
|
DataFrame | None
|
Pre-built dataset observation table (alternative to |
None
|
mdata
|
MuData | None
|
Pre-built |
None
|
source
|
str | dict[str, Any] | None
|
Source descriptor recorded in provenance and
|
None
|
name
|
str | None
|
Optional collection name (stored in metadata). |
None
|
meta
|
dict[str, Any] | None
|
Extra metadata merged into |
None
|
mod
|
dict[str, AnnData] | None
|
Feature-space modalities to attach up front. |
None
|
relations
|
dict[str, Any] | None
|
Square dataset-by-dataset matrices for |
None
|
provenance
|
dict[str, Any] | None
|
Free-form provenance metadata. |
None
|
uns
|
dict[str, Any] | None
|
Extra Celldega metadata. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
calc_dataset_by_pop(adata, category='leiden', modality_name='population', output='proportion', min_cells=1, dataset_col=None)
Calculate a dataset-by-population modality and attach it to self.mod.
For each dataset, counts cells per category value and stores the
result as a dataset (rows) by population (columns) feature matrix.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData
|
Cell-level |
required |
category
|
str
|
|
'leiden'
|
modality_name
|
str
|
Key for the modality in |
'population'
|
output
|
str
|
|
'proportion'
|
min_cells
|
int
|
Minimum cells for a dataset row to be kept. |
1
|
dataset_col
|
str | None
|
Override the collection's dataset column; defaults to
|
None
|
Returns:
| Type | Description |
|---|---|
None
|
|
calc_dataset_signature(adata, category, value, modality_name=None, layer=None, aggregate='sum', normalization='log1p_cpm', min_cells=1, missing_datasets='nan', dataset_col=None, var_entity_type='gene')
Calculate and attach a dataset-by-feature signature for one category value.
Selects the cells where adata.obs[category] == value and aggregates
their expression per dataset into a dataset (rows) by gene (columns)
signature modality (a pseudobulk profile for that one population).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData
|
Cell-level |
required |
category
|
str
|
|
required |
value
|
Any
|
The |
required |
modality_name
|
str | None
|
Key for the modality; defaults to
|
None
|
layer
|
str | None
|
|
None
|
aggregate
|
str
|
|
'sum'
|
normalization
|
str | None
|
|
'log1p_cpm'
|
min_cells
|
int
|
Minimum selected cells for a dataset to get a real row. |
1
|
missing_datasets
|
str
|
|
'nan'
|
dataset_col
|
str | None
|
Override the collection's dataset column. |
None
|
var_entity_type
|
str
|
Entity type written to the modality's
|
'gene'
|
Returns:
| Type | Description |
|---|---|
None
|
|