Skip to content

Dataset Module API Reference

The dataset module contains dataset-level modality constructors and helpers for building MuData-backed DatasetCollection objects from cell-level AnnData metadata.

DatasetCollection is the dataset-level collection object: its obs table is the canonical dataset/sample axis, and derived feature spaces are stored directly as MuData modalities in dset.mod. Dataset-specific feature calculation should happen through methods on this object so the result is attached to the collection that owns the dataset axis. Cell-level AnnData is used as an input to constructors and calculations, but it is not stored on the dataset-level collection.

import celldega as dega

dset = dega.dataset.DatasetCollection(
    adata,
    dataset_col="sample_id",
    obs_columns=["patient_id", "condition"],
)

dset.calc_dataset_by_pop(
    adata,
    category="cell_type",
    output="proportion",
)

population = dset.mod["population"]

dset.calc_dataset_signature(
    adata,
    category="cell_type",
    value="CD8 T",
    modality_name="cd8_t_expression",
    missing_datasets="nan",
)

cd8_t_expression = dset.mod["cd8_t_expression"]

dset.write("dataset.h5mu")
loaded = dega.dataset.DatasetCollection.read("dataset.h5mu")

Dataset-level Celldega collection objects.

DatasetCollection

Bases: CelldegaCollection

Dataset-level collection with convenience modality constructors.

DatasetCollection is the dataset-level Celldega collection. Its obs table is the canonical dataset/sample axis, and feature constructors attach clusterable AnnData modalities directly to self.mod.

__init__(adata=None, dataset_col='sample_id', obs_columns=None, obs=None, mdata=None, source=None, name=None, meta=None, mod=None, relations=None, provenance=None, uns=None)

Build a dataset-level collection.

The dataset/sample observation axis is established one of three ways: from a pre-built mdata (dataset_col recovered from its metadata), from an explicit obs table, or — most commonly — by binning cell-level adata over dataset_col (one row per unique dataset, with an n_cells count and the first value of each obs_columns).

Parameters:

Name Type Description Default
adata AnnData | None

Cell-level AnnData to derive the dataset axis from (required when neither obs nor mdata is given).

None
dataset_col str

adata.obs column identifying the dataset/sample/ patient unit; becomes the collection's observation index.

'sample_id'
obs_columns list[str] | None

Per-dataset metadata columns to carry over from adata.obs (first value per dataset).

None
obs DataFrame | None

Pre-built dataset observation table (alternative to adata).

None
mdata MuData | None

Pre-built MuData to wrap (e.g. from read).

None
source str | dict[str, Any] | None

Source descriptor recorded in provenance and uns["sources"]["cells"].

None
name str | None

Optional collection name (stored in metadata).

None
meta dict[str, Any] | None

Extra metadata merged into uns["celldega"].

None
mod dict[str, AnnData] | None

Feature-space modalities to attach up front.

None
relations dict[str, Any] | None

Square dataset-by-dataset matrices for mdata.obsp.

None
provenance dict[str, Any] | None

Free-form provenance metadata.

None
uns dict[str, Any] | None

Extra Celldega metadata.

None

Raises:

Type Description
ValueError

If adata is missing when obs/mdata are absent.

calc_dataset_by_pop(adata, category='leiden', modality_name='population', output='proportion', min_cells=1, dataset_col=None)

Calculate a dataset-by-population modality and attach it to self.mod.

For each dataset, counts cells per category value and stores the result as a dataset (rows) by population (columns) feature matrix.

Parameters:

Name Type Description Default
adata AnnData

Cell-level AnnData containing dataset_col and category in obs.

required
category str

obs column naming the population/cell-type/cluster.

'leiden'
modality_name str

Key for the modality in self.mod.

'population'
output str

"proportion" (within-dataset fractions) or "counts".

'proportion'
min_cells int

Minimum cells for a dataset row to be kept.

1
dataset_col str | None

Override the collection's dataset column; defaults to self.dataset_col.

None

Returns:

Type Description
None

None — the modality is attached to self.mod[modality_name].

calc_dataset_signature(adata, category, value, modality_name=None, layer=None, aggregate='sum', normalization='log1p_cpm', min_cells=1, missing_datasets='nan', dataset_col=None, var_entity_type='gene')

Calculate and attach a dataset-by-feature signature for one category value.

Selects the cells where adata.obs[category] == value and aggregates their expression per dataset into a dataset (rows) by gene (columns) signature modality (a pseudobulk profile for that one population).

Parameters:

Name Type Description Default
adata AnnData

Cell-level AnnData with dataset_col, category, and an expression matrix (X or layer).

required
category str

obs column to select on (e.g. "cell_type").

required
value Any

The category value whose cells form the signature (e.g. "CD8 T").

required
modality_name str | None

Key for the modality; defaults to f"{slug(value)}_signature".

None
layer str | None

adata layer to aggregate; None uses adata.X.

None
aggregate str

"sum" or "mean" across the selected cells.

'sum'
normalization str | None

None, "cpm", or "log1p_cpm" applied per dataset row.

'log1p_cpm'
min_cells int

Minimum selected cells for a dataset to get a real row.

1
missing_datasets str

"nan" keeps the full observation axis and marks datasets below min_cells (or with no selected cells) as NaN rows; "raise" rejects them instead.

'nan'
dataset_col str | None

Override the collection's dataset column.

None
var_entity_type str

Entity type written to the modality's var["entity_type"] (default "gene").

'gene'

Returns:

Type Description
None

None — the modality is attached to self.mod.