Set Module API Reference

The set module provides SetCollection, the set-level Celldega collection. A SetCollection represents collections that are literally defined as sets of some base element (most commonly cells) with no intrinsic geometry of their own — clustering results, spatial-domain algorithm outputs (SpaGCN, GraphST, GASTON, Points2Regions), or manual annotations projected back to cells.

Each observation is one set. The defining modality membership is a sparse sets × cells incidence matrix, so a set never loses track of exactly which cells belong to it. Where DatasetCollection and NeighborhoodCollection make a derived feature (an expression signature, a geometry) first-class, SetCollection makes membership itself first-class, and signatures, population composition, and set-to-set overlap are all derived from it.

import celldega as dega

# Build one SetCollection per clustering "opinion" (the cells define the sets)
clust = dega.set.SetCollection(adata, set_col="leiden", name="leiden")

# Per-set expression signature (pseudobulk). feature_type is only required when
# passing a MuData; for an AnnData it defaults to "gene" -> modality "expression".
clust.calc_signature(adata)
clust.calc_signature(mdata, feature_type="protein")   # protein modality of a MuData

# Per-set cell-type composition (sets x populations)
clust.calc_population(adata, category="cell_type")

# Cross-algorithm comparison: membership IoU between two SetCollections that
# share the same cells (different obs). Rectangular modality on `clust`.
clust_b = dega.set.SetCollection(adata, set_col="spagcn", name="spagcn")
clust.calc_overlap(clust_b)

# Consensus across algorithms: concatenate, self-overlap (square relation),
# make it a clusterable modality, then cut the dendrogram via the Matrix.
combined = dega.set.concat_sets([clust, clust_b])
combined.calc_overlap()                        # -> combined.relations["overlap"]
combined.add_relation_modality("overlap")      # -> combined.mod["overlap_relation"]

clust.write("clusters.h5mu")
loaded = dega.set.SetCollection.read("clusters.h5mu")

Hierarchical clustering of any modality is done with the Matrix / Clustergram classes, and the resulting dendrogram can be cut into flat labels with Matrix.to_cluster / Clustergram.to_cluster (e.g. to define consensus domains or meta-clusters), which you then attach back to the collection's obs.

Set-level Celldega collection objects.

`SetCollection`

Bases: CelldegaCollection

Set-level Celldega collection backed by a sets x elements membership matrix.

The canonical observation axis is one row per set; the defining modality membership is a sparse AnnData with sets as observations and elements (cells) as variables, carrying per-cell spatial coordinates in var when available. Feature spaces (expression signatures) and relations (set-to-set overlap) are derived from this membership.

`init(adata=None, set_col=None, obs=None, mdata=None, membership=None, name=None, source=None, element_type='cell', meta=None, mod=None, relations=None, provenance=None, uns=None)`

Build a set-level collection.

The set observation axis is established one of three ways: from a pre-built mdata (e.g. via :meth:read), from a ready-made membership modality, or — most commonly — by binning cell-level adata over the categorical set_col (one row per unique label), which also constructs the sparse membership modality and tags cell coordinates onto its var.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData \| None`	Cell-level `AnnData` whose `set_col` labels define the sets (required when neither `obs`/`membership` nor `mdata` given).	`None`
`set_col`	`str \| None`	`adata.obs` column whose categories become the sets (e.g. `"leiden"`, `"spagcn"`); recorded as the source algorithm.	`None`
`obs`	`DataFrame \| None`	Pre-built set observation table (alternative to `adata`).	`None`
`mdata`	`Any \| None`	Pre-built `MuData` to wrap (e.g. from `read`).	`None`
`membership`	`AnnData \| None`	Pre-built `sets x elements` membership modality.	`None`
`name`	`str \| None`	Optional collection / algorithm name (e.g. `"spagcn"`).	`None`
`source`	`str \| dict[str, Any] \| None`	Source descriptor recorded in provenance.	`None`
`element_type`	`str`	Entity type of the membership `var` axis; `"cell"` today, `"gene"` for a future gene-set library.	`'cell'`
`meta`	`dict[str, Any] \| None`	Extra metadata merged into `uns["celldega"]`.	`None`
`mod`	`dict[str, AnnData] \| None`	Feature-space modalities to attach up front.	`None`
`relations`	`dict[str, Any] \| None`	Square set-by-set matrices for `mdata.obsp`.	`None`
`provenance`	`dict[str, Any] \| None`	Free-form provenance metadata.	`None`
`uns`	`dict[str, Any] \| None`	Extra Celldega metadata.	`None`

Raises:

Type	Description
`ValueError`	If no construction input (`adata` + `set_col`, `obs`/`membership`, or `mdata`) is provided.

`calc_overlap(other=None, weights='membership', metric='iou', key='overlap', modality_name=None, var_entity_type='set')`

Calculate set-to-set membership overlap (the cross-algorithm comparison engine).

Computes overlap between this collection's sets and other's sets over their shared element (cell) axis as A.X @ B.X.T. One engine, two outputs:

other is None (self-overlap, e.g. on a concatenated collection) → a square relation stored in self.relations[key]; convert it to a clusterable modality with :meth:add_relation_modality and hierarchically cluster to find consensus sets (Fig 4C-i).
other given → a rectangular modality self_sets x other_sets attached to self.mod (e.g. domains vs. manual annotation, Fig 4C-ii).

Parameters:

Name	Type	Description	Default
`other`	`SetCollection \| None`	Another `SetCollection` sharing the element axis; defaults to `self`.	`None`
`weights`	`str`	Membership modality to compare on.	`'membership'`
`metric`	`str`	`"iou"` (Jaccard) or `"intersection"` (raw shared count).	`'iou'`
`key`	`str`	Relation key (self-overlap) or default modality stem.	`'overlap'`
`modality_name`	`str \| None`	Modality key for the cross-collection case.	`None`
`var_entity_type`	`str`	Entity type for the rectangular modality's `var`.	`'set'`

Returns:

Type	Description
`ndarray`	The dense overlap matrix (also stored as a relation or modality).

`calc_population(data, category='leiden', output='proportion', weights='membership', modality_name='population')`

Calculate a set-by-population composition modality.

For each set, counts its member cells per category value (cell type / cluster) into a sets x populations modality — e.g. the cell-type composition of each spatial domain. Computed as membership @ one_hot(category). Mirrors NeighborhoodCollection.calc_population / DatasetCollection.calc_population.

Parameters:

Name	Type	Description	Default
`data`	`AnnData \| MuData`	Cell-level `AnnData` (or `MuData`) carrying `category` in `obs`; cells are aligned to the membership `var` axis.	required
`category`	`str`	`obs` column naming the population/cell-type/cluster.	`'leiden'`
`output`	`str`	`"proportion"` (within-set fractions) or `"counts"`.	`'proportion'`
`weights`	`str`	Membership modality to aggregate.	`'membership'`
`modality_name`	`str`	Key for the modality in `self.mod`.	`'population'`

Returns:

Type	Description
`None`	`None` — the modality is attached to `self.mod[modality_name]`.

`calc_signature(data, feature_type=None, layer=None, weights='membership', aggregate='mean', normalization='log1p_cpm', modality_name=None)`

Calculate and attach a set-by-feature signature (pseudobulk).

Aggregates the per-cell feature matrix of each set's member cells into a sets x features modality, using the stored membership matrix as the aggregation operator. Consistent with DatasetCollection.calc_signature and NeighborhoodCollection.calc_signature — the entity is implied by the instance, so it is not repeated in the name.

feature_type is only needed when data is a MuData (it names the modality to aggregate and labels the output). For a plain AnnData the matrix is unambiguous and feature_type defaults to "gene"; pass a protein AnnData (with feature_type="protein" to label it) for a protein signature, or use layer for an alternative matrix over the same features (raw vs. normalized).

Parameters:

Name	Type	Description	Default
`data`	`AnnData \| MuData`	Cell-level `AnnData`, or a `MuData` paired with `feature_type`. Cells are aligned to the membership `var` axis.	required
`feature_type`	`str \| None`	Output feature label / `MuData` modality selector. Required for `MuData`; optional for `AnnData` (default `"gene"`).	`None`
`layer`	`str \| None`	`adata` layer to aggregate; `None` uses `adata.X`.	`None`
`weights`	`str`	Membership modality driving aggregation — `"membership"` (binary, hard assignment) or `"weight"` (soft/probabilistic).	`'membership'`
`aggregate`	`str`	`"mean"` or `"sum"` across each set's member cells.	`'mean'`
`normalization`	`str \| None`	`None`, `"cpm"`, or `"log1p_cpm"` per set row.	`'log1p_cpm'`
`modality_name`	`str \| None`	Key for the modality; defaults to `"expression"` for genes and to `feature_type` otherwise.	`None`

Returns:

Type	Description
`None`	`None` — the modality is attached to `self.mod`.

`to_nbhd(method='points', **kwargs)`

Graduate set membership to geometry, returning a NeighborhoodCollection.

For each set, gather its member cells, read their coordinates from the membership.var axis, and materialize geometry: "points" stores the raw MultiPoint (unopinionated); "alpha_shape" / "convex_hull" build a polygon (opinionated). The inverse operation, NeighborhoodCollection.to_set, projects geometry back to cell sets — round-tripping alpha_shape quantifies how faithfully a polygon recovers its defining cells (precision/recall).

TODO(DEGA-487): implement by reusing nbhd.alpha_shape_cell_clusters and constructing a NeighborhoodCollection (lazy import to avoid a cycle).

`concat_sets(collections, names=None, weights='membership')`

Stack per-algorithm SetCollection objects into one comparison collection.

Unions the element (cell) axis across all inputs, prefixes each set id with its collection name (so spagcn::3 and gaston::5 stay distinct), and vstacks the membership matrices. The result is the input to a self :meth:SetCollection.calc_overlap → add_relation_modality → hierarchical-clustering consensus workflow.

Parameters:

Name	Type	Description	Default
`collections`	`list[SetCollection]`	Per-algorithm set collections sharing an element namespace.	required
`names`	`list[str] \| None`	Optional prefixes; defaults to each collection's `name` or index.	`None`
`weights`	`str`	Membership modality to stack.	`'membership'`

Returns:

Type	Description
`SetCollection`	A combined `SetCollection` whose `obs` carries a `set_source` column.

Set Module API Reference

SetCollection

__init__(adata=None, set_col=None, obs=None, mdata=None, membership=None, name=None, source=None, element_type='cell', meta=None, mod=None, relations=None, provenance=None, uns=None)

calc_overlap(other=None, weights='membership', metric='iou', key='overlap', modality_name=None, var_entity_type='set')

calc_population(data, category='leiden', output='proportion', weights='membership', modality_name='population')

calc_signature(data, feature_type=None, layer=None, weights='membership', aggregate='mean', normalization='log1p_cpm', modality_name=None)

to_nbhd(method='points', **kwargs)

concat_sets(collections, names=None, weights='membership')

`SetCollection`

`init(adata=None, set_col=None, obs=None, mdata=None, membership=None, name=None, source=None, element_type='cell', meta=None, mod=None, relations=None, provenance=None, uns=None)`

`calc_overlap(other=None, weights='membership', metric='iou', key='overlap', modality_name=None, var_entity_type='set')`

`calc_population(data, category='leiden', output='proportion', weights='membership', modality_name='population')`

`calc_signature(data, feature_type=None, layer=None, weights='membership', aggregate='mean', normalization='log1p_cpm', modality_name=None)`

`to_nbhd(method='points', **kwargs)`

`concat_sets(collections, names=None, weights='membership')`