Set Module API Reference
The set module provides SetCollection, the set-level Celldega collection.
A SetCollection represents collections that are literally defined as sets of
some base element (most commonly cells) with no intrinsic geometry of their
own — clustering results, spatial-domain algorithm outputs (SpaGCN, GraphST,
GASTON, Points2Regions), or manual annotations projected back to cells.
Each observation is one set. The defining modality membership is a sparse
sets × cells incidence matrix, so a set never loses track of exactly which
cells belong to it. Where DatasetCollection and NeighborhoodCollection make
a derived feature (an expression signature, a geometry) first-class,
SetCollection makes membership itself first-class, and signatures, population
composition, and set-to-set overlap are all derived from it.
import celldega as dega
# Build one SetCollection per clustering "opinion" (the cells define the sets)
clust = dega.set.SetCollection(adata, set_col="leiden", name="leiden")
# Per-set expression signature (pseudobulk). feature_type is only required when
# passing a MuData; for an AnnData it defaults to "gene" -> modality "expression".
clust.calc_signature(adata)
clust.calc_signature(mdata, feature_type="protein") # protein modality of a MuData
# Per-set cell-type composition (sets x populations)
clust.calc_population(adata, category="cell_type")
# Cross-algorithm comparison: membership IoU between two SetCollections that
# share the same cells (different obs). Rectangular modality on `clust`.
clust_b = dega.set.SetCollection(adata, set_col="spagcn", name="spagcn")
clust.calc_overlap(clust_b)
# Consensus across algorithms: concatenate, self-overlap (square relation),
# make it a clusterable modality, then cut the dendrogram via the Matrix.
combined = dega.set.concat_sets([clust, clust_b])
combined.calc_overlap() # -> combined.relations["overlap"]
combined.add_relation_modality("overlap") # -> combined.mod["overlap_relation"]
clust.write("clusters.h5mu")
loaded = dega.set.SetCollection.read("clusters.h5mu")
Hierarchical clustering of any modality is done with the Matrix /
Clustergram classes, and the resulting dendrogram can be cut into flat labels
with Matrix.to_cluster / Clustergram.to_cluster (e.g. to define consensus
domains or meta-clusters), which you then attach back to the collection's obs.
Set-level Celldega collection objects.
SetCollection
Bases: CelldegaCollection
Set-level Celldega collection backed by a sets x elements membership matrix.
The canonical observation axis is one row per set; the defining modality
membership is a sparse AnnData with sets as observations and elements
(cells) as variables, carrying per-cell spatial coordinates in var when
available. Feature spaces (expression signatures) and relations (set-to-set
overlap) are derived from this membership.
__init__(adata=None, set_col=None, obs=None, mdata=None, membership=None, name=None, source=None, element_type='cell', meta=None, mod=None, relations=None, provenance=None, uns=None)
Build a set-level collection.
The set observation axis is established one of three ways: from a
pre-built mdata (e.g. via :meth:read), from a ready-made
membership modality, or — most commonly — by binning cell-level
adata over the categorical set_col (one row per unique label),
which also constructs the sparse membership modality and tags cell
coordinates onto its var.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData | None
|
Cell-level |
None
|
set_col
|
str | None
|
|
None
|
obs
|
DataFrame | None
|
Pre-built set observation table (alternative to |
None
|
mdata
|
Any | None
|
Pre-built |
None
|
membership
|
AnnData | None
|
Pre-built |
None
|
name
|
str | None
|
Optional collection / algorithm name (e.g. |
None
|
source
|
str | dict[str, Any] | None
|
Source descriptor recorded in provenance. |
None
|
element_type
|
str
|
Entity type of the membership |
'cell'
|
meta
|
dict[str, Any] | None
|
Extra metadata merged into |
None
|
mod
|
dict[str, AnnData] | None
|
Feature-space modalities to attach up front. |
None
|
relations
|
dict[str, Any] | None
|
Square set-by-set matrices for |
None
|
provenance
|
dict[str, Any] | None
|
Free-form provenance metadata. |
None
|
uns
|
dict[str, Any] | None
|
Extra Celldega metadata. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If no construction input ( |
calc_overlap(other=None, weights='membership', metric='iou', key='overlap', modality_name=None, var_entity_type='set')
Calculate set-to-set membership overlap (the cross-algorithm comparison engine).
Computes overlap between this collection's sets and other's sets over
their shared element (cell) axis as A.X @ B.X.T. One engine, two
outputs:
other is None(self-overlap, e.g. on a concatenated collection) → a square relation stored inself.relations[key]; convert it to a clusterable modality with :meth:add_relation_modalityand hierarchically cluster to find consensus sets (Fig 4C-i).othergiven → a rectangular modalityself_sets x other_setsattached toself.mod(e.g. domains vs. manual annotation, Fig 4C-ii).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other
|
SetCollection | None
|
Another |
None
|
weights
|
str
|
Membership modality to compare on. |
'membership'
|
metric
|
str
|
|
'iou'
|
key
|
str
|
Relation key (self-overlap) or default modality stem. |
'overlap'
|
modality_name
|
str | None
|
Modality key for the cross-collection case. |
None
|
var_entity_type
|
str
|
Entity type for the rectangular modality's |
'set'
|
Returns:
| Type | Description |
|---|---|
ndarray
|
The dense overlap matrix (also stored as a relation or modality). |
calc_population(data, category='leiden', output='proportion', weights='membership', modality_name='population')
Calculate a set-by-population composition modality.
For each set, counts its member cells per category value (cell type /
cluster) into a sets x populations modality — e.g. the cell-type
composition of each spatial domain. Computed as
membership @ one_hot(category). Mirrors
NeighborhoodCollection.calc_population / DatasetCollection.calc_population.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
AnnData | MuData
|
Cell-level |
required |
category
|
str
|
|
'leiden'
|
output
|
str
|
|
'proportion'
|
weights
|
str
|
Membership modality to aggregate. |
'membership'
|
modality_name
|
str
|
Key for the modality in |
'population'
|
Returns:
| Type | Description |
|---|---|
None
|
|
calc_signature(data, feature_type=None, layer=None, weights='membership', aggregate='mean', normalization='log1p_cpm', modality_name=None)
Calculate and attach a set-by-feature signature (pseudobulk).
Aggregates the per-cell feature matrix of each set's member cells into a
sets x features modality, using the stored membership matrix as the
aggregation operator. Consistent with DatasetCollection.calc_signature
and NeighborhoodCollection.calc_signature — the entity is implied by
the instance, so it is not repeated in the name.
feature_type is only needed when data is a MuData (it names the
modality to aggregate and labels the output). For a plain AnnData the
matrix is unambiguous and feature_type defaults to "gene"; pass a
protein AnnData (with feature_type="protein" to label it) for a
protein signature, or use layer for an alternative matrix over the same
features (raw vs. normalized).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
AnnData | MuData
|
Cell-level |
required |
feature_type
|
str | None
|
Output feature label / |
None
|
layer
|
str | None
|
|
None
|
weights
|
str
|
Membership modality driving aggregation — |
'membership'
|
aggregate
|
str
|
|
'mean'
|
normalization
|
str | None
|
|
'log1p_cpm'
|
modality_name
|
str | None
|
Key for the modality; defaults to |
None
|
Returns:
| Type | Description |
|---|---|
None
|
|
to_nbhd(method='points', **kwargs)
Graduate set membership to geometry, returning a NeighborhoodCollection.
For each set, gather its member cells, read their coordinates from the
membership.var axis, and materialize geometry: "points" stores the
raw MultiPoint (unopinionated); "alpha_shape" / "convex_hull"
build a polygon (opinionated). The inverse operation,
NeighborhoodCollection.to_set, projects geometry back to cell sets —
round-tripping alpha_shape quantifies how faithfully a polygon recovers
its defining cells (precision/recall).
TODO(DEGA-487): implement by reusing nbhd.alpha_shape_cell_clusters and
constructing a NeighborhoodCollection (lazy import to avoid a cycle).
concat_sets(collections, names=None, weights='membership')
Stack per-algorithm SetCollection objects into one comparison collection.
Unions the element (cell) axis across all inputs, prefixes each set id with its
collection name (so spagcn::3 and gaston::5 stay distinct), and
vstacks the membership matrices. The result is the input to a self
:meth:SetCollection.calc_overlap → add_relation_modality →
hierarchical-clustering consensus workflow.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
collections
|
list[SetCollection]
|
Per-algorithm set collections sharing an element namespace. |
required |
names
|
list[str] | None
|
Optional prefixes; defaults to each collection's |
None
|
weights
|
str
|
Membership modality to stack. |
'membership'
|
Returns:
| Type | Description |
|---|---|
SetCollection
|
A combined |