SetCollection: cluster / set space¶
This notebook builds a SetCollection from a single-cell clustering of the public Xenium Human Skin dataset, calculates a per-set gene-expression signature (pseudobulk), and hands the resulting modality to the Matrix / Clustergram classes for hierarchical clustering — including a Clustergram linked to a Landscape spatial view.
A SetCollection treats each cluster (or spatial domain, or manual annotation) as a set of cells: its defining membership modality is a sparse sets × cells matrix, so a set never loses track of which cells belong to it. Signatures, cell-type composition, and set-to-set overlap are all derived from that membership.
The
Matrixclass can also aggregate cells into a heatmap directly and statelessly (shown at the end). TheSetCollectionpath is the stateful counterpart — it persists membership, signatures, and overlaps as a MuData object you can write, reload, and compare across algorithms.
from pathlib import Path
from urllib.parse import quote
import numpy as np
import pandas as pd
import requests
import scanpy as sc
import celldega as dega
Load the Xenium Human Skin dataset¶
We stream a prepared AnnData from the Celldega supporting-data repository on the Hugging Face Hub.
REPO_ID = "broadinstitute/Celldega_Supporting_Data"
REVISION = "main"
CACHE_DIR = Path("data/celldega_supporting_data")
H5AD_PATH = "Xenium_Prime_Human_Skin_FFPE_outs.h5ad"
def download_repo_file(repo_id, repo_path, cache_dir, revision="main"):
cache_dir.mkdir(parents=True, exist_ok=True)
local_path = cache_dir / Path(repo_path).name
if local_path.exists():
return local_path
encoded_path = quote(repo_path)
url = f"https://huggingface.co/datasets/{repo_id}/resolve/{revision}/{encoded_path}"
with requests.get(url, stream=True, timeout=300) as response:
response.raise_for_status()
with local_path.open('wb') as handle:
for chunk in response.iter_content(chunk_size=1024 * 1024):
if chunk:
handle.write(chunk)
return local_path
local_h5ad = download_repo_file(REPO_ID, H5AD_PATH, CACHE_DIR, REVISION)
adata = sc.read_h5ad(local_h5ad)
adata
AnnData object with n_obs × n_vars = 109709 × 5004
obs: 'transcript_counts', 'control_probe_counts', 'genomic_control_counts', 'control_codeword_counts', 'unassigned_codeword_counts', 'deprecated_codeword_counts', 'total_counts', 'cell_area', 'nucleus_area', 'nucleus_count', 'segmentation_method', 'region', 'z_level', 'cell_labels', 'n_counts', 'leiden'
var: 'gene_ids', 'feature_types', 'genome', 'n_cells'
uns: 'leiden', 'leiden_colors', 'log1p', 'neighbors', 'pca', 'spatialdata_attrs', 'umap'
obsm: 'X_pca', 'X_umap', 'spatial'
varm: 'PCs'
obsp: 'connectivities', 'distances'
Define the sets¶
The sets are defined by a categorical cell label — here the precomputed Leiden clusters. (If the column were missing we would compute it with sc.pp.neighbors / sc.tl.leiden.)
set_col = "leiden"
if set_col not in adata.obs.columns:
sc.pp.pca(adata, n_comps=50)
sc.pp.neighbors(adata)
sc.tl.leiden(adata, key_added=set_col)
adata.obs[set_col].value_counts()
leiden 0 14459 1 13326 2 11199 3 9367 4 8487 5 8012 6 6896 7 5226 8 5185 9 4970 10 4643 11 3680 12 3072 13 2849 14 2566 15 2517 16 2481 17 774 Name: count, dtype: int64
Build the SetCollection¶
Each Leiden cluster becomes one observation (one set); the sparse membership modality records which cells belong to each. Cell coordinates are tagged onto membership.var so the collection can later graduate to spatial geometry via to_nbhd.
clust = dega.set.SetCollection(adata, set_col=set_col, name="leiden")
print(clust.obs[[set_col, "n_cells"]])
clust.mod["membership"]
leiden n_cells leiden 12 12 3072 0 0 14459 6 6 6896 4 4 8487 2 2 11199 10 10 4643 15 15 2517 14 14 2566 16 16 2481 3 3 9367 5 5 8012 8 8 5185 11 11 3680 7 7 5226 1 1 13326 13 13 2849 9 9 4970 17 17 774
AnnData object with n_obs × n_vars = 18 × 109709
obs: 'leiden', 'n_cells', 'set_source', 'color'
var: 'cell', 'center_x', 'center_y', 'entity_type'
Per-set expression signature (pseudobulk)¶
calc_signature aggregates each set's member cells. With an AnnData the feature type defaults to gene, producing an expression modality of shape sets × genes. This public dataset's X is already log-normalized, so we aggregate with normalization=None (a plain mean of the normalized values per cluster). (Pass a MuData with feature_type="protein" for a protein signature instead.)
clust.calc_signature(adata, aggregate="mean", normalization=None)
expression = clust.mod["expression"]
expression
AnnData object with n_obs × n_vars = 18 × 5004
obs: 'leiden', 'n_cells', 'set_source', 'color'
var: 'gene_ids', 'feature_types', 'genome', 'n_cells', 'gene', 'entity_type'
uns: 'feature_type', 'aggregate', 'normalization', 'layer', 'axis_entities'
Hand the signature to the Matrix / Clustergram¶
Matrix transposes an AnnData to features × observations, so here rows are genes and columns are sets. We z-score each gene across the clusters (mat.norm(axis="row", by="zscore")) so the heatmap shows each gene's relative enrichment per cluster, then cluster both axes. calc_signature stamped axis_entities onto the modality, so the Matrix automatically knows its columns are cells grouped by leiden (col_entity), which is what lets the Clustergram link back to a Landscape.
mat = dega.clust.Matrix(expression, col_attr=[set_col])
mat.norm(axis="row", by="zscore")
mat.clust(dist_type="cosine", linkage_type="average")
mat.row_entity, mat.col_entity
({'entity': 'gene', 'attr': 'name'}, {'entity': 'cell', 'attr': 'leiden'})
Link the Clustergram to a Landscape¶
Passing the same adata to the Landscape makes its cluster definitions and colors identical to the Clustergram's (both come from adata.obs["leiden"]). Because the signature carries col_entity = {entity: cell, attr: leiden}, the landscape_clustergram link knows that clicking a column (a Leiden cluster) should color the cells of that cluster in the spatial Landscape. The same wiring works for any set_col (cell types, spatial domains) — the attribute is taken from the Clustergram's col_entity, not hard-coded to leiden. The Landscape loads the prepared LandscapeFiles (image tiles + cell positions) for this sample from GitHub.
base_url = (
"https://raw.githubusercontent.com/broadinstitute/"
"celldega_Xenium_Prime_Human_Skin_FFPE_outs/main/Xenium_Prime_Human_Skin_FFPE_outs"
)
# The SetCollection stored a per-cluster color in clust.obs['color']; reuse it so
# the Landscape and Clustergram share one consistent palette.
meta_cluster = clust.obs[["color"]].copy()
meta_cluster["count"] = clust.obs["n_cells"]
landscape = dega.viz.Landscape(base_url=base_url, adata=adata, meta_cluster=meta_cluster)
dega.viz.landscape_clustergram(landscape, dega.viz.Clustergram(matrix=mat))
/var/folders/8d/jxpy9rd10j7fp2rcj_s5sz3c0000gq/T/ipykernel_87463/206546466.py:9: UserWarning: Transformation matrix not found at https://raw.githubusercontent.com/broadinstitute/celldega_Xenium_Prime_Human_Skin_FFPE_outs/main/Xenium_Prime_Human_Skin_FFPE_outs/micron_to_image_transform.csv. Using identity. landscape = dega.viz.Landscape(base_url=base_url, adata=adata, meta_cluster=meta_cluster)
Cut the dendrogram into groups¶
Matrix.to_cluster cuts the linkage into flat labels. Cutting the column axis groups the sets into meta-clusters, which we attach back to clust.obs. In the widget, dragging the dendrogram slider populates cgm.dendro_cut, and cgm.to_cluster(axis="col") reads that slider value instead.
set_groups = mat.to_cluster(axis="col", n_clusters=4)
clust.obs["meta_cluster"] = set_groups.reindex(clust.obs.index).astype("Int64")
clust.obs[[set_col, "n_cells", "meta_cluster"]]
| leiden | n_cells | meta_cluster | |
|---|---|---|---|
| leiden | |||
| 12 | 12 | 3072 | 2 |
| 0 | 0 | 14459 | 4 |
| 6 | 6 | 6896 | 4 |
| 4 | 4 | 8487 | 1 |
| 2 | 2 | 11199 | 4 |
| 10 | 10 | 4643 | 4 |
| 15 | 15 | 2517 | 3 |
| 14 | 14 | 2566 | 3 |
| 16 | 16 | 2481 | 3 |
| 3 | 3 | 9367 | 1 |
| 5 | 5 | 8012 | 1 |
| 8 | 8 | 5185 | 1 |
| 11 | 11 | 3680 | 4 |
| 7 | 7 | 5226 | 4 |
| 1 | 1 | 13326 | 1 |
| 13 | 13 | 2849 | 1 |
| 9 | 9 | 4970 | 4 |
| 17 | 17 | 774 | 3 |
The quick, stateless Matrix path¶
When you don't need a persistent SetCollection, Matrix can aggregate cells into a gene × cluster heatmap directly via downsample_to. This is faster and stateless — ideal for a quick look — but it does not retain membership, cross-algorithm overlap, or a reloadable MuData. The two paths produce the same kind of pseudobulk heatmap; the SetCollection path is what you reach for when the aggregation is a durable analysis object rather than a one-off view.
mat_quick = dega.clust.Matrix(adata)
mat_quick.downsample_to(category=set_col, axis="col")
mat_quick.norm(axis="row", by="zscore")
mat_quick.clust()
dega.viz.Clustergram(matrix=mat_quick)
/Users/feni/Documents/celldega/src/celldega/clust/matrix.py:307: UserWarning: Large matrix (109709 x 5004). Consider filtering. self.load_adata(data, col_attr=col_attr, row_attr=row_attr)
When to use which¶
Matrix.downsample_to— quick, stateless aggregation for a one-off heatmap.SetCollection— when you want to persist membership and signatures, compute per-set composition (calc_population) and overlap, compare multiple clustering/domain algorithms (concat_sets+calc_overlap), or graduate sets to spatial neighborhoods (to_nbhd).