SetCollection: cluster / set space¶

This notebook builds a SetCollection from a single-cell clustering of the public Xenium Human Skin dataset, calculates a per-set gene-expression signature (pseudobulk), and hands the resulting modality to the Matrix / Clustergram classes for hierarchical clustering — including a Clustergram linked to a Landscape spatial view.

A SetCollection treats each cluster (or spatial domain, or manual annotation) as a set of cells: its defining membership modality is a sparse sets × cells matrix, so a set never loses track of which cells belong to it. Signatures, cell-type composition, and set-to-set overlap are all derived from that membership.

The Matrix class can also aggregate cells into a heatmap directly and statelessly (shown at the end). The SetCollection path is the stateful counterpart — it persists membership, signatures, and overlaps as a MuData object you can write, reload, and compare across algorithms.

In [1]:

Copied!





from pathlib import Path
from urllib.parse import quote

import numpy as np
import pandas as pd
import requests
import scanpy as sc

import celldega as dega
from pathlib import Path
from urllib.parse import quote

import numpy as np
import pandas as pd
import requests
import scanpy as sc

import celldega as dega

Load the Xenium Human Skin dataset¶

We stream a prepared AnnData from the Celldega supporting-data repository on the Hugging Face Hub.

In [2]:

Copied!





REPO_ID = "broadinstitute/Celldega_Supporting_Data"
REVISION = "main"
CACHE_DIR = Path("data/celldega_supporting_data")
H5AD_PATH = "Xenium_Prime_Human_Skin_FFPE_outs.h5ad"


def download_repo_file(repo_id, repo_path, cache_dir, revision="main"):
    cache_dir.mkdir(parents=True, exist_ok=True)
    local_path = cache_dir / Path(repo_path).name
    if local_path.exists():
        return local_path
    encoded_path = quote(repo_path)
    url = f"https://huggingface.co/datasets/{repo_id}/resolve/{revision}/{encoded_path}"
    with requests.get(url, stream=True, timeout=300) as response:
        response.raise_for_status()
        with local_path.open('wb') as handle:
            for chunk in response.iter_content(chunk_size=1024 * 1024):
                if chunk:
                    handle.write(chunk)
    return local_path


local_h5ad = download_repo_file(REPO_ID, H5AD_PATH, CACHE_DIR, REVISION)
adata = sc.read_h5ad(local_h5ad)
adata
REPO_ID = "broadinstitute/Celldega_Supporting_Data"
REVISION = "main"
CACHE_DIR = Path("data/celldega_supporting_data")
H5AD_PATH = "Xenium_Prime_Human_Skin_FFPE_outs.h5ad"


def download_repo_file(repo_id, repo_path, cache_dir, revision="main"):
    cache_dir.mkdir(parents=True, exist_ok=True)
    local_path = cache_dir / Path(repo_path).name
    if local_path.exists():
        return local_path
    encoded_path = quote(repo_path)
    url = f"https://huggingface.co/datasets/{repo_id}/resolve/{revision}/{encoded_path}"
    with requests.get(url, stream=True, timeout=300) as response:
        response.raise_for_status()
        with local_path.open('wb') as handle:
            for chunk in response.iter_content(chunk_size=1024 * 1024):
                if chunk:
                    handle.write(chunk)
    return local_path


local_h5ad = download_repo_file(REPO_ID, H5AD_PATH, CACHE_DIR, REVISION)
adata = sc.read_h5ad(local_h5ad)
adata

Out[2]:

AnnData object with n_obs × n_vars = 109709 × 5004
    obs: 'transcript_counts', 'control_probe_counts', 'genomic_control_counts', 'control_codeword_counts', 'unassigned_codeword_counts', 'deprecated_codeword_counts', 'total_counts', 'cell_area', 'nucleus_area', 'nucleus_count', 'segmentation_method', 'region', 'z_level', 'cell_labels', 'n_counts', 'leiden'
    var: 'gene_ids', 'feature_types', 'genome', 'n_cells'
    uns: 'leiden', 'leiden_colors', 'log1p', 'neighbors', 'pca', 'spatialdata_attrs', 'umap'
    obsm: 'X_pca', 'X_umap', 'spatial'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

Define the sets¶

The sets are defined by a categorical cell label — here the precomputed Leiden clusters. (If the column were missing we would compute it with sc.pp.neighbors / sc.tl.leiden.)

In [3]:

Copied!





set_col = "leiden"
if set_col not in adata.obs.columns:
    sc.pp.pca(adata, n_comps=50)
    sc.pp.neighbors(adata)
    sc.tl.leiden(adata, key_added=set_col)
adata.obs[set_col].value_counts()
set_col = "leiden"
if set_col not in adata.obs.columns:
    sc.pp.pca(adata, n_comps=50)
    sc.pp.neighbors(adata)
    sc.tl.leiden(adata, key_added=set_col)
adata.obs[set_col].value_counts()

Out[3]:

leiden
0     14459
1     13326
2     11199
3      9367
4      8487
5      8012
6      6896
7      5226
8      5185
9      4970
10     4643
11     3680
12     3072
13     2849
14     2566
15     2517
16     2481
17      774
Name: count, dtype: int64

Build the SetCollection¶

Each Leiden cluster becomes one observation (one set); the sparse membership modality records which cells belong to each. Cell coordinates are tagged onto membership.var so the collection can later graduate to spatial geometry via to_nbhd.

In [4]:

Copied!

clust = dega.set.SetCollection(adata, set_col=set_col, name="leiden")
print(clust.obs[[set_col, "n_cells"]])
clust.mod["membership"]
clust = dega.set.SetCollection(adata, set_col=set_col, name="leiden")
print(clust.obs[[set_col, "n_cells"]])
clust.mod["membership"]

       leiden  n_cells
leiden                
12         12     3072
0           0    14459
6           6     6896
4           4     8487
2           2    11199
10         10     4643
15         15     2517
14         14     2566
16         16     2481
3           3     9367
5           5     8012
8           8     5185
11         11     3680
7           7     5226
1           1    13326
13         13     2849
9           9     4970
17         17      774

Out[4]:

AnnData object with n_obs × n_vars = 18 × 109709
    obs: 'leiden', 'n_cells', 'set_source', 'color'
    var: 'cell', 'center_x', 'center_y', 'entity_type'

Per-set expression signature (pseudobulk)¶

calc_signature aggregates each set's member cells. With an AnnData the feature type defaults to gene, producing an expression modality of shape sets × genes. This public dataset's X is already log-normalized, so we aggregate with normalization=None (a plain mean of the normalized values per cluster). (Pass a MuData with feature_type="protein" for a protein signature instead.)

In [5]:

Copied!

clust.calc_signature(adata, aggregate="mean", normalization=None)
expression = clust.mod["expression"]
expression
clust.calc_signature(adata, aggregate="mean", normalization=None)
expression = clust.mod["expression"]
expression

Out[5]:

AnnData object with n_obs × n_vars = 18 × 5004
    obs: 'leiden', 'n_cells', 'set_source', 'color'
    var: 'gene_ids', 'feature_types', 'genome', 'n_cells', 'gene', 'entity_type'
    uns: 'feature_type', 'aggregate', 'normalization', 'layer', 'axis_entities'

Hand the signature to the Matrix / Clustergram¶

Matrix transposes an AnnData to features × observations, so here rows are genes and columns are sets. We z-score each gene across the clusters (mat.norm(axis="row", by="zscore")) so the heatmap shows each gene's relative enrichment per cluster, then cluster both axes. calc_signature stamped axis_entities onto the modality, so the Matrix automatically knows its columns are cells grouped by leiden (col_entity), which is what lets the Clustergram link back to a Landscape.

In [6]:

Copied!





mat = dega.clust.Matrix(expression, col_attr=[set_col])
mat.norm(axis="row", by="zscore")
mat.clust(dist_type="cosine", linkage_type="average")
mat.row_entity, mat.col_entity
mat = dega.clust.Matrix(expression, col_attr=[set_col])
mat.norm(axis="row", by="zscore")
mat.clust(dist_type="cosine", linkage_type="average")
mat.row_entity, mat.col_entity

Out[6]:

({'entity': 'gene', 'attr': 'name'}, {'entity': 'cell', 'attr': 'leiden'})

Link the Clustergram to a Landscape¶

Passing the same adata to the Landscape makes its cluster definitions and colors identical to the Clustergram's (both come from adata.obs["leiden"]). Because the signature carries col_entity = {entity: cell, attr: leiden}, the landscape_clustergram link knows that clicking a column (a Leiden cluster) should color the cells of that cluster in the spatial Landscape. The same wiring works for any set_col (cell types, spatial domains) — the attribute is taken from the Clustergram's col_entity, not hard-coded to leiden. The Landscape loads the prepared LandscapeFiles (image tiles + cell positions) for this sample from GitHub.

In [7]:

Copied!





base_url = (
    "https://raw.githubusercontent.com/broadinstitute/"
    "celldega_Xenium_Prime_Human_Skin_FFPE_outs/main/Xenium_Prime_Human_Skin_FFPE_outs"
)
# The SetCollection stored a per-cluster color in clust.obs['color']; reuse it so
# the Landscape and Clustergram share one consistent palette.
meta_cluster = clust.obs[["color"]].copy()
meta_cluster["count"] = clust.obs["n_cells"]
landscape = dega.viz.Landscape(base_url=base_url, adata=adata, meta_cluster=meta_cluster)
dega.viz.landscape_clustergram(landscape, dega.viz.Clustergram(matrix=mat))
base_url = (
    "https://raw.githubusercontent.com/broadinstitute/"
    "celldega_Xenium_Prime_Human_Skin_FFPE_outs/main/Xenium_Prime_Human_Skin_FFPE_outs"
)
# The SetCollection stored a per-cluster color in clust.obs['color']; reuse it so
# the Landscape and Clustergram share one consistent palette.
meta_cluster = clust.obs[["color"]].copy()
meta_cluster["count"] = clust.obs["n_cells"]
landscape = dega.viz.Landscape(base_url=base_url, adata=adata, meta_cluster=meta_cluster)
dega.viz.landscape_clustergram(landscape, dega.viz.Clustergram(matrix=mat))

/var/folders/8d/jxpy9rd10j7fp2rcj_s5sz3c0000gq/T/ipykernel_87463/206546466.py:9: UserWarning: Transformation matrix not found at https://raw.githubusercontent.com/broadinstitute/celldega_Xenium_Prime_Human_Skin_FFPE_outs/main/Xenium_Prime_Human_Skin_FFPE_outs/micron_to_image_transform.csv. Using identity.
  landscape = dega.viz.Landscape(base_url=base_url, adata=adata, meta_cluster=meta_cluster)

Out[7]:

Cut the dendrogram into groups¶

Matrix.to_cluster cuts the linkage into flat labels. Cutting the column axis groups the sets into meta-clusters, which we attach back to clust.obs. In the widget, dragging the dendrogram slider populates cgm.dendro_cut, and cgm.to_cluster(axis="col") reads that slider value instead.

In [8]:

Copied!

set_groups = mat.to_cluster(axis="col", n_clusters=4)
clust.obs["meta_cluster"] = set_groups.reindex(clust.obs.index).astype("Int64")
clust.obs[[set_col, "n_cells", "meta_cluster"]]
set_groups = mat.to_cluster(axis="col", n_clusters=4)
clust.obs["meta_cluster"] = set_groups.reindex(clust.obs.index).astype("Int64")
clust.obs[[set_col, "n_cells", "meta_cluster"]]

Out[8]:

	leiden	n_cells	meta_cluster
leiden
12	12	3072	2
0	0	14459	4
6	6	6896	4
4	4	8487	1
2	2	11199	4
10	10	4643	4
15	15	2517	3
14	14	2566	3
16	16	2481	3
3	3	9367	1
5	5	8012	1
8	8	5185	1
11	11	3680	4
7	7	5226	4
1	1	13326	1
13	13	2849	1
9	9	4970	4
17	17	774	3

The quick, stateless Matrix path¶

When you don't need a persistent SetCollection, Matrix can aggregate cells into a gene × cluster heatmap directly via downsample_to. This is faster and stateless — ideal for a quick look — but it does not retain membership, cross-algorithm overlap, or a reloadable MuData. The two paths produce the same kind of pseudobulk heatmap; the SetCollection path is what you reach for when the aggregation is a durable analysis object rather than a one-off view.

In [9]:

Copied!





mat_quick = dega.clust.Matrix(adata)
mat_quick.downsample_to(category=set_col, axis="col")
mat_quick.norm(axis="row", by="zscore")
mat_quick.clust()
dega.viz.Clustergram(matrix=mat_quick)
mat_quick = dega.clust.Matrix(adata)
mat_quick.downsample_to(category=set_col, axis="col")
mat_quick.norm(axis="row", by="zscore")
mat_quick.clust()
dega.viz.Clustergram(matrix=mat_quick)

/Users/feni/Documents/celldega/src/celldega/clust/matrix.py:307: UserWarning: Large matrix (109709 x 5004). Consider filtering.
  self.load_adata(data, col_attr=col_attr, row_attr=row_attr)

Out[9]:

When to use which¶

Matrix.downsample_to — quick, stateless aggregation for a one-off heatmap.
SetCollection — when you want to persist membership and signatures, compute per-set composition (calc_population) and overlap, compare multiple clustering/domain algorithms (concat_sets + calc_overlap), or graduate sets to spatial neighborhoods (to_nbhd).

In [ ]: