Skip to content

Select API

The select module provides a small query and sampling layer over AnnData. It is designed to answer questions like:

  • Which entities match this metadata/gene filter?
  • In what stable order should those ids be shown?
  • What query and sampler produced that order?

The main class is Selector, used from the dega.select namespace:

import celldega as dega

selector = dega.select.Selector(adata)

Core Concepts

Selector separates three pieces of selection logic:

Concept API Purpose
Attribute selector.attr(...), selector.gene(...) Reference per-entity values from adata.obs or gene expression
Query (selector.attr("cluster") == "B cell") Define the candidate set with categorical, numeric, and gene-expression predicates combined with boolean logic.
Sampler selector.samplers.random(...), selector.samplers.quantile_bin(...), or sampler=3000 Choose or rank ids from the candidate set

This keeps the backend responsible for producing a stable ordered list of ids plus provenance. Widgets such as Yearbook can then paginate over that ordered list.

Basic Selection

selector = dega.select.Selector(adata)

query = (
    (selector.attr("cluster") == "B cell")
    & selector.attr("sample_id").isin(["S1", "S2"])
)

selection = selector.select(query=query)
selection.names()

selection.names() returns the ordered ids, usually names from adata.obs_names.

Safe Default Preview

If a query matches a very large number of entities and no sampler is provided, Selector returns a deterministic random preview instead of accidentally handing a widget or notebook hundreds of thousands of ids.

By default:

  • candidate sets up to 1,000 ids are returned in full;
  • candidate sets over 1,000 ids return a deterministic random preview of 1,000 ids;
  • a warning explains that previewing happened and how to request all matches.
selection = selector.select(query=query)

To intentionally return every matching id, pass sampler="all":

selection = selector.select(query=query, sampler="all")

For the common case of "give me N random ids", pass an integer. This uses the selector's default_preview_seed so the shorthand is reproducible:

selection = selector.select(query=query, sampler=3000)

To make sampling explicit and reproducible, pass a sampler:

selection = selector.select(
    query=query,
    sampler=selector.samplers.random(n=5000, seed=1),
)

The default preview size can be changed when the selector is created:

selector = dega.select.Selector(adata, default_preview_n=2000)

Set default_preview_n=None to disable the preview guard.

Sampling And Ranking

Sampler constructors are grouped under selector.samplers.

Each built-in sampler is backed by a concrete sampler class such as RandomSampler, QuantileBinSampler, RankSampler, GaussianSampler, or StratifiedSampler. selector.samplers.* is the notebook-friendly constructor namespace for creating those objects.

Built-In Samplers

Sampler API Behavior Good for
Random selector.samplers.random(n=..., seed=...) or sampler=3000 Random sample, optionally reproducible with seed Quick representative subsets
Rank selector.samplers.rank(attr=..., n=..., by="high") Deterministic top or bottom ids by attribute value Highest-expression or lowest-QC examples
Quantile Bin selector.samplers.quantile_bin(attr=..., bin="high", ...) Sample from a low/mid/high region of a distribution Representative sampling from tails or the middle
Gaussian selector.samplers.gaussian(attr=..., center=..., std=..., n=...) Bias samples toward a numeric target value “Near this score” or “around this expression level”
Stratified selector.samplers.stratified(attr=..., n_per_category=...) or n=... Evenly distribute samples across categories Balanced cluster/sample selections
selection = selector.select(
    query=query,
    sampler=selector.samplers.random(n=24, seed=1),
)

Random sampling is the simplest exploratory sampler. If you just want a bounded subset and do not care about an attribute-specific distribution, this is usually the right default.

For deterministic top or bottom ids by an attribute:

selection = selector.select(
    sampler=selector.samplers.rank(
        attr=selector.gene("MS4A1"),
        n=24,
        by="high",
    ),
)

For gene-expression-driven inspection, use a quantile-bin sampler:

selection = selector.select(
    query=query,
    sampler=selector.samplers.quantile_bin(
        attr=selector.gene("MS4A1"),
        bin="high",
        n=24,
        seed=1,
    ),
)

The returned ids preserve the sampler's order. For bin="high", selected ids are ordered from higher to lower expression after sampling.

For narrower tails such as "top 5%", use proportion or percentile:

selection = selector.select(
    query=query,
    sampler=selector.samplers.quantile_bin(
        attr=selector.gene("MS4A1"),
        bin="high",
        percentile=5,
    ),
)

For Gaussian-weighted sampling around a target value:

selection = selector.select(
    sampler=selector.samplers.gaussian(
        attr=selector.attr("qc_score"),
        center=0.8,
        std=0.05,
        n=24,
        seed=1,
    ),
)

For even sampling across categories:

selection = selector.select(
    sampler=selector.samplers.stratified(
        attr=selector.attr("cluster"),
        n_per_category=10,
        seed=1,
    ),
)

For a total quota distributed as evenly as possible across categories:

selection = selector.select(
    sampler=selector.samplers.stratified(
        attr=selector.attr("cluster"),
        n=100,
        seed=1,
    ),
)

Result Objects

selector.select(...) returns a Selection. It behaves like an ordered list of selected ids and also carries provenance.

len(selection)
selection[0]
list(selection)
selection.names()

For notebook work:

selection.to_dataframe()

For serialization or frontend integration:

selection.to_json()

The JSON-ready object includes:

  • ids: ordered selected ids
  • query: serialized query expression
  • sampler: serialized sampler definition
  • candidate_count: number of entities matching the query
  • selected_count: number of returned ids
  • scores: optional ranking scores keyed by id
  • provenance: execution metadata

Validation

Queries are validated when they are executed by selector.select(...).

  • Missing adata.obs columns raise KeyError.
  • Missing genes raise KeyError.
  • Missing layers raise KeyError.
  • Missing adata.raw raises ValueError when raw=True.

This means query objects can be built lazily, but the selector confirms that their attributes exist in the AnnData object before returning a selection.

Multiple AnnDatas

A selector is bound to one AnnData object. For multiple datasets, instantiate one selector per AnnData and use explicit names:

skin_selector = dega.select.Selector(skin_adata)
lymph_selector = dega.select.Selector(lymph_adata)

skin_selection = skin_selector.select(
    query=skin_selector.attr("cluster") == "B cell",
)

lymph_selection = lymph_selector.select(
    query=lymph_selector.attr("cluster") == "B cell",
)

Spelling out selector and selection is recommended for real notebooks because the two concepts are easy to confuse if abbreviated.

Backend vs. Front-End Selection

There are two complementary ways to decide which cells a Yearbook shows. They solve the same problem — "which cells, in what order" — but run in different places and have different requirements.

Back-end selection (select module) Front-end query (front_end_query)
Runs in Python, before the widget renders The browser, against LandscapeFiles
Requires An in-memory AnnData object Only base_url (no Python AnnData)
Expressiveness Full query algebra + five samplers/rankers Single cluster filter and/or single-gene ranking
Reproducibility Seeded, serialized query + sampler + scores Stateless; recomputed in the browser each time
Provenance selection.to_json() captured on the widget Query dict only
Pass to Yearbook as selection= (or a plain id list via cells=) front_end_query=

Use the back-end path when you have an AnnData object and want rich, reproducible queries (boolean logic across obs columns and genes, quantile-bin or Gaussian sampling, stratified balancing, captured scores and provenance).

Use the front-end path for lightweight, AnnData-free browsing directly from a dataset URL — for example "show cells in cluster 8" or "rank cells by BRCA1 expression". See the Front-End Query section of the viz docs for the supported dict shapes.

The two map onto each other. A single-gene front-end query {"gene": "MS4A1"} is the in-browser equivalent of the back-end rank sampler:

selection = selector.select(
    sampler=selector.samplers.rank(attr=selector.gene("MS4A1"), by="high"),
)

Yearbook Integration

Yearbook can render a back-end selection directly:

yearbook = dega.viz.Yearbook(
    base_url=base_url,
    selection=selection,
    rows=2,
    cols=4,
)

Internally, Yearbook uses selection.names() as its ordered cells list and stores selection.to_json() for provenance. The frontend can paginate over the ordered ids without needing to understand the query machinery.

selection= accepts a Selection, a JSON-ready selection dict, or a plain list of cell ids. Pass either selection= or cells=, not both.

API Reference

Composable query and sampling tools for selecting AnnData entities.

Attribute dataclass

Reference to an AnnData-backed attribute.

An Attribute is a lazy reference to one column of per-entity values: an adata.obs column (kind="obs") or a gene's expression vector (kind="gene", optionally from a named layer or from adata.raw). It becomes concrete only when a query or sampler is evaluated by a :class:Selector.

Attributes are usually created through :meth:Selector.attr or :meth:Selector.gene rather than instantiated directly.

Comparison operators build queries, they do not return booleans. Because the operators (==, !=, <, <=, >, >=) and the helper methods (:meth:isin, :meth:between, ...) return :class:PredicateQuery objects, an Attribute reads like a value but composes like an expression::

selector.attr("qc") >= 0.8            # PredicateQuery, not a bool
selector.attr("cluster").isin(["B", "T"])

The dataclass is declared frozen=True, eq=False: frozen makes it an immutable value object, and eq=False is required so that overriding __eq__ to return a query does not clash with dataclass value-equality (it keeps the default identity-based __hash__).

__eq__(other)

Build an equality query (attr == value).

__ge__(other)

Build a greater-than-or-equal query (attr >= value).

__gt__(other)

Build a greater-than query (attr > value).

__le__(other)

Build a less-than-or-equal query (attr <= value).

__lt__(other)

Build a less-than query (attr < value).

__ne__(other)

Build an inequality query (attr != value).

between(left, right, inclusive='both')

Build a query matching values in the range [left, right].

inclusive controls which endpoints count, mirroring pandas.Series.between ("both", "neither", "left", "right").

evaluate(selector)

Resolve this reference to a concrete Series aligned to selector.ids.

Dispatches to the selector's private resolver for the attribute kind: obs columns via :meth:Selector._obs_attribute, gene expression via :meth:Selector._gene_attribute (honoring layer and raw).

isin(values)

Build a query matching entities whose value is in values.

isna()

Build a query matching entities with a missing value.

notin(values)

Build a query matching entities whose value is not in values.

notna()

Build a query matching entities with a non-missing value.

to_dict()

Return a JSON-ready description of this reference.

Always includes type (the kind) and name; layer and raw are included only when set, so the serialized form stays minimal.

GaussianSampler dataclass

Select ids whose numeric attribute value is near a target center.

Each candidate gets a Gaussian weight exp(-0.5 * ((value - center) / std) ** 2): a value exactly at center scores 1.0 and the weight falls off with distance. std controls the tolerance -- small std is a sharp peak (only very close values matter), large std is broad. (The usual 1 / (std * sqrt(2*pi)) normalizing constant is omitted because it cancels when sorting and when normalizing the sampling probabilities.)

Two modes

  • Rank everything (n is None or >= the candidate count): return all candidates ordered by closeness to center (closest first).
  • Weighted subsample (n smaller than the candidate count): draw n ids without replacement using the Gaussian weights as probabilities (seeded via seed), then re-order the draw closest-first.

In both modes the per-id weight is attached as the selection's score.

Use for "around this value" inspection -- e.g. cells near a particular QC score or expression level.

Parameters

attr Numeric attribute to weight on. center Target value the sampler is biased toward. std Standard deviation of the Gaussian; must be positive. Larger = broader. n Number of ids to draw. None ranks all candidates by closeness. seed Random seed used only in the weighted-subsample mode.

Notes

In the subsample mode, an aggressively small std can drive many weights to underflow to exactly 0. If fewer than n candidates retain a positive weight, the underlying rng.choice(replace=False, p=...) will raise.

__post_init__()

Validate that n is non-negative and std is strictly positive.

apply(selector, candidate_ids)

Weight candidates by a Gaussian and either rank or weighted-sample them.

Resolves the attribute, drops non-numeric/missing values, computes the Gaussian weights, and either returns all candidates ordered by closeness (when n covers them all) or draws a seeded weighted subset that is then re-ordered closest-first. Returns an empty result with a reason when no numeric values are available.

to_dict()

Return a JSON-ready {type, attr, center, std, n, seed} description.

QuantileBinSampler dataclass

Sample ids from a low/mid/high quantile bin for a numeric attribute.

The attribute's values define two cut points at quantiles q_low and q_high, splitting entities into three bins. bin selects which one to draw from:

  • "low" -> values <= low_cut, ordered ascending;
  • "mid" -> values strictly between the cuts, ordered by closeness to the median;
  • "high" -> values >= high_cut, ordered descending.

The interior boundaries are half-open so the bins partition cleanly (a value sitting exactly on a cut lands in one bin only) -- this matters for tie-heavy data such as raw counts.

Useful for representative inspection: e.g. "show me high-expressing cells for this gene" while preserving a stable ranked order in the returned selection.

Specifying the band

The band width can be given three ways (mutually exclusive forms of the same idea):

  • q_low / q_high directly (default thirds: 1/3 and 2/3);
  • proportion -- a fraction in (0, 1] giving the tail/center size;
  • percentile -- the same as proportion but on a 0-100 scale.

With proportion/percentile the cut(s) are derived per bin: "low" takes the bottom fraction, "high" the top fraction, and "mid" a centered band of that width around the median.

Sampling vs ranking

If n is None or the bin has <= n members, the whole bin is returned in ranked order. If the bin is larger than n, a seeded random subset of size n is drawn and then re-sorted into the bin's natural order. The per-id value is attached as the selection's score.

Parameters

attr Numeric attribute to bin (an obs column or a gene). bin Which bin to draw from: "low", "mid", or "high". n Maximum number of ids to return. None returns the whole bin. seed Random seed used only when the bin is subsampled. q_low, q_high Lower/upper quantile cut points in [0, 1] with q_low <= q_high. proportion Alternative band specification as a fraction in (0, 1]. percentile Alternative band specification as a percentage in (0, 100].

__post_init__()

Validate n, the bin name, the quantile order, and the band specs.

Enforces 0 <= q_low <= q_high <= 1, that proportion and percentile are not both set, and that each falls in its valid range.

apply(selector, candidate_ids)

Bin the candidates by quantile, then rank or subsample the chosen bin.

Resolves the attribute over the candidates, drops non-numeric/missing values, computes the two quantile cuts, selects the requested bin in its natural order, and (if larger than n) draws a seeded random subset that is then re-sorted. Returns an empty result with a reason when no numeric values are available. Provenance records the cut points, bin size, and seed.

to_dict()

Return a JSON-ready description, including the band specification.

Always includes the attribute, bin, n, seed, and the q_low/q_high cuts; proportion/percentile are added only when they were supplied.

Query

Base class for boolean query expressions.

A query is a small, immutable expression tree. Leaves are :class:PredicateQuery nodes (one comparison on one attribute) and internal nodes are :class:BooleanQuery combinators. Query objects are normally created by comparing attributes from :meth:Selector.attr or :meth:Selector.gene rather than constructed directly, and combined with the Python boolean operators below::

(selector.attr("cluster") == "B cell") & selector.gene("MS4A1") > 2

Evaluation is deferred: a query holds no data until :meth:evaluate runs it against a :class:Selector.

__and__(other)

Combine with other using logical AND (q1 & q2).

__invert__()

Negate this query using logical NOT (~q).

__or__(other)

Combine with other using logical OR (q1 | q2).

evaluate(selector)

Evaluate this query against selector and return a boolean mask.

The returned pd.Series is indexed by the selector's entity ids, with True for entities that match. Subclasses implement the actual logic.

to_dict()

Return a JSON-ready representation of this query expression.

RandomSampler dataclass

Randomly order or sample candidate ids.

With n=None this is a shuffle: it returns all candidates in random order. With an n, it draws that many. Without replace the draw is a subset (n is clamped to the number of candidates); with replace=True ids may repeat and the result can be longer than the candidate set. A seed makes the draw reproducible. This sampler attaches no scores.

Parameters

n Number of ids to return. If None, all candidate ids are shuffled. seed Optional random seed for reproducible selections. replace Whether ids may be sampled more than once.

__post_init__()

Validate that n is non-negative (or None).

apply(selector, candidate_ids)

Draw (or shuffle) ids with a seeded NumPy generator.

Returns an empty result when there are no candidates or n == 0. Without replacement, uses a permutation truncated to min(n, len); with replacement, uses rng.choice so ids may repeat. selector is unused (the draw needs no attribute values).

RankSampler dataclass

Deterministically return the highest- or lowest-valued ids for an attribute.

Sorts the candidates by a numeric attribute and takes the top (by="high") or bottom (by="low") n. Unlike the random/Gaussian samplers this is fully deterministic (no seed): a stable mergesort is used so ties keep their original relative order. The per-id value is attached as the score.

Good for "top markers" style inspection -- e.g. the highest-expressing cells for a gene, or the lowest-QC cells.

Parameters

attr Numeric attribute to rank by. n Number of ids to return. None returns all candidates, ranked. by "high" for descending (largest first), "low" for ascending.

__post_init__()

Validate that n is non-negative and by is "high" or "low".

apply(selector, candidate_ids)

Sort candidates by value and take the top/bottom n.

Resolves the attribute, drops non-numeric/missing values, sorts in the requested direction, and truncates to n. Returns an empty result with a reason when no numeric values are available.

to_dict()

Return a JSON-ready {type, attr, n, by} description.

Sampler

Bases: Protocol

Structural protocol implemented by every sampler/ranker.

A sampler turns a candidate pd.Index into an ordered subset. Anything with the two methods below satisfies the protocol, so :meth:Selector.select accepts the built-in samplers (and any duck-typed equivalent) uniformly.

apply(selector, candidate_ids)

Choose and order ids from candidate_ids, returning a SamplingResult.

Receives the bound selector (so attribute-based samplers can resolve their values) and the already-narrowed candidate index, so the sampler only ever does work proportional to the candidate set, not the full AnnData.

to_dict()

Return a JSON-ready description of the sampler and its parameters.

Selection dataclass

Ordered ids selected from a :class:Selector.

Selection is the object returned by :meth:Selector.select. It stores stable ordered ids plus the query, sampler, scores, and provenance used to create that order. It is intentionally list-like, so it can be iterated, indexed, and passed to consumers such as :class:celldega.viz.Yearbook.

Attributes

ids Ordered selected entity ids. For AnnData objects these are usually names from adata.obs_names. query JSON-ready query representation, or None when no query was used. sampler JSON-ready sampler representation, or None when ids were returned in source order. candidate_count Number of entities matching the query before sampling. selected_count Number of ids returned in :attr:ids. provenance Execution metadata, including source AnnData shape and sampler details. scores Optional score values keyed by selected id. Ranking samplers, such as quantile-bin gene selection, may populate this.

__getitem__(index)

Index or slice the ordered ids (e.g. selection[0], selection[:5]).

__iter__()

Iterate over the ordered ids, so a Selection works like a list.

__len__()

Return the number of selected ids.

names()

Return selected entity names in stable result order.

This is the most direct way to pass a selection to code that expects a plain list of ids.

page(page, per_page)

Return one zero-based page of ids.

Slices the ordered ids into fixed-size pages, e.g. for paginating a portrait grid. page(0, 24) returns the first 24 ids, page(1, 24) the next 24, and so on. A page past the end returns an empty list.

Parameters

page Zero-based page index. Must be non-negative. per_page Number of ids per page. Must be positive.

to_dataframe()

Return selected ids as a ranking DataFrame.

The returned frame always contains id and zero-based rank columns, with rows in result order. If the sampler produced scores, a score column is included (aligned by id; entries default to None for any id without a score). Handy for inspecting a selection in a notebook or joining it back onto other per-entity tables.

to_dict()

Return a JSON-ready representation of the selection.

This is an alias for :meth:to_json.

to_frame()

Return selected ids as a ranking DataFrame.

This is an alias for :meth:to_dataframe.

to_json()

Return a JSON-ready object including query, sampler, and provenance.

The payload contains ids, the serialized query and sampler (each None when unused), candidate_count, selected_count, JSON-coerced provenance, and scores when the sampler produced them. This is exactly what :class:celldega.viz.Yearbook stores to record how a portrait set was chosen.

Selector

Query and selection interface for an AnnData object.

Selector is the public object for building and executing Celldega selections. A selector is bound to one AnnData object, so every query is validated and evaluated against that object's obs, var_names, X, and optional layers. To work with multiple AnnData objects, instantiate one selector per object.

Parameters

adata AnnData-like object. The first implementation selects over adata.obs rows and can resolve metadata attributes from obs plus gene expression vectors from X or a named layer. default_preview_n Maximum number of ids to return when no sampler is provided and the candidate set is larger than this value. Set to None to disable the preview guard. default_preview_seed Seed used for the deterministic random preview when default_preview_n is triggered.

Examples

selector = dega.select.Selector(adata) query = ( ... (selector.attr("cluster") == "B cell") ... & selector.attr("sample_id").isin(["S1", "S2"]) ... ) selection = selector.select( ... query=query, ... sampler=selector.samplers.quantile_bin( ... attr=selector.gene("MS4A1"), ... bin="high", ... n=24, ... seed=1, ... ), ... ) selection.names()

ids property

Entity ids available to this selector.

attr(name)

Reference a per-entity metadata attribute from adata.obs.

The attribute name is validated when a query or sampler using it is executed. Missing columns raise KeyError.

gene(name, *, layer=None, raw=False)

Reference a gene expression vector by gene name.

Parameters

name Gene name in adata.var_names. layer Optional layer name to use instead of adata.X. raw If True, use adata.raw.

Notes

Gene and layer names are validated when the attribute is evaluated. Missing genes or layers raise KeyError.

select(query=None, sampler=None)

Evaluate a query and optionally sample/rank the matching ids.

Parameters

query Boolean query expression built from :meth:attr, :meth:gene, and boolean operators. If omitted, every AnnData observation is a candidate. sampler How to order/subset the candidates. Accepts several forms, dispatched in this order:

- ``None`` -- return candidates in source order, unless the candidate
  set exceeds ``default_preview_n``, in which case a deterministic
  random preview is returned with a warning (the guard against
  accidentally materializing a huge selection);
- a ``bool`` -- rejected with a clear error (guards against
  ``sampler=True`` being silently treated as the integer ``1``);
- an ``int`` -- shorthand for a deterministic random sample of that
  many ids, seeded with ``default_preview_seed``;
- ``"all"`` -- explicitly return every matching id, no preview guard;
- a sampler from ``selector.samplers`` -- applied to the candidates.
Returns

Selection Stable ordered selected ids plus JSON-ready query, sampler, scores, and provenance (source shape, candidate/selected counts, and the sampler's own provenance).

Notes

The query is evaluated first to narrow the candidate set, and the sampler only ever sees that narrowed index -- so expensive samplers do work proportional to the matches, not the whole AnnData.

StratifiedSampler dataclass

Draw a balanced sample across the categories of a categorical attribute.

Provide exactly one of two quota modes:

  • n_per_category -- take up to this many ids from each category (capped by how many that category actually has);
  • n -- a total quota distributed as evenly as possible across categories, via round-robin allocation (categories that run out are skipped, so larger categories absorb the remainder).

Within each category the ids are drawn at random (seeded via seed). The result is grouped by category in the order the categories are processed (i.e. not interleaved across categories). No scores are attached.

Good for balanced inspection -- e.g. an equal number of cells per cluster.

Parameters

attr Categorical attribute to stratify on. n_per_category Per-category quota. Mutually exclusive with n. n Total quota spread evenly across categories. Mutually exclusive with n_per_category. seed Random seed for the within-category draws. categories Optional explicit category order/subset. When None, categories are discovered from the data in first-seen order.

__post_init__()

Validate the quotas: non-negative, and exactly one of n / n_per_category.

apply(selector, candidate_ids)

Allocate per-category quotas, then draw that many ids from each category.

Resolves the attribute and drops missing values, groups candidates by category, computes each category's quota (fixed per-category, or round-robin for a total n), draws ids at random within each, and concatenates the groups. Provenance records per-stratum availability and sampled counts plus the quota mode.

to_dict()

Return a JSON-ready description; includes whichever quota and categories were set.