Select API
The select module provides a small query and sampling layer over AnnData. It is designed to answer questions like:
- Which entities match this metadata/gene filter?
- In what stable order should those ids be shown?
- What query and sampler produced that order?
The main class is Selector, used from the dega.select namespace:
import celldega as dega
selector = dega.select.Selector(adata)
Core Concepts
Selector separates three pieces of selection logic:
| Concept | API | Purpose |
|---|---|---|
| Attribute | selector.attr(...), selector.gene(...) |
Reference per-entity values from adata.obs or gene expression |
| Query | (selector.attr("cluster") == "B cell") |
Define the candidate set with categorical, numeric, and gene-expression predicates combined with boolean logic. |
| Sampler | selector.samplers.random(...), selector.samplers.quantile_bin(...), or sampler=3000 |
Choose or rank ids from the candidate set |
This keeps the backend responsible for producing a stable ordered list of ids plus provenance. Widgets such as Yearbook can then paginate over that ordered list.
Basic Selection
selector = dega.select.Selector(adata)
query = (
(selector.attr("cluster") == "B cell")
& selector.attr("sample_id").isin(["S1", "S2"])
)
selection = selector.select(query=query)
selection.names()
selection.names() returns the ordered ids, usually names from adata.obs_names.
Safe Default Preview
If a query matches a very large number of entities and no sampler is provided, Selector returns a deterministic random preview instead of accidentally handing a widget or notebook hundreds of thousands of ids.
By default:
- candidate sets up to 1,000 ids are returned in full;
- candidate sets over 1,000 ids return a deterministic random preview of 1,000 ids;
- a warning explains that previewing happened and how to request all matches.
selection = selector.select(query=query)
To intentionally return every matching id, pass sampler="all":
selection = selector.select(query=query, sampler="all")
For the common case of "give me N random ids", pass an integer. This uses the selector's default_preview_seed so the shorthand is reproducible:
selection = selector.select(query=query, sampler=3000)
To make sampling explicit and reproducible, pass a sampler:
selection = selector.select(
query=query,
sampler=selector.samplers.random(n=5000, seed=1),
)
The default preview size can be changed when the selector is created:
selector = dega.select.Selector(adata, default_preview_n=2000)
Set default_preview_n=None to disable the preview guard.
Sampling And Ranking
Sampler constructors are grouped under selector.samplers.
Each built-in sampler is backed by a concrete sampler class such as RandomSampler, QuantileBinSampler, RankSampler, GaussianSampler, or StratifiedSampler. selector.samplers.* is the notebook-friendly constructor namespace for creating those objects.
Built-In Samplers
| Sampler | API | Behavior | Good for |
|---|---|---|---|
| Random | selector.samplers.random(n=..., seed=...) or sampler=3000 |
Random sample, optionally reproducible with seed |
Quick representative subsets |
| Rank | selector.samplers.rank(attr=..., n=..., by="high") |
Deterministic top or bottom ids by attribute value | Highest-expression or lowest-QC examples |
| Quantile Bin | selector.samplers.quantile_bin(attr=..., bin="high", ...) |
Sample from a low/mid/high region of a distribution | Representative sampling from tails or the middle |
| Gaussian | selector.samplers.gaussian(attr=..., center=..., std=..., n=...) |
Bias samples toward a numeric target value | “Near this score” or “around this expression level” |
| Stratified | selector.samplers.stratified(attr=..., n_per_category=...) or n=... |
Evenly distribute samples across categories | Balanced cluster/sample selections |
selection = selector.select(
query=query,
sampler=selector.samplers.random(n=24, seed=1),
)
Random sampling is the simplest exploratory sampler. If you just want a bounded subset and do not care about an attribute-specific distribution, this is usually the right default.
For deterministic top or bottom ids by an attribute:
selection = selector.select(
sampler=selector.samplers.rank(
attr=selector.gene("MS4A1"),
n=24,
by="high",
),
)
For gene-expression-driven inspection, use a quantile-bin sampler:
selection = selector.select(
query=query,
sampler=selector.samplers.quantile_bin(
attr=selector.gene("MS4A1"),
bin="high",
n=24,
seed=1,
),
)
The returned ids preserve the sampler's order. For bin="high", selected ids are ordered from higher to lower expression after sampling.
For narrower tails such as "top 5%", use proportion or percentile:
selection = selector.select(
query=query,
sampler=selector.samplers.quantile_bin(
attr=selector.gene("MS4A1"),
bin="high",
percentile=5,
),
)
For Gaussian-weighted sampling around a target value:
selection = selector.select(
sampler=selector.samplers.gaussian(
attr=selector.attr("qc_score"),
center=0.8,
std=0.05,
n=24,
seed=1,
),
)
For even sampling across categories:
selection = selector.select(
sampler=selector.samplers.stratified(
attr=selector.attr("cluster"),
n_per_category=10,
seed=1,
),
)
For a total quota distributed as evenly as possible across categories:
selection = selector.select(
sampler=selector.samplers.stratified(
attr=selector.attr("cluster"),
n=100,
seed=1,
),
)
Result Objects
selector.select(...) returns a Selection. It behaves like an ordered list of selected ids and also carries provenance.
len(selection)
selection[0]
list(selection)
selection.names()
For notebook work:
selection.to_dataframe()
For serialization or frontend integration:
selection.to_json()
The JSON-ready object includes:
ids: ordered selected idsquery: serialized query expressionsampler: serialized sampler definitioncandidate_count: number of entities matching the queryselected_count: number of returned idsscores: optional ranking scores keyed by idprovenance: execution metadata
Validation
Queries are validated when they are executed by selector.select(...).
- Missing
adata.obscolumns raiseKeyError. - Missing genes raise
KeyError. - Missing layers raise
KeyError. - Missing
adata.rawraisesValueErrorwhenraw=True.
This means query objects can be built lazily, but the selector confirms that their attributes exist in the AnnData object before returning a selection.
Multiple AnnDatas
A selector is bound to one AnnData object. For multiple datasets, instantiate one selector per AnnData and use explicit names:
skin_selector = dega.select.Selector(skin_adata)
lymph_selector = dega.select.Selector(lymph_adata)
skin_selection = skin_selector.select(
query=skin_selector.attr("cluster") == "B cell",
)
lymph_selection = lymph_selector.select(
query=lymph_selector.attr("cluster") == "B cell",
)
Spelling out selector and selection is recommended for real notebooks because the two concepts are easy to confuse if abbreviated.
Backend vs. Front-End Selection
There are two complementary ways to decide which cells a Yearbook shows. They
solve the same problem — "which cells, in what order" — but run in different
places and have different requirements.
Back-end selection (select module) |
Front-end query (front_end_query) |
|
|---|---|---|
| Runs in | Python, before the widget renders | The browser, against LandscapeFiles |
| Requires | An in-memory AnnData object |
Only base_url (no Python AnnData) |
| Expressiveness | Full query algebra + five samplers/rankers | Single cluster filter and/or single-gene ranking |
| Reproducibility | Seeded, serialized query + sampler + scores | Stateless; recomputed in the browser each time |
| Provenance | selection.to_json() captured on the widget |
Query dict only |
| Pass to Yearbook as | selection= (or a plain id list via cells=) |
front_end_query= |
Use the back-end path when you have an AnnData object and want rich,
reproducible queries (boolean logic across obs columns and genes, quantile-bin
or Gaussian sampling, stratified balancing, captured scores and provenance).
Use the front-end path for lightweight, AnnData-free browsing directly from
a dataset URL — for example "show cells in cluster 8" or "rank cells by BRCA1
expression". See the Front-End Query section
of the viz docs for the supported dict shapes.
The two map onto each other. A single-gene front-end query
{"gene": "MS4A1"} is the in-browser equivalent of the back-end rank sampler:
selection = selector.select(
sampler=selector.samplers.rank(attr=selector.gene("MS4A1"), by="high"),
)
Yearbook Integration
Yearbook can render a back-end selection directly:
yearbook = dega.viz.Yearbook(
base_url=base_url,
selection=selection,
rows=2,
cols=4,
)
Internally, Yearbook uses selection.names() as its ordered cells list and
stores selection.to_json() for provenance. The frontend can paginate over the
ordered ids without needing to understand the query machinery.
selection= accepts a Selection, a JSON-ready selection dict, or a plain list
of cell ids. Pass either selection= or cells=, not both.
API Reference
Composable query and sampling tools for selecting AnnData entities.
Attribute
dataclass
Reference to an AnnData-backed attribute.
An Attribute is a lazy reference to one column of per-entity values: an
adata.obs column (kind="obs") or a gene's expression vector
(kind="gene", optionally from a named layer or from adata.raw). It
becomes concrete only when a query or sampler is evaluated by a
:class:Selector.
Attributes are usually created through :meth:Selector.attr or
:meth:Selector.gene rather than instantiated directly.
Comparison operators build queries, they do not return booleans. Because the
operators (==, !=, <, <=, >, >=) and the helper
methods (:meth:isin, :meth:between, ...) return :class:PredicateQuery
objects, an Attribute reads like a value but composes like an expression::
selector.attr("qc") >= 0.8 # PredicateQuery, not a bool
selector.attr("cluster").isin(["B", "T"])
The dataclass is declared frozen=True, eq=False: frozen makes it an
immutable value object, and eq=False is required so that overriding
__eq__ to return a query does not clash with dataclass value-equality
(it keeps the default identity-based __hash__).
__eq__(other)
Build an equality query (attr == value).
__ge__(other)
Build a greater-than-or-equal query (attr >= value).
__gt__(other)
Build a greater-than query (attr > value).
__le__(other)
Build a less-than-or-equal query (attr <= value).
__lt__(other)
Build a less-than query (attr < value).
__ne__(other)
Build an inequality query (attr != value).
between(left, right, inclusive='both')
Build a query matching values in the range [left, right].
inclusive controls which endpoints count, mirroring
pandas.Series.between ("both", "neither", "left",
"right").
evaluate(selector)
Resolve this reference to a concrete Series aligned to selector.ids.
Dispatches to the selector's private resolver for the attribute kind:
obs columns via :meth:Selector._obs_attribute, gene expression via
:meth:Selector._gene_attribute (honoring layer and raw).
isin(values)
Build a query matching entities whose value is in values.
isna()
Build a query matching entities with a missing value.
notin(values)
Build a query matching entities whose value is not in values.
notna()
Build a query matching entities with a non-missing value.
to_dict()
Return a JSON-ready description of this reference.
Always includes type (the kind) and name; layer and raw
are included only when set, so the serialized form stays minimal.
GaussianSampler
dataclass
Select ids whose numeric attribute value is near a target center.
Each candidate gets a Gaussian weight exp(-0.5 * ((value - center) / std)
** 2): a value exactly at center scores 1.0 and the weight falls
off with distance. std controls the tolerance -- small std is a sharp
peak (only very close values matter), large std is broad. (The usual
1 / (std * sqrt(2*pi)) normalizing constant is omitted because it cancels
when sorting and when normalizing the sampling probabilities.)
Two modes
- Rank everything (
nisNoneor>=the candidate count): return all candidates ordered by closeness tocenter(closest first). - Weighted subsample (
nsmaller than the candidate count): drawnids without replacement using the Gaussian weights as probabilities (seeded viaseed), then re-order the draw closest-first.
In both modes the per-id weight is attached as the selection's score.
Use for "around this value" inspection -- e.g. cells near a particular QC score or expression level.
Parameters
attr
Numeric attribute to weight on.
center
Target value the sampler is biased toward.
std
Standard deviation of the Gaussian; must be positive. Larger = broader.
n
Number of ids to draw. None ranks all candidates by closeness.
seed
Random seed used only in the weighted-subsample mode.
Notes
In the subsample mode, an aggressively small std can drive many weights to
underflow to exactly 0. If fewer than n candidates retain a positive
weight, the underlying rng.choice(replace=False, p=...) will raise.
__post_init__()
Validate that n is non-negative and std is strictly positive.
apply(selector, candidate_ids)
Weight candidates by a Gaussian and either rank or weighted-sample them.
Resolves the attribute, drops non-numeric/missing values, computes the
Gaussian weights, and either returns all candidates ordered by closeness
(when n covers them all) or draws a seeded weighted subset that is
then re-ordered closest-first. Returns an empty result with a reason
when no numeric values are available.
to_dict()
Return a JSON-ready {type, attr, center, std, n, seed} description.
QuantileBinSampler
dataclass
Sample ids from a low/mid/high quantile bin for a numeric attribute.
The attribute's values define two cut points at quantiles q_low and
q_high, splitting entities into three bins. bin selects which one to
draw from:
"low"-> values<= low_cut, ordered ascending;"mid"-> values strictly between the cuts, ordered by closeness to the median;"high"-> values>= high_cut, ordered descending.
The interior boundaries are half-open so the bins partition cleanly (a value sitting exactly on a cut lands in one bin only) -- this matters for tie-heavy data such as raw counts.
Useful for representative inspection: e.g. "show me high-expressing cells for this gene" while preserving a stable ranked order in the returned selection.
Specifying the band
The band width can be given three ways (mutually exclusive forms of the same idea):
q_low/q_highdirectly (default thirds:1/3and2/3);proportion-- a fraction in(0, 1]giving the tail/center size;percentile-- the same asproportionbut on a 0-100 scale.
With proportion/percentile the cut(s) are derived per bin: "low"
takes the bottom fraction, "high" the top fraction, and "mid" a
centered band of that width around the median.
Sampling vs ranking
If n is None or the bin has <= n members, the whole bin is
returned in ranked order. If the bin is larger than n, a seeded random
subset of size n is drawn and then re-sorted into the bin's natural order.
The per-id value is attached as the selection's score.
Parameters
attr
Numeric attribute to bin (an obs column or a gene).
bin
Which bin to draw from: "low", "mid", or "high".
n
Maximum number of ids to return. None returns the whole bin.
seed
Random seed used only when the bin is subsampled.
q_low, q_high
Lower/upper quantile cut points in [0, 1] with q_low <= q_high.
proportion
Alternative band specification as a fraction in (0, 1].
percentile
Alternative band specification as a percentage in (0, 100].
__post_init__()
Validate n, the bin name, the quantile order, and the band specs.
Enforces 0 <= q_low <= q_high <= 1, that proportion and
percentile are not both set, and that each falls in its valid range.
apply(selector, candidate_ids)
Bin the candidates by quantile, then rank or subsample the chosen bin.
Resolves the attribute over the candidates, drops non-numeric/missing
values, computes the two quantile cuts, selects the requested bin in its
natural order, and (if larger than n) draws a seeded random subset
that is then re-sorted. Returns an empty result with a reason when no
numeric values are available. Provenance records the cut points, bin
size, and seed.
to_dict()
Return a JSON-ready description, including the band specification.
Always includes the attribute, bin, n, seed, and the
q_low/q_high cuts; proportion/percentile are added only
when they were supplied.
Query
Base class for boolean query expressions.
A query is a small, immutable expression tree. Leaves are
:class:PredicateQuery nodes (one comparison on one attribute) and internal
nodes are :class:BooleanQuery combinators. Query objects are normally
created by comparing attributes from :meth:Selector.attr or
:meth:Selector.gene rather than constructed directly, and combined with the
Python boolean operators below::
(selector.attr("cluster") == "B cell") & selector.gene("MS4A1") > 2
Evaluation is deferred: a query holds no data until :meth:evaluate runs it
against a :class:Selector.
__and__(other)
Combine with other using logical AND (q1 & q2).
__invert__()
Negate this query using logical NOT (~q).
__or__(other)
Combine with other using logical OR (q1 | q2).
evaluate(selector)
Evaluate this query against selector and return a boolean mask.
The returned pd.Series is indexed by the selector's entity ids, with
True for entities that match. Subclasses implement the actual logic.
to_dict()
Return a JSON-ready representation of this query expression.
RandomSampler
dataclass
Randomly order or sample candidate ids.
With n=None this is a shuffle: it returns all candidates in random order.
With an n, it draws that many. Without replace the draw is a subset
(n is clamped to the number of candidates); with replace=True ids may
repeat and the result can be longer than the candidate set. A seed makes
the draw reproducible. This sampler attaches no scores.
Parameters
n
Number of ids to return. If None, all candidate ids are shuffled.
seed
Optional random seed for reproducible selections.
replace
Whether ids may be sampled more than once.
__post_init__()
Validate that n is non-negative (or None).
apply(selector, candidate_ids)
Draw (or shuffle) ids with a seeded NumPy generator.
Returns an empty result when there are no candidates or n == 0.
Without replacement, uses a permutation truncated to min(n, len);
with replacement, uses rng.choice so ids may repeat. selector is
unused (the draw needs no attribute values).
RankSampler
dataclass
Deterministically return the highest- or lowest-valued ids for an attribute.
Sorts the candidates by a numeric attribute and takes the top (by="high")
or bottom (by="low") n. Unlike the random/Gaussian samplers this is
fully deterministic (no seed): a stable mergesort is used so ties keep their
original relative order. The per-id value is attached as the score.
Good for "top markers" style inspection -- e.g. the highest-expressing cells for a gene, or the lowest-QC cells.
Parameters
attr
Numeric attribute to rank by.
n
Number of ids to return. None returns all candidates, ranked.
by
"high" for descending (largest first), "low" for ascending.
__post_init__()
Validate that n is non-negative and by is "high" or "low".
apply(selector, candidate_ids)
Sort candidates by value and take the top/bottom n.
Resolves the attribute, drops non-numeric/missing values, sorts in the
requested direction, and truncates to n. Returns an empty result with
a reason when no numeric values are available.
to_dict()
Return a JSON-ready {type, attr, n, by} description.
Sampler
Bases: Protocol
Structural protocol implemented by every sampler/ranker.
A sampler turns a candidate pd.Index into an ordered subset. Anything
with the two methods below satisfies the protocol, so :meth:Selector.select
accepts the built-in samplers (and any duck-typed equivalent) uniformly.
apply(selector, candidate_ids)
Choose and order ids from candidate_ids, returning a SamplingResult.
Receives the bound selector (so attribute-based samplers can resolve
their values) and the already-narrowed candidate index, so the sampler
only ever does work proportional to the candidate set, not the full
AnnData.
to_dict()
Return a JSON-ready description of the sampler and its parameters.
Selection
dataclass
Ordered ids selected from a :class:Selector.
Selection is the object returned by :meth:Selector.select. It stores
stable ordered ids plus the query, sampler, scores, and provenance used to
create that order. It is intentionally list-like, so it can be iterated,
indexed, and passed to consumers such as :class:celldega.viz.Yearbook.
Attributes
ids
Ordered selected entity ids. For AnnData objects these are usually
names from adata.obs_names.
query
JSON-ready query representation, or None when no query was used.
sampler
JSON-ready sampler representation, or None when ids were returned in
source order.
candidate_count
Number of entities matching the query before sampling.
selected_count
Number of ids returned in :attr:ids.
provenance
Execution metadata, including source AnnData shape and sampler details.
scores
Optional score values keyed by selected id. Ranking samplers, such as
quantile-bin gene selection, may populate this.
__getitem__(index)
Index or slice the ordered ids (e.g. selection[0], selection[:5]).
__iter__()
Iterate over the ordered ids, so a Selection works like a list.
__len__()
Return the number of selected ids.
names()
Return selected entity names in stable result order.
This is the most direct way to pass a selection to code that expects a plain list of ids.
page(page, per_page)
Return one zero-based page of ids.
Slices the ordered ids into fixed-size pages, e.g. for paginating a
portrait grid. page(0, 24) returns the first 24 ids, page(1, 24)
the next 24, and so on. A page past the end returns an empty list.
Parameters
page Zero-based page index. Must be non-negative. per_page Number of ids per page. Must be positive.
to_dataframe()
Return selected ids as a ranking DataFrame.
The returned frame always contains id and zero-based rank
columns, with rows in result order. If the sampler produced scores, a
score column is included (aligned by id; entries default to None
for any id without a score). Handy for inspecting a selection in a
notebook or joining it back onto other per-entity tables.
to_dict()
Return a JSON-ready representation of the selection.
This is an alias for :meth:to_json.
to_frame()
Return selected ids as a ranking DataFrame.
This is an alias for :meth:to_dataframe.
to_json()
Return a JSON-ready object including query, sampler, and provenance.
The payload contains ids, the serialized query and sampler
(each None when unused), candidate_count, selected_count,
JSON-coerced provenance, and scores when the sampler produced
them. This is exactly what :class:celldega.viz.Yearbook stores to
record how a portrait set was chosen.
Selector
Query and selection interface for an AnnData object.
Selector is the public object for building and executing Celldega
selections. A selector is bound to one AnnData object, so every query is
validated and evaluated against that object's obs, var_names, X,
and optional layers. To work with multiple AnnData objects, instantiate one
selector per object.
Parameters
adata
AnnData-like object. The first implementation selects over adata.obs
rows and can resolve metadata attributes from obs plus gene
expression vectors from X or a named layer.
default_preview_n
Maximum number of ids to return when no sampler is provided and the
candidate set is larger than this value. Set to None to disable the
preview guard.
default_preview_seed
Seed used for the deterministic random preview when default_preview_n
is triggered.
Examples
selector = dega.select.Selector(adata) query = ( ... (selector.attr("cluster") == "B cell") ... & selector.attr("sample_id").isin(["S1", "S2"]) ... ) selection = selector.select( ... query=query, ... sampler=selector.samplers.quantile_bin( ... attr=selector.gene("MS4A1"), ... bin="high", ... n=24, ... seed=1, ... ), ... ) selection.names()
ids
property
Entity ids available to this selector.
attr(name)
Reference a per-entity metadata attribute from adata.obs.
The attribute name is validated when a query or sampler using it is
executed. Missing columns raise KeyError.
gene(name, *, layer=None, raw=False)
Reference a gene expression vector by gene name.
Parameters
name
Gene name in adata.var_names.
layer
Optional layer name to use instead of adata.X.
raw
If True, use adata.raw.
Notes
Gene and layer names are validated when the attribute is evaluated.
Missing genes or layers raise KeyError.
select(query=None, sampler=None)
Evaluate a query and optionally sample/rank the matching ids.
Parameters
query
Boolean query expression built from :meth:attr, :meth:gene, and
boolean operators. If omitted, every AnnData observation is a
candidate.
sampler
How to order/subset the candidates. Accepts several forms, dispatched
in this order:
- ``None`` -- return candidates in source order, unless the candidate
set exceeds ``default_preview_n``, in which case a deterministic
random preview is returned with a warning (the guard against
accidentally materializing a huge selection);
- a ``bool`` -- rejected with a clear error (guards against
``sampler=True`` being silently treated as the integer ``1``);
- an ``int`` -- shorthand for a deterministic random sample of that
many ids, seeded with ``default_preview_seed``;
- ``"all"`` -- explicitly return every matching id, no preview guard;
- a sampler from ``selector.samplers`` -- applied to the candidates.
Returns
Selection Stable ordered selected ids plus JSON-ready query, sampler, scores, and provenance (source shape, candidate/selected counts, and the sampler's own provenance).
Notes
The query is evaluated first to narrow the candidate set, and the sampler only ever sees that narrowed index -- so expensive samplers do work proportional to the matches, not the whole AnnData.
StratifiedSampler
dataclass
Draw a balanced sample across the categories of a categorical attribute.
Provide exactly one of two quota modes:
n_per_category-- take up to this many ids from each category (capped by how many that category actually has);n-- a total quota distributed as evenly as possible across categories, via round-robin allocation (categories that run out are skipped, so larger categories absorb the remainder).
Within each category the ids are drawn at random (seeded via seed). The
result is grouped by category in the order the categories are processed
(i.e. not interleaved across categories). No scores are attached.
Good for balanced inspection -- e.g. an equal number of cells per cluster.
Parameters
attr
Categorical attribute to stratify on.
n_per_category
Per-category quota. Mutually exclusive with n.
n
Total quota spread evenly across categories. Mutually exclusive with
n_per_category.
seed
Random seed for the within-category draws.
categories
Optional explicit category order/subset. When None, categories are
discovered from the data in first-seen order.
__post_init__()
Validate the quotas: non-negative, and exactly one of n / n_per_category.
apply(selector, candidate_ids)
Allocate per-category quotas, then draw that many ids from each category.
Resolves the attribute and drops missing values, groups candidates by
category, computes each category's quota (fixed per-category, or
round-robin for a total n), draws ids at random within each, and
concatenates the groups. Provenance records per-stratum availability and
sampled counts plus the quota mode.
to_dict()
Return a JSON-ready description; includes whichever quota and categories were set.