import polars as pl
import requests
from broad_babel.query import get_mapperDownloading data from 'https://zenodo.org/records/12211976/files/babel.db' to file '/home/runner/.cache/pooch/2eaa6a2f4915f72d7100683f53982ed8-babel.db'.A very common task when processing morphological profiles is knowing which ones are treatments and which ones are controls. Here we will explore how we can use broad-babel to accomplish this task.
import polars as pl
import requests
from broad_babel.query import get_mapperDownloading data from 'https://zenodo.org/records/12211976/files/babel.db' to file '/home/runner/.cache/pooch/2eaa6a2f4915f72d7100683f53982ed8-babel.db'.We will be using the CRISPR dataset specificed in our json index file.
INDEX_FILE = "https://raw.githubusercontent.com/jump-cellpainting/datasets/v0.11.0/manifests/profile_index.json"
response = requests.get(INDEX_FILE)
profile_index = response.json()
CRISPR_URL = (
    pl.DataFrame(profile_index).filter(pl.col("subset") == "crispr").item(0, "url")
)
profiles = pl.scan_parquet(CRISPR_URL)
print(profiles.collect_schema().names()[:6])['Metadata_Source', 'Metadata_Plate', 'Metadata_Well', 'Metadata_JCP2022', 'X_1', 'X_2']For simplicity the contents of our processed profiles are minimal: “The profile origin” (source, plate and well) and the unique JUMP identifier for that perturbation. We will use broad-babel to further expand on this metadata, but for simplicity’s sake let us sample subset of data.
jcp_ids = (
    profiles.select(pl.col("Metadata_JCP2022")).unique().collect().to_series().sort()
)
subsample = jcp_ids.sample(10, seed=42)
# Add a well-known control
subsample = (*subsample, "JCP2022_800002")
subsample('JCP2022_801270',
 'JCP2022_802425',
 'JCP2022_802356',
 'JCP2022_805808',
 'JCP2022_804300',
 'JCP2022_801205',
 'JCP2022_802539',
 'JCP2022_803663',
 'JCP2022_800116',
 'JCP2022_801847',
 'JCP2022_800002')We will use these JUMP ids to obtain a mapper that indicates the perturbation type (trt, negcon or, rarely, poscon)
pert_mapper = get_mapper(
    subsample, input_column="JCP2022", output_columns="JCP2022,pert_type"
)
pert_mapper{'JCP2022_801270': 'trt',
 'JCP2022_800116': 'trt',
 'JCP2022_801847': 'trt',
 'JCP2022_802425': 'trt',
 'JCP2022_804300': 'trt',
 'JCP2022_803663': 'trt',
 'JCP2022_802356': 'trt',
 'JCP2022_802539': 'trt',
 'JCP2022_800002': 'negcon',
 'JCP2022_801205': 'trt',
 'JCP2022_805808': 'trt'}A couple of important notes about broad_babel’s get mapper and other functions: - these must be fed tuples, as these are cached and provide significant speed-ups for repeated calls - ‘get-mapper’ works for datasets for up to a few tens of thousands of samples. If you try to use it to get a mapper for the entirety of the ‘compounds’ dataset it is likely to fail. For these cases we suggest the more general function ‘run_query’. You can read more on this and other use-cases on Babel’s readme.
We will now repeat the process to get their ‘standard’ name
name_mapper = get_mapper(
    (*subsample, "JCP2022_800002"),
    input_column="JCP2022",
    output_columns="JCP2022,standard_key",
)
name_mapper{'JCP2022_800002': 'non-targeting',
 'JCP2022_801847': 'DMRT2',
 'JCP2022_802356': 'FDX1',
 'JCP2022_801205': 'CDK20',
 'JCP2022_801270': 'CFB',
 'JCP2022_802539': 'G6PC',
 'JCP2022_804300': 'MSX1',
 'JCP2022_805808': 'RAD51B',
 'JCP2022_802425': 'FLNC',
 'JCP2022_803663': 'KIF16B',
 'JCP2022_800116': 'ACOT11'}To wrap up, we will fetch all the available profiles for these perturbations and use the mappers to add the missing metadata. We also select a few features to showcase how how selection can be performed in polars.
subsample_profiles = profiles.filter(
    pl.col("Metadata_JCP2022").is_in(subsample)
).collect()
profiles_with_meta = subsample_profiles.with_columns(
    pl.col("Metadata_JCP2022").replace(pert_mapper).alias("pert_type"),
    pl.col("Metadata_JCP2022").replace(name_mapper).alias("name"),
)
profiles_with_meta.select(
    pl.col(("name", "pert_type", "^Metadata.*$", "^X_[0-3]$"))
).sort(by="pert_type")| name | pert_type | Metadata_Source | Metadata_Plate | Metadata_Well | Metadata_JCP2022 | X_1 | X_2 | X_3 | 
|---|---|---|---|---|---|---|---|---|
| str | str | str | str | str | str | f32 | f32 | f32 | 
| "non-targeting" | "negcon" | "source_13" | "CP-CC9-R1-01" | "A02" | "JCP2022_800002" | -0.223417 | -0.049487 | -0.826231 | 
| "non-targeting" | "negcon" | "source_13" | "CP-CC9-R1-01" | "L23" | "JCP2022_800002" | -0.079349 | -0.016958 | -0.277558 | 
| "non-targeting" | "negcon" | "source_13" | "CP-CC9-R1-01" | "I23" | "JCP2022_800002" | -0.023832 | -0.00537 | -0.29832 | 
| "non-targeting" | "negcon" | "source_13" | "CP-CC9-R1-01" | "J02" | "JCP2022_800002" | -0.169491 | -0.023422 | -0.088187 | 
| "non-targeting" | "negcon" | "source_13" | "CP-CC9-R1-01" | "O23" | "JCP2022_800002" | -0.295112 | -0.124241 | 1.055193 | 
| … | … | … | … | … | … | … | … | … | 
| "CFB" | "trt" | "source_13" | "CP-CC9-R5-19" | "G12" | "JCP2022_801270" | 0.023277 | 0.021747 | -0.914867 | 
| "DMRT2" | "trt" | "source_13" | "CP-CC9-R5-21" | "C10" | "JCP2022_801847" | -0.208865 | -0.046884 | 0.845933 | 
| "MSX1" | "trt" | "source_13" | "CP-CC9-R5-24" | "C03" | "JCP2022_804300" | -0.003333 | 0.053181 | 0.078227 | 
| "CFB" | "trt" | "source_13" | "CP-CC9-R6-19" | "G12" | "JCP2022_801270" | 0.18701 | 0.072405 | 0.381239 | 
| "CFB" | "trt" | "source_13" | "CP-CC9-R7-19" | "G12" | "JCP2022_801270" | 0.051818 | -0.129935 | -0.481606 |