import polars as pl
from broad_babel.query import get_mapper
Incorporate metadata into profiles
A very common task when processing morphological profiles is knowing which ones are treatments and which ones are controls. Here we will explore how we can use broad-babel to accomplish this task.
We will be using the CRISPR dataset specificed in our index csv.
= "https://raw.githubusercontent.com/jump-cellpainting/datasets/50cd2ab93749ccbdb0919d3adf9277c14b6343dd/manifests/profile_index.csv"
INDEX_FILE = pl.read_csv(INDEX_FILE).filter(pl.col("subset") == "crispr").item(0, "url")
CRISPR_URL = pl.scan_parquet(CRISPR_URL)
profiles print(profiles.collect_schema().names()[:6])
['Metadata_Source', 'Metadata_Plate', 'Metadata_Well', 'Metadata_JCP2022', 'X_1', 'X_2']
For simplicity the contents of our processed profiles are minimal: “The profile origin” (source, plate and well) and the unique JUMP identifier for that perturbation. We will use broad-babel to further expand on this metadata, but for simplicity’s sake let us sample subset of data.
= (
jcp_ids "Metadata_JCP2022")).unique().collect().to_series().sort()
profiles.select(pl.col(
)= jcp_ids.sample(10, seed=42)
subsample # Add a well-known control
= (*subsample, "JCP2022_800002")
subsample subsample
('JCP2022_801270',
'JCP2022_802425',
'JCP2022_802356',
'JCP2022_805808',
'JCP2022_804300',
'JCP2022_801205',
'JCP2022_802539',
'JCP2022_803663',
'JCP2022_800116',
'JCP2022_801847',
'JCP2022_800002')
We will use these JUMP ids to obtain a mapper that indicates the perturbation type (trt, negcon or, rarely, poscon)
= get_mapper(
pert_mapper ="JCP2022", output_columns="JCP2022,pert_type"
subsample, input_column
) pert_mapper
{'JCP2022_800116': 'trt',
'JCP2022_805808': 'trt',
'JCP2022_802425': 'trt',
'JCP2022_803663': 'trt',
'JCP2022_802539': 'trt',
'JCP2022_801847': 'trt',
'JCP2022_800002': 'negcon',
'JCP2022_804300': 'trt',
'JCP2022_802356': 'trt',
'JCP2022_801270': 'trt',
'JCP2022_801205': 'trt'}
A couple of important notes about broad_babel’s get mapper and other functions: - these must be fed tuples, as these are cached and provide significant speed-ups for repeated calls - ‘get-mapper’ works for datasets for up to a few tens of thousands of samples. If you try to use it to get a mapper for the entirety of the ‘compounds’ dataset it is likely to fail. For these cases we suggest the more general function ‘run_query’. You can read more on this and other use-cases on Babel’s readme.
We will now repeat the process to get their ‘standard’ name
= get_mapper(
name_mapper *subsample, "JCP2022_800002"),
(="JCP2022",
input_column="JCP2022,standard_key",
output_columns
) name_mapper
{'JCP2022_803663': 'KIF16B',
'JCP2022_804300': 'MSX1',
'JCP2022_800002': 'non-targeting',
'JCP2022_805808': 'RAD51B',
'JCP2022_801847': 'DMRT2',
'JCP2022_802425': 'FLNC',
'JCP2022_802539': 'G6PC',
'JCP2022_801205': 'CDK20',
'JCP2022_800116': 'ACOT11',
'JCP2022_801270': 'CFB',
'JCP2022_802356': 'FDX1'}
To wrap up, we will fetch all the available profiles for these perturbations and use the mappers to add the missing metadata. We also select a few features to showcase how how selection can be performed in polars.
= profiles.filter(
subsample_profiles "Metadata_JCP2022").is_in(subsample)
pl.col(
).collect()= subsample_profiles.with_columns(
profiles_with_meta "Metadata_JCP2022").replace(pert_mapper).alias("pert_type"),
pl.col("Metadata_JCP2022").replace(name_mapper).alias("name"),
pl.col(
)
profiles_with_meta.select("name", "pert_type", "^Metadata.*$", "^X_[0-3]$"))
pl.col((="pert_type") ).sort(by
name | pert_type | Metadata_Source | Metadata_Plate | Metadata_Well | Metadata_JCP2022 | X_1 | X_2 | X_3 |
---|---|---|---|---|---|---|---|---|
str | str | str | str | str | str | f32 | f32 | f32 |
"non-targeting" | "negcon" | "source_13" | "CP-CC9-R1-01" | "A02" | "JCP2022_800002" | -0.223417 | -0.049487 | -0.826231 |
"non-targeting" | "negcon" | "source_13" | "CP-CC9-R1-01" | "L23" | "JCP2022_800002" | -0.079349 | -0.016958 | -0.277558 |
"non-targeting" | "negcon" | "source_13" | "CP-CC9-R1-01" | "I23" | "JCP2022_800002" | -0.023832 | -0.00537 | -0.29832 |
"non-targeting" | "negcon" | "source_13" | "CP-CC9-R1-01" | "J02" | "JCP2022_800002" | -0.169491 | -0.023422 | -0.088187 |
"non-targeting" | "negcon" | "source_13" | "CP-CC9-R1-01" | "O23" | "JCP2022_800002" | -0.295112 | -0.124241 | 1.055193 |
… | … | … | … | … | … | … | … | … |
"CFB" | "trt" | "source_13" | "CP-CC9-R5-19" | "G12" | "JCP2022_801270" | 0.023277 | 0.021747 | -0.914867 |
"DMRT2" | "trt" | "source_13" | "CP-CC9-R5-21" | "C10" | "JCP2022_801847" | -0.208865 | -0.046884 | 0.845933 |
"MSX1" | "trt" | "source_13" | "CP-CC9-R5-24" | "C03" | "JCP2022_804300" | -0.003333 | 0.053181 | 0.078227 |
"CFB" | "trt" | "source_13" | "CP-CC9-R6-19" | "G12" | "JCP2022_801270" | 0.18701 | 0.072405 | 0.381239 |
"CFB" | "trt" | "source_13" | "CP-CC9-R7-19" | "G12" | "JCP2022_801270" | 0.051818 | -0.129935 | -0.481606 |