Incorporate metadata into profiles

A very common task when processing morphological profiles is knowing which ones are treatments and which ones are controls. Here we will explore how we can use broad-babel to accomplish this task.

import polars as pl
from broad_babel.query import get_mapper

We will be using the CRISPR dataset specificed in our index csv.

INDEX_FILE = "https://raw.githubusercontent.com/jump-cellpainting/datasets/50cd2ab93749ccbdb0919d3adf9277c14b6343dd/manifests/profile_index.csv"
CRISPR_URL = pl.read_csv(INDEX_FILE).filter(pl.col("subset") == "crispr").item(0, "url")
profiles = pl.scan_parquet(CRISPR_URL)
print(profiles.collect_schema().names()[:6])
['Metadata_Source', 'Metadata_Plate', 'Metadata_Well', 'Metadata_JCP2022', 'X_1', 'X_2']

For simplicity the contents of our processed profiles are minimal: “The profile origin” (source, plate and well) and the unique JUMP identifier for that perturbation. We will use broad-babel to further expand on this metadata, but for simplicity’s sake let us sample subset of data.

jcp_ids = (
    profiles.select(pl.col("Metadata_JCP2022")).unique().collect().to_series().sort()
)
subsample = jcp_ids.sample(10, seed=42)
# Add a well-known control
subsample = (*subsample, "JCP2022_800002")
subsample
('JCP2022_801270',
 'JCP2022_802425',
 'JCP2022_802356',
 'JCP2022_805808',
 'JCP2022_804300',
 'JCP2022_801205',
 'JCP2022_802539',
 'JCP2022_803663',
 'JCP2022_800116',
 'JCP2022_801847',
 'JCP2022_800002')

We will use these JUMP ids to obtain a mapper that indicates the perturbation type (trt, negcon or, rarely, poscon)

pert_mapper = get_mapper(
    subsample, input_column="JCP2022", output_columns="JCP2022,pert_type"
)
pert_mapper
{'JCP2022_800116': 'trt',
 'JCP2022_805808': 'trt',
 'JCP2022_802425': 'trt',
 'JCP2022_803663': 'trt',
 'JCP2022_802539': 'trt',
 'JCP2022_801847': 'trt',
 'JCP2022_800002': 'negcon',
 'JCP2022_804300': 'trt',
 'JCP2022_802356': 'trt',
 'JCP2022_801270': 'trt',
 'JCP2022_801205': 'trt'}

A couple of important notes about broad_babel’s get mapper and other functions: - these must be fed tuples, as these are cached and provide significant speed-ups for repeated calls - ‘get-mapper’ works for datasets for up to a few tens of thousands of samples. If you try to use it to get a mapper for the entirety of the ‘compounds’ dataset it is likely to fail. For these cases we suggest the more general function ‘run_query’. You can read more on this and other use-cases on Babel’s readme.

We will now repeat the process to get their ‘standard’ name

name_mapper = get_mapper(
    (*subsample, "JCP2022_800002"),
    input_column="JCP2022",
    output_columns="JCP2022,standard_key",
)
name_mapper
{'JCP2022_803663': 'KIF16B',
 'JCP2022_804300': 'MSX1',
 'JCP2022_800002': 'non-targeting',
 'JCP2022_805808': 'RAD51B',
 'JCP2022_801847': 'DMRT2',
 'JCP2022_802425': 'FLNC',
 'JCP2022_802539': 'G6PC',
 'JCP2022_801205': 'CDK20',
 'JCP2022_800116': 'ACOT11',
 'JCP2022_801270': 'CFB',
 'JCP2022_802356': 'FDX1'}

To wrap up, we will fetch all the available profiles for these perturbations and use the mappers to add the missing metadata. We also select a few features to showcase how how selection can be performed in polars.

subsample_profiles = profiles.filter(
    pl.col("Metadata_JCP2022").is_in(subsample)
).collect()
profiles_with_meta = subsample_profiles.with_columns(
    pl.col("Metadata_JCP2022").replace(pert_mapper).alias("pert_type"),
    pl.col("Metadata_JCP2022").replace(name_mapper).alias("name"),
)
profiles_with_meta.select(
    pl.col(("name", "pert_type", "^Metadata.*$", "^X_[0-3]$"))
).sort(by="pert_type")
shape: (2_806, 9)
name pert_type Metadata_Source Metadata_Plate Metadata_Well Metadata_JCP2022 X_1 X_2 X_3
str str str str str str f32 f32 f32
"non-targeting" "negcon" "source_13" "CP-CC9-R1-01" "A02" "JCP2022_800002" -0.223417 -0.049487 -0.826231
"non-targeting" "negcon" "source_13" "CP-CC9-R1-01" "L23" "JCP2022_800002" -0.079349 -0.016958 -0.277558
"non-targeting" "negcon" "source_13" "CP-CC9-R1-01" "I23" "JCP2022_800002" -0.023832 -0.00537 -0.29832
"non-targeting" "negcon" "source_13" "CP-CC9-R1-01" "J02" "JCP2022_800002" -0.169491 -0.023422 -0.088187
"non-targeting" "negcon" "source_13" "CP-CC9-R1-01" "O23" "JCP2022_800002" -0.295112 -0.124241 1.055193
"CFB" "trt" "source_13" "CP-CC9-R5-19" "G12" "JCP2022_801270" 0.023277 0.021747 -0.914867
"DMRT2" "trt" "source_13" "CP-CC9-R5-21" "C10" "JCP2022_801847" -0.208865 -0.046884 0.845933
"MSX1" "trt" "source_13" "CP-CC9-R5-24" "C03" "JCP2022_804300" -0.003333 0.053181 0.078227
"CFB" "trt" "source_13" "CP-CC9-R6-19" "G12" "JCP2022_801270" 0.18701 0.072405 0.381239
"CFB" "trt" "source_13" "CP-CC9-R7-19" "G12" "JCP2022_801270" 0.051818 -0.129935 -0.481606