import polars as pl
import requests
from broad_babel.query import get_mapperAdd metadata to profiles
A very common task when processing morphological profiles is knowing which ones are treatments and which ones are controls. Here we will explore how we can use broad-babel to accomplish this task.
We will be using the CRISPR dataset specificed in our json index file.
INDEX_FILE = "https://raw.githubusercontent.com/jump-cellpainting/datasets/v0.11.0/manifests/profile_index.json"
response = requests.get(INDEX_FILE)
profile_index = response.json()
CRISPR_URL = (
pl.DataFrame(profile_index).filter(pl.col("subset") == "crispr").item(0, "url")
)
profiles = pl.scan_parquet(CRISPR_URL)
print(profiles.collect_schema().names()[:6])['Metadata_Source', 'Metadata_Plate', 'Metadata_Well', 'Metadata_JCP2022', 'X_1', 'X_2']
For simplicity the contents of our processed profiles are minimal: “The profile origin” (source, plate and well) and the unique JUMP identifier for that perturbation. We will use broad-babel to further expand on this metadata, but for simplicity’s sake let us sample subset of data.
jcp_ids = (
profiles.select(pl.col("Metadata_JCP2022")).unique().collect().to_series().sort()
)
subsample = jcp_ids.sample(10, seed=42)
# Add a well-known control
subsample = (*subsample, "JCP2022_800002")
subsample('JCP2022_806489',
'JCP2022_802541',
'JCP2022_807842',
'JCP2022_805589',
'JCP2022_806326',
'JCP2022_804689',
'JCP2022_801000',
'JCP2022_804826',
'JCP2022_801657',
'JCP2022_807446',
'JCP2022_800002')
We will use these JUMP ids to obtain a mapper that indicates the perturbation type (trt, negcon or, rarely, poscon)
pert_mapper = get_mapper(
subsample, input_column="JCP2022", output_columns="JCP2022,pert_type"
)
pert_mapper{'JCP2022_800002': 'negcon',
'JCP2022_806326': 'trt',
'JCP2022_801657': 'trt',
'JCP2022_805589': 'trt',
'JCP2022_806489': 'trt',
'JCP2022_807842': 'trt',
'JCP2022_802541': 'trt',
'JCP2022_804826': 'trt',
'JCP2022_807446': 'trt',
'JCP2022_804689': 'trt',
'JCP2022_801000': 'trt'}
A couple of important notes about broad_babel’s get mapper and other functions: - these must be fed tuples, as these are cached and provide significant speed-ups for repeated calls - ‘get-mapper’ works for datasets for up to a few tens of thousands of samples. If you try to use it to get a mapper for the entirety of the ‘compounds’ dataset it is likely to fail. For these cases we suggest the more general function ‘run_query’. You can read more on this and other use-cases on Babel’s readme.
We will now repeat the process to get their ‘standard’ name
name_mapper = get_mapper(
(*subsample, "JCP2022_800002"),
input_column="JCP2022",
output_columns="JCP2022,standard_key",
)
name_mapper{'JCP2022_806489': 'SLC2A4',
'JCP2022_804689': 'NR2F1',
'JCP2022_802541': 'G6PC3',
'JCP2022_804826': 'OTUD3',
'JCP2022_807446': 'UBE2B',
'JCP2022_807842': 'ZNF205',
'JCP2022_801000': 'CACNB1',
'JCP2022_801657': 'CYP4F8',
'JCP2022_805589': 'PRSS41',
'JCP2022_800002': 'non-targeting',
'JCP2022_806326': 'SHOX2'}
To wrap up, we will fetch all the available profiles for these perturbations and use the mappers to add the missing metadata. We also select a few features to showcase how how selection can be performed in polars.
subsample_profiles = profiles.filter(
pl.col("Metadata_JCP2022").is_in(subsample)
).collect()
profiles_with_meta = subsample_profiles.with_columns(
pl.col("Metadata_JCP2022").replace(pert_mapper).alias("pert_type"),
pl.col("Metadata_JCP2022").replace(name_mapper).alias("name"),
)
profiles_with_meta.select(
pl.col(("name", "pert_type", "^Metadata.*$", "^X_[0-3]$"))
).sort(by="pert_type")| Metadata_Source | Metadata_Plate | Metadata_Well | Metadata_JCP2022 | X_1 | X_2 | X_3 | pert_type | name |
|---|---|---|---|---|---|---|---|---|
| str | str | str | str | f32 | f32 | f32 | str | str |
| "source_13" | "CP-CC9-R1-01" | "A02" | "JCP2022_800002" | -0.223417 | -0.049487 | -0.826231 | "negcon" | "non-targeting" |
| "source_13" | "CP-CC9-R1-01" | "L23" | "JCP2022_800002" | -0.079349 | -0.016958 | -0.277558 | "negcon" | "non-targeting" |
| "source_13" | "CP-CC9-R1-01" | "I23" | "JCP2022_800002" | -0.023832 | -0.00537 | -0.29832 | "negcon" | "non-targeting" |
| "source_13" | "CP-CC9-R1-01" | "J02" | "JCP2022_800002" | -0.169491 | -0.023422 | -0.088187 | "negcon" | "non-targeting" |
| "source_13" | "CP-CC9-R1-01" | "O23" | "JCP2022_800002" | -0.295112 | -0.124241 | 1.055193 | "negcon" | "non-targeting" |
| … | … | … | … | … | … | … | … | … |
| "source_13" | "CP-CC9-R5-24" | "M19" | "JCP2022_807842" | -0.025408 | -0.03758 | 0.605146 | "trt" | "ZNF205" |
| "source_13" | "CP-CC9-R5-27" | "D15" | "JCP2022_807446" | 0.076753 | -0.040404 | -0.421323 | "trt" | "UBE2B" |
| "source_13" | "CP-CC9-R6-01" | "L07" | "JCP2022_806489" | -0.267132 | -0.136339 | 1.994115 | "trt" | "SLC2A4" |
| "source_13" | "CP-CC9-R7-01" | "L07" | "JCP2022_806489" | 0.046167 | -0.180045 | -6.034665 | "trt" | "SLC2A4" |
| "source_13" | "CP-CC9-R8-01" | "L07" | "JCP2022_806489" | -0.0686 | -0.025567 | 0.013707 | "trt" | "SLC2A4" |