Retrieve JUMP profiles

This is a tutorial on how to access profiles from the JUMP Cell Painting datasets. We will use polars to fetch the data frames lazily, with the help of s3fs and pyarrow. We prefer lazy loading because the data can be too big to be handled in memory.

import polars as pl

The shapes of the available datasets are:

  1. cpg0016-jump[crispr]: CRISPR knockouts genetic perturbations.
  2. cpg0016-jump[orf]: Overexpression genetic perturbations.
  3. cpg0016-jump[compound]: Chemical perturbations.

Their explicit location is determined by the transformations that produce the datasets. The aws paths of the dataframes are built from a prefix below:

INDEX_FILE = "https://raw.githubusercontent.com/jump-cellpainting/datasets/50cd2ab93749ccbdb0919d3adf9277c14b6343dd/manifests/profile_index.csv"

We use a version-controlled csv to release the latest corrected profiles

profile_index = pl.read_csv(INDEX_FILE)
profile_index.head()
shape: (5, 3)
subset url etag
str str str
"orf" "https://cellpainting-gallery.s… "c05a241135dcedda4e9cc639480b3f…
"crispr" "https://cellpainting-gallery.s… "4c59782c0dd5244f67d14323e83258…
"compound" "https://cellpainting-gallery.s… "1368a48ddbd4c44b1bfbc084591aaf…
"orf_interpretable" "https://cellpainting-gallery.s… "97b0c31d7d678ca2a5e2353df5799f…
"crispr_interpretable" "https://cellpainting-gallery.s… "90b08b824c06bcf16dfc5e788e74f0…

We do not need the ‘etag’ (used to check file integrity) column nor the ‘interpretable’ (i.e., before major modifications)

selected_profiles = profile_index.filter(
    pl.col("subset").is_in(("crispr", "orf", "compound"))
).select(pl.exclude("etag"))
filepaths = dict(selected_profiles.iter_rows())
print(filepaths)
{'orf': 'https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_a917fa7/ORF/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony.parquet', 'crispr': 'https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_a917fa7/CRISPR/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony_PCA_corrected/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony_PCA_corrected.parquet', 'compound': 'https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_a917fa7/COMPOUND/profiles_var_mad_int_featselect_harmony/profiles_var_mad_int_featselect_harmony.parquet'}

We will lazy-load the dataframes and print the number of rows and columns

info = {k: [] for k in ("dataset", "#rows", "#cols", "#Metadata cols", "Size (MB)")}
for name, path in filepaths.items():
    data = pl.scan_parquet(path)
    n_rows = data.select(pl.len()).collect().item()
    schema = data.collect_schema()
    metadata_cols = [col for col in schema.keys() if col.startswith("Metadata")]
    n_cols = schema.len()
    n_meta_cols = len(metadata_cols)
    estimated_size = int(round(4.03 * n_rows * n_cols / 1e6, 0))  # B -> MB
    for k, v in zip(info.keys(), (name, n_rows, n_cols, n_meta_cols, estimated_size)):
        info[k].append(v)

pl.DataFrame(info)
shape: (3, 5)
dataset #rows #cols #Metadata cols Size (MB)
str i64 i64 i64 i64
"orf" 81660 726 4 239
"crispr" 51185 263 4 54
"compound" 803853 741 4 2400

Let us now focus on the crispr dataset and use a regex to select the metadata columns. We will then sample rows and display the overview. Note that the collect() method enforces loading some data into memory.

data = pl.scan_parquet(filepaths["crispr"])
data.select(pl.col("^Metadata.*$").sample(n=5, seed=1)).collect()
shape: (5, 4)
Metadata_Source Metadata_Plate Metadata_Well Metadata_JCP2022
str str str str
"source_13" "CP-CC9-R2-15" "D02" "JCP2022_800002"
"source_13" "CP-CC9-R1-04" "J18" "JCP2022_800028"
"source_13" "CP-CC9-R2-04" "J09" "JCP2022_807421"
"source_13" "CP-CC9-R2-26" "L14" "JCP2022_807129"
"source_13" "CP-CC9-R6-01" "C12" "JCP2022_806640"

The following line excludes the metadata columns:

data_only = data.select(pl.all().exclude("^Metadata.*$").sample(n=5, seed=1)).collect()
data_only
shape: (5, 259)
X_1 X_2 X_3 X_4 X_5 X_6 X_7 X_8 X_9 X_10 X_11 X_12 X_13 X_14 X_15 X_16 X_17 X_18 X_19 X_20 X_21 X_22 X_23 X_24 X_25 X_26 X_27 X_28 X_29 X_30 X_31 X_32 X_33 X_34 X_35 X_36 X_37 X_223 X_224 X_225 X_226 X_227 X_228 X_229 X_230 X_231 X_232 X_233 X_234 X_235 X_236 X_237 X_238 X_239 X_240 X_241 X_242 X_243 X_244 X_245 X_246 X_247 X_248 X_249 X_250 X_251 X_252 X_253 X_254 X_255 X_256 X_257 X_258 X_259
f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32
0.431689 0.121776 -0.288611 1.199042 -0.758412 -0.466926 -0.777705 -0.081231 -0.619822 -1.27128 -0.373444 0.755662 -0.271196 -0.219682 0.268569 -0.831324 -0.916929 0.128514 0.202126 -0.448374 0.57358 -0.148984 -0.451346 -0.863105 -0.519879 -0.485649 0.067051 -0.461362 -0.87479 0.060438 -0.86988 -0.053304 0.479346 0.415922 0.55612 0.057157 -0.486731 0.070464 0.011686 -0.071482 0.047634 -0.137811 0.010114 -0.146834 0.028652 0.048453 0.015478 -0.371927 -0.318295 -0.07663 0.099552 -0.067174 0.324664 0.11507 0.07018 0.149843 0.090655 -0.024452 -0.167478 -0.063188 0.10028 -0.20603 -0.143531 -0.042267 -0.103231 0.166172 0.268637 -0.249552 -0.125842 -0.010658 0.148293 -0.002996 0.018602 0.120415
-0.286125 -0.139647 0.521229 -0.130772 -0.392223 -0.478905 -2.190718 -0.910039 -0.923397 -0.89992 0.809614 0.195752 1.051458 -0.586142 0.132069 0.691497 2.309921 0.451202 0.017881 0.722985 0.094764 0.458089 0.289687 -0.005019 -0.44384 -0.292192 -0.661437 -0.480588 -0.43835 0.392833 0.883042 -0.183804 -0.63443 0.088329 0.317562 0.790481 0.49558 0.12586 0.150716 0.092419 0.070398 -0.10096 0.241489 -0.02793 -0.069464 0.173498 0.096578 -0.006984 -0.010409 -0.122357 -0.154975 -0.264336 -0.026424 -0.107131 -0.217108 -0.076673 -0.025199 0.178872 0.273566 -0.011964 -0.284162 -0.07764 -0.147836 -0.030516 0.039593 -0.251191 -0.145978 -0.061276 0.260967 0.136172 0.220407 -0.016074 0.24593 -0.051766
0.044537 0.093762 0.38071 -0.078268 -0.332677 -0.492756 -0.54244 -0.751058 0.28314 0.772951 -0.344511 -0.291534 -0.64803 1.04816 0.814905 0.020586 -1.699232 -0.35928 0.474136 -0.500731 0.16648 0.460551 0.773349 -0.584125 0.070497 0.382738 1.290578 1.115024 0.656066 -0.211548 0.615551 1.202399 0.61274 0.467623 0.826743 0.98965 0.515379 -0.035649 0.084653 -0.148614 0.41456 -0.035386 0.039774 0.222122 0.127807 0.212482 -0.087575 0.149949 -0.146337 0.031107 0.048564 -0.151519 -0.256957 -0.147494 -0.051771 0.000703 -0.100694 0.127297 -0.159605 0.056752 0.079783 -0.301415 -0.033567 -0.073402 0.073441 0.003454 -0.065908 0.003793 0.017154 0.122071 0.031753 -0.115469 -0.183939 -0.037042
0.045477 0.020634 0.312316 1.316 -0.831466 -1.536956 0.495057 -1.25451 -0.417021 0.099831 0.010575 0.815467 -0.793362 -0.602823 -0.470462 -1.901034 -0.749613 -0.03417 -0.349764 -0.109558 0.50934 0.937879 -0.567808 -0.361403 0.07038 0.428986 0.178268 -0.264072 -1.08156 0.484804 0.257085 -0.387199 -0.594517 -0.142474 0.364982 0.369385 -0.033974 0.080806 0.047688 0.081428 -0.072393 -0.134251 0.32516 -0.013819 -0.231218 0.235347 -0.099079 -0.214146 -0.088035 0.279149 0.235552 0.056753 -0.002605 -0.121467 -0.011054 0.014276 0.031513 0.056525 -0.204108 0.056208 -0.007412 0.295334 0.059559 -0.072717 0.143892 -0.175082 0.06916 -0.240234 -0.243179 0.132553 -0.10939 -0.006807 -0.081922 -0.033631
-0.128473 -0.163732 0.052351 -3.2502 0.237454 0.327462 2.975345 1.074392 -0.642075 -0.309154 -1.427569 0.209862 -0.207053 -0.785397 -1.690689 0.57705 1.286289 -0.260824 -0.066723 -0.378312 -0.107758 0.58553 0.723803 -0.085321 -0.899026 -0.508275 0.946614 0.681252 0.591428 -0.058463 -0.611216 -0.249337 0.151805 -0.201767 -0.364704 -0.279569 0.032865 -0.103084 -0.092279 0.061387 -0.229078 0.214459 0.018508 -0.164547 0.170245 -0.028671 -0.024243 0.116811 0.03172 0.010574 0.014084 0.15063 -0.053592 -0.297773 -0.033743 0.264092 -0.030906 -0.04306 -0.126682 -0.050824 -0.011592 0.082704 -0.186133 0.172641 -0.056459 0.190109 0.06259 0.093085 -0.251115 0.141207 0.180379 -0.006493 -0.155394 -0.013597

Finally, we can convert this to pandas if we want to perform analyses with that tool. Keep in mind that this loads the entire dataframe into memory.

data_only.to_pandas()
X_1 X_2 X_3 X_4 X_5 X_6 X_7 X_8 X_9 X_10 ... X_250 X_251 X_252 X_253 X_254 X_255 X_256 X_257 X_258 X_259
0 0.431689 0.121776 -0.288611 1.199042 -0.758412 -0.466926 -0.777705 -0.081231 -0.619822 -1.271280 ... -0.103231 0.166172 0.268637 -0.249552 -0.125842 -0.010658 0.148293 -0.002996 0.018602 0.120415
1 -0.286125 -0.139647 0.521229 -0.130772 -0.392223 -0.478905 -2.190718 -0.910039 -0.923397 -0.899920 ... 0.039593 -0.251191 -0.145978 -0.061276 0.260967 0.136172 0.220407 -0.016074 0.245930 -0.051766
2 0.044537 0.093762 0.380710 -0.078268 -0.332677 -0.492756 -0.542440 -0.751058 0.283140 0.772951 ... 0.073441 0.003454 -0.065908 0.003793 0.017154 0.122071 0.031753 -0.115469 -0.183939 -0.037042
3 0.045477 0.020634 0.312316 1.316000 -0.831466 -1.536956 0.495057 -1.254510 -0.417021 0.099831 ... 0.143892 -0.175082 0.069160 -0.240234 -0.243179 0.132553 -0.109390 -0.006807 -0.081922 -0.033631
4 -0.128473 -0.163732 0.052351 -3.250200 0.237454 0.327462 2.975345 1.074392 -0.642075 -0.309154 ... -0.056459 0.190109 0.062590 0.093085 -0.251115 0.141207 0.180379 -0.006493 -0.155394 -0.013597

5 rows × 259 columns