import polars as pl
Retrieve JUMP profiles
This is a tutorial on how to access profiles from the JUMP Cell Painting datasets. We will use polars to fetch the data frames lazily, with the help of s3fs
and pyarrow
. We prefer lazy loading because the data can be too big to be handled in memory.
The JUMP Cell Painting project provides several processed datasets for morphological profiling:
crispr
: CRISPR knockout genetic perturbationsorf
: Open Reading Frame (ORF) overexpression perturbationscompound
: Chemical compound perturbationsall
: Combined dataset containing all perturbation types
Each dataset is available in two versions:
- Standard: Fully processed including batch correction
- Interpretable: Same processing but without batch correction steps (which involve transformations that lose the original feature space)
All datasets are stored as Parquet files on AWS S3 and can be accessed directly via their URLs. Snakemake workflows for producing these assembled profiles are available here. The specific commit used to produce the profiles can be found in the folder path of each parquet file. For example, jump-profiling-recipe_2024_a917fa7
indicates commit a917fa7
was used. The index file below contains the exact locations and metadata for each dataset:
= "https://raw.githubusercontent.com/jump-cellpainting/datasets/v0.9.0/manifests/profile_index.csv" INDEX_FILE
We use the version-controlled CSV above to release the latest corrected profiles
= pl.read_csv(INDEX_FILE)
profile_index profile_index.head()
subset | url | etag |
---|---|---|
str | str | str |
"orf" | "https://cellpainting-gallery.s… | "c05a241135dcedda4e9cc639480b3f… |
"crispr" | "https://cellpainting-gallery.s… | "4c59782c0dd5244f67d14323e83258… |
"compound" | "https://cellpainting-gallery.s… | "1368a48ddbd4c44b1bfbc084591aaf… |
"orf_interpretable" | "https://cellpainting-gallery.s… | "97b0c31d7d678ca2a5e2353df5799f… |
"crispr_interpretable" | "https://cellpainting-gallery.s… | "90b08b824c06bcf16dfc5e788e74f0… |
We do not need the ‘etag’ (used to check file integrity) column nor the ‘interpretable’ (i.e., before major modifications)
= profile_index.filter(
selected_profiles "subset").is_in(("crispr", "orf", "compound"))
pl.col("etag"))
).select(pl.exclude(= dict(selected_profiles.iter_rows())
filepaths print(filepaths)
{'orf': 'https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_a917fa7/ORF/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony.parquet', 'crispr': 'https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_a917fa7/CRISPR/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony_PCA_corrected/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony_PCA_corrected.parquet', 'compound': 'https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_a917fa7/COMPOUND/profiles_var_mad_int_featselect_harmony/profiles_var_mad_int_featselect_harmony.parquet'}
We will lazy-load the dataframes and print the number of rows and columns
= {k: [] for k in ("dataset", "#rows", "#cols", "#Metadata cols", "Size (MB)")}
info for name, path in filepaths.items():
= pl.scan_parquet(path)
data = data.select(pl.len()).collect().item()
n_rows = data.collect_schema()
schema = [col for col in schema.keys() if col.startswith("Metadata")]
metadata_cols = schema.len()
n_cols = len(metadata_cols)
n_meta_cols = int(round(4.03 * n_rows * n_cols / 1e6, 0)) # B -> MB
estimated_size for k, v in zip(info.keys(), (name, n_rows, n_cols, n_meta_cols, estimated_size)):
info[k].append(v)
pl.DataFrame(info)
dataset | #rows | #cols | #Metadata cols | Size (MB) |
---|---|---|---|---|
str | i64 | i64 | i64 | i64 |
"orf" | 81660 | 726 | 4 | 239 |
"crispr" | 51185 | 263 | 4 | 54 |
"compound" | 803853 | 741 | 4 | 2400 |
Let us now focus on the crispr
dataset and use a regex to select the metadata columns. We will then sample rows and display the overview. Note that the collect() method enforces loading some data into memory.
= pl.scan_parquet(filepaths["crispr"])
data "^Metadata.*$").sample(n=5, seed=1)).collect() data.select(pl.col(
Metadata_Source | Metadata_Plate | Metadata_Well | Metadata_JCP2022 |
---|---|---|---|
str | str | str | str |
"source_13" | "CP-CC9-R2-15" | "D02" | "JCP2022_800002" |
"source_13" | "CP-CC9-R1-04" | "J18" | "JCP2022_800028" |
"source_13" | "CP-CC9-R2-04" | "J09" | "JCP2022_807421" |
"source_13" | "CP-CC9-R2-26" | "L14" | "JCP2022_807129" |
"source_13" | "CP-CC9-R6-01" | "C12" | "JCP2022_806640" |
The following line excludes the metadata columns:
= data.select(pl.all().exclude("^Metadata.*$").sample(n=5, seed=1)).collect()
data_only data_only
X_1 | X_2 | X_3 | X_4 | X_5 | X_6 | X_7 | X_8 | X_9 | X_10 | X_11 | X_12 | X_13 | X_14 | X_15 | X_16 | X_17 | X_18 | X_19 | X_20 | X_21 | X_22 | X_23 | X_24 | X_25 | X_26 | X_27 | X_28 | X_29 | X_30 | X_31 | X_32 | X_33 | X_34 | X_35 | X_36 | X_37 | … | X_223 | X_224 | X_225 | X_226 | X_227 | X_228 | X_229 | X_230 | X_231 | X_232 | X_233 | X_234 | X_235 | X_236 | X_237 | X_238 | X_239 | X_240 | X_241 | X_242 | X_243 | X_244 | X_245 | X_246 | X_247 | X_248 | X_249 | X_250 | X_251 | X_252 | X_253 | X_254 | X_255 | X_256 | X_257 | X_258 | X_259 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | … | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 |
0.431689 | 0.121776 | -0.288611 | 1.199042 | -0.758412 | -0.466926 | -0.777705 | -0.081231 | -0.619822 | -1.27128 | -0.373444 | 0.755662 | -0.271196 | -0.219682 | 0.268569 | -0.831324 | -0.916929 | 0.128514 | 0.202126 | -0.448374 | 0.57358 | -0.148984 | -0.451346 | -0.863105 | -0.519879 | -0.485649 | 0.067051 | -0.461362 | -0.87479 | 0.060438 | -0.86988 | -0.053304 | 0.479346 | 0.415922 | 0.55612 | 0.057157 | -0.486731 | … | 0.070464 | 0.011686 | -0.071482 | 0.047634 | -0.137811 | 0.010114 | -0.146834 | 0.028652 | 0.048453 | 0.015478 | -0.371927 | -0.318295 | -0.07663 | 0.099552 | -0.067174 | 0.324664 | 0.11507 | 0.07018 | 0.149843 | 0.090655 | -0.024452 | -0.167478 | -0.063188 | 0.10028 | -0.20603 | -0.143531 | -0.042267 | -0.103231 | 0.166172 | 0.268637 | -0.249552 | -0.125842 | -0.010658 | 0.148293 | -0.002996 | 0.018602 | 0.120415 |
-0.286125 | -0.139647 | 0.521229 | -0.130772 | -0.392223 | -0.478905 | -2.190718 | -0.910039 | -0.923397 | -0.89992 | 0.809614 | 0.195752 | 1.051458 | -0.586142 | 0.132069 | 0.691497 | 2.309921 | 0.451202 | 0.017881 | 0.722985 | 0.094764 | 0.458089 | 0.289687 | -0.005019 | -0.44384 | -0.292192 | -0.661437 | -0.480588 | -0.43835 | 0.392833 | 0.883042 | -0.183804 | -0.63443 | 0.088329 | 0.317562 | 0.790481 | 0.49558 | … | 0.12586 | 0.150716 | 0.092419 | 0.070398 | -0.10096 | 0.241489 | -0.02793 | -0.069464 | 0.173498 | 0.096578 | -0.006984 | -0.010409 | -0.122357 | -0.154975 | -0.264336 | -0.026424 | -0.107131 | -0.217108 | -0.076673 | -0.025199 | 0.178872 | 0.273566 | -0.011964 | -0.284162 | -0.07764 | -0.147836 | -0.030516 | 0.039593 | -0.251191 | -0.145978 | -0.061276 | 0.260967 | 0.136172 | 0.220407 | -0.016074 | 0.24593 | -0.051766 |
0.044537 | 0.093762 | 0.38071 | -0.078268 | -0.332677 | -0.492756 | -0.54244 | -0.751058 | 0.28314 | 0.772951 | -0.344511 | -0.291534 | -0.64803 | 1.04816 | 0.814905 | 0.020586 | -1.699232 | -0.35928 | 0.474136 | -0.500731 | 0.16648 | 0.460551 | 0.773349 | -0.584125 | 0.070497 | 0.382738 | 1.290578 | 1.115024 | 0.656066 | -0.211548 | 0.615551 | 1.202399 | 0.61274 | 0.467623 | 0.826743 | 0.98965 | 0.515379 | … | -0.035649 | 0.084653 | -0.148614 | 0.41456 | -0.035386 | 0.039774 | 0.222122 | 0.127807 | 0.212482 | -0.087575 | 0.149949 | -0.146337 | 0.031107 | 0.048564 | -0.151519 | -0.256957 | -0.147494 | -0.051771 | 0.000703 | -0.100694 | 0.127297 | -0.159605 | 0.056752 | 0.079783 | -0.301415 | -0.033567 | -0.073402 | 0.073441 | 0.003454 | -0.065908 | 0.003793 | 0.017154 | 0.122071 | 0.031753 | -0.115469 | -0.183939 | -0.037042 |
0.045477 | 0.020634 | 0.312316 | 1.316 | -0.831466 | -1.536956 | 0.495057 | -1.25451 | -0.417021 | 0.099831 | 0.010575 | 0.815467 | -0.793362 | -0.602823 | -0.470462 | -1.901034 | -0.749613 | -0.03417 | -0.349764 | -0.109558 | 0.50934 | 0.937879 | -0.567808 | -0.361403 | 0.07038 | 0.428986 | 0.178268 | -0.264072 | -1.08156 | 0.484804 | 0.257085 | -0.387199 | -0.594517 | -0.142474 | 0.364982 | 0.369385 | -0.033974 | … | 0.080806 | 0.047688 | 0.081428 | -0.072393 | -0.134251 | 0.32516 | -0.013819 | -0.231218 | 0.235347 | -0.099079 | -0.214146 | -0.088035 | 0.279149 | 0.235552 | 0.056753 | -0.002605 | -0.121467 | -0.011054 | 0.014276 | 0.031513 | 0.056525 | -0.204108 | 0.056208 | -0.007412 | 0.295334 | 0.059559 | -0.072717 | 0.143892 | -0.175082 | 0.06916 | -0.240234 | -0.243179 | 0.132553 | -0.10939 | -0.006807 | -0.081922 | -0.033631 |
-0.128473 | -0.163732 | 0.052351 | -3.2502 | 0.237454 | 0.327462 | 2.975345 | 1.074392 | -0.642075 | -0.309154 | -1.427569 | 0.209862 | -0.207053 | -0.785397 | -1.690689 | 0.57705 | 1.286289 | -0.260824 | -0.066723 | -0.378312 | -0.107758 | 0.58553 | 0.723803 | -0.085321 | -0.899026 | -0.508275 | 0.946614 | 0.681252 | 0.591428 | -0.058463 | -0.611216 | -0.249337 | 0.151805 | -0.201767 | -0.364704 | -0.279569 | 0.032865 | … | -0.103084 | -0.092279 | 0.061387 | -0.229078 | 0.214459 | 0.018508 | -0.164547 | 0.170245 | -0.028671 | -0.024243 | 0.116811 | 0.03172 | 0.010574 | 0.014084 | 0.15063 | -0.053592 | -0.297773 | -0.033743 | 0.264092 | -0.030906 | -0.04306 | -0.126682 | -0.050824 | -0.011592 | 0.082704 | -0.186133 | 0.172641 | -0.056459 | 0.190109 | 0.06259 | 0.093085 | -0.251115 | 0.141207 | 0.180379 | -0.006493 | -0.155394 | -0.013597 |
Finally, we can convert this to pandas
if we want to perform analyses with that tool. Keep in mind that this loads the entire dataframe into memory.
data_only.to_pandas()
X_1 | X_2 | X_3 | X_4 | X_5 | X_6 | X_7 | X_8 | X_9 | X_10 | ... | X_250 | X_251 | X_252 | X_253 | X_254 | X_255 | X_256 | X_257 | X_258 | X_259 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.431689 | 0.121776 | -0.288611 | 1.199042 | -0.758412 | -0.466926 | -0.777705 | -0.081231 | -0.619822 | -1.271280 | ... | -0.103231 | 0.166172 | 0.268637 | -0.249552 | -0.125842 | -0.010658 | 0.148293 | -0.002996 | 0.018602 | 0.120415 |
1 | -0.286125 | -0.139647 | 0.521229 | -0.130772 | -0.392223 | -0.478905 | -2.190718 | -0.910039 | -0.923397 | -0.899920 | ... | 0.039593 | -0.251191 | -0.145978 | -0.061276 | 0.260967 | 0.136172 | 0.220407 | -0.016074 | 0.245930 | -0.051766 |
2 | 0.044537 | 0.093762 | 0.380710 | -0.078268 | -0.332677 | -0.492756 | -0.542440 | -0.751058 | 0.283140 | 0.772951 | ... | 0.073441 | 0.003454 | -0.065908 | 0.003793 | 0.017154 | 0.122071 | 0.031753 | -0.115469 | -0.183939 | -0.037042 |
3 | 0.045477 | 0.020634 | 0.312316 | 1.316000 | -0.831466 | -1.536956 | 0.495057 | -1.254510 | -0.417021 | 0.099831 | ... | 0.143892 | -0.175082 | 0.069160 | -0.240234 | -0.243179 | 0.132553 | -0.109390 | -0.006807 | -0.081922 | -0.033631 |
4 | -0.128473 | -0.163732 | 0.052351 | -3.250200 | 0.237454 | 0.327462 | 2.975345 | 1.074392 | -0.642075 | -0.309154 | ... | -0.056459 | 0.190109 | 0.062590 | 0.093085 | -0.251115 | 0.141207 | 0.180379 | -0.006493 | -0.155394 | -0.013597 |
5 rows × 259 columns