import polars as pl
Retrieve JUMP profiles
This is a tutorial on how to access profiles from the JUMP Cell Painting datasets. We will use polars to fetch the data frames lazily, with the help of s3fs
and pyarrow
. We prefer lazy loading because the data can be too big to be handled in memory.
The shapes of the available datasets are:
cpg0016-jump[crispr]
: CRISPR knockouts genetic perturbations.cpg0016-jump[orf]
: Overexpression genetic perturbations.cpg0016-jump[compound]
: Chemical perturbations.
Their explicit location is determined by the transformations that produce the datasets. The aws paths of the dataframes are built from a prefix below:
= "https://raw.githubusercontent.com/jump-cellpainting/datasets/50cd2ab93749ccbdb0919d3adf9277c14b6343dd/manifests/profile_index.csv" INDEX_FILE
We use a version-controlled csv to release the latest corrected profiles
= pl.read_csv(INDEX_FILE)
profile_index profile_index.head()
subset | url | etag |
---|---|---|
str | str | str |
"orf" | "https://cellpainting-gallery.s… | "c05a241135dcedda4e9cc639480b3f… |
"crispr" | "https://cellpainting-gallery.s… | "4c59782c0dd5244f67d14323e83258… |
"compound" | "https://cellpainting-gallery.s… | "1368a48ddbd4c44b1bfbc084591aaf… |
"orf_interpretable" | "https://cellpainting-gallery.s… | "97b0c31d7d678ca2a5e2353df5799f… |
"crispr_interpretable" | "https://cellpainting-gallery.s… | "90b08b824c06bcf16dfc5e788e74f0… |
We do not need the ‘etag’ (used to check file integrity) column nor the ‘interpretable’ (i.e., before major modifications)
= profile_index.filter(
selected_profiles "subset").is_in(("crispr", "orf", "compound"))
pl.col("etag"))
).select(pl.exclude(= dict(selected_profiles.iter_rows())
filepaths print(filepaths)
{'orf': 'https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_a917fa7/ORF/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony.parquet', 'crispr': 'https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_a917fa7/CRISPR/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony_PCA_corrected/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony_PCA_corrected.parquet', 'compound': 'https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_a917fa7/COMPOUND/profiles_var_mad_int_featselect_harmony/profiles_var_mad_int_featselect_harmony.parquet'}
We will lazy-load the dataframes and print the number of rows and columns
= {k: [] for k in ("dataset", "#rows", "#cols", "#Metadata cols", "Size (MB)")}
info for name, path in filepaths.items():
= pl.scan_parquet(path)
data = data.select(pl.len()).collect().item()
n_rows = data.collect_schema()
schema = [col for col in schema.keys() if col.startswith("Metadata")]
metadata_cols = schema.len()
n_cols = len(metadata_cols)
n_meta_cols = int(round(4.03 * n_rows * n_cols / 1e6, 0)) # B -> MB
estimated_size for k, v in zip(info.keys(), (name, n_rows, n_cols, n_meta_cols, estimated_size)):
info[k].append(v)
pl.DataFrame(info)
dataset | #rows | #cols | #Metadata cols | Size (MB) |
---|---|---|---|---|
str | i64 | i64 | i64 | i64 |
"orf" | 81660 | 726 | 4 | 239 |
"crispr" | 51185 | 263 | 4 | 54 |
"compound" | 803853 | 741 | 4 | 2400 |
Let us now focus on the crispr
dataset and use a regex to select the metadata columns. We will then sample rows and display the overview. Note that the collect() method enforces loading some data into memory.
= pl.scan_parquet(filepaths["crispr"])
data "^Metadata.*$").sample(n=5, seed=1)).collect() data.select(pl.col(
Metadata_Source | Metadata_Plate | Metadata_Well | Metadata_JCP2022 |
---|---|---|---|
str | str | str | str |
"source_13" | "CP-CC9-R2-15" | "D02" | "JCP2022_800002" |
"source_13" | "CP-CC9-R1-04" | "J18" | "JCP2022_800028" |
"source_13" | "CP-CC9-R2-04" | "J09" | "JCP2022_807421" |
"source_13" | "CP-CC9-R2-26" | "L14" | "JCP2022_807129" |
"source_13" | "CP-CC9-R6-01" | "C12" | "JCP2022_806640" |
The following line excludes the metadata columns:
= data.select(pl.all().exclude("^Metadata.*$").sample(n=5, seed=1)).collect()
data_only data_only
X_1 | X_2 | X_3 | X_4 | X_5 | X_6 | X_7 | X_8 | X_9 | X_10 | X_11 | X_12 | X_13 | X_14 | X_15 | X_16 | X_17 | X_18 | X_19 | X_20 | X_21 | X_22 | X_23 | X_24 | X_25 | X_26 | X_27 | X_28 | X_29 | X_30 | X_31 | X_32 | X_33 | X_34 | X_35 | X_36 | X_37 | … | X_223 | X_224 | X_225 | X_226 | X_227 | X_228 | X_229 | X_230 | X_231 | X_232 | X_233 | X_234 | X_235 | X_236 | X_237 | X_238 | X_239 | X_240 | X_241 | X_242 | X_243 | X_244 | X_245 | X_246 | X_247 | X_248 | X_249 | X_250 | X_251 | X_252 | X_253 | X_254 | X_255 | X_256 | X_257 | X_258 | X_259 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | … | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 |
0.431689 | 0.121776 | -0.288611 | 1.199042 | -0.758412 | -0.466926 | -0.777705 | -0.081231 | -0.619822 | -1.27128 | -0.373444 | 0.755662 | -0.271196 | -0.219682 | 0.268569 | -0.831324 | -0.916929 | 0.128514 | 0.202126 | -0.448374 | 0.57358 | -0.148984 | -0.451346 | -0.863105 | -0.519879 | -0.485649 | 0.067051 | -0.461362 | -0.87479 | 0.060438 | -0.86988 | -0.053304 | 0.479346 | 0.415922 | 0.55612 | 0.057157 | -0.486731 | … | 0.070464 | 0.011686 | -0.071482 | 0.047634 | -0.137811 | 0.010114 | -0.146834 | 0.028652 | 0.048453 | 0.015478 | -0.371927 | -0.318295 | -0.07663 | 0.099552 | -0.067174 | 0.324664 | 0.11507 | 0.07018 | 0.149843 | 0.090655 | -0.024452 | -0.167478 | -0.063188 | 0.10028 | -0.20603 | -0.143531 | -0.042267 | -0.103231 | 0.166172 | 0.268637 | -0.249552 | -0.125842 | -0.010658 | 0.148293 | -0.002996 | 0.018602 | 0.120415 |
-0.286125 | -0.139647 | 0.521229 | -0.130772 | -0.392223 | -0.478905 | -2.190718 | -0.910039 | -0.923397 | -0.89992 | 0.809614 | 0.195752 | 1.051458 | -0.586142 | 0.132069 | 0.691497 | 2.309921 | 0.451202 | 0.017881 | 0.722985 | 0.094764 | 0.458089 | 0.289687 | -0.005019 | -0.44384 | -0.292192 | -0.661437 | -0.480588 | -0.43835 | 0.392833 | 0.883042 | -0.183804 | -0.63443 | 0.088329 | 0.317562 | 0.790481 | 0.49558 | … | 0.12586 | 0.150716 | 0.092419 | 0.070398 | -0.10096 | 0.241489 | -0.02793 | -0.069464 | 0.173498 | 0.096578 | -0.006984 | -0.010409 | -0.122357 | -0.154975 | -0.264336 | -0.026424 | -0.107131 | -0.217108 | -0.076673 | -0.025199 | 0.178872 | 0.273566 | -0.011964 | -0.284162 | -0.07764 | -0.147836 | -0.030516 | 0.039593 | -0.251191 | -0.145978 | -0.061276 | 0.260967 | 0.136172 | 0.220407 | -0.016074 | 0.24593 | -0.051766 |
0.044537 | 0.093762 | 0.38071 | -0.078268 | -0.332677 | -0.492756 | -0.54244 | -0.751058 | 0.28314 | 0.772951 | -0.344511 | -0.291534 | -0.64803 | 1.04816 | 0.814905 | 0.020586 | -1.699232 | -0.35928 | 0.474136 | -0.500731 | 0.16648 | 0.460551 | 0.773349 | -0.584125 | 0.070497 | 0.382738 | 1.290578 | 1.115024 | 0.656066 | -0.211548 | 0.615551 | 1.202399 | 0.61274 | 0.467623 | 0.826743 | 0.98965 | 0.515379 | … | -0.035649 | 0.084653 | -0.148614 | 0.41456 | -0.035386 | 0.039774 | 0.222122 | 0.127807 | 0.212482 | -0.087575 | 0.149949 | -0.146337 | 0.031107 | 0.048564 | -0.151519 | -0.256957 | -0.147494 | -0.051771 | 0.000703 | -0.100694 | 0.127297 | -0.159605 | 0.056752 | 0.079783 | -0.301415 | -0.033567 | -0.073402 | 0.073441 | 0.003454 | -0.065908 | 0.003793 | 0.017154 | 0.122071 | 0.031753 | -0.115469 | -0.183939 | -0.037042 |
0.045477 | 0.020634 | 0.312316 | 1.316 | -0.831466 | -1.536956 | 0.495057 | -1.25451 | -0.417021 | 0.099831 | 0.010575 | 0.815467 | -0.793362 | -0.602823 | -0.470462 | -1.901034 | -0.749613 | -0.03417 | -0.349764 | -0.109558 | 0.50934 | 0.937879 | -0.567808 | -0.361403 | 0.07038 | 0.428986 | 0.178268 | -0.264072 | -1.08156 | 0.484804 | 0.257085 | -0.387199 | -0.594517 | -0.142474 | 0.364982 | 0.369385 | -0.033974 | … | 0.080806 | 0.047688 | 0.081428 | -0.072393 | -0.134251 | 0.32516 | -0.013819 | -0.231218 | 0.235347 | -0.099079 | -0.214146 | -0.088035 | 0.279149 | 0.235552 | 0.056753 | -0.002605 | -0.121467 | -0.011054 | 0.014276 | 0.031513 | 0.056525 | -0.204108 | 0.056208 | -0.007412 | 0.295334 | 0.059559 | -0.072717 | 0.143892 | -0.175082 | 0.06916 | -0.240234 | -0.243179 | 0.132553 | -0.10939 | -0.006807 | -0.081922 | -0.033631 |
-0.128473 | -0.163732 | 0.052351 | -3.2502 | 0.237454 | 0.327462 | 2.975345 | 1.074392 | -0.642075 | -0.309154 | -1.427569 | 0.209862 | -0.207053 | -0.785397 | -1.690689 | 0.57705 | 1.286289 | -0.260824 | -0.066723 | -0.378312 | -0.107758 | 0.58553 | 0.723803 | -0.085321 | -0.899026 | -0.508275 | 0.946614 | 0.681252 | 0.591428 | -0.058463 | -0.611216 | -0.249337 | 0.151805 | -0.201767 | -0.364704 | -0.279569 | 0.032865 | … | -0.103084 | -0.092279 | 0.061387 | -0.229078 | 0.214459 | 0.018508 | -0.164547 | 0.170245 | -0.028671 | -0.024243 | 0.116811 | 0.03172 | 0.010574 | 0.014084 | 0.15063 | -0.053592 | -0.297773 | -0.033743 | 0.264092 | -0.030906 | -0.04306 | -0.126682 | -0.050824 | -0.011592 | 0.082704 | -0.186133 | 0.172641 | -0.056459 | 0.190109 | 0.06259 | 0.093085 | -0.251115 | 0.141207 | 0.180379 | -0.006493 | -0.155394 | -0.013597 |
Finally, we can convert this to pandas
if we want to perform analyses with that tool. Keep in mind that this loads the entire dataframe into memory.
data_only.to_pandas()
X_1 | X_2 | X_3 | X_4 | X_5 | X_6 | X_7 | X_8 | X_9 | X_10 | ... | X_250 | X_251 | X_252 | X_253 | X_254 | X_255 | X_256 | X_257 | X_258 | X_259 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.431689 | 0.121776 | -0.288611 | 1.199042 | -0.758412 | -0.466926 | -0.777705 | -0.081231 | -0.619822 | -1.271280 | ... | -0.103231 | 0.166172 | 0.268637 | -0.249552 | -0.125842 | -0.010658 | 0.148293 | -0.002996 | 0.018602 | 0.120415 |
1 | -0.286125 | -0.139647 | 0.521229 | -0.130772 | -0.392223 | -0.478905 | -2.190718 | -0.910039 | -0.923397 | -0.899920 | ... | 0.039593 | -0.251191 | -0.145978 | -0.061276 | 0.260967 | 0.136172 | 0.220407 | -0.016074 | 0.245930 | -0.051766 |
2 | 0.044537 | 0.093762 | 0.380710 | -0.078268 | -0.332677 | -0.492756 | -0.542440 | -0.751058 | 0.283140 | 0.772951 | ... | 0.073441 | 0.003454 | -0.065908 | 0.003793 | 0.017154 | 0.122071 | 0.031753 | -0.115469 | -0.183939 | -0.037042 |
3 | 0.045477 | 0.020634 | 0.312316 | 1.316000 | -0.831466 | -1.536956 | 0.495057 | -1.254510 | -0.417021 | 0.099831 | ... | 0.143892 | -0.175082 | 0.069160 | -0.240234 | -0.243179 | 0.132553 | -0.109390 | -0.006807 | -0.081922 | -0.033631 |
4 | -0.128473 | -0.163732 | 0.052351 | -3.250200 | 0.237454 | 0.327462 | 2.975345 | 1.074392 | -0.642075 | -0.309154 | ... | -0.056459 | 0.190109 | 0.062590 | 0.093085 | -0.251115 | 0.141207 | 0.180379 | -0.006493 | -0.155394 | -0.013597 |
5 rows × 259 columns