Retrieve JUMP profiles

This is a tutorial on how to access profiles from the JUMP Cell Painting datasets. We will use polars to fetch the data frames lazily, with the help of s3fs and pyarrow. We prefer lazy loading because the data can be too big to be handled in memory.

import polars as pl

The shapes of the available datasets are:

cpg0016-jump[crispr]: CRISPR knockouts genetic perturbations.
cpg0016-jump[orf]: Overexpression genetic perturbations.
cpg0016-jump[compound]: Chemical perturbations.

Their explicit location is determined by the transformations that produce the datasets. The aws paths of the dataframes are built from a prefix below:

INDEX_FILE = "https://raw.githubusercontent.com/jump-cellpainting/datasets/50cd2ab93749ccbdb0919d3adf9277c14b6343dd/manifests/profile_index.csv"

We use a version-controlled csv to release the latest corrected profiles

profile_index = pl.read_csv(INDEX_FILE)
profile_index.head()

shape: (5, 3)

subset	url	etag
str	str	str
"orf"	"https://cellpainting-gallery.s…	"c05a241135dcedda4e9cc639480b3f…
"crispr"	"https://cellpainting-gallery.s…	"4c59782c0dd5244f67d14323e83258…
"compound"	"https://cellpainting-gallery.s…	"1368a48ddbd4c44b1bfbc084591aaf…
"orf_interpretable"	"https://cellpainting-gallery.s…	"97b0c31d7d678ca2a5e2353df5799f…
"crispr_interpretable"	"https://cellpainting-gallery.s…	"90b08b824c06bcf16dfc5e788e74f0…

We do not need the ‘etag’ (used to check file integrity) column nor the ‘interpretable’ (i.e., before major modifications)

selected_profiles = profile_index.filter(
    pl.col("subset").is_in(("crispr", "orf", "compound"))
).select(pl.exclude("etag"))
filepaths = dict(selected_profiles.iter_rows())
print(filepaths)

{'orf': 'https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_a917fa7/ORF/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony.parquet', 'crispr': 'https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_a917fa7/CRISPR/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony_PCA_corrected/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony_PCA_corrected.parquet', 'compound': 'https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_a917fa7/COMPOUND/profiles_var_mad_int_featselect_harmony/profiles_var_mad_int_featselect_harmony.parquet'}

We will lazy-load the dataframes and print the number of rows and columns

info = {k: [] for k in ("dataset", "#rows", "#cols", "#Metadata cols", "Size (MB)")}
for name, path in filepaths.items():
    data = pl.scan_parquet(path)
    n_rows = data.select(pl.len()).collect().item()
    schema = data.collect_schema()
    metadata_cols = [col for col in schema.keys() if col.startswith("Metadata")]
    n_cols = schema.len()
    n_meta_cols = len(metadata_cols)
    estimated_size = int(round(4.03 * n_rows * n_cols / 1e6, 0))  # B -> MB
    for k, v in zip(info.keys(), (name, n_rows, n_cols, n_meta_cols, estimated_size)):
        info[k].append(v)

pl.DataFrame(info)

shape: (3, 5)

dataset	#rows	#cols	#Metadata cols	Size (MB)
str	i64	i64	i64	i64
"orf"	81660	726	4	239
"crispr"	51185	263	4	54
"compound"	803853	741	4	2400

Let us now focus on the crispr dataset and use a regex to select the metadata columns. We will then sample rows and display the overview. Note that the collect() method enforces loading some data into memory.

data = pl.scan_parquet(filepaths["crispr"])
data.select(pl.col("^Metadata.*$").sample(n=5, seed=1)).collect()

shape: (5, 4)

Metadata_Source	Metadata_Plate	Metadata_Well	Metadata_JCP2022
str	str	str	str
"source_13"	"CP-CC9-R2-15"	"D02"	"JCP2022_800002"
"source_13"	"CP-CC9-R1-04"	"J18"	"JCP2022_800028"
"source_13"	"CP-CC9-R2-04"	"J09"	"JCP2022_807421"
"source_13"	"CP-CC9-R2-26"	"L14"	"JCP2022_807129"
"source_13"	"CP-CC9-R6-01"	"C12"	"JCP2022_806640"

The following line excludes the metadata columns:

data_only = data.select(pl.all().exclude("^Metadata.*$").sample(n=5, seed=1)).collect()
data_only

shape: (5, 259)

X_1	X_2	X_3	X_4	X_5	X_6	X_7	X_8	X_9	X_10	X_11	X_12	X_13	X_14	X_15	X_16	X_17	X_18	X_19	X_20	X_21	X_22	X_23	X_24	X_25	X_26	X_27	X_28	X_29	X_30	X_31	X_32	X_33	X_34	X_35	X_36	X_37	…	X_223	X_224	X_225	X_226	X_227	X_228	X_229	X_230	X_231	X_232	X_233	X_234	X_235	X_236	X_237	X_238	X_239	X_240	X_241	X_242	X_243	X_244	X_245	X_246	X_247	X_248	X_249	X_250	X_251	X_252	X_253	X_254	X_255	X_256	X_257	X_258	X_259
f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	…	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32
0.431689	0.121776	-0.288611	1.199042	-0.758412	-0.466926	-0.777705	-0.081231	-0.619822	-1.27128	-0.373444	0.755662	-0.271196	-0.219682	0.268569	-0.831324	-0.916929	0.128514	0.202126	-0.448374	0.57358	-0.148984	-0.451346	-0.863105	-0.519879	-0.485649	0.067051	-0.461362	-0.87479	0.060438	-0.86988	-0.053304	0.479346	0.415922	0.55612	0.057157	-0.486731	…	0.070464	0.011686	-0.071482	0.047634	-0.137811	0.010114	-0.146834	0.028652	0.048453	0.015478	-0.371927	-0.318295	-0.07663	0.099552	-0.067174	0.324664	0.11507	0.07018	0.149843	0.090655	-0.024452	-0.167478	-0.063188	0.10028	-0.20603	-0.143531	-0.042267	-0.103231	0.166172	0.268637	-0.249552	-0.125842	-0.010658	0.148293	-0.002996	0.018602	0.120415
-0.286125	-0.139647	0.521229	-0.130772	-0.392223	-0.478905	-2.190718	-0.910039	-0.923397	-0.89992	0.809614	0.195752	1.051458	-0.586142	0.132069	0.691497	2.309921	0.451202	0.017881	0.722985	0.094764	0.458089	0.289687	-0.005019	-0.44384	-0.292192	-0.661437	-0.480588	-0.43835	0.392833	0.883042	-0.183804	-0.63443	0.088329	0.317562	0.790481	0.49558	…	0.12586	0.150716	0.092419	0.070398	-0.10096	0.241489	-0.02793	-0.069464	0.173498	0.096578	-0.006984	-0.010409	-0.122357	-0.154975	-0.264336	-0.026424	-0.107131	-0.217108	-0.076673	-0.025199	0.178872	0.273566	-0.011964	-0.284162	-0.07764	-0.147836	-0.030516	0.039593	-0.251191	-0.145978	-0.061276	0.260967	0.136172	0.220407	-0.016074	0.24593	-0.051766
0.044537	0.093762	0.38071	-0.078268	-0.332677	-0.492756	-0.54244	-0.751058	0.28314	0.772951	-0.344511	-0.291534	-0.64803	1.04816	0.814905	0.020586	-1.699232	-0.35928	0.474136	-0.500731	0.16648	0.460551	0.773349	-0.584125	0.070497	0.382738	1.290578	1.115024	0.656066	-0.211548	0.615551	1.202399	0.61274	0.467623	0.826743	0.98965	0.515379	…	-0.035649	0.084653	-0.148614	0.41456	-0.035386	0.039774	0.222122	0.127807	0.212482	-0.087575	0.149949	-0.146337	0.031107	0.048564	-0.151519	-0.256957	-0.147494	-0.051771	0.000703	-0.100694	0.127297	-0.159605	0.056752	0.079783	-0.301415	-0.033567	-0.073402	0.073441	0.003454	-0.065908	0.003793	0.017154	0.122071	0.031753	-0.115469	-0.183939	-0.037042
0.045477	0.020634	0.312316	1.316	-0.831466	-1.536956	0.495057	-1.25451	-0.417021	0.099831	0.010575	0.815467	-0.793362	-0.602823	-0.470462	-1.901034	-0.749613	-0.03417	-0.349764	-0.109558	0.50934	0.937879	-0.567808	-0.361403	0.07038	0.428986	0.178268	-0.264072	-1.08156	0.484804	0.257085	-0.387199	-0.594517	-0.142474	0.364982	0.369385	-0.033974	…	0.080806	0.047688	0.081428	-0.072393	-0.134251	0.32516	-0.013819	-0.231218	0.235347	-0.099079	-0.214146	-0.088035	0.279149	0.235552	0.056753	-0.002605	-0.121467	-0.011054	0.014276	0.031513	0.056525	-0.204108	0.056208	-0.007412	0.295334	0.059559	-0.072717	0.143892	-0.175082	0.06916	-0.240234	-0.243179	0.132553	-0.10939	-0.006807	-0.081922	-0.033631
-0.128473	-0.163732	0.052351	-3.2502	0.237454	0.327462	2.975345	1.074392	-0.642075	-0.309154	-1.427569	0.209862	-0.207053	-0.785397	-1.690689	0.57705	1.286289	-0.260824	-0.066723	-0.378312	-0.107758	0.58553	0.723803	-0.085321	-0.899026	-0.508275	0.946614	0.681252	0.591428	-0.058463	-0.611216	-0.249337	0.151805	-0.201767	-0.364704	-0.279569	0.032865	…	-0.103084	-0.092279	0.061387	-0.229078	0.214459	0.018508	-0.164547	0.170245	-0.028671	-0.024243	0.116811	0.03172	0.010574	0.014084	0.15063	-0.053592	-0.297773	-0.033743	0.264092	-0.030906	-0.04306	-0.126682	-0.050824	-0.011592	0.082704	-0.186133	0.172641	-0.056459	0.190109	0.06259	0.093085	-0.251115	0.141207	0.180379	-0.006493	-0.155394	-0.013597

Finally, we can convert this to pandas if we want to perform analyses with that tool. Keep in mind that this loads the entire dataframe into memory.

data_only.to_pandas()

	X_1	X_2	X_3	X_4	X_5	X_6	X_7	X_8	X_9	X_10	...	X_250	X_251	X_252	X_253	X_254	X_255	X_256	X_257	X_258	X_259
0	0.431689	0.121776	-0.288611	1.199042	-0.758412	-0.466926	-0.777705	-0.081231	-0.619822	-1.271280	...	-0.103231	0.166172	0.268637	-0.249552	-0.125842	-0.010658	0.148293	-0.002996	0.018602	0.120415
1	-0.286125	-0.139647	0.521229	-0.130772	-0.392223	-0.478905	-2.190718	-0.910039	-0.923397	-0.899920	...	0.039593	-0.251191	-0.145978	-0.061276	0.260967	0.136172	0.220407	-0.016074	0.245930	-0.051766
2	0.044537	0.093762	0.380710	-0.078268	-0.332677	-0.492756	-0.542440	-0.751058	0.283140	0.772951	...	0.073441	0.003454	-0.065908	0.003793	0.017154	0.122071	0.031753	-0.115469	-0.183939	-0.037042
3	0.045477	0.020634	0.312316	1.316000	-0.831466	-1.536956	0.495057	-1.254510	-0.417021	0.099831	...	0.143892	-0.175082	0.069160	-0.240234	-0.243179	0.132553	-0.109390	-0.006807	-0.081922	-0.033631
4	-0.128473	-0.163732	0.052351	-3.250200	0.237454	0.327462	2.975345	1.074392	-0.642075	-0.309154	...	-0.056459	0.190109	0.062590	0.093085	-0.251115	0.141207	0.180379	-0.006493	-0.155394	-0.013597

5 rows × 259 columns

Other Formats