Retrieve JUMP profiles

import polars as pl
import requests

The JUMP Cell Painting project provides several processed datasets for morphological profiling. Choose the dataset that matches your perturbation type:

crispr: CRISPR knockout genetic perturbations
orf: Open Reading Frame (ORF) overexpression perturbations
compound: Chemical compound perturbations
all: Combined dataset containing all perturbation types (use for cross-modality comparisons)

Each dataset is available in two processing versions:

Standard (e.g., crispr, compound, orf): Fully processed including batch correction steps. Recommended for most analyses as they provide better cross-dataset comparability.
Interpretable (e.g., crispr_interpretable, compound_interpretable, orf_interpretable): Same initial processing but without batch correction transformations that modify the original feature space. Use these when you need to interpret individual morphological features.

All datasets are stored as Parquet files on AWS S3 and can be accessed directly via their URLs.

The index file below contains the recommended profiles for each subset. Each profile includes: - Direct links to the processing recipe and configuration used - ETags for data integrity verification

For details on creating your own profile manifests, see the manifest guide.

INDEX_FILE = "https://raw.githubusercontent.com/jump-cellpainting/datasets/main/manifests/profile_index.json"

We use the version-controlled manifest above to release the latest corrected profiles

# Load the JSON manifest
response = requests.get(INDEX_FILE)
profile_index = response.json()

# Display the manifest data
for dataset in profile_index:
    print(f"- {dataset['subset']}: {dataset['url']}")

- orf: https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_a917fa7/ORF/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony.parquet
- crispr: https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_a917fa7/CRISPR/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony_PCA_corrected/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony_PCA_corrected.parquet
- compound: https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_a917fa7/COMPOUND/profiles_var_mad_int_featselect_harmony/profiles_var_mad_int_featselect_harmony.parquet
- orf_interpretable: https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_a917fa7/ORF/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony/profiles_wellpos_cc_var_mad_outlier.parquet
- crispr_interpretable: https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_a917fa7/CRISPR/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony_PCA_corrected/profiles_wellpos_cc_var_mad_outlier.parquet
- compound_interpretable: https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_a917fa7/COMPOUND/profiles_var_mad_int_featselect_harmony/profiles_var_mad_int.parquet
- all: https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_0224e0f/ALL/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony.parquet
- all_interpretable: https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles/jump-profiling-recipe_2024_0224e0f/ALL/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony/profiles_wellpos_cc_var_mad_outlier_featselect.parquet

Each profile in the manifest includes direct links to: - recipe_permalink: The exact version of the processing code used - config_permalink: The specific configuration file that defines the processing steps

Let’s display the key information from the manifest:

# Convert JSON to DataFrame for better display
profile_df = pl.DataFrame(profile_index)

# Show key information in a clean table
display_df = profile_df.select(
    [
        "subset",
        pl.col("url").str.extract(r"([^/]+)\.parquet$").alias("filename"),
        pl.col("recipe_permalink")
        .str.extract(r"tree/([^/]+)$")
        .str.slice(0, 7)
        .alias("recipe_version"),
        pl.col("config_permalink").str.extract(r"([^/]+)\.json$").alias("config"),
    ]
)
display_df

shape: (8, 4)

subset	filename	recipe_version	config
str	str	str	str
"orf"	"profiles_wellpos_cc_var_mad_ou…	"a917fa7"	"orf"
"crispr"	"profiles_wellpos_cc_var_mad_ou…	"a917fa7"	"crispr"
"compound"	"profiles_var_mad_int_featselec…	"a917fa7"	"compound"
"orf_interpretable"	"profiles_wellpos_cc_var_mad_ou…	"a917fa7"	"orf"
"crispr_interpretable"	"profiles_wellpos_cc_var_mad_ou…	"a917fa7"	"crispr"
"compound_interpretable"	"profiles_var_mad_int"	"a917fa7"	"compound"
"all"	"profiles_wellpos_cc_var_mad_ou…	"0224e0f"	"pipeline_2"
"all_interpretable"	"profiles_wellpos_cc_var_mad_ou…	"0224e0f"	"pipeline_2"

Let inspect the standard profiles.

# Create dictionary of subset -> url for the standard profiles only
filepaths = {
    dataset["subset"]: dataset["url"]
    for dataset in profile_index
    if dataset["subset"] in ("crispr", "orf", "compound")
}
print("Selected profiles:")
for subset, url in filepaths.items():
    print(f"  {subset}: {url.split('/')[-1]}")

Selected profiles:
  orf: profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony.parquet
  crispr: profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony_PCA_corrected.parquet
  compound: profiles_var_mad_int_featselect_harmony.parquet

We will lazy-load the dataframes and print the number of rows and columns

info = {k: [] for k in ("dataset", "#rows", "#cols", "#Metadata cols", "Size (MB)")}
for name, path in filepaths.items():
    data = pl.scan_parquet(path)
    n_rows = data.select(pl.len()).collect().item()
    schema = data.collect_schema()
    metadata_cols = [col for col in schema.keys() if col.startswith("Metadata")]
    n_cols = schema.len()
    n_meta_cols = len(metadata_cols)
    estimated_size = int(round(4.03 * n_rows * n_cols / 1e6, 0))  # B -> MB
    for k, v in zip(info.keys(), (name, n_rows, n_cols, n_meta_cols, estimated_size)):
        info[k].append(v)

pl.DataFrame(info)

shape: (3, 5)

dataset	#rows	#cols	#Metadata cols	Size (MB)
str	i64	i64	i64	i64
"orf"	81660	726	4	239
"crispr"	51185	263	4	54
"compound"	803853	741	4	2400

Let us now focus on the crispr dataset and use a regex to select the metadata columns. We will then sample rows and display the overview. Note that the collect() method enforces loading some data into memory.

data = pl.scan_parquet(filepaths["crispr"])
data.select(pl.col("^Metadata.*$").sample(n=5, seed=1)).collect()

shape: (5, 4)

Metadata_Source	Metadata_Plate	Metadata_Well	Metadata_JCP2022
str	str	str	str
"source_13"	"CP-CC9-R2-15"	"D02"	"JCP2022_800002"
"source_13"	"CP-CC9-R1-04"	"J18"	"JCP2022_800028"
"source_13"	"CP-CC9-R2-04"	"J09"	"JCP2022_807421"
"source_13"	"CP-CC9-R2-26"	"L14"	"JCP2022_807129"
"source_13"	"CP-CC9-R6-01"	"C12"	"JCP2022_806640"

The following line excludes the metadata columns:

data_only = data.select(pl.all().exclude("^Metadata.*$").sample(n=5, seed=1)).collect()
data_only

shape: (5, 259)

X_1	X_2	X_3	X_4	X_5	X_6	X_7	X_8	X_9	X_10	X_11	X_12	X_13	X_14	X_15	X_16	X_17	X_18	X_19	X_20	X_21	X_22	X_23	X_24	X_25	X_26	X_27	X_28	X_29	X_30	X_31	X_32	X_33	X_34	X_35	X_36	X_37	…	X_223	X_224	X_225	X_226	X_227	X_228	X_229	X_230	X_231	X_232	X_233	X_234	X_235	X_236	X_237	X_238	X_239	X_240	X_241	X_242	X_243	X_244	X_245	X_246	X_247	X_248	X_249	X_250	X_251	X_252	X_253	X_254	X_255	X_256	X_257	X_258	X_259
f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	…	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32
0.431689	0.121776	-0.288611	1.199042	-0.758412	-0.466926	-0.777705	-0.081231	-0.619822	-1.27128	-0.373444	0.755662	-0.271196	-0.219682	0.268569	-0.831324	-0.916929	0.128514	0.202126	-0.448374	0.57358	-0.148984	-0.451346	-0.863105	-0.519879	-0.485649	0.067051	-0.461362	-0.87479	0.060438	-0.86988	-0.053304	0.479346	0.415922	0.55612	0.057157	-0.486731	…	0.070464	0.011686	-0.071482	0.047634	-0.137811	0.010114	-0.146834	0.028652	0.048453	0.015478	-0.371927	-0.318295	-0.07663	0.099552	-0.067174	0.324664	0.11507	0.07018	0.149843	0.090655	-0.024452	-0.167478	-0.063188	0.10028	-0.20603	-0.143531	-0.042267	-0.103231	0.166172	0.268637	-0.249552	-0.125842	-0.010658	0.148293	-0.002996	0.018602	0.120415
-0.286125	-0.139647	0.521229	-0.130772	-0.392223	-0.478905	-2.190718	-0.910039	-0.923397	-0.89992	0.809614	0.195752	1.051458	-0.586142	0.132069	0.691497	2.309921	0.451202	0.017881	0.722985	0.094764	0.458089	0.289687	-0.005019	-0.44384	-0.292192	-0.661437	-0.480588	-0.43835	0.392833	0.883042	-0.183804	-0.63443	0.088329	0.317562	0.790481	0.49558	…	0.12586	0.150716	0.092419	0.070398	-0.10096	0.241489	-0.02793	-0.069464	0.173498	0.096578	-0.006984	-0.010409	-0.122357	-0.154975	-0.264336	-0.026424	-0.107131	-0.217108	-0.076673	-0.025199	0.178872	0.273566	-0.011964	-0.284162	-0.07764	-0.147836	-0.030516	0.039593	-0.251191	-0.145978	-0.061276	0.260967	0.136172	0.220407	-0.016074	0.24593	-0.051766
0.044537	0.093762	0.38071	-0.078268	-0.332677	-0.492756	-0.54244	-0.751058	0.28314	0.772951	-0.344511	-0.291534	-0.64803	1.04816	0.814905	0.020586	-1.699232	-0.35928	0.474136	-0.500731	0.16648	0.460551	0.773349	-0.584125	0.070497	0.382738	1.290578	1.115024	0.656066	-0.211548	0.615551	1.202399	0.61274	0.467623	0.826743	0.98965	0.515379	…	-0.035649	0.084653	-0.148614	0.41456	-0.035386	0.039774	0.222122	0.127807	0.212482	-0.087575	0.149949	-0.146337	0.031107	0.048564	-0.151519	-0.256957	-0.147494	-0.051771	0.000703	-0.100694	0.127297	-0.159605	0.056752	0.079783	-0.301415	-0.033567	-0.073402	0.073441	0.003454	-0.065908	0.003793	0.017154	0.122071	0.031753	-0.115469	-0.183939	-0.037042
0.045477	0.020634	0.312316	1.316	-0.831466	-1.536956	0.495057	-1.25451	-0.417021	0.099831	0.010575	0.815467	-0.793362	-0.602823	-0.470462	-1.901034	-0.749613	-0.03417	-0.349764	-0.109558	0.50934	0.937879	-0.567808	-0.361403	0.07038	0.428986	0.178268	-0.264072	-1.08156	0.484804	0.257085	-0.387199	-0.594517	-0.142474	0.364982	0.369385	-0.033974	…	0.080806	0.047688	0.081428	-0.072393	-0.134251	0.32516	-0.013819	-0.231218	0.235347	-0.099079	-0.214146	-0.088035	0.279149	0.235552	0.056753	-0.002605	-0.121467	-0.011054	0.014276	0.031513	0.056525	-0.204108	0.056208	-0.007412	0.295334	0.059559	-0.072717	0.143892	-0.175082	0.06916	-0.240234	-0.243179	0.132553	-0.10939	-0.006807	-0.081922	-0.033631
-0.128473	-0.163732	0.052351	-3.2502	0.237454	0.327462	2.975345	1.074392	-0.642075	-0.309154	-1.427569	0.209862	-0.207053	-0.785397	-1.690689	0.57705	1.286289	-0.260824	-0.066723	-0.378312	-0.107758	0.58553	0.723803	-0.085321	-0.899026	-0.508275	0.946614	0.681252	0.591428	-0.058463	-0.611216	-0.249337	0.151805	-0.201767	-0.364704	-0.279569	0.032865	…	-0.103084	-0.092279	0.061387	-0.229078	0.214459	0.018508	-0.164547	0.170245	-0.028671	-0.024243	0.116811	0.03172	0.010574	0.014084	0.15063	-0.053592	-0.297773	-0.033743	0.264092	-0.030906	-0.04306	-0.126682	-0.050824	-0.011592	0.082704	-0.186133	0.172641	-0.056459	0.190109	0.06259	0.093085	-0.251115	0.141207	0.180379	-0.006493	-0.155394	-0.013597

Finally, we can convert this to pandas if we want to perform analyses with that tool. Keep in mind that this loads the entire dataframe into memory.

data_only.to_pandas()

	X_1	X_2	X_3	X_4	X_5	X_6	X_7	X_8	X_9	X_10	...	X_250	X_251	X_252	X_253	X_254	X_255	X_256	X_257	X_258	X_259
0	0.431689	0.121776	-0.288611	1.199042	-0.758412	-0.466926	-0.777705	-0.081231	-0.619822	-1.271280	...	-0.103231	0.166172	0.268637	-0.249552	-0.125842	-0.010658	0.148293	-0.002996	0.018602	0.120415
1	-0.286125	-0.139647	0.521229	-0.130772	-0.392223	-0.478905	-2.190718	-0.910039	-0.923397	-0.899920	...	0.039593	-0.251191	-0.145978	-0.061276	0.260967	0.136172	0.220407	-0.016074	0.245930	-0.051766
2	0.044537	0.093762	0.380710	-0.078268	-0.332677	-0.492756	-0.542440	-0.751058	0.283140	0.772951	...	0.073441	0.003454	-0.065908	0.003793	0.017154	0.122071	0.031753	-0.115469	-0.183939	-0.037042
3	0.045477	0.020634	0.312316	1.316000	-0.831466	-1.536956	0.495057	-1.254510	-0.417021	0.099831	...	0.143892	-0.175082	0.069160	-0.240234	-0.243179	0.132553	-0.109390	-0.006807	-0.081922	-0.033631
4	-0.128473	-0.163732	0.052351	-3.250200	0.237454	0.327462	2.975345	1.074392	-0.642075	-0.309154	...	-0.056459	0.190109	0.062590	0.093085	-0.251115	0.141207	0.180379	-0.006493	-0.155394	-0.013597

5 rows × 259 columns

Other Formats