# Retrieve JUMP profiles

In [1]:
import polars as pl
import requests

The JUMP Cell Painting project provides several processed datasets for
morphological profiling. Choose the dataset that matches your
perturbation type:

-   **`crispr`**: CRISPR knockout genetic perturbations
-   **`orf`**: Open Reading Frame (ORF) overexpression perturbations
-   **`compound`**: Chemical compound perturbations
-   **`all`**: Combined dataset containing all perturbation types (use
    for cross-modality comparisons)

Each dataset is available in two processing versions:

-   **Standard** (e.g., `crispr`, `compound`, `orf`): Fully processed
    including batch correction steps. **Recommended for most analyses**
    as they provide better cross-dataset comparability.

-   **Interpretable** (e.g., `crispr_interpretable`,
    `compound_interpretable`, `orf_interpretable`): Same initial
    processing but without batch correction transformations that modify
    the original feature space. Use these when you need to interpret
    individual morphological features.

All datasets are stored as Parquet files on AWS S3 and can be accessed
directly via their URLs.

The index file below contains the **recommended profiles** for each
subset. Each profile includes: - Direct links to the processing recipe
and configuration used - ETags for data integrity verification

For details on creating your own profile manifests, see the [manifest
guide](https://github.com/broadinstitute/jump_hub/blob/main/howto/2_create_project_manifest.md).

In [2]:
INDEX_FILE = "https://raw.githubusercontent.com/jump-cellpainting/datasets/main/manifests/profile_index.json"

We use the version-controlled manifest above to release the latest
corrected profiles

In [3]:
# Load the JSON manifest
response = requests.get(INDEX_FILE)
profile_index = response.json()

# Display the manifest data
for dataset in profile_index:
    print(f"- {dataset['subset']}: {dataset['url']}")

- orf: https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles_assembled/ORF/v1.0a/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony.parquet
- crispr: https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles_assembled/CRISPR/v1.0a/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony_PCA_corrected.parquet
- compound: https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles_assembled/COMPOUND/v1.0/profiles_var_mad_int_featselect_harmony.parquet
- orf_interpretable: https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles_assembled/ORF/v1.0a/profiles_wellpos_cc_var_mad_outlier.parquet
- crispr_interpretable: https://cellpainting-gallery.s3.amazonaws.com/cpg0016-jump-assembled/source_all/workspace/profiles_assembled/CRISPR/v1.0a/profiles_wellpos_cc_var_mad_outlier.parquet
- compound_interpretable: h

Each profile in the manifest includes direct links to: -
**recipe_permalink**: The exact version of the processing code used -
**config_permalink**: The specific configuration file that defines the
processing steps

Let’s display the key information from the manifest:

In [4]:
# Convert JSON to DataFrame for better display
profile_df = pl.DataFrame(profile_index)

# Show key information in a clean table
display_df = profile_df.select(
    [
        "subset",
        pl.col("url").str.extract(r"([^/]+)\.parquet$").alias("filename"),
        pl.col("recipe_permalink")
        .str.extract(r"tree/([^/]+)$")
        .str.slice(0, 7)
        .alias("recipe_version"),
        pl.col("config_permalink").str.extract(r"([^/]+)\.json$").alias("config"),
    ]
)
display_df

Let inspect the standard profiles.

In [5]:
# Create dictionary of subset -> url for the standard profiles only
filepaths = {
    dataset["subset"]: dataset["url"]
    for dataset in profile_index
    if dataset["subset"] in ("crispr", "orf", "compound")
}
print("Selected profiles:")
for subset, url in filepaths.items():
    print(f"  {subset}: {url.split('/')[-1]}")

Selected profiles:
  orf: profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony.parquet
  crispr: profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony_PCA_corrected.parquet
  compound: profiles_var_mad_int_featselect_harmony.parquet

We will lazy-load the dataframes and print the number of rows and
columns

In [6]:
info = {k: [] for k in ("dataset", "#rows", "#cols", "#Metadata cols", "Size (MB)")}
for name, path in filepaths.items():
    data = pl.scan_parquet(path)
    n_rows = data.select(pl.len()).collect().item()
    schema = data.collect_schema()
    metadata_cols = [col for col in schema.keys() if col.startswith("Metadata")]
    n_cols = schema.len()
    n_meta_cols = len(metadata_cols)
    estimated_size = int(round(4.03 * n_rows * n_cols / 1e6, 0))  # B -> MB
    for k, v in zip(info.keys(), (name, n_rows, n_cols, n_meta_cols, estimated_size)):
        info[k].append(v)

pl.DataFrame(info)

Let us now focus on the `crispr` dataset and use a regex to select the
metadata columns. We will then sample rows and display the overview.
Note that the collect() method enforces loading some data into memory.

In [7]:
data = pl.scan_parquet(filepaths["crispr"])
data.select(pl.col("^Metadata.*$").sample(n=5, seed=1)).collect()

The following line excludes the metadata columns:

In [8]:
data_only = data.select(pl.all().exclude("^Metadata.*$").sample(n=5, seed=1)).collect()
data_only

Finally, we can convert this to `pandas` if we want to perform analyses
with that tool. Keep in mind that this loads the entire dataframe into
memory.

In [9]:
data_only.to_pandas()