Dataset Overview

Dataset Collection

This collection comprises 4 datasets:

Principal dataset (cpg0016): 116k chemical and ~22k gene perturbations, split across 12 data-generating centers using human U2OS osteosarcoma cells. This includes JUMP-ORF, JUMP-CRISPR, and JUMP-compounds
3 pilot datasets testing:
- Different perturbation conditions (cpg0000-jump-pilot, including different cell types)
- Staining conditions (cpg0001-cellpainting-protocol)
- Microscopes (cpg0002-jump-scope)

We chose U2OS (osteosarcoma) cells for our major data production work because phenotypes are equally or more visible than the few other lines we’ve tested and there is existing data in this cell type (namely, cpg0012-wawer-bioactivecompoundprofiling)

Genetic Perturbations:
- CRISPR knockdowns of ~8k genes (pooled guides targeting each gene are arrayed into plates)
- ORF (overexpression) reagents for ~12k unique genes, with ~5k that overlap with CRISPR targets
Do note that these numbers were based on JUMP Cell Painting IDs and there may be some minor duplication of genes.
Chemical Perturbations:
- Partners exchanged ~115,795 compounds
- ~5 replicates of each compound
- Performed as 1-2 replicates at 3-5 different sites globally

JUMP-Target:
- 306 compounds and 160 corresponding genetic perturbations
- Designed to assess connectivity (gene-compound matching, based on annotated gene targets of each compound) in profiling assays
- Includes 384-well plate maps
- Documentation
JUMP-MOA:
- 90 compounds in quadruplicate, laid out on a 384-well plate
- Represents 47 mechanism-of-action classes
- Designed for assessing connectivity between genes and compounds
- Documentation
Positive Controls:
- Set of 8 compounds per sample plate
- List of recommended controls

The experiments used an optimized Cell Painting protocol, published in Cimini et al. Nature Protocols 2023, which builds upon the original Bray et al. Nature Protocols 2016. For detailed implementation guidance, see the Cell Painting wiki.

From 12 sources (data-generating centers):

Images
- 5 channels (DNA, RNA, ER, AGP, Mito) per imaging site within a well
- Multiple sites (images) per well
CellProfiler Output
- Cell segmentation images
- Image-level quality metrics
Profile Data
- Single-cell level profiles
- Well-aggregated profiles
- Normalized features
- Well-aggregated profiles after feature selection applied
Index You can find the profile index here
- Parquet tables in which profiles were preprocessed with varying optimized pipelines.
- The “Interpretable” tables means that they are processed to the point where features retain their original mapping from the original features’ names (relating to size, shape, intensity, etc.). In other words, the batch correction step transforms features into a new space so that they no longer reflect their original meanings, so the “Interpretable” profiles are those just before this step. They will not be optimally aligned, but they will still have the original feature meanings.
Processed JUMP reference tables (JUMP_rr tables) This dataset provides multiple precomputed analysis tables to make JUMP data exploration accessible:

‘X_features.parquet’ contains a ranking of the features that distinguish a given perturbation from negative controls.
‘X_gallery.parquet’ is for visualization of the images with all channels collapsed into one.
‘X_cosinesim…parquet’ contains the pairwise cosine similarity of all perturbations within a given dataset (i.e., orf, crispr). This allows searching for the closest matches for each perturbation of interest or looking at all relationships in a heatmap.
‘X…significance…parquet’ is the statistical significance for the phenotypic activity of a given sample (see broad.io/crispr_feature for a formal definition). It shows which perturbations yielded a phenotype distinguishable from negative controls.
‘full’ tables contain all the data points from the resulting analysis. Their non-full counterpart contains a subset comprised of the most significant entries, meant for in-browser consumption and queries.
Many of the above tables can be interactively viewed using JUMPrr tools

Hosted in the Cell Painting Gallery (registry.opendata.aws/cellpainting-gallery/). Access and download is free through AWS Open Data Program.

Many of the processed datasets and manifest files can be found associated with the Broad Institute Imaging Platform community.

How-to guides provided
APIs and libraries for programmatic access:
- cpgdata: Tool to generate index of the Cell Painting Gallery files, for faster querying of parquet files using database languages (such as SQL).
- jump-portrait: Fetch images using standard gene/compound names into a Python session or filesystem.
- jump-babel: Translate perturbation names and access very basic metadata.

Cross-modality matching still being improved (the three modalities are ORF, CRISPR, and chemicals)
Some wells/plates/sources excluded for quality control
Within-modality matching generally reliable

You can find more details here.

For the most current updates, subscribe to our email list.