Cell Painting Gallery folder structure

Cell Painting Gallery folder structure#

All projects in the Cell Painting Gallery form a stereotyped structure. The parent structure is as follows.

cellpainting-gallery
└── <project>
    └── <source>
        ├── images
        ├── workspace
        └── workspace_dl

<project>: top level folder for the project. Keep the name short and simple with [a-z0-9_] only.
<source>: additional nesting level that is an identifier for the contributing institution. It should be present even if the data is from a single source (e.g. s3://cellpainting-gallery/cpg0003-rosetta/ only contains broad/). It can be anonymized (e.g. s3://cellpainting-gallery/cpg0016-jump/ contains source_1/, source_2/, etc.). It can also indicate that it contains data aggregated from multiple sources (e.g. s3://cellpainting-gallery/cpg0016-jump/ contains assembled).
images: all images and illumination correction functions.
workspace: everything else that results from CellProfiler-based features goes here.
workspace_dl: everything else that results from deep learning-based features goes here.

Not all projects will have all parent structures.

The “completeness” of a project can be checked using our data validator. Please note that it is in alpha and further functionality and documentation are under development.

`images` folder structure#

cellpainting-gallery/
└── <project>
    └── <project-specific-nesting>
        └── images
        │   ├── YYYY_MM_DD_<batch-name>
        │   │   ├── illum
        │   │   │   ├── <plate-name>
        │   │   │   │   ├── <plate-name>_Illum<Channel>.npy
        │   │   │   │   └── <plate-name>_Illum<Channel>.npy
        │   │   │   └── <plate-name>
        │   │   └── images
        │   │       ├── <full-plate-name>
        │   │       └── <full-plate-name>
        │   └── YYYY_MM_DD_<batch-name>
        └── workspace

Within the outer images folder, there are YYYY_MM_DD_<batch-name> subfolders for each batch. Each batch folder typically starts with YYYY_MM_DD of the date that image acquisition started. The rest of the batch folder name can be a simple ordinal (e.g. YYYY_MM_DD_Batch1) or more descriptive of its contents (e.g. 2020_01_02_TestPhalloidinConcentration). A single batch typically contains all of the plates that were imaged (or started acquisition) on that day. However, for simplifying project tracking and analysis, sometimes plates imaged on the same day are divided into multiple batches where each batch is a different experimental condition (e.g. 2020_01_02_LowPhalloidin and 2020_01_02_HighPhalloidin).

Arrayed Cell Painting experiments#

Most Cell Painting experiments are arrayed, meaning that each well contains a single perturbation and therefore every cell within that well has the same perturbation. For arrayed experiments, within each YYYY_MM_DD_<batch-name> batch subfolder there is typically an illum and an images folder.

The images folder contains a <full-plate-name> folder for each plate imaged in that batch. The structure beneath the <full-plate-name> folder depends on your imager, but it should contain all the images from the plate, and perhaps some other related metadata generated by the imager. Note that the microscope used for many datasets within the Cell Painting Gallery creates an Images folder nested below the <full-plate-name> folder - this is considered imager-specific and should not be confused with the two higher-level images folder that are a part of the required folder nesting.

The illum folder contains a <plate-name> folder for each plate imaged in that batch. The <plate-name> can match the <full-plate-name> or it can be truncated if the <full-plate-name> is long. Note that the relationship between <full-plate-name> and <plate-name> needs to be immediately obvious and the <plate-name> still needs to be a unique identifier (e.g. full-plate-name is BR00117035__2021-05-02T16_02_51-Measurement1 and plate-name is BRO00127035). Additionally, the <plate-name> used in the images folder must match that used in the workspace folder. Within each <plate-name> folder there are illumination correction functions for all channels imaged in that plate, as generated by CellProfiler. The illumination correction functions are named <plate-name>_Illum<Channel>.npy.

Note that the images folder contains the raw images as they come off of the microscope. Though images undergo manipulation before analysis (e.g. application of illumination correction functions), intermediate, processed images are not typically saved or uploaded. However, all of the information necessary to replicate the manipulation should be found in the <project-specific-nesting> folder (for a typical arrayed Cell Painting experiment this is just the illumination correction function). For atypical experiments in which the images undergo more extensive manipulation and for which replicating those manipulations is challenging or prohibitive, additional folders of images may be uploaded. Those folders will follow the format images_<manipulation-description>. The most common example of atypical image manipulation in an arrayed experiment is when images are acquired in z-stacks and then max-projected and saved as images_projected before undergoing analysis.

An example of what this looks like in practice for a typical arrayed Cell Painting experiment is below.

cellpainting-gallery
└── cpg0016-jump
    └── source_1
    └── source_2
    └── source_3
    └── source_4
        ├── images
        │   ├── 2021_04_26_Batch1
        │   │   ├── illum
        │   │   │   ├── BR00117035
        │   │   │   │   ├── BR00117035_IllumAGP.npy
        │   │   │   │   ├── BR00117035_IllumBrightfield.npy
        │   │   │   │   ├── BR00117035_IllumBrightfield_H.npy
        │   │   │   │   ├── BR00117035_IllumBrightfield_L.npy
        │   │   │   │   ├── BR00117035_IllumDNA.npy
        │   │   │   │   ├── BR00117035_IllumER.npy
        │   │   │   │   ├── BR00117035_IllumMito.npy
        │   │   │   │   └── BR00117035_IllumRNA.npy
        │   │   │   └── BR00117036
        │   │   └── images
        │   │       ├── BR00117035__2021-05-02T16_02_51-Measurement1
        │   │       └── BR00117036__2021-05-02T18_01_40-Measurement1
        │   └── 2021_05_31_Batch2
        └── workspace

cpg0016-jump is the project folder.
source_4 is the anonymized nesting folder, representing Broad’s data. Note that there are multiple sources in this project, though a nesting folder is still required even if your project doesn’t have multiple sources.
2021_04_26_Batch1 is the batch folder. Note that there are multiple batches of data acquired on different days in this project.
There are two plates in this example. BR00117035__2021-05-02T16_02_51-Measurement1 is the plate name as it comes off the microscope. This naming may differ with different microscopes and different acquisition configurations.
BR00117035 is the truncated plate name that we have given to BR00117035__2021-05-02T16_02_51-Measurement1 that is used for naming the plate in the illum folder (and the workspace folder, discussed below).
In the illum folder, within the BR00117035 plate folder, there are 8 separate illumination correction functions, one for each of the 8 channels imaged in that plate (e.g. BR00117035_IllumAGP.npy is the correction function for the AGP channel.)

Pooled Cell Painting experiments#

Some Cell Painting experiments are pooled, meaning that each well contains multiple perturbations and therefore the identity of a cell’s perturbation requires additional disambiguation - typically through optically reading a barcode For pooled experiments, within each YYYY_MM_DD_<batch-name> batch subfolder there are at least an illum and an images folder, though likely more given that the complexity of handling pooled experiments often requires generating intermediate images.

The images folder contains the raw images as they come off of the microscope. The images folder contains a <full-plate-name> folder for each plate imaged in that batch. Within each <full-plate-name> folder are subfolders for each cycle of imaging - one or many rounds of Cell Painting and each of the rounds barcode acquisition. Because barcoding and Cell Painting images can be acquired at different magnifications, the cycle folders start with the magnification at acquisition (e.g. 10X_ or 20X_). Barcoding cycle folders should end with _SBS-<n> where n is the barcoding cycle (e.g. _SBS-1 or _SBS-12). Cell Painting folders should contain _CP_ to indicate they are Cell Painting folders and if multiple cycles of Cell Painting are done then this should be indicated in the folder name as well (e.g. round_1 or round_2). The internal structure of the cycle round folders depends on your imager, but it should contain all the images from the plate, and perhaps some other related metadata generated by the imager.

The illum folder contains a <plate-name> folder for each plate imaged in that batch. The <plate-name> can match the <full-plate-name> or it can be truncated if the <full-plate-name> is long. Note that the relationship between <full-plate-name> and <plate-name> needs to be immediately obvious and the <plate-name> still needs to be a unique identifier (e.g. full-plate-name is BR00117035__2021-05-02T16_02_51-Measurement1 and plate-name is BR00117035). Additionally, the <plate-name> used in the images folder must match that used in the workspace folder. Within each <plate-name> folder there are illumination correction functions for all channels imaged in that plate, as generated by CellProfiler. For Cell Painting images, the illumination correction functions are named <plate-name>_Illum<Channel>.npy. If multiple rounds of Cell Painting occurred and there was a duplication of a painting channel between rounds, append _Round<n> to the end of the illum name (e.g. BR00117035_IllumDNA_Round1). For Barcoding images, the illumination correction functions are named <plate-name>_Cycle<n>_Illum<channel>.npy where the channel is the nucleotide it corresponds to (A, C, T, or G) or the label used for alignment (e.g. DNA).

Because Pooled Cell Painting experiments may require more extensive image manipulation and replicating those manipulations is challenging or prohibitive, additional folders of images may be uploaded. Those folders will follow the format images_<manipulation-description>. The substructure of non-standard folders may vary greatly based on the specific workflow. Some examples are shown below.

An example of what this looks like in practice for an example pooled Cell Painting experiment is below.

cellpainting-gallery
└── cpg0021-periscope
    └── broad
        ├── images
        │   ├── 20200805_A549_WG_Screen/
        │   └── 20210422_6W_CP257
        │      ├── illum
        │      │   ├── CP257A
        │      │   │   ├── CP257A_IllumAGP.npy
        │      │   │   ├── CP257A_IllumDNA.npy
        │      │   │   ├── CP257A_IllumER.npy
        │      │   │   ├── CP257A_IllumMito.npy
        │      │   │   ├── CP257A_IllumRNA.npy
        │      │   │   ├── CP257A_Cycle1_IllumA.npy
        │      │   │   ├── CP257A_Cycle1_IllumC.npy
        │      │   │   ├── CP257A_Cycle1_IllumDNA.npy
        │      │   │   ├── CP257A_Cycle1_IllumG.npy
        │      │   │   └── CP257A_Cycle1_IllumT.npy
        │      │   └── CP257B
        │      ├── images_aligned
        │      └── images_corrected_cropped
        │      │    ├── CP257A-Well1-0
        │      │    │    ├── CorrDNA
        │      │    │    │    ├──CorrDNA_Site_1.tiff
        │      │    │    │    └──CorrDNA_Site_10.tiff
        │      │    │    └── Cycle01_A
        │      │    └── CP257A-Well1-1
        │      └── images_corrected
        │      │    ├── barcoding
        │      │    │    ├── CP257A-Well1-0
        │      │    │    │    ├──Plate_CP257A_Well_1_Site_0_Cycle01_A.tiff
        │      │    │    │    └──Plate_CP257A_Well_1_Site_0_Cycle01_C.tiff
        │      │    │    └── CP257A-Well1-1
        │      │    └── painting
        │      └── images
        │          ├── CP257A
        │          │   ├── 10X_c1-SBS-1
        │          |   │    ├──Well1_Point1_0000_ChannelDAPI,Cy3,A594,Cy5,Cy7_Seq0000.nd2
        │          |   │    └──Well1_Point1_0001_ChannelDAPI,Cy3,A594,Cy5,Cy7_Seq0001.nd2
        │          │   ├── 10X_c2-SBS-2
        │          │   └── 20X_CP_CP257A
        │          └── CP257B
        └── workspace

cpg0021-periscope is the project folder.
broad is source of the data.
20210422_6W_CP257 is the batch folder. Note that there are multiple batches of data acquired on different days in this project.
There are two plates in this example, CP257A and CP257B.
In the illum folder, within the CP257A plate folder, there are many illumination correction functions, one for each of the channels imaged during Cell Painting and one for each plate, channel, and cycle of barcode acquisition.
The images_corrected_cropped folder follows one example custom structure of nesting by Plate-Well-Site, then Channel/Cycle.
The images_corrected folder follows a different example custom structure of nesting by arm, then Plate-Well-Site.
The images folder is has two example plates. Within the CP257A plate folder we show folders for two cycles of barcode acquisition at 10X magnification (e.g. 10X_c1-SBS-1 and 10X_c2-SBS-2) and one round of Cell Painting at 20X magnification (e.g. 20X_CP_CP257A).

`workspace` folder structure#

Let’s look under the workspace folder. Everything but images lives here. These folders are produced when following the data processing steps in the Image-based Profiling Handbook. Below are the minimally required top-level folders under workspace. Note that some experiments may generate additional categories of data/metadata and these should be uploaded to the workspace folder in their own folder/s.

cellpainting-gallery/
└── cpg0016-jump
    └── source_4
        ├── images
        └── workspace
            ├── analysis
            ├── backend
            ├── load_data_csv
            ├── metadata
            └── profiles

analysis: contains the CSV files and optionally object outline PNGs generated by CellProfiler
backend: contains the single-cell SQLite files (one per plate), the well-level aggregated profiles CSV files (also one per plate)
load_data_csv: contains LoadData CSV files used by CellProfiler to process the data
metadata: contains metadata files used to annotate the profiles
profiles: contains a set of well-level profiles files (one set per plate). The set comprises different stages of the CSV files produced when running the profiling recipe, as well as other output.

Examples of additional optional folders you may upload to workspace include:

assaydev: work use to test/optimize segmentation parameters
embeddings: embeddings generated from deep learning models
pipelines: the CellProfiler .cppipe or .cpproj files used
profiles_assembled: versioned profiles processed across multiple batches or sources
qc: quality control data
segmentation: optimized segmentations generated independently of the analysis pipeline
software: scripts used while handling the batch

`analysis` folder structure#

Within the analysis folder, is a folder for each batch and within each batch folder is a folder for each plate. Within the plate folder is an additional analysis folder. It is the only folder at this level; it is redundant and somewhat confusingly-named but we have kept it for legacy reasons.

Within the nested analysis folder, data is typically saved in <plate>-<well>-<site> subfolders with a .csv for each object measured (e.g. Cells.csv) and for experimental details (Experiment.csv) and whole image measurements (Image.csv) from that single site. However, the grouping can vary depending on how the grouping was performed for the CellProfiler run (e.g. an experiment grouped by well instead of site would generate <plate>-<well> folders with the .csvs containing all of the data from the well in each .csv).

Often there is an additional folder such as outlines that contains object outlines or masks containing object masks. These are the actual masks/outlines that were generated during the CellProfiler analysis run and used for analysis.

└── analysis
    ├── 2021_04_26_Batch1
    │   ├── BR00117035
    │   │   └── analysis
    │   │       ├── BR00117035-A01-1
    │   │       │   ├── Cells.csv
    │   │       │   ├── Cytoplasm.csv
    │   │       │   ├── Experiment.csv
    │   │       │   ├── Image.csv
    │   │       │   ├── Nuclei.csv
    │   │       │   └── outlines
    │   │       │       ├── A01_s1--cell_outlines.png
    │   │       │       └── A01_s1--nuclei_outlines.png
    │   │       └── BR00117035-A01-2
    │   └── BR00117036
    └── 2021_05_31_Batch2

In this example batch:

2021_04_26_Batch1 is the batch and BR00117035 is the plate
BR00117035-A01-1 is a folder containing CSV files and outline files for site 1 in well A01 in plate BR00117035. Less-granular folders are acceptable as well. e.g., BR00117035-A01 containing CSV files for the whole well and outline files for each site in the well.

`backend` folder structure#

Within the backend folder, is a folder for each batch and within each batch folder is a folder for each plate. Within each plate folder is a single-cell SQLite file, comprising all measurements from all cells in the plate, and a CSV that aggregates the single-cell data into a per-well measurement.

└── backend
    └── 2021_04_26_Batch1
        ├── BR00117035
        │   ├── BR00117035.csv
        │   └── BR00117035.sqlite
        └── BR00117036

In this example batch:

2021_04_26_Batch1 is the batch and BR00117035 is the plate
BR00117035.sqlite is the single-cell SQLite file
BR00117035.csv is the aggregated CSV file

`load_data_csv` folder structure#

Within the load_data_csv folder is a folder for each batch and within each batch folder is a folder for each plate. Within the plate folder there are typically two files - a load_data.csv for pipelines that do not use an illumination correction function and a load_data_with_illum.csv for pipelines that do use an illumination correction function, however atypical workflows can have other arrangements such as a separate CSV for each pipeline in the workflow.

The load_data.csv maps the actual file names and paths and their metadata (e.g. channel number, channel name) to the naming information passed to CellProfiler for running the images in a CellProfiler pipeline. More information on load_data.csv’s and their contents is available in CellProfiler documentation.

Though CellProfiler suports multiple formats for file paths, all load_data.csv in the CPG use the format of column names URL_<ChannelName> where the URL starts s3://cellpainting-gallery. e.g.

URL_OrigDNA	URL_OrigER	Metadata_Plate	Metadata_Well	Metadata_Site
s3://cellpainting-gallery/cpg0000-jump-pilot/broad/images/2020_11_04_CPJUMP1/images/BR00116991__2020-11-05T19_51_35-Measurement1/Images/r01c01f01p01-ch5sk1fk1fl1.tiff	s3://cellpainting-gallery/cpg0000-jump-pilot/broad/images/2020_11_04_CPJUMP1/images/BR00116991__2020-11-05T19_51_35-Measurement1/Images/r01c01f01p01-ch4sk1fk1fl1.tiff	BR00116991	A01	1
s3://cellpainting-gallery/cpg0000-jump-pilot/broad/images/2020_11_04_CPJUMP1/images/BR00116991__2020-11-05T19_51_35-Measurement1/Images/r01c01f02p01-ch5sk1fk1fl1.tiff	s3://cellpainting-gallery/cpg0000-jump-pilot/broad/images/2020_11_04_CPJUMP1/images/BR00116991__2020-11-05T19_51_35-Measurement1/Images/r01c01f02p01-ch4sk1fk1fl1.tiff	BR00116991	A01	2

Note that at this time, CellProfiler from source (but not built) can directly use these load_data.csv’s to download and process CPG images but you must have AWS credentials as it does not support unsigned requests. Distributed-CellProfiler will soon support this formatting for file download but not reading directly off the bucket with S3FS.

└── load_data_csv
     └── 2021_04_26_Batch1
         ├── BR00117035
         │   ├── load_data.csv
         │   └── load_data_with_illum.csv
         └── BR00117036

`metadata` folder structure#

arrayed metadata#

The metadata folder has a slightly different structure from other workspace folders. It complies with pycytominer metadata requirements.

Additional context and information may also be found in the profiling recipe and the Image-based Profiling Handbook.

└── metadata
     ├─── external_metadata
     |   └── external_metadata.tsv
     └── platemaps
         └── 2021_04_26_Batch1
             ├── barcode_platemap.csv
             └── platemap
                 └── OAA01.02.03.04.A.txt

All datasets have at least barcode_platemap.csv and PLATEMAP.txt files.

Within barcode_platemap.csv, there are two columns: Assay_Plate_Barcode and Plate_Map_Name.

Assay_Plate_Barcode matches the plate name used for analysis. This may be a full string match to the platenames as acquired off the imager and stored in the images folder (e.g.BR00117035__2021-05-02T16_02_51-Measurement1) or it may be a truncation of the full string as long as it is still a unique identifier (e.g. BR00117035).
Plate_Map_Name is the name of a platemap in the platemaps/BATCH/platemap folder. There may be one-to-one or many-to-one correspondence between Assay_Plate_Barcode and Plate_Map_Name. Platemap naming can vary greatly from dataset to dataset depending upon the source and their data tracking/naming conventions.

Within PLATEMAP.txt there at least plate_map_name and well_position columns and may be any additional number of metadata columns.

plate_map_name matches the Plate_Map_Name in the barcode_platemap.csv and the PLATEMAP in the file name.
well_position matches the well names in the data output by CellProfiler and are typically based on raw image file naming as so are generally formatted like A01 but may be upper or lowercase and may or may not have zero padding (e.g. a1, a01, A1, A01).

Some datasets additionally have external_metadata.tsv. These contain mapping between a perturbation identifier to other metadata using matching column names.

We do not currently enforce metadata harmonization beyond what is described here. However, one can generally expect that metadata have been harmonized within a dataset. We are currently exploring further metadata harmonization requirements and will update our documentation at the point of implementation.

pooled metadata#

For pooled experiments, the primary source of disambiguation of cellular perturbations is through barcode assignment to individual cells and not through a platemap, so the folder structure is different than for arrayed metadata. The Barcodes.csv used for assignment is required and is assumed to be the same for each plate within a batch. Other sources of metadata may be included, particularly if there are additional per-well or per-plate differences in metadata. Additionally, we suggest the inclusion of the metadata.json dictionary used for image processing with the pooled cell painting image processing repository.

└── metadata
   └── 2021_04_26_Batch1
        ├── Barcodes.csv
        └── metadata.json

`profiles` folder structure#

Within the profiles folder is a folder for each batch and within each batch folder is a folder for each plate. Within each plate folder are many files produced by the profiling-recipe that describe single-cell morphological profiles. For a full description of the files, see profiling-recipe files generated.

└── profiles
    └── 2021_04_26_Batch1
        ├── BR00117035
        │   ├── BR00117035.csv.gz
        │   ├── BR00117035_augmented.csv.gz
        │   ├── BR00117035_normalized.csv.gz
        │   ├── BR00117035_normalized_feature_select_negcon_plate.csv.gz
        │   ├── BR00117035_normalized_feature_select_plate.csv.gz
        │   └── BR00117035_normalized_negcon.csv.gz
        └── BR00117036

2021_04_26_Batch1 is the batch and BR00117035 is the plate
The .csv files undergo gzip compression to be .csv.gz files

`profiles_assembled` folder structure#

The profiles_assembled folder contains profiles that have been processed across multiple batches or sources using workflows like the JUMP profiling recipe. Unlike the profiles folder which contains per-plate outputs from the standard profiling recipe, profiles_assembled contains versioned datasets that may combine data from multiple plates, batches, or even sources.

Within the profiles_assembled folder, data is organized by subset, version, and processing variant:

└── profiles_assembled
    └── <subset_name>
        └── <version>
            ├── <processing_variant1>.parquet
            └── <processing_variant2>.parquet

<subset_name>: Describes which data was included (e.g., compound_no_source7 indicates compounds excluding source 7)
<version>: Version of the assembled dataset (e.g., v1.0, v2.0)
<processing_variant>: Describes the specific processing applied (e.g., profiles_var_mad_int_featselect for variance, MAD, intensity feature selection)

For example:

└── profiles_assembled
    └── compound_no_source7
        └── v1.0
            ├── profiles_var_mad_int_featselect.parquet
            └── profiles_var_mad_int_featselect_harmony.parquet

This structure allows multiple processing approaches to coexist without overwriting each other. The provenance of these files is typically tracked using manifest files as described in the JUMP Hub manifest guide (note: this guide is a work in progress).

`quality_control` folder structure#

The quality_control folder has the slightly different structure. The files are all produced by the profiling-recipe.

└── quality_control
    └── heatmap
        └── 2021_04_26_Batch1
            ├── BR00117035
            │   ├── BR00117035_cell_count.png
            │   ├── BR00117035_correlation.png
            │   ├── BR00117035_position_effect.png
            │   └── and possibly others
            └── BR00117036

2021_04_26_Batch1 is the batch and BR00117035 is the plate

`segmentation` folder structure#

Files within segmentation are object segmentations (masks and/or outlines) generated outside of the final (usually CellProfiler) analysis pipeline and therefore likely differ from the segmentations used to make the dataset’s profiles. They are considered “state-of-the-field” or better in quality. The model and/or training data are optionally included. At minimum, a README.md describing the model and where to access it is included in the model folder.

Within objects are the segmented objects, nested into Batch and Plate folders

└── segmentation
    └── <software>_<hash> (e.g. cellpose_c632b05b6930)
        ├── model
        │   └──README.md
        ├── training
        └── objects
            ├── 2021_04_26_Batch1
            │    ├── BR00117035
            │    └── BR00117036
            └── 2021_05_31_Batch2

Within Batch and Plate folders, substructure can vary. The structure used for cpg0016-jump is as follows:

└── BR00117036
    ├── BR00117036.zarr
    │    └──<source>__<batch>__<plate>__<well>__<site>
    │       ├──label_image/
    │       ├──single_cell_data/
    │       ├──single_cell_index/
    │       └──.zgroup
    └── channel_mapping.json

`workspace_dl` folder structure#

NOTE: This section is work in progress. More documentation will be added. The structure may change.

Within the workspace_dl folder are several subfolders for different classes of data.

Within the subfolders are folders for the In this example we have used efficientnet_v2_imagenet1k_s_feature_vector_2_ec756ff where efficientnet is the name of the network, imagenet1k is the dataset that was used for training, and ec756ff is a hash for the model. Note that it is possible to use other identifiers for the model such as a Zenodo DOI.

cellpainting-gallery/
└── cpg0016-jump
    └── source_4
        └── workspace_dl
            ├── collated
            ├── consensus
            ├── embeddings
            └── profiles

`collated` folder structure#

The collated contains .csv or .parquet files with well-level profiles for all plates in a folder for each network/model.

└── collated
        └── efficientnet_v2_imagenet1k_s_feature_vector_2_ec756ff
            └── collated.parquet

`consensus` folder structure#

The consensus folder contains .csv or .parquet files with treatment-level profiles for all plates in a folder for each network/model.

└── consensus
        └── efficientnet_v2_imagenet1k_s_feature_vector_2_ec756ff
            └── consensus.parquet

`embeddings` folder structure#

The embeddings folder contains a subfolder for each network/model, with subfolders for each batch. Within each batch folder is a subfolder for each plate. Within each plate subfolder is a subfolder for well-site. In the well-site subfolder is a .npz or .parquet file with single-cell features extracted from the single image.

└── embeddings
        └── efficientnet_v2_imagenet1k_s_feature_vector_2_ec756ff
            ├── 2021_04_26_Batch1
            │   ├── BR00117035
            │   │       ├── A01-1
            │   │       │   └── embedding.parquet
            │   │       └── A01-2
            │   └── BR00117036
            └── 2021_05_31_Batch2

In this example batch:

2021_04_26_Batch1 is the batch and BR00117035 is the plate
efficientnet_v2_imagenet1k_s_feature_vector_2_ec756ff is an identifier for the deep learning network, suffixed with some hash for the model
A01-1 is a folder containing the embedding file for site 1 in well A01 in plate BR00117035
embedding.parquet is the single-cell Parquet file containing the embeddings

The folder structure is a little different for DeepProfiler-generated output in that the well-site subfolder is replaced by a well subfolders with subfolders per site.

└── embeddings
        └── efficientnet_v2_imagenet1k_s_feature_vector_2_ec756ff
            ├── 2021_04_26_Batch1
            │   ├── BR00117035
            │   │       ├── A01
            │   │       │   └── 1
            │   │       │   │   └── embedding.npz
            │   │       │   └── 2
            │   └── BR00117036
            └── 2021_05_31_Batch2

`profiles` folder structure#

Within the profiles folder is a folder for the deep learning network with its hash. Within the network folder is a folder for each batch and within each batch folder is a file for each plate.

└── profiles
        └── efficientnet_v2_imagenet1k_s_feature_vector_2_ec756ff
            ├── 2021_04_26_Batch1
            │   ├── BR00117035
            │   │       └── BR00117035.parquet
            │   └── BR00117036
            └── 2021_05_31_Batch2

Complete folder structure#

Here’s the complete folder structure for a sample project with CellProfiler-based features.

Click here

 └── cellpainting-gallery
    └── cpg0016-jump
        └── source_4
            ├── images
            │   ├── 2021_04_26_Batch1
            │   │   ├── illum
            │   │   │   ├── BR00117035
            │   │   │   │   ├── BR00117035_IllumAGP.npy
            │   │   │   │   ├── BR00117035_IllumBrightfield.npy
            │   │   │   │   ├── BR00117035_IllumBrightfield_H.npy
            │   │   │   │   ├── BR00117035_IllumBrightfield_L.npy
            │   │   │   │   ├── BR00117035_IllumDNA.npy
            │   │   │   │   ├── BR00117035_IllumER.npy
            │   │   │   │   ├── BR00117035_IllumMito.npy
            │   │   │   │   └── BR00117035_IllumRNA.npy
            │   │   │   └── BR00117036
            │   │   └── images
            │   │       ├── BR00117035__2021-05-02T16_02_51-Measurement1
            │   │       └── BR00117036__2021-05-02T18_01_40-Measurement1
            │   └── 2021_05_31_Batch2
            └── workspace
                ├── analysis
                │   ├── 2021_04_26_Batch1
                │   │   ├── BR00117035
                │   │   │   └── analysis
                │   │   │       ├── BR00117035-A01-1
                │   │   │       │   ├── Cells.csv
                │   │   │       │   ├── Cytoplasm.csv
                │   │   │       │   ├── Image.csv
                │   │   │       │   ├── Nuclei.csv
                │   │   │       │   └── outlines
                │   │   │       │       ├── A01_s1--cell_outlines.png
                │   │   │       │       └── A01_s1--nuclei_outlines.png
                │   │   │       └── BR00117035-A01-2
                │   │   └── BR00117036
                │   └── 2021_05_31_Batch2
                ├── backend
                │   └── 2021_04_26_Batch1
                │       ├── BR00117035
                │       │   ├── BR00117035.csv
                │       │   └── BR00117035.sqlite
                │       └── BR00117036
                ├── load_data_csv
                │   └── 2021_04_26_Batch1
                │       ├── BR00117035
                │       │   ├── load_data.csv.gz
                │       │   └── load_data_with_illum.csv.gz
                │       └── BR00117036
                ├── metadata
                │   ├─── external_metadata
                |   |   └── external_metadata.tsv
                │   └── platemaps
                |       └── 2021_04_26_Batch1
                |           ├── platemap
                |           │   └── OAA01.02.03.04.A.txt
                |           └── barcode_platemap.csv
                ├── quality_control
                │   └── heatmap
                │       └── 2021_04_26_Batch1
                │           ├── BR00117035
                │           │   ├── BR00117035_cell_count.png
                │           │   ├── BR00117035_correlation.png
                │           │   ├── BR00117035_position_effect.png
                │           │   └── and possibly others
                │           └── BR00117036
                └── profiles
                    └── 2021_04_26_Batch1
                        ├── BR00117035
                        │   ├── BR00116991_augmented.csv.gz
                        │   ├── BR00116991_normalized.csv.gz
                        │   ├── BR00116991_normalized_feature_select_negcon_plate.csv.gz
                        │   ├── BR00116991_normalized_feature_select_plate.csv.gz
                        │   ├── BR00116991_normalized_negcon.csv.gz
                        │   ├── BR00117035.csv.gz
                        │   └── and others https://github.com/cytomining/profiling-recipe#files-generated
                        └── BR00117036

Cell Painting Gallery folder structure

Contents

Cell Painting Gallery folder structure#

images folder structure#

Arrayed Cell Painting experiments#

Pooled Cell Painting experiments#

workspace folder structure#

analysis folder structure#

backend folder structure#

load_data_csv folder structure#

metadata folder structure#

arrayed metadata#

pooled metadata#

profiles folder structure#

profiles_assembled folder structure#

quality_control folder structure#

segmentation folder structure#

workspace_dl folder structure#

collated folder structure#

consensus folder structure#

embeddings folder structure#

profiles folder structure#

Complete folder structure#

`images` folder structure#

`workspace` folder structure#

`analysis` folder structure#

`backend` folder structure#

`load_data_csv` folder structure#

`metadata` folder structure#

`profiles` folder structure#

`profiles_assembled` folder structure#

`quality_control` folder structure#

`segmentation` folder structure#

`workspace_dl` folder structure#

`collated` folder structure#

`consensus` folder structure#

`embeddings` folder structure#

`profiles` folder structure#