This document outlines the critical dependencies between the input samplesheet, the Nextflow pipeline logic, and the Python scripts that generate load_data.csv files for CellProfiler. Correctly formatting your samplesheet and naming your files is essential for the pipeline to function correctly.
Samplesheet Requirements¶
The samplesheet is the single source of truth for experimental metadata. The pipeline expects specific columns to be present.
Required Columns¶
| Column | Description | Critical Dependency |
|---|---|---|
batch | Batch identifier (e.g., Batch1) | Used for grouping images for illumination calculation. |
plate | Plate identifier (e.g., Plate1) | CRITICAL: Must match the plate name used in filenames for the Combined Analysis step. |
well | Well identifier (e.g., A01) | CRITICAL: Used to map images to metadata. |
site | Site/Field number (e.g., 1) | CRITICAL: Used to map images to metadata. |
channels | Comma-separated list of channels | CRITICAL: Used as column headers in load_data.csv. Must match the channel names parsed from filenames (see below). |
arm | painting or barcoding | Determines which subworkflow processes the image. |
cycle | Cycle number (Barcoding only) | CRITICAL: Used for grouping barcoding cycles. |
Metadata Flow¶
Ingestion: The samplesheet is read by
main.nf.Channel Creation: Nextflow creates channels carrying
[meta, image]tuples.metacontains all the columns above.Processing:
Illumination Calculation/Correction: Metadata (
plate,channels,cycle) is passed explicitly to the Python script via CLI arguments.Preprocessing & Combined Analysis: Metadata is implicitly derived from filenames in some legacy paths, but the modern implementation relies on the
metamap passed from Nextflow.
Channel Naming Constraints¶
The Python script (bin/generate_load_data_csv.py) uses regular expressions to parse filenames and extract Channel and Cycle information. This is where most user errors occur.
Cell Painting Arm¶
Input Images (Raw)
Requirement: Must contain the channel names specified in your samplesheet
channelscolumn.Regex:
Channel([^_]+)matches the channel list.Example:
..._ChannelDNA,Phalloidin,Mito_...
Corrected Images (Intermediate)
Requirement: The pipeline generates these. If you provide pre-corrected images, they must match the pattern.
Regex:
Corr(.+?)\.tiff?Example:
Plate_P1_Well_A01_Site_1_CorrDNA.tiffHere,
DNAis extracted as the channel name.Constraint: This extracted name MUST match one of the entries in your samplesheet
channelscolumn (e.g.,DNA).
Barcoding Arm¶
Input Images (Raw)
Requirement: Must contain cycle information if it’s a multi-cycle experiment.
Preprocessing & Alignment
Constraint: The barcoding arm is stricter. It expects specific channel names for the barcode bases.
Allowed Channels:
A,C,G,T(for bases),DNA,DAPI(for reference).Regex:
Cycle(\d+)_([ACGT]|DNA|DAPI)\.tiff?Example:
Plate_P1_Well_A01_Site_1_Cycle01_A.tiffCycle01-> Cycle 1A-> Channel A
Combined Analysis Dependencies¶
The Combined Analysis step merges data from both arms. This is the most fragile step regarding naming.
The “Plate Name” Trap¶
The Python script groups files by (Plate, Well, Site).
Source: It takes
Plate,Well,Sitefrom the samplesheet metadata.Matching: It looks for files in the input directory.
The Constraint: The input files for combined analysis are generated by previous steps (IllumApply). These files are named using the metadata from those previous steps.
If your samplesheet says Plate is
Plate_1(underscore), but your raw filenames saidPlate1(no underscore) and you relied on filename parsing earlier, you might have a mismatch.Best Practice: Ensure the
platecolumn in your samplesheet EXACTLY matches the plate identifier used in your filenames if you are relying on any filename-based grouping logic.
Channel Matching¶
The generate_load_data_csv.py script in combined mode uses regex to identify if a file is “Cell Painting” or “Barcoding” based on its filename pattern:
Barcoding Pattern: Looks for
Cycle(\d+).Matches:
..._Cycle01_A.tiff
Cell Painting Pattern: Looks for
Corr(.+).Matches:
..._CorrDNA.tiff
Impact:
If you name a Cell Painting channel Cycle1 (e.g., CorrCycle1.tiff), the script might mistakenly try to parse it as a barcoding image because of the Cycle keyword.
Rule: Avoid using the word
Cyclein your Cell Painting channel names.
Summary Checklist¶
Before running the pipeline:
Samplesheet Columns: Ensure
batch,plate,well,site,channels,armare present.Channel Names:
Cell Painting: Names in
channelscolumn match the names in your raw image filenames (e.g.,DNA,Mito).Barcoding: Names in
channelscolumn areA,C,G,T,DNA, orDAPI.
Avoid Keywords: Do not use
CycleorCorras part of your raw channel names to avoid regex confusion.Consistency: Ensure
platenames are consistent across all rows for the same physical plate.