Python Scripts

The pipeline includes helper Python scripts that handle data preparation and quality control. You generally don’t need to modify these scripts, but understanding what they do helps with debugging when things go wrong.

Overview¶

These scripts live in the bin/ directory and are automatically available to all pipeline processes:

generate_load_data_csv.py: Universal CSV generator for CellProfiler
qc_barcode_align.py: Barcode alignment quality control
qc_barcode_preprocess.py: Barcode preprocessing quality control

These scripts are automatically available in the process PATH and are called during pipeline execution.

generate_load_data_csv.py¶

Purpose¶

Generates load_data.csv files required by CellProfiler processes. This is the primary data staging script used throughout the pipeline.

Usage¶

generate_load_data_csv.py \
    --pipeline_type <type> \
    --channels <channel_list> \
    --frames <frame_list> \
    --cycles <cycle_list> \
    --output load_data.csv

Parameters¶

Parameter	Required	Description
`--pipeline_type`	Yes	Pipeline stage: `illumcalc`, `illumapply`, `segcheck`, `analysis`, `preprocess`, `combined`
`--channels`	Yes	Comma-separated channel names (e.g., `DAPI,GFP,RFP`)
`--frames`	No	Comma-separated frame indices (e.g., `0,1,2,3`)
`--cycles`	No	Comma-separated cycle numbers (e.g., `1,2,3`) - barcoding only
`--output`	Yes	Output CSV file path

Pipeline Types¶

1. `illumcalc` - Illumination Calculation¶

Stages original multi-channel images for illumination function calculation.

Output columns:

Metadata_Plate, Metadata_Well, Metadata_Site, Metadata_Frame, FileName_<Channel>

Example:

generate_load_data_csv.py \
    --pipeline_type illumcalc \
    --channels DAPI,GFP,RFP \
    --frames 0,1,2,3 \
    --output load_data.csv

2. `illumapply` - Illumination Application¶

Stages original images alongside illumination functions for correction.

Output columns:

Metadata_Plate, Metadata_Well, Metadata_Site, Metadata_Frame,
FileName_Orig<Channel>, FileName_Illum<Channel>

Example:

generate_load_data_csv.py \
    --pipeline_type illumapply \
    --channels DAPI,GFP,RFP \
    --frames 0,1,2,3 \
    --output load_data.csv

3. `segcheck` - Segmentation Check¶

Stages corrected images for segmentation quality control.

Output columns:

Metadata_Plate, Metadata_Well, Metadata_Site, Metadata_Frame, FileName_Corr<Channel>

4. `preprocess` - Barcoding Preprocessing¶

Stages cycle-based images for barcode calling.

Output columns:

Metadata_Plate, Metadata_Well, Metadata_Site, Metadata_Frame, Metadata_Cycle,
FileName_Cycle<N>_<Channel>

Example:

generate_load_data_csv.py \
    --pipeline_type preprocess \
    --channels Cy3,Cy5 \
    --cycles 1,2,3 \
    --frames 0,1,2,3 \
    --output load_data.csv

5. `combined` - Combined Analysis¶

Stages both painting (corrected) and barcoding (preprocessed) images.

Output columns:

Metadata_Plate, Metadata_Well, Metadata_Site, Metadata_Frame,
FileName_Corr<Channel>, FileName_Cycle<N>_<Channel>

Metadata Flow¶

Metadata originates from the input samplesheet and flows through Nextflow channels to the Python scripts. The scripts use two approaches:

Pattern A: Metadata-Driven (ILLUMCALC, ILLUMAPPLY)

Metadata is passed as CLI arguments from the Nextflow meta map:

generate_load_data_csv.py \
    --pipeline-type illumcalc \
    --images-dir ./images \
    --plate ${meta.plate} \      # from samplesheet
    --channels "${channels}" \   # from meta.channels
    --output load_data.csv

Pattern B: Filename-Driven (PREPROCESS, COMBINEDANALYSIS)

All metadata is extracted from standardized filenames:

generate_load_data_csv.py \
    --pipeline-type preprocess \
    --images-dir ./images \
    --output load_data.csv

Filename Patterns¶

The script parses different filename patterns depending on the pipeline stage:

Image Type	Pattern	Example
Original	`Well{well}_Point{site}_{frame}_Channel{channels}_Seq*.ome.tiff`	`WellA1_PointA1_0000_ChannelDNA,GFP_Seq0000.ome.tiff`
Corrected	`Plate_{plate}_Well_{well}_Site_{site}_Corr{channel}.tiff`	`Plate_Plate1_Well_A1_Site_1_CorrDNA.tiff`
Illumination	`{plate}_Illum{channel}.npy`	`Plate1_IllumDNA.npy`
Cycle	`Plate_{plate}_Well_{well}_Site_{site}_Cycle{cycle}_{channel}.tiff`	`Plate_Plate1_Well_A1_Site_1_Cycle01_A.tiff`

CSV Output Structures¶

Different pipeline types produce different CSV structures:

Standard (Cell Painting):

Metadata_Plate	Metadata_Well	Metadata_Site	FileName_OrigDNA	Frame_OrigDNA	FileName_IllumDNA	...
Plate1	A1	1	WellA1_Point_0000.ome.tiff	0	Plate1_IllumDNA.npy	...

With Cycles (Barcoding):

Metadata_Plate	Metadata_Well	Metadata_Site	Metadata_Cycle	FileName_Cycle01_OrigA	Frame_Cycle01_OrigA	...
Plate1	A1	1	1	filename.ome.tiff	0	...

Combined Analysis:

Metadata_Plate	Metadata_Site	Metadata_Well	FileName_CorrDNA	FileName_Cycle01_A	...
Plate1	1	A1	corrected.tiff	cycle1.tiff	...

Implementation Details¶

The script:

Scans directories for TIFF images matching expected patterns
Parses filenames to extract metadata (plate, well, site, frame, channel, cycle)
Groups images by metadata keys
Generates CSV with proper CellProfiler column names

Error Handling¶

The script validates:

File existence
Metadata completeness
Channel consistency
Frame/cycle ranges

qc_barcode_align.py¶

Purpose¶

Jupyter notebook for analyzing barcode alignment quality across cycles.

Functionality¶

Load cycle images: Reads corrected images from all cycles
Calculate pixel shifts: Measures X/Y displacement between cycles
Compute correlations: Calculates Pearson correlation between cycles
Generate visualizations:
- Scatter plots of pixel shifts
- Correlation heatmaps
- Spatial shift maps
Validate thresholds: Checks against barcoding_shift_threshold and barcoding_corr_threshold

Usage¶

# In Jupyter or as script
qc_barcode_align.py \
    --input_dir /path/to/corrected/cycles/ \
    --shift_threshold 50 \
    --corr_threshold 0.9 \
    --output_dir qc/barcode_align/

Output Files¶

shift_summary.csv: Per-site shift statistics
correlation_matrix.csv: Cycle-to-cycle correlations
shift_spatial_plot.png: Spatial distribution of shifts
correlation_heatmap.png: Correlation visualization
qc_report.html: Interactive QC report

QC Criteria¶

Pass criteria:

Mean pixel shift < barcoding_shift_threshold
Mean correlation > barcoding_corr_threshold

Fail criteria:

Any site exceeds shift threshold
Any cycle pair below correlation threshold

Development Guide¶

Adding New Pipeline Types¶

To support a new CellProfiler stage:

Update generate_load_data_csv.py:

def generate_newtype_csv(images, channels, output):
    rows = []
    for img in images:
        row = {
            'Metadata_Plate': img.plate,
            'Metadata_Well': img.well,
            # ... additional metadata
            f'FileName_{channel}': img.filename
        }
        rows.append(row)

    df = pd.DataFrame(rows)
    df.to_csv(output, index=False)

Add to pipeline type switch:

if pipeline_type == 'newtype':
    generate_newtype_csv(images, channels, output)

Update process module:

script:
"""
generate_load_data_csv.py \
    --pipeline_type newtype \
    --channels ${meta.channels.join(',')} \
    --output load_data.csv
"""

Testing Scripts¶

Test scripts in isolation:

# Create test data
mkdir -p test_images
touch test_images/P001_A01_1_0_DAPI.tif
touch test_images/P001_A01_1_0_GFP.tif

# Run script
python bin/generate_load_data_csv.py \
    --pipeline_type illumcalc \
    --channels DAPI,GFP \
    --frames 0 \
    --output test_load_data.csv

# Validate output
head test_load_data.csv

Debugging¶

Enable verbose logging:

import logging
logging.basicConfig(level=logging.DEBUG)

Best Practices¶

Validate inputs: Check file existence before processing
Handle edge cases: Empty directories, missing frames, etc.
Consistent naming: Follow established filename conventions
Error messages: Provide clear, actionable error messages
Logging: Log key operations for debugging
Testing: Write unit tests for parsing logic

Overview¶

generate_load_data_csv.py¶

Purpose¶

Usage¶

Parameters¶

Pipeline Types¶

1. illumcalc - Illumination Calculation¶

2. illumapply - Illumination Application¶

3. segcheck - Segmentation Check¶

4. preprocess - Barcoding Preprocessing¶

5. combined - Combined Analysis¶

Metadata Flow¶

Filename Patterns¶

CSV Output Structures¶

Implementation Details¶

Error Handling¶

qc_barcode_align.py¶

Purpose¶

Functionality¶

Usage¶

Output Files¶

QC Criteria¶

Development Guide¶

Adding New Pipeline Types¶

Testing Scripts¶

Debugging¶

Best Practices¶

1. `illumcalc` - Illumination Calculation¶

2. `illumapply` - Illumination Application¶

3. `segcheck` - Segmentation Check¶

4. `preprocess` - Barcoding Preprocessing¶

5. `combined` - Combined Analysis¶