Using Your Own Data: Run - nf-pooled-cellpainting Documentation

Running the Pipeline with CLI¶

Once your inputs are ready, run the pipeline pointing to your files:

nextflow run broadinstitute/nf-pooled-cellpainting \
    --input samplesheet.csv \
    --barcodes barcodes.csv \
    --outdir results \
    --painting_illumcalc_cppipe your_painting_illumcalc_cppipe.cppipe \
    --painting_illumapply_cppipe your_painting_illumapply_cppipe.cppipe \
    --painting_segcheck_cppipe your_painting_segcheck_cppipe.cppipe \
    --barcoding_illumcalc_cppipe your_barcoding_illumcalc_cppipe.cppipe \
    --barcoding_illumapply_cppipe your_barcoding_illumapply_cppipe.cppipe \
    --barcoding_preprocess_cppipe your_barcoding_preprocess_cppipe.cppipe \
    --combinedanalysis_cppipe your_combinedanalysis_cppipe.cppipe \
    -profile docker

Running the Pipeline with Seqera Platform¶

Configuring the Pipeline in Seqera Platform¶

Navigate to Launchpad → Add Pipeline.

Pipeline Settings¶

Setting	Value
Name	`nf-pooled-cellpainting` or a name describing your run
Pipeline to launch	`https://github.com/broadinstitute/nf-pooled-cellpainting`
Revision	`dev` (for latest updates), `main` (for latest versioned code), or a specific commit
Compute environment	Your AWS Batch environment
Work directory	`s3://your-bucket/prefix/to/scratch/output`
Config profiles	(leave empty for a custom run)

Pipeline Parameters¶

In the Launchpad, select “Launch” for your pipeline.

In the “Run Parameters” tab, fill all of the required Input/Output options. You can manually enter each of the values in the “Input form view” or you can add the following parameters to the JSON or YAML in the “Params file view”. Note that all of the other parameters have default values but you may need to edit default values to match your dataset.

input: "s3://your-bucket/samplesheet.csv"
outdir: "s3://your-bucket/results"
barcodes: "s3://your-bucket/barcodes.csv"
painting_illumcalc_cppipe: "s3://your-bucket/pipelines/painting_illumcalc.cppipe"
painting_illumapply_cppipe: "s3://your-bucket/pipelines/painting_illumapply.cppipe"
painting_segcheck_cppipe: "s3://your-bucket/pipelines/painting_segcheck.cppipe"
barcoding_illumcalc_cppipe: "s3://your-bucket/pipelines/barcoding_illumcalc.cppipe"
barcoding_illumapply_cppipe: "s3://your-bucket/pipelines/barcoding_illumapply.cppipe"
barcoding_preprocess_cppipe: "s3://your-bucket/pipelines/barcoding_preprocess.cppipe"
combinedanalysis_cppipe: "s3://your-bucket/pipelines/combinedanalysis.cppipe"

Keep `qc_barcoding_passed: false` and `qc_painting_passed: false` for your first trigger of the pipeline. This will pause the pipeline after these important QC steps before the final steps are run.

Select “Launch”

Launching and Monitoring Runs¶

Launch: Click Launch from the pipeline page
Monitor: View real-time task execution in the Runs tab
QC Review: Check outputs in the S3 bucket or via the Reports tab
Resume: After QC review, click Resume (not Relaunch!) with updated parameters:

qc_painting_passed: true
qc_barcoding_passed: true

Cost Optimization Tips¶

Use Spot Instances: 60-90% cost savings for fault-tolerant workloads
Enable Fusion Snapshots: Automatically recover from spot interruptions
Right-size Max CPUs: Start with 500-1000, increase based on queue times
Use Appropriate Instance Types: Memory-optimized (r6id) for Combined Analysis; compute-optimized (c6id) for illumination steps
Clean Up Work Directory: Periodically delete old work directories from S3
Route Long Tasks to On-Demand: See below for avoiding spot reclaim losses on multi-hour tasks

Routing Long-Running Tasks to On-Demand Instances¶

Long-running tasks like FIJI_STITCHCROP (up to 4-6 hours) and CELLPROFILER_COMBINEDANALYSIS risk losing hours of work if spot instances are reclaimed. To avoid this:

Create an on-demand compute environment in Seqera Platform (duplicate your spot environment, disable Fusion Snapshots since they’re unnecessary for on-demand)
Route specific processes to the on-demand queue by adding to your Nextflow config:

process {
    withName: 'FIJI_STITCHCROP' {
        queue = '<on-demand-queue-name>'
    }
    withName: 'CELLPROFILER_COMBINEDANALYSIS' {
        queue = '<on-demand-queue-name>'
    }
}

The queue name is visible in your Seqera Platform compute environment under “Manual config attributes”.

Resource Requirements by Process¶

Process	CPU	Memory	Notes
CELLPROFILER_ILLUMCALC	1	2 GB	Per plate
CELLPROFILER_ILLUMAPPLY	1-2	6 GB	Per well/site
CELLPROFILER_PREPROCESS	4	8 GB	Per site
FIJI_STITCHCROP	6	36 GB	Memory-intensive
CELLPROFILER_COMBINEDANALYSIS	4	12-32 GB	Most demanding

To override defaults, add to your Nextflow config:

process {
    withName: 'CELLPROFILER_COMBINEDANALYSIS' {
        memory = '64.GB'
    }
}