StarryNight Execution Layer

Overview

The execution layer in StarryNight defines how modules and pipelines are executed in computing environments. This layer consists of two key components: the execution model, which handles how modules and pipelines are configured and executed in different contexts, and the Snakemake backend, which translates Pipecraft pipelines into concrete, reproducible workflows. Together, these components form the final layer in StarryNight's architecture, turning abstract pipeline definitions into actual running processes.

Purpose

The execution layer serves several critical purposes in the StarryNight architecture:

Execution Preparation - Preparing configured pipelines for runtime
Instantiation - Creating configured module instances
Backend Selection - Choosing appropriate execution backends
Execution Control - Initiating and monitoring pipeline execution
Result Management - Handling outputs and logs
Workflow Translation - Converting Pipecraft pipelines to executable format
Dependency Management - Handling dependencies between pipeline steps
Container Execution - Managing execution of containerized operations
Parallel Processing - Controlling parallel execution of independent steps

This system provides the connection between abstract pipeline definitions and concrete execution in computing environments.

Execution Model

The execution model in the StarryNight execution layer defines how modules and pipelines are run in different contexts, with a particular focus on the notebook workflow.

Notebook Workflow

The typical notebook workflow includes these key steps:

1. Import Components

# Import necessary modules
from starrynight.modules.inventory import GenInvModule
from starrynight.modules.index import GenIndexModule
from starrynight.modules.cp_illum_calc import CPIllumCalcGenLoadDataModule, CPIllumCalcGenCPipeModule
from starrynight.config import DataConfig
from starrynight.experiments.pcp_generic import PCPGenericExperiment
import pipecraft as pc

2. Configure Data Paths

# Set up data paths
workspace_path = "/path/to/workspace"
images_path = "/path/to/images"
scratch_path = "/path/to/scratch"

# Create data config
data_config = DataConfig(
    workspace_path=workspace_path,
    images_path=images_path,
    scratch_path=scratch_path
)

3. Configure Backend

# Configure Snakemake backend
backend_config = pc.SnakemakeBackendConfig(
    use_opentelemetry=False,  # Disable telemetry for notebook
    print_exec=True           # Print execution details
)

# Create backend instance
exec_backend = pc.SnakemakeBackend(backend_config)

Module Configuration and Execution

The execution model handles how modules are configured and run:

# Create and run index module
gen_index_mod = GenIndexModule(data_config)
exec_backend.run(
    gen_index_mod.pipeline,
    config=backend_config,
    working_dir=data_config.scratch_path / "runs" / "index"
)

# Configure experiment
pcp_init_config = {
    "nuclear_channel": "DAPI",
    "cell_channel": "CellMask",
    "mito_channel": "MitoTracker",
    "barcode_csv_path": "/path/to/barcodes.csv",
    "image_overlap_percentage": 10
}

# Create configured experiment
pcp_experiment = PCPGenericExperiment.from_index(
    index_path=data_config.workspace_path / "index.yaml",
    init_config=pcp_init_config
)

# Create and run pipeline module
illum_load_data_mod = CPIllumCalcGenLoadDataModule(
    data_config=data_config,
    experiment=pcp_experiment
)

exec_backend.run(
    illum_load_data_mod.pipeline,
    config=backend_config,
    working_dir=data_config.scratch_path / "runs" / "illum_load_data"
)

Backend Selection

The execution system uses backend implementations in Pipecraft. Currently, StarryNight primarily uses the Snakemake backend, configured with options such as:

Telemetry Settings - Whether to use OpenTelemetry for logging
Output Settings - How to display execution information
Resource Settings - CPU, memory, and other resource limits

Execution Artifacts

When a pipeline is executed, several artifacts are generated:

Compiled Workflow

The compiled workflow (e.g., Snakefile) contains the full definition of the operations to be performed. This file includes rules for each operation, input/output specifications, container configuration, and command-line instructions.

Execution Logs

The execution logs capture the entire execution process, including command outputs, errors, and runtime information. These logs provide a complete record of execution for troubleshooting and auditing.

Results

The results of the execution are stored in configured output locations, as defined in the module specifications.

Module State Management

An important aspect of the notebook workflow is module state management. The notebook environment maintains module state during its execution, allowing for iterative development and inspection. This enables users to inspect module configurations, modify parameters, and re-run operations without restarting the entire workflow.

Snakemake Backend

The Snakemake backend is StarryNight's primary execution engine, responsible for translating Pipecraft pipelines into Snakemake workflows and executing them.

Snakemake is a workflow management system that:

Uses a Python-based language to define rules
Manages dependencies between rules using input/output relationships
Supports parallel execution of independent tasks
Provides container integration (Docker, Singularity/Apptainer)
Handles resource management and scheduling

Backend Implementation in Pipecraft

The Snakemake backend is implemented in the Pipecraft package rather than in the StarryNight core package. This architectural decision:

Keeps backend implementation details separate from the scientific image processing logic
Allows for multiple backends to be developed without modifying the core package
Maintains a clean separation between pipeline definition and execution

When StarryNight modules and pipelines are executed, they use the backend implementations from Pipecraft through a well-defined API.

The "Aha Moment" of Automatic Generation

Critical Point: The Snakemake backend delivers one of the most impressive capabilities of the StarryNight system - the automatic generation of complex workflow files. This automatic generation of complex Snakefiles from high-level abstractions is a central architectural achievement that demonstrates the value of the entire system.

For developers who have written Snakemake files manually, seeing a complex 500-line Snakefile generated automatically from high-level module definitions provides an immediate understanding of the system's value. It exemplifies how the StarryNight architecture transforms simple, user-friendly abstractions into complex, reproducible workflows.

Generated Snakefile Structure

When a Pipecraft pipeline is compiled to a Snakefile, it generates a structure like this:

# Generated Snakefile

rule all:
    input:
        "path/to/final/output.csv"

rule operation_name:
    input:
        input_name="path/to/input/file.csv",
        workspace="path/to/workspace"
    output:
        pipeline="path/to/output/pipeline.cppipe"
    container:
        "cellprofiler/starrynight:latest"
    shell:
        "starrynight segcheck generate-pipeline --output-path {output.pipeline} --load-data {input.input_name} --nuclear-channel DAPI --cell-channel CellMask"

The compiled Snakefile defines what inputs each rule expects, what outputs it will create, which container to use, and the actual command to invoke inside that container.

Rule Structure

Each rule in the Snakefile represents a computational step and includes:

Rule Name - Identifier for the operation
Inputs - Files or directories required for the operation
Outputs - Files or directories produced by the operation
Container - Container image to use for execution
Shell Command - Command to execute inside the container

Complex Workflow Example

For a multi-step pipeline, the Snakefile would contain multiple interconnected rules:

rule all:
    input:
        "results/analysis_complete.txt"

rule generate_load_data:
    input:
        images="path/to/images"
    output:
        load_data="workspace/load_data/illum_calc_load_data.csv"
    container:
        "cellprofiler/starrynight:latest"
    shell:
        "starrynight illum generate-load-data --images-path {input.images} --output-path {output.load_data} --batch-id Batch1 --plate-id Plate1"

rule generate_pipeline:
    input:
        load_data="workspace/load_data/illum_calc_load_data.csv"
    output:
        pipeline="workspace/pipelines/illum_calc_pipeline.cppipe"
    container:
        "cellprofiler/starrynight:latest"
    shell:
        "starrynight illum generate-pipeline --output-path {output.pipeline} --load-data {input.load_data}"

rule run_pipeline:
    input:
        load_data="workspace/load_data/illum_calc_load_data.csv",
        pipeline="workspace/pipelines/illum_calc_pipeline.cppipe"
    output:
        results="workspace/results",
        complete="results/analysis_complete.txt"
    container:
        "cellprofiler/starrynight:latest"
    shell:
        """
        starrynight illum run-pipeline --load-data {input.load_data} --pipeline {input.pipeline} --output-dir {output.results}
        touch {output.complete}
        """

Snakemake automatically determines the execution order based on the input/output dependencies.

Container Execution Model

StarryNight uses containerization for reproducible algorithm execution. This is implemented through a structured approach in the PipeCraft package.

Container Definition

The Container class in pipecraft/node.py defines execution environments with: - image: Docker/Singularity image reference - cmd: Command to run within the container - env: Environment variables

Modules use this pattern to define containerized operations:

# From starrynight/modules/cp_illum_calc/calc_cp.py
Container(
    name="cp_calc_illum_invoke_cp",
    input_paths={
        "cppipe_path": [...],
        "load_data_path": [...],
    },
    output_paths={
        "cp_illum_calc_dir": [...]
    },
    config=ContainerConfig(
        image="ghrc.io/leoank/starrynight:dev",
        cmd=["starrynight", "cp", "-p", spec.inputs[0].path, ...],
        env={},
    ),
)

Backend Integration

The SnakeMakeBackend in pipecraft/backend/snakemake.py translates container specifications to Snakemake rules: - Container images become Snakemake container directives - Input/output paths define rule dependencies - Commands define the shell execution

This is implemented in the Mako template at pipecraft/backend/templates/snakemake.mako:

rule ${container.name.replace(" ", "_").lower()}:
  input:
    # Input path definitions...
  output:
    # Output path definitions...
  container: "docker://${container.config.image}"
  shell:
    "${' '.join(container.config.cmd)}"

Execution Flow

The execution process follows these steps: 1. Modules define containers with appropriate configurations 2. The pipeline connects containers in sequential or parallel arrangements 3. The backend compiles the pipeline to Snakemake rules 4. Snakemake handles container execution and dependency tracking 5. Results are stored at specified output paths

Parallelism in Execution

The execution system handles two levels of parallelism:

Rule-level Parallelism

Snakemake automatically handles rule-level parallelism based on the dependency graph:

Independent rules can run in parallel
Rules that depend on the outputs of other rules wait for those rules to complete
The order of execution is determined by the input/output dependencies, not by the order in the file

Task-level Parallelism

For rules that process multiple similar items:

Multiple instances of the same rule can run in parallel
Each instance processes a different input/output combination
This is particularly useful for operations like applying illumination correction to multiple images

The level of parallelism can be controlled with Snakemake parameters:

snakemake --cores 4  # Run with 4 CPU cores

Advanced Features

Compiling Without Executing

You can compile a pipeline without executing it:

exec_backend.compile(
    pipeline=module.pipeline,
    config=backend_config,
    working_dir=working_dir
)

This generates the Snakefile without running it, allowing for inspection and manual execution. Once generated, this Snakefile can be run directly using the Snakemake command-line tool, giving users flexibility in how they execute workflows.

Logs and Monitoring

The Snakemake backend captures detailed logs of execution. These logs include command outputs, error messages, and execution status for each step in the pipeline. They are stored in the working directory and can be accessed for troubleshooting or monitoring.

When execution fails, several troubleshooting approaches are available:

Examine logs in the working directory
Check container execution details
Validate input configurations
Inspect the compiled workflow file

Execution with Telemetry

For production environments, telemetry can be enabled to send execution information to a monitoring system. This is typically disabled for notebook environments but can be enabled for centralized monitoring in production deployments.

Complete Examples

Example Notebook

Here's a complete notebook example integrating these concepts:

# Import necessary components
from starrynight.config import DataConfig
from starrynight.experiments.pcp_generic import PCPGenericExperiment
from starrynight.modules.inventory import GenInvModule
from starrynight.modules.index import GenIndexModule
from starrynight.pipelines.pcp_generic import create_pcp_generic_pipeline
import pipecraft as pc
import os
from pathlib import Path

# Set up paths
workspace_path = Path("/path/to/workspace")
images_path = Path("/path/to/images")
scratch_path = Path("/path/to/scratch")

# Create data config
data_config = DataConfig(
    workspace_path=workspace_path,
    images_path=images_path,
    scratch_path=scratch_path
)

# Configure backend
backend_config = pc.SnakemakeBackendConfig(
    use_opentelemetry=False,
    print_exec=True
)
exec_backend = pc.SnakemakeBackend(backend_config)

# Run indexing and inventory
print("Running indexing...")
gen_index_mod = GenIndexModule(data_config)
exec_backend.run(
    gen_index_mod.pipeline,
    config=backend_config,
    working_dir=data_config.scratch_path / "runs" / "index"
)

print("Running inventory...")
gen_inv_mod = GenInvModule(data_config)
exec_backend.run(
    gen_inv_mod.pipeline,
    config=backend_config,
    working_dir=data_config.scratch_path / "runs" / "inventory"
)

# Configure experiment
pcp_init_config = {
    "nuclear_channel": "DAPI",
    "cell_channel": "CellMask",
    "mito_channel": "MitoTracker",
    "barcode_csv_path": str(workspace_path / "barcodes.csv"),
    "image_overlap_percentage": 10
}

pcp_experiment = PCPGenericExperiment.from_index(
    index_path=data_config.workspace_path / "index.yaml",
    init_config=pcp_init_config
)

# Create complete pipeline
print("Creating pipeline...")
modules, pipeline = create_pcp_generic_pipeline(data_config, pcp_experiment)

# Run the pipeline
print("Running pipeline...")
exec_backend.run(
    pipeline=pipeline,
    config=backend_config,
    working_dir=data_config.scratch_path / "runs" / "complete_pipeline"
)

print("Pipeline complete!")

Example: Generated Snakefile

Here's an excerpt from an actual generated Snakefile:

# This file was generated by StarryNight

from snakemake.io import directory

rule all:
    input:
        "workspace/results/analysis_complete.txt"

rule cp_illum_calc_load_data:
    input:
        images="path/to/images/Batch1/Plate1"
    output:
        load_data="workspace/load_data/illum_calc_load_data.csv"
    container:
        "cellprofiler/starrynight:latest"
    shell:
        "starrynight illum generate-load-data --images-path {input.images} --output-path {output.load_data} --batch-id Batch1 --plate-id Plate1 --channel DAPI --channel CellMask --channel MitoTracker"

rule cp_illum_calc_pipeline:
    input:
        load_data="workspace/load_data/illum_calc_load_data.csv"
    output:
        pipeline="workspace/pipelines/illum_calc_pipeline.cppipe"
    container:
        "cellprofiler/starrynight:latest"
    shell:
        "starrynight illum generate-pipeline --output-path {output.pipeline} --load-data {input.load_data}"

# Additional rules...

Future Backends

The Snakemake backend demonstrates the power of the architecture by showing that compute graphs can be converted to executable Snakemake workflows. This separation of pipeline definition from execution enables the possibility of developing additional backends for different environments, such as:

Cloud-based execution (AWS, GCP, Azure)
HPC cluster execution
Kubernetes-based execution
Custom execution environments

This extensibility is a direct result of the architecture's separation of concerns, where pipelines are defined independently of how they are executed.

Comparison with Other Approaches

The notebook workflow provides several advantages over direct CLI usage:

State Persistence - Module configurations are maintained in memory
Parameter Inference - Automatic configuration from experiments
Containerization - Automatic execution in containers
Workflow Composition - Easy combination of multiple steps

The execution through Snakemake also offers benefits compared to direct execution:

Reproducibility - Ensures consistent execution across environments
Scalability - Scales from laptops to HPC clusters
Restart Capability - Can resume from failures without redoing completed work
Resource Management - Can specify CPU, memory, and other resource requirements
Integration - Works well with containers and existing tools

Conclusion

The execution system in StarryNight provides a powerful approach to running pipelines in a reproducible, containerized manner. By combining a flexible execution model with the Snakemake backend, it enables complex workflows to be executed consistently across different environments.

The automatic generation of detailed, executable Snakefiles from high-level abstractions is one of the most impressive achievements of the StarryNight architecture. This capability demonstrates the power of the separation between definition and execution in the system design, allowing complex workflows to be defined at a high level and automatically translated into executable form.

The execution system bridges the gap between abstract pipeline definitions and concrete execution, providing the final layer in StarryNight's architecture that turns conceptual workflows into running processes.

Next: Configuration Layer