StarryNight Configuration Layer

Overview

The configuration layer in StarryNight consists of two interconnected systems: experiment configuration, which manages experiment-specific parameters and infers settings from data, and module configuration, which connects these parameters with module specifications. Together, these systems enable automatic setup of complex pipelines with minimal manual input, creating a bridge between user-provided parameters and the detailed configuration needed for pipeline execution.

Cross-Cutting Nature

Unlike the other five layers which operate in a primarily sequential flow, the Configuration Layer functions as a dimensional plane that intersects with all layers. It provides contextual intelligence and adaptive parameters throughout the system:

With Algorithm Layer: Provides parameters that influence algorithm behavior
With CLI Layer: Supplies default values and validation rules for command parameters
With Module Layer: Transforms high-level experiment parameters into detailed specifications
With Pipeline Layer: Influences how modules are composed and connected
With Execution Layer: Controls runtime behavior and resource allocation

This cross-cutting nature allows the Configuration Layer to maintain state across layer boundaries and adapt the system to specific experimental contexts without changing the core architecture.

Purpose

The configuration layer in StarryNight serves several critical purposes:

Parameter Management - Collecting and organizing essential parameters
Parameter Inference - Determining settings automatically from data
Standardization - Providing consistent configuration across modules
Extensibility - Supporting different experiment types
Module Configuration - Simplifying the setup of pipeline modules
Automatic Setup - Configuring modules with minimal manual input
Parameter Consistency - Ensuring consistent parameters across pipeline steps
Path Management - Setting up standardized input and output paths

Experiment Configuration

Experiment configuration provides a systematic way to manage experiment-specific parameters and infer settings from data using a clean implementation of the Pydantic data validation framework.

Experiment Classes

Experiments are implemented as Python classes that inherit from a base experiment class:

class PCPGenericExperiment(ExperimentBase):
    """
    Experiment configuration for generic Plate Cell Painting.
    """
    # Implementation...

From Index Method

A critical method in experiment classes is from_index, which initializes the experiment from index data:

This method takes two main parameters:

Index Path - Path to the index file generated by indexing
Initial Config - User-provided parameters that cannot be inferred

@staticmethod
def from_index(index_path: Path, init_config: dict) -> Self:
    """Configure experiment with index."""
    init_config_parsed = PCPGenericInitConfig.model_validate(init_config)
    if index_path.name.endswith(".csv"):
        index_df = pl.scan_csv(index_path)
    else:
        index_df = pl.scan_parquet(index_path)

    # Get dataset_id from index
    dataset_id = (
        index_df.select(pl.col("dataset_id")).unique().collect().rows()[0][0]
    )

    # Extract images per well from data
    cp_im_per_well = (
        cp_images_df.group_by("batch_id", "plate_id", "well_id")
        .agg(pl.col("key").count())
        .collect()
        .select(pl.col("key"))
        .unique()
        .rows()[0][0]
    )

Initial Configuration

The initial configuration includes parameters that cannot be inferred from data:

pcp_init_config = {
    "nuclear_channel": "DAPI",
    "cell_channel": "CellMask",
    "mito_channel": "MitoTracker",
    "barcode_csv_path": "/path/to/barcodes.csv",
    "image_overlap_percentage": 10
}

These parameters are experiment-specific and must be provided by the user.

Parameter Inference

A key feature of experiment classes is their ability to infer parameters from data. The system uses Polars (a high-performance dataframe library) to perform complex data analysis tasks:

# Extract number of cycles
sbs_n_cycles = (
    sbs_images_df.select(pl.col("cycle_id").unique().count())
    .collect()
    .rows()[0][0]
)

# Extract channel list
sbs_channel_lists = (
    cp_images_df.select(pl.col("channel_dict"))
    .collect()
    .to_dict()["channel_dict"]
)
# unique list of list adapted from Stack Overflow
sbs_channel_list = [list(x) for x in set(tuple(x) for x in sbs_channel_lists)]

Examples of inferred parameters:

Images per Well - Calculated from inventory data
Channel Count - Determined from image metadata
Channel List - Extracted from available images
Dataset Structure - Inferred from file organization

This inference reduces the manual configuration burden on users.

Using Experiment Configurations

Once configured, the experiment object is passed to modules when creating them:

# Create a module with experiment configuration
illum_calc_module = CPIllumCalcLoadDataModule.from_config(
    data_config=data_config,
    experiment=pcp_experiment
)

The module then uses the experiment configuration to set its parameters.

Different Experiment Types

The architecture supports different experiment types through a class-based inheritance system. Each experiment type can have its own class with specific parameter inference logic.

Creating New Experiment Types

To create a new experiment type:

Create a new file in the experiments folder
Define a class that inherits from ExperimentBase
Implement the from_index method
Define parameter inference logic
Register the experiment class in the registry

Experiment Registry

Experiments are registered in a registry to make them discoverable:

from starrynight.experiments.registry import register_experiment

@register_experiment("pcp_generic")
class PCPGenericExperiment(ExperimentBase):
    """Experiment configuration for generic Plate Cell Painting."""
    # Implementation...

This allows experiments to be looked up by name.

Module Configuration

Module configuration connects experiment parameters with module specifications, allowing modules to be automatically configured based on experiment settings and data configurations.

from_config Method

The primary method for module configuration is from_config:

@staticmethod
def from_config(
    data: DataConfig,
    experiment: Experiment | None = None,
    spec: Container | None = None,
) -> "StarrynightModule":
    """Create module from experiment and data config."""
    if spec is None:
        spec = CPCalcIllumInvokeCPModule._spec()
        spec.inputs[0].path = (
            data.workspace_path.joinpath(
                CP_ILLUM_CALC_CP_CPPIPE_OUT_PATH_SUFFIX,
                CP_ILLUM_CALC_CP_CPPIPE_OUT_NAME,
            )
            .resolve()
            .__str__()
        )

This method creates a configured module based on the provided configurations.

Data Configuration

The DataConfig object provides essential path information:

class DataConfig(BaseModel):
    """Data configuration schema."""

    dataset_path: Path | CloudPath
    storage_path: Path | CloudPath
    workspace_path: Path | CloudPath

These paths are used to locate inputs and set up outputs.

Experiment Integration

The experiment parameter provides experiment-specific information, allowing modules to adapt to specific experiment requirements.

Spec Parameter

The spec parameter allows custom specifications:

# Create module with custom spec
custom_spec = create_custom_spec()
module = ModuleClass.from_config(data_config, experiment, spec=custom_spec)

If not provided, a default spec is created based on the data and experiment configurations.

Implementation Pattern

The typical implementation of from_config follows this pattern:

Create default spec if none provided
Configure inputs based on data_config
Configure parameters based on experiment (if provided)
Configure outputs based on data_config
Create and return module with configured spec

@classmethod
def from_config(cls, data_config, experiment=None, spec=None):
    # Create default spec if none provided
    if spec is None:
        spec = cls().spec

        # Configure inputs from data_config
        spec.inputs["workspace_path"].value = data_config.workspace_path

        # Configure based on experiment if provided
        if experiment is not None:
            spec.inputs["nuclear_channel"].value = experiment.nuclear_channel
            spec.inputs["cell_channel"].value = experiment.cell_channel

        # Configure outputs
        output_path = data_config.workspace_path / "results" / "output.csv"
        spec.outputs["results"].value = output_path

    # Create and return module
    return cls(spec=spec)

Configuration Flow Examples

Example: Segmentation Check Module

The segmentation check module provides a good example of experiment integration, requiring experiment-specific channel information.

Example: Illumination Calculation Module

The illumination calculation module demonstrates that not all modules require experiment parameters. Some modules only use data configuration for paths, without experiment-specific parameters.

Detailed Configuration Flow

Let's examine the detailed configuration flow for a module:

1. Create Module Instance

# In a notebook or script
segcheck_module = CPSegcheckGenCPipeModule.from_config(
    data_config=data_config,
    experiment=pcp_experiment
)

2. from_config Implementation

@classmethod
def from_config(cls, data_config, experiment=None, spec=None):
    if spec is None:
        spec = cls().spec

        # Configure workspace path
        spec.inputs["workspace_path"].value = data_config.workspace_path

        # Configure load data path
        load_data_path = data_config.workspace_path / "load_data" / "segcheck_load_data.csv"
        spec.inputs["load_data"].value = load_data_path

        # Configure channels from experiment
        if experiment is not None:
            spec.inputs["nuclear_channel"].value = experiment.nuclear_channel
            spec.inputs["cell_channel"].value = experiment.cell_channel

        # Configure output pipeline path
        pipeline_path = data_config.workspace_path / "pipelines" / "segcheck_pipeline.cppipe"
        spec.outputs["pipeline"].value = pipeline_path

        # Configure output notebook path
        notebook_path = data_config.workspace_path / "notebooks" / "segcheck_visualization.ipynb"
        spec.outputs["notebook"].value = notebook_path

    return cls(spec=spec)

3. Create Pipeline

The configured module then uses this information to create its pipeline:

def create_pipeline(self):
    # Construct CLI command using spec values
    command = [
        "starrynight", "segcheck", "generate-pipeline",
        "--output-path", str(self.spec.outputs["pipeline"].value),
        "--load-data", str(self.spec.inputs["load_data"].value),
        "--nuclear-channel", str(self.spec.inputs["nuclear_channel"].value),
        "--cell-channel", str(self.spec.inputs["cell_channel"].value)
    ]

    # Create pipeline with container
    pipeline = pc.Pipeline()
    with pipeline.sequential() as seq:
        seq.container(
            name="segcheck_pipeline_gen",
            inputs={
                "load_data": str(self.spec.inputs["load_data"].value),
                "workspace": str(self.spec.inputs["workspace_path"].value)
            },
            outputs={
                "pipeline": str(self.spec.outputs["pipeline"].value)
            },
            container_config=pc.ContainerConfig(
                image="cellprofiler/starrynight:latest",
                command=command
            )
        )

    return pipeline

Path Handling Patterns

Modules follow consistent patterns for handling paths:

Input Data - Typically under data_config.dataset_path
Intermediate Data - Under data_config.workspace_path with subdirectories:
load_data/ - For load data files
pipelines/ - For pipeline files
results/ - For processing results
Execution Data - Under data_config.storage_path / "runs" / module_name

Common Module Sets and Their Configuration

Different module sets have different configuration patterns:

CP Modules (Cell Painting)

CP modules typically require: - Nuclear channel - Cell channel - Other specific channels (e.g., Mito channel) - Paths to Cell Painting images

SBS Modules (Sequencing By Synthesis)

SBS modules typically require: - Barcode information - Image overlap percentage - Paths to SBS images

Common Modules (Index, Inventory)

These modules typically only require: - Data configuration for paths - No experiment-specific parameters

Advanced Configuration Topics

Updating Module Configuration

Module configurations can be updated after creation:

# Create module with default configuration
module = CPSegcheckGenCPipeModule.from_config(data_config, experiment)

# Update a parameter
module.spec.inputs["nuclear_channel"].value = "New_DAPI_Channel"

# Regenerate the pipeline
updated_pipeline = module.create_pipeline()

This allows for dynamic reconfiguration.

Serialization and Deserialization

Experiment configurations can be serialized and deserialized using Pydantic's built-in JSON capabilities:

# Serialize experiment to JSON
experiment_json = pcp_experiment.model_dump_json()

# Save to file
with open("experiment_config.json", "w") as f:
    f.write(experiment_json)

# Later, load from file
with open("experiment_config.json", "r") as f:
    experiment_json = f.read()

# Deserialize experiment
pcp_experiment = PCPGenericExperiment.model_validate_json(experiment_json)

This allows experiment configurations to be saved and restored.

Creating Custom Module Configurations

To create custom module configurations:

Create a ModuleSpec with the desired inputs and outputs
Set values for all input parameters
Set paths for all output parameters
Create the module with the custom spec

# Create custom spec
spec = bl.ModuleSpec(
    name="Custom CP Segcheck Pipeline Generator",
    inputs={
        "load_data": bl.PortSpec(type="file", value="/path/to/custom/load_data.csv"),
        "workspace_path": bl.PortSpec(type="directory", value="/path/to/custom/workspace"),
        "nuclear_channel": bl.PortSpec(type="string", value="CustomDAPI"),
        "cell_channel": bl.PortSpec(type="string", value="CustomCellMask")
    },
    outputs={
        "pipeline": bl.PortSpec(type="file", value="/path/to/custom/pipeline.cppipe")
    }
)

# Create module with custom spec
module = CPSegcheckGenCPipeModule.from_config(
    data_config=data_config,
    experiment=None,  # Not needed since spec is fully configured
    spec=spec
)

Data Analysis Utilities

StarryNight includes powerful utilities to extract configuration from complex data structures:

# Extract configurations from data
def get_channels_by_batch_plate(
    df: pl.LazyFrame, batch_id: str, plate_id: str
) -> list[str]:
    channels = (
        df.filter(
            pl.col("batch_id").eq(batch_id)
            & pl.col("plate_id").eq(plate_id)
            & pl.col("channel_dict").is_not_null()
        )
        .select(pl.col("channel_dict").explode().unique(maintain_order=True))
        .to_series()
        .to_list()
    )
    return channels

# Detecting hierarchical structure in data
def gen_image_hierarchy(df: pl.LazyFrame) -> dict:
    hierarchy_dict = {}
    batches = get_batches(df)
    for batch in batches:
        plates = get_plates_by_batch(df, batch)
        hierarchy_dict[batch] = {}
        for plate in plates:
            wells = get_wells_by_batch_plate(df, batch, plate)
            hierarchy_dict[batch][plate] = {}
            for well in wells:
                sites = get_sites_by_batch_plate_well(df, batch, plate, well)
                hierarchy_dict[batch][plate][well] = sites
    return hierarchy_dict

These utilities help extract structured information from complex datasets.

Complete Examples

Example: Complete Experiment Class

Here's a more complete example of an experiment class using Pydantic for validation:

@register_experiment("pcp_generic")
class PCPGenericExperiment(ExperimentBase):
    """
    Experiment configuration for generic Plate Cell Painting.
    """
    # User-provided parameters
    nuclear_channel: str
    cell_channel: str
    mito_channel: str
    barcode_csv_path: Path
    image_overlap_percentage: int

    # Inferred parameters
    dataset_id: str | None = None
    cp_images_df: pl.LazyFrame | None = None
    sbs_images_df: pl.LazyFrame | None = None
    images_per_well: int | None = None
    cp_channels: list[str] | None = None
    cp_channel_count: int | None = None
    sbs_channels: list[str] | None = None
    sbs_channel_count: int | None = None

    @staticmethod
    def from_index(index_path: Path, init_config: dict) -> Self:
        """Create experiment from index and initial config."""
        # Validate initial config with Pydantic
        init_config_parsed = PCPGenericInitConfig.model_validate(init_config)

        # Load index and extract dataset_id
        index_df = pl.scan_parquet(index_path)
        dataset_id = index_df.select(pl.col("dataset_id")).unique().collect().rows()[0][0]

        # Create and configure experiment instance
        experiment = PCPGenericExperiment(
            nuclear_channel=init_config_parsed.nuclear_channel,
            cell_channel=init_config_parsed.cell_channel,
            mito_channel=init_config_parsed.mito_channel,
            barcode_csv_path=init_config_parsed.barcode_csv_path,
            image_overlap_percentage=init_config_parsed.image_overlap_percentage,
            dataset_id=dataset_id
        )

        # Infer additional parameters using Polars
        # (implementation details...)

        return experiment

    def model_post_init(self, __context: Any) -> None:
        """Validate experiment configuration."""
        # Additional validation can be performed here
        if not self.nuclear_channel:
            raise ValueError("Nuclear channel must be specified")

Example: Notebook Workflow

Here's how configuration fits into a typical notebook workflow:

# Import necessary components
from starrynight.config import DataConfig
from starrynight.experiments.pcp_generic import PCPGenericExperiment
from starrynight.modules.cp_segcheck import CPSegcheckGenLoadDataModule, CPSegcheckGenCPipeModule
import pipecraft as pc
import pathlib

# Set up paths
workspace_path = pathlib.Path("/path/to/workspace")
dataset_path = pathlib.Path("/path/to/images")
storage_path = pathlib.Path("/path/to/scratch")

# Create data config
data_config = DataConfig(
    workspace_path=workspace_path,
    dataset_path=dataset_path,
    storage_path=storage_path
)

# Configure experiment
pcp_init_config = {
    "nuclear_channel": "DAPI",
    "cell_channel": "CellMask",
    "mito_channel": "MitoTracker",
    "barcode_csv_path": str(workspace_path / "barcodes.csv"),
    "image_overlap_percentage": 10
}

# Create experiment
pcp_experiment = PCPGenericExperiment.from_index(
    index_path=data_config.workspace_path / "index.parquet",
    init_config=pcp_init_config
)

# Create modules with configuration
load_data_module = CPSegcheckGenLoadDataModule.from_config(
    data_config=data_config,
    experiment=pcp_experiment
)

pipeline_module = CPSegcheckGenCPipeModule.from_config(
    data_config=data_config,
    experiment=pcp_experiment
)

# Configure backend
backend_config = pc.SnakemakeBackendConfig(
    use_opentelemetry=False,
    print_exec=True
)
exec_backend = pc.SnakemakeBackend(backend_config)

# Run modules
exec_backend.run(
    load_data_module.pipeline,
    config=backend_config,
    working_dir=data_config.storage_path / "runs" / "segcheck_load_data"
)

exec_backend.run(
    pipeline_module.pipeline,
    config=backend_config,
    working_dir=data_config.storage_path / "runs" / "segcheck_pipeline"
)

Comparison with Direct CLI Usage

The configuration approach differs significantly from CLI usage by providing a richer, more automated approach with several benefits:

Reduced Manual Configuration - Many parameters are inferred automatically
Consistency - Parameters are defined once and used consistently
Validation - Parameters can be validated during inference using Pydantic
Extensibility - New experiment types can be added without changing modules
Separation of Concerns - Experiment logic is separate from module logic
Reduced Boilerplate - Minimal code required to set up modules
Discoverability - Clear pattern for how modules are configured
Flexibility - Custom specs can override defaults when needed

Conclusion

The experiment and module configuration systems in StarryNight provide a powerful approach to managing parameters, inferring settings from data, and automatically configuring modules. By separating experiment-specific logic from module implementation and providing standardized configuration patterns, they enable flexibility, extensibility, and consistency across the pipeline system.

Together, these configuration systems form a critical bridge between user input and pipeline execution, reducing manual configuration burden while maintaining flexibility for different experiment types and module implementations. They exemplify the architecture's focus on separation of concerns, allowing each component to focus on its specific role while working together to create a cohesive system.

Next: Architecture for Biologists