StarryNight Configuration Layer
Overview
The configuration layer in StarryNight consists of two interconnected systems: experiment configuration, which manages experiment-specific parameters and infers settings from data, and module configuration, which connects these parameters with module specifications. Together, these systems enable automatic setup of complex pipelines with minimal manual input, creating a bridge between user-provided parameters and the detailed configuration needed for pipeline execution.
Cross-Cutting Nature
Unlike the other five layers which operate in a primarily sequential flow, the Configuration Layer functions as a dimensional plane that intersects with all layers. It provides contextual intelligence and adaptive parameters throughout the system:
- With Algorithm Layer: Provides parameters that influence algorithm behavior
- With CLI Layer: Supplies default values and validation rules for command parameters
- With Module Layer: Transforms high-level experiment parameters into detailed specifications
- With Pipeline Layer: Influences how modules are composed and connected
- With Execution Layer: Controls runtime behavior and resource allocation
This cross-cutting nature allows the Configuration Layer to maintain state across layer boundaries and adapt the system to specific experimental contexts without changing the core architecture.
Purpose
The configuration layer in StarryNight serves several critical purposes:
- Parameter Management - Collecting and organizing essential parameters
- Parameter Inference - Determining settings automatically from data
- Standardization - Providing consistent configuration across modules
- Extensibility - Supporting different experiment types
- Module Configuration - Simplifying the setup of pipeline modules
- Automatic Setup - Configuring modules with minimal manual input
- Parameter Consistency - Ensuring consistent parameters across pipeline steps
- Path Management - Setting up standardized input and output paths
Experiment Configuration
Experiment configuration provides a systematic way to manage experiment-specific parameters and infer settings from data using a clean implementation of the Pydantic data validation framework.
Experiment Classes
Experiments are implemented as Python classes that inherit from a base experiment class:
class PCPGenericExperiment(ExperimentBase):
"""
Experiment configuration for generic Plate Cell Painting.
"""
# Implementation...
From Index Method
A critical method in experiment classes is from_index
, which initializes the experiment from index data:
This method takes two main parameters:
- Index Path - Path to the index file generated by indexing
- Initial Config - User-provided parameters that cannot be inferred
@staticmethod
def from_index(index_path: Path, init_config: dict) -> Self:
"""Configure experiment with index."""
init_config_parsed = PCPGenericInitConfig.model_validate(init_config)
if index_path.name.endswith(".csv"):
index_df = pl.scan_csv(index_path)
else:
index_df = pl.scan_parquet(index_path)
# Get dataset_id from index
dataset_id = (
index_df.select(pl.col("dataset_id")).unique().collect().rows()[0][0]
)
# Extract images per well from data
cp_im_per_well = (
cp_images_df.group_by("batch_id", "plate_id", "well_id")
.agg(pl.col("key").count())
.collect()
.select(pl.col("key"))
.unique()
.rows()[0][0]
)
Initial Configuration
The initial configuration includes parameters that cannot be inferred from data:
pcp_init_config = {
"nuclear_channel": "DAPI",
"cell_channel": "CellMask",
"mito_channel": "MitoTracker",
"barcode_csv_path": "/path/to/barcodes.csv",
"image_overlap_percentage": 10
}
These parameters are experiment-specific and must be provided by the user.
Parameter Inference
A key feature of experiment classes is their ability to infer parameters from data. The system uses Polars (a high-performance dataframe library) to perform complex data analysis tasks:
# Extract number of cycles
sbs_n_cycles = (
sbs_images_df.select(pl.col("cycle_id").unique().count())
.collect()
.rows()[0][0]
)
# Extract channel list
sbs_channel_lists = (
cp_images_df.select(pl.col("channel_dict"))
.collect()
.to_dict()["channel_dict"]
)
# unique list of list adapted from Stack Overflow
sbs_channel_list = [list(x) for x in set(tuple(x) for x in sbs_channel_lists)]
Examples of inferred parameters:
- Images per Well - Calculated from inventory data
- Channel Count - Determined from image metadata
- Channel List - Extracted from available images
- Dataset Structure - Inferred from file organization
This inference reduces the manual configuration burden on users.
Using Experiment Configurations
Once configured, the experiment object is passed to modules when creating them:
# Create a module with experiment configuration
illum_calc_module = CPIllumCalcLoadDataModule.from_config(
data_config=data_config,
experiment=pcp_experiment
)
The module then uses the experiment configuration to set its parameters.
Different Experiment Types
The architecture supports different experiment types through a class-based inheritance system. Each experiment type can have its own class with specific parameter inference logic.
Creating New Experiment Types
To create a new experiment type:
- Create a new file in the experiments folder
- Define a class that inherits from ExperimentBase
- Implement the from_index method
- Define parameter inference logic
- Register the experiment class in the registry
Experiment Registry
Experiments are registered in a registry to make them discoverable:
from starrynight.experiments.registry import register_experiment
@register_experiment("pcp_generic")
class PCPGenericExperiment(ExperimentBase):
"""Experiment configuration for generic Plate Cell Painting."""
# Implementation...
This allows experiments to be looked up by name.
Module Configuration
Module configuration connects experiment parameters with module specifications, allowing modules to be automatically configured based on experiment settings and data configurations.
from_config Method
The primary method for module configuration is from_config
:
@staticmethod
def from_config(
data: DataConfig,
experiment: Experiment | None = None,
spec: Container | None = None,
) -> "StarrynightModule":
"""Create module from experiment and data config."""
if spec is None:
spec = CPCalcIllumInvokeCPModule._spec()
spec.inputs[0].path = (
data.workspace_path.joinpath(
CP_ILLUM_CALC_CP_CPPIPE_OUT_PATH_SUFFIX,
CP_ILLUM_CALC_CP_CPPIPE_OUT_NAME,
)
.resolve()
.__str__()
)
This method creates a configured module based on the provided configurations.
Data Configuration
The DataConfig
object provides essential path information:
class DataConfig(BaseModel):
"""Data configuration schema."""
dataset_path: Path | CloudPath
storage_path: Path | CloudPath
workspace_path: Path | CloudPath
These paths are used to locate inputs and set up outputs.
Experiment Integration
The experiment parameter provides experiment-specific information, allowing modules to adapt to specific experiment requirements.
Spec Parameter
The spec
parameter allows custom specifications:
# Create module with custom spec
custom_spec = create_custom_spec()
module = ModuleClass.from_config(data_config, experiment, spec=custom_spec)
If not provided, a default spec is created based on the data and experiment configurations.
Implementation Pattern
The typical implementation of from_config
follows this pattern:
- Create default spec if none provided
- Configure inputs based on data_config
- Configure parameters based on experiment (if provided)
- Configure outputs based on data_config
- Create and return module with configured spec
@classmethod
def from_config(cls, data_config, experiment=None, spec=None):
# Create default spec if none provided
if spec is None:
spec = cls().spec
# Configure inputs from data_config
spec.inputs["workspace_path"].value = data_config.workspace_path
# Configure based on experiment if provided
if experiment is not None:
spec.inputs["nuclear_channel"].value = experiment.nuclear_channel
spec.inputs["cell_channel"].value = experiment.cell_channel
# Configure outputs
output_path = data_config.workspace_path / "results" / "output.csv"
spec.outputs["results"].value = output_path
# Create and return module
return cls(spec=spec)
Configuration Flow Examples
Example: Segmentation Check Module
The segmentation check module provides a good example of experiment integration, requiring experiment-specific channel information.
Example: Illumination Calculation Module
The illumination calculation module demonstrates that not all modules require experiment parameters. Some modules only use data configuration for paths, without experiment-specific parameters.
Detailed Configuration Flow
Let's examine the detailed configuration flow for a module:
1. Create Module Instance
# In a notebook or script
segcheck_module = CPSegcheckGenCPipeModule.from_config(
data_config=data_config,
experiment=pcp_experiment
)
2. from_config Implementation
@classmethod
def from_config(cls, data_config, experiment=None, spec=None):
if spec is None:
spec = cls().spec
# Configure workspace path
spec.inputs["workspace_path"].value = data_config.workspace_path
# Configure load data path
load_data_path = data_config.workspace_path / "load_data" / "segcheck_load_data.csv"
spec.inputs["load_data"].value = load_data_path
# Configure channels from experiment
if experiment is not None:
spec.inputs["nuclear_channel"].value = experiment.nuclear_channel
spec.inputs["cell_channel"].value = experiment.cell_channel
# Configure output pipeline path
pipeline_path = data_config.workspace_path / "pipelines" / "segcheck_pipeline.cppipe"
spec.outputs["pipeline"].value = pipeline_path
# Configure output notebook path
notebook_path = data_config.workspace_path / "notebooks" / "segcheck_visualization.ipynb"
spec.outputs["notebook"].value = notebook_path
return cls(spec=spec)
3. Create Pipeline
The configured module then uses this information to create its pipeline:
def create_pipeline(self):
# Construct CLI command using spec values
command = [
"starrynight", "segcheck", "generate-pipeline",
"--output-path", str(self.spec.outputs["pipeline"].value),
"--load-data", str(self.spec.inputs["load_data"].value),
"--nuclear-channel", str(self.spec.inputs["nuclear_channel"].value),
"--cell-channel", str(self.spec.inputs["cell_channel"].value)
]
# Create pipeline with container
pipeline = pc.Pipeline()
with pipeline.sequential() as seq:
seq.container(
name="segcheck_pipeline_gen",
inputs={
"load_data": str(self.spec.inputs["load_data"].value),
"workspace": str(self.spec.inputs["workspace_path"].value)
},
outputs={
"pipeline": str(self.spec.outputs["pipeline"].value)
},
container_config=pc.ContainerConfig(
image="cellprofiler/starrynight:latest",
command=command
)
)
return pipeline
Path Handling Patterns
Modules follow consistent patterns for handling paths:
- Input Data - Typically under
data_config.dataset_path
- Intermediate Data - Under
data_config.workspace_path
with subdirectories: load_data/
- For load data filespipelines/
- For pipeline filesresults/
- For processing results- Execution Data - Under
data_config.storage_path / "runs" / module_name
Common Module Sets and Their Configuration
Different module sets have different configuration patterns:
CP Modules (Cell Painting)
CP modules typically require: - Nuclear channel - Cell channel - Other specific channels (e.g., Mito channel) - Paths to Cell Painting images
SBS Modules (Sequencing By Synthesis)
SBS modules typically require: - Barcode information - Image overlap percentage - Paths to SBS images
Common Modules (Index, Inventory)
These modules typically only require: - Data configuration for paths - No experiment-specific parameters
Advanced Configuration Topics
Updating Module Configuration
Module configurations can be updated after creation:
# Create module with default configuration
module = CPSegcheckGenCPipeModule.from_config(data_config, experiment)
# Update a parameter
module.spec.inputs["nuclear_channel"].value = "New_DAPI_Channel"
# Regenerate the pipeline
updated_pipeline = module.create_pipeline()
This allows for dynamic reconfiguration.
Serialization and Deserialization
Experiment configurations can be serialized and deserialized using Pydantic's built-in JSON capabilities:
# Serialize experiment to JSON
experiment_json = pcp_experiment.model_dump_json()
# Save to file
with open("experiment_config.json", "w") as f:
f.write(experiment_json)
# Later, load from file
with open("experiment_config.json", "r") as f:
experiment_json = f.read()
# Deserialize experiment
pcp_experiment = PCPGenericExperiment.model_validate_json(experiment_json)
This allows experiment configurations to be saved and restored.
Creating Custom Module Configurations
To create custom module configurations:
- Create a ModuleSpec with the desired inputs and outputs
- Set values for all input parameters
- Set paths for all output parameters
- Create the module with the custom spec
# Create custom spec
spec = bl.ModuleSpec(
name="Custom CP Segcheck Pipeline Generator",
inputs={
"load_data": bl.PortSpec(type="file", value="/path/to/custom/load_data.csv"),
"workspace_path": bl.PortSpec(type="directory", value="/path/to/custom/workspace"),
"nuclear_channel": bl.PortSpec(type="string", value="CustomDAPI"),
"cell_channel": bl.PortSpec(type="string", value="CustomCellMask")
},
outputs={
"pipeline": bl.PortSpec(type="file", value="/path/to/custom/pipeline.cppipe")
}
)
# Create module with custom spec
module = CPSegcheckGenCPipeModule.from_config(
data_config=data_config,
experiment=None, # Not needed since spec is fully configured
spec=spec
)
Data Analysis Utilities
StarryNight includes powerful utilities to extract configuration from complex data structures:
# Extract configurations from data
def get_channels_by_batch_plate(
df: pl.LazyFrame, batch_id: str, plate_id: str
) -> list[str]:
channels = (
df.filter(
pl.col("batch_id").eq(batch_id)
& pl.col("plate_id").eq(plate_id)
& pl.col("channel_dict").is_not_null()
)
.select(pl.col("channel_dict").explode().unique(maintain_order=True))
.to_series()
.to_list()
)
return channels
# Detecting hierarchical structure in data
def gen_image_hierarchy(df: pl.LazyFrame) -> dict:
hierarchy_dict = {}
batches = get_batches(df)
for batch in batches:
plates = get_plates_by_batch(df, batch)
hierarchy_dict[batch] = {}
for plate in plates:
wells = get_wells_by_batch_plate(df, batch, plate)
hierarchy_dict[batch][plate] = {}
for well in wells:
sites = get_sites_by_batch_plate_well(df, batch, plate, well)
hierarchy_dict[batch][plate][well] = sites
return hierarchy_dict
These utilities help extract structured information from complex datasets.
Complete Examples
Example: Complete Experiment Class
Here's a more complete example of an experiment class using Pydantic for validation:
@register_experiment("pcp_generic")
class PCPGenericExperiment(ExperimentBase):
"""
Experiment configuration for generic Plate Cell Painting.
"""
# User-provided parameters
nuclear_channel: str
cell_channel: str
mito_channel: str
barcode_csv_path: Path
image_overlap_percentage: int
# Inferred parameters
dataset_id: str | None = None
cp_images_df: pl.LazyFrame | None = None
sbs_images_df: pl.LazyFrame | None = None
images_per_well: int | None = None
cp_channels: list[str] | None = None
cp_channel_count: int | None = None
sbs_channels: list[str] | None = None
sbs_channel_count: int | None = None
@staticmethod
def from_index(index_path: Path, init_config: dict) -> Self:
"""Create experiment from index and initial config."""
# Validate initial config with Pydantic
init_config_parsed = PCPGenericInitConfig.model_validate(init_config)
# Load index and extract dataset_id
index_df = pl.scan_parquet(index_path)
dataset_id = index_df.select(pl.col("dataset_id")).unique().collect().rows()[0][0]
# Create and configure experiment instance
experiment = PCPGenericExperiment(
nuclear_channel=init_config_parsed.nuclear_channel,
cell_channel=init_config_parsed.cell_channel,
mito_channel=init_config_parsed.mito_channel,
barcode_csv_path=init_config_parsed.barcode_csv_path,
image_overlap_percentage=init_config_parsed.image_overlap_percentage,
dataset_id=dataset_id
)
# Infer additional parameters using Polars
# (implementation details...)
return experiment
def model_post_init(self, __context: Any) -> None:
"""Validate experiment configuration."""
# Additional validation can be performed here
if not self.nuclear_channel:
raise ValueError("Nuclear channel must be specified")
Example: Notebook Workflow
Here's how configuration fits into a typical notebook workflow:
# Import necessary components
from starrynight.config import DataConfig
from starrynight.experiments.pcp_generic import PCPGenericExperiment
from starrynight.modules.cp_segcheck import CPSegcheckGenLoadDataModule, CPSegcheckGenCPipeModule
import pipecraft as pc
import pathlib
# Set up paths
workspace_path = pathlib.Path("/path/to/workspace")
dataset_path = pathlib.Path("/path/to/images")
storage_path = pathlib.Path("/path/to/scratch")
# Create data config
data_config = DataConfig(
workspace_path=workspace_path,
dataset_path=dataset_path,
storage_path=storage_path
)
# Configure experiment
pcp_init_config = {
"nuclear_channel": "DAPI",
"cell_channel": "CellMask",
"mito_channel": "MitoTracker",
"barcode_csv_path": str(workspace_path / "barcodes.csv"),
"image_overlap_percentage": 10
}
# Create experiment
pcp_experiment = PCPGenericExperiment.from_index(
index_path=data_config.workspace_path / "index.parquet",
init_config=pcp_init_config
)
# Create modules with configuration
load_data_module = CPSegcheckGenLoadDataModule.from_config(
data_config=data_config,
experiment=pcp_experiment
)
pipeline_module = CPSegcheckGenCPipeModule.from_config(
data_config=data_config,
experiment=pcp_experiment
)
# Configure backend
backend_config = pc.SnakemakeBackendConfig(
use_opentelemetry=False,
print_exec=True
)
exec_backend = pc.SnakemakeBackend(backend_config)
# Run modules
exec_backend.run(
load_data_module.pipeline,
config=backend_config,
working_dir=data_config.storage_path / "runs" / "segcheck_load_data"
)
exec_backend.run(
pipeline_module.pipeline,
config=backend_config,
working_dir=data_config.storage_path / "runs" / "segcheck_pipeline"
)
Comparison with Direct CLI Usage
The configuration approach differs significantly from CLI usage by providing a richer, more automated approach with several benefits:
- Reduced Manual Configuration - Many parameters are inferred automatically
- Consistency - Parameters are defined once and used consistently
- Validation - Parameters can be validated during inference using Pydantic
- Extensibility - New experiment types can be added without changing modules
- Separation of Concerns - Experiment logic is separate from module logic
- Reduced Boilerplate - Minimal code required to set up modules
- Discoverability - Clear pattern for how modules are configured
- Flexibility - Custom specs can override defaults when needed
Conclusion
The experiment and module configuration systems in StarryNight provide a powerful approach to managing parameters, inferring settings from data, and automatically configuring modules. By separating experiment-specific logic from module implementation and providing standardized configuration patterns, they enable flexibility, extensibility, and consistency across the pipeline system.
Together, these configuration systems form a critical bridge between user input and pipeline execution, reducing manual configuration burden while maintaining flexibility for different experiment types and module implementations. They exemplify the architecture's focus on separation of concerns, allowing each component to focus on its specific role while working together to create a cohesive system.