AWS Deployment Roadmap

This roadmap outlines our 8-week plan to deploy StarryNight on AWS, transitioning from local servers to cloud infrastructure.

AWS Deployment Architecture

Core Services

AWS Batch: Orchestrates containerized compute jobs
EC2: Hosts login node for job coordination
S3: Stores data and results
Container Registry: Public registry (GitHub or similar) for images
Supporting: CloudWatch (logs), IAM (access), VPC (networking)

System Topology

The AWS deployment follows a hub-and-spoke architecture:

flowchart TB
    subgraph "Control Plane"
        EC2[EC2 Login Node<br/>StarryNight Coordinator]
        GRAFANA[Grafana Stack<br/>OpenTelemetry Collector]
    end

    subgraph "Compute Plane"
        BATCH[AWS Batch<br/>Job Queue]
        COMPUTE[Compute Environment<br/>EC2/Fargate]
    end

    subgraph "Storage"
        S3[S3 Buckets<br/>Data Lake]
        REG[Container Registry<br/>GitHub/Public]
    end

    subgraph "Infrastructure Management"
        PULUMI[Pulumi<br/>IaC]
    end

    EC2 -->|Submit Jobs| BATCH
    BATCH -->|Schedule| COMPUTE
    COMPUTE -->|Read/Write| S3
    COMPUTE -->|Pull Images| REG
    COMPUTE -->|Send Telemetry| GRAFANA
    EC2 <-->|Monitor| GRAFANA
    PULUMI -->|Provision| EC2 & BATCH & S3

Monthly Costs (excluding compute/storage)

Service	Cost	Notes
EC2 Login Node	$30-40	t3.medium, always-on
CloudWatch	$10-50	Varies with job volume
Pulumi	$0-75	Free tier likely sufficient
Total	$40-165	Plus data transfer costs

8-Week Deployment Plan

Phase 1: Infrastructure Foundation (Weeks 1-2)

Configure Pulumi project structure (team has experience from CytoSkel)
Define AWS Batch compute environments
Create S3 bucket hierarchy
Deploy EC2 login node (24/7 coordinator)

Key Unknown: Team has no AWS Batch experience; configuration requirements unclear

Phase 2: Container Pipeline (Weeks 3-4)

Select public container registry
Set up automated builds triggered by CellProfiler releases
Implement independent StarryNight versioning
Test build pipeline

Note: Automated builds minimize maintenance burden

Phase 3: Integration Testing (Weeks 5-6)

Test job submission pipeline
Validate Snakemake → AWS Batch translation
Verify telemetry (using StarryNight's built-in system)
Run end-to-end workflows

Key Risk: CellProfiler error handling in containerized environment

Phase 4: Production Ready (Weeks 7-8)

Finalize documentation and runbooks
Configure monitoring and alerts
Complete VPC/security setup
Optimize costs

Note: Security hardening deferred (internal users only)

Configuration Requirements

User Infrastructure Configuration

Based on planning discussions, users configure infrastructure through the UI:

Module Parameter Exposure
- Module authors expose parameters like memory and compute requirements
- Example from discussion: inventory module exposes dataset_path parameter
- Proposed: Modules could expose memory or similar resource parameters
- Backend implementation decides how to handle these parameters
- Note: Specific parameter names and UI interface details TBD
Configuration Flow
- UI → Module → Pipeline → AWS Batch job definitions
- Users cannot directly configure AWS Batch settings
- Backend determines infrastructure choices (e.g., AWS Batch vs alternatives)

Job Failure and Restart Procedures

From the planning discussions:

Snakemake Intelligence
- Snakemake automatically tracks successful jobs and won't re-run them
- Only failed or not-yet-run jobs execute on retry
- Target-based execution model checks for output files
QC Review Points
- QC steps implemented as modules that fail by default
- Human review required before marking as passed
- After review, job can be manually marked to proceed
- Note: Specific UI for QC approval TBD
Individual Module Re-execution
- Each module can be run independently with different parameters
- Users can modify parameters and re-run specific modules
- Logs available for each run attempt

Partial Failure Recovery

Based on the discussion about 90% success / 10% failure scenarios:

Automatic Detection
- Snakemake identifies which jobs succeeded vs failed
- Re-running a pipeline only executes failed jobs
- Note: Specific mechanism for failure detection not fully detailed
Telemetry and Monitoring
- OpenTelemetry integration sends logs to central Grafana stack
- All stdout/stderr piped through telemetry system
- Challenge: CellProfiler containers need custom wrappers for proper error reporting
Resource Adjustment
- Failed jobs can be retried with adjusted resources
- Note: UI mechanism for resource adjustment per retry TBD

Infrastructure Configuration Notes

Note: Many specifics remain TBD during implementation.

Potential areas requiring configuration:

Network setup (VPC, security groups)
S3 access policies
IAM permissions
Compute preferences (spot vs on-demand)

StarryNight manages job execution; IT retains security/cost control.

Failure Handling

Snakemake provides intelligent recovery:

Successful jobs are never re-run
Failed jobs can be retried with adjusted resources
QC steps pause for manual review via dummy modules

Validation Checklist

Key Risks

Risk	Mitigation
AWS Batch complexity	Early proof-of-concept
Container maintenance	Automated builds
Cost overruns	Monitoring and alerts
CellProfiler integration	Extensive testing

Stakeholder Approval Process

Note: This section requires stakeholder input to define the approval process.

Proposed Review Structure (TBD)

Technical review by engineering team
Security review by IT/compliance
Final approval by project sponsors

Open Questions for Stakeholders

Who are the key stakeholders for approval?
What are the approval criteria?
What documentation is required for each review?
What is the timeline for reviews?

Context from Planning Discussions

IT Team Constraints: IT team will likely lack bandwidth to implement custom infrastructure solutions. StarryNight must provide a predefined AWS configuration that IT teams can approve with a simple "yes/no" decision, rather than requiring custom backend development.

AWS Batch Experience Gap: The team has no hands-on experience with AWS Batch for scientific workloads. The 8-week timeline is based on theoretical assumptions rather than practical knowledge, creating significant unknown risks.

Pulumi Cost Scaling Uncertainty: Unclear how Pulumi pricing scales with infrastructure complexity (10 vs 1000 compute instances). Monthly costs could exceed projections if Pulumi charges per AWS resource rather than per managed service.

Always-On Coordinator Requirements: StarryNight needs a persistent coordinator node for job submission and state management. Requires dedicated EC2 instance running 24/7; this may introduce infrastructure complexity.

Custom Container Maintenance: Cannot use official CellProfiler containers due to need for custom telemetry and error handling wrappers. StarryNight must maintain its own CellProfiler builds, requiring automated CI/CD pipeline and coordination with CellProfiler releases.

CellProfiler Integration Complexity: CellProfiler doesn't behave like standard command-line tools - unreliable exit codes and inconsistent error reporting in containers. Standard containerization approaches may fail; requires custom error detection and handling mechanisms.

Bottom Line: While the roadmap provides a structured 8-week plan, several critical unknowns could significantly impact timeline and complexity. Early proof-of-concept testing is essential before committing to full deployment.