AWS Deployment Roadmap
This roadmap outlines our 8-week plan to deploy StarryNight on AWS, transitioning from local servers to cloud infrastructure.
AWS Deployment Architecture
Core Services
- AWS Batch: Orchestrates containerized compute jobs
- EC2: Hosts login node for job coordination
- S3: Stores data and results
- Container Registry: Public registry (GitHub or similar) for images
- Supporting: CloudWatch (logs), IAM (access), VPC (networking)
System Topology
The AWS deployment follows a hub-and-spoke architecture:
flowchart TB
subgraph "Control Plane"
EC2[EC2 Login Node<br/>StarryNight Coordinator]
GRAFANA[Grafana Stack<br/>OpenTelemetry Collector]
end
subgraph "Compute Plane"
BATCH[AWS Batch<br/>Job Queue]
COMPUTE[Compute Environment<br/>EC2/Fargate]
end
subgraph "Storage"
S3[S3 Buckets<br/>Data Lake]
REG[Container Registry<br/>GitHub/Public]
end
subgraph "Infrastructure Management"
PULUMI[Pulumi<br/>IaC]
end
EC2 -->|Submit Jobs| BATCH
BATCH -->|Schedule| COMPUTE
COMPUTE -->|Read/Write| S3
COMPUTE -->|Pull Images| REG
COMPUTE -->|Send Telemetry| GRAFANA
EC2 <-->|Monitor| GRAFANA
PULUMI -->|Provision| EC2 & BATCH & S3
Monthly Costs (excluding compute/storage)
Service | Cost | Notes |
---|---|---|
EC2 Login Node | $30-40 | t3.medium, always-on |
CloudWatch | $10-50 | Varies with job volume |
Pulumi | $0-75 | Free tier likely sufficient |
Total | $40-165 | Plus data transfer costs |
8-Week Deployment Plan
Phase 1: Infrastructure Foundation (Weeks 1-2)
- Configure Pulumi project structure (team has experience from CytoSkel)
- Define AWS Batch compute environments
- Create S3 bucket hierarchy
- Deploy EC2 login node (24/7 coordinator)
Key Unknown: Team has no AWS Batch experience; configuration requirements unclear
Phase 2: Container Pipeline (Weeks 3-4)
- Select public container registry
- Adapt existing CI/CD pipeline for AWS containers
- Set up builds triggered by CellProfiler releases
- Test custom CellProfiler wrappers
Note: Must maintain custom StarryNight+CellProfiler containers (can't use official)
Phase 3: Integration Testing (Weeks 5-6)
- Test job submission pipeline
- Validate Snakemake → AWS Batch translation
- Verify OpenTelemetry/Grafana integration
- Test Snakemake's job recovery features
Key Risk: CellProfiler error handling requires custom hooks
Phase 4: Production Ready (Weeks 7-8)
- Finalize documentation and runbooks
- Configure monitoring and alerts
- Complete VPC/security setup
- Optimize costs
Note: Security hardening deferred (internal users only)
Configuration Requirements
User Infrastructure Configuration
Based on planning discussions, users configure infrastructure through the UI:
- Module Parameter Exposure
- Module authors expose parameters like memory and compute requirements
- Example from discussion: inventory module exposes
dataset_path
parameter - Proposed: Modules could expose
memory
or similar resource parameters - Backend implementation decides how to handle these parameters
- Note: Specific parameter names and UI interface details TBD
- Configuration Flow
- UI → Module → Pipeline → AWS Batch job definitions
- Users cannot directly configure AWS Batch settings
- Backend determines infrastructure choices (e.g., AWS Batch vs alternatives)
Job Failure and Restart Procedures
From the planning discussions:
- Snakemake Intelligence
- Snakemake automatically tracks successful jobs and won't re-run them
- Only failed or not-yet-run jobs execute on retry
- Target-based execution model checks for output files
- QC Review Points
- QC steps implemented as modules that fail by default
- Human review required before marking as passed
- After review, job can be manually marked to proceed
- Note: Specific UI for QC approval TBD
- Individual Module Re-execution
- Each module can be run independently with different parameters
- Users can modify parameters and re-run specific modules
- Logs available for each run attempt
Partial Failure Recovery
Based on the discussion about 90% success / 10% failure scenarios:
- Automatic Detection
- Snakemake identifies which jobs succeeded vs failed
- Re-running a pipeline only executes failed jobs
- Note: Specific mechanism for failure detection not fully detailed
- Telemetry and Monitoring
- OpenTelemetry integration sends logs to central Grafana stack
- All stdout/stderr piped through telemetry system
- Challenge: CellProfiler containers need custom wrappers for proper error reporting
- Resource Adjustment
- Failed jobs can be retried with adjusted resources
- Note: UI mechanism for resource adjustment per retry TBD
Infrastructure Configuration Notes
Note: Many specifics remain TBD during implementation.
Potential areas requiring configuration:
- Network setup (VPC, security groups)
- S3 access policies
- IAM permissions
- Compute preferences (spot vs on-demand)
StarryNight manages job execution; IT retains security/cost control.
Validation Checklist
- 100-job pipeline test
- Partial failure recovery
- Container version switching
- Telemetry completeness
- Internal user testing
Key Risks
Risk | Mitigation |
---|---|
AWS Batch complexity | Early proof-of-concept |
Container maintenance | Automated builds |
Cost overruns | Monitoring and alerts |
CellProfiler integration | Extensive testing |
Stakeholder Approval Process
Note: This section requires stakeholder input to define the approval process.
Proposed Review Structure (TBD)
- Technical review by engineering team
- Security review by IT/compliance
- Final approval by project sponsors
Open Questions for Stakeholders
- Who are the key stakeholders for approval?
- What are the approval criteria?
- What documentation is required for each review?
- What is the timeline for reviews?
Context from Planning Discussions
IT Team Constraints: IT team will likely lack bandwidth to implement custom infrastructure solutions. StarryNight must provide a predefined AWS configuration that IT teams can approve with a simple "yes/no" decision, rather than requiring custom backend development.
AWS Batch Experience Gap: The team has no hands-on experience with AWS Batch for scientific workloads. The 8-week timeline is based on theoretical assumptions rather than practical knowledge, creating significant unknown risks.
Pulumi Cost Scaling Uncertainty: Unclear how Pulumi pricing scales with infrastructure complexity (10 vs 1000 compute instances). Monthly costs could exceed projections if Pulumi charges per AWS resource rather than per managed service.
Always-On Coordinator Requirements: StarryNight needs a persistent coordinator node for job submission and state management. Requires dedicated EC2 instance running 24/7; this may introduce infrastructure complexity.
Custom Container Maintenance: Cannot use official CellProfiler containers due to need for custom telemetry and error handling wrappers. StarryNight must maintain its own CellProfiler builds, requiring automated CI/CD pipeline and coordination with CellProfiler releases.
CellProfiler Integration Complexity: CellProfiler doesn't behave like standard command-line tools - unreliable exit codes and inconsistent error reporting in containers. Standard containerization approaches may fail; requires custom error detection and handling mechanisms.
Bottom Line: While the roadmap provides a structured 8-week plan, several critical unknowns could significantly impact timeline and complexity. Early proof-of-concept testing is essential before committing to full deployment.