Files
scadalink-design/docs/deployment/production-checklist.md
Joseph Doherty b659978764 Phase 8: Production readiness — failover tests, security hardening, sandboxing, deployment docs
- WP-1-3: Central/site failover + dual-node recovery tests (17 tests)
- WP-4: Performance testing framework for target scale (7 tests)
- WP-5: Security hardening (LDAPS, JWT key length, no secrets in logs) (11 tests)
- WP-6: Script sandboxing adversarial tests (28 tests, all forbidden APIs)
- WP-7: Recovery drill test scaffolds (5 tests)
- WP-8: Observability validation (structured logs, correlation IDs, metrics) (6 tests)
- WP-9: Message contract compatibility (forward/backward compat) (18 tests)
- WP-10: Deployment packaging (installation guide, production checklist, topology)
- WP-11: Operational runbooks (failover, troubleshooting, maintenance)
92 new tests, all passing. Zero warnings.
2026-03-16 22:12:31 -04:00

98 lines
4.2 KiB
Markdown

# ScadaLink Production Deployment Checklist
## Pre-Deployment
### Configuration Verification
- [ ] `ScadaLink:Node:Role` is set correctly (`Central` or `Site`)
- [ ] `ScadaLink:Node:NodeHostname` matches the machine's resolvable hostname
- [ ] `ScadaLink:Cluster:SeedNodes` contains exactly 2 entries for the cluster pair
- [ ] Seed node addresses use fully qualified hostnames (not `localhost`)
- [ ] Remoting port (default 8081) is open bidirectionally between cluster peers
### Central Node
- [ ] `ScadaLink:Database:ConfigurationDb` connection string is valid and tested
- [ ] `ScadaLink:Database:MachineDataDb` connection string is valid and tested
- [ ] SQL Server login has `db_owner` role on both databases
- [ ] EF Core migrations have been applied (SQL script reviewed and executed)
- [ ] `ScadaLink:Security:JwtSigningKey` is at least 32 characters, randomly generated
- [ ] **Both central nodes use the same JwtSigningKey** (required for JWT failover)
- [ ] `ScadaLink:Security:LdapServer` points to the production LDAP/AD server
- [ ] `ScadaLink:Security:LdapUseTls` is `true` (LDAPS required in production)
- [ ] `ScadaLink:Security:AllowInsecureLdap` is `false`
- [ ] LDAP search base DN is correct for the organization
- [ ] LDAP group-to-role mappings are configured
- [ ] Load balancer is configured in front of central UI (sticky sessions not required)
- [ ] ASP.NET Data Protection keys are shared between central nodes (for cookie failover)
- [ ] HTTPS certificate is installed and configured
### Site Node
- [ ] `ScadaLink:Node:SiteId` is set and unique across all sites
- [ ] `ScadaLink:Database:SiteDbPath` points to a writable directory
- [ ] SQLite data directory has sufficient disk space (no max buffer size for S&F)
- [ ] `ScadaLink:Communication:CentralSeedNode` points to a reachable central node
- [ ] OPC UA server endpoints are accessible from site nodes
- [ ] OPC UA security certificates are configured if required
### Security
- [ ] No secrets in `appsettings.json` committed to source control
- [ ] Secrets managed via environment variables or a secrets manager
- [ ] Windows Service account has minimum necessary permissions
- [ ] Log directory permissions restrict access to service account and administrators
- [ ] SMTP credentials use OAuth2 Client Credentials (preferred) or secure Basic Auth
- [ ] API keys for Inbound API are generated with sufficient entropy (32+ chars)
### Network
- [ ] DNS resolution works between all cluster nodes
- [ ] Firewall rules permit Akka.NET remoting (TCP 8081)
- [ ] Firewall rules permit LDAP (TCP 636 for LDAPS)
- [ ] Firewall rules permit SMTP (TCP 587 for TLS)
- [ ] Firewall rules permit SQL Server (TCP 1433) from central nodes only
- [ ] Load balancer health check configured against `/health/ready`
## Deployment
### Order of Operations
1. Deploy central node A (forms single-node cluster)
2. Verify central node A is healthy: `GET /health/ready` returns 200
3. Deploy central node B (joins existing cluster)
4. Verify both central nodes show as cluster members in logs
5. Deploy site nodes (order does not matter)
6. Verify sites register with central via health dashboard
### Rollback Plan
- [ ] Previous version binaries are retained for rollback
- [ ] Database backup taken before migration
- [ ] Rollback SQL script is available (if migration requires it)
- [ ] Service can be stopped and previous binary restored
## Post-Deployment
### Smoke Tests
- [ ] Central UI is accessible and login works
- [ ] Health dashboard shows all expected sites as online
- [ ] Template engine can create/save/delete a test template
- [ ] Deployment pipeline can deploy a test instance to a site
- [ ] Inbound API responds to test requests with valid API key
- [ ] Notification Service can send a test email
### Monitoring Setup
- [ ] Log aggregation is configured (Serilog file sink + centralized collector)
- [ ] Health dashboard bookmarked for operations team
- [ ] Alerting configured for site offline threshold violations
- [ ] Disk space monitoring on site nodes (SQLite growth)
### Documentation
- [ ] Cluster topology documented (hostnames, ports, roles)
- [ ] Runbook updated with environment-specific details
- [ ] On-call team briefed on failover procedures