Files
scadalink-design/docs/deployment/production-checklist.md
Joseph Doherty b659978764 Phase 8: Production readiness — failover tests, security hardening, sandboxing, deployment docs
- WP-1-3: Central/site failover + dual-node recovery tests (17 tests)
- WP-4: Performance testing framework for target scale (7 tests)
- WP-5: Security hardening (LDAPS, JWT key length, no secrets in logs) (11 tests)
- WP-6: Script sandboxing adversarial tests (28 tests, all forbidden APIs)
- WP-7: Recovery drill test scaffolds (5 tests)
- WP-8: Observability validation (structured logs, correlation IDs, metrics) (6 tests)
- WP-9: Message contract compatibility (forward/backward compat) (18 tests)
- WP-10: Deployment packaging (installation guide, production checklist, topology)
- WP-11: Operational runbooks (failover, troubleshooting, maintenance)
92 new tests, all passing. Zero warnings.
2026-03-16 22:12:31 -04:00

4.2 KiB

ScadaLink Production Deployment Checklist

Pre-Deployment

Configuration Verification

  • ScadaLink:Node:Role is set correctly (Central or Site)
  • ScadaLink:Node:NodeHostname matches the machine's resolvable hostname
  • ScadaLink:Cluster:SeedNodes contains exactly 2 entries for the cluster pair
  • Seed node addresses use fully qualified hostnames (not localhost)
  • Remoting port (default 8081) is open bidirectionally between cluster peers

Central Node

  • ScadaLink:Database:ConfigurationDb connection string is valid and tested
  • ScadaLink:Database:MachineDataDb connection string is valid and tested
  • SQL Server login has db_owner role on both databases
  • EF Core migrations have been applied (SQL script reviewed and executed)
  • ScadaLink:Security:JwtSigningKey is at least 32 characters, randomly generated
  • Both central nodes use the same JwtSigningKey (required for JWT failover)
  • ScadaLink:Security:LdapServer points to the production LDAP/AD server
  • ScadaLink:Security:LdapUseTls is true (LDAPS required in production)
  • ScadaLink:Security:AllowInsecureLdap is false
  • LDAP search base DN is correct for the organization
  • LDAP group-to-role mappings are configured
  • Load balancer is configured in front of central UI (sticky sessions not required)
  • ASP.NET Data Protection keys are shared between central nodes (for cookie failover)
  • HTTPS certificate is installed and configured

Site Node

  • ScadaLink:Node:SiteId is set and unique across all sites
  • ScadaLink:Database:SiteDbPath points to a writable directory
  • SQLite data directory has sufficient disk space (no max buffer size for S&F)
  • ScadaLink:Communication:CentralSeedNode points to a reachable central node
  • OPC UA server endpoints are accessible from site nodes
  • OPC UA security certificates are configured if required

Security

  • No secrets in appsettings.json committed to source control
  • Secrets managed via environment variables or a secrets manager
  • Windows Service account has minimum necessary permissions
  • Log directory permissions restrict access to service account and administrators
  • SMTP credentials use OAuth2 Client Credentials (preferred) or secure Basic Auth
  • API keys for Inbound API are generated with sufficient entropy (32+ chars)

Network

  • DNS resolution works between all cluster nodes
  • Firewall rules permit Akka.NET remoting (TCP 8081)
  • Firewall rules permit LDAP (TCP 636 for LDAPS)
  • Firewall rules permit SMTP (TCP 587 for TLS)
  • Firewall rules permit SQL Server (TCP 1433) from central nodes only
  • Load balancer health check configured against /health/ready

Deployment

Order of Operations

  1. Deploy central node A (forms single-node cluster)
  2. Verify central node A is healthy: GET /health/ready returns 200
  3. Deploy central node B (joins existing cluster)
  4. Verify both central nodes show as cluster members in logs
  5. Deploy site nodes (order does not matter)
  6. Verify sites register with central via health dashboard

Rollback Plan

  • Previous version binaries are retained for rollback
  • Database backup taken before migration
  • Rollback SQL script is available (if migration requires it)
  • Service can be stopped and previous binary restored

Post-Deployment

Smoke Tests

  • Central UI is accessible and login works
  • Health dashboard shows all expected sites as online
  • Template engine can create/save/delete a test template
  • Deployment pipeline can deploy a test instance to a site
  • Inbound API responds to test requests with valid API key
  • Notification Service can send a test email

Monitoring Setup

  • Log aggregation is configured (Serilog file sink + centralized collector)
  • Health dashboard bookmarked for operations team
  • Alerting configured for site offline threshold violations
  • Disk space monitoring on site nodes (SQLite growth)

Documentation

  • Cluster topology documented (hostnames, ports, roles)
  • Runbook updated with environment-specific details
  • On-call team briefed on failover procedures