Phase 8: Production readiness — failover tests, security hardening, sandboxing, deployment docs

- WP-1-3: Central/site failover + dual-node recovery tests (17 tests) - WP-4: Performance testing framework for target scale (7 tests) - WP-5: Security hardening (LDAPS, JWT key length, no secrets in logs) (11 tests) - WP-6: Script sandboxing adversarial tests (28 tests, all forbidden APIs) - WP-7: Recovery drill test scaffolds (5 tests) - WP-8: Observability validation (structured logs, correlation IDs, metrics) (6 tests) - WP-9: Message contract compatibility (forward/backward compat) (18 tests) - WP-10: Deployment packaging (installation guide, production checklist, topology) - WP-11: Operational runbooks (failover, troubleshooting, maintenance) 92 new tests, all passing. Zero warnings.
2026-03-16 22:12:31 -04:00
parent 3b2320bd35
commit b659978764
68 changed files with 6253 additions and 44 deletions
--- a/docs/deployment/topology-guide.md
+++ b/docs/deployment/topology-guide.md
@@ -0,0 +1,172 @@
+# ScadaLink Cluster Topology Guide
+
+## Architecture Overview
+
+ScadaLink uses a hub-and-spoke architecture:
+- **Central Cluster**: Two-node active/standby Akka.NET cluster for management, UI, and coordination.
+- **Site Clusters**: Two-node active/standby Akka.NET clusters at each remote site for data collection and local processing.
+
+```
+                    ┌──────────────────────────┐
+                    │     Central Cluster       │
+                    │  ┌──────┐    ┌──────┐    │
+  Users ──────────► │  │Node A│◄──►│Node B│    │
+  (HTTPS/LB)        │  │Active│    │Stby  │    │
+                    │  └──┬───┘    └──┬───┘    │
+                    └─────┼───────────┼────────┘
+                          │           │
+              ┌───────────┼───────────┼───────────┐
+              │           │           │           │
+        ┌─────▼─────┐ ┌──▼──────┐ ┌──▼──────┐ ┌──▼──────┐
+        │  Site 01   │ │ Site 02 │ │ Site 03 │ │ Site N  │
+        │ ┌──┐ ┌──┐ │ │ ┌──┐┌──┐│ │ ┌──┐┌──┐│ │ ┌──┐┌──┐│
+        │ │A │ │B │ │ │ │A ││B ││ │ │A ││B ││ │ │A ││B ││
+        │ └──┘ └──┘ │ │ └──┘└──┘│ │ └──┘└──┘│ │ └──┘└──┘│
+        └───────────┘ └─────────┘ └─────────┘ └─────────┘
+```
+
+## Central Cluster Setup
+
+### Cluster Configuration
+
+Both central nodes must be configured as seed nodes for each other:
+
+**Node A** (`central-01.example.com`):
+```json
+{
+  "ScadaLink": {
+    "Node": {
+      "Role": "Central",
+      "NodeHostname": "central-01.example.com",
+      "RemotingPort": 8081
+    },
+    "Cluster": {
+      "SeedNodes": [
+        "akka.tcp://scadalink@central-01.example.com:8081",
+        "akka.tcp://scadalink@central-02.example.com:8081"
+      ]
+    }
+  }
+}
+```
+
+**Node B** (`central-02.example.com`):
+```json
+{
+  "ScadaLink": {
+    "Node": {
+      "Role": "Central",
+      "NodeHostname": "central-02.example.com",
+      "RemotingPort": 8081
+    },
+    "Cluster": {
+      "SeedNodes": [
+        "akka.tcp://scadalink@central-01.example.com:8081",
+        "akka.tcp://scadalink@central-02.example.com:8081"
+      ]
+    }
+  }
+}
+```
+
+### Cluster Behavior
+
+- **Split-brain resolver**: Keep-oldest with `down-if-alone = on`, 15-second stable-after.
+- **Minimum members**: `min-nr-of-members = 1` — a single node can form a cluster.
+- **Failure detection**: 2-second heartbeat interval, 10-second threshold.
+- **Total failover time**: ~25 seconds from node failure to singleton migration.
+- **Singleton handover**: Uses CoordinatedShutdown for graceful migration.
+
+### Shared State
+
+Both central nodes share state through:
+- **SQL Server**: All configuration, deployment records, templates, and audit logs.
+- **JWT signing key**: Same `JwtSigningKey` in both nodes' configuration.
+- **Data Protection keys**: Shared key ring (stored in SQL Server or shared file path).
+
+### Load Balancer
+
+A load balancer sits in front of both central nodes for the Blazor Server UI:
+- Health check: `GET /health/ready`
+- Protocol: HTTPS (TLS termination at LB or pass-through)
+- Sticky sessions: Not required (JWT + shared Data Protection keys)
+- If the active node fails, the LB routes to the standby (which becomes active after singleton migration).
+
+## Site Cluster Setup
+
+### Cluster Configuration
+
+Each site has its own two-node cluster:
+
+**Site Node A** (`site-01-a.example.com`):
+```json
+{
+  "ScadaLink": {
+    "Node": {
+      "Role": "Site",
+      "NodeHostname": "site-01-a.example.com",
+      "SiteId": "plant-north",
+      "RemotingPort": 8081
+    },
+    "Cluster": {
+      "SeedNodes": [
+        "akka.tcp://scadalink@site-01-a.example.com:8081",
+        "akka.tcp://scadalink@site-01-b.example.com:8081"
+      ]
+    }
+  }
+}
+```
+
+### Site Cluster Behavior
+
+- Same split-brain resolver as central (keep-oldest).
+- Singleton actors: Site Deployment Manager migrates on failover.
+- Staggered instance startup: 50ms delay between Instance Actor creation to prevent reconnection storms.
+- SQLite persistence: Both nodes access the same SQLite files (or each has its own copy with async replication).
+
+### Central-Site Communication
+
+- Sites connect to central via Akka.NET remoting.
+- The `Communication:CentralSeedNode` setting in the site config points to one of the central nodes.
+- If that central node is down, the site's communication actor will retry until it connects to the active central node.
+
+## Scaling Guidelines
+
+### Target Scale
+
+- 10 sites maximum per central cluster
+- 500 machines (instances) total across all sites
+- 75 tags per machine (37,500 total tag subscriptions)
+
+### Resource Requirements
+
+| Component | CPU | RAM | Disk | Notes |
+|-----------|-----|-----|------|-------|
+| Central node | 4 cores | 8 GB | 50 GB | SQL Server is separate |
+| Site node | 2 cores | 4 GB | 20 GB | SQLite databases grow with S&F |
+| SQL Server | 4 cores | 16 GB | 100 GB | Shared across central cluster |
+
+### Network Bandwidth
+
+- Health reports: ~1 KB per site per 30 seconds = negligible
+- Tag value updates: Depends on data change rate; OPC UA subscription-based
+- Deployment artifacts: One-time burst per deployment (varies by config size)
+- Debug view streaming: ~500 bytes per attribute change per subscriber
+
+## Dual-Node Failure Recovery
+
+### Scenario: Both Nodes Down
+
+1. **First node starts**: Forms a single-node cluster (`min-nr-of-members = 1`).
+2. **Central**: Reconnects to SQL Server, reads deployment state, becomes operational.
+3. **Site**: Opens SQLite databases, rebuilds Instance Actors from persisted configs, resumes S&F retries.
+4. **Second node starts**: Joins the existing cluster as standby.
+
+### Automatic Recovery
+
+No manual intervention required for dual-node failure. The first node to start will:
+- Form the cluster
+- Take over all singletons
+- Begin processing immediately
+- Accept the second node when it joins