Files
scadalink-design/docs/operations/failover-procedures.md
Joseph Doherty b659978764 Phase 8: Production readiness — failover tests, security hardening, sandboxing, deployment docs
- WP-1-3: Central/site failover + dual-node recovery tests (17 tests)
- WP-4: Performance testing framework for target scale (7 tests)
- WP-5: Security hardening (LDAPS, JWT key length, no secrets in logs) (11 tests)
- WP-6: Script sandboxing adversarial tests (28 tests, all forbidden APIs)
- WP-7: Recovery drill test scaffolds (5 tests)
- WP-8: Observability validation (structured logs, correlation IDs, metrics) (6 tests)
- WP-9: Message contract compatibility (forward/backward compat) (18 tests)
- WP-10: Deployment packaging (installation guide, production checklist, topology)
- WP-11: Operational runbooks (failover, troubleshooting, maintenance)
92 new tests, all passing. Zero warnings.
2026-03-16 22:12:31 -04:00

135 lines
5.4 KiB
Markdown

# ScadaLink Failover Procedures
## Automatic Failover (No Intervention Required)
### Central Cluster Failover
**What happens automatically:**
1. Active central node becomes unreachable (process crash, network failure, hardware failure).
2. Akka.NET failure detection triggers after ~10 seconds (2s heartbeat, 10s threshold).
3. Split-brain resolver (keep-oldest) evaluates cluster state for 15 seconds (stable-after).
4. Standby node is promoted to active. Total time: ~25 seconds.
5. Cluster singletons migrate to the new active node.
6. Load balancer detects the failed node via `/health/ready` and routes traffic to the surviving node.
7. Active user sessions continue (JWT tokens are validated by the new node using the shared signing key).
8. SignalR connections are dropped and Blazor clients automatically reconnect.
**What is preserved:**
- All configuration and deployment state (stored in SQL Server)
- Active JWT sessions (shared signing key)
- Deployment status records (SQL Server with optimistic concurrency)
**What is temporarily disrupted:**
- In-flight deployments: Central re-queries site state and re-issues if needed (idempotent)
- Real-time debug view streams: Clients reconnect automatically
- Health dashboard: Resumes on reconnect
### Site Cluster Failover
**What happens automatically:**
1. Active site node becomes unreachable.
2. Failure detection and split-brain resolution (~25 seconds total).
3. Site Deployment Manager singleton migrates to standby.
4. Instance Actors are recreated from persisted SQLite configurations.
5. Staggered startup: 50ms delay between instance creations to prevent reconnection storms.
6. DCL connection actors reconnect to OPC UA servers.
7. Script Actors and Alarm Actors resume processing from incoming values (no stale state).
8. S&F buffer is read from SQLite — pending retries resume.
**What is preserved:**
- Deployed instance configurations (SQLite)
- Static attribute overrides (SQLite)
- S&F message buffer (SQLite)
- Site event logs (SQLite)
**What is temporarily disrupted:**
- Tag value subscriptions: DCL reconnects and re-subscribes transparently
- Active script executions: Cancelled; trigger fires again on next value change
- Alarm states: Re-evaluated from incoming tag values (correct state within one update cycle)
## Manual Intervention Scenarios
### Scenario 1: Both Central Nodes Down
**Symptoms:** No central UI access, sites report "central unreachable" in logs.
**Recovery:**
1. Start either central node. It will form a single-node cluster.
2. Verify SQL Server is accessible.
3. Check `/health/ready` returns 200.
4. Start the second node. It will join the cluster automatically.
5. Verify both nodes appear in the Akka.NET cluster member list (check logs for "Member joined").
**No data loss:** All state is in SQL Server.
### Scenario 2: Both Site Nodes Down
**Symptoms:** Site appears offline in central health dashboard.
**Recovery:**
1. Start either site node.
2. Check logs for "Store-and-forward SQLite storage initialized".
3. Verify instance actors are recreated: "Instance {Name}: created N script actors and M alarm actors".
4. Start the second site node.
5. Verify the site appears online in the central health dashboard within 60 seconds.
**No data loss:** All state is in SQLite.
### Scenario 3: Split-Brain (Network Partition Between Peers)
**Symptoms:** Both nodes believe they are the active node. Logs show "Cluster partition detected".
**How the system handles it:**
- Keep-oldest resolver: The older node (first to join cluster) survives; the younger is downed.
- `down-if-alone = on`: If a node is alone (no peers), it downs itself.
- Stable-after (15s): The resolver waits 15 seconds for the partition to stabilize before acting.
**Manual intervention (if auto-resolution fails):**
1. Stop both nodes.
2. Start the preferred node first (it becomes the "oldest").
3. Start the second node.
### Scenario 4: SQL Server Outage (Central)
**Symptoms:** Central UI returns errors. `/health/ready` returns 503. Logs show database connection failures.
**Impact:**
- Active sessions with valid JWTs can still access cached UI state.
- New logins fail (LDAP auth still works but role mapping requires DB).
- Template changes and deployments fail.
- Sites continue operating independently.
**Recovery:**
1. Restore SQL Server access.
2. Central nodes will automatically reconnect (EF Core connection resiliency).
3. Verify `/health/ready` returns 200.
4. No manual intervention needed on ScadaLink nodes.
### Scenario 5: Forced Singleton Migration
**When to use:** The active node is degraded but not crashed (e.g., high CPU, disk full).
**Procedure:**
1. Initiate graceful shutdown on the degraded node:
- Stop the Windows Service: `sc.exe stop ScadaLink-Central`
- CoordinatedShutdown will migrate singletons to the standby.
2. Wait for the standby to take over (check logs for "Singleton acquired").
3. Fix the issue on the original node.
4. Restart the service. It will rejoin as standby.
## Failover Timeline
```
T+0s Node failure detected (heartbeat timeout)
T+2s Akka.NET marks node as unreachable
T+10s Failure detection confirmed (threshold reached)
T+10s Split-brain resolver begins stable-after countdown
T+25s Resolver actions: surviving node promoted
T+25s Singleton migration begins
T+26s Instance Actors start recreating (staggered)
T+30s Health report sent from new active node
T+60s All instances operational (500 instances * 50ms stagger = 25s)
```