- WP-1-3: Central/site failover + dual-node recovery tests (17 tests) - WP-4: Performance testing framework for target scale (7 tests) - WP-5: Security hardening (LDAPS, JWT key length, no secrets in logs) (11 tests) - WP-6: Script sandboxing adversarial tests (28 tests, all forbidden APIs) - WP-7: Recovery drill test scaffolds (5 tests) - WP-8: Observability validation (structured logs, correlation IDs, metrics) (6 tests) - WP-9: Message contract compatibility (forward/backward compat) (18 tests) - WP-10: Deployment packaging (installation guide, production checklist, topology) - WP-11: Operational runbooks (failover, troubleshooting, maintenance) 92 new tests, all passing. Zero warnings.
5.4 KiB
ScadaLink Failover Procedures
Automatic Failover (No Intervention Required)
Central Cluster Failover
What happens automatically:
- Active central node becomes unreachable (process crash, network failure, hardware failure).
- Akka.NET failure detection triggers after ~10 seconds (2s heartbeat, 10s threshold).
- Split-brain resolver (keep-oldest) evaluates cluster state for 15 seconds (stable-after).
- Standby node is promoted to active. Total time: ~25 seconds.
- Cluster singletons migrate to the new active node.
- Load balancer detects the failed node via
/health/readyand routes traffic to the surviving node. - Active user sessions continue (JWT tokens are validated by the new node using the shared signing key).
- SignalR connections are dropped and Blazor clients automatically reconnect.
What is preserved:
- All configuration and deployment state (stored in SQL Server)
- Active JWT sessions (shared signing key)
- Deployment status records (SQL Server with optimistic concurrency)
What is temporarily disrupted:
- In-flight deployments: Central re-queries site state and re-issues if needed (idempotent)
- Real-time debug view streams: Clients reconnect automatically
- Health dashboard: Resumes on reconnect
Site Cluster Failover
What happens automatically:
- Active site node becomes unreachable.
- Failure detection and split-brain resolution (~25 seconds total).
- Site Deployment Manager singleton migrates to standby.
- Instance Actors are recreated from persisted SQLite configurations.
- Staggered startup: 50ms delay between instance creations to prevent reconnection storms.
- DCL connection actors reconnect to OPC UA servers.
- Script Actors and Alarm Actors resume processing from incoming values (no stale state).
- S&F buffer is read from SQLite — pending retries resume.
What is preserved:
- Deployed instance configurations (SQLite)
- Static attribute overrides (SQLite)
- S&F message buffer (SQLite)
- Site event logs (SQLite)
What is temporarily disrupted:
- Tag value subscriptions: DCL reconnects and re-subscribes transparently
- Active script executions: Cancelled; trigger fires again on next value change
- Alarm states: Re-evaluated from incoming tag values (correct state within one update cycle)
Manual Intervention Scenarios
Scenario 1: Both Central Nodes Down
Symptoms: No central UI access, sites report "central unreachable" in logs.
Recovery:
- Start either central node. It will form a single-node cluster.
- Verify SQL Server is accessible.
- Check
/health/readyreturns 200. - Start the second node. It will join the cluster automatically.
- Verify both nodes appear in the Akka.NET cluster member list (check logs for "Member joined").
No data loss: All state is in SQL Server.
Scenario 2: Both Site Nodes Down
Symptoms: Site appears offline in central health dashboard.
Recovery:
- Start either site node.
- Check logs for "Store-and-forward SQLite storage initialized".
- Verify instance actors are recreated: "Instance {Name}: created N script actors and M alarm actors".
- Start the second site node.
- Verify the site appears online in the central health dashboard within 60 seconds.
No data loss: All state is in SQLite.
Scenario 3: Split-Brain (Network Partition Between Peers)
Symptoms: Both nodes believe they are the active node. Logs show "Cluster partition detected".
How the system handles it:
- Keep-oldest resolver: The older node (first to join cluster) survives; the younger is downed.
down-if-alone = on: If a node is alone (no peers), it downs itself.- Stable-after (15s): The resolver waits 15 seconds for the partition to stabilize before acting.
Manual intervention (if auto-resolution fails):
- Stop both nodes.
- Start the preferred node first (it becomes the "oldest").
- Start the second node.
Scenario 4: SQL Server Outage (Central)
Symptoms: Central UI returns errors. /health/ready returns 503. Logs show database connection failures.
Impact:
- Active sessions with valid JWTs can still access cached UI state.
- New logins fail (LDAP auth still works but role mapping requires DB).
- Template changes and deployments fail.
- Sites continue operating independently.
Recovery:
- Restore SQL Server access.
- Central nodes will automatically reconnect (EF Core connection resiliency).
- Verify
/health/readyreturns 200. - No manual intervention needed on ScadaLink nodes.
Scenario 5: Forced Singleton Migration
When to use: The active node is degraded but not crashed (e.g., high CPU, disk full).
Procedure:
- Initiate graceful shutdown on the degraded node:
- Stop the Windows Service:
sc.exe stop ScadaLink-Central - CoordinatedShutdown will migrate singletons to the standby.
- Stop the Windows Service:
- Wait for the standby to take over (check logs for "Singleton acquired").
- Fix the issue on the original node.
- Restart the service. It will rejoin as standby.
Failover Timeline
T+0s Node failure detected (heartbeat timeout)
T+2s Akka.NET marks node as unreachable
T+10s Failure detection confirmed (threshold reached)
T+10s Split-brain resolver begins stable-after countdown
T+25s Resolver actions: surviving node promoted
T+25s Singleton migration begins
T+26s Instance Actors start recreating (staggered)
T+30s Health report sent from new active node
T+60s All instances operational (500 instances * 50ms stagger = 25s)