- WP-1-3: Central/site failover + dual-node recovery tests (17 tests) - WP-4: Performance testing framework for target scale (7 tests) - WP-5: Security hardening (LDAPS, JWT key length, no secrets in logs) (11 tests) - WP-6: Script sandboxing adversarial tests (28 tests, all forbidden APIs) - WP-7: Recovery drill test scaffolds (5 tests) - WP-8: Observability validation (structured logs, correlation IDs, metrics) (6 tests) - WP-9: Message contract compatibility (forward/backward compat) (18 tests) - WP-10: Deployment packaging (installation guide, production checklist, topology) - WP-11: Operational runbooks (failover, troubleshooting, maintenance) 92 new tests, all passing. Zero warnings.
135 lines
5.4 KiB
Markdown
135 lines
5.4 KiB
Markdown
# ScadaLink Failover Procedures
|
|
|
|
## Automatic Failover (No Intervention Required)
|
|
|
|
### Central Cluster Failover
|
|
|
|
**What happens automatically:**
|
|
|
|
1. Active central node becomes unreachable (process crash, network failure, hardware failure).
|
|
2. Akka.NET failure detection triggers after ~10 seconds (2s heartbeat, 10s threshold).
|
|
3. Split-brain resolver (keep-oldest) evaluates cluster state for 15 seconds (stable-after).
|
|
4. Standby node is promoted to active. Total time: ~25 seconds.
|
|
5. Cluster singletons migrate to the new active node.
|
|
6. Load balancer detects the failed node via `/health/ready` and routes traffic to the surviving node.
|
|
7. Active user sessions continue (JWT tokens are validated by the new node using the shared signing key).
|
|
8. SignalR connections are dropped and Blazor clients automatically reconnect.
|
|
|
|
**What is preserved:**
|
|
- All configuration and deployment state (stored in SQL Server)
|
|
- Active JWT sessions (shared signing key)
|
|
- Deployment status records (SQL Server with optimistic concurrency)
|
|
|
|
**What is temporarily disrupted:**
|
|
- In-flight deployments: Central re-queries site state and re-issues if needed (idempotent)
|
|
- Real-time debug view streams: Clients reconnect automatically
|
|
- Health dashboard: Resumes on reconnect
|
|
|
|
### Site Cluster Failover
|
|
|
|
**What happens automatically:**
|
|
|
|
1. Active site node becomes unreachable.
|
|
2. Failure detection and split-brain resolution (~25 seconds total).
|
|
3. Site Deployment Manager singleton migrates to standby.
|
|
4. Instance Actors are recreated from persisted SQLite configurations.
|
|
5. Staggered startup: 50ms delay between instance creations to prevent reconnection storms.
|
|
6. DCL connection actors reconnect to OPC UA servers.
|
|
7. Script Actors and Alarm Actors resume processing from incoming values (no stale state).
|
|
8. S&F buffer is read from SQLite — pending retries resume.
|
|
|
|
**What is preserved:**
|
|
- Deployed instance configurations (SQLite)
|
|
- Static attribute overrides (SQLite)
|
|
- S&F message buffer (SQLite)
|
|
- Site event logs (SQLite)
|
|
|
|
**What is temporarily disrupted:**
|
|
- Tag value subscriptions: DCL reconnects and re-subscribes transparently
|
|
- Active script executions: Cancelled; trigger fires again on next value change
|
|
- Alarm states: Re-evaluated from incoming tag values (correct state within one update cycle)
|
|
|
|
## Manual Intervention Scenarios
|
|
|
|
### Scenario 1: Both Central Nodes Down
|
|
|
|
**Symptoms:** No central UI access, sites report "central unreachable" in logs.
|
|
|
|
**Recovery:**
|
|
1. Start either central node. It will form a single-node cluster.
|
|
2. Verify SQL Server is accessible.
|
|
3. Check `/health/ready` returns 200.
|
|
4. Start the second node. It will join the cluster automatically.
|
|
5. Verify both nodes appear in the Akka.NET cluster member list (check logs for "Member joined").
|
|
|
|
**No data loss:** All state is in SQL Server.
|
|
|
|
### Scenario 2: Both Site Nodes Down
|
|
|
|
**Symptoms:** Site appears offline in central health dashboard.
|
|
|
|
**Recovery:**
|
|
1. Start either site node.
|
|
2. Check logs for "Store-and-forward SQLite storage initialized".
|
|
3. Verify instance actors are recreated: "Instance {Name}: created N script actors and M alarm actors".
|
|
4. Start the second site node.
|
|
5. Verify the site appears online in the central health dashboard within 60 seconds.
|
|
|
|
**No data loss:** All state is in SQLite.
|
|
|
|
### Scenario 3: Split-Brain (Network Partition Between Peers)
|
|
|
|
**Symptoms:** Both nodes believe they are the active node. Logs show "Cluster partition detected".
|
|
|
|
**How the system handles it:**
|
|
- Keep-oldest resolver: The older node (first to join cluster) survives; the younger is downed.
|
|
- `down-if-alone = on`: If a node is alone (no peers), it downs itself.
|
|
- Stable-after (15s): The resolver waits 15 seconds for the partition to stabilize before acting.
|
|
|
|
**Manual intervention (if auto-resolution fails):**
|
|
1. Stop both nodes.
|
|
2. Start the preferred node first (it becomes the "oldest").
|
|
3. Start the second node.
|
|
|
|
### Scenario 4: SQL Server Outage (Central)
|
|
|
|
**Symptoms:** Central UI returns errors. `/health/ready` returns 503. Logs show database connection failures.
|
|
|
|
**Impact:**
|
|
- Active sessions with valid JWTs can still access cached UI state.
|
|
- New logins fail (LDAP auth still works but role mapping requires DB).
|
|
- Template changes and deployments fail.
|
|
- Sites continue operating independently.
|
|
|
|
**Recovery:**
|
|
1. Restore SQL Server access.
|
|
2. Central nodes will automatically reconnect (EF Core connection resiliency).
|
|
3. Verify `/health/ready` returns 200.
|
|
4. No manual intervention needed on ScadaLink nodes.
|
|
|
|
### Scenario 5: Forced Singleton Migration
|
|
|
|
**When to use:** The active node is degraded but not crashed (e.g., high CPU, disk full).
|
|
|
|
**Procedure:**
|
|
1. Initiate graceful shutdown on the degraded node:
|
|
- Stop the Windows Service: `sc.exe stop ScadaLink-Central`
|
|
- CoordinatedShutdown will migrate singletons to the standby.
|
|
2. Wait for the standby to take over (check logs for "Singleton acquired").
|
|
3. Fix the issue on the original node.
|
|
4. Restart the service. It will rejoin as standby.
|
|
|
|
## Failover Timeline
|
|
|
|
```
|
|
T+0s Node failure detected (heartbeat timeout)
|
|
T+2s Akka.NET marks node as unreachable
|
|
T+10s Failure detection confirmed (threshold reached)
|
|
T+10s Split-brain resolver begins stable-after countdown
|
|
T+25s Resolver actions: surviving node promoted
|
|
T+25s Singleton migration begins
|
|
T+26s Instance Actors start recreating (staggered)
|
|
T+30s Health report sent from new active node
|
|
T+60s All instances operational (500 instances * 50ms stagger = 25s)
|
|
```
|