Files

Joseph Doherty b659978764 Phase 8: Production readiness — failover tests, security hardening, sandboxing, deployment docs

- WP-1-3: Central/site failover + dual-node recovery tests (17 tests)
- WP-4: Performance testing framework for target scale (7 tests)
- WP-5: Security hardening (LDAPS, JWT key length, no secrets in logs) (11 tests)
- WP-6: Script sandboxing adversarial tests (28 tests, all forbidden APIs)
- WP-7: Recovery drill test scaffolds (5 tests)
- WP-8: Observability validation (structured logs, correlation IDs, metrics) (6 tests)
- WP-9: Message contract compatibility (forward/backward compat) (18 tests)
- WP-10: Deployment packaging (installation guide, production checklist, topology)
- WP-11: Operational runbooks (failover, troubleshooting, maintenance)
92 new tests, all passing. Zero warnings.

2026-03-16 22:12:31 -04:00

5.4 KiB

Raw Blame History

ScadaLink Failover Procedures

Automatic Failover (No Intervention Required)

Central Cluster Failover

What happens automatically:

Active central node becomes unreachable (process crash, network failure, hardware failure).
Akka.NET failure detection triggers after ~10 seconds (2s heartbeat, 10s threshold).
Split-brain resolver (keep-oldest) evaluates cluster state for 15 seconds (stable-after).
Standby node is promoted to active. Total time: ~25 seconds.
Cluster singletons migrate to the new active node.
Load balancer detects the failed node via /health/ready and routes traffic to the surviving node.
Active user sessions continue (JWT tokens are validated by the new node using the shared signing key).
SignalR connections are dropped and Blazor clients automatically reconnect.

What is preserved:

All configuration and deployment state (stored in SQL Server)
Active JWT sessions (shared signing key)
Deployment status records (SQL Server with optimistic concurrency)

What is temporarily disrupted:

In-flight deployments: Central re-queries site state and re-issues if needed (idempotent)
Real-time debug view streams: Clients reconnect automatically
Health dashboard: Resumes on reconnect

Site Cluster Failover

What happens automatically:

Active site node becomes unreachable.
Failure detection and split-brain resolution (~25 seconds total).
Site Deployment Manager singleton migrates to standby.
Instance Actors are recreated from persisted SQLite configurations.
Staggered startup: 50ms delay between instance creations to prevent reconnection storms.
DCL connection actors reconnect to OPC UA servers.
Script Actors and Alarm Actors resume processing from incoming values (no stale state).
S&F buffer is read from SQLite — pending retries resume.

What is preserved:

Deployed instance configurations (SQLite)
Static attribute overrides (SQLite)
S&F message buffer (SQLite)
Site event logs (SQLite)

What is temporarily disrupted:

Tag value subscriptions: DCL reconnects and re-subscribes transparently
Active script executions: Cancelled; trigger fires again on next value change
Alarm states: Re-evaluated from incoming tag values (correct state within one update cycle)

Manual Intervention Scenarios

Scenario 1: Both Central Nodes Down

Symptoms: No central UI access, sites report "central unreachable" in logs.

Recovery:

Start either central node. It will form a single-node cluster.
Verify SQL Server is accessible.
Check /health/ready returns 200.
Start the second node. It will join the cluster automatically.
Verify both nodes appear in the Akka.NET cluster member list (check logs for "Member joined").

No data loss: All state is in SQL Server.

Scenario 2: Both Site Nodes Down

Symptoms: Site appears offline in central health dashboard.

Recovery:

Start either site node.
Check logs for "Store-and-forward SQLite storage initialized".
Verify instance actors are recreated: "Instance {Name}: created N script actors and M alarm actors".
Start the second site node.
Verify the site appears online in the central health dashboard within 60 seconds.

No data loss: All state is in SQLite.

Scenario 3: Split-Brain (Network Partition Between Peers)

Symptoms: Both nodes believe they are the active node. Logs show "Cluster partition detected".

How the system handles it:

Keep-oldest resolver: The older node (first to join cluster) survives; the younger is downed.
down-if-alone = on: If a node is alone (no peers), it downs itself.
Stable-after (15s): The resolver waits 15 seconds for the partition to stabilize before acting.

Manual intervention (if auto-resolution fails):

Stop both nodes.
Start the preferred node first (it becomes the "oldest").
Start the second node.

Scenario 4: SQL Server Outage (Central)

Symptoms: Central UI returns errors. /health/ready returns 503. Logs show database connection failures.

Impact:

Active sessions with valid JWTs can still access cached UI state.
New logins fail (LDAP auth still works but role mapping requires DB).
Template changes and deployments fail.
Sites continue operating independently.

Recovery:

Restore SQL Server access.
Central nodes will automatically reconnect (EF Core connection resiliency).
Verify /health/ready returns 200.
No manual intervention needed on ScadaLink nodes.

Scenario 5: Forced Singleton Migration

When to use: The active node is degraded but not crashed (e.g., high CPU, disk full).

Procedure:

Initiate graceful shutdown on the degraded node:
- Stop the Windows Service: sc.exe stop ScadaLink-Central
- CoordinatedShutdown will migrate singletons to the standby.
Wait for the standby to take over (check logs for "Singleton acquired").
Fix the issue on the original node.
Restart the service. It will rejoin as standby.

Failover Timeline

T+0s    Node failure detected (heartbeat timeout)
T+2s    Akka.NET marks node as unreachable
T+10s   Failure detection confirmed (threshold reached)
T+10s   Split-brain resolver begins stable-after countdown
T+25s   Resolver actions: surviving node promoted
T+25s   Singleton migration begins
T+26s   Instance Actors start recreating (staggered)
T+30s   Health report sent from new active node
T+60s   All instances operational (500 instances * 50ms stagger = 25s)

5.4 KiB Raw Blame History

ScadaLink Failover Procedures

Automatic Failover (No Intervention Required)

Central Cluster Failover

Site Cluster Failover

Manual Intervention Scenarios

Scenario 1: Both Central Nodes Down

Scenario 2: Both Site Nodes Down

Scenario 3: Split-Brain (Network Partition Between Peers)

Scenario 4: SQL Server Outage (Central)

Scenario 5: Forced Singleton Migration

Failover Timeline

5.4 KiB

Raw Blame History