Refine Cluster Infrastructure: split-brain, seed nodes, failure detection, dual recovery

Add keep-oldest split-brain resolver with 15s stable-after duration. Configure both nodes as seed nodes for symmetric startup. Set moderate failure detection defaults (2s heartbeat, 10s threshold, ~25s total failover). Document automatic dual-node recovery from persistent storage with no manual intervention.
2026-03-16 08:07:28 -04:00
parent bd735de8c4
commit 3dd62adf42
2 changed files with 77 additions and 1 deletions
--- a/docs/plans/2026-03-16-cluster-infrastructure-refinement-design.md
+++ b/docs/plans/2026-03-16-cluster-infrastructure-refinement-design.md
@@ -0,0 +1,45 @@
+# Cluster Infrastructure Refinement — Design
+
+**Date**: 2026-03-16
+**Component**: Cluster Infrastructure (`Component-ClusterInfrastructure.md`)
+**Status**: Approved
+
+## Problem
+
+The Cluster Infrastructure doc covered topology and failover behavior but lacked specification for the split-brain resolver strategy, seed node configuration, failure detection timing, and dual-node failure recovery.
+
+## Decisions
+
+### Split-Brain Resolver
+- **Keep-oldest** strategy. The longest-running node stays active on partition; the younger node downs itself.
+- Stable-after duration: 15 seconds — prevents premature downing during startup or transient instability.
+- Quorum-based strategies rejected because they cause total cluster shutdown on any partition in a two-node cluster.
+
+### Seed Node Configuration
+- **Both nodes are seed nodes.** No startup ordering dependency. Whichever node starts first forms the cluster.
+
+### Failure Detection Timing
+- Heartbeat interval: **2 seconds**.
+- Failure threshold: **10 seconds** (5 missed heartbeats).
+- Total failover time: **~25 seconds** (10s detection + 15s stable-after + singleton restart).
+- All values configurable. Defaults balance failover speed with stability.
+
+### Dual-Node Recovery
+- **Automatic recovery**, no manual intervention. First node up forms a new cluster from seed configuration.
+- Site clusters rebuild from SQLite (deployed configs, S&F buffer). Alarm states re-evaluate from live data.
+- Central cluster rebuilds from MS SQL. No message buffer state to recover.
+
+## Affected Documents
+
+| Document | Change |
+|----------|--------|
+| `Component-ClusterInfrastructure.md` | Added 3 new sections: Split-Brain Resolution, Failure Detection Timing, Dual-Node Recovery. Updated Node Configuration to clarify both-as-seed. |
+
+## Alternatives Considered
+
+- **Static-quorum / keep-majority**: Rejected — both cause total cluster shutdown on partition in a two-node cluster. Unacceptable for SCADA availability.
+- **Single designated seed node**: Rejected — creates startup ordering dependency for no benefit in a two-node cluster.
+- **Manual recovery on dual failure**: Rejected — system already persists all state needed for automatic recovery.
+- **Fast detection (1s/5s)**: Rejected — too sensitive; brief network hiccups would trigger unnecessary failovers and full actor hierarchy rebuilds.
+- **Conservative detection (5s/30s)**: Rejected — 30 seconds of data collection downtime is too long for SCADA.
+- **Shorter stable-after (10s)**: Rejected — matching the failure threshold risks downing nodes that are slow to respond (GC pause, heavy load).