# Cluster Infrastructure Refinement — Design **Date**: 2026-03-16 **Component**: Cluster Infrastructure (`docs/requirements/Component-ClusterInfrastructure.md`) **Status**: Approved ## Problem The Cluster Infrastructure doc covered topology and failover behavior but lacked specification for the split-brain resolver strategy, seed node configuration, failure detection timing, and dual-node failure recovery. ## Decisions ### Split-Brain Resolver - **Keep-oldest** strategy. The longest-running node stays active on partition; the younger node downs itself. - Stable-after duration: 15 seconds — prevents premature downing during startup or transient instability. - Quorum-based strategies rejected because they cause total cluster shutdown on any partition in a two-node cluster. ### Seed Node Configuration - **Both nodes are seed nodes.** No startup ordering dependency. Whichever node starts first forms the cluster. ### Failure Detection Timing - Heartbeat interval: **2 seconds**. - Failure threshold: **10 seconds** (5 missed heartbeats). - Total failover time: **~25 seconds** (10s detection + 15s stable-after + singleton restart). - All values configurable. Defaults balance failover speed with stability. ### Dual-Node Recovery - **Automatic recovery**, no manual intervention. First node up forms a new cluster from seed configuration. - Site clusters rebuild from SQLite (deployed configs, S&F buffer). Alarm states re-evaluate from live data. - Central cluster rebuilds from MS SQL. No message buffer state to recover. ## Affected Documents | Document | Change | |----------|--------| | `docs/requirements/Component-ClusterInfrastructure.md` | Added 3 new sections: Split-Brain Resolution, Failure Detection Timing, Dual-Node Recovery. Updated Node Configuration to clarify both-as-seed. | ## Alternatives Considered - **Static-quorum / keep-majority**: Rejected — both cause total cluster shutdown on partition in a two-node cluster. Unacceptable for SCADA availability. - **Single designated seed node**: Rejected — creates startup ordering dependency for no benefit in a two-node cluster. - **Manual recovery on dual failure**: Rejected — system already persists all state needed for automatic recovery. - **Fast detection (1s/5s)**: Rejected — too sensitive; brief network hiccups would trigger unnecessary failovers and full actor hierarchy rebuilds. - **Conservative detection (5s/30s)**: Rejected — 30 seconds of data collection downtime is too long for SCADA. - **Shorter stable-after (10s)**: Rejected — matching the failure threshold risks downing nodes that are slow to respond (GC pause, heavy load).