Files
scadalink-design/docs/plans/2026-03-16-cluster-infrastructure-refinement-design.md
Joseph Doherty 3dd62adf42 Refine Cluster Infrastructure: split-brain, seed nodes, failure detection, dual recovery
Add keep-oldest split-brain resolver with 15s stable-after duration. Configure both
nodes as seed nodes for symmetric startup. Set moderate failure detection defaults
(2s heartbeat, 10s threshold, ~25s total failover). Document automatic dual-node
recovery from persistent storage with no manual intervention.
2026-03-16 08:07:28 -04:00

2.6 KiB

Cluster Infrastructure Refinement — Design

Date: 2026-03-16 Component: Cluster Infrastructure (Component-ClusterInfrastructure.md) Status: Approved

Problem

The Cluster Infrastructure doc covered topology and failover behavior but lacked specification for the split-brain resolver strategy, seed node configuration, failure detection timing, and dual-node failure recovery.

Decisions

Split-Brain Resolver

  • Keep-oldest strategy. The longest-running node stays active on partition; the younger node downs itself.
  • Stable-after duration: 15 seconds — prevents premature downing during startup or transient instability.
  • Quorum-based strategies rejected because they cause total cluster shutdown on any partition in a two-node cluster.

Seed Node Configuration

  • Both nodes are seed nodes. No startup ordering dependency. Whichever node starts first forms the cluster.

Failure Detection Timing

  • Heartbeat interval: 2 seconds.
  • Failure threshold: 10 seconds (5 missed heartbeats).
  • Total failover time: ~25 seconds (10s detection + 15s stable-after + singleton restart).
  • All values configurable. Defaults balance failover speed with stability.

Dual-Node Recovery

  • Automatic recovery, no manual intervention. First node up forms a new cluster from seed configuration.
  • Site clusters rebuild from SQLite (deployed configs, S&F buffer). Alarm states re-evaluate from live data.
  • Central cluster rebuilds from MS SQL. No message buffer state to recover.

Affected Documents

Document Change
Component-ClusterInfrastructure.md Added 3 new sections: Split-Brain Resolution, Failure Detection Timing, Dual-Node Recovery. Updated Node Configuration to clarify both-as-seed.

Alternatives Considered

  • Static-quorum / keep-majority: Rejected — both cause total cluster shutdown on partition in a two-node cluster. Unacceptable for SCADA availability.
  • Single designated seed node: Rejected — creates startup ordering dependency for no benefit in a two-node cluster.
  • Manual recovery on dual failure: Rejected — system already persists all state needed for automatic recovery.
  • Fast detection (1s/5s): Rejected — too sensitive; brief network hiccups would trigger unnecessary failovers and full actor hierarchy rebuilds.
  • Conservative detection (5s/30s): Rejected — 30 seconds of data collection downtime is too long for SCADA.
  • Shorter stable-after (10s): Rejected — matching the failure threshold risks downing nodes that are slow to respond (GC pause, heavy load).