From 3dd62adf421c30d0cad026db9f63221576c9329a Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Mon, 16 Mar 2026 08:07:28 -0400 Subject: [PATCH] Refine Cluster Infrastructure: split-brain, seed nodes, failure detection, dual recovery Add keep-oldest split-brain resolver with 15s stable-after duration. Configure both nodes as seed nodes for symmetric startup. Set moderate failure detection defaults (2s heartbeat, 10s threshold, ~25s total failover). Document automatic dual-node recovery from persistent storage with no manual intervention. --- Component-ClusterInfrastructure.md | 33 +++++++++++++- ...luster-infrastructure-refinement-design.md | 45 +++++++++++++++++++ 2 files changed, 77 insertions(+), 1 deletion(-) create mode 100644 docs/plans/2026-03-16-cluster-infrastructure-refinement-design.md diff --git a/Component-ClusterInfrastructure.md b/Component-ClusterInfrastructure.md index 13e7ac3..5d48968 100644 --- a/Component-ClusterInfrastructure.md +++ b/Component-ClusterInfrastructure.md @@ -55,10 +55,41 @@ Both central and site clusters. - Health reporting resumes from the new active node. - Alarm states are re-evaluated from incoming values (alarm state is in-memory only). +## Split-Brain Resolution + +The system uses the Akka.NET **keep-oldest** split-brain resolver strategy: + +- On a network partition, the node that has been in the cluster longest remains active. The younger node downs itself. +- **Stable-after duration**: 15 seconds. The cluster membership must remain stable (no changes) for 15 seconds before the resolver acts to down unreachable nodes. This prevents premature downing during startup or rolling restarts. +- **Why keep-oldest**: With only two nodes, quorum-based strategies (static-quorum, keep-majority) cannot distinguish "one node crashed" from "network partition" — both sides see fewer than quorum and both would down themselves, resulting in total cluster shutdown. Keep-oldest accepts a brief potential dual-active window during true network partitions, which is safe because site state rebuilds from SQLite and central state is in MS SQL. + +## Failure Detection Timing + +Configurable defaults for heartbeat and failure detection: + +| Setting | Default | Description | +|---------|---------|-------------| +| Heartbeat interval | 2 seconds | Frequency of health check messages between nodes | +| Failure detection threshold | 10 seconds | Time without heartbeat before a node is considered unreachable | +| Stable-after (split-brain) | 15 seconds | Time cluster must be stable before resolver acts | +| **Total failover time** | **~25 seconds** | Detection (10s) + stable-after (15s) + singleton restart | + +These values balance failover speed with stability — fast enough that data collection gaps are small, tolerant enough that brief network hiccups don't trigger unnecessary failovers. + +## Dual-Node Recovery + +If both nodes in a cluster fail simultaneously (e.g., site power outage): + +1. **No manual intervention required.** Since both nodes are configured as seed nodes, whichever node starts first forms a new cluster. The second node joins when it starts. +2. **State recovery**: + - **Site clusters**: The Deployment Manager singleton reads deployed configurations from local SQLite and re-creates the full Instance Actor hierarchy. Store-and-forward buffers are already persisted locally. Alarm states re-evaluate from incoming data values. + - **Central cluster**: All state is in MS SQL (configuration database). The active node resumes normal operation. +3. The keep-oldest resolver handles the "both starting fresh" case naturally — there is no pre-existing cluster to conflict with. + ## Node Configuration Each node is configured with: -- **Cluster seed nodes**: Addresses of both nodes in the cluster. +- **Cluster seed nodes**: **Both nodes** are seed nodes — each node lists both itself and its partner. Either node can start first and form the cluster; the other joins when it starts. No startup ordering dependency. - **Cluster role**: Central or Site (plus site identifier for site clusters). - **Akka.NET remoting**: Hostname/port for inter-node and inter-cluster communication. - **Local storage paths**: SQLite database locations (site nodes only). diff --git a/docs/plans/2026-03-16-cluster-infrastructure-refinement-design.md b/docs/plans/2026-03-16-cluster-infrastructure-refinement-design.md new file mode 100644 index 0000000..db43968 --- /dev/null +++ b/docs/plans/2026-03-16-cluster-infrastructure-refinement-design.md @@ -0,0 +1,45 @@ +# Cluster Infrastructure Refinement — Design + +**Date**: 2026-03-16 +**Component**: Cluster Infrastructure (`Component-ClusterInfrastructure.md`) +**Status**: Approved + +## Problem + +The Cluster Infrastructure doc covered topology and failover behavior but lacked specification for the split-brain resolver strategy, seed node configuration, failure detection timing, and dual-node failure recovery. + +## Decisions + +### Split-Brain Resolver +- **Keep-oldest** strategy. The longest-running node stays active on partition; the younger node downs itself. +- Stable-after duration: 15 seconds — prevents premature downing during startup or transient instability. +- Quorum-based strategies rejected because they cause total cluster shutdown on any partition in a two-node cluster. + +### Seed Node Configuration +- **Both nodes are seed nodes.** No startup ordering dependency. Whichever node starts first forms the cluster. + +### Failure Detection Timing +- Heartbeat interval: **2 seconds**. +- Failure threshold: **10 seconds** (5 missed heartbeats). +- Total failover time: **~25 seconds** (10s detection + 15s stable-after + singleton restart). +- All values configurable. Defaults balance failover speed with stability. + +### Dual-Node Recovery +- **Automatic recovery**, no manual intervention. First node up forms a new cluster from seed configuration. +- Site clusters rebuild from SQLite (deployed configs, S&F buffer). Alarm states re-evaluate from live data. +- Central cluster rebuilds from MS SQL. No message buffer state to recover. + +## Affected Documents + +| Document | Change | +|----------|--------| +| `Component-ClusterInfrastructure.md` | Added 3 new sections: Split-Brain Resolution, Failure Detection Timing, Dual-Node Recovery. Updated Node Configuration to clarify both-as-seed. | + +## Alternatives Considered + +- **Static-quorum / keep-majority**: Rejected — both cause total cluster shutdown on partition in a two-node cluster. Unacceptable for SCADA availability. +- **Single designated seed node**: Rejected — creates startup ordering dependency for no benefit in a two-node cluster. +- **Manual recovery on dual failure**: Rejected — system already persists all state needed for automatic recovery. +- **Fast detection (1s/5s)**: Rejected — too sensitive; brief network hiccups would trigger unnecessary failovers and full actor hierarchy rebuilds. +- **Conservative detection (5s/30s)**: Rejected — 30 seconds of data collection downtime is too long for SCADA. +- **Shorter stable-after (10s)**: Rejected — matching the failure threshold risks downing nodes that are slow to respond (GC pause, heavy load).