Refine Cluster Infrastructure: split-brain, seed nodes, failure detection, dual recovery
Add keep-oldest split-brain resolver with 15s stable-after duration. Configure both nodes as seed nodes for symmetric startup. Set moderate failure detection defaults (2s heartbeat, 10s threshold, ~25s total failover). Document automatic dual-node recovery from persistent storage with no manual intervention.
This commit is contained in:
@@ -55,10 +55,41 @@ Both central and site clusters.
|
||||
- Health reporting resumes from the new active node.
|
||||
- Alarm states are re-evaluated from incoming values (alarm state is in-memory only).
|
||||
|
||||
## Split-Brain Resolution
|
||||
|
||||
The system uses the Akka.NET **keep-oldest** split-brain resolver strategy:
|
||||
|
||||
- On a network partition, the node that has been in the cluster longest remains active. The younger node downs itself.
|
||||
- **Stable-after duration**: 15 seconds. The cluster membership must remain stable (no changes) for 15 seconds before the resolver acts to down unreachable nodes. This prevents premature downing during startup or rolling restarts.
|
||||
- **Why keep-oldest**: With only two nodes, quorum-based strategies (static-quorum, keep-majority) cannot distinguish "one node crashed" from "network partition" — both sides see fewer than quorum and both would down themselves, resulting in total cluster shutdown. Keep-oldest accepts a brief potential dual-active window during true network partitions, which is safe because site state rebuilds from SQLite and central state is in MS SQL.
|
||||
|
||||
## Failure Detection Timing
|
||||
|
||||
Configurable defaults for heartbeat and failure detection:
|
||||
|
||||
| Setting | Default | Description |
|
||||
|---------|---------|-------------|
|
||||
| Heartbeat interval | 2 seconds | Frequency of health check messages between nodes |
|
||||
| Failure detection threshold | 10 seconds | Time without heartbeat before a node is considered unreachable |
|
||||
| Stable-after (split-brain) | 15 seconds | Time cluster must be stable before resolver acts |
|
||||
| **Total failover time** | **~25 seconds** | Detection (10s) + stable-after (15s) + singleton restart |
|
||||
|
||||
These values balance failover speed with stability — fast enough that data collection gaps are small, tolerant enough that brief network hiccups don't trigger unnecessary failovers.
|
||||
|
||||
## Dual-Node Recovery
|
||||
|
||||
If both nodes in a cluster fail simultaneously (e.g., site power outage):
|
||||
|
||||
1. **No manual intervention required.** Since both nodes are configured as seed nodes, whichever node starts first forms a new cluster. The second node joins when it starts.
|
||||
2. **State recovery**:
|
||||
- **Site clusters**: The Deployment Manager singleton reads deployed configurations from local SQLite and re-creates the full Instance Actor hierarchy. Store-and-forward buffers are already persisted locally. Alarm states re-evaluate from incoming data values.
|
||||
- **Central cluster**: All state is in MS SQL (configuration database). The active node resumes normal operation.
|
||||
3. The keep-oldest resolver handles the "both starting fresh" case naturally — there is no pre-existing cluster to conflict with.
|
||||
|
||||
## Node Configuration
|
||||
|
||||
Each node is configured with:
|
||||
- **Cluster seed nodes**: Addresses of both nodes in the cluster.
|
||||
- **Cluster seed nodes**: **Both nodes** are seed nodes — each node lists both itself and its partner. Either node can start first and form the cluster; the other joins when it starts. No startup ordering dependency.
|
||||
- **Cluster role**: Central or Site (plus site identifier for site clusters).
|
||||
- **Akka.NET remoting**: Hostname/port for inter-node and inter-cluster communication.
|
||||
- **Local storage paths**: SQLite database locations (site nodes only).
|
||||
|
||||
Reference in New Issue
Block a user