Organize documentation by moving requirements (HighLevelReqs, Component-*, lmxproxy_protocol) to docs/requirements/ and test infrastructure docs to docs/test_infra/. Updates all cross-references in README, CLAUDE.md, infra/README, component docs, and 23 plan files.
46 lines
2.6 KiB
Markdown
46 lines
2.6 KiB
Markdown
# Cluster Infrastructure Refinement — Design
|
|
|
|
**Date**: 2026-03-16
|
|
**Component**: Cluster Infrastructure (`docs/requirements/Component-ClusterInfrastructure.md`)
|
|
**Status**: Approved
|
|
|
|
## Problem
|
|
|
|
The Cluster Infrastructure doc covered topology and failover behavior but lacked specification for the split-brain resolver strategy, seed node configuration, failure detection timing, and dual-node failure recovery.
|
|
|
|
## Decisions
|
|
|
|
### Split-Brain Resolver
|
|
- **Keep-oldest** strategy. The longest-running node stays active on partition; the younger node downs itself.
|
|
- Stable-after duration: 15 seconds — prevents premature downing during startup or transient instability.
|
|
- Quorum-based strategies rejected because they cause total cluster shutdown on any partition in a two-node cluster.
|
|
|
|
### Seed Node Configuration
|
|
- **Both nodes are seed nodes.** No startup ordering dependency. Whichever node starts first forms the cluster.
|
|
|
|
### Failure Detection Timing
|
|
- Heartbeat interval: **2 seconds**.
|
|
- Failure threshold: **10 seconds** (5 missed heartbeats).
|
|
- Total failover time: **~25 seconds** (10s detection + 15s stable-after + singleton restart).
|
|
- All values configurable. Defaults balance failover speed with stability.
|
|
|
|
### Dual-Node Recovery
|
|
- **Automatic recovery**, no manual intervention. First node up forms a new cluster from seed configuration.
|
|
- Site clusters rebuild from SQLite (deployed configs, S&F buffer). Alarm states re-evaluate from live data.
|
|
- Central cluster rebuilds from MS SQL. No message buffer state to recover.
|
|
|
|
## Affected Documents
|
|
|
|
| Document | Change |
|
|
|----------|--------|
|
|
| `docs/requirements/Component-ClusterInfrastructure.md` | Added 3 new sections: Split-Brain Resolution, Failure Detection Timing, Dual-Node Recovery. Updated Node Configuration to clarify both-as-seed. |
|
|
|
|
## Alternatives Considered
|
|
|
|
- **Static-quorum / keep-majority**: Rejected — both cause total cluster shutdown on partition in a two-node cluster. Unacceptable for SCADA availability.
|
|
- **Single designated seed node**: Rejected — creates startup ordering dependency for no benefit in a two-node cluster.
|
|
- **Manual recovery on dual failure**: Rejected — system already persists all state needed for automatic recovery.
|
|
- **Fast detection (1s/5s)**: Rejected — too sensitive; brief network hiccups would trigger unnecessary failovers and full actor hierarchy rebuilds.
|
|
- **Conservative detection (5s/30s)**: Rejected — 30 seconds of data collection downtime is too long for SCADA.
|
|
- **Shorter stable-after (10s)**: Rejected — matching the failure threshold risks downing nodes that are slow to respond (GC pause, heavy load).
|