Verify component designs against Akka.NET best practices documentation
Cluster Infrastructure: add min-nr-of-members=1 requirement for single-node operation after failover. Add graceful shutdown / CoordinatedShutdown section for fast singleton handover during planned maintenance. Site Runtime: add explicit supervision strategies per actor type (Resume for coordinators, Stop for short-lived execution actors). Stagger Instance Actor startup to prevent reconnection storms. Add Tell-vs-Ask usage guidance per Akka.NET best practices (Tell for hot path, Ask for system boundaries only). Data Connection Layer: add Connection Actor Model section documenting the Become/Stash pattern for connection lifecycle state machine. Health Monitoring: add dead letter count as a monitored metric. Host: add REQ-HOST-8a for dead letter monitoring (subscribe to EventStream, log at Warning level, report as health metric).
This commit is contained in:
@@ -64,6 +64,10 @@ The system uses the Akka.NET **keep-oldest** split-brain resolver strategy:
|
||||
- **`down-if-alone = on`**: The keep-oldest resolver is configured with `down-if-alone` enabled. If the oldest node finds itself alone (no other reachable members), it downs itself rather than continuing as a single-node cluster. This prevents the oldest node from running in isolation during a network partition while the younger node also forms its own cluster.
|
||||
- **Why keep-oldest**: With only two nodes, quorum-based strategies (static-quorum, keep-majority) cannot distinguish "one node crashed" from "network partition" — both sides see fewer than quorum and both would down themselves, resulting in total cluster shutdown. Keep-oldest with `down-if-alone` provides safe singleton ownership — at most one node runs the cluster singleton at any time.
|
||||
|
||||
## Single-Node Operation
|
||||
|
||||
`akka.cluster.min-nr-of-members` must be set to **1**. After failover, only one node is running. If set to 2, the surviving node waits for a second member before allowing the Cluster Singleton (Site Runtime Deployment Manager) to start — blocking all data collection and script execution indefinitely.
|
||||
|
||||
## Failure Detection Timing
|
||||
|
||||
Configurable defaults for heartbeat and failure detection:
|
||||
@@ -87,6 +91,16 @@ If both nodes in a cluster fail simultaneously (e.g., site power outage):
|
||||
- **Central cluster**: All state is in MS SQL (configuration database). The active node resumes normal operation.
|
||||
3. The keep-oldest resolver handles the "both starting fresh" case naturally — there is no pre-existing cluster to conflict with.
|
||||
|
||||
## Graceful Shutdown & Singleton Handover
|
||||
|
||||
When a node is stopped for planned maintenance (Windows Service stop), `CoordinatedShutdown` triggers a **graceful leave** from the cluster. This enables the Cluster Singleton (Site Runtime Deployment Manager) to hand over to the other node in seconds (limited by the hand-over retry interval) rather than waiting for the full failure detection timeout (~25 seconds).
|
||||
|
||||
Configuration required:
|
||||
- `akka.coordinated-shutdown.run-by-clr-shutdown-hook = on`
|
||||
- `akka.cluster.run-coordinated-shutdown-when-down = on`
|
||||
|
||||
The Host component wires CoordinatedShutdown into the Windows Service lifecycle (see REQ-HOST-9).
|
||||
|
||||
## Node Configuration
|
||||
|
||||
Each node is configured with:
|
||||
|
||||
Reference in New Issue
Block a user