Verify component designs against Akka.NET best practices documentation

Cluster Infrastructure: add min-nr-of-members=1 requirement for single-node operation after failover. Add graceful shutdown / CoordinatedShutdown section for fast singleton handover during planned maintenance. Site Runtime: add explicit supervision strategies per actor type (Resume for coordinators, Stop for short-lived execution actors). Stagger Instance Actor startup to prevent reconnection storms. Add Tell-vs-Ask usage guidance per Akka.NET best practices (Tell for hot path, Ask for system boundaries only). Data Connection Layer: add Connection Actor Model section documenting the Become/Stash pattern for connection lifecycle state machine. Health Monitoring: add dead letter count as a monitored metric. Host: add REQ-HOST-8a for dead letter monitoring (subscribe to EventStream, log at Warning level, report as health metric).
2026-03-16 09:12:36 -04:00
parent de636b908b
commit 409cc62309
5 changed files with 58 additions and 4 deletions
--- a/Component-ClusterInfrastructure.md
+++ b/Component-ClusterInfrastructure.md
@@ -64,6 +64,10 @@ The system uses the Akka.NET **keep-oldest** split-brain resolver strategy:
 - **`down-if-alone = on`**: The keep-oldest resolver is configured with `down-if-alone` enabled. If the oldest node finds itself alone (no other reachable members), it downs itself rather than continuing as a single-node cluster. This prevents the oldest node from running in isolation during a network partition while the younger node also forms its own cluster.
 - **Why keep-oldest**: With only two nodes, quorum-based strategies (static-quorum, keep-majority) cannot distinguish "one node crashed" from "network partition" — both sides see fewer than quorum and both would down themselves, resulting in total cluster shutdown. Keep-oldest with `down-if-alone` provides safe singleton ownership — at most one node runs the cluster singleton at any time.

+## Single-Node Operation
+
+`akka.cluster.min-nr-of-members` must be set to **1**. After failover, only one node is running. If set to 2, the surviving node waits for a second member before allowing the Cluster Singleton (Site Runtime Deployment Manager) to start — blocking all data collection and script execution indefinitely.
+
 ## Failure Detection Timing

 Configurable defaults for heartbeat and failure detection:
@@ -87,6 +91,16 @@ If both nodes in a cluster fail simultaneously (e.g., site power outage):
   - **Central cluster**: All state is in MS SQL (configuration database). The active node resumes normal operation.
 3. The keep-oldest resolver handles the "both starting fresh" case naturally — there is no pre-existing cluster to conflict with.

+## Graceful Shutdown & Singleton Handover
+
+When a node is stopped for planned maintenance (Windows Service stop), `CoordinatedShutdown` triggers a **graceful leave** from the cluster. This enables the Cluster Singleton (Site Runtime Deployment Manager) to hand over to the other node in seconds (limited by the hand-over retry interval) rather than waiting for the full failure detection timeout (~25 seconds).
+
+Configuration required:
+- `akka.coordinated-shutdown.run-by-clr-shutdown-hook = on`
+- `akka.cluster.run-coordinated-shutdown-when-down = on`
+
+The Host component wires CoordinatedShutdown into the Windows Service lifecycle (see REQ-HOST-9).
+
 ## Node Configuration

 Each node is configured with: