Verify component designs against Akka.NET best practices documentation

Cluster Infrastructure: add min-nr-of-members=1 requirement for single-node operation after failover. Add graceful shutdown / CoordinatedShutdown section for fast singleton handover during planned maintenance. Site Runtime: add explicit supervision strategies per actor type (Resume for coordinators, Stop for short-lived execution actors). Stagger Instance Actor startup to prevent reconnection storms. Add Tell-vs-Ask usage guidance per Akka.NET best practices (Tell for hot path, Ask for system boundaries only). Data Connection Layer: add Connection Actor Model section documenting the Become/Stash pattern for connection lifecycle state machine. Health Monitoring: add dead letter count as a monitored metric. Host: add REQ-HOST-8a for dead letter monitoring (subscribe to EventStream, log at Warning level, report as health metric).
2026-03-16 09:12:36 -04:00
parent de636b908b
commit 409cc62309
5 changed files with 58 additions and 4 deletions
--- a/Component-ClusterInfrastructure.md
+++ b/Component-ClusterInfrastructure.md
@@ -64,6 +64,10 @@ The system uses the Akka.NET **keep-oldest** split-brain resolver strategy:
 - **`down-if-alone = on`**: The keep-oldest resolver is configured with `down-if-alone` enabled. If the oldest node finds itself alone (no other reachable members), it downs itself rather than continuing as a single-node cluster. This prevents the oldest node from running in isolation during a network partition while the younger node also forms its own cluster.
 - **Why keep-oldest**: With only two nodes, quorum-based strategies (static-quorum, keep-majority) cannot distinguish "one node crashed" from "network partition" — both sides see fewer than quorum and both would down themselves, resulting in total cluster shutdown. Keep-oldest with `down-if-alone` provides safe singleton ownership — at most one node runs the cluster singleton at any time.

+## Single-Node Operation
+
+`akka.cluster.min-nr-of-members` must be set to **1**. After failover, only one node is running. If set to 2, the surviving node waits for a second member before allowing the Cluster Singleton (Site Runtime Deployment Manager) to start — blocking all data collection and script execution indefinitely.
+
 ## Failure Detection Timing

 Configurable defaults for heartbeat and failure detection:
@@ -87,6 +91,16 @@ If both nodes in a cluster fail simultaneously (e.g., site power outage):
   - **Central cluster**: All state is in MS SQL (configuration database). The active node resumes normal operation.
 3. The keep-oldest resolver handles the "both starting fresh" case naturally — there is no pre-existing cluster to conflict with.

+## Graceful Shutdown & Singleton Handover
+
+When a node is stopped for planned maintenance (Windows Service stop), `CoordinatedShutdown` triggers a **graceful leave** from the cluster. This enables the Cluster Singleton (Site Runtime Deployment Manager) to hand over to the other node in seconds (limited by the hand-over retry interval) rather than waiting for the full failure detection timeout (~25 seconds).
+
+Configuration required:
+- `akka.coordinated-shutdown.run-by-clr-shutdown-hook = on`
+- `akka.cluster.run-coordinated-shutdown-when-down = on`
+
+The Host component wires CoordinatedShutdown into the Windows Service lifecycle (see REQ-HOST-9).
+
 ## Node Configuration

 Each node is configured with:
--- a/Component-DataConnectionLayer.md
+++ b/Component-DataConnectionLayer.md
@@ -67,6 +67,16 @@ Each value update delivered to an Instance Actor includes:
 - **Quality**: Data quality indicator (good, bad, uncertain).
 - **Timestamp**: When the value was read from the device.

+## Connection Actor Model
+
+Each data connection is managed by a dedicated connection actor that uses the Akka.NET **Become/Stash** pattern to model its lifecycle as a state machine:
+
+- **Connecting**: The actor attempts to establish the connection. Subscription requests and write commands received during this phase are **stashed** (buffered in the actor's stash).
+- **Connected**: The actor is actively servicing subscriptions. On entering this state, all stashed messages are unstashed and processed.
+- **Reconnecting**: The connection was lost. The actor transitions back to a connecting-like state, stashing new requests while it retries.
+
+This pattern ensures no messages are lost during connection transitions and is the standard Akka.NET approach for actors with I/O lifecycle dependencies.
+
 ## Connection Lifecycle & Reconnection

 The DCL manages connection lifecycle automatically:
--- a/Component-HealthMonitoring.md
+++ b/Component-HealthMonitoring.md
@@ -30,6 +30,7 @@ Site clusters (metric collection and reporting). Central cluster (aggregation an
 | Script error rates | Site Runtime (Script Actors) | Frequency of script failures |
 | Alarm evaluation error rates | Site Runtime (Alarm Actors) | Frequency of alarm evaluation failures |
 | Store-and-forward buffer depth | Store-and-Forward Engine | Pending messages by category (external, notification, DB write) |
+| Dead letter count | Akka.NET EventStream | Messages sent to actors that no longer exist — indicates stale references or timing issues |

 ## Reporting Protocol

--- a/Component-Host.md
+++ b/Component-Host.md
@@ -98,6 +98,10 @@ The Host must configure Serilog as the logging provider with:
 - Automatic enrichment of every log entry with `SiteId`, `NodeHostname`, and `NodeRole` properties sourced from `NodeConfiguration`.
 - Structured (machine-parseable) output format.

+### REQ-HOST-8a: Dead Letter Monitoring
+
+The Host must subscribe to the Akka.NET `DeadLetter` event stream and log dead letters at Warning level. Dead letters indicate messages sent to actors that no longer exist — a common symptom of failover timing issues, stale actor references, or race conditions during instance lifecycle transitions. The dead letter count is reported as a health metric (see Health Monitoring).
+
 ### REQ-HOST-9: Graceful Shutdown

 When the Host process receives a stop signal (Windows Service stop, `Ctrl+C`, or SIGTERM), it must trigger Akka.NET CoordinatedShutdown to allow actors to drain in-flight work before the process exits. The Host must not call `Environment.Exit()` or forcibly terminate the actor system without coordinated shutdown.
--- a/Component-SiteRuntime.md
+++ b/Component-SiteRuntime.md
@@ -54,7 +54,7 @@ Deployment Manager Singleton (Cluster Singleton)
 1. Read all deployed configurations from local SQLite.
 2. Read all shared scripts from local storage.
 3. Compile all scripts (instance scripts, alarm on-trigger scripts, shared scripts).
-4. Create Instance Actors for all deployed, **enabled** instances as child actors.
+4. Create Instance Actors for all deployed, **enabled** instances as child actors. Instance Actors are created in **staggered batches** (e.g., 20 at a time with a short delay between batches) to prevent a reconnection storm — 500 Instance Actors all registering data subscriptions simultaneously would overwhelm OPC UA servers and network capacity.
 5. Make compiled shared script code available to all Script Actors.

 ### Deployment Handling
@@ -110,9 +110,20 @@ Deployment Manager Singleton (Cluster Singleton)
 - On request from central (via Communication Layer), the Instance Actor provides a **snapshot** of all current attribute values and alarm states.
 - Subsequent changes are delivered via the site-wide Akka stream, filtered by instance unique name.

-### Supervision
- The Instance Actor supervises all child Script and Alarm Actors.
- When the Instance Actor is stopped (due to disable, delete, or redeployment), Akka.NET automatically stops all child actors.
+### Supervision Strategy
+
+The Instance Actor supervises all child Script and Alarm Actors with explicit strategies:
+
+| Child Actor | Exception Type | Strategy | Rationale |
+|-------------|---------------|----------|-----------|
+| Script Actor | Any exception | Resume | Script Actor is a coordinator — its state (trigger timers, last execution time) should survive child failures. Script Execution Actor failures are isolated. |
+| Alarm Actor | Any exception | Resume | Alarm Actor holds alarm state. Resume preserves state and continues evaluation on next value update. |
+| Script Execution Actor | Unhandled exception | Stop | Short-lived, per-invocation. Failure is logged; the Script Actor coordinator remains active for future triggers. |
+| Alarm Execution Actor | Unhandled exception | Stop | Short-lived, per on-trigger invocation. Same as Script Execution Actor. |
+
+The Deployment Manager singleton supervises Instance Actors with a **OneForOneStrategy** — one Instance Actor's failure does not affect other instances.
+
+When the Instance Actor is stopped (due to disable, delete, or redeployment), Akka.NET automatically stops all child actors.

 ---

@@ -243,6 +254,20 @@ These constraints are enforced by restricting the set of assemblies and namespac

 ---

+## Tell vs. Ask Usage
+
+Per Akka.NET best practices, internal actor communication uses **Tell** (fire-and-forget with reply-to) for the hot path:
+
+- **Tag value updates** (DCL → Instance Actor): Tell. High-frequency, no response needed.
+- **Attribute change notifications** (Instance Actor → Script/Alarm Actors): Tell. Fan-out notifications.
+- **Stream publishing** (Instance Actor → Akka stream): Tell. Fire-and-forget.
+
+**Ask** is reserved for system boundaries where a synchronous response is needed:
+
+- **`Instance.CallScript()`**: Ask pattern from Script Execution Actor to sibling Script Actor. The caller needs the return value. Acceptable because script calls are infrequent relative to tag updates.
+- **`Route.To().Call()`**: Ask from Inbound API to site Instance Actor via Communication Layer. External caller needs a response.
+- **Debug view snapshot**: Ask from Communication Layer to Instance Actor for initial state.
+
 ## Concurrency & Serialization

 - The Instance Actor processes messages **sequentially** (standard Akka actor model). This means `SetAttribute` calls from concurrent Script Execution Actors are serialized at the Instance Actor, preventing race conditions on attribute state.