Verify component designs against Akka.NET best practices documentation

Cluster Infrastructure: add min-nr-of-members=1 requirement for single-node
operation after failover. Add graceful shutdown / CoordinatedShutdown section
for fast singleton handover during planned maintenance.

Site Runtime: add explicit supervision strategies per actor type (Resume for
coordinators, Stop for short-lived execution actors). Stagger Instance Actor
startup to prevent reconnection storms. Add Tell-vs-Ask usage guidance per
Akka.NET best practices (Tell for hot path, Ask for system boundaries only).

Data Connection Layer: add Connection Actor Model section documenting the
Become/Stash pattern for connection lifecycle state machine.

Health Monitoring: add dead letter count as a monitored metric.

Host: add REQ-HOST-8a for dead letter monitoring (subscribe to EventStream,
log at Warning level, report as health metric).
This commit is contained in:
Joseph Doherty
2026-03-16 09:12:36 -04:00
parent de636b908b
commit 409cc62309
5 changed files with 58 additions and 4 deletions

View File

@@ -64,6 +64,10 @@ The system uses the Akka.NET **keep-oldest** split-brain resolver strategy:
- **`down-if-alone = on`**: The keep-oldest resolver is configured with `down-if-alone` enabled. If the oldest node finds itself alone (no other reachable members), it downs itself rather than continuing as a single-node cluster. This prevents the oldest node from running in isolation during a network partition while the younger node also forms its own cluster. - **`down-if-alone = on`**: The keep-oldest resolver is configured with `down-if-alone` enabled. If the oldest node finds itself alone (no other reachable members), it downs itself rather than continuing as a single-node cluster. This prevents the oldest node from running in isolation during a network partition while the younger node also forms its own cluster.
- **Why keep-oldest**: With only two nodes, quorum-based strategies (static-quorum, keep-majority) cannot distinguish "one node crashed" from "network partition" — both sides see fewer than quorum and both would down themselves, resulting in total cluster shutdown. Keep-oldest with `down-if-alone` provides safe singleton ownership — at most one node runs the cluster singleton at any time. - **Why keep-oldest**: With only two nodes, quorum-based strategies (static-quorum, keep-majority) cannot distinguish "one node crashed" from "network partition" — both sides see fewer than quorum and both would down themselves, resulting in total cluster shutdown. Keep-oldest with `down-if-alone` provides safe singleton ownership — at most one node runs the cluster singleton at any time.
## Single-Node Operation
`akka.cluster.min-nr-of-members` must be set to **1**. After failover, only one node is running. If set to 2, the surviving node waits for a second member before allowing the Cluster Singleton (Site Runtime Deployment Manager) to start — blocking all data collection and script execution indefinitely.
## Failure Detection Timing ## Failure Detection Timing
Configurable defaults for heartbeat and failure detection: Configurable defaults for heartbeat and failure detection:
@@ -87,6 +91,16 @@ If both nodes in a cluster fail simultaneously (e.g., site power outage):
- **Central cluster**: All state is in MS SQL (configuration database). The active node resumes normal operation. - **Central cluster**: All state is in MS SQL (configuration database). The active node resumes normal operation.
3. The keep-oldest resolver handles the "both starting fresh" case naturally — there is no pre-existing cluster to conflict with. 3. The keep-oldest resolver handles the "both starting fresh" case naturally — there is no pre-existing cluster to conflict with.
## Graceful Shutdown & Singleton Handover
When a node is stopped for planned maintenance (Windows Service stop), `CoordinatedShutdown` triggers a **graceful leave** from the cluster. This enables the Cluster Singleton (Site Runtime Deployment Manager) to hand over to the other node in seconds (limited by the hand-over retry interval) rather than waiting for the full failure detection timeout (~25 seconds).
Configuration required:
- `akka.coordinated-shutdown.run-by-clr-shutdown-hook = on`
- `akka.cluster.run-coordinated-shutdown-when-down = on`
The Host component wires CoordinatedShutdown into the Windows Service lifecycle (see REQ-HOST-9).
## Node Configuration ## Node Configuration
Each node is configured with: Each node is configured with:

View File

@@ -67,6 +67,16 @@ Each value update delivered to an Instance Actor includes:
- **Quality**: Data quality indicator (good, bad, uncertain). - **Quality**: Data quality indicator (good, bad, uncertain).
- **Timestamp**: When the value was read from the device. - **Timestamp**: When the value was read from the device.
## Connection Actor Model
Each data connection is managed by a dedicated connection actor that uses the Akka.NET **Become/Stash** pattern to model its lifecycle as a state machine:
- **Connecting**: The actor attempts to establish the connection. Subscription requests and write commands received during this phase are **stashed** (buffered in the actor's stash).
- **Connected**: The actor is actively servicing subscriptions. On entering this state, all stashed messages are unstashed and processed.
- **Reconnecting**: The connection was lost. The actor transitions back to a connecting-like state, stashing new requests while it retries.
This pattern ensures no messages are lost during connection transitions and is the standard Akka.NET approach for actors with I/O lifecycle dependencies.
## Connection Lifecycle & Reconnection ## Connection Lifecycle & Reconnection
The DCL manages connection lifecycle automatically: The DCL manages connection lifecycle automatically:

View File

@@ -30,6 +30,7 @@ Site clusters (metric collection and reporting). Central cluster (aggregation an
| Script error rates | Site Runtime (Script Actors) | Frequency of script failures | | Script error rates | Site Runtime (Script Actors) | Frequency of script failures |
| Alarm evaluation error rates | Site Runtime (Alarm Actors) | Frequency of alarm evaluation failures | | Alarm evaluation error rates | Site Runtime (Alarm Actors) | Frequency of alarm evaluation failures |
| Store-and-forward buffer depth | Store-and-Forward Engine | Pending messages by category (external, notification, DB write) | | Store-and-forward buffer depth | Store-and-Forward Engine | Pending messages by category (external, notification, DB write) |
| Dead letter count | Akka.NET EventStream | Messages sent to actors that no longer exist — indicates stale references or timing issues |
## Reporting Protocol ## Reporting Protocol

View File

@@ -98,6 +98,10 @@ The Host must configure Serilog as the logging provider with:
- Automatic enrichment of every log entry with `SiteId`, `NodeHostname`, and `NodeRole` properties sourced from `NodeConfiguration`. - Automatic enrichment of every log entry with `SiteId`, `NodeHostname`, and `NodeRole` properties sourced from `NodeConfiguration`.
- Structured (machine-parseable) output format. - Structured (machine-parseable) output format.
### REQ-HOST-8a: Dead Letter Monitoring
The Host must subscribe to the Akka.NET `DeadLetter` event stream and log dead letters at Warning level. Dead letters indicate messages sent to actors that no longer exist — a common symptom of failover timing issues, stale actor references, or race conditions during instance lifecycle transitions. The dead letter count is reported as a health metric (see Health Monitoring).
### REQ-HOST-9: Graceful Shutdown ### REQ-HOST-9: Graceful Shutdown
When the Host process receives a stop signal (Windows Service stop, `Ctrl+C`, or SIGTERM), it must trigger Akka.NET CoordinatedShutdown to allow actors to drain in-flight work before the process exits. The Host must not call `Environment.Exit()` or forcibly terminate the actor system without coordinated shutdown. When the Host process receives a stop signal (Windows Service stop, `Ctrl+C`, or SIGTERM), it must trigger Akka.NET CoordinatedShutdown to allow actors to drain in-flight work before the process exits. The Host must not call `Environment.Exit()` or forcibly terminate the actor system without coordinated shutdown.

View File

@@ -54,7 +54,7 @@ Deployment Manager Singleton (Cluster Singleton)
1. Read all deployed configurations from local SQLite. 1. Read all deployed configurations from local SQLite.
2. Read all shared scripts from local storage. 2. Read all shared scripts from local storage.
3. Compile all scripts (instance scripts, alarm on-trigger scripts, shared scripts). 3. Compile all scripts (instance scripts, alarm on-trigger scripts, shared scripts).
4. Create Instance Actors for all deployed, **enabled** instances as child actors. 4. Create Instance Actors for all deployed, **enabled** instances as child actors. Instance Actors are created in **staggered batches** (e.g., 20 at a time with a short delay between batches) to prevent a reconnection storm — 500 Instance Actors all registering data subscriptions simultaneously would overwhelm OPC UA servers and network capacity.
5. Make compiled shared script code available to all Script Actors. 5. Make compiled shared script code available to all Script Actors.
### Deployment Handling ### Deployment Handling
@@ -110,9 +110,20 @@ Deployment Manager Singleton (Cluster Singleton)
- On request from central (via Communication Layer), the Instance Actor provides a **snapshot** of all current attribute values and alarm states. - On request from central (via Communication Layer), the Instance Actor provides a **snapshot** of all current attribute values and alarm states.
- Subsequent changes are delivered via the site-wide Akka stream, filtered by instance unique name. - Subsequent changes are delivered via the site-wide Akka stream, filtered by instance unique name.
### Supervision ### Supervision Strategy
- The Instance Actor supervises all child Script and Alarm Actors.
- When the Instance Actor is stopped (due to disable, delete, or redeployment), Akka.NET automatically stops all child actors. The Instance Actor supervises all child Script and Alarm Actors with explicit strategies:
| Child Actor | Exception Type | Strategy | Rationale |
|-------------|---------------|----------|-----------|
| Script Actor | Any exception | Resume | Script Actor is a coordinator — its state (trigger timers, last execution time) should survive child failures. Script Execution Actor failures are isolated. |
| Alarm Actor | Any exception | Resume | Alarm Actor holds alarm state. Resume preserves state and continues evaluation on next value update. |
| Script Execution Actor | Unhandled exception | Stop | Short-lived, per-invocation. Failure is logged; the Script Actor coordinator remains active for future triggers. |
| Alarm Execution Actor | Unhandled exception | Stop | Short-lived, per on-trigger invocation. Same as Script Execution Actor. |
The Deployment Manager singleton supervises Instance Actors with a **OneForOneStrategy** — one Instance Actor's failure does not affect other instances.
When the Instance Actor is stopped (due to disable, delete, or redeployment), Akka.NET automatically stops all child actors.
--- ---
@@ -243,6 +254,20 @@ These constraints are enforced by restricting the set of assemblies and namespac
--- ---
## Tell vs. Ask Usage
Per Akka.NET best practices, internal actor communication uses **Tell** (fire-and-forget with reply-to) for the hot path:
- **Tag value updates** (DCL → Instance Actor): Tell. High-frequency, no response needed.
- **Attribute change notifications** (Instance Actor → Script/Alarm Actors): Tell. Fan-out notifications.
- **Stream publishing** (Instance Actor → Akka stream): Tell. Fire-and-forget.
**Ask** is reserved for system boundaries where a synchronous response is needed:
- **`Instance.CallScript()`**: Ask pattern from Script Execution Actor to sibling Script Actor. The caller needs the return value. Acceptable because script calls are infrequent relative to tag updates.
- **`Route.To().Call()`**: Ask from Inbound API to site Instance Actor via Communication Layer. External caller needs a response.
- **Debug view snapshot**: Ask from Communication Layer to Instance Actor for initial state.
## Concurrency & Serialization ## Concurrency & Serialization
- The Instance Actor processes messages **sequentially** (standard Akka actor model). This means `SetAttribute` calls from concurrent Script Execution Actors are serialized at the Instance Actor, preventing race conditions on attribute state. - The Instance Actor processes messages **sequentially** (standard Akka actor model). This means `SetAttribute` calls from concurrent Script Execution Actors are serialized at the Instance Actor, preventing race conditions on attribute state.