Verify component designs against Akka.NET best practices documentation

Cluster Infrastructure: add min-nr-of-members=1 requirement for single-node
operation after failover. Add graceful shutdown / CoordinatedShutdown section
for fast singleton handover during planned maintenance.

Site Runtime: add explicit supervision strategies per actor type (Resume for
coordinators, Stop for short-lived execution actors). Stagger Instance Actor
startup to prevent reconnection storms. Add Tell-vs-Ask usage guidance per
Akka.NET best practices (Tell for hot path, Ask for system boundaries only).

Data Connection Layer: add Connection Actor Model section documenting the
Become/Stash pattern for connection lifecycle state machine.

Health Monitoring: add dead letter count as a monitored metric.

Host: add REQ-HOST-8a for dead letter monitoring (subscribe to EventStream,
log at Warning level, report as health metric).
This commit is contained in:
Joseph Doherty
2026-03-16 09:12:36 -04:00
parent de636b908b
commit 409cc62309
5 changed files with 58 additions and 4 deletions

View File

@@ -64,6 +64,10 @@ The system uses the Akka.NET **keep-oldest** split-brain resolver strategy:
- **`down-if-alone = on`**: The keep-oldest resolver is configured with `down-if-alone` enabled. If the oldest node finds itself alone (no other reachable members), it downs itself rather than continuing as a single-node cluster. This prevents the oldest node from running in isolation during a network partition while the younger node also forms its own cluster.
- **Why keep-oldest**: With only two nodes, quorum-based strategies (static-quorum, keep-majority) cannot distinguish "one node crashed" from "network partition" — both sides see fewer than quorum and both would down themselves, resulting in total cluster shutdown. Keep-oldest with `down-if-alone` provides safe singleton ownership — at most one node runs the cluster singleton at any time.
## Single-Node Operation
`akka.cluster.min-nr-of-members` must be set to **1**. After failover, only one node is running. If set to 2, the surviving node waits for a second member before allowing the Cluster Singleton (Site Runtime Deployment Manager) to start — blocking all data collection and script execution indefinitely.
## Failure Detection Timing
Configurable defaults for heartbeat and failure detection:
@@ -87,6 +91,16 @@ If both nodes in a cluster fail simultaneously (e.g., site power outage):
- **Central cluster**: All state is in MS SQL (configuration database). The active node resumes normal operation.
3. The keep-oldest resolver handles the "both starting fresh" case naturally — there is no pre-existing cluster to conflict with.
## Graceful Shutdown & Singleton Handover
When a node is stopped for planned maintenance (Windows Service stop), `CoordinatedShutdown` triggers a **graceful leave** from the cluster. This enables the Cluster Singleton (Site Runtime Deployment Manager) to hand over to the other node in seconds (limited by the hand-over retry interval) rather than waiting for the full failure detection timeout (~25 seconds).
Configuration required:
- `akka.coordinated-shutdown.run-by-clr-shutdown-hook = on`
- `akka.cluster.run-coordinated-shutdown-when-down = on`
The Host component wires CoordinatedShutdown into the Windows Service lifecycle (see REQ-HOST-9).
## Node Configuration
Each node is configured with:

View File

@@ -67,6 +67,16 @@ Each value update delivered to an Instance Actor includes:
- **Quality**: Data quality indicator (good, bad, uncertain).
- **Timestamp**: When the value was read from the device.
## Connection Actor Model
Each data connection is managed by a dedicated connection actor that uses the Akka.NET **Become/Stash** pattern to model its lifecycle as a state machine:
- **Connecting**: The actor attempts to establish the connection. Subscription requests and write commands received during this phase are **stashed** (buffered in the actor's stash).
- **Connected**: The actor is actively servicing subscriptions. On entering this state, all stashed messages are unstashed and processed.
- **Reconnecting**: The connection was lost. The actor transitions back to a connecting-like state, stashing new requests while it retries.
This pattern ensures no messages are lost during connection transitions and is the standard Akka.NET approach for actors with I/O lifecycle dependencies.
## Connection Lifecycle & Reconnection
The DCL manages connection lifecycle automatically:

View File

@@ -30,6 +30,7 @@ Site clusters (metric collection and reporting). Central cluster (aggregation an
| Script error rates | Site Runtime (Script Actors) | Frequency of script failures |
| Alarm evaluation error rates | Site Runtime (Alarm Actors) | Frequency of alarm evaluation failures |
| Store-and-forward buffer depth | Store-and-Forward Engine | Pending messages by category (external, notification, DB write) |
| Dead letter count | Akka.NET EventStream | Messages sent to actors that no longer exist — indicates stale references or timing issues |
## Reporting Protocol

View File

@@ -98,6 +98,10 @@ The Host must configure Serilog as the logging provider with:
- Automatic enrichment of every log entry with `SiteId`, `NodeHostname`, and `NodeRole` properties sourced from `NodeConfiguration`.
- Structured (machine-parseable) output format.
### REQ-HOST-8a: Dead Letter Monitoring
The Host must subscribe to the Akka.NET `DeadLetter` event stream and log dead letters at Warning level. Dead letters indicate messages sent to actors that no longer exist — a common symptom of failover timing issues, stale actor references, or race conditions during instance lifecycle transitions. The dead letter count is reported as a health metric (see Health Monitoring).
### REQ-HOST-9: Graceful Shutdown
When the Host process receives a stop signal (Windows Service stop, `Ctrl+C`, or SIGTERM), it must trigger Akka.NET CoordinatedShutdown to allow actors to drain in-flight work before the process exits. The Host must not call `Environment.Exit()` or forcibly terminate the actor system without coordinated shutdown.

View File

@@ -54,7 +54,7 @@ Deployment Manager Singleton (Cluster Singleton)
1. Read all deployed configurations from local SQLite.
2. Read all shared scripts from local storage.
3. Compile all scripts (instance scripts, alarm on-trigger scripts, shared scripts).
4. Create Instance Actors for all deployed, **enabled** instances as child actors.
4. Create Instance Actors for all deployed, **enabled** instances as child actors. Instance Actors are created in **staggered batches** (e.g., 20 at a time with a short delay between batches) to prevent a reconnection storm — 500 Instance Actors all registering data subscriptions simultaneously would overwhelm OPC UA servers and network capacity.
5. Make compiled shared script code available to all Script Actors.
### Deployment Handling
@@ -110,9 +110,20 @@ Deployment Manager Singleton (Cluster Singleton)
- On request from central (via Communication Layer), the Instance Actor provides a **snapshot** of all current attribute values and alarm states.
- Subsequent changes are delivered via the site-wide Akka stream, filtered by instance unique name.
### Supervision
- The Instance Actor supervises all child Script and Alarm Actors.
- When the Instance Actor is stopped (due to disable, delete, or redeployment), Akka.NET automatically stops all child actors.
### Supervision Strategy
The Instance Actor supervises all child Script and Alarm Actors with explicit strategies:
| Child Actor | Exception Type | Strategy | Rationale |
|-------------|---------------|----------|-----------|
| Script Actor | Any exception | Resume | Script Actor is a coordinator — its state (trigger timers, last execution time) should survive child failures. Script Execution Actor failures are isolated. |
| Alarm Actor | Any exception | Resume | Alarm Actor holds alarm state. Resume preserves state and continues evaluation on next value update. |
| Script Execution Actor | Unhandled exception | Stop | Short-lived, per-invocation. Failure is logged; the Script Actor coordinator remains active for future triggers. |
| Alarm Execution Actor | Unhandled exception | Stop | Short-lived, per on-trigger invocation. Same as Script Execution Actor. |
The Deployment Manager singleton supervises Instance Actors with a **OneForOneStrategy** — one Instance Actor's failure does not affect other instances.
When the Instance Actor is stopped (due to disable, delete, or redeployment), Akka.NET automatically stops all child actors.
---
@@ -243,6 +254,20 @@ These constraints are enforced by restricting the set of assemblies and namespac
---
## Tell vs. Ask Usage
Per Akka.NET best practices, internal actor communication uses **Tell** (fire-and-forget with reply-to) for the hot path:
- **Tag value updates** (DCL → Instance Actor): Tell. High-frequency, no response needed.
- **Attribute change notifications** (Instance Actor → Script/Alarm Actors): Tell. Fan-out notifications.
- **Stream publishing** (Instance Actor → Akka stream): Tell. Fire-and-forget.
**Ask** is reserved for system boundaries where a synchronous response is needed:
- **`Instance.CallScript()`**: Ask pattern from Script Execution Actor to sibling Script Actor. The caller needs the return value. Acceptable because script calls are infrequent relative to tag updates.
- **`Route.To().Call()`**: Ask from Inbound API to site Instance Actor via Communication Layer. External caller needs a response.
- **Debug view snapshot**: Ask from Communication Layer to Instance Actor for initial state.
## Concurrency & Serialization
- The Instance Actor processes messages **sequentially** (standard Akka actor model). This means `SetAttribute` calls from concurrent Script Execution Actors are serialized at the Instance Actor, preventing race conditions on attribute state.