diff --git a/Component-ClusterInfrastructure.md b/Component-ClusterInfrastructure.md index 9ad6c16..ea22d62 100644 --- a/Component-ClusterInfrastructure.md +++ b/Component-ClusterInfrastructure.md @@ -64,6 +64,10 @@ The system uses the Akka.NET **keep-oldest** split-brain resolver strategy: - **`down-if-alone = on`**: The keep-oldest resolver is configured with `down-if-alone` enabled. If the oldest node finds itself alone (no other reachable members), it downs itself rather than continuing as a single-node cluster. This prevents the oldest node from running in isolation during a network partition while the younger node also forms its own cluster. - **Why keep-oldest**: With only two nodes, quorum-based strategies (static-quorum, keep-majority) cannot distinguish "one node crashed" from "network partition" — both sides see fewer than quorum and both would down themselves, resulting in total cluster shutdown. Keep-oldest with `down-if-alone` provides safe singleton ownership — at most one node runs the cluster singleton at any time. +## Single-Node Operation + +`akka.cluster.min-nr-of-members` must be set to **1**. After failover, only one node is running. If set to 2, the surviving node waits for a second member before allowing the Cluster Singleton (Site Runtime Deployment Manager) to start — blocking all data collection and script execution indefinitely. + ## Failure Detection Timing Configurable defaults for heartbeat and failure detection: @@ -87,6 +91,16 @@ If both nodes in a cluster fail simultaneously (e.g., site power outage): - **Central cluster**: All state is in MS SQL (configuration database). The active node resumes normal operation. 3. The keep-oldest resolver handles the "both starting fresh" case naturally — there is no pre-existing cluster to conflict with. +## Graceful Shutdown & Singleton Handover + +When a node is stopped for planned maintenance (Windows Service stop), `CoordinatedShutdown` triggers a **graceful leave** from the cluster. This enables the Cluster Singleton (Site Runtime Deployment Manager) to hand over to the other node in seconds (limited by the hand-over retry interval) rather than waiting for the full failure detection timeout (~25 seconds). + +Configuration required: +- `akka.coordinated-shutdown.run-by-clr-shutdown-hook = on` +- `akka.cluster.run-coordinated-shutdown-when-down = on` + +The Host component wires CoordinatedShutdown into the Windows Service lifecycle (see REQ-HOST-9). + ## Node Configuration Each node is configured with: diff --git a/Component-DataConnectionLayer.md b/Component-DataConnectionLayer.md index 85372f3..3aeefea 100644 --- a/Component-DataConnectionLayer.md +++ b/Component-DataConnectionLayer.md @@ -67,6 +67,16 @@ Each value update delivered to an Instance Actor includes: - **Quality**: Data quality indicator (good, bad, uncertain). - **Timestamp**: When the value was read from the device. +## Connection Actor Model + +Each data connection is managed by a dedicated connection actor that uses the Akka.NET **Become/Stash** pattern to model its lifecycle as a state machine: + +- **Connecting**: The actor attempts to establish the connection. Subscription requests and write commands received during this phase are **stashed** (buffered in the actor's stash). +- **Connected**: The actor is actively servicing subscriptions. On entering this state, all stashed messages are unstashed and processed. +- **Reconnecting**: The connection was lost. The actor transitions back to a connecting-like state, stashing new requests while it retries. + +This pattern ensures no messages are lost during connection transitions and is the standard Akka.NET approach for actors with I/O lifecycle dependencies. + ## Connection Lifecycle & Reconnection The DCL manages connection lifecycle automatically: diff --git a/Component-HealthMonitoring.md b/Component-HealthMonitoring.md index 5d30841..72cf5d8 100644 --- a/Component-HealthMonitoring.md +++ b/Component-HealthMonitoring.md @@ -30,6 +30,7 @@ Site clusters (metric collection and reporting). Central cluster (aggregation an | Script error rates | Site Runtime (Script Actors) | Frequency of script failures | | Alarm evaluation error rates | Site Runtime (Alarm Actors) | Frequency of alarm evaluation failures | | Store-and-forward buffer depth | Store-and-Forward Engine | Pending messages by category (external, notification, DB write) | +| Dead letter count | Akka.NET EventStream | Messages sent to actors that no longer exist — indicates stale references or timing issues | ## Reporting Protocol diff --git a/Component-Host.md b/Component-Host.md index fab396f..c24d187 100644 --- a/Component-Host.md +++ b/Component-Host.md @@ -98,6 +98,10 @@ The Host must configure Serilog as the logging provider with: - Automatic enrichment of every log entry with `SiteId`, `NodeHostname`, and `NodeRole` properties sourced from `NodeConfiguration`. - Structured (machine-parseable) output format. +### REQ-HOST-8a: Dead Letter Monitoring + +The Host must subscribe to the Akka.NET `DeadLetter` event stream and log dead letters at Warning level. Dead letters indicate messages sent to actors that no longer exist — a common symptom of failover timing issues, stale actor references, or race conditions during instance lifecycle transitions. The dead letter count is reported as a health metric (see Health Monitoring). + ### REQ-HOST-9: Graceful Shutdown When the Host process receives a stop signal (Windows Service stop, `Ctrl+C`, or SIGTERM), it must trigger Akka.NET CoordinatedShutdown to allow actors to drain in-flight work before the process exits. The Host must not call `Environment.Exit()` or forcibly terminate the actor system without coordinated shutdown. diff --git a/Component-SiteRuntime.md b/Component-SiteRuntime.md index c4c1603..afb329d 100644 --- a/Component-SiteRuntime.md +++ b/Component-SiteRuntime.md @@ -54,7 +54,7 @@ Deployment Manager Singleton (Cluster Singleton) 1. Read all deployed configurations from local SQLite. 2. Read all shared scripts from local storage. 3. Compile all scripts (instance scripts, alarm on-trigger scripts, shared scripts). -4. Create Instance Actors for all deployed, **enabled** instances as child actors. +4. Create Instance Actors for all deployed, **enabled** instances as child actors. Instance Actors are created in **staggered batches** (e.g., 20 at a time with a short delay between batches) to prevent a reconnection storm — 500 Instance Actors all registering data subscriptions simultaneously would overwhelm OPC UA servers and network capacity. 5. Make compiled shared script code available to all Script Actors. ### Deployment Handling @@ -110,9 +110,20 @@ Deployment Manager Singleton (Cluster Singleton) - On request from central (via Communication Layer), the Instance Actor provides a **snapshot** of all current attribute values and alarm states. - Subsequent changes are delivered via the site-wide Akka stream, filtered by instance unique name. -### Supervision -- The Instance Actor supervises all child Script and Alarm Actors. -- When the Instance Actor is stopped (due to disable, delete, or redeployment), Akka.NET automatically stops all child actors. +### Supervision Strategy + +The Instance Actor supervises all child Script and Alarm Actors with explicit strategies: + +| Child Actor | Exception Type | Strategy | Rationale | +|-------------|---------------|----------|-----------| +| Script Actor | Any exception | Resume | Script Actor is a coordinator — its state (trigger timers, last execution time) should survive child failures. Script Execution Actor failures are isolated. | +| Alarm Actor | Any exception | Resume | Alarm Actor holds alarm state. Resume preserves state and continues evaluation on next value update. | +| Script Execution Actor | Unhandled exception | Stop | Short-lived, per-invocation. Failure is logged; the Script Actor coordinator remains active for future triggers. | +| Alarm Execution Actor | Unhandled exception | Stop | Short-lived, per on-trigger invocation. Same as Script Execution Actor. | + +The Deployment Manager singleton supervises Instance Actors with a **OneForOneStrategy** — one Instance Actor's failure does not affect other instances. + +When the Instance Actor is stopped (due to disable, delete, or redeployment), Akka.NET automatically stops all child actors. --- @@ -243,6 +254,20 @@ These constraints are enforced by restricting the set of assemblies and namespac --- +## Tell vs. Ask Usage + +Per Akka.NET best practices, internal actor communication uses **Tell** (fire-and-forget with reply-to) for the hot path: + +- **Tag value updates** (DCL → Instance Actor): Tell. High-frequency, no response needed. +- **Attribute change notifications** (Instance Actor → Script/Alarm Actors): Tell. Fan-out notifications. +- **Stream publishing** (Instance Actor → Akka stream): Tell. Fire-and-forget. + +**Ask** is reserved for system boundaries where a synchronous response is needed: + +- **`Instance.CallScript()`**: Ask pattern from Script Execution Actor to sibling Script Actor. The caller needs the return value. Acceptable because script calls are infrequent relative to tag updates. +- **`Route.To().Call()`**: Ask from Inbound API to site Instance Actor via Communication Layer. External caller needs a response. +- **Debug view snapshot**: Ask from Communication Layer to Instance Actor for initial state. + ## Concurrency & Serialization - The Instance Actor processes messages **sequentially** (standard Akka actor model). This means `SetAttribute` calls from concurrent Script Execution Actors are serialized at the Instance Actor, preventing race conditions on attribute state.