docs(components): accuracy fixes from deep review (batch 2)

TemplateEngine (alarm-script-ref ordering, native-alarm-sources not in
revision hash, composition cycle checks, 9-step pipeline), SiteRuntime
(alarm on-trigger scripts run with a restricted context; PreStart seeds
children from defaults before overrides arrive), DataConnectionLayer
(UnsubscribeAlarmsRequest stashed in Connecting), StoreAndForward (InFlight/
Delivered are dead enum values; notifications can park at 50 retries),
ExternalSystemGateway (CachedWrite returns void + enqueues directly; log levels).
This commit is contained in:
Joseph Doherty
2026-06-03 16:34:37 -04:00
parent c5fb02d640
commit 25bae4e43b
5 changed files with 31 additions and 23 deletions
+1 -1
View File
@@ -27,7 +27,7 @@ The DCL performs no alarm evaluation, no trigger evaluation, and no business log
| State | Behaviour | | State | Behaviour |
|-------|-----------| |-------|-----------|
| `Connecting` | Stashes `SubscribeTagsRequest`, `WriteTagRequest`, `UnsubscribeTagsRequest`, `SubscribeAlarmsRequest`. Non-blocking for `BrowseNodeCommand` and `ReadTagValuesCommand` (immediate failure reply). | | `Connecting` | Stashes `SubscribeTagsRequest`, `WriteTagRequest`, `UnsubscribeTagsRequest`, `SubscribeAlarmsRequest`, `UnsubscribeAlarmsRequest`. Non-blocking for `ReadTagValuesCommand` (immediate synchronous failure reply) and `BrowseNodeCommand` (async failure via `PipeTo` when the adapter is `IBrowsableDataConnection`; immediate synchronous reply otherwise). |
| `Connected` | Processes all message types. On entering, calls `Stash.UnstashAll()`. | | `Connected` | Processes all message types. On entering, calls `Stash.UnstashAll()`. |
| `Reconnecting` | Stashes new subscribe/write requests; allows `UnsubscribeTagsRequest` and `UnsubscribeAlarmsRequest` through for cleanup on instance stop. | | `Reconnecting` | Stashes new subscribe/write requests; allows `UnsubscribeTagsRequest` and `UnsubscribeAlarmsRequest` through for cleanup on instance stop. |
+14 -7
View File
@@ -1,6 +1,6 @@
# External System Gateway # External System Gateway
The External System Gateway gives site scripts two runtime capabilities: invoking HTTP/REST APIs on named external systems, and executing SQL writes against named database connections. Both capabilities expose a dual call mode — synchronous (blocking, result returned) and cached (store-and-forward on transient failure, `TrackedOperationId` returned) — so scripts choose the right delivery guarantee per operation without knowing the underlying retry machinery. The External System Gateway gives site scripts two runtime capabilities: invoking HTTP/REST APIs on named external systems, and executing SQL writes against named database connections. Both capabilities expose a dual call mode — synchronous (blocking, result returned) and cached (store-and-forward on transient failure) — so scripts choose the right delivery guarantee per operation without knowing the underlying retry machinery.
## Overview ## Overview
@@ -31,18 +31,20 @@ Every API call and every database write has two modes:
| Mode | API surface | Failure behaviour | Return value | | Mode | API surface | Failure behaviour | Return value |
|------|-------------|-------------------|--------------| |------|-------------|-------------------|--------------|
| Synchronous | `ExternalSystem.Call()` / `Database.Connection()` | All failures returned to script | Response JSON / `DbConnection` | | Synchronous | `ExternalSystem.Call()` / `Database.Connection()` | All failures returned to script | Response JSON / `DbConnection` |
| Cached | `ExternalSystem.CachedCall()` / `Database.CachedWrite()` | Transient → buffered; permanent → returned | `TrackedOperationId` (on buffer) | | Cached | `ExternalSystem.CachedCall()` / `Database.CachedWrite()` | Transient → buffered; permanent → returned | `ExternalCallResult` (buffered) / `void` |
`CachedCallAsync` and `CachedWriteAsync` attempt immediate delivery first. Only a transient failure routes to the Store-and-Forward Engine. `CachedCallAsync` attempts immediate delivery first; only a transient failure routes to the Store-and-Forward Engine. `CachedWriteAsync` makes no immediate SQL attempt — it resolves the connection definition and enqueues directly.
### Error classification ### Error classification
`ErrorClassifier` is the single authority on what counts as transient: `ErrorClassifier` is the authority on HTTP and exception transience for the synchronous call path:
- **HTTP status codes**: 5xx, 408 (Request Timeout), 429 (Too Many Requests) → transient. All other non-success 4xx → permanent. - **HTTP status codes**: 5xx, 408 (Request Timeout), 429 (Too Many Requests) → transient. All other non-success 4xx → permanent.
- **Exceptions**: `HttpRequestException`, `TaskCanceledException`, `TimeoutException`, `OperationCanceledException` → transient. `JsonException` during payload deserialization → permanent (a malformed payload will not become well-formed on retry, so it is parked rather than retried forever). - **Exceptions**: `HttpRequestException`, `TaskCanceledException`, `TimeoutException`, `OperationCanceledException` → transient.
Transient failures on `CachedCall` / `CachedWrite` are silently buffered (logged at `Debug`). Permanent failures are logged at `Warning` and returned to the calling script regardless of call mode, because a permanently-wrong request should surface immediately. `JsonException` during buffered-delivery payload deserialization is classified as permanent inline inside `DeliverBufferedAsync` (both `ExternalSystemClient` and `DatabaseGateway`), not via `ErrorClassifier` — a malformed payload will not become well-formed on retry, so it is parked immediately.
Transient failures on `CachedCall` / `CachedWrite` are silently buffered (logged at `Debug`). Permanent failures on the synchronous (`InvokeHttpAsync`) path are logged at `Warning` and returned to the calling script. Permanent failures detected during buffered retry delivery (`DeliverBufferedAsync`) are logged at `Error` before parking.
## Architecture ## Architecture
@@ -64,6 +66,11 @@ Error body embedded in script-visible messages is capped at 2 048 characters so
```csharp ```csharp
// ExternalSystemClient.cs // ExternalSystemClient.cs
catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested)
{
// The caller asked to abandon the work — do not reclassify as transient.
throw;
}
catch (OperationCanceledException ex) when (timeoutCts.IsCancellationRequested) catch (OperationCanceledException ex) when (timeoutCts.IsCancellationRequested)
{ {
// Our own timeout elapsed — a transient failure per the design. // Our own timeout elapsed — a transient failure per the design.
@@ -159,7 +166,7 @@ cmd.Parameters.AddWithValue("@name", tagName);
var value = await cmd.ExecuteScalarAsync(); var value = await cmd.ExecuteScalarAsync();
``` ```
**Cached database write** — enqueued immediately; returns a `TrackedOperationId`: **Cached database write** — enqueued immediately; returns nothing (`Task`):
```csharp ```csharp
await Database.CachedWrite("MES_DB", await Database.CachedWrite("MES_DB",
+6 -6
View File
@@ -79,7 +79,7 @@ Central sends a `DeployInstanceCommand` carrying a JSON `FlattenedConfiguration`
1. Calls `EnsureDclConnections` to push any new or changed connection definitions to the DCL manager (hash-guarded: unchanged configs are skipped). 1. Calls `EnsureDclConnections` to push any new or changed connection definitions to the DCL manager (hash-guarded: unchanged configs are skipped).
2. Calls `CreateInstanceActor`, which does `Context.ActorOf(props, instanceName)`. 2. Calls `CreateInstanceActor`, which does `Context.ActorOf(props, instanceName)`.
3. Runs an off-thread `Task` that calls `SiteStorageService.StoreDeployedConfigAsync`, clears static overrides and native alarm state, and tells `SiteReplicationActor` to push to the peer node. 3. Runs an off-thread `Task` that calls `SiteStorageService.StoreDeployedConfigAsync`, clears static overrides and native alarm state, and — if `_replicationActor` is non-null (it is optional and null in isolated deployments/tests) — tells `SiteReplicationActor` to push to the peer node.
4. Pipes back a `DeployPersistenceResult`; only on success does it tell the deployer `DeploymentStatus.Success`. If persistence fails, the optimistically-created actor is stopped and the error is returned to central (`SiteRuntime-005`). 4. Pipes back a `DeployPersistenceResult`; only on success does it tell the deployer `DeploymentStatus.Success`. If persistence fails, the optimistically-created actor is stopped and the error is returned to central (`SiteRuntime-005`).
For redeployment (instance already running), the existing actor is stopped and watched: For redeployment (instance already running), the existing actor is stopped and watched:
@@ -103,8 +103,8 @@ The `Terminated` signal fires once the previous actor and its entire subtree hav
On `PreStart`, `InstanceActor`: On `PreStart`, `InstanceActor`:
1. Pipes `SiteStorageService.GetStaticOverridesAsync` to self as a `LoadOverridesResult`. When the message arrives, persisted overrides are applied on top of the flattened-config defaults. 1. Fires `SiteStorageService.GetStaticOverridesAsync` asynchronously and pipes the result to self as a `LoadOverridesResult` — this is non-blocking; the message arrives later in the mailbox.
2. Calls `CreateChildActors()`, which snapshots `_attributes` (the live dictionary) into `attributeSnapshot` before any child constructor runs. Each child's `Props` closure captures the immutable snapshot, not the live dictionary — preventing the race condition described in `SiteRuntime-017`. 2. Calls `CreateChildActors()` **immediately** (before the override message arrives). `CreateChildActors` snapshots `_attributes` (the live dictionary, seeded from flattened-config defaults) into `attributeSnapshot` before any child constructor runs. Each child's `Props` closure captures the immutable snapshot, not the live dictionary — preventing the race condition described in `SiteRuntime-017`. Because the override load is asynchronous, children are created from the un-overridden defaults; when the `LoadOverridesResult` message is subsequently processed, `HandleOverridesLoaded` applies the persisted overrides on top of the live `_attributes` dictionary.
3. Calls `SubscribeToDcl()`, grouping data-sourced attributes by connection name and sending `SubscribeTagsRequest` to the DCL manager. Tag paths are stored in `_tagPathToAttributes`, a `Dictionary<string, List<string>>`, because one physical tag can back more than one attribute canonical name. 3. Calls `SubscribeToDcl()`, grouping data-sourced attributes by connection name and sending `SubscribeTagsRequest` to the DCL manager. Tag paths are stored in `_tagPathToAttributes`, a `Dictionary<string, List<string>>`, because one physical tag can back more than one attribute canonical name.
Data-sourced attributes start with quality `Uncertain` until the first `TagValueUpdate` arrives; static attributes start with quality `Good`. Data-sourced attributes start with quality `Uncertain` until the first `TagValueUpdate` arrives; static attributes start with quality `Good`.
@@ -130,7 +130,7 @@ var diagnostics = script.Compile();
### Shared script library ### Shared script library
`SharedScriptLibrary` holds a `Dictionary<string, Script<object?>>` under a `lock`. It is populated at startup (off-thread by the singleton, piped back as `SharedScriptsLoaded`) and updated live when artifact deployments arrive carrying new shared scripts. Calling `Scripts.CallShared("name", params)` inside a script calls `SharedScriptLibrary.ExecuteAsync`, which runs the compiled delegate inline on the calling thread — no actor messages, no serialization. `SharedScriptLibrary` holds a `Dictionary<string, Script<object?>>` under a `lock`. It is populated at startup (off-thread by the singleton, piped back as `SharedScriptsLoaded`) and updated live when artifact deployments arrive carrying new shared scripts. Calling `Scripts.CallShared("name", params)` inside a script calls `SharedScriptLibrary.ExecuteAsync`, which runs the compiled delegate inline as compiled code (no actor message, no serialization) and is awaited by the caller.
### Script Actor triggers ### Script Actor triggers
@@ -176,7 +176,7 @@ var killSwitch = _hubSource
`NativeAlarmActor` mirrors the condition state of one source binding — an OPC UA A&C server or MxAccess Gateway connection — without writing back to the source. Each condition is keyed by `SourceReference`. `NativeAlarmActor` mirrors the condition state of one source binding — an OPC UA A&C server or MxAccess Gateway connection — without writing back to the source. Each condition is keyed by `SourceReference`.
On `PreStart` it rehydrates last-known state from SQLite (`native_alarm_state` table) and immediately sends a `SubscribeAlarmsRequest` to the DCL manager. The DCL forwards this to the connection's `IAlarmSubscribableConnection` implementation. On `PreStart` it concurrently kicks off two operations: the SQLite rehydration (`GetNativeAlarmsAsync`, piped back as `RehydrationCompleted`) and a `SubscribeAlarmsRequest` to the DCL manager — the subscribe is sent before the async rehydration completes and is not gated on it. The DCL forwards the subscribe request to the connection's `IAlarmSubscribableConnection` implementation.
Transition handling: Transition handling:
@@ -244,7 +244,7 @@ Scripts run inside `ScriptExecutionActor` with a `ScriptGlobals` object as the R
- `Tracking.Status(id)` — reads the site-local `OperationTrackingStore` synchronously. - `Tracking.Status(id)` — reads the site-local `OperationTrackingStore` synchronously.
- `Notify.To("list").Send(...)` — enqueues a notification in the Store-and-Forward Engine for delivery to central. - `Notify.To("list").Send(...)` — enqueues a notification in the Store-and-Forward Engine for delivery to central.
Alarm on-trigger scripts run in `AlarmExecutionActor` with the same API plus an `Alarm` global (`AlarmContext` carrying `Name`, `Level`, `Priority`, `Message`). Alarm on-trigger scripts run in `AlarmExecutionActor` with a **restricted** context: they receive an `Alarm` global (`AlarmContext` carrying `Name`, `Level`, `Priority`, `Message`) and have access to the instance/shared-script surface (`Instance.*`, `Scripts.CallShared`, `Instance.CallScript`), but **not** the external-system, database, notification, or audit integration APIs. `AlarmExecutionActor` builds its `ScriptRuntimeContext` without a `serviceProvider`, so `ExternalSystem`, `Database`, `Notify`, and audit writes are unavailable to alarm on-trigger scripts — those APIs are only resolved inside `ScriptExecutionActor` (instance scripts).
### Debug view ### Debug view
+3 -1
View File
@@ -90,7 +90,7 @@ Three nullable columns (`execution_id`, `source_script`, `parent_execution_id`)
`StoreAndForwardStorage` opens a fresh `SqliteConnection` per call and relies on the Microsoft.Data.Sqlite connection pool (keyed on the connection string) for acceptable performance on the retry sweep. If a pooled-open ever becomes a bottleneck the remedy is a batched sweep API that opens one connection per sweep. `StoreAndForwardStorage` opens a fresh `SqliteConnection` per call and relies on the Microsoft.Data.Sqlite connection pool (keyed on the connection string) for acceptable performance on the retry sweep. If a pooled-open ever becomes a bottleneck the remedy is a batched sweep API that opens one connection per sweep.
Status values from `StoreAndForwardMessageStatus`: `Pending` (0), `InFlight` (1), `Parked` (2), `Delivered` (3). The retry sweep loads only `Pending` rows whose `last_attempt_at` is older than `retry_interval_ms`. The engine uses two status values from `StoreAndForwardMessageStatus`: `Pending` (0) and `Parked` (2). On successful delivery the row is deleted (`RemoveMessageAsync`) — there is no `Delivered` status written. The enum also declares `InFlight` (1) and `Delivered` (3) but neither is assigned anywhere in the engine; they are dead values. The retry sweep loads only `Pending` rows whose `last_attempt_at` is older than `retry_interval_ms`.
### Retry sweep ### Retry sweep
@@ -137,6 +137,8 @@ The four `ReplicationOperationType` values are `Add`, `Remove`, `Park`, and `Req
`NotificationForwarder` is the delivery handler for `StoreAndForwardCategory.Notification`. It deserializes the buffered `PayloadJson` as a `NotificationSubmit`, re-stamps `SourceSiteId` and `SourceInstanceId` from the forwarder's own context (the site is authoritative for these), and sends the submit to the `SiteCommunicationActor` via Akka's `Ask` with a configurable timeout. A `NotificationSubmitAck` with `Accepted = true` returns `true`; any other ack or a timeout throws `NotificationForwardException`, which the engine treats as transient. A payload that cannot be deserialized is logged at Warning and discarded (returns `true`) rather than parked — a corrupt payload cannot be fixed by retrying. `NotificationForwarder` is the delivery handler for `StoreAndForwardCategory.Notification`. It deserializes the buffered `PayloadJson` as a `NotificationSubmit`, re-stamps `SourceSiteId` and `SourceInstanceId` from the forwarder's own context (the site is authoritative for these), and sends the submit to the `SiteCommunicationActor` via Akka's `Ask` with a configurable timeout. A `NotificationSubmitAck` with `Accepted = true` returns `true`; any other ack or a timeout throws `NotificationForwardException`, which the engine treats as transient. A payload that cannot be deserialized is logged at Warning and discarded (returns `true`) rather than parked — a corrupt payload cannot be fixed by retrying.
Notification messages are subject to the same retry budget as every other category. The notification enqueue call passes no explicit `maxRetries`, so it inherits `StoreAndForwardOptions.DefaultMaxRetries` (50). Under a sustained central outage that exhausts all 50 retry attempts, the buffered notification is parked and surfaces in the parked-message UI exactly like any other parked message. Callers that require unbounded retry must pass `maxRetries: 0` to `EnqueueAsync`.
### Parked message management ### Parked message management
`ParkedMessageHandlerActor` is the Akka bridge between `SiteCommunicationActor` and `StoreAndForwardService`. It handles five message types from central: `ParkedMessageHandlerActor` is the Akka bridge between `SiteCommunicationActor` and `StoreAndForwardService`. It handles five message types from central:
+7 -8
View File
@@ -28,7 +28,7 @@ Both edge types are enforced acyclic on every mutating call. `CycleDetector` pro
- `DetectCompositionCycle` — BFS from the proposed composed template through its own compositions. - `DetectCompositionCycle` — BFS from the proposed composed template through its own compositions.
- `DetectCrossGraphCycle` — BFS across both inheritance and composition edges simultaneously, catching cycles that neither pure check alone would find. - `DetectCrossGraphCycle` — BFS across both inheritance and composition edges simultaneously, catching cycles that neither pure check alone would find.
All three run before any composition is written. Because the graph can contain not-yet-saved templates (Id = 0), `CycleDetector.BuildLookup` uses `TryAdd` rather than `ToDictionary` so duplicate Ids do not throw. `AddCompositionAsync` runs `DetectCompositionCycle` and `DetectCrossGraphCycle` before any composition is written — `DetectInheritanceCycle` does not run on the composition path. Because the graph can contain not-yet-saved templates (Id = 0), `CycleDetector.BuildLookup` uses `TryAdd` rather than `ToDictionary` so duplicate Ids do not throw.
### Derived templates and the composition model ### Derived templates and the composition model
@@ -83,7 +83,7 @@ Because each composition slot has a unique `InstanceName` prefix, members from d
3. Parent templates, walking to the root. 3. Parent templates, walking to the root.
4. Composed module members, path-qualified. 4. Composed module members, path-qualified.
The eight steps in order: The nine steps in order:
1. Validate `LockedInDerived` is not violated across each chain. 1. Validate `LockedInDerived` is not violated across each chain.
2. Resolve attributes from the inheritance chain (base-to-derived; `IsInherited` placeholders never shadow live base values). 2. Resolve attributes from the inheritance chain (base-to-derived; `IsInherited` placeholders never shadow live base values).
@@ -91,14 +91,13 @@ The eight steps in order:
4. Apply `InstanceAttributeOverride` records (locked attributes are silently skipped). 4. Apply `InstanceAttributeOverride` records (locked attributes are silently skipped).
5. Apply `InstanceConnectionBinding` records (data-sourced attributes only). 5. Apply `InstanceConnectionBinding` records (data-sourced attributes only).
6. Resolve alarms; for `HiLo` trigger type, merge setpoints key-by-key so a derived template can override just `hi` while inheriting `loLo`. 6. Resolve alarms; for `HiLo` trigger type, merge setpoints key-by-key so a derived template can override just `hi` while inheriting `loLo`.
7. Resolve scripts; wire `ScriptScope` (self- and parent-path) into each composed script so `Attributes["X"]` resolves to the right path-prefix at runtime. 7. Resolve scripts; wire `ScriptScope` (self- and parent-path) into each composed script so `Attributes["X"]` resolves to the right path-prefix at runtime. Then resolve alarm `OnTriggerScriptId` FKs to canonical script names (`ResolveAlarmScriptReferences`).
8. Resolve native alarm source bindings (`TemplateNativeAlarmSource`), apply `InstanceNativeAlarmSourceOverride`. 8. Resolve native alarm source bindings (`TemplateNativeAlarmSource`), apply `InstanceNativeAlarmSourceOverride`.
9. Collect connection configurations — iterate resolved attributes, and for each attribute that has a bound data connection, populate `FlattenedConfiguration.Connections` with the corresponding `ConnectionConfig` (protocol, primary and backup JSON, failover retry count).
Between steps 6 and 7, `ResolveAlarmScriptReferences` resolves each alarm's `OnTriggerScriptId` FK to the canonical name of the corresponding resolved script. A dangling reference (script Id has no resolved script) produces a null `OnTriggerScriptCanonicalName` and is caught by `SemanticValidator`.
### Revision hash ### Revision hash
`RevisionHashService.ComputeHash` produces a deterministic `sha256:<hex>` string over the canonical JSON serialization of the `FlattenedConfiguration`. Volatile fields (`GeneratedAtUtc`) are excluded. Collections are sorted by `CanonicalName` before hashing. Internal `Hashable*` records declare their properties in alphabetical order because `System.Text.Json` emits them in declaration order — out-of-order additions would silently break determinism. The hash is included in the deployment identity and lets the Deployment Manager detect whether a re-flatten has changed anything before pushing to sites. `RevisionHashService.ComputeHash` produces a deterministic `sha256:<hex>` string over the canonical JSON serialization of the `FlattenedConfiguration`. Volatile fields (`GeneratedAtUtc`) are excluded. Collections are sorted by `CanonicalName` before hashing. Internal `Hashable*` records declare their properties in alphabetical order because `System.Text.Json` emits them in declaration order — out-of-order additions would silently break determinism. The hash covers resolved attributes, alarms, scripts, and connection configurations. **`NativeAlarmSources` is not included in the hash**: changes to native alarm source bindings do not alter the revision hash and therefore do not trigger a re-deploy on their own. The hash is included in the deployment identity and lets the Deployment Manager detect whether a re-flatten has changed anything before pushing to sites.
### Diff ### Diff
@@ -106,7 +105,7 @@ Between steps 6 and 7, `ResolveAlarmScriptReferences` resolves each alarm's `OnT
### Concurrent editing ### Concurrent editing
Template edits use **last-write-wins** — there is no optimistic concurrency token on `Template` or its member rows. Two simultaneous edits to the same template produce one winner. This is by design and is documented in the `InstanceService` comment: "Concurrent editing uses last-write-wins — no pessimistic locking or conflict detection." Optimistic concurrency (`RowVersion`) applies to deployment status records in the Deployment Manager, not to template authoring. Template edits use **last-write-wins** — there is no optimistic concurrency token on `Template` or its member rows. Two simultaneous edits to the same template produce one winner. This is by design. The same policy applies to instances and is documented in the `InstanceService` class comment: "Concurrent edits are last-write-wins — there is no version token or conflict detection on instance state." Optimistic concurrency (`RowVersion`) applies to deployment status records in the Deployment Manager, not to template or instance authoring.
## Architecture ## Architecture
@@ -263,7 +262,7 @@ if (!validationResult.IsValid)
### Revision hash changes unexpectedly between deploys ### Revision hash changes unexpectedly between deploys
The SHA-256 hash covers all resolved attributes, alarms, scripts, and connection configurations. Changes to any member anywhere in the inheritance or composition chain — including in parent templates or feature modules the instance does not directly own — will change the hash. Use `DiffService.ComputeDiff` to identify exactly which canonical names changed and why. The SHA-256 hash covers resolved attributes, alarms, scripts, and connection configurations. Changes to any of those members anywhere in the inheritance or composition chain — including in parent templates or feature modules the instance does not directly own — will change the hash. Note that `NativeAlarmSources` is not part of the hash, so native alarm source binding changes alone do not change the revision hash. Use `DiffService.ComputeDiff` to identify exactly which canonical names changed and why.
### Naming collision on composed member add ### Naming collision on composed member add