diff --git a/docs/components/DataConnectionLayer.md b/docs/components/DataConnectionLayer.md index 2a3d2447..66b7d0cf 100644 --- a/docs/components/DataConnectionLayer.md +++ b/docs/components/DataConnectionLayer.md @@ -27,7 +27,7 @@ The DCL performs no alarm evaluation, no trigger evaluation, and no business log | State | Behaviour | |-------|-----------| -| `Connecting` | Stashes `SubscribeTagsRequest`, `WriteTagRequest`, `UnsubscribeTagsRequest`, `SubscribeAlarmsRequest`. Non-blocking for `BrowseNodeCommand` and `ReadTagValuesCommand` (immediate failure reply). | +| `Connecting` | Stashes `SubscribeTagsRequest`, `WriteTagRequest`, `UnsubscribeTagsRequest`, `SubscribeAlarmsRequest`, `UnsubscribeAlarmsRequest`. Non-blocking for `ReadTagValuesCommand` (immediate synchronous failure reply) and `BrowseNodeCommand` (async failure via `PipeTo` when the adapter is `IBrowsableDataConnection`; immediate synchronous reply otherwise). | | `Connected` | Processes all message types. On entering, calls `Stash.UnstashAll()`. | | `Reconnecting` | Stashes new subscribe/write requests; allows `UnsubscribeTagsRequest` and `UnsubscribeAlarmsRequest` through for cleanup on instance stop. | diff --git a/docs/components/ExternalSystemGateway.md b/docs/components/ExternalSystemGateway.md index 1ab25dcf..b250d4b4 100644 --- a/docs/components/ExternalSystemGateway.md +++ b/docs/components/ExternalSystemGateway.md @@ -1,6 +1,6 @@ # External System Gateway -The External System Gateway gives site scripts two runtime capabilities: invoking HTTP/REST APIs on named external systems, and executing SQL writes against named database connections. Both capabilities expose a dual call mode — synchronous (blocking, result returned) and cached (store-and-forward on transient failure, `TrackedOperationId` returned) — so scripts choose the right delivery guarantee per operation without knowing the underlying retry machinery. +The External System Gateway gives site scripts two runtime capabilities: invoking HTTP/REST APIs on named external systems, and executing SQL writes against named database connections. Both capabilities expose a dual call mode — synchronous (blocking, result returned) and cached (store-and-forward on transient failure) — so scripts choose the right delivery guarantee per operation without knowing the underlying retry machinery. ## Overview @@ -31,18 +31,20 @@ Every API call and every database write has two modes: | Mode | API surface | Failure behaviour | Return value | |------|-------------|-------------------|--------------| | Synchronous | `ExternalSystem.Call()` / `Database.Connection()` | All failures returned to script | Response JSON / `DbConnection` | -| Cached | `ExternalSystem.CachedCall()` / `Database.CachedWrite()` | Transient → buffered; permanent → returned | `TrackedOperationId` (on buffer) | +| Cached | `ExternalSystem.CachedCall()` / `Database.CachedWrite()` | Transient → buffered; permanent → returned | `ExternalCallResult` (buffered) / `void` | -`CachedCallAsync` and `CachedWriteAsync` attempt immediate delivery first. Only a transient failure routes to the Store-and-Forward Engine. +`CachedCallAsync` attempts immediate delivery first; only a transient failure routes to the Store-and-Forward Engine. `CachedWriteAsync` makes no immediate SQL attempt — it resolves the connection definition and enqueues directly. ### Error classification -`ErrorClassifier` is the single authority on what counts as transient: +`ErrorClassifier` is the authority on HTTP and exception transience for the synchronous call path: - **HTTP status codes**: 5xx, 408 (Request Timeout), 429 (Too Many Requests) → transient. All other non-success 4xx → permanent. -- **Exceptions**: `HttpRequestException`, `TaskCanceledException`, `TimeoutException`, `OperationCanceledException` → transient. `JsonException` during payload deserialization → permanent (a malformed payload will not become well-formed on retry, so it is parked rather than retried forever). +- **Exceptions**: `HttpRequestException`, `TaskCanceledException`, `TimeoutException`, `OperationCanceledException` → transient. -Transient failures on `CachedCall` / `CachedWrite` are silently buffered (logged at `Debug`). Permanent failures are logged at `Warning` and returned to the calling script regardless of call mode, because a permanently-wrong request should surface immediately. +`JsonException` during buffered-delivery payload deserialization is classified as permanent inline inside `DeliverBufferedAsync` (both `ExternalSystemClient` and `DatabaseGateway`), not via `ErrorClassifier` — a malformed payload will not become well-formed on retry, so it is parked immediately. + +Transient failures on `CachedCall` / `CachedWrite` are silently buffered (logged at `Debug`). Permanent failures on the synchronous (`InvokeHttpAsync`) path are logged at `Warning` and returned to the calling script. Permanent failures detected during buffered retry delivery (`DeliverBufferedAsync`) are logged at `Error` before parking. ## Architecture @@ -64,6 +66,11 @@ Error body embedded in script-visible messages is capped at 2 048 characters so ```csharp // ExternalSystemClient.cs +catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) +{ + // The caller asked to abandon the work — do not reclassify as transient. + throw; +} catch (OperationCanceledException ex) when (timeoutCts.IsCancellationRequested) { // Our own timeout elapsed — a transient failure per the design. @@ -159,7 +166,7 @@ cmd.Parameters.AddWithValue("@name", tagName); var value = await cmd.ExecuteScalarAsync(); ``` -**Cached database write** — enqueued immediately; returns a `TrackedOperationId`: +**Cached database write** — enqueued immediately; returns nothing (`Task`): ```csharp await Database.CachedWrite("MES_DB", diff --git a/docs/components/SiteRuntime.md b/docs/components/SiteRuntime.md index 206ec922..e5675159 100644 --- a/docs/components/SiteRuntime.md +++ b/docs/components/SiteRuntime.md @@ -79,7 +79,7 @@ Central sends a `DeployInstanceCommand` carrying a JSON `FlattenedConfiguration` 1. Calls `EnsureDclConnections` to push any new or changed connection definitions to the DCL manager (hash-guarded: unchanged configs are skipped). 2. Calls `CreateInstanceActor`, which does `Context.ActorOf(props, instanceName)`. -3. Runs an off-thread `Task` that calls `SiteStorageService.StoreDeployedConfigAsync`, clears static overrides and native alarm state, and tells `SiteReplicationActor` to push to the peer node. +3. Runs an off-thread `Task` that calls `SiteStorageService.StoreDeployedConfigAsync`, clears static overrides and native alarm state, and — if `_replicationActor` is non-null (it is optional and null in isolated deployments/tests) — tells `SiteReplicationActor` to push to the peer node. 4. Pipes back a `DeployPersistenceResult`; only on success does it tell the deployer `DeploymentStatus.Success`. If persistence fails, the optimistically-created actor is stopped and the error is returned to central (`SiteRuntime-005`). For redeployment (instance already running), the existing actor is stopped and watched: @@ -103,8 +103,8 @@ The `Terminated` signal fires once the previous actor and its entire subtree hav On `PreStart`, `InstanceActor`: -1. Pipes `SiteStorageService.GetStaticOverridesAsync` to self as a `LoadOverridesResult`. When the message arrives, persisted overrides are applied on top of the flattened-config defaults. -2. Calls `CreateChildActors()`, which snapshots `_attributes` (the live dictionary) into `attributeSnapshot` before any child constructor runs. Each child's `Props` closure captures the immutable snapshot, not the live dictionary — preventing the race condition described in `SiteRuntime-017`. +1. Fires `SiteStorageService.GetStaticOverridesAsync` asynchronously and pipes the result to self as a `LoadOverridesResult` — this is non-blocking; the message arrives later in the mailbox. +2. Calls `CreateChildActors()` **immediately** (before the override message arrives). `CreateChildActors` snapshots `_attributes` (the live dictionary, seeded from flattened-config defaults) into `attributeSnapshot` before any child constructor runs. Each child's `Props` closure captures the immutable snapshot, not the live dictionary — preventing the race condition described in `SiteRuntime-017`. Because the override load is asynchronous, children are created from the un-overridden defaults; when the `LoadOverridesResult` message is subsequently processed, `HandleOverridesLoaded` applies the persisted overrides on top of the live `_attributes` dictionary. 3. Calls `SubscribeToDcl()`, grouping data-sourced attributes by connection name and sending `SubscribeTagsRequest` to the DCL manager. Tag paths are stored in `_tagPathToAttributes`, a `Dictionary>`, because one physical tag can back more than one attribute canonical name. Data-sourced attributes start with quality `Uncertain` until the first `TagValueUpdate` arrives; static attributes start with quality `Good`. @@ -130,7 +130,7 @@ var diagnostics = script.Compile(); ### Shared script library -`SharedScriptLibrary` holds a `Dictionary>` under a `lock`. It is populated at startup (off-thread by the singleton, piped back as `SharedScriptsLoaded`) and updated live when artifact deployments arrive carrying new shared scripts. Calling `Scripts.CallShared("name", params)` inside a script calls `SharedScriptLibrary.ExecuteAsync`, which runs the compiled delegate inline on the calling thread — no actor messages, no serialization. +`SharedScriptLibrary` holds a `Dictionary>` under a `lock`. It is populated at startup (off-thread by the singleton, piped back as `SharedScriptsLoaded`) and updated live when artifact deployments arrive carrying new shared scripts. Calling `Scripts.CallShared("name", params)` inside a script calls `SharedScriptLibrary.ExecuteAsync`, which runs the compiled delegate inline as compiled code (no actor message, no serialization) and is awaited by the caller. ### Script Actor triggers @@ -176,7 +176,7 @@ var killSwitch = _hubSource `NativeAlarmActor` mirrors the condition state of one source binding — an OPC UA A&C server or MxAccess Gateway connection — without writing back to the source. Each condition is keyed by `SourceReference`. -On `PreStart` it rehydrates last-known state from SQLite (`native_alarm_state` table) and immediately sends a `SubscribeAlarmsRequest` to the DCL manager. The DCL forwards this to the connection's `IAlarmSubscribableConnection` implementation. +On `PreStart` it concurrently kicks off two operations: the SQLite rehydration (`GetNativeAlarmsAsync`, piped back as `RehydrationCompleted`) and a `SubscribeAlarmsRequest` to the DCL manager — the subscribe is sent before the async rehydration completes and is not gated on it. The DCL forwards the subscribe request to the connection's `IAlarmSubscribableConnection` implementation. Transition handling: @@ -244,7 +244,7 @@ Scripts run inside `ScriptExecutionActor` with a `ScriptGlobals` object as the R - `Tracking.Status(id)` — reads the site-local `OperationTrackingStore` synchronously. - `Notify.To("list").Send(...)` — enqueues a notification in the Store-and-Forward Engine for delivery to central. -Alarm on-trigger scripts run in `AlarmExecutionActor` with the same API plus an `Alarm` global (`AlarmContext` carrying `Name`, `Level`, `Priority`, `Message`). +Alarm on-trigger scripts run in `AlarmExecutionActor` with a **restricted** context: they receive an `Alarm` global (`AlarmContext` carrying `Name`, `Level`, `Priority`, `Message`) and have access to the instance/shared-script surface (`Instance.*`, `Scripts.CallShared`, `Instance.CallScript`), but **not** the external-system, database, notification, or audit integration APIs. `AlarmExecutionActor` builds its `ScriptRuntimeContext` without a `serviceProvider`, so `ExternalSystem`, `Database`, `Notify`, and audit writes are unavailable to alarm on-trigger scripts — those APIs are only resolved inside `ScriptExecutionActor` (instance scripts). ### Debug view diff --git a/docs/components/StoreAndForward.md b/docs/components/StoreAndForward.md index 0ff8f24d..fadc25a2 100644 --- a/docs/components/StoreAndForward.md +++ b/docs/components/StoreAndForward.md @@ -90,7 +90,7 @@ Three nullable columns (`execution_id`, `source_script`, `parent_execution_id`) `StoreAndForwardStorage` opens a fresh `SqliteConnection` per call and relies on the Microsoft.Data.Sqlite connection pool (keyed on the connection string) for acceptable performance on the retry sweep. If a pooled-open ever becomes a bottleneck the remedy is a batched sweep API that opens one connection per sweep. -Status values from `StoreAndForwardMessageStatus`: `Pending` (0), `InFlight` (1), `Parked` (2), `Delivered` (3). The retry sweep loads only `Pending` rows whose `last_attempt_at` is older than `retry_interval_ms`. +The engine uses two status values from `StoreAndForwardMessageStatus`: `Pending` (0) and `Parked` (2). On successful delivery the row is deleted (`RemoveMessageAsync`) — there is no `Delivered` status written. The enum also declares `InFlight` (1) and `Delivered` (3) but neither is assigned anywhere in the engine; they are dead values. The retry sweep loads only `Pending` rows whose `last_attempt_at` is older than `retry_interval_ms`. ### Retry sweep @@ -137,6 +137,8 @@ The four `ReplicationOperationType` values are `Add`, `Remove`, `Park`, and `Req `NotificationForwarder` is the delivery handler for `StoreAndForwardCategory.Notification`. It deserializes the buffered `PayloadJson` as a `NotificationSubmit`, re-stamps `SourceSiteId` and `SourceInstanceId` from the forwarder's own context (the site is authoritative for these), and sends the submit to the `SiteCommunicationActor` via Akka's `Ask` with a configurable timeout. A `NotificationSubmitAck` with `Accepted = true` returns `true`; any other ack or a timeout throws `NotificationForwardException`, which the engine treats as transient. A payload that cannot be deserialized is logged at Warning and discarded (returns `true`) rather than parked — a corrupt payload cannot be fixed by retrying. +Notification messages are subject to the same retry budget as every other category. The notification enqueue call passes no explicit `maxRetries`, so it inherits `StoreAndForwardOptions.DefaultMaxRetries` (50). Under a sustained central outage that exhausts all 50 retry attempts, the buffered notification is parked and surfaces in the parked-message UI exactly like any other parked message. Callers that require unbounded retry must pass `maxRetries: 0` to `EnqueueAsync`. + ### Parked message management `ParkedMessageHandlerActor` is the Akka bridge between `SiteCommunicationActor` and `StoreAndForwardService`. It handles five message types from central: diff --git a/docs/components/TemplateEngine.md b/docs/components/TemplateEngine.md index 89eafcc5..ad9678ba 100644 --- a/docs/components/TemplateEngine.md +++ b/docs/components/TemplateEngine.md @@ -28,7 +28,7 @@ Both edge types are enforced acyclic on every mutating call. `CycleDetector` pro - `DetectCompositionCycle` — BFS from the proposed composed template through its own compositions. - `DetectCrossGraphCycle` — BFS across both inheritance and composition edges simultaneously, catching cycles that neither pure check alone would find. -All three run before any composition is written. Because the graph can contain not-yet-saved templates (Id = 0), `CycleDetector.BuildLookup` uses `TryAdd` rather than `ToDictionary` so duplicate Ids do not throw. +`AddCompositionAsync` runs `DetectCompositionCycle` and `DetectCrossGraphCycle` before any composition is written — `DetectInheritanceCycle` does not run on the composition path. Because the graph can contain not-yet-saved templates (Id = 0), `CycleDetector.BuildLookup` uses `TryAdd` rather than `ToDictionary` so duplicate Ids do not throw. ### Derived templates and the composition model @@ -83,7 +83,7 @@ Because each composition slot has a unique `InstanceName` prefix, members from d 3. Parent templates, walking to the root. 4. Composed module members, path-qualified. -The eight steps in order: +The nine steps in order: 1. Validate `LockedInDerived` is not violated across each chain. 2. Resolve attributes from the inheritance chain (base-to-derived; `IsInherited` placeholders never shadow live base values). @@ -91,14 +91,13 @@ The eight steps in order: 4. Apply `InstanceAttributeOverride` records (locked attributes are silently skipped). 5. Apply `InstanceConnectionBinding` records (data-sourced attributes only). 6. Resolve alarms; for `HiLo` trigger type, merge setpoints key-by-key so a derived template can override just `hi` while inheriting `loLo`. -7. Resolve scripts; wire `ScriptScope` (self- and parent-path) into each composed script so `Attributes["X"]` resolves to the right path-prefix at runtime. +7. Resolve scripts; wire `ScriptScope` (self- and parent-path) into each composed script so `Attributes["X"]` resolves to the right path-prefix at runtime. Then resolve alarm `OnTriggerScriptId` FKs to canonical script names (`ResolveAlarmScriptReferences`). 8. Resolve native alarm source bindings (`TemplateNativeAlarmSource`), apply `InstanceNativeAlarmSourceOverride`. - -Between steps 6 and 7, `ResolveAlarmScriptReferences` resolves each alarm's `OnTriggerScriptId` FK to the canonical name of the corresponding resolved script. A dangling reference (script Id has no resolved script) produces a null `OnTriggerScriptCanonicalName` and is caught by `SemanticValidator`. +9. Collect connection configurations — iterate resolved attributes, and for each attribute that has a bound data connection, populate `FlattenedConfiguration.Connections` with the corresponding `ConnectionConfig` (protocol, primary and backup JSON, failover retry count). ### Revision hash -`RevisionHashService.ComputeHash` produces a deterministic `sha256:` string over the canonical JSON serialization of the `FlattenedConfiguration`. Volatile fields (`GeneratedAtUtc`) are excluded. Collections are sorted by `CanonicalName` before hashing. Internal `Hashable*` records declare their properties in alphabetical order because `System.Text.Json` emits them in declaration order — out-of-order additions would silently break determinism. The hash is included in the deployment identity and lets the Deployment Manager detect whether a re-flatten has changed anything before pushing to sites. +`RevisionHashService.ComputeHash` produces a deterministic `sha256:` string over the canonical JSON serialization of the `FlattenedConfiguration`. Volatile fields (`GeneratedAtUtc`) are excluded. Collections are sorted by `CanonicalName` before hashing. Internal `Hashable*` records declare their properties in alphabetical order because `System.Text.Json` emits them in declaration order — out-of-order additions would silently break determinism. The hash covers resolved attributes, alarms, scripts, and connection configurations. **`NativeAlarmSources` is not included in the hash**: changes to native alarm source bindings do not alter the revision hash and therefore do not trigger a re-deploy on their own. The hash is included in the deployment identity and lets the Deployment Manager detect whether a re-flatten has changed anything before pushing to sites. ### Diff @@ -106,7 +105,7 @@ Between steps 6 and 7, `ResolveAlarmScriptReferences` resolves each alarm's `OnT ### Concurrent editing -Template edits use **last-write-wins** — there is no optimistic concurrency token on `Template` or its member rows. Two simultaneous edits to the same template produce one winner. This is by design and is documented in the `InstanceService` comment: "Concurrent editing uses last-write-wins — no pessimistic locking or conflict detection." Optimistic concurrency (`RowVersion`) applies to deployment status records in the Deployment Manager, not to template authoring. +Template edits use **last-write-wins** — there is no optimistic concurrency token on `Template` or its member rows. Two simultaneous edits to the same template produce one winner. This is by design. The same policy applies to instances and is documented in the `InstanceService` class comment: "Concurrent edits are last-write-wins — there is no version token or conflict detection on instance state." Optimistic concurrency (`RowVersion`) applies to deployment status records in the Deployment Manager, not to template or instance authoring. ## Architecture @@ -263,7 +262,7 @@ if (!validationResult.IsValid) ### Revision hash changes unexpectedly between deploys -The SHA-256 hash covers all resolved attributes, alarms, scripts, and connection configurations. Changes to any member anywhere in the inheritance or composition chain — including in parent templates or feature modules the instance does not directly own — will change the hash. Use `DiffService.ComputeDiff` to identify exactly which canonical names changed and why. +The SHA-256 hash covers resolved attributes, alarms, scripts, and connection configurations. Changes to any of those members anywhere in the inheritance or composition chain — including in parent templates or feature modules the instance does not directly own — will change the hash. Note that `NativeAlarmSources` is not part of the hash, so native alarm source binding changes alone do not change the revision hash. Use `DiffService.ComputeDiff` to identify exactly which canonical names changed and why. ### Naming collision on composed member add