docs(components): reference docs batch 2/4 — TemplateEngine, DeploymentManager, SiteRuntime, DataConnectionLayer, StoreAndForward, ExternalSystemGateway

2026-06-03 15:47:16 -04:00
parent b89611464b
commit 8fb90ba400
6 changed files with 1644 additions and 0 deletions
@@ -0,0 +1,309 @@
+# Data Connection Layer
+
+The Data Connection Layer is the site-only clean data pipe that abstracts protocol-specific communication behind a uniform actor/interface model, delivering live tag values and native alarm transitions to Instance Actors while hiding all connection lifecycle complexity from the rest of the system.
+
+## Overview
+
+The Data Connection Layer (#4) runs exclusively on site nodes. Central never reads from or writes to physical devices directly. The component code lives in `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/`, with adapters in `Adapters/` and the actor hierarchy in `Actors/`.
+
+Key collaborators are:
+
+- `DataConnectionManagerActor` — the site-level router. One actor per site, child of the Site Runtime hierarchy. It owns the `connectionName → IActorRef` map and routes every inbound message to the right `DataConnectionActor`.
+- `DataConnectionActor` — one per configured data connection. Owns a single `IDataConnection` adapter and models the full connection lifecycle as a Become/Stash state machine.
+- `IDataConnection` (Commons) — the protocol adapter contract. Implemented by `OpcUaDataConnection` and `MxGatewayDataConnection`.
+- `DataConnectionFactory` / `IDataConnectionFactory` — resolves a case-insensitive `ProtocolType` string to a fresh `IDataConnection` instance.
+
+DI registration is through `ServiceCollectionExtensions.AddDataConnectionLayer`, which binds `DataConnectionOptions` from the `DataConnectionLayer` section and `OpcUaGlobalOptions` from the `OpcUa` section. The actors themselves are created inside the ActorSystem, not by DI.
+
+## Key Concepts
+
+### Clean data pipe
+
+The DCL performs no alarm evaluation, no trigger evaluation, and no business logic. Its only job is to subscribe to tag paths requested by Instance Actors, deliver `TagValueUpdate` messages when values change, and route write requests down to the device. Every value change crosses exactly one boundary: device → adapter callback → `DataConnectionActor` → `InstanceActor`. The DCL never publishes to any actor other than the Instance Actors that subscribed.
+
+### Become/Stash lifecycle state machine
+
+`DataConnectionActor` uses Akka.NET's `Become`/`Stash` pattern to model three states cleanly:
+
+| State | Behaviour |
+|-------|-----------|
+| `Connecting` | Stashes `SubscribeTagsRequest`, `WriteTagRequest`, `UnsubscribeTagsRequest`, `SubscribeAlarmsRequest`. Non-blocking for `BrowseNodeCommand` and `ReadTagValuesCommand` (immediate failure reply). |
+| `Connected` | Processes all message types. On entering, calls `Stash.UnstashAll()`. |
+| `Reconnecting` | Stashes new subscribe/write requests; allows `UnsubscribeTagsRequest` and `UnsubscribeAlarmsRequest` through for cleanup on instance stop. |
+
+`BecomeConnecting` fires immediately in `PreStart`. `BecomeConnected` also calls `Stash.UnstashAll()` so queued subscribe requests from instance startup race are processed without any loss. The `IWithStash` and `IWithTimers` interfaces are declared on the actor; both are injected by the Akka.NET infrastructure.
+
+### Adapter generation guard
+
+Each `DataConnectionActor` holds an `_adapterGeneration` integer. When a new `IDataConnection` is created on failover, the generation is incremented. All subscription callbacks capture the generation at the time they are registered; a `TagValueReceived` or `AlarmTransitionReceived` whose generation does not match `_adapterGeneration` is silently dropped. This ensures stale callbacks from a disposed OPC UA SDK session or a closed gRPC stream never reach Instance Actors.
+
+### Protocol extensibility
+
+Adding a new protocol requires implementing `IDataConnection` (and optionally `IBrowsableDataConnection` and/or `IAlarmSubscribableConnection`) and calling `DataConnectionFactory.RegisterAdapter` with the new `ProtocolType` string. The manager and the actor are protocol-agnostic — they always work through the interfaces.
+
+## Architecture
+
+### Actor hierarchy
+
+```
+DataConnectionManagerActor          (one per site; child of DeploymentManagerActor)
+├── DataConnectionActor  "OpcSrv1"  (one per configured connection)
+├── DataConnectionActor  "Gateway2"
+└── ...
+```
+
+`DataConnectionManagerActor` is a `ReceiveActor`. It routes by connection name: `HandleRoute` calls `actor.Forward(request)` when the connection exists or sends a typed failure reply when it does not. The manager owns `ConnectionNotFound` failures; `DataConnectionActor` owns everything else (not connected, server errors, capability checks). The manager's `SupervisorStrategy` is `Resume` with up to ten restarts per minute, so a transient exception on a child never collapses the routing table.
+
+`CreateConnectionCommand` (from the Site Runtime's `DeploymentManagerActor`) triggers `HandleCreateConnection`. The actor name is sanitised from the connection's display name to satisfy Akka's path character constraints.
+
+### Connection state machine
+
+```csharp
+// DataConnectionActor.cs — PreStart and state transitions
+protected override void PreStart()
+{
+    _self = Self;                                  // capture for off-thread use
+    _adapter.Disconnected += OnAdapterDisconnected;
+    BecomeConnecting();
+}
+
+private void OnAdapterDisconnected()
+{
+    // fired from a background thread (gRPC stream, OPC UA keep-alive timer)
+    _self.Tell(new AdapterDisconnected());         // marshal onto actor loop
+}
+
+private void BecomeConnecting()
+{
+    _healthCollector.UpdateConnectionHealth(_connectionName, ConnectionHealth.Connecting);
+    Become(Connecting);
+    Self.Tell(new AttemptConnect());
+}
+
+private void BecomeConnected()
+{
+    _lastConnectedAt = DateTimeOffset.UtcNow;
+    _healthCollector.UpdateConnectionHealth(_connectionName, ConnectionHealth.Connected);
+    Become(Connected);
+    Stash.UnstashAll();
+}
+
+private void BecomeReconnecting()
+{
+    _healthCollector.UpdateConnectionHealth(_connectionName, ConnectionHealth.Disconnected);
+    Become(Reconnecting);
+    PushBadQualityForAllTags();      // immediate bad quality — Instance Actors see it now
+    PushAlarmSourceUnavailable();    // native alarm subscribers notified
+    Timers.StartSingleTimer("reconnect", new AttemptConnect(), _options.ReconnectInterval);
+}
+```
+
+`AttemptConnect` fires `_adapter.ConnectAsync(...)` and pipes `ConnectResult` back to the actor via `PipeTo`. On success, `BecomeConnected` is called; on failure, the counter increments and the reconnect timer is re-armed.
+
+### Transparent re-subscribe
+
+`ReSubscribeAll` runs immediately after a successful reconnect, before `BecomeConnected`. It derives the tag list from `_subscriptionsByInstance` (the durable record of what every Instance Actor asked for) rather than from `_subscriptionIds` (which are adapter handles cleared on every disconnect). All in-flight and unresolved sets are cleared so the new adapter starts clean. Subscriptions are re-issued in parallel via `PipeTo`, so the actor is never blocked during re-subscribe.
+
+```csharp
+// DataConnectionActor.cs — HandleReconnectResult
+if (result.Success)
+{
+    _consecutiveFailures = 0;
+    ReSubscribeAll();        // tag subscriptions
+    ReSubscribeAllAlarms();  // native alarm feeds
+    BecomeConnected();
+}
+```
+
+### Failover between primary and backup endpoints
+
+When `BackupConnectionDetails` is provided, the actor tracks consecutive failures and unstable disconnects. After `FailoverRetryCount` consecutive failures (or unstable connections shorter than `StableConnectionThreshold`), the actor disposes the current adapter, creates a fresh one with the other endpoint's config via `DataConnectionFactory.Create`, and re-arms the connect timer. Failover is round-robin: primary → backup → primary. There is no auto-failback; the connection stays on whichever endpoint is currently working.
+
+### Write path
+
+Writes are fire-and-forget from the script's view only in the sense that there is no store-and-forward. The `DataConnectionActor` pipes `_adapter.WriteAsync` back to the original sender:
+
+```csharp
+// DataConnectionActor.cs — HandleWrite
+private void HandleWrite(WriteTagRequest request)
+{
+    var sender = Sender;
+    var cts = new CancellationTokenSource(_options.WriteTimeout);
+
+    _adapter.WriteAsync(request.TagPath, request.Value, cts.Token).ContinueWith(t =>
+    {
+        cts.Dispose();
+        if (t.IsCompletedSuccessfully)
+            return new WriteTagResponse(request.CorrelationId, t.Result.Success,
+                t.Result.ErrorMessage, DateTimeOffset.UtcNow);
+        if (t.IsCanceled || t.Exception?.GetBaseException() is OperationCanceledException)
+            return new WriteTagResponse(request.CorrelationId, false,
+                $"Write timeout after {_options.WriteTimeout.TotalSeconds:F0}s", DateTimeOffset.UtcNow);
+        return new WriteTagResponse(request.CorrelationId, false,
+            t.Exception?.GetBaseException().Message, DateTimeOffset.UtcNow);
+    }).PipeTo(sender);
+}
+```
+
+`WriteTagResponse` is returned synchronously to the calling script (via the Instance Actor, which awaits it). There is no store-and-forward for writes — buffering stale setpoints for later replay is unsafe in a control context.
+
+### Tag path resolution retry
+
+When `_adapter.SubscribeAsync` throws a resolution-level exception (node not found, device still booting), the tag is added to `_unresolvedTags` and marked `QualityCode.Bad`. A periodic `RetryTagResolution` timer fires at `TagResolutionRetryInterval`. Only tags that are not already in `_resolutionInFlight` are dispatched on each tick, preventing duplicate concurrent subscribe calls for a slow device. When resolution succeeds, the tag moves from `_unresolvedTags` to `_subscriptionIds`, `_resolvedTags` is incremented, and the timer is cancelled once the set is empty.
+
+A separate in-flight set `_subscribesInFlight` tracks the initial `SubscribeAsync` for newly requested tags. Two `SubscribeTagsRequest` messages arriving for different instances that share a tag path both observe the tag as already handled, so only one adapter subscribe is issued.
+
+### Tag subscriber reference counting
+
+`_tagSubscriberCount` maps each tag path to the number of instances subscribed to it. When an instance unsubscribes, the count is decremented. The adapter `UnsubscribeAsync` call and the quality / resolution counters are updated only when the count reaches zero. This means a shared tag path (subscribed by two different instances bound to the same connection) remains active at the adapter until both instances stop.
+
+## Usage
+
+### Creating a connection
+
+The Site Runtime's `DeploymentManagerActor` sends `CreateConnectionCommand` to `DataConnectionManagerActor` during instance deployment:
+
+```csharp
+// Commons/Messages/DataConnection/CreateConnectionCommand.cs
+public record CreateConnectionCommand(
+    string ConnectionName,
+    string ProtocolType,
+    IDictionary<string, string> PrimaryConnectionDetails,
+    IDictionary<string, string>? BackupConnectionDetails = null,
+    int FailoverRetryCount = 3);
+```
+
+`DataConnectionManagerActor.HandleCreateConnection` calls `DataConnectionFactory.Create(ProtocolType, PrimaryConnectionDetails)` to produce the initial `IDataConnection` adapter and spawns a `DataConnectionActor` child.
+
+### Subscribing and receiving values
+
+Instance Actors send `SubscribeTagsRequest` and receive `TagValueUpdate` messages. The actor sends `TagValueUpdate` to the Instance Actor's ref for every value change notification from the adapter. On disconnect, `ConnectionQualityChanged` (with `QualityCode.Bad`) is sent to all subscribers so Instance Actors reflect staleness immediately without waiting for a per-tag callback.
+
+### Browse (design-time only)
+
+`BrowseNodeCommand` is routed to the named `DataConnectionActor`. If the adapter implements `IBrowsableDataConnection`, the actor calls `BrowseChildrenAsync` and pipes `BrowseNodeResult` back. The result is capped to ~100 KB before reply to stay within the Akka remote frame budget on the site→central crossing. Browse is never called on the hot path and Instance Actors never use it.
+
+### Native alarm subscription
+
+`NativeAlarmActor` (Site Runtime) sends `SubscribeAlarmsRequest` to the `DataConnectionManagerActor`, which forwards it to the named `DataConnectionActor`:
+
+```csharp
+// DataConnectionActor.cs — HandleSubscribeAlarms (key excerpt)
+if (_adapter is not IAlarmSubscribableConnection alarmable)
+{
+    subscriber.Tell(new SubscribeAlarmsResponse(
+        request.CorrelationId, request.InstanceUniqueName, false,
+        $"Connection '{_connectionName}' is not alarm-capable.", now));
+    return;
+}
+// register the subscriber for routing before issuing the adapter call
+_alarmSourceSubscribers[request.SourceReference].Add(subscriber);
+
+// open one feed per source reference; subsequent subscribers reuse it
+alarmable.SubscribeAlarmsAsync(sourceRef, filter,
+        t => self.Tell(new AlarmTransitionReceived(t, generation)))
+    .ContinueWith(task => ...)
+    .PipeTo(self);
+```
+
+Incoming `AlarmTransitionReceived` is routed to all subscribers whose registered `SourceReference` is a prefix of the transition's `SourceObjectReference` or `SourceReference`. On disconnect, `PushAlarmSourceUnavailable` sends `NativeAlarmSourceUnavailable` to every alarm subscriber; on reconnect, `ReSubscribeAllAlarms` re-opens the feeds and the adapter replays a snapshot of currently-active conditions.
+
+## Configuration
+
+All settings under `DataConnectionLayer` in `appsettings.json` bind to `DataConnectionOptions`. Global protocol settings (`OpcUa`, `MxGateway` sections) bind to `OpcUaGlobalOptions` and `MxGatewayGlobalOptions`.
+
+### Shared settings (`DataConnectionLayer` section)
+
+| Key | Default | Description |
+|-----|---------|-------------|
+| `ReconnectInterval` | `00:00:05` | Fixed interval between reconnection attempts after a disconnect or failed connect. |
+| `TagResolutionRetryInterval` | `00:00:10` | Retry interval for tag paths that could not be resolved on the device. |
+| `WriteTimeout` | `00:00:30` | Timeout applied to each `WriteAsync` call. A hung write times out and returns an error to the calling script. |
+| `StableConnectionThreshold` | `00:01:00` | A connection that drops before this duration is counted as an unstable disconnect toward `FailoverRetryCount`. |
+
+### OPC UA global settings (`OpcUa` section)
+
+| Key | Default | Description |
+|-----|---------|-------------|
+| `ApplicationName` | `ScadaBridge-DCL` | Application name used in the OPC UA certificate and session negotiation. |
+| `TrustedIssuerStorePath` | `""` | File path to the trusted issuer certificate store. |
+| `TrustedPeerStorePath` | `""` | File path to the trusted peer certificate store. |
+| `RejectedCertificateStorePath` | `""` | File path to the rejected certificate store. |
+
+Empty store paths fall back to a temp-path default so dev runs work without explicit configuration.
+
+### Per-connection settings (OPC UA)
+
+Stored in the `DataConnection.PrimaryConfiguration` JSON dict, parsed by `OpcUaEndpointConfigSerializer.FromFlatDict`.
+
+| Key | Default | Description |
+|-----|---------|-------------|
+| `endpoint` / `EndpointUrl` | `opc.tcp://localhost:4840` | OPC UA server endpoint URL. |
+| `SessionTimeoutMs` | `60000` | Session timeout in ms. |
+| `OperationTimeoutMs` | `15000` | Transport operation timeout in ms. |
+| `PublishingIntervalMs` | `1000` | Subscription publishing interval in ms. |
+| `KeepAliveCount` | `10` | Keep-alive frames before session timeout. |
+| `LifetimeCount` | `30` | Subscription lifetime in publish intervals. |
+| `SamplingIntervalMs` | `1000` | Per-item server sampling rate in ms. |
+| `QueueSize` | `10` | Per-item notification buffer size. |
+| `SecurityMode` | `None` | `None`, `Sign`, or `SignAndEncrypt`. |
+| `AutoAcceptUntrustedCerts` | `true` | Accept untrusted server certificates. |
+
+### Per-connection settings (MxGateway)
+
+Stored in `DataConnection.PrimaryConfiguration`, parsed by `MxGatewayEndpointConfigSerializer.FromFlatDict`.
+
+| Key | Default | Description |
+|-----|---------|-------------|
+| `Endpoint` | `http://localhost:5000` | Gateway base URL. |
+| `ApiKey` | — | Sent as `Authorization: Bearer <key>`. Redacted in logs. |
+| `ClientName` | `scadabridge` | MXAccess client registration name. |
+| `WriteUserId` | `0` | MXAccess user id on writes. `0` = no user context. |
+| `UseTls` | `false` | Use TLS. |
+| `CaFile` | — | CA certificate file path (TLS only). |
+| `ServerName` | — | TLS server-name override. |
+| `ReadTimeoutMs` | `5000` | `ReadBulk` per-call timeout in ms. |
+
+### Per-connection failover settings
+
+Carried on `CreateConnectionCommand` from the deployment artifact.
+
+| Field | Default | Description |
+|-------|---------|-------------|
+| `BackupConnectionDetails` | `null` | If absent, the connection retries its single endpoint indefinitely. |
+| `FailoverRetryCount` | `3` | Consecutive failures (or unstable disconnects) before switching endpoints. |
+
+## Dependencies & Interactions
+
+- [Site Runtime (#3)](./SiteRuntime.md) — `DeploymentManagerActor` sends `CreateConnectionCommand` / `RemoveConnectionCommand` to create and tear down connections during instance deployment. Instance Actors send `SubscribeTagsRequest` / `UnsubscribeTagsRequest` / `WriteTagRequest`. `NativeAlarmActor` (peer to `AlarmActor` under `InstanceActor`) sends `SubscribeAlarmsRequest` / `UnsubscribeAlarmsRequest` and receives `NativeAlarmTransitionUpdate` / `NativeAlarmSourceUnavailable`.
+- [Commons (#16)](./Commons.md) — owns `IDataConnection`, `IAlarmSubscribableConnection`, `IBrowsableDataConnection` interfaces; `TagValue`, `QualityCode`, `NativeAlarmTransition`, `AlarmConditionState`, `AlarmTransitionKind` types; and all message contracts (`SubscribeTagsRequest/Response`, `TagValueUpdate`, `WriteTagRequest/Response`, `SubscribeAlarmsRequest/Response`, `NativeAlarmTransitionUpdate`, `NativeAlarmSourceUnavailable`, `CreateConnectionCommand`, `BrowseNodeCommand/Result`, `DataConnectionHealthReport`).
+- [Health Monitoring (#11)](./HealthMonitoring.md) — `DataConnectionActor` calls `ISiteHealthCollector.UpdateConnectionHealth`, `UpdateTagResolution`, `UpdateTagQuality`, `UpdateConnectionEndpoint`, and `RemoveConnection` to keep the site health report current. `DataConnectionManagerActor` handles `GetAllHealthReports` by forwarding `GetHealthReport` to each child.
+- [Site Event Logging (#12)](./SiteEventLogging.md) — `DataConnectionActor` logs connection lost, restored, and failover events via `ISiteEventLogger.LogEventAsync` (fire-and-forget). Absent in tests where `ISiteEventLogger` is null.
+- Design spec: [Component-DataConnectionLayer.md](../requirements/Component-DataConnectionLayer.md).
+
+## Troubleshooting
+
+### Tags stuck in bad quality after reconnect
+
+After a reconnect, `ReSubscribeAll` re-issues all subscriptions. Tags that resolve successfully immediately receive a seeded value from `ReadAsync` before the subscription's first change notification arrives, so the Instance Actor sees a good-quality value promptly. Tags that fail resolution land in `_unresolvedTags` and are retried at `TagResolutionRetryInterval`. If a tag is persistently bad, check whether the tag path exists on the device and whether the device is online. The health report's `ResolvedTags` vs `TotalSubscribedTags` counters expose the gap without requiring the debug view.
+
+### Connection not failing over to backup
+
+The failover counter uses two independent paths: `_consecutiveFailures` for connect-attempt failures, and `_consecutiveUnstableDisconnects` for connections that drop before `StableConnectionThreshold`. Both must reach `FailoverRetryCount` to trigger a switch. A connection that succeeds but immediately drops does not increment `_consecutiveFailures` — it increments `_consecutiveUnstableDisconnects` instead. Verify both counters are relevant to the observed failure pattern. Failover events are written to Site Event Logging as `Warning` entries.
+
+### Browse hangs on "loading…"
+
+The `DataConnectionActor` caps `BrowseNodeResult` to ~100 KB before returning it. If the picker hangs rather than showing a "results truncated" hint, the reply may have been silently discarded by Akka remoting before reaching central (the reply crosses the site→central frame). Check whether the connection actor is in `Connecting` or `Reconnecting` state — browse while disconnected returns `BrowseFailureKind.ConnectionNotConnected` immediately rather than hanging.
+
+### Alarm feed not receiving transitions
+
+Confirm that the connection's adapter implements `IAlarmSubscribableConnection` (currently `OpcUaDataConnection` and `MxGatewayDataConnection`). A non-capable adapter returns `SubscribeAlarmsResponse(Success = false)`. If the capability is present, check whether the connection is in `Connected` state — `SubscribeAlarmsRequest` is stashed while `Connecting` or `Reconnecting` and processed on entering `Connected`. On reconnect, `ReSubscribeAllAlarms` re-opens the feeds and the adapter replays a snapshot.
+
+## Related Documentation
+
+- [Data Connection Layer design specification](../requirements/Component-DataConnectionLayer.md)
+- [Site Runtime](./SiteRuntime.md)
+- [Commons](./Commons.md)
+- [Health Monitoring](./HealthMonitoring.md)
+- [Site Event Logging](./SiteEventLogging.md)
+- [Central–Site Communication](./Communication.md)
@@ -0,0 +1,209 @@
+# Deployment Manager
+
+The Deployment Manager is the central-side pipeline that takes a validated, flattened instance configuration from the Template Engine, ships it to a site via the Communication Layer, and tracks the result — along with full instance lifecycle commands and system-wide artifact distribution to all connected sites.
+
+## Overview
+
+Deployment Manager (#2) runs exclusively on the central cluster. The site-side counterpart — the Deployment Manager singleton inside Site Runtime — receives and applies what central sends; that actor's design is covered in Site Runtime (#3).
+
+The component code lives in `src/ZB.MOM.WW.ScadaBridge.DeploymentManager/`:
+
+- `DeploymentService` — per-instance deploy, disable, enable, delete, diff, and status queries.
+- `ArtifactDeploymentService` — system-wide artifact broadcast and per-site retry.
+- `FlatteningPipeline` — wraps the Template Engine's `FlatteningService`, `ValidationService`, and `RevisionHashService` into a single call used by `DeploymentService`.
+- `OperationLockManager` — ref-counted per-instance `SemaphoreSlim(1,1)` that serialises all mutating operations on one instance.
+- `StateTransitionValidator` — encodes the allowed state-transition matrix for `InstanceState`.
+- `DeploymentStatusNotifier` — singleton in-process event broadcaster that pushes `DeploymentStatusChange` to the Central UI's Blazor circuits instead of letting them poll.
+
+Registration entry point: `ServiceCollectionExtensions.AddDeploymentManager`. Options are bound from `ScadaBridge:DeploymentManager` in `appsettings.json`.
+
+## Key Concepts
+
+### Deployment identity
+
+Every instance deployment carries two correlated identifiers:
+
+- **`DeploymentId`** — a new `Guid` (formatted `"N"`) minted by `DeploymentService` at the start of each `DeployInstanceAsync` call.
+- **`RevisionHash`** — computed by the Template Engine's `RevisionHashService` over the fully resolved `FlattenedConfiguration`. The hash captures the template state at the moment of flattening, so concurrent last-write-wins template edits do not affect an in-flight deployment.
+
+The pair travels inside `DeployInstanceCommand` to the site. The site uses the `DeploymentId` to detect an already-applied identical command (idempotent re-delivery) and uses the `RevisionHash` to reject a stale configuration that predates what is already running.
+
+Central stores the `RevisionHash` on `DeploymentRecord` and, after a confirmed success, on `DeployedConfigSnapshot`. Comparing the snapshot hash against the current-template hash determines whether an instance is stale without a site round-trip.
+
+### Per-instance operation lock
+
+`OperationLockManager` holds a `Dictionary<string, LockEntry>` keyed by instance `UniqueName`. Each `LockEntry` wraps a `SemaphoreSlim(1,1)` with a reference count so the semaphore is created on first contention and disposed when the last waiter clears. The lock covers all four mutating operations — deploy, disable, enable, delete — so they can never interleave on a single instance. Operations on different instances proceed in parallel.
+
+Lock acquisition throws `TimeoutException` after `DeploymentManagerOptions.OperationLockTimeout` (default 5 s). The operation lock is in-memory and is therefore lost on a central failover; the design treats any in-progress deployment at failover time as failed.
+
+### State transition rules
+
+`StateTransitionValidator` enforces the following matrix:
+
+| `InstanceState` | Deploy | Disable | Enable | Delete |
+|-----------------|--------|---------|--------|--------|
+| `NotDeployed`   | Yes    | No      | No     | Yes    |
+| `Enabled`       | Yes    | Yes     | No     | Yes    |
+| `Disabled`      | Yes*   | No      | Yes    | Yes    |
+
+\* Deploying from `Disabled` transitions the instance to `Enabled` on confirmed success.
+
+### Optimistic concurrency on deployment status
+
+`DeploymentRecord` carries a `RowVersion byte[]` column. EF Core uses this as an optimistic-concurrency token on every `UPDATE` and `DELETE`. A concurrent write to the same record surfaces as `DbUpdateConcurrencyException` rather than silently overwriting the peer's state.
+
+### Failover and in-progress deployments
+
+The operation lock is in-memory. If the active central node fails mid-deployment, the new active node has no lock and no knowledge of what the site received. The `DeploymentRecord` is left `InProgress` (or `Failed` if the failure path ran before the node died). Before allowing a re-deploy, `DeploymentService` calls `TryReconcileWithSiteAsync`, which queries the site for its currently-applied revision hash and reconciles rather than re-sending if the site already has the target revision.
+
+## Architecture
+
+### Instance deploy pipeline
+
+`DeployInstanceAsync` executes the following sequence:
+
+1. **Load and validate state** — loads the `Instance` from `IDeploymentManagerRepository` and checks the transition via `StateTransitionValidator`.
+2. **Acquire operation lock** — `OperationLockManager.AcquireAsync` blocks competing operations on the same instance.
+3. **Flatten and validate** — `IFlatteningPipeline.FlattenAndValidateAsync` runs the Template Engine pipeline and returns a `FlatteningPipelineResult` containing the `FlattenedConfiguration`, `RevisionHash`, and a `ValidationResult`. Semantic validation failures (call targets, argument types, trigger operand types, connection binding completeness) are returned to the caller before any record is written.
+4. **Pre-deploy site reconciliation** — when the prior `DeploymentRecord` for the instance is `InProgress` or `Failed` with a timeout marker (`"Communication failure:"`), the service queries the site via `CommunicationService.QueryDeploymentStateAsync`. If the site already holds the target revision hash, the prior record is updated to `Success` and no new deployment is sent.
+5. **Write `InProgress` record** — a single `DeploymentRecord` insert directly at `InProgress` status (no transient `Pending` hop). `IDeploymentStatusNotifier.NotifyStatusChanged` fires to push the status to the UI.
+6. **Send `DeployInstanceCommand`** — the command carries `DeploymentId`, `InstanceUniqueName`, `RevisionHash`, `FlattenedConfigurationJson`, `DeployedBy`, and `Timestamp`.
+7. **Commit terminal status** — the `DeploymentRecord` is updated to `Success` or `Failed` and saved before any post-success side effects run. This ordering ensures the recorded outcome can never be lost if a post-success write fails.
+8. **Post-success side effects** — `ApplyPostSuccessSideEffectsAsync` sets `Instance.State = Enabled` (or preserves `Disabled` on the reconciliation path) and upserts the `DeployedConfigSnapshot`. These writes are best-effort: a failure here is logged at `Error` but does not flip the already-committed `Success` record back to `Failed`.
+9. **Audit log** — `IAuditService.LogAsync` records `Deploy` / `DeployFailed` / `DeployReconciled` with the `DeploymentId`, status, and user.
+
+Any exception in the site round-trip (steps 6–7) writes `DeploymentStatus.Failed` using `CancellationToken.None` so a cancelled outer token cannot prevent the failure record from being persisted:
+
+```csharp
+// DeploymentService.DeployInstanceAsync — exception handler
+var isTimeout = ex is TimeoutException or OperationCanceledException;
+
+record.Status = DeploymentStatus.Failed;
+record.ErrorMessage = isTimeout
+    ? $"{TimeoutFailurePrefix} {ex.Message}"
+    : $"Deployment error: {ex.Message}";
+record.CompletedAt = DateTimeOffset.UtcNow;
+
+await _repository.UpdateDeploymentRecordAsync(record, CancellationToken.None);
+await _repository.SaveChangesAsync(CancellationToken.None);
+NotifyStatusChange(record);
+```
+
+The `TimeoutFailurePrefix` constant (`"Communication failure:"`) is the marker that `ShouldQuerySiteBeforeRedeploy` checks on the next deploy attempt.
+
+### Pre-deploy site reconciliation
+
+`TryReconcileWithSiteAsync` is invoked only when a prior deployment record exists and `ShouldQuerySiteBeforeRedeploy` returns true:
+
+```csharp
+private static bool ShouldQuerySiteBeforeRedeploy(DeploymentRecord prior) =>
+    prior.Status == DeploymentStatus.InProgress
+    || (prior.Status == DeploymentStatus.Failed
+        && prior.ErrorMessage != null
+        && prior.ErrorMessage.StartsWith(TimeoutFailurePrefix, StringComparison.Ordinal));
+```
+
+If the site responds that it is running the target `RevisionHash`, the stale prior record is updated to `Success` (with the hash corrected to the target), `ApplyPostSuccessSideEffectsAsync` runs with `forceEnabledState: false` to avoid undoing an intentional disable, and the caller receives the reconciled record. A query failure falls through to a normal deploy; the site's own stale-rejection logic is the safety net.
+
+### Deployed config snapshot and diff
+
+`DeployedConfigSnapshot` is a one-per-instance row that stores the `DeploymentId`, `RevisionHash`, and the full `FlattenedConfiguration` JSON as of the last confirmed success. `DeploymentService.GetDeploymentComparisonAsync` re-flattens the current template state, compares the hash, and feeds both configs to `DiffService.ComputeDiff` if the hashes differ, producing a `ConfigurationDiff` with added, removed, and changed attributes, alarms, scripts, and connection bindings.
+
+### Artifact deployment
+
+`ArtifactDeploymentService.DeployToAllSitesAsync` deploys the full system-wide artifact set to every site in parallel. It fetches system-wide artifacts (shared scripts, external systems with serialised methods, database connections, notification lists, SMTP configurations) once via `FetchGlobalArtifactsAsync` before the per-site loop, avoiding N×1 re-queries. Per-site data connections are fetched inside each per-site command build because they legitimately vary per site.
+
+All per-site `DeployArtifactsCommand` messages share one `DeploymentId` so the audit log, UI summary, and persisted `SystemArtifactDeploymentRecord` all reference the same logical deployment. Each site runs under a `cts.CancelAfter(ArtifactDeploymentTimeoutPerSite)` linked source. Successful sites are never rolled back on other failures; individual failed sites are retryable via `RetryForSiteAsync`.
+
+```csharp
+// ArtifactDeploymentService — parallel per-site dispatch
+var tasks = sites.Select(async site =>
+{
+    using var cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
+    cts.CancelAfter(_options.ArtifactDeploymentTimeoutPerSite);
+
+    var command = siteCommands[site.Id];
+    var response = await _communicationService.DeployArtifactsAsync(
+        site.SiteIdentifier, command, cts.Token);
+
+    return new SiteArtifactResult(
+        site.SiteIdentifier, site.Name, response.Success, response.ErrorMessage);
+}).ToList();
+```
+
+Cross-site artifact version skew is supported by design: a site that missed an artifact deployment continues operating with its current versions until an operator retries.
+
+### Status notification
+
+`DeploymentStatusNotifier` is a DI singleton that exposes `event Action<DeploymentStatusChange>? StatusChanged`. `DeploymentService` calls `NotifyStatusChanged` at every point a `DeploymentRecord` status is written. The Central UI's deployment page subscribes at render time and re-renders over its Blazor Server SignalR circuit without polling. Each subscriber is invoked individually inside a try/catch so a disposed Blazor circuit cannot break the deployment pipeline.
+
+## Usage
+
+`DeploymentService` and `ArtifactDeploymentService` are scoped services, typically resolved by `ManagementService` actor handlers (triggered by `MgmtDeployArtifactsCommand`, `GetDeploymentDiffCommand`, and the instance lifecycle commands) or directly by Central UI Blazor components. Engineers interact through the Central UI; automated bulk operations (deploy all stale instances) decompose into individual `DeployInstanceAsync` calls.
+
+Lifecycle commands (`DisableInstanceAsync`, `EnableInstanceAsync`, `DeleteInstanceAsync`) follow the same lock-then-command pattern as deploy, with `LifecycleCommandTimeout` applied as a linked `CancellationTokenSource` deadline:
+
+```csharp
+// DeploymentService — lifecycle command pattern (disable shown)
+using var lockHandle = await _lockManager.AcquireAsync(
+    instance.UniqueName, _options.OperationLockTimeout, cancellationToken);
+
+using var cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
+cts.CancelAfter(_options.LifecycleCommandTimeout);
+response = await _communicationService.DisableInstanceAsync(siteId, command, cts.Token);
+```
+
+A timeout on a lifecycle command writes a `DisableTimedOut` / `EnableTimedOut` / `DeleteTimedOut` audit entry via `TryLogLifecycleTimeoutAsync` using `CancellationToken.None`, mirroring the `DeployFailed` audit pattern. The site-side `Instance` state is only updated in the central DB after the site confirms success; a timeout leaves the DB state unchanged.
+
+Delete is stricter than disable/enable: if the site confirms but the central `DeleteInstanceAsync` repository call subsequently fails, the instance record is orphaned. The service logs at `Error`, records a `DeleteOrphaned` audit entry, and returns a descriptive failure so an operator can reconcile — it does not retry automatically.
+
+## Configuration
+
+Options are registered via `AddDeploymentManager` and bound from `ScadaBridge:DeploymentManager`.
+
+| Key | Default | Description |
+|-----|---------|-------------|
+| `OperationLockTimeout` | `00:00:05` | Maximum wait for the per-instance operation lock before throwing `TimeoutException`. |
+| `LifecycleCommandTimeout` | `00:00:30` | Maximum round-trip for a disable, enable, or delete command before the operation is declared timed out. |
+| `ArtifactDeploymentTimeoutPerSite` | `00:02:00` | Per-site deadline for a `DeployArtifactsCommand` response. Sites exceeding this are recorded as failed; others are unaffected. |
+
+## Dependencies & Interactions
+
+- [Template Engine (#1)](./TemplateEngine.md) — `FlatteningPipeline` delegates to `FlatteningService`, `ValidationService`, and `RevisionHashService`. Template state is captured at flatten time; last-write-wins edits made after flatten do not affect the in-flight deployment. `DiffService.ComputeDiff` powers the deployment diff view.
+- [Configuration Database (#17)](./ConfigurationDatabase.md) — owns the EF Core implementation of `IDeploymentManagerRepository`, which stores `DeploymentRecord`, `DeployedConfigSnapshot`, and `SystemArtifactDeploymentRecord`. `IAuditService` (also registered by the Configuration Database component) writes all deployment audit rows.
+- [Central–Site Communication (#5)](./Communication.md) — `CommunicationService` provides `DeployInstanceAsync`, `QueryDeploymentStateAsync`, `DeployArtifactsAsync`, `DisableInstanceAsync`, `EnableInstanceAsync`, and `DeleteInstanceAsync`. The communication layer routes by `SiteIdentifier` (string), not DB id; `DeploymentService.ResolveSiteIdentifierAsync` resolves the numeric `SiteId` before each cross-cluster call and treats a missing site row as a hard failure.
+- [Commons (#16)](./Commons.md) — owns `DeploymentRecord`, `DeployedConfigSnapshot`, `SystemArtifactDeploymentRecord`, `DeploymentStatus`, `InstanceState`, `DeployInstanceCommand`, `DeployArtifactsCommand`, `DeploymentStateQueryRequest/Response`, `InstanceLifecycleResponse`, and the `IDeploymentManagerRepository` interface.
+- [Site Runtime (#3)](./SiteRuntime.md) — receives `DeployInstanceCommand` and `DeployArtifactsCommand` via the Communication Layer. Site-side apply is all-or-nothing per instance: the Deployment Manager singleton at the site stores the config, compiles all scripts, and creates or replaces the Instance Actor as a unit. A failure at any step is reported back with the specific error message and the previous configuration remains active.
+- [Central UI (#9)](./CentralUI.md) — engineers trigger deployments, view diffs, manage instance lifecycle, and deploy system-wide artifacts through the UI. The deployment status page subscribes to `IDeploymentStatusNotifier.StatusChanged` for real-time push updates via Blazor Server SignalR.
+- [Management Service (#18)](./ManagementService.md) — the actor-layer entry point for deployment commands received over ClusterClient. It resolves `DeploymentService` and `ArtifactDeploymentService` from a per-message DI scope and forwards `MgmtDeployArtifactsCommand`, `GetDeploymentDiffCommand`, and instance lifecycle requests.
+- [Security & Auth (#10)](./Security.md) — the Deployment role is required for all deploy and artifact operations; site-scoped permissions are enforced by the Central UI and Management Service before commands reach `DeploymentService`.
+
+## Troubleshooting
+
+### An instance is stuck InProgress after a central failover
+
+The operation lock is in-memory. On failover the new active node has no lock entry, and the deployment record remains `InProgress`. When the engineer issues a re-deploy, `TryReconcileWithSiteAsync` queries the site; if the site already applied the config the record is updated to `Success` without re-sending. If the site did not apply it, a new deployment proceeds. No manual DB edits are required in the normal failover case.
+
+### A deployment record shows Failed with "Communication failure:"
+
+The site round-trip timed out or was cancelled before a response arrived. The site may or may not have applied the config. On the next deploy attempt the reconciliation query determines the ground truth. If the query also fails (site unreachable), a new `DeployInstanceCommand` is sent; the site rejects it with "already applied" if it ran the previous one.
+
+### DeleteOrphaned audit entry
+
+The site destroyed the Instance Actor but the central DB removal failed. The instance record exists in the central DB but has no corresponding site actor. It cannot be deleted through the normal UI path (the site will reject the delete command because the instance does not exist). Reconcile by removing the central record directly via the Management API or database, referencing the `CommandId` in the audit entry.
+
+### Artifact deployment partially failed
+
+`DeployToAllSitesAsync` returns an `ArtifactDeploymentSummary` with per-site `SiteArtifactResult`. Failed sites do not block or roll back successful ones. Use `RetryForSiteAsync` when the failed site is reachable again; it re-fetches all global artifacts and re-sends to the single site.
+
+## Related Documentation
+
+- [Deployment Manager design specification](../requirements/Component-DeploymentManager.md)
+- [Template Engine](./TemplateEngine.md)
+- [Site Runtime](./SiteRuntime.md)
+- [Configuration Database](./ConfigurationDatabase.md)
+- [Central–Site Communication](./Communication.md)
+- [Commons](./Commons.md)
+- [Central UI](./CentralUI.md)
+- [Management Service](./ManagementService.md)
+- [Security & Auth](./Security.md)
@@ -0,0 +1,219 @@
+# External System Gateway
+
+The External System Gateway gives site scripts two runtime capabilities: invoking HTTP/REST APIs on named external systems, and executing SQL writes against named database connections. Both capabilities expose a dual call mode — synchronous (blocking, result returned) and cached (store-and-forward on transient failure, `TrackedOperationId` returned) — so scripts choose the right delivery guarantee per operation without knowing the underlying retry machinery.
+
+## Overview
+
+External System Gateway (#7) runs exclusively at the site. Definitions — external system endpoints with their authentication and method catalogue, and database connection strings — are authored centrally and deployed to the site's local SQLite by the Deployment Manager. The site never reaches back to the configuration database at call time; the repository resolves each definition from SQLite on the hot path.
+
+The component code lives in `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/`, with all four source files at the root:
+
+- `ExternalSystemClient.cs` — `IExternalSystemClient` implementation; `CallAsync` (synchronous) and `CachedCallAsync` (store-and-forward on transient failure), plus the `DeliverBufferedAsync` entry point consumed by the Store-and-Forward Engine during retry sweeps.
+- `DatabaseGateway.cs` — `IDatabaseGateway` implementation; `GetConnectionAsync` (ADO.NET `SqlConnection`) and `CachedWriteAsync` (S&F-buffered SQL), plus its own `DeliverBufferedAsync` for the retry path.
+- `ErrorClassifier.cs` — static helper that maps HTTP status codes and exception types to `TransientExternalSystemException` / `PermanentExternalSystemException`.
+- `ExternalSystemGatewayOptions.cs` — options class bound from `ScadaBridge:ExternalSystemGateway`.
+- `ServiceCollectionExtensions.cs` — `AddExternalSystemGateway` extension; registers `ExternalSystemClient` and `DatabaseGateway` as scoped services and applies per-system connection limits to named `HttpClient` instances.
+
+Both services are DI-scoped. Script Execution Actors (short-lived, per-invocation) resolve them; blocking I/O from both runs on a dedicated Akka.NET dispatcher to keep the default dispatcher free for coordination actors.
+
+## Key Concepts
+
+### Definitions at rest
+
+An `ExternalSystemDefinition` carries the base `EndpointUrl`, `AuthType` (`"apikey"`, `"basic"`, or `"none"`), `AuthConfiguration` (the credential payload), and per-system retry settings (`MaxRetries`, `RetryDelay`). Its child `ExternalSystemMethod` records each carry `HttpMethod`, `Path` (relative to the base URL), and JSON-serialized `ParameterDefinitions` / `ReturnDefinition`. A `DatabaseConnectionDefinition` carries an ADO.NET `ConnectionString` and its own `MaxRetries` / `RetryDelay`.
+
+Definitions are resolved from the site SQLite repository on every call via name-keyed indexed queries (`GetExternalSystemByNameAsync`, `GetDatabaseConnectionByNameAsync`) rather than a fetch-all-then-filter scan, because definitions are read on every script invocation.
+
+### Dual call modes
+
+Every API call and every database write has two modes:
+
+| Mode | API surface | Failure behaviour | Return value |
+|------|-------------|-------------------|--------------|
+| Synchronous | `ExternalSystem.Call()` / `Database.Connection()` | All failures returned to script | Response JSON / `DbConnection` |
+| Cached | `ExternalSystem.CachedCall()` / `Database.CachedWrite()` | Transient → buffered; permanent → returned | `TrackedOperationId` (on buffer) |
+
+`CachedCallAsync` and `CachedWriteAsync` attempt immediate delivery first. Only a transient failure routes to the Store-and-Forward Engine.
+
+### Error classification
+
+`ErrorClassifier` is the single authority on what counts as transient:
+
+- **HTTP status codes**: 5xx, 408 (Request Timeout), 429 (Too Many Requests) → transient. All other non-success 4xx → permanent.
+- **Exceptions**: `HttpRequestException`, `TaskCanceledException`, `TimeoutException`, `OperationCanceledException` → transient. `JsonException` during payload deserialization → permanent (a malformed payload will not become well-formed on retry, so it is parked rather than retried forever).
+
+Transient failures on `CachedCall` / `CachedWrite` are silently buffered (logged at `Debug`). Permanent failures are logged at `Warning` and returned to the calling script regardless of call mode, because a permanently-wrong request should surface immediately.
+
+## Architecture
+
+### HTTP invocation (`ExternalSystemClient`)
+
+`InvokeHttpAsync` constructs the request, applies auth, dispatches, and classifies the response. The gateway creates a named `HttpClient` per system (`ExternalSystem_{systemName}`) through `IHttpClientFactory`, with `SocketsHttpHandler.MaxConnectionsPerServer` capped by `MaxConcurrentConnectionsPerSystem`. The framework default `HttpClient.Timeout` (100 s) is deliberately overridden to `Timeout.InfiniteTimeSpan` so the gateway's own `CancellationTokenSource(DefaultHttpTimeout)` is the sole timeout source — without this, configured timeouts above 100 s would be silently clipped.
+
+Parameter routing by verb:
+- `POST`, `PUT`, `PATCH` → JSON body (`application/json`).
+- `GET`, `DELETE` → URL query string (null-valued parameters omitted; no trailing `?` when all values are null).
+
+Auth application:
+- `apikey` — `AuthConfiguration` format `"HeaderName:KeyValue"` or bare key value (default header `X-API-Key`).
+- `basic` — `AuthConfiguration` format `"username:password"`, Base64-encoded as `Authorization: Basic ...`.
+- `none` — silent no-op.
+- Missing or malformed `AuthConfiguration` for a type that requires credentials logs a `Warning` but does not abort the call.
+
+Error body embedded in script-visible messages is capped at 2 048 characters so a misbehaving endpoint cannot inflate error strings.
+
+```csharp
+// ExternalSystemClient.cs
+catch (OperationCanceledException ex) when (timeoutCts.IsCancellationRequested)
+{
+    // Our own timeout elapsed — a transient failure per the design.
+    throw ErrorClassifier.AsTransient(
+        $"Timeout calling {system.Name} after {_options.DefaultHttpTimeout.TotalSeconds:0.##}s", ex);
+}
+catch (Exception ex) when (ErrorClassifier.IsTransient(ex))
+{
+    throw ErrorClassifier.AsTransient($"Connection error to {system.Name}: {ex.Message}", ex);
+}
+```
+
+### `CachedCallAsync` — the buffered path
+
+On a transient failure, `CachedCallAsync` serializes `{SystemName, MethodName, Parameters}` as JSON and calls `StoreAndForwardService.EnqueueAsync` with `StoreAndForwardCategory.ExternalSystem`. Three details matter for correct S&F integration:
+
+- **`attemptImmediateDelivery: false`** — the HTTP attempt has already been made; passing `true` would dispatch the same request twice.
+- **`MaxRetries` / `RetryDelay` defaulting** — `ExternalSystemDefinition.MaxRetries` defaults to `0`, and the S&F engine treats a stored `0` as "no limit". A `0` is therefore passed as `null` so the engine's own bounded default applies, avoiding unbounded retry loops on unconfigured systems.
+- **`messageId: trackedOperationId`** — pins the S&F message GUID to the caller-supplied `TrackedOperationId` so the retry loop can emit per-attempt and terminal audit telemetry under the same tracking id.
+
+```csharp
+// ExternalSystemClient.cs — transient branch of CachedCallAsync
+await _storeAndForward.EnqueueAsync(
+    StoreAndForwardCategory.ExternalSystem,
+    systemName,
+    payload,
+    originInstanceName,
+    system.MaxRetries > 0 ? system.MaxRetries : null,
+    system.RetryDelay > TimeSpan.Zero ? system.RetryDelay : null,
+    attemptImmediateDelivery: false,
+    messageId: trackedOperationId?.ToString(),
+    executionId: executionId,
+    sourceScript: sourceScript,
+    parentExecutionId: parentExecutionId);
+
+return new ExternalCallResult(true, null, null, WasBuffered: true);
+```
+
+### `DeliverBufferedAsync` — S&F retry delivery
+
+The Store-and-Forward Engine calls `ExternalSystemClient.DeliverBufferedAsync` and `DatabaseGateway.DeliverBufferedAsync` during retry sweeps. Both methods:
+
+1. Deserialize the payload JSON; treat `JsonException` as permanent (return `false` → park).
+2. Re-resolve the definition by name; if gone, return `false` → park.
+3. Execute the operation. `PermanentExternalSystemException` → park. `TransientExternalSystemException` propagates → engine retries.
+
+### Database gateway (`DatabaseGateway`)
+
+`GetConnectionAsync` resolves the `DatabaseConnectionDefinition`, opens a `SqlConnection` against `ConnectionString`, and returns the open connection. The caller owns disposal. If `OpenAsync` throws (unreachable server, bad credentials), the connection is disposed before the exception propagates.
+
+`CachedWriteAsync` serializes `{ConnectionName, Sql, Parameters}` and enqueues to S&F under `StoreAndForwardCategory.CachedDbWrite`, with the same `MaxRetries` / `RetryDelay` defaulting logic as `CachedCallAsync`.
+
+During retry delivery, `JsonElement` parameter values are converted with a numeric type preference of `long` → `decimal` → `double`. This matters because a script's decimal SQL parameter is serialized as an untagged JSON number; naively casting to `double` loses precision for money and measurement values.
+
+```csharp
+// DatabaseGateway.cs — JsonElementToParameterValue
+JsonValueKind.Number => element.TryGetInt64(out var l)
+    ? l
+    : element.TryGetDecimal(out var dec)
+        ? dec
+        : element.GetDouble(),
+```
+
+## Usage
+
+Scripts interact through `IExternalSystemClient` and `IDatabaseGateway`, which the Script Runtime Context exposes as `ExternalSystem` and `Database` respectively. Scripts never construct gateway types directly.
+
+**Synchronous external system call** — blocks until the response arrives or the timeout elapses:
+
+```csharp
+// Script code (via ScriptRuntimeContext)
+var result = await ExternalSystem.Call("MES", "GetRecipe", new { RecipeId = 42 });
+if (result.Success)
+{
+    var name = result.Response.recipeName; // dynamic JSON access
+}
+```
+
+**Cached external system call** — returns immediately with a `TrackedOperationId`; the actual HTTP request is attempted once and, on transient failure, buffered for retry:
+
+```csharp
+var tracked = await ExternalSystem.CachedCall("MES", "PostProductionResult", payload);
+// tracked.WasBuffered == true when queued to S&F
+```
+
+**Synchronous database access** — caller controls the connection lifetime:
+
+```csharp
+await using var conn = await Database.Connection("HistorianDB");
+using var cmd = conn.CreateCommand();
+cmd.CommandText = "SELECT TOP 1 Value FROM dbo.Tags WHERE Name = @name";
+cmd.Parameters.AddWithValue("@name", tagName);
+var value = await cmd.ExecuteScalarAsync();
+```
+
+**Cached database write** — enqueued immediately; returns a `TrackedOperationId`:
+
+```csharp
+await Database.CachedWrite("MES_DB",
+    "INSERT INTO dbo.ProductionLog (BatchId, Qty) VALUES (@batchId, @qty)",
+    new { batchId = id, qty = quantity });
+```
+
+Call status is observable via `Tracking.Status(trackedOperationId)` — answered site-locally against the S&F tracking table, or centrally via the Site Call Audit page.
+
+## Configuration
+
+Options are bound from `ScadaBridge:ExternalSystemGateway` into `ExternalSystemGatewayOptions` by `AddExternalSystemGateway`.
+
+| Key | Default | Description |
+|-----|---------|-------------|
+| `DefaultHttpTimeout` | `00:00:30` | Per-call HTTP round-trip timeout. Applied via `CancellationTokenSource`; overrides the framework 100 s default. |
+| `MaxConcurrentConnectionsPerSystem` | `10` | `SocketsHttpHandler.MaxConnectionsPerServer` applied to each named `HttpClient` (`ExternalSystem_{name}`). Does not affect other host `HttpClient` instances. |
+
+Per-system retry settings (`MaxRetries`, `RetryDelay`) are properties of `ExternalSystemDefinition` and `DatabaseConnectionDefinition`, authored by operators in the Central UI and deployed as part of the system artifact. The gateway passes these directly to the Store-and-Forward Engine on enqueue.
+
+There is no separate configuration section for database connections — connection strings reside in `DatabaseConnectionDefinition.ConnectionString`, deployed via artifact. Pool tuning (max pool size, connection lifetime) can be embedded in the connection string itself.
+
+## Dependencies & Interactions
+
+- [Commons (#16)](./Commons.md) — owns `IExternalSystemClient`, `IDatabaseGateway`, `ExternalCallResult`, `TrackedOperationId`, `ExternalSystemDefinition`, `ExternalSystemMethod`, `DatabaseConnectionDefinition`, `IExternalSystemRepository`, and the `StoreAndForwardCategory` enum values consumed here.
+- [Store-and-Forward Engine (#6)](./StoreAndForward.md) — receives buffered `ExternalSystem` and `CachedDbWrite` payloads from `CachedCallAsync` / `CachedWriteAsync`; drives retry sweeps by calling `DeliverBufferedAsync` on both gateway types; assigns `TrackedOperationId` tracking rows; owns the site-local operation tracking table read by `Tracking.Status()`.
+- [Configuration Database (#17)](./ConfigurationDatabase.md) — provides `IExternalSystemRepository`, implemented against the site SQLite replica. Central uses the same interface against MS SQL for definition management.
+- [Site Runtime (#3)](../requirements/Component-SiteRuntime.md) — Script Execution Actors resolve `IExternalSystemClient` and `IDatabaseGateway` from DI and expose them to script code as `ExternalSystem` and `Database`. Actors run on a dedicated blocking I/O dispatcher to isolate HTTP and SQL waits from the actor system's default dispatcher.
+- [Site Call Audit (#22)](./SiteCallAudit.md) — receives cached-call lifecycle telemetry (via the combined `CachedCallTelemetry` packet) so cached call status is observable centrally; the gateway's S&F delivery writes the tracking row that `Tracking.Status()` reads.
+- [Audit Log (#23)](./AuditLog.md) — audit rows for `ApiOutbound` and `DbOutbound` channels are emitted by the Script Runtime Context around gateway calls; gateway itself does not write audit rows directly. The `trackedOperationId`, `executionId`, and `parentExecutionId` threaded through `CachedCallAsync` / `CachedWriteAsync` keep audit rows correlated across the retry lifecycle.
+
+## Troubleshooting
+
+### A cached call is stuck retrying
+
+If the external system definition or database connection has `MaxRetries = 0` and the operator intended "no retries", the S&F engine interprets `0` as "no limit" (retry forever). The gateway normalizes `0` to `null` on enqueue so the engine's bounded default applies. Verify the definition's `MaxRetries` field is set to the intended value in the Central UI and redeployed.
+
+### Timeout is not being respected
+
+`ExternalSystemGatewayOptions.DefaultHttpTimeout` applies only when `HttpClient.Timeout` is `Timeout.InfiniteTimeSpan`. The gateway sets this explicitly on every factory-supplied client. If a custom `HttpMessageHandler` upstream resets `Timeout`, the gateway's `CancellationTokenSource(DefaultHttpTimeout)` is still the controlling token because `SendAsync` is called with the linked token, not the raw `cancellationToken`.
+
+### Auth header not sent
+
+The gateway logs a `Warning` when `AuthType` is `"apikey"` or `"basic"` but `AuthConfiguration` is empty or absent, and when `AuthType` is `"basic"` but `AuthConfiguration` has no `:` separator. Check the site log for `ApplyAuth:` warning messages. The credential value is never logged — only the system name and auth type.
+
+### A buffered call is parked immediately
+
+A `JsonException` during `DeliverBufferedAsync` payload deserialization is treated as permanent (the same malformed payload will fail every time). The message is parked rather than retried. Check the site log for `"malformed JSON payload; parking"` alongside the message GUID, then inspect the S&F store for the payload to identify the serialization issue.
+
+## Related Documentation
+
+- [External System Gateway design specification](../requirements/Component-ExternalSystemGateway.md)
+- [Store-and-Forward Engine](./StoreAndForward.md)
+- [Site Call Audit](./SiteCallAudit.md)
+- [Audit Log](./AuditLog.md)
+- [Commons](./Commons.md)
+- [Configuration Database](./ConfigurationDatabase.md)
@@ -0,0 +1,312 @@
+# Site Runtime
+
+The Site Runtime component runs the site-side actor hierarchy that executes deployed machine instances: it owns script compilation, alarm evaluation, native alarm mirroring, and the site-wide Akka stream that carries attribute value and alarm state changes to every subscriber.
+
+## Overview
+
+Site Runtime (#3) operates exclusively on site clusters. Its entry point is the `DeploymentManagerActor` cluster singleton, which re-creates the full actor hierarchy on every site startup or failover. Each deployed enabled instance gets an `InstanceActor` child; each `InstanceActor` spawns `ScriptActor` and `AlarmActor` coordinator children, plus a `NativeAlarmActor` peer for every configured native alarm source. Script invocations spawn short-lived `ScriptExecutionActor` children; alarm on-trigger invocations spawn short-lived `AlarmExecutionActor` children.
+
+The component code lives in `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/`:
+
+- `Actors/` — `DeploymentManagerActor`, `InstanceActor`, `ScriptActor`, `ScriptExecutionActor`, `AlarmActor`, `AlarmExecutionActor`, `NativeAlarmActor`, `SiteReplicationActor`.
+- `Scripts/` — `ScriptCompilationService`, `ScriptExecutionScheduler`, `SharedScriptLibrary`, `ScriptRuntimeContext`, `ScopeAccessors`, `TriggerExpressionGlobals`.
+- `Streaming/` — `SiteStreamManager` (the site-wide Akka broadcast stream).
+- `Persistence/` — `SiteStorageService` (raw SQLite via `Microsoft.Data.Sqlite`), `SiteStorageInitializer`.
+- `Repositories/` — `SiteExternalSystemRepository`, `SiteNotificationRepository`.
+- `Tracking/` — `OperationTrackingStore`, `OperationTrackingOptions`.
+
+`ServiceCollectionExtensions.AddSiteRuntime(connectionString)` registers all singletons; the `Host` calls it and wires the `DeploymentManagerActor` cluster singleton separately via `AkkaHostedService`.
+
+## Key Concepts
+
+### Cluster singleton as the single point of authority
+
+`DeploymentManagerActor` runs as an Akka.NET cluster singleton — guaranteed to be active on exactly one site node at a time. On failover, Akka restarts the singleton on the surviving node. Because all deployment commands from central are routed through the singleton, there is never a split-brain dispute over which node owns instance lifecycle: the singleton is the only actor that calls `Context.ActorOf` for `InstanceActor` children.
+
+### Staggered startup
+
+The singleton reads all deployed configurations from SQLite in `PreStart`, compiles shared scripts off-thread, and then creates `InstanceActor` children in batches. The default batch size is 20, with a 100 ms delay between batches (`StartupBatchSize` and `StartupBatchDelayMs` in `SiteRuntimeOptions`). Without staggering, 500 instances each subscribing to OPC UA tags simultaneously would produce a reconnection storm that overwhelms the OPC UA server.
+
+### Actor supervision and lifetimes
+
+| Actor | Kind | Supervises children with | On exception |
+|---|---|---|---|
+| `DeploymentManagerActor` | long-lived singleton | `OneForOneStrategy` | Resume (coord) / Stop (init failure) |
+| `InstanceActor` | long-lived per instance | `OneForOneStrategy` | Resume for all coordinator children |
+| `ScriptActor` | long-lived coordinator | `OneForOneStrategy` | Stop execution child, keep self |
+| `AlarmActor` | long-lived coordinator | `OneForOneStrategy` | Stop execution child, keep self |
+| `NativeAlarmActor` | long-lived coordinator | — | Supervised by Instance Actor (Resume) |
+| `ScriptExecutionActor` | short-lived per invocation | — | Stops itself; parent logs failure |
+| `AlarmExecutionActor` | short-lived per invocation | — | Stops itself; parent logs failure |
+
+Coordinator actors resume on exception because their in-memory state (trigger timers, last execution time, alarm level) must survive child crashes. Short-lived execution actors stop themselves on completion or exception — the coordinator remains available for the next trigger.
+
+### Dedicated script-execution dispatcher
+
+Script and alarm on-trigger bodies run on the `ScriptExecutionScheduler` (`SiteRuntime-009`): a custom `TaskScheduler` backed by a bounded set of dedicated threads (default 8, `ScriptExecutionThreadCount`). The script body is submitted to this scheduler via `Task.Factory.StartNew(..., scheduler)` inside `ScriptExecutionActor` and `AlarmExecutionActor`. Scripts that block on I/O (database connections, synchronous external system calls) block only the scheduler's threads, leaving the shared .NET thread pool and all Akka dispatchers unaffected.
+
+### Tell vs. Ask
+
+- **Tell**: tag value updates, `AttributeValueChanged` fan-out to child Script/Alarm actors, stream publishing. These are high-frequency or fire-and-forget paths.
+- **Ask**: `Instance.CallScript()` (caller needs the return value), debug view snapshots, attribute reads from the Inbound API. Ask is reserved for cross-boundary, low-frequency calls.
+
+### Attribute serialization through the Instance Actor
+
+All in-memory state mutations (attribute values, qualities, alarm states) run inside `InstanceActor`'s mailbox. Multiple `ScriptExecutionActor` instances may run concurrently but all `SetAttribute` calls serialize through the `InstanceActor` mailbox, preventing race conditions. Concurrent script executions may interleave external side effects (HTTP calls, database writes, notifications); those are independent and intentionally not serialized.
+
+## Architecture
+
+### Actor hierarchy
+
+```
+DeploymentManagerActor  (Akka.NET cluster singleton)
+└── InstanceActor  "MachineA-001"
+    ├── ScriptActor  "MonitorSpeed"           (coordinator)
+    │   └── ScriptExecutionActor             (short-lived, per invocation)
+    ├── ScriptActor  "CalculateOEE"           (coordinator)
+    │   └── ScriptExecutionActor             (short-lived)
+    ├── AlarmActor  "OverTemp"               (coordinator, computed)
+    │   └── AlarmExecutionActor             (short-lived, on-trigger)
+    ├── AlarmActor  "LowPressure"            (coordinator, computed)
+    └── NativeAlarmActor  "OpcUaServer1"     (read-only mirror, peer to AlarmActor)
+```
+
+`NativeAlarmActor` is a sibling of `AlarmActor` — a peer under the same `InstanceActor` parent. It is not a child of `AlarmActor` and has no relationship to the script engine.
+
+### Deployment flow
+
+Central sends a `DeployInstanceCommand` carrying a JSON `FlattenedConfiguration` to the site singleton. The singleton:
+
+1. Calls `EnsureDclConnections` to push any new or changed connection definitions to the DCL manager (hash-guarded: unchanged configs are skipped).
+2. Calls `CreateInstanceActor`, which does `Context.ActorOf(props, instanceName)`.
+3. Runs an off-thread `Task` that calls `SiteStorageService.StoreDeployedConfigAsync`, clears static overrides and native alarm state, and tells `SiteReplicationActor` to push to the peer node.
+4. Pipes back a `DeployPersistenceResult`; only on success does it tell the deployer `DeploymentStatus.Success`. If persistence fails, the optimistically-created actor is stopped and the error is returned to central (`SiteRuntime-005`).
+
+For redeployment (instance already running), the existing actor is stopped and watched:
+
+```csharp
+// DeploymentManagerActor.HandleDeploy
+if (_instanceActors.TryGetValue(instanceName, out var existing))
+{
+    _instanceActors.Remove(instanceName);
+    _pendingRedeploys[existing] = new PendingRedeploy(command, Sender);
+    _terminatingActorsByName[instanceName] = existing;
+    Context.Watch(existing);
+    Context.Stop(existing);
+    return;
+}
+```
+
+The `Terminated` signal fires once the previous actor and its entire subtree have stopped (freeing the actor name), and only then does `ApplyDeployment` run for the replacement. A third deploy arriving mid-termination overwrites the buffered `PendingRedeploy` (last-write-wins) and tells the displaced sender a `Failed-superseded` response (`SiteRuntime-020`).
+
+### Instance Actor initialization
+
+On `PreStart`, `InstanceActor`:
+
+1. Pipes `SiteStorageService.GetStaticOverridesAsync` to self as a `LoadOverridesResult`. When the message arrives, persisted overrides are applied on top of the flattened-config defaults.
+2. Calls `CreateChildActors()`, which snapshots `_attributes` (the live dictionary) into `attributeSnapshot` before any child constructor runs. Each child's `Props` closure captures the immutable snapshot, not the live dictionary — preventing the race condition described in `SiteRuntime-017`.
+3. Calls `SubscribeToDcl()`, grouping data-sourced attributes by connection name and sending `SubscribeTagsRequest` to the DCL manager. Tag paths are stored in `_tagPathToAttributes`, a `Dictionary<string, List<string>>`, because one physical tag can back more than one attribute canonical name.
+
+Data-sourced attributes start with quality `Uncertain` until the first `TagValueUpdate` arrives; static attributes start with quality `Good`.
+
+### Script compilation
+
+`ScriptCompilationService.Compile(name, code)` first runs `ValidateTrustModel`, which uses Roslyn semantic analysis (not substring scanning) to detect references to forbidden namespaces (`System.IO`, `System.Diagnostics.Process`, `System.Threading` — except `Tasks`/`CancellationToken`, `System.Reflection`, `System.Net.Sockets`, `System.Net.Http`). Only after passing trust validation does it call `CSharpScript.Create<object?>` with the restricted `ScriptOptions` (references capped to `object`, `Enumerable`, `Math`, `CSharpArgumentInfo`, and `DynamicJsonElement` assemblies).
+
+```csharp
+// ScriptCompilationService.CompileCore
+var violations = ValidateTrustModel(code);
+if (violations.Count > 0)
+    return ScriptCompilationResult.Failed(violations);
+
+var script = CSharpScript.Create<object?>(
+    code,
+    BuildScriptOptions(),
+    globalsType: globalsType);
+var diagnostics = script.Compile();
+```
+
+`CompileTriggerExpression(name, expression)` follows the same path but uses `TriggerExpressionGlobals` as the globals type instead of `ScriptGlobals` — trigger expressions are read-only and have no access to the script runtime API.
+
+### Shared script library
+
+`SharedScriptLibrary` holds a `Dictionary<string, Script<object?>>` under a `lock`. It is populated at startup (off-thread by the singleton, piped back as `SharedScriptsLoaded`) and updated live when artifact deployments arrive carrying new shared scripts. Calling `Scripts.CallShared("name", params)` inside a script calls `SharedScriptLibrary.ExecuteAsync`, which runs the compiled delegate inline on the calling thread — no actor messages, no serialization.
+
+### Script Actor triggers
+
+`ScriptActor` parses the `ResolvedScript.TriggerType` and `TriggerConfiguration` into a discriminated union (`IntervalTriggerConfig`, `ValueChangeTriggerConfig`, `ConditionalTriggerConfig`, `ExpressionTriggerConfig`). Interval triggers use Akka `ITimerScheduler`. Value-change and conditional triggers react to `AttributeValueChanged` messages forwarded by the Instance Actor. Expression triggers maintain an `_attributeSnapshot` dictionary kept current by every `AttributeValueChanged`, then evaluate the compiled `_compiledTriggerExpression` synchronously with a 2-second `CancellationTokenSource` timeout.
+
+`WhileTrue` mode is handled by `HandleWhileTrueTransition`:
+
+```csharp
+// ScriptActor.HandleWhileTrueTransition
+private void HandleWhileTrueTransition(bool nowTrue, bool wasTrue)
+{
+    if (nowTrue && !wasTrue)
+    {
+        TrySpawnExecution(null);
+        StartWhileTrueTimer();
+    }
+    else if (!nowTrue && wasTrue)
+    {
+        StopWhileTrueTimer();
+    }
+}
+```
+
+On the false→true edge the script fires once and a periodic re-fire timer starts at `MinTimeBetweenRuns` cadence; on the true→false edge the timer stops.
+
+### Site-wide Akka stream
+
+`SiteStreamManager` materializes a broadcast hub in `Initialize(ActorSystem)`, called by the Host after Akka starts. The hub is fed by a `Source.ActorRef<ISiteStreamEvent>` (bounded with `OverflowStrategy.DropHead`). `InstanceActor` publishes via `_streamManager?.PublishAttributeValueChanged(changed)` and `PublishAlarmStateChanged(changed)` — both are Tell calls; they never block the actor.
+
+Each subscriber (typically a `StreamRelayActor` created by the Communication Layer's `SiteStreamGrpcServer`) gets its own materialized sub-graph with an independent `Buffer(_bufferSize, DropHead)` and a `KillSwitch`. A slow subscriber drops only its own events; it cannot stall other subscribers or the publishing Instance Actor.
+
+```csharp
+// SiteStreamManager.Subscribe
+var killSwitch = _hubSource
+    .Where(ev => ev.InstanceUniqueName == capturedInstance)
+    .Buffer(_bufferSize, OverflowStrategy.DropHead)
+    .ViaMaterialized(KillSwitches.Single<ISiteStreamEvent>(), Keep.Right)
+    .To(Sink.ForEach<ISiteStreamEvent>(ev => capturedSubscriber.Tell(ev)))
+    .Run(_materializer);
+```
+
+### Native alarm mirror
+
+`NativeAlarmActor` mirrors the condition state of one source binding — an OPC UA A&C server or MxAccess Gateway connection — without writing back to the source. Each condition is keyed by `SourceReference`.
+
+On `PreStart` it rehydrates last-known state from SQLite (`native_alarm_state` table) and immediately sends a `SubscribeAlarmsRequest` to the DCL manager. The DCL forwards this to the connection's `IAlarmSubscribableConnection` implementation.
+
+Transition handling:
+
+- `AlarmTransitionKind.Snapshot` accumulates into `_snapshotBuffer`.
+- `AlarmTransitionKind.SnapshotComplete` atomically swaps `_alarms` with `_snapshotBuffer`. Conditions absent from the snapshot emit return-to-normal events and drop from the mirror — the mechanism that reconciles state after a reconnect.
+- Live transitions (`Raise`, `Ack`, `Clear`, etc.) upsert by `SourceReference`, ignoring transitions older than the held `TransitionTime` (out-of-order protection).
+
+Retention: a condition that is both inactive and acknowledged (`!Active && Acknowledged`) is dropped from the mirror and its SQLite row deleted. If the mirror exceeds `MirroredAlarmCapPerSource` (default 1000), the oldest condition is dropped and logged. Persistence is fire-and-forget — a write failure is logged but never blocks the actor or suppresses the upward `AlarmStateChanged` emit.
+
+### Enriched `AlarmStateChanged`
+
+Both `AlarmActor` and `NativeAlarmActor` tell the `InstanceActor` an `AlarmStateChanged`. The message was extended additively so existing computed-alarm consumers continue to work unchanged:
+
+| Field | Computed alarm | Native alarm |
+|---|---|---|
+| `Kind` | `AlarmKind.Computed` | `AlarmKind.NativeOpcUa` or `NativeMxAccess` |
+| `Condition` | computed default (auto-acknowledged, `Severity = Priority`) | mirrored `AlarmConditionState` from source |
+| `SourceReference`, `AlarmTypeName`, `Category`, `Message`, `OperatorUser`, `OperatorComment`, `OriginalRaiseTime`, `CurrentValue`, `LimitValue` | empty/null | populated from source transition |
+
+`InstanceActor` stores the latest enriched event per alarm name in `_latestAlarmEvents`. The Debug View snapshot uses this map so native alarm metadata reaches the central debug view.
+
+### Local SQLite schema
+
+`SiteStorageService` owns the site database (raw `Microsoft.Data.Sqlite`, not EF Core). Tables created by `InitializeAsync`:
+
+| Table | Purpose | Reset on redeploy? |
+|---|---|---|
+| `deployed_configurations` | Persisted flattened configs (survives restart/failover) | No (replaced) |
+| `static_attribute_overrides` | Runtime attribute writes (`SetAttribute` on static attrs) | Yes — cleared by `ClearStaticOverridesAsync` |
+| `native_alarm_state` | Mirrored native alarm conditions (survives failover) | Yes — cleared by `ClearNativeAlarmsForInstanceAsync` |
+| `shared_scripts` | Shared script code from artifact deployments | No |
+| `external_systems` | External system definitions | No |
+| `database_connections` | Database connection strings | No |
+| `data_connection_definitions` | OPC UA / MxGateway endpoint definitions | No |
+| `notification_lists` | Notification list definitions | No |
+| `smtp_configurations` | SMTP configuration (from artifact deployment) | No |
+
+### Standby replication
+
+`SiteReplicationActor` runs on every site node (not a singleton). The active node's `DeploymentManagerActor` tells it `ReplicateConfigDeploy`, `ReplicateConfigRemove`, `ReplicateConfigSetEnabled`, `ReplicateArtifacts`, or `ReplicateStoreAndForward`. The replication actor tracks the peer node via Akka cluster membership events and forwards each command to `/user/site-replication` on the peer via `ActorSelection`. Replication is fire-and-forget (no ack wait per design), so a failed write to the standby is logged but does not fail the primary operation.
+
+## Usage
+
+### Lifecycle commands
+
+Central sends commands to the site `DeploymentManagerActor` singleton over the Communication Layer:
+
+| Command | Effect |
+|---|---|
+| `DeployInstanceCommand` | Create or replace the instance actor; persist config to SQLite; clear static overrides and native alarm state |
+| `DisableInstanceCommand` | Stop the instance actor; set `is_enabled = 0` in SQLite; retain config for re-enable |
+| `EnableInstanceCommand` | Create a new instance actor from the stored config |
+| `DeleteInstanceCommand` | Stop the instance actor; remove config from SQLite; store-and-forward messages are NOT cleared |
+| `DeployArtifactsCommand` | Persist shared scripts, external system definitions, database connections, notification lists, data connection definitions; recompile shared scripts; push data connections to DCL |
+
+### Script API surface
+
+Scripts run inside `ScriptExecutionActor` with a `ScriptGlobals` object as the Roslyn host object. The `Instance` global is a `ScriptRuntimeContext`. Convenience top-level aliases (`ExternalSystem`, `Database`, `Notify`, `Scripts`, `Attributes`, `Children`, `Parent`) delegate to context methods. Key calls:
+
+- `Instance.GetAttribute("name")` / `Instance.SetAttribute("name", value)` — Ask to `InstanceActor` for write, in-process for read.
+- `Instance.CallScript("scriptName", params)` — Ask from `ScriptExecutionActor` to sibling `ScriptActor`, which spawns a new `ScriptExecutionActor`.
+- `Scripts.CallShared("name", params)` — `SharedScriptLibrary.ExecuteAsync`, inline on the current scheduler thread.
+- `ExternalSystem.Call(...)` — synchronous HTTP call through `IExternalSystemClient`.
+- `ExternalSystem.CachedCall(...)` / `Database.CachedWrite(...)` — store-and-forwarded; returns a `TrackedOperationId`.
+- `Tracking.Status(id)` — reads the site-local `OperationTrackingStore` synchronously.
+- `Notify.To("list").Send(...)` — enqueues a notification in the Store-and-Forward Engine for delivery to central.
+
+Alarm on-trigger scripts run in `AlarmExecutionActor` with the same API plus an `Alarm` global (`AlarmContext` carrying `Name`, `Level`, `Priority`, `Message`).
+
+### Debug view
+
+The Communication Layer sends `SubscribeDebugViewRequest` or `DebugSnapshotRequest` to the singleton, which forwards to the named `InstanceActor`. The Instance Actor replies with a `DebugViewSnapshot` built from the current `_attributes` dictionary and `_latestAlarmEvents` map. Ongoing changes reach the central debug view via the gRPC stream, not through the actor hierarchy.
+
+## Configuration
+
+All options live in the `ScadaBridge:SiteRuntime` section, bound to `SiteRuntimeOptions`:
+
+| Key | Default | Description |
+|---|---|---|
+| `StartupBatchSize` | `20` | Instance Actors created per batch during staggered startup |
+| `StartupBatchDelayMs` | `100` | Milliseconds between startup batches |
+| `MaxScriptCallDepth` | `10` | Maximum `Instance.CallScript` / `Scripts.CallShared` recursion depth |
+| `ScriptExecutionTimeoutSeconds` | `30` | Per-script body execution timeout; exceeding it cancels and logs an error |
+| `StreamBufferSize` | `1000` | Per-subscriber drop-oldest buffer size for the Akka broadcast stream |
+| `ScriptExecutionThreadCount` | `8` | Dedicated threads in the `ScriptExecutionScheduler` (covers both scripts and alarm on-trigger bodies) |
+| `MirroredAlarmCapPerSource` | `1000` | Maximum mirrored conditions per `NativeAlarmActor` source binding before oldest is dropped and logged |
+| `NativeAlarmRetryIntervalMs` | `5000` | Milliseconds before retrying a failed native alarm subscription |
+
+The SQLite connection string is passed directly to `AddSiteRuntime(connectionString)` in the host composition root and is not part of `SiteRuntimeOptions`.
+
+## Dependencies & Interactions
+
+- [Data Connection Layer (#4)](./DataConnectionLayer.md) — supplies `TagValueUpdate` and `ConnectionQualityChanged` messages to `InstanceActor`; receives `SubscribeTagsRequest` and `WriteTagRequest`. Also supplies `NativeAlarmTransitionUpdate` and `NativeAlarmSourceUnavailable` to `NativeAlarmActor` via `SubscribeAlarmsRequest` (connections implementing `IAlarmSubscribableConnection`).
+- [Central–Site Communication (#5)](./Communication.md) — routes `DeployInstanceCommand`, `DisableInstanceCommand`, `EnableInstanceCommand`, `DeleteInstanceCommand`, `DeployArtifactsCommand`, debug view requests, and Inbound API `RouteToCallRequest` / `RouteToGetAttributesRequest` / `RouteToSetAttributesRequest` to the singleton; receives `DeploymentStatusResponse` and `ArtifactDeploymentResponse` back. The `SiteStreamManager` implements `ISiteStreamSubscriber` so the Communication Layer's `SiteStreamGrpcServer` can subscribe `StreamRelayActor` instances to the broadcast hub.
+- [Store-and-Forward Engine (#6)](./StoreAndForward.md) — `ScriptRuntimeContext` passes `StoreAndForwardService` (resolved from DI inside `ScriptExecutionActor`) for `ExternalSystem.CachedCall`, `Database.CachedWrite`, and `Notify.To().Send()`. Owns the site-local operation tracking table that `Tracking.Status(id)` reads.
+- [External System Gateway (#7)](./ExternalSystemGateway.md) — `IExternalSystemClient` called by `ScriptRuntimeContext.ExternalSystemHelper` for synchronous and cached external system calls.
+- [Site Event Logging (#12)](./SiteEventLogging.md) — `ISiteEventLogger` (resolved from DI inside execution actors) receives script error, alarm error, and script execution events.
+- [Health Monitoring (#11)](./HealthMonitoring.md) — `ISiteHealthCollector` (injected into `DeploymentManagerActor`, `InstanceActor`, `ScriptActor`, `AlarmActor`) tracks instance counts (`SetInstanceCounts`), script errors (`IncrementScriptError`), and alarm errors (`IncrementAlarmError`); sets `SetActiveNode` in `DeploymentManagerActor.PreStart`/`PostStop` so the health report reflects which node holds the singleton.
+- [Audit Log (#23)](./AuditLog.md) — `IAuditWriter` (resolved from DI inside `ScriptExecutionActor`) receives one row per script-trust-boundary call; audit writes are best-effort and never abort the calling script.
+- [Commons (#16)](./Commons.md) — owns all message contracts (`DeployInstanceCommand`, `AttributeValueChanged`, `AlarmStateChanged`, `ScriptCallRequest`, `NativeAlarmTransitionUpdate`, etc.), the `FlattenedConfiguration` / `ResolvedScript` / `ResolvedAlarm` / `ResolvedNativeAlarmSource` types, and the `AlarmKind` / `AlarmState` / `AlarmLevel` / `AlarmConditionState` / `AlarmTransitionKind` enums.
+- Local SQLite — `SiteStorageService` owns the site database. Peer SQLite stores (Store-and-Forward buffer, AuditLog, operation tracking, site event log) are owned by their respective components but share the same SQLite file path convention.
+- Design spec: [Component-SiteRuntime.md](../requirements/Component-SiteRuntime.md).
+
+## Troubleshooting
+
+### Instance stays in `Unknown` or `Failed` after deployment
+
+`DeploymentManagerActor` only tells central `DeploymentStatus.Success` after `SiteStorageService.StoreDeployedConfigAsync` commits. If the site SQLite file is locked, full, or on a read-only volume, the persistence task throws, the optimistically-created actor is stopped, and central receives `Failed`. Check the site event log for `Failed to persist deployment` entries. The site SQLite path is configured in the host `appsettings.json`.
+
+### Reconnection storm on failover
+
+If many instances race to subscribe to OPC UA at once the server may throttle or drop connections. Increase `StartupBatchDelayMs` or decrease `StartupBatchSize` in `SiteRuntimeOptions`. The current defaults (batch 20, delay 100 ms) mean a site with 200 instances takes 1 second to start all subscriptions, which is acceptable for most servers.
+
+### Script execution actor backpressure
+
+The `ScriptExecutionScheduler` has a fixed thread count (`ScriptExecutionThreadCount`, default 8). If all threads are blocked (a burst of scripts each waiting on a slow database or external system), new script invocations queue behind them. The queue is unbounded — memory usage can grow during a backlog. If this is observed, raise `ScriptExecutionThreadCount` or reduce the number of concurrent long-running scripts. Script execution timeout (`ScriptExecutionTimeoutSeconds`) bounds the worst case.
+
+### Native alarm conditions not recovering after reconnect
+
+`NativeAlarmActor` retains last-known conditions during a source outage (it does not clear them) and reconciles state via the reconnect snapshot swap. If the snapshot never arrives (the DCL connection was cleanly unsubscribed rather than failing), the actor may hold stale Active conditions indefinitely. A redeploy of the instance clears `native_alarm_state` in SQLite and forces fresh subscription. Failed subscription retries are logged at `Warning` level with the retry interval.
+
+### `InvalidActorNameException` on rapid redeployment
+
+If two `DeployInstanceCommand` messages arrive for the same instance while the first redeployment is still terminating, the `_terminatingActorsByName` shadow index in `DeploymentManagerActor` detects the collision and buffers the second command (`SiteRuntime-020`). The displaced deploy receives a `Failed-superseded` response. This is expected behaviour — central should observe the `Failed` response and retry when the site is ready.
+
+## Related Documentation
+
+- [Site Runtime design specification](../requirements/Component-SiteRuntime.md)
+- [Central–Site Communication](./Communication.md)
+- [Commons](./Commons.md)
+- [Host](./Host.md)
+- [Audit Log](./AuditLog.md)
+- [Cluster Infrastructure](./ClusterInfrastructure.md)
@@ -0,0 +1,310 @@
+# Store-and-Forward Engine
+
+The Store-and-Forward Engine buffers site-originated outbound messages when a target system or the central cluster is unreachable, retries them on a fixed interval, parks those that exhaust their retry budget, and persists the buffer in a local SQLite database that is asynchronously replicated to the standby node for failover continuity.
+
+## Overview
+
+The Store-and-Forward Engine (#6) is a site-only component. The central cluster has no equivalent buffer; it uses the Notification Outbox (#21) instead for its own queued delivery work. Every site node runs one `StoreAndForwardService` instance, backed by a `StoreAndForwardStorage` SQLite store and an optional `ReplicationService` that fans each buffer mutation to the standby.
+
+The component code lives in `src/ZB.MOM.WW.ScadaBridge.StoreAndForward/`:
+
+- `StoreAndForwardService` — the core buffer: enqueue, retry sweep, park/retry/discard, and the `ICachedCallLifecycleObserver` audit hook.
+- `StoreAndForwardStorage` — the SQLite layer; all reads and writes against `sf_messages`.
+- `ReplicationService` — fire-and-forget buffer replication to the standby.
+- `ParkedMessageHandlerActor` — Akka actor bridge that exposes parked-message query/retry/discard to the `SiteCommunicationActor`.
+- `NotificationForwarder` — the delivery handler for the `Notification` category; forwards buffered notifications to central via the ClusterClient transport and interprets the ack.
+- `StoreAndForwardOptions` — options class bound from the `StoreAndForward` configuration section.
+- `IStoreAndForwardSiteContext` — narrow interface through which the Host supplies the site identifier without creating a project-reference cycle with Health Monitoring.
+
+DI registration is via `ServiceCollectionExtensions.AddStoreAndForward`. Actor bindings (`AddStoreAndForwardActors`) are a separate call resolved during Akka startup in the Host.
+
+The operation tracking table that backs `Tracking.Status(id)` is **not** owned by this component; its implementation (`OperationTrackingStore`, `OperationTrackingOptions`) lives in `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Tracking/`. The engine carries the `TrackedOperationId` linking a buffered message to its tracking row and drives tracking updates through the `ICachedCallLifecycleObserver` hook. The tracking table is documented here because its lifecycle is coupled to the S&F retry loop.
+
+## Key Concepts
+
+### Three message categories
+
+`StoreAndForwardCategory` has three values, each serviced by its own registered delivery handler:
+
+| Category | Delivery target | Tracked? |
+|---|---|---|
+| `ExternalSystem` | External system API (HTTP) | Yes — `TrackedOperationId` |
+| `Notification` | Central cluster (`NotificationForwarder`) | No — central `Notifications` table |
+| `CachedDbWrite` | Database connection | Yes — `TrackedOperationId` |
+
+Only `ExternalSystem` and `CachedDbWrite` generate cached-call audit telemetry through the `ICachedCallLifecycleObserver` hook. `Notification` has its own central-side audit pipeline (Notification Outbox / Audit Log) and is explicitly excluded from that hook.
+
+### Transient vs. permanent failures
+
+Only transient failures are buffered. The delivery handler contract is:
+
+- Returns `true` — delivered. The message is removed from the buffer (or, on the immediate path, never buffered).
+- Returns `false` — permanent failure. The message is not buffered on the immediate path; on a retry the row is parked immediately.
+- Throws — transient failure. On the immediate path the message is buffered for retry; on a retry the retry count is incremented and the row is parked once `MaxRetries` is reached.
+
+A permanent failure for a cached-call category additionally writes a terminal `Failed` row to the operation tracking table via the observer hook. The error is returned synchronously to the calling script; no buffer row is created for a permanent failure.
+
+### Fixed retry interval and no max buffer size
+
+The retry interval is fixed — not exponential. There is no maximum buffer size; messages accumulate until delivery succeeds or the retry budget is exhausted. The default interval is 30 seconds and the background sweep fires every 10 seconds (checking which rows are due via the `last_attempt_at` predicate). Both are configurable.
+
+### Retry budget and parking
+
+`StoreAndForwardMessage.MaxRetries` controls how many background-sweep attempts the engine makes before parking. `MaxRetries = 0` means **no limit** — the message retries on every sweep until delivered and is never parked for retry exhaustion. It is not a "never retry" value; callers that want unbounded retry pass `maxRetries: 0` explicitly. The `EnqueueAsync` `maxRetries` parameter defaults to `StoreAndForwardOptions.DefaultMaxRetries` (50).
+
+### Messages not cleared on instance deletion
+
+When an instance is deleted, its buffered S&F messages are not removed. `StoreAndForwardMessage.OriginInstanceName` records the originating instance at enqueue time so the buffer can continue to deliver and so the central UI can attribute parked messages even after the instance is gone.
+
+### CachedCall idempotency is the caller's responsibility
+
+`StoreAndForwardService` does not deduplicate. If the same message is enqueued twice it is delivered twice. Callers using `ExternalSystem.CachedCall()` or `Database.CachedWrite()` must design payloads to be idempotent, for example by including unique request IDs and relying on the remote end to handle duplicates.
+
+## Architecture
+
+### Buffer storage — `sf_messages`
+
+`StoreAndForwardStorage.InitializeAsync` creates the `sf_messages` table and its indexes:
+
+```sql
+CREATE TABLE IF NOT EXISTS sf_messages (
+    id                TEXT PRIMARY KEY,
+    category          INTEGER NOT NULL,
+    target            TEXT    NOT NULL,
+    payload_json      TEXT    NOT NULL,
+    retry_count       INTEGER NOT NULL DEFAULT 0,
+    max_retries       INTEGER NOT NULL DEFAULT 50,
+    retry_interval_ms INTEGER NOT NULL DEFAULT 30000,
+    created_at        TEXT    NOT NULL,
+    last_attempt_at   TEXT,
+    status            INTEGER NOT NULL DEFAULT 0,
+    last_error        TEXT,
+    origin_instance   TEXT
+);
+
+CREATE INDEX IF NOT EXISTS idx_sf_messages_status ON sf_messages(status);
+CREATE INDEX IF NOT EXISTS idx_sf_messages_category ON sf_messages(category);
+```
+
+Three nullable columns (`execution_id`, `source_script`, `parent_execution_id`) were added by additive migrations after initial rollout. SQLite lacks `ADD COLUMN IF NOT EXISTS`, so each column is probed via `PRAGMA table_info` before the `ALTER TABLE` is issued — making `InitializeAsync` idempotent.
+
+`StoreAndForwardStorage` opens a fresh `SqliteConnection` per call and relies on the Microsoft.Data.Sqlite connection pool (keyed on the connection string) for acceptable performance on the retry sweep. If a pooled-open ever becomes a bottleneck the remedy is a batched sweep API that opens one connection per sweep.
+
+Status values from `StoreAndForwardMessageStatus`: `Pending` (0), `InFlight` (1), `Parked` (2), `Delivered` (3). The retry sweep loads only `Pending` rows whose `last_attempt_at` is older than `retry_interval_ms`.
+
+### Retry sweep
+
+`StoreAndForwardService.RetryPendingMessagesAsync` is the background sweep, fired by `_retryTimer` on `RetryTimerInterval` (default 10 s). An `Interlocked` flag prevents overlapping sweeps. `StopAsync` stops the timer, then awaits any in-flight sweep up to `SweepShutdownWaitTimeout` (10 s) before returning so the host can safely dispose `_storage` and `_replication`.
+
+Each `RetryMessageAsync` call invokes the registered delivery handler for the message's category. A conditional `UpdateMessageIfStatusAsync` is used for every state-changing write so a concurrent operator action (retry, discard) is not silently overwritten by the sweep:
+
+```csharp
+// Transient failure — increment retry, check budget.
+message.RetryCount++;
+message.LastAttemptAt = DateTimeOffset.UtcNow;
+message.LastError = ex.Message;
+
+if (message.MaxRetries > 0 && message.RetryCount >= message.MaxRetries)
+{
+    message.Status = StoreAndForwardMessageStatus.Parked;
+    var parked = await _storage.UpdateMessageIfStatusAsync(
+        message, StoreAndForwardMessageStatus.Pending);
+    if (!parked) return; // operator action won the race
+    Interlocked.Decrement(ref _bufferedCount);
+    _replication?.ReplicatePark(message);
+    // … observer notification …
+}
+else
+{
+    if (!await _storage.UpdateMessageIfStatusAsync(
+            message, StoreAndForwardMessageStatus.Pending))
+        return; // operator action won the race
+    // … observer notification (TransientFailure) …
+}
+```
+
+### Queue-depth gauge
+
+`StoreAndForwardService` maintains a `long _bufferedCount` in-process gauge seeded from a `COUNT(*)` at startup. `BufferAsync` increments it; successful delivery and `Pending→Parked` transitions decrement it; operator requeue (`Parked→Pending`) increments it. `ScadaBridgeTelemetry.SetQueueDepthProvider` registers a sync, non-blocking read callback so the OpenTelemetry/Prometheus collector never needs to run an async query. The gauge is approximate: it is eventually consistent with the store, and standby replication applies to the standby's own counter separately.
+
+### Async replication to standby
+
+`ReplicationService` wraps each buffer mutation — add, remove, park, requeue — in a `Task.Run` fire-and-forget. The active node does not wait for standby acknowledgment. The standby applies each `ReplicationOperation` via `ApplyReplicatedOperationAsync`, which calls the same `StoreAndForwardStorage` methods. Replication failures are logged at Debug and discarded; the standby may be slightly behind the active at any moment, producing at-most a few duplicate deliveries or missed retries after a failover — an accepted trade-off for zero added latency on the enqueue path.
+
+The four `ReplicationOperationType` values are `Add`, `Remove`, `Park`, and `Requeue` (requeue was added to cover the operator-initiated `Parked→Pending` transition so the standby preserves retry intent after failover).
+
+### Notification delivery
+
+`NotificationForwarder` is the delivery handler for `StoreAndForwardCategory.Notification`. It deserializes the buffered `PayloadJson` as a `NotificationSubmit`, re-stamps `SourceSiteId` and `SourceInstanceId` from the forwarder's own context (the site is authoritative for these), and sends the submit to the `SiteCommunicationActor` via Akka's `Ask` with a configurable timeout. A `NotificationSubmitAck` with `Accepted = true` returns `true`; any other ack or a timeout throws `NotificationForwardException`, which the engine treats as transient. A payload that cannot be deserialized is logged at Warning and discarded (returns `true`) rather than parked — a corrupt payload cannot be fixed by retrying.
+
+### Parked message management
+
+`ParkedMessageHandlerActor` is the Akka bridge between `SiteCommunicationActor` and `StoreAndForwardService`. It handles five message types from central:
+
+| Message | Action |
+|---|---|
+| `ParkedMessageQueryRequest` | Paginated list of parked rows, all categories |
+| `ParkedMessageRetryRequest` | Move a parked row back to `Pending` |
+| `ParkedMessageDiscardRequest` | Delete a parked row |
+| `RetryParkedOperation` | Retry a parked cached call (keyed by `TrackedOperationId`) |
+| `DiscardParkedOperation` | Discard a parked cached call (keyed by `TrackedOperationId`) |
+
+All five use `PipeTo` for idiomatic Akka async reply. `RetryParkedMessageAsync` resets `retry_count = 0` and `last_attempt_at = NULL` so the requeued message is due on the next sweep, and replicates a `Requeue` operation to the standby. `DiscardParkedMessageAsync` deletes the row and replicates a `Remove`.
+
+### Operation tracking table
+
+The operation tracking table (`OperationTracking`) is a SQLite table in `site-tracking.db`, owned by `OperationTrackingStore` in `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Tracking/`. Its schema:
+
+```sql
+CREATE TABLE IF NOT EXISTS OperationTracking (
+    TrackedOperationId TEXT    NOT NULL PRIMARY KEY,
+    Kind               TEXT    NOT NULL,
+    TargetSummary      TEXT    NULL,
+    Status             TEXT    NOT NULL,
+    RetryCount         INTEGER NOT NULL DEFAULT 0,
+    LastError          TEXT    NULL,
+    HttpStatus         INTEGER NULL,
+    CreatedAtUtc       TEXT    NOT NULL,
+    UpdatedAtUtc       TEXT    NOT NULL,
+    TerminalAtUtc      TEXT    NULL,
+    SourceInstanceId   TEXT    NULL,
+    SourceScript       TEXT    NULL,
+    SourceNode         TEXT    NULL
+);
+CREATE INDEX IF NOT EXISTS IX_OperationTracking_Status_Updated
+    ON OperationTracking (Status, UpdatedAtUtc);
+```
+
+One row per `TrackedOperationId`; lifecycle `Submitted → Retrying → Delivered / Parked / Failed / Discarded`. Writes are serialised through a `SemaphoreSlim` on a single owned `SqliteConnection`. Reads open a fresh connection to avoid blocking status queries behind in-flight writes.
+
+`Tracking.Status(id)` reads this table **site-locally and authoritatively** — the answer never round-trips to central, even when central is unreachable. Terminal rows are purged after `OperationTrackingOptions.RetentionDays` (default 7 days). The `PurgeTerminalAsync` call only removes rows where `TerminalAtUtc IS NOT NULL` and `TerminalAtUtc < threshold`; non-terminal (in-flight) rows are never purged.
+
+The S&F engine connects to this table only through the `ICachedCallLifecycleObserver` hook, not directly. `OperationTrackingStore` is wired in `SiteRuntime` and injected into the observer implementation; `StoreAndForward` carries the `TrackedOperationId` on the buffered message and passes it to the observer on each attempt.
+
+### Cached-call observer hook
+
+`ICachedCallLifecycleObserver.OnAttemptCompletedAsync` is called by the retry sweep after every `ExternalSystem` or `CachedDbWrite` delivery attempt with a `CachedCallAttemptContext` record:
+
+```csharp
+context = new CachedCallAttemptContext(
+    TrackedOperationId: trackedId,
+    Channel: channel,           // "ApiOutbound" or "DbOutbound"
+    Target: message.Target,
+    SourceSite: _siteId,
+    Outcome: outcome,           // Delivered / TransientFailure / PermanentFailure / ParkedMaxRetries
+    RetryCount: message.RetryCount,
+    LastError: lastError,
+    HttpStatus: httpStatus,
+    CreatedAtUtc: message.CreatedAt.UtcDateTime,
+    OccurredAtUtc: attemptStartUtc,
+    DurationMs: (int)attemptStopwatch.ElapsedMilliseconds,
+    SourceInstanceId: message.OriginInstanceName,
+    ExecutionId: message.ExecutionId,
+    SourceScript: message.SourceScript,
+    ParentExecutionId: message.ParentExecutionId);
+```
+
+The observer implementation (in `ZB.MOM.WW.ScadaBridge.AuditLog`) maps the outcome to `OperationTrackingStore` writes and builds the `CachedCallTelemetry` packet for the central Site Call Audit component. Observer failures are swallowed — a failing audit pipeline must never corrupt S&F retry bookkeeping or be misclassified as a transient delivery failure.
+
+The `_siteId` stamped onto every context is sourced from the optional `IStoreAndForwardSiteContext` binding resolved at construction time. A null or whitespace site id is normalised to `UnknownSiteSentinel` (`$unknown-site`) so a misconfigured host produces a distinctive marker in the central audit log rather than silently merging multiple sites into an empty-string bucket.
+
+## Usage
+
+### Registering the service
+
+```csharp
+// In the Host composition root (site node only):
+services.AddStoreAndForward();
+services.AddStoreAndForwardActors();
+services.Configure<StoreAndForwardOptions>(
+    configuration.GetSection("StoreAndForward"));
+```
+
+### Enqueueing a message
+
+```csharp
+public async Task<StoreAndForwardResult> EnqueueAsync(
+    StoreAndForwardCategory category,
+    string target,
+    string payloadJson,
+    string? originInstanceName = null,
+    int? maxRetries = null,
+    TimeSpan? retryInterval = null,
+    bool attemptImmediateDelivery = true,
+    string? messageId = null,
+    Guid? executionId = null,
+    string? sourceScript = null,
+    Guid? parentExecutionId = null)
+```
+
+Pass `attemptImmediateDelivery: false` when the caller has already attempted delivery itself — the message is placed directly into the buffer for the background sweep without invoking the handler again. The Notification Outbox uses the `messageId` overload to pin the script-generated `NotificationId` as the buffer row's id (the single idempotency key from script through central ingest).
+
+`StoreAndForwardResult` carries `Accepted` (true if delivered or buffered), `MessageId`, and `WasBuffered`.
+
+### Registering a delivery handler
+
+```csharp
+_storeAndForwardService.RegisterDeliveryHandler(
+    StoreAndForwardCategory.ExternalSystem,
+    async message => await _externalSystemGateway.DeliverAsync(message));
+```
+
+Handlers are registered by the component that owns the delivery channel (External System Gateway, database adapter, `NotificationForwarder`) during startup before `StartAsync` is called.
+
+## Configuration
+
+Options class: `StoreAndForwardOptions`, bound from the `StoreAndForward` configuration section.
+
+| Key | Default | Description |
+|---|---|---|
+| `SqliteDbPath` | `./data/store-and-forward.db` | Path to the SQLite buffer database. The directory is created on startup if absent. |
+| `ReplicationEnabled` | `true` | Whether to replicate buffer operations to the standby node. |
+| `DefaultRetryInterval` | `00:00:30` | Fixed retry interval applied when `EnqueueAsync` is called without an explicit `retryInterval`. |
+| `DefaultMaxRetries` | `50` | Max background-sweep attempts before parking. Applied when `EnqueueAsync` is called without an explicit `maxRetries`. `0` = no limit. |
+| `RetryTimerInterval` | `00:00:10` | Cadence of the background retry sweep timer. |
+
+Operation tracking options live separately under `OperationTrackingOptions` (bound in Site Runtime):
+
+| Key | Default | Description |
+|---|---|---|
+| `ConnectionString` | `Data Source=site-tracking.db` | ADO.NET connection string for the tracking SQLite database. |
+| `RetentionDays` | `7` | Terminal rows older than this many days are deleted by the nightly purge. |
+
+## Dependencies & Interactions
+
+- [Commons (#16)](./Commons.md) — owns `StoreAndForwardCategory`, `StoreAndForwardMessageStatus`, `TrackedOperationId`, `TrackingStatusSnapshot`, `ICachedCallLifecycleObserver` / `CachedCallAttemptContext` / `CachedCallAttemptOutcome`, `IOperationTrackingStore`, and the `RemoteQuery` message contracts (`ParkedMessageQueryRequest/Response`, `ParkedMessageRetryRequest/Response`, `ParkedMessageDiscardRequest/Response`, `RetryParkedOperation`, `DiscardParkedOperation`, `ParkedOperationActionAck`).
+- [Central–Site Communication (#5)](./Communication.md) — carries `ParkedMessageQueryRequest/Response` and operator Retry/Discard commands between the central UI and `ParkedMessageHandlerActor`. Also carries buffered notifications (`NotificationSubmit` / `NotificationSubmitAck`) from `NotificationForwarder` to the Notification Outbox, and `CachedCallTelemetry` from the observer implementation to Site Call Audit.
+- [Notification Outbox (#21)](./NotificationOutbox.md) — the central destination for the `Notification` category. Central ingests each forwarded `NotificationSubmit` into the `Notifications` table and replies with `NotificationSubmitAck`; on `Accepted = true` the engine clears the buffered row. The S&F engine is the site half of the outbox handoff.
+- [Site Call Audit (#22)](./SiteCallAudit.md) — the central mirror for cached-call status. Receives `CachedCallTelemetry` (audit rows + operational tracking snapshot) emitted by the observer on each S&F attempt outcome. Relays `RetryParkedOperation` / `DiscardParkedOperation` commands to the site when an operator acts on a parked cached call via the central UI.
+- [Audit Log (#23)](./AuditLog.md) — the observer implementation (`ICachedCallLifecycleObserver`) lives in the Audit Log component. It maps `CachedCallAttemptContext` onto `AuditLog` rows and drives the `CachedCallTelemetry` packet to central.
+- [Site Runtime (#3)](../requirements/Component-SiteRuntime.md) — owns the `OperationTrackingStore` and `OperationTrackingOptions` that back `Tracking.Status(id)`. Script Actors submit messages to `StoreAndForwardService.EnqueueAsync` on the buffered-call paths.
+- [Health Monitoring (#11)](../requirements/Component-HealthMonitoring.md) — `ScadaBridgeTelemetry.SetQueueDepthProvider` registers the `_bufferedCount` gauge read by the OpenTelemetry/Prometheus collector. The `scadabridge.store_and_forward.queue.depth` gauge surfaces on the site health report.
+- [Site Event Logging (#12)](../requirements/Component-SiteEventLogging.md) — the `OnActivity` event on `StoreAndForwardService` posts activity strings (Queued, Delivered, Retried, Parked, Retry, Discard) to the site event log.
+- Design spec: [Component-StoreAndForward.md](../requirements/Component-StoreAndForward.md).
+
+## Troubleshooting
+
+### A message stays in the Pending queue and is never delivered
+
+The retry sweep only picks up rows where `status = Pending AND (last_attempt_at IS NULL OR elapsed >= retry_interval_ms)`. If a row never appears in the sweep output, check that the delivery handler for the category is registered before `StartAsync` is called. A missing handler causes the sweep to log a Warning at category level and skip the row; the row stays `Pending` indefinitely rather than being parked.
+
+### A parked cached call does not respond to Retry from the central UI
+
+`RetryParkedOperation` and `DiscardParkedOperation` are keyed by `TrackedOperationId`, which is the S&F buffer message's `Id`. The buffer row's `Id` is the GUID string of the `TrackedOperationId` in `"N"` (no-hyphens) format for engine-minted ids, or `"D"` (hyphenated) format when the caller supplies one. `TrackedOperationId.TryParse` accepts both; confirm that the id in the command matches the stored row id.
+
+### Standby has duplicate or stale rows after failover
+
+Replication is best-effort and fire-and-forget. A message delivered just before failover may still appear in the standby's buffer (the `Remove` replication did not arrive in time) and will be re-delivered. A message buffered just before failover may not appear (the `Add` replication did not arrive in time) and will be silently skipped. Both are accepted trade-offs; the expected rate is a handful of events per failover, not a systematic backlog.
+
+### `$unknown-site` appears in central audit rows for a site's cached calls
+
+`StoreAndForwardService` was constructed without an `IStoreAndForwardSiteContext` binding, so the site id could not be resolved. Ensure the Host calls `services.AddSingleton<IStoreAndForwardSiteContext>(…)` with an adapter that forwards to the same `NodeOptions.SiteId` read by `ISiteIdentityProvider`.
+
+## Related Documentation
+
+- [Store-and-Forward design specification](../requirements/Component-StoreAndForward.md)
+- [Notification Outbox](./NotificationOutbox.md)
+- [Site Call Audit](./SiteCallAudit.md)
+- [Audit Log](./AuditLog.md)
+- [Central–Site Communication](./Communication.md)
+- [Commons](./Commons.md)
@@ -0,0 +1,285 @@
+# Template Engine
+
+The Template Engine models the machine blueprints — templates — from which all deployed instances are created. It enforces inheritance, composition, locking, naming rules, and acyclicity at authoring time, then flattens the resulting graph plus instance overrides into a concrete, revision-hashed `FlattenedConfiguration` that the Deployment Manager sends to sites.
+
+## Overview
+
+Template Engine (#1) runs on the central cluster only. Sites receive flattened output and have no awareness of template structure. The component code lives in `src/ZB.MOM.WW.ScadaBridge.TemplateEngine/`, organized as follows:
+
+- Root — `TemplateService`, `SharedScriptService`, `TemplateResolver`, `CycleDetector`, `CollisionDetector`, `LockEnforcer`, `TemplateNaming` — core authoring operations and graph invariant enforcement.
+- `Flattening/` — `FlatteningService`, `RevisionHashService`, `DiffService` — produce and compare the deployment-ready representation.
+- `Validation/` — `ValidationService`, `SemanticValidator`, `ScriptCompiler`, `CSharpDelimiterScanner` — pre-deployment and on-demand correctness checks.
+- `Services/` — `InstanceService`, `SiteService`, `AreaService`, `TemplateFolderService`, `TemplateDeletionService` — domain operations that depend on the scoped `ITemplateEngineRepository`.
+
+The single DI entry point is `ServiceCollectionExtensions.AddTemplateEngine`. `TemplateService` and `SharedScriptService` are scoped; the flattening and validation utilities are transient; static helpers (`CycleDetector`, `CollisionDetector`, `LockEnforcer`, `TemplateResolver`) are not registered.
+
+## Key Concepts
+
+### Template graph
+
+The full set of templates forms a directed graph with two independent edge types:
+
+- **Inheritance** — `Template.ParentTemplateId` (nullable `int?`). A null value means no parent; a non-null value sets the defining ancestor. The parent is set at creation time and is immutable thereafter; `UpdateTemplateAsync` rejects any attempt to change it.
+- **Composition** — `TemplateComposition` rows, each pointing from an owner `TemplateId` to a `ComposedTemplateId` with a slot name (`InstanceName`). Only base (non-derived) templates may be composed.
+
+Both edge types are enforced acyclic on every mutating call. `CycleDetector` provides three checks:
+
+- `DetectInheritanceCycle` — walks the proposed parent chain upward looking for the template being modified.
+- `DetectCompositionCycle` — BFS from the proposed composed template through its own compositions.
+- `DetectCrossGraphCycle` — BFS across both inheritance and composition edges simultaneously, catching cycles that neither pure check alone would find.
+
+All three run before any composition is written. Because the graph can contain not-yet-saved templates (Id = 0), `CycleDetector.BuildLookup` uses `TryAdd` rather than `ToDictionary` so duplicate Ids do not throw.
+
+### Derived templates and the composition model
+
+When `AddCompositionAsync` composes template B into template A under slot name `Pump`, the engine calls `CreateCascadedCompositionAsync`, which:
+
+1. Creates a derived `Template` (`IsDerived = true`, `ParentTemplateId = B.Id`, `Name = "Pump"`) as the slot-owned backing record.
+2. Copies B's attributes, alarms, and scripts onto the derived template with `IsInherited = true`.
+3. Creates a `TemplateComposition` row linking A to the derived template.
+4. Sets `derived.OwnerCompositionId` so the slot can be deleted as a unit.
+5. Recurses into B's own compositions to replicate them under the new derived template.
+
+Derived templates are hidden from the main template tree and cannot be directly composed or deleted by name — removal goes through `DeleteCompositionAsync`, which calls `CascadeDeleteDerivedAsync` to tear the whole subtree down.
+
+### Canonical naming and path qualification
+
+Members of composed modules are addressed by **path-qualified canonical names**: `[ModuleInstanceName].[MemberName]`. For deeper nesting the dot chain extends: `Pump.Motor.Speed`. Direct members of the owning template carry no prefix.
+
+`TemplateResolver.ResolveAllMembers` builds a `Dictionary<string, ResolvedTemplateMember>` in inheritance-chain order (root first) then adds composed-module members with their prefix. The last writer on a given canonical name wins — child overrides shadow parent definitions. `FindMemberByCanonicalName` is the entry point for lock checks during `UpdateAttribute/Alarm/ScriptAsync`.
+
+`TemplateNaming.QualifiedName` computes the full dotted path of a derived template at read time by walking the `OwnerCompositionId` chain — the derived template stores only its contained name (`InstanceName`), so the full path is never stored and cannot drift.
+
+### Locking and override granularity
+
+`LockEnforcer` enforces three classes of rules for attributes, alarms, and scripts:
+
+| Rule | Mechanism |
+|------|-----------|
+| `IsLocked` = true blocks all downstream overrides | `ValidateLockChange`: once set, cannot be cleared (one-way ratchet). |
+| `LockedInDerived` = true blocks derived-template overrides of that specific member | `ValidateLockedInDerivedChange`: also a one-way ratchet — cannot be cleared on a base template. |
+| Fixed fields cannot change at any level | `ValidateAttributeOverride` / `ValidateAlarmOverride` / `ValidateScriptOverride`. |
+
+**Attribute fixed fields**: `DataType`, `DataSourceReference`. Overridable: `Value`, `Description`.
+
+**Alarm fixed fields**: `Name`, `TriggerType`. Overridable: `PriorityLevel`, `TriggerConfiguration`, `Description`, `OnTriggerScriptId`.
+
+**Script fixed fields**: `Name`. Overridable: `Code`, `TriggerType`, `TriggerConfiguration`, `MinTimeBetweenRuns`, `ParameterDefinitions`, `ReturnDefinition`.
+
+Intermediate locking is permitted: a child template can lock an unlocked member inherited from its parent. Unlocking is never permitted at any level.
+
+### Naming collisions
+
+Adding a member (attribute, alarm, script, or composition) triggers `CollisionDetector.DetectCollisions` on a speculative clone of the template. The detector collects all canonical names — direct members plus path-qualified composed-module members plus inherited members — groups them by canonical name, and reports any group where two entries come from different origin descriptions. A collision is a design-time error and blocks the operation.
+
+Because each composition slot has a unique `InstanceName` prefix, members from different slots can never collide by canonical name. Collisions arise only when a directly defined member shares an unqualified name with an inherited or composed member under the same owner.
+
+### Flattening
+
+`FlatteningService.Flatten` takes the instance, its template inheritance chain (most-derived first), a composition map, per-composed-template chains, and available data connections, and produces a `FlattenedConfiguration`. The resolution order is:
+
+1. Instance overrides (highest priority, respects locks).
+2. Most-derived template in the inheritance chain.
+3. Parent templates, walking to the root.
+4. Composed module members, path-qualified.
+
+The eight steps in order:
+
+1. Validate `LockedInDerived` is not violated across each chain.
+2. Resolve attributes from the inheritance chain (base-to-derived; `IsInherited` placeholders never shadow live base values).
+3. Resolve composed-module attributes with path-qualified canonical names, recursively.
+4. Apply `InstanceAttributeOverride` records (locked attributes are silently skipped).
+5. Apply `InstanceConnectionBinding` records (data-sourced attributes only).
+6. Resolve alarms; for `HiLo` trigger type, merge setpoints key-by-key so a derived template can override just `hi` while inheriting `loLo`.
+7. Resolve scripts; wire `ScriptScope` (self- and parent-path) into each composed script so `Attributes["X"]` resolves to the right path-prefix at runtime.
+8. Resolve native alarm source bindings (`TemplateNativeAlarmSource`), apply `InstanceNativeAlarmSourceOverride`.
+
+Between steps 6 and 7, `ResolveAlarmScriptReferences` resolves each alarm's `OnTriggerScriptId` FK to the canonical name of the corresponding resolved script. A dangling reference (script Id has no resolved script) produces a null `OnTriggerScriptCanonicalName` and is caught by `SemanticValidator`.
+
+### Revision hash
+
+`RevisionHashService.ComputeHash` produces a deterministic `sha256:<hex>` string over the canonical JSON serialization of the `FlattenedConfiguration`. Volatile fields (`GeneratedAtUtc`) are excluded. Collections are sorted by `CanonicalName` before hashing. Internal `Hashable*` records declare their properties in alphabetical order because `System.Text.Json` emits them in declaration order — out-of-order additions would silently break determinism. The hash is included in the deployment identity and lets the Deployment Manager detect whether a re-flatten has changed anything before pushing to sites.
+
+### Diff
+
+`DiffService.ComputeDiff` compares two `FlattenedConfiguration` snapshots by canonical name, producing `Added`, `Removed`, and `Changed` entries for attributes, alarms, and scripts. `ComputeConnectionsDiff` produces the same shape for data-connection configurations. The diff is used by the Deployment Manager to decide whether a full redeploy is needed.
+
+### Concurrent editing
+
+Template edits use **last-write-wins** — there is no optimistic concurrency token on `Template` or its member rows. Two simultaneous edits to the same template produce one winner. This is by design and is documented in the `InstanceService` comment: "Concurrent editing uses last-write-wins — no pessimistic locking or conflict detection." Optimistic concurrency (`RowVersion`) applies to deployment status records in the Deployment Manager, not to template authoring.
+
+## Architecture
+
+### Service map
+
+```
+AddTemplateEngine()
+├── TemplateService        (scoped)  — template + member CRUD, collision/acyclicity pre-checks
+├── SharedScriptService    (scoped)  — system-wide shared script CRUD + syntax validation
+├── InstanceService        (scoped)  — instance CRUD, overrides, connection bindings
+├── SiteService            (scoped)  — site CRUD
+├── AreaService            (scoped)  — area CRUD
+├── TemplateFolderService  (scoped)  — folder hierarchy, sibling-name uniqueness, acyclicity on move
+├── TemplateDeletionService(scoped)  — deletion constraints; called from TemplateService.DeleteTemplateAsync
+├── FlatteningService      (transient)
+├── RevisionHashService    (transient)
+├── DiffService            (transient)
+├── ValidationService      (transient) — full pipeline: 8 stages merged into one ValidationResult
+├── SemanticValidator      (transient) — call-target, argument-count, operand-type, cross-call rules
+└── ScriptCompiler         (transient) — advisory forbidden-API scan + delimiter balance check
+```
+
+Static helpers — `CycleDetector`, `CollisionDetector`, `LockEnforcer`, `TemplateResolver`, `TemplateNaming` — are not registered with DI.
+
+### Validation pipeline
+
+`ValidationService.Validate` runs eight stages in sequence and merges results via `ValidationResult.Merge`:
+
+| Stage | Category | Outcome |
+|-------|----------|---------|
+| `ValidateFlatteningSuccess` | `FlatteningFailure` | Error on missing name; warning on empty configuration |
+| `ValidateNamingCollisions` | `NamingCollision` | Error per duplicate canonical name within entity type |
+| `ValidateScriptCompilation` | `ScriptCompilation` | Error per script that fails `ScriptCompiler.TryCompile` |
+| `ValidateAlarmTriggerReferences` | `AlarmTriggerReference` | Error when `attributeName` / `attribute` key not in flattened attributes |
+| `ValidateScriptTriggerReferences` | `ScriptTriggerReference` | Same check for script triggers |
+| `ValidateExpressionTriggers` | `ScriptTriggerReference` / `AlarmTriggerReference` | Blank warning, syntax error, or missing `Attributes["X"]` reference |
+| `ValidateConnectionBindingCompleteness` | `ConnectionBinding` | Warning per data-sourced attribute with no binding |
+| `SemanticValidator.Validate` | Multiple | Call targets, argument counts, `RangeViolation`/`HiLo` operand types, on-trigger script existence, cross-call violations, native alarm source completeness |
+
+`ValidationResult` is defined in Commons: `IsValid` is true when `Errors` is empty; `Warnings` do not block deployment. Each `ValidationEntry` carries a `ValidationCategory`, a human-readable `Message`, and an optional `EntityName` (canonical name of the offending entity).
+
+### Key entity types (defined in Commons)
+
+| Type | Namespace | Role |
+|------|-----------|------|
+| `Template` | `Commons.Entities.Templates` | Base and derived template rows; `IsDerived` distinguishes slot-owned derived templates |
+| `TemplateAttribute` | same | Attribute definition with `IsInherited`, `LockedInDerived`, `DataType`, `DataSourceReference` |
+| `TemplateAlarm` | same | Alarm definition; `TriggerType` and `Name` are fixed fields |
+| `TemplateScript` | same | Script definition; `Name` is a fixed field |
+| `TemplateComposition` | same | Slot row linking owner to composed (or derived) template by `InstanceName` |
+| `TemplateNativeAlarmSource` | same | Read-only native alarm binding; `SourceReference` is a raw connection address |
+| `FlattenedConfiguration` | `Commons.Types.Flattening` | Deployment-ready snapshot; fields `Attributes`, `Alarms`, `Scripts`, `NativeAlarmSources`, `Connections` |
+| `ResolvedAttribute` / `ResolvedAlarm` / `ResolvedScript` / `ResolvedNativeAlarmSource` | same | Flattened member records carrying `CanonicalName` and `Source` provenance |
+| `ConfigurationDiff` / `DiffEntry<T>` | same | Diff output keyed by canonical name |
+| `ValidationResult` / `ValidationEntry` / `ValidationCategory` | same | Validation output |
+
+## Usage
+
+### Authoring a template with inheritance and composition
+
+The normal flow goes through `TemplateService` methods; each call validates graph invariants before persisting and audits after:
+
+```csharp
+// Create a base template
+var base = await templateService.CreateTemplateAsync(
+    "MotorBase", description: null, parentTemplateId: null, user: "alice");
+
+// Add an attribute to the base
+await templateService.AddAttributeAsync(base.Value.Id, new TemplateAttribute("Speed")
+{
+    DataType = DataType.Float,
+    DataSourceReference = "/Motor/Speed"
+}, user: "alice");
+
+// Create a child that inherits from the base
+var child = await templateService.CreateTemplateAsync(
+    "PumpMotor", description: null, parentTemplateId: base.Value.Id, user: "alice");
+
+// Compose a feature module (AlarmsModule) into the base — acyclicity and collision
+// checks run; a derived template is auto-created to back the slot
+await templateService.AddCompositionAsync(
+    templateId: base.Value.Id,
+    composedTemplateId: alarmsModule.Id,
+    instanceName: "Alarms",
+    user: "alice");
+```
+
+After composition, `Alarms.HighTemp` is the canonical name of `HighTemp` from `AlarmsModule` as it appears in the flattened output.
+
+### Resolving members and checking overrides
+
+```csharp
+// Returns all effective members with canonical names for templateId
+var members = await templateService.ResolveTemplateMembersAsync(templateId);
+
+// Returns TemplateResolver.ResolvedTemplateMember with CanonicalName, MemberType,
+// IsLocked, and ModulePath (null for direct members, slot prefix for composed members).
+foreach (var m in members)
+{
+    Console.WriteLine($"{m.CanonicalName} ({m.MemberType}) locked={m.IsLocked}");
+}
+```
+
+### Flattening and hashing
+
+`FlatteningService` and `RevisionHashService` are transient services called by the Deployment Manager, not directly from the Central UI. The caller builds the template chain (most-derived first) from the repository and passes it:
+
+```csharp
+var flatResult = flatteningService.Flatten(
+    instance,
+    templateChain,          // IReadOnlyList<Template>, index 0 = instance's template
+    compositionMap,         // Dictionary<int, IReadOnlyList<TemplateComposition>>
+    composedTemplateChains, // Dictionary<int, IReadOnlyList<Template>>
+    dataConnections);       // Dictionary<int, DataConnection>
+
+if (flatResult.IsSuccess)
+{
+    var hash = revisionHashService.ComputeHash(flatResult.Value);
+    // hash: "sha256:3a7f..."
+}
+```
+
+### Validating before deployment
+
+```csharp
+var validationResult = validationService.Validate(flatResult.Value, sharedScripts);
+if (!validationResult.IsValid)
+{
+    foreach (var err in validationResult.Errors)
+        logger.LogError("{Category} {Entity}: {Msg}", err.Category, err.EntityName, err.Message);
+}
+```
+
+## Dependencies & Interactions
+
+- [Commons (#16)](./Commons.md) — owns all entity types (`Template`, `TemplateAttribute`, `TemplateAlarm`, `TemplateScript`, `TemplateComposition`, `TemplateNativeAlarmSource`), the flattening types (`FlattenedConfiguration`, `ResolvedAttribute`, `ResolvedAlarm`, `ResolvedScript`, `ResolvedNativeAlarmSource`, `ConfigurationDiff`), the `ValidationResult`/`ValidationEntry`/`ValidationCategory` hierarchy, the `ITemplateEngineRepository` interface, and the `IAuditService` interface. Template Engine imports from Commons; it never holds a direct EF Core dependency.
+- [Configuration Database (#17)](./ConfigurationDatabase.md) — provides the `ITemplateEngineRepository` implementation backed by the central MS SQL database. `TemplateService`, `InstanceService`, and all `Services/` classes resolve this via constructor injection. EF Core migrations for template tables live in this project.
+- [Deployment Manager (#2)](./DeploymentManager.md) — consumes `FlatteningService`, `RevisionHashService`, `DiffService`, and `ValidationService` to prepare deployment packages. It also calls `TemplateDeletionService.CanDeleteTemplateAsync` to check constraints before removing a template that has active deployments.
+- [Site Runtime (#3)](./SiteRuntime.md) — receives the `FlattenedConfiguration` (via the Deployment Manager) and uses `ResolvedScript.Scope` to set the path-prefix context for script attribute accessors. `ResolvedNativeAlarmSource` records drive the site's `NativeAlarmActor` bindings.
+- [Central UI (#9)](./CentralUI.md) — the primary authoring surface. All template CRUD, instance management, shared script editing, and folder organization go through the Management Service, which delegates to `TemplateService`, `InstanceService`, and `SharedScriptService`. The Central UI calls `ValidateAsync` on-demand so designers see errors before deployment.
+- [Management Service (#18)](./ManagementService.md) — the Akka.NET actor that exposes template operations over the cluster boundary. `TemplateService` and related services are injected into its DI scope per request.
+- [Transport (#24)](./Transport.md) — exports and imports templates as encrypted bundles. On import, the Transport component calls `TemplateService.CreateTemplateAsync` (and member-add methods) for each template in the bundle; acyclicity and collision checks run identically to manual authoring.
+- Design spec: [Component-TemplateEngine.md](../requirements/Component-TemplateEngine.md).
+
+## Troubleshooting
+
+### Composition fails with a cycle error
+
+`CycleDetector.DetectCrossGraphCycle` rejects edges that would create a cycle across either inheritance or composition edges. The most common trigger is composing a template that already (transitively) includes the owner — for example, A composes B, and then trying to compose A into B. The error message identifies the template by name. Remove or restructure the graph to break the circular dependency.
+
+### `LockedInDerived` cannot be cleared
+
+`LockEnforcer.ValidateLockedInDerivedChange` enforces a one-way ratchet: once a base template sets `LockedInDerived = true` on a member, the flag cannot be cleared. This is intentional — clearing it retroactively would make previously blocked derived overrides legal without any visible signal to derived-template authors. The only remediation is to create a new base template without the flag.
+
+### Revision hash changes unexpectedly between deploys
+
+The SHA-256 hash covers all resolved attributes, alarms, scripts, and connection configurations. Changes to any member anywhere in the inheritance or composition chain — including in parent templates or feature modules the instance does not directly own — will change the hash. Use `DiffService.ComputeDiff` to identify exactly which canonical names changed and why.
+
+### Naming collision on composed member add
+
+`CollisionDetector.DetectCollisions` fires when a member's unqualified name collides with a direct or inherited member on the same owner template. Because composed members carry a slot-name prefix, a collision can arise between a directly defined member (`Speed`) and a composed member that resolves to the same unqualified name in an ancestor. The fix is to rename one of the conflicting members before adding the composition.
+
+### Semantic validation: `CallScript` target not found
+
+`SemanticValidator.ExtractCallTargets` uses a substring scan for `CallScript("name", ...)` and `CallShared("name", ...)`. If the target name does not match any resolved script canonical name, the error `CallTargetNotFound` is reported. Check that the call uses the full canonical name, including any composition prefix (e.g., `Alarms.HandleFault`, not just `HandleFault`).
+
+## Related Documentation
+
+- [Template Engine design specification](../requirements/Component-TemplateEngine.md)
+- [Commons](./Commons.md)
+- [Configuration Database](./ConfigurationDatabase.md)
+- [Deployment Manager](./DeploymentManager.md)
+- [Site Runtime](./SiteRuntime.md)
+- [Central UI](./CentralUI.md)
+- [Management Service](./ManagementService.md)
+- [Transport](./Transport.md)