Files
ScadaBridge/docs/components/SiteRuntime.md
T
Joseph Doherty 25bae4e43b docs(components): accuracy fixes from deep review (batch 2)
TemplateEngine (alarm-script-ref ordering, native-alarm-sources not in
revision hash, composition cycle checks, 9-step pipeline), SiteRuntime
(alarm on-trigger scripts run with a restricted context; PreStart seeds
children from defaults before overrides arrive), DataConnectionLayer
(UnsubscribeAlarmsRequest stashed in Connecting), StoreAndForward (InFlight/
Delivered are dead enum values; notifications can park at 50 retries),
ExternalSystemGateway (CachedWrite returns void + enqueues directly; log levels).
2026-06-03 16:34:37 -04:00

313 lines
27 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Site Runtime
The Site Runtime component runs the site-side actor hierarchy that executes deployed machine instances: it owns script compilation, alarm evaluation, native alarm mirroring, and the site-wide Akka stream that carries attribute value and alarm state changes to every subscriber.
## Overview
Site Runtime (#3) operates exclusively on site clusters. Its entry point is the `DeploymentManagerActor` cluster singleton, which re-creates the full actor hierarchy on every site startup or failover. Each deployed enabled instance gets an `InstanceActor` child; each `InstanceActor` spawns `ScriptActor` and `AlarmActor` coordinator children, plus a `NativeAlarmActor` peer for every configured native alarm source. Script invocations spawn short-lived `ScriptExecutionActor` children; alarm on-trigger invocations spawn short-lived `AlarmExecutionActor` children.
The component code lives in `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/`:
- `Actors/``DeploymentManagerActor`, `InstanceActor`, `ScriptActor`, `ScriptExecutionActor`, `AlarmActor`, `AlarmExecutionActor`, `NativeAlarmActor`, `SiteReplicationActor`.
- `Scripts/``ScriptCompilationService`, `ScriptExecutionScheduler`, `SharedScriptLibrary`, `ScriptRuntimeContext`, `ScopeAccessors`, `TriggerExpressionGlobals`.
- `Streaming/``SiteStreamManager` (the site-wide Akka broadcast stream).
- `Persistence/``SiteStorageService` (raw SQLite via `Microsoft.Data.Sqlite`), `SiteStorageInitializer`.
- `Repositories/``SiteExternalSystemRepository`, `SiteNotificationRepository`.
- `Tracking/``OperationTrackingStore`, `OperationTrackingOptions`.
`ServiceCollectionExtensions.AddSiteRuntime(connectionString)` registers all singletons; the `Host` calls it and wires the `DeploymentManagerActor` cluster singleton separately via `AkkaHostedService`.
## Key Concepts
### Cluster singleton as the single point of authority
`DeploymentManagerActor` runs as an Akka.NET cluster singleton — guaranteed to be active on exactly one site node at a time. On failover, Akka restarts the singleton on the surviving node. Because all deployment commands from central are routed through the singleton, there is never a split-brain dispute over which node owns instance lifecycle: the singleton is the only actor that calls `Context.ActorOf` for `InstanceActor` children.
### Staggered startup
The singleton reads all deployed configurations from SQLite in `PreStart`, compiles shared scripts off-thread, and then creates `InstanceActor` children in batches. The default batch size is 20, with a 100 ms delay between batches (`StartupBatchSize` and `StartupBatchDelayMs` in `SiteRuntimeOptions`). Without staggering, 500 instances each subscribing to OPC UA tags simultaneously would produce a reconnection storm that overwhelms the OPC UA server.
### Actor supervision and lifetimes
| Actor | Kind | Supervises children with | On exception |
|---|---|---|---|
| `DeploymentManagerActor` | long-lived singleton | `OneForOneStrategy` | Resume (coord) / Stop (init failure) |
| `InstanceActor` | long-lived per instance | `OneForOneStrategy` | Resume for all coordinator children |
| `ScriptActor` | long-lived coordinator | `OneForOneStrategy` | Stop execution child, keep self |
| `AlarmActor` | long-lived coordinator | `OneForOneStrategy` | Stop execution child, keep self |
| `NativeAlarmActor` | long-lived coordinator | — | Supervised by Instance Actor (Resume) |
| `ScriptExecutionActor` | short-lived per invocation | — | Stops itself; parent logs failure |
| `AlarmExecutionActor` | short-lived per invocation | — | Stops itself; parent logs failure |
Coordinator actors resume on exception because their in-memory state (trigger timers, last execution time, alarm level) must survive child crashes. Short-lived execution actors stop themselves on completion or exception — the coordinator remains available for the next trigger.
### Dedicated script-execution dispatcher
Script and alarm on-trigger bodies run on the `ScriptExecutionScheduler` (`SiteRuntime-009`): a custom `TaskScheduler` backed by a bounded set of dedicated threads (default 8, `ScriptExecutionThreadCount`). The script body is submitted to this scheduler via `Task.Factory.StartNew(..., scheduler)` inside `ScriptExecutionActor` and `AlarmExecutionActor`. Scripts that block on I/O (database connections, synchronous external system calls) block only the scheduler's threads, leaving the shared .NET thread pool and all Akka dispatchers unaffected.
### Tell vs. Ask
- **Tell**: tag value updates, `AttributeValueChanged` fan-out to child Script/Alarm actors, stream publishing. These are high-frequency or fire-and-forget paths.
- **Ask**: `Instance.CallScript()` (caller needs the return value), debug view snapshots, attribute reads from the Inbound API. Ask is reserved for cross-boundary, low-frequency calls.
### Attribute serialization through the Instance Actor
All in-memory state mutations (attribute values, qualities, alarm states) run inside `InstanceActor`'s mailbox. Multiple `ScriptExecutionActor` instances may run concurrently but all `SetAttribute` calls serialize through the `InstanceActor` mailbox, preventing race conditions. Concurrent script executions may interleave external side effects (HTTP calls, database writes, notifications); those are independent and intentionally not serialized.
## Architecture
### Actor hierarchy
```text
DeploymentManagerActor (Akka.NET cluster singleton)
└── InstanceActor "MachineA-001"
├── ScriptActor "MonitorSpeed" (coordinator)
│ └── ScriptExecutionActor (short-lived, per invocation)
├── ScriptActor "CalculateOEE" (coordinator)
│ └── ScriptExecutionActor (short-lived)
├── AlarmActor "OverTemp" (coordinator, computed)
│ └── AlarmExecutionActor (short-lived, on-trigger)
├── AlarmActor "LowPressure" (coordinator, computed)
└── NativeAlarmActor "OpcUaServer1" (read-only mirror, peer to AlarmActor)
```
`NativeAlarmActor` is a sibling of `AlarmActor` — a peer under the same `InstanceActor` parent. It is not a child of `AlarmActor` and has no relationship to the script engine.
### Deployment flow
Central sends a `DeployInstanceCommand` carrying a JSON `FlattenedConfiguration` to the site singleton. The singleton:
1. Calls `EnsureDclConnections` to push any new or changed connection definitions to the DCL manager (hash-guarded: unchanged configs are skipped).
2. Calls `CreateInstanceActor`, which does `Context.ActorOf(props, instanceName)`.
3. Runs an off-thread `Task` that calls `SiteStorageService.StoreDeployedConfigAsync`, clears static overrides and native alarm state, and — if `_replicationActor` is non-null (it is optional and null in isolated deployments/tests) — tells `SiteReplicationActor` to push to the peer node.
4. Pipes back a `DeployPersistenceResult`; only on success does it tell the deployer `DeploymentStatus.Success`. If persistence fails, the optimistically-created actor is stopped and the error is returned to central (`SiteRuntime-005`).
For redeployment (instance already running), the existing actor is stopped and watched:
```csharp
// DeploymentManagerActor.HandleDeploy
if (_instanceActors.TryGetValue(instanceName, out var existing))
{
_instanceActors.Remove(instanceName);
_pendingRedeploys[existing] = new PendingRedeploy(command, Sender);
_terminatingActorsByName[instanceName] = existing;
Context.Watch(existing);
Context.Stop(existing);
return;
}
```
The `Terminated` signal fires once the previous actor and its entire subtree have stopped (freeing the actor name), and only then does `ApplyDeployment` run for the replacement. A third deploy arriving mid-termination overwrites the buffered `PendingRedeploy` (last-write-wins) and tells the displaced sender a `Failed-superseded` response (`SiteRuntime-020`).
### Instance Actor initialization
On `PreStart`, `InstanceActor`:
1. Fires `SiteStorageService.GetStaticOverridesAsync` asynchronously and pipes the result to self as a `LoadOverridesResult` — this is non-blocking; the message arrives later in the mailbox.
2. Calls `CreateChildActors()` **immediately** (before the override message arrives). `CreateChildActors` snapshots `_attributes` (the live dictionary, seeded from flattened-config defaults) into `attributeSnapshot` before any child constructor runs. Each child's `Props` closure captures the immutable snapshot, not the live dictionary — preventing the race condition described in `SiteRuntime-017`. Because the override load is asynchronous, children are created from the un-overridden defaults; when the `LoadOverridesResult` message is subsequently processed, `HandleOverridesLoaded` applies the persisted overrides on top of the live `_attributes` dictionary.
3. Calls `SubscribeToDcl()`, grouping data-sourced attributes by connection name and sending `SubscribeTagsRequest` to the DCL manager. Tag paths are stored in `_tagPathToAttributes`, a `Dictionary<string, List<string>>`, because one physical tag can back more than one attribute canonical name.
Data-sourced attributes start with quality `Uncertain` until the first `TagValueUpdate` arrives; static attributes start with quality `Good`.
### Script compilation
`ScriptCompilationService.Compile(name, code)` first runs `ValidateTrustModel`, which uses Roslyn semantic analysis (not substring scanning) to detect references to forbidden namespaces (`System.IO`, `System.Diagnostics.Process`, `System.Threading` — except `Tasks`/`CancellationToken`, `System.Reflection`, `System.Net.Sockets`, `System.Net.Http`). Only after passing trust validation does it call `CSharpScript.Create<object?>` with the restricted `ScriptOptions` (references capped to `object`, `Enumerable`, `Math`, `CSharpArgumentInfo`, and `DynamicJsonElement` assemblies).
```csharp
// ScriptCompilationService.CompileCore
var violations = ValidateTrustModel(code);
if (violations.Count > 0)
return ScriptCompilationResult.Failed(violations);
var script = CSharpScript.Create<object?>(
code,
BuildScriptOptions(),
globalsType: globalsType);
var diagnostics = script.Compile();
```
`CompileTriggerExpression(name, expression)` follows the same path but uses `TriggerExpressionGlobals` as the globals type instead of `ScriptGlobals` — trigger expressions are read-only and have no access to the script runtime API.
### Shared script library
`SharedScriptLibrary` holds a `Dictionary<string, Script<object?>>` under a `lock`. It is populated at startup (off-thread by the singleton, piped back as `SharedScriptsLoaded`) and updated live when artifact deployments arrive carrying new shared scripts. Calling `Scripts.CallShared("name", params)` inside a script calls `SharedScriptLibrary.ExecuteAsync`, which runs the compiled delegate inline as compiled code (no actor message, no serialization) and is awaited by the caller.
### Script Actor triggers
`ScriptActor` parses the `ResolvedScript.TriggerType` and `TriggerConfiguration` into a discriminated union (`IntervalTriggerConfig`, `ValueChangeTriggerConfig`, `ConditionalTriggerConfig`, `ExpressionTriggerConfig`). Interval triggers use Akka `ITimerScheduler`. Value-change and conditional triggers react to `AttributeValueChanged` messages forwarded by the Instance Actor. Expression triggers maintain an `_attributeSnapshot` dictionary kept current by every `AttributeValueChanged`, then evaluate the compiled `_compiledTriggerExpression` synchronously with a 2-second `CancellationTokenSource` timeout.
`WhileTrue` mode is handled by `HandleWhileTrueTransition`:
```csharp
// ScriptActor.HandleWhileTrueTransition
private void HandleWhileTrueTransition(bool nowTrue, bool wasTrue)
{
if (nowTrue && !wasTrue)
{
TrySpawnExecution(null);
StartWhileTrueTimer();
}
else if (!nowTrue && wasTrue)
{
StopWhileTrueTimer();
}
}
```
On the false→true edge the script fires once and a periodic re-fire timer starts at `MinTimeBetweenRuns` cadence; on the true→false edge the timer stops.
### Site-wide Akka stream
`SiteStreamManager` materializes a broadcast hub in `Initialize(ActorSystem)`, called by the Host after Akka starts. The hub is fed by a `Source.ActorRef<ISiteStreamEvent>` (bounded with `OverflowStrategy.DropHead`). `InstanceActor` publishes via `_streamManager?.PublishAttributeValueChanged(changed)` and `PublishAlarmStateChanged(changed)` — both are Tell calls; they never block the actor.
Each subscriber (typically a `StreamRelayActor` created by the Communication Layer's `SiteStreamGrpcServer`) gets its own materialized sub-graph with an independent `Buffer(_bufferSize, DropHead)` and a `KillSwitch`. A slow subscriber drops only its own events; it cannot stall other subscribers or the publishing Instance Actor.
```csharp
// SiteStreamManager.Subscribe
var killSwitch = _hubSource
.Where(ev => ev.InstanceUniqueName == capturedInstance)
.Buffer(_bufferSize, OverflowStrategy.DropHead)
.ViaMaterialized(KillSwitches.Single<ISiteStreamEvent>(), Keep.Right)
.To(Sink.ForEach<ISiteStreamEvent>(ev => capturedSubscriber.Tell(ev)))
.Run(_materializer);
```
### Native alarm mirror
`NativeAlarmActor` mirrors the condition state of one source binding — an OPC UA A&C server or MxAccess Gateway connection — without writing back to the source. Each condition is keyed by `SourceReference`.
On `PreStart` it concurrently kicks off two operations: the SQLite rehydration (`GetNativeAlarmsAsync`, piped back as `RehydrationCompleted`) and a `SubscribeAlarmsRequest` to the DCL manager — the subscribe is sent before the async rehydration completes and is not gated on it. The DCL forwards the subscribe request to the connection's `IAlarmSubscribableConnection` implementation.
Transition handling:
- `AlarmTransitionKind.Snapshot` accumulates into `_snapshotBuffer`.
- `AlarmTransitionKind.SnapshotComplete` atomically swaps `_alarms` with `_snapshotBuffer`. Conditions absent from the snapshot emit return-to-normal events and drop from the mirror — the mechanism that reconciles state after a reconnect.
- Live transitions (`Raise`, `Ack`, `Clear`, etc.) upsert by `SourceReference`, ignoring transitions older than the held `TransitionTime` (out-of-order protection).
Retention: a condition that is both inactive and acknowledged (`!Active && Acknowledged`) is dropped from the mirror and its SQLite row deleted. If the mirror exceeds `MirroredAlarmCapPerSource` (default 1000), the oldest condition is dropped and logged. Persistence is fire-and-forget — a write failure is logged but never blocks the actor or suppresses the upward `AlarmStateChanged` emit.
### Enriched `AlarmStateChanged`
Both `AlarmActor` and `NativeAlarmActor` tell the `InstanceActor` an `AlarmStateChanged`. The message was extended additively so existing computed-alarm consumers continue to work unchanged:
| Field | Computed alarm | Native alarm |
|---|---|---|
| `Kind` | `AlarmKind.Computed` | `AlarmKind.NativeOpcUa` or `NativeMxAccess` |
| `Condition` | computed default (auto-acknowledged, `Severity = Priority`) | mirrored `AlarmConditionState` from source |
| `SourceReference`, `AlarmTypeName`, `Category`, `Message`, `OperatorUser`, `OperatorComment`, `OriginalRaiseTime`, `CurrentValue`, `LimitValue` | empty/null | populated from source transition |
`InstanceActor` stores the latest enriched event per alarm name in `_latestAlarmEvents`. The Debug View snapshot uses this map so native alarm metadata reaches the central debug view.
### Local SQLite schema
`SiteStorageService` owns the site database (raw `Microsoft.Data.Sqlite`, not EF Core). Tables created by `InitializeAsync`:
| Table | Purpose | Reset on redeploy? |
|---|---|---|
| `deployed_configurations` | Persisted flattened configs (survives restart/failover) | No (replaced) |
| `static_attribute_overrides` | Runtime attribute writes (`SetAttribute` on static attrs) | Yes — cleared by `ClearStaticOverridesAsync` |
| `native_alarm_state` | Mirrored native alarm conditions (survives failover) | Yes — cleared by `ClearNativeAlarmsForInstanceAsync` |
| `shared_scripts` | Shared script code from artifact deployments | No |
| `external_systems` | External system definitions | No |
| `database_connections` | Database connection strings | No |
| `data_connection_definitions` | OPC UA / MxGateway endpoint definitions | No |
| `notification_lists` | Notification list definitions | No |
| `smtp_configurations` | SMTP configuration (from artifact deployment) | No |
### Standby replication
`SiteReplicationActor` runs on every site node (not a singleton). The active node's `DeploymentManagerActor` tells it `ReplicateConfigDeploy`, `ReplicateConfigRemove`, `ReplicateConfigSetEnabled`, `ReplicateArtifacts`, or `ReplicateStoreAndForward`. The replication actor tracks the peer node via Akka cluster membership events and forwards each command to `/user/site-replication` on the peer via `ActorSelection`. Replication is fire-and-forget (no ack wait per design), so a failed write to the standby is logged but does not fail the primary operation.
## Usage
### Lifecycle commands
Central sends commands to the site `DeploymentManagerActor` singleton over the Communication Layer:
| Command | Effect |
|---|---|
| `DeployInstanceCommand` | Create or replace the instance actor; persist config to SQLite; clear static overrides and native alarm state |
| `DisableInstanceCommand` | Stop the instance actor; set `is_enabled = 0` in SQLite; retain config for re-enable |
| `EnableInstanceCommand` | Create a new instance actor from the stored config |
| `DeleteInstanceCommand` | Stop the instance actor; remove config from SQLite; store-and-forward messages are NOT cleared |
| `DeployArtifactsCommand` | Persist shared scripts, external system definitions, database connections, notification lists, data connection definitions; recompile shared scripts; push data connections to DCL |
### Script API surface
Scripts run inside `ScriptExecutionActor` with a `ScriptGlobals` object as the Roslyn host object. The `Instance` global is a `ScriptRuntimeContext`. Convenience top-level aliases (`ExternalSystem`, `Database`, `Notify`, `Scripts`, `Attributes`, `Children`, `Parent`) delegate to context methods. Key calls:
- `Instance.GetAttribute("name")` / `Instance.SetAttribute("name", value)` — Ask to `InstanceActor` for write, in-process for read.
- `Instance.CallScript("scriptName", params)` — Ask from `ScriptExecutionActor` to sibling `ScriptActor`, which spawns a new `ScriptExecutionActor`.
- `Scripts.CallShared("name", params)``SharedScriptLibrary.ExecuteAsync`, inline on the current scheduler thread.
- `ExternalSystem.Call(...)` — synchronous HTTP call through `IExternalSystemClient`.
- `ExternalSystem.CachedCall(...)` / `Database.CachedWrite(...)` — store-and-forwarded; returns a `TrackedOperationId`.
- `Tracking.Status(id)` — reads the site-local `OperationTrackingStore` synchronously.
- `Notify.To("list").Send(...)` — enqueues a notification in the Store-and-Forward Engine for delivery to central.
Alarm on-trigger scripts run in `AlarmExecutionActor` with a **restricted** context: they receive an `Alarm` global (`AlarmContext` carrying `Name`, `Level`, `Priority`, `Message`) and have access to the instance/shared-script surface (`Instance.*`, `Scripts.CallShared`, `Instance.CallScript`), but **not** the external-system, database, notification, or audit integration APIs. `AlarmExecutionActor` builds its `ScriptRuntimeContext` without a `serviceProvider`, so `ExternalSystem`, `Database`, `Notify`, and audit writes are unavailable to alarm on-trigger scripts — those APIs are only resolved inside `ScriptExecutionActor` (instance scripts).
### Debug view
The Communication Layer sends `SubscribeDebugViewRequest` or `DebugSnapshotRequest` to the singleton, which forwards to the named `InstanceActor`. The Instance Actor replies with a `DebugViewSnapshot` built from the current `_attributes` dictionary and `_latestAlarmEvents` map. Ongoing changes reach the central debug view via the gRPC stream, not through the actor hierarchy.
## Configuration
All options live in the `ScadaBridge:SiteRuntime` section, bound to `SiteRuntimeOptions`:
| Key | Default | Description |
|---|---|---|
| `StartupBatchSize` | `20` | Instance Actors created per batch during staggered startup |
| `StartupBatchDelayMs` | `100` | Milliseconds between startup batches |
| `MaxScriptCallDepth` | `10` | Maximum `Instance.CallScript` / `Scripts.CallShared` recursion depth |
| `ScriptExecutionTimeoutSeconds` | `30` | Per-script body execution timeout; exceeding it cancels and logs an error |
| `StreamBufferSize` | `1000` | Per-subscriber drop-oldest buffer size for the Akka broadcast stream |
| `ScriptExecutionThreadCount` | `8` | Dedicated threads in the `ScriptExecutionScheduler` (covers both scripts and alarm on-trigger bodies) |
| `MirroredAlarmCapPerSource` | `1000` | Maximum mirrored conditions per `NativeAlarmActor` source binding before oldest is dropped and logged |
| `NativeAlarmRetryIntervalMs` | `5000` | Milliseconds before retrying a failed native alarm subscription |
The SQLite connection string is passed directly to `AddSiteRuntime(connectionString)` in the host composition root and is not part of `SiteRuntimeOptions`.
## Dependencies & Interactions
- [Data Connection Layer (#4)](./DataConnectionLayer.md) — supplies `TagValueUpdate` and `ConnectionQualityChanged` messages to `InstanceActor`; receives `SubscribeTagsRequest` and `WriteTagRequest`. Also supplies `NativeAlarmTransitionUpdate` and `NativeAlarmSourceUnavailable` to `NativeAlarmActor` via `SubscribeAlarmsRequest` (connections implementing `IAlarmSubscribableConnection`).
- [CentralSite Communication (#5)](./Communication.md) — routes `DeployInstanceCommand`, `DisableInstanceCommand`, `EnableInstanceCommand`, `DeleteInstanceCommand`, `DeployArtifactsCommand`, debug view requests, and Inbound API `RouteToCallRequest` / `RouteToGetAttributesRequest` / `RouteToSetAttributesRequest` to the singleton; receives `DeploymentStatusResponse` and `ArtifactDeploymentResponse` back. The `SiteStreamManager` implements `ISiteStreamSubscriber` so the Communication Layer's `SiteStreamGrpcServer` can subscribe `StreamRelayActor` instances to the broadcast hub.
- [Store-and-Forward Engine (#6)](./StoreAndForward.md) — `ScriptRuntimeContext` passes `StoreAndForwardService` (resolved from DI inside `ScriptExecutionActor`) for `ExternalSystem.CachedCall`, `Database.CachedWrite`, and `Notify.To().Send()`. Owns the site-local operation tracking table that `Tracking.Status(id)` reads.
- [External System Gateway (#7)](./ExternalSystemGateway.md) — `IExternalSystemClient` called by `ScriptRuntimeContext.ExternalSystemHelper` for synchronous and cached external system calls.
- [Site Event Logging (#12)](./SiteEventLogging.md) — `ISiteEventLogger` (resolved from DI inside execution actors) receives script error, alarm error, and script execution events.
- [Health Monitoring (#11)](./HealthMonitoring.md) — `ISiteHealthCollector` (injected into `DeploymentManagerActor`, `InstanceActor`, `ScriptActor`, `AlarmActor`) tracks instance counts (`SetInstanceCounts`), script errors (`IncrementScriptError`), and alarm errors (`IncrementAlarmError`); sets `SetActiveNode` in `DeploymentManagerActor.PreStart`/`PostStop` so the health report reflects which node holds the singleton.
- [Audit Log (#23)](./AuditLog.md) — `IAuditWriter` (resolved from DI inside `ScriptExecutionActor`) receives one row per script-trust-boundary call; audit writes are best-effort and never abort the calling script.
- [Commons (#16)](./Commons.md) — owns all message contracts (`DeployInstanceCommand`, `AttributeValueChanged`, `AlarmStateChanged`, `ScriptCallRequest`, `NativeAlarmTransitionUpdate`, etc.), the `FlattenedConfiguration` / `ResolvedScript` / `ResolvedAlarm` / `ResolvedNativeAlarmSource` types, and the `AlarmKind` / `AlarmState` / `AlarmLevel` / `AlarmConditionState` / `AlarmTransitionKind` enums.
- Local SQLite — `SiteStorageService` owns the site database. Peer SQLite stores (Store-and-Forward buffer, AuditLog, operation tracking, site event log) are owned by their respective components but share the same SQLite file path convention.
- Design spec: [Component-SiteRuntime.md](../requirements/Component-SiteRuntime.md).
## Troubleshooting
### Instance stays in `Unknown` or `Failed` after deployment
`DeploymentManagerActor` only tells central `DeploymentStatus.Success` after `SiteStorageService.StoreDeployedConfigAsync` commits. If the site SQLite file is locked, full, or on a read-only volume, the persistence task throws, the optimistically-created actor is stopped, and central receives `Failed`. Check the site event log for `Failed to persist deployment` entries. The site SQLite path is configured in the host `appsettings.json`.
### Reconnection storm on failover
If many instances race to subscribe to OPC UA at once the server may throttle or drop connections. Increase `StartupBatchDelayMs` or decrease `StartupBatchSize` in `SiteRuntimeOptions`. The current defaults (batch 20, delay 100 ms) mean a site with 200 instances takes 1 second to start all subscriptions, which is acceptable for most servers.
### Script execution actor backpressure
The `ScriptExecutionScheduler` has a fixed thread count (`ScriptExecutionThreadCount`, default 8). If all threads are blocked (a burst of scripts each waiting on a slow database or external system), new script invocations queue behind them. The queue is unbounded — memory usage can grow during a backlog. If this is observed, raise `ScriptExecutionThreadCount` or reduce the number of concurrent long-running scripts. Script execution timeout (`ScriptExecutionTimeoutSeconds`) bounds the worst case.
### Native alarm conditions not recovering after reconnect
`NativeAlarmActor` retains last-known conditions during a source outage (it does not clear them) and reconciles state via the reconnect snapshot swap. If the snapshot never arrives (the DCL connection was cleanly unsubscribed rather than failing), the actor may hold stale Active conditions indefinitely. A redeploy of the instance clears `native_alarm_state` in SQLite and forces fresh subscription. Failed subscription retries are logged at `Warning` level with the retry interval.
### `InvalidActorNameException` on rapid redeployment
If two `DeployInstanceCommand` messages arrive for the same instance while the first redeployment is still terminating, the `_terminatingActorsByName` shadow index in `DeploymentManagerActor` detects the collision and buffers the second command (`SiteRuntime-020`). The displaced deploy receives a `Failed-superseded` response. This is expected behaviour — central should observe the `Failed` response and retry when the site is ready.
## Related Documentation
- [Site Runtime design specification](../requirements/Component-SiteRuntime.md)
- [CentralSite Communication](./Communication.md)
- [Commons](./Commons.md)
- [Host](./Host.md)
- [Audit Log](./AuditLog.md)
- [Cluster Infrastructure](./ClusterInfrastructure.md)