TemplateEngine (alarm-script-ref ordering, native-alarm-sources not in revision hash, composition cycle checks, 9-step pipeline), SiteRuntime (alarm on-trigger scripts run with a restricted context; PreStart seeds children from defaults before overrides arrive), DataConnectionLayer (UnsubscribeAlarmsRequest stashed in Connecting), StoreAndForward (InFlight/ Delivered are dead enum values; notifications can park at 50 retries), ExternalSystemGateway (CachedWrite returns void + enqueues directly; log levels).
27 KiB
Site Runtime
The Site Runtime component runs the site-side actor hierarchy that executes deployed machine instances: it owns script compilation, alarm evaluation, native alarm mirroring, and the site-wide Akka stream that carries attribute value and alarm state changes to every subscriber.
Overview
Site Runtime (#3) operates exclusively on site clusters. Its entry point is the DeploymentManagerActor cluster singleton, which re-creates the full actor hierarchy on every site startup or failover. Each deployed enabled instance gets an InstanceActor child; each InstanceActor spawns ScriptActor and AlarmActor coordinator children, plus a NativeAlarmActor peer for every configured native alarm source. Script invocations spawn short-lived ScriptExecutionActor children; alarm on-trigger invocations spawn short-lived AlarmExecutionActor children.
The component code lives in src/ZB.MOM.WW.ScadaBridge.SiteRuntime/:
Actors/—DeploymentManagerActor,InstanceActor,ScriptActor,ScriptExecutionActor,AlarmActor,AlarmExecutionActor,NativeAlarmActor,SiteReplicationActor.Scripts/—ScriptCompilationService,ScriptExecutionScheduler,SharedScriptLibrary,ScriptRuntimeContext,ScopeAccessors,TriggerExpressionGlobals.Streaming/—SiteStreamManager(the site-wide Akka broadcast stream).Persistence/—SiteStorageService(raw SQLite viaMicrosoft.Data.Sqlite),SiteStorageInitializer.Repositories/—SiteExternalSystemRepository,SiteNotificationRepository.Tracking/—OperationTrackingStore,OperationTrackingOptions.
ServiceCollectionExtensions.AddSiteRuntime(connectionString) registers all singletons; the Host calls it and wires the DeploymentManagerActor cluster singleton separately via AkkaHostedService.
Key Concepts
Cluster singleton as the single point of authority
DeploymentManagerActor runs as an Akka.NET cluster singleton — guaranteed to be active on exactly one site node at a time. On failover, Akka restarts the singleton on the surviving node. Because all deployment commands from central are routed through the singleton, there is never a split-brain dispute over which node owns instance lifecycle: the singleton is the only actor that calls Context.ActorOf for InstanceActor children.
Staggered startup
The singleton reads all deployed configurations from SQLite in PreStart, compiles shared scripts off-thread, and then creates InstanceActor children in batches. The default batch size is 20, with a 100 ms delay between batches (StartupBatchSize and StartupBatchDelayMs in SiteRuntimeOptions). Without staggering, 500 instances each subscribing to OPC UA tags simultaneously would produce a reconnection storm that overwhelms the OPC UA server.
Actor supervision and lifetimes
| Actor | Kind | Supervises children with | On exception |
|---|---|---|---|
DeploymentManagerActor |
long-lived singleton | OneForOneStrategy |
Resume (coord) / Stop (init failure) |
InstanceActor |
long-lived per instance | OneForOneStrategy |
Resume for all coordinator children |
ScriptActor |
long-lived coordinator | OneForOneStrategy |
Stop execution child, keep self |
AlarmActor |
long-lived coordinator | OneForOneStrategy |
Stop execution child, keep self |
NativeAlarmActor |
long-lived coordinator | — | Supervised by Instance Actor (Resume) |
ScriptExecutionActor |
short-lived per invocation | — | Stops itself; parent logs failure |
AlarmExecutionActor |
short-lived per invocation | — | Stops itself; parent logs failure |
Coordinator actors resume on exception because their in-memory state (trigger timers, last execution time, alarm level) must survive child crashes. Short-lived execution actors stop themselves on completion or exception — the coordinator remains available for the next trigger.
Dedicated script-execution dispatcher
Script and alarm on-trigger bodies run on the ScriptExecutionScheduler (SiteRuntime-009): a custom TaskScheduler backed by a bounded set of dedicated threads (default 8, ScriptExecutionThreadCount). The script body is submitted to this scheduler via Task.Factory.StartNew(..., scheduler) inside ScriptExecutionActor and AlarmExecutionActor. Scripts that block on I/O (database connections, synchronous external system calls) block only the scheduler's threads, leaving the shared .NET thread pool and all Akka dispatchers unaffected.
Tell vs. Ask
- Tell: tag value updates,
AttributeValueChangedfan-out to child Script/Alarm actors, stream publishing. These are high-frequency or fire-and-forget paths. - Ask:
Instance.CallScript()(caller needs the return value), debug view snapshots, attribute reads from the Inbound API. Ask is reserved for cross-boundary, low-frequency calls.
Attribute serialization through the Instance Actor
All in-memory state mutations (attribute values, qualities, alarm states) run inside InstanceActor's mailbox. Multiple ScriptExecutionActor instances may run concurrently but all SetAttribute calls serialize through the InstanceActor mailbox, preventing race conditions. Concurrent script executions may interleave external side effects (HTTP calls, database writes, notifications); those are independent and intentionally not serialized.
Architecture
Actor hierarchy
DeploymentManagerActor (Akka.NET cluster singleton)
└── InstanceActor "MachineA-001"
├── ScriptActor "MonitorSpeed" (coordinator)
│ └── ScriptExecutionActor (short-lived, per invocation)
├── ScriptActor "CalculateOEE" (coordinator)
│ └── ScriptExecutionActor (short-lived)
├── AlarmActor "OverTemp" (coordinator, computed)
│ └── AlarmExecutionActor (short-lived, on-trigger)
├── AlarmActor "LowPressure" (coordinator, computed)
└── NativeAlarmActor "OpcUaServer1" (read-only mirror, peer to AlarmActor)
NativeAlarmActor is a sibling of AlarmActor — a peer under the same InstanceActor parent. It is not a child of AlarmActor and has no relationship to the script engine.
Deployment flow
Central sends a DeployInstanceCommand carrying a JSON FlattenedConfiguration to the site singleton. The singleton:
- Calls
EnsureDclConnectionsto push any new or changed connection definitions to the DCL manager (hash-guarded: unchanged configs are skipped). - Calls
CreateInstanceActor, which doesContext.ActorOf(props, instanceName). - Runs an off-thread
Taskthat callsSiteStorageService.StoreDeployedConfigAsync, clears static overrides and native alarm state, and — if_replicationActoris non-null (it is optional and null in isolated deployments/tests) — tellsSiteReplicationActorto push to the peer node. - Pipes back a
DeployPersistenceResult; only on success does it tell the deployerDeploymentStatus.Success. If persistence fails, the optimistically-created actor is stopped and the error is returned to central (SiteRuntime-005).
For redeployment (instance already running), the existing actor is stopped and watched:
// DeploymentManagerActor.HandleDeploy
if (_instanceActors.TryGetValue(instanceName, out var existing))
{
_instanceActors.Remove(instanceName);
_pendingRedeploys[existing] = new PendingRedeploy(command, Sender);
_terminatingActorsByName[instanceName] = existing;
Context.Watch(existing);
Context.Stop(existing);
return;
}
The Terminated signal fires once the previous actor and its entire subtree have stopped (freeing the actor name), and only then does ApplyDeployment run for the replacement. A third deploy arriving mid-termination overwrites the buffered PendingRedeploy (last-write-wins) and tells the displaced sender a Failed-superseded response (SiteRuntime-020).
Instance Actor initialization
On PreStart, InstanceActor:
- Fires
SiteStorageService.GetStaticOverridesAsyncasynchronously and pipes the result to self as aLoadOverridesResult— this is non-blocking; the message arrives later in the mailbox. - Calls
CreateChildActors()immediately (before the override message arrives).CreateChildActorssnapshots_attributes(the live dictionary, seeded from flattened-config defaults) intoattributeSnapshotbefore any child constructor runs. Each child'sPropsclosure captures the immutable snapshot, not the live dictionary — preventing the race condition described inSiteRuntime-017. Because the override load is asynchronous, children are created from the un-overridden defaults; when theLoadOverridesResultmessage is subsequently processed,HandleOverridesLoadedapplies the persisted overrides on top of the live_attributesdictionary. - Calls
SubscribeToDcl(), grouping data-sourced attributes by connection name and sendingSubscribeTagsRequestto the DCL manager. Tag paths are stored in_tagPathToAttributes, aDictionary<string, List<string>>, because one physical tag can back more than one attribute canonical name.
Data-sourced attributes start with quality Uncertain until the first TagValueUpdate arrives; static attributes start with quality Good.
Script compilation
ScriptCompilationService.Compile(name, code) first runs ValidateTrustModel, which uses Roslyn semantic analysis (not substring scanning) to detect references to forbidden namespaces (System.IO, System.Diagnostics.Process, System.Threading — except Tasks/CancellationToken, System.Reflection, System.Net.Sockets, System.Net.Http). Only after passing trust validation does it call CSharpScript.Create<object?> with the restricted ScriptOptions (references capped to object, Enumerable, Math, CSharpArgumentInfo, and DynamicJsonElement assemblies).
// ScriptCompilationService.CompileCore
var violations = ValidateTrustModel(code);
if (violations.Count > 0)
return ScriptCompilationResult.Failed(violations);
var script = CSharpScript.Create<object?>(
code,
BuildScriptOptions(),
globalsType: globalsType);
var diagnostics = script.Compile();
CompileTriggerExpression(name, expression) follows the same path but uses TriggerExpressionGlobals as the globals type instead of ScriptGlobals — trigger expressions are read-only and have no access to the script runtime API.
Shared script library
SharedScriptLibrary holds a Dictionary<string, Script<object?>> under a lock. It is populated at startup (off-thread by the singleton, piped back as SharedScriptsLoaded) and updated live when artifact deployments arrive carrying new shared scripts. Calling Scripts.CallShared("name", params) inside a script calls SharedScriptLibrary.ExecuteAsync, which runs the compiled delegate inline as compiled code (no actor message, no serialization) and is awaited by the caller.
Script Actor triggers
ScriptActor parses the ResolvedScript.TriggerType and TriggerConfiguration into a discriminated union (IntervalTriggerConfig, ValueChangeTriggerConfig, ConditionalTriggerConfig, ExpressionTriggerConfig). Interval triggers use Akka ITimerScheduler. Value-change and conditional triggers react to AttributeValueChanged messages forwarded by the Instance Actor. Expression triggers maintain an _attributeSnapshot dictionary kept current by every AttributeValueChanged, then evaluate the compiled _compiledTriggerExpression synchronously with a 2-second CancellationTokenSource timeout.
WhileTrue mode is handled by HandleWhileTrueTransition:
// ScriptActor.HandleWhileTrueTransition
private void HandleWhileTrueTransition(bool nowTrue, bool wasTrue)
{
if (nowTrue && !wasTrue)
{
TrySpawnExecution(null);
StartWhileTrueTimer();
}
else if (!nowTrue && wasTrue)
{
StopWhileTrueTimer();
}
}
On the false→true edge the script fires once and a periodic re-fire timer starts at MinTimeBetweenRuns cadence; on the true→false edge the timer stops.
Site-wide Akka stream
SiteStreamManager materializes a broadcast hub in Initialize(ActorSystem), called by the Host after Akka starts. The hub is fed by a Source.ActorRef<ISiteStreamEvent> (bounded with OverflowStrategy.DropHead). InstanceActor publishes via _streamManager?.PublishAttributeValueChanged(changed) and PublishAlarmStateChanged(changed) — both are Tell calls; they never block the actor.
Each subscriber (typically a StreamRelayActor created by the Communication Layer's SiteStreamGrpcServer) gets its own materialized sub-graph with an independent Buffer(_bufferSize, DropHead) and a KillSwitch. A slow subscriber drops only its own events; it cannot stall other subscribers or the publishing Instance Actor.
// SiteStreamManager.Subscribe
var killSwitch = _hubSource
.Where(ev => ev.InstanceUniqueName == capturedInstance)
.Buffer(_bufferSize, OverflowStrategy.DropHead)
.ViaMaterialized(KillSwitches.Single<ISiteStreamEvent>(), Keep.Right)
.To(Sink.ForEach<ISiteStreamEvent>(ev => capturedSubscriber.Tell(ev)))
.Run(_materializer);
Native alarm mirror
NativeAlarmActor mirrors the condition state of one source binding — an OPC UA A&C server or MxAccess Gateway connection — without writing back to the source. Each condition is keyed by SourceReference.
On PreStart it concurrently kicks off two operations: the SQLite rehydration (GetNativeAlarmsAsync, piped back as RehydrationCompleted) and a SubscribeAlarmsRequest to the DCL manager — the subscribe is sent before the async rehydration completes and is not gated on it. The DCL forwards the subscribe request to the connection's IAlarmSubscribableConnection implementation.
Transition handling:
AlarmTransitionKind.Snapshotaccumulates into_snapshotBuffer.AlarmTransitionKind.SnapshotCompleteatomically swaps_alarmswith_snapshotBuffer. Conditions absent from the snapshot emit return-to-normal events and drop from the mirror — the mechanism that reconciles state after a reconnect.- Live transitions (
Raise,Ack,Clear, etc.) upsert bySourceReference, ignoring transitions older than the heldTransitionTime(out-of-order protection).
Retention: a condition that is both inactive and acknowledged (!Active && Acknowledged) is dropped from the mirror and its SQLite row deleted. If the mirror exceeds MirroredAlarmCapPerSource (default 1000), the oldest condition is dropped and logged. Persistence is fire-and-forget — a write failure is logged but never blocks the actor or suppresses the upward AlarmStateChanged emit.
Enriched AlarmStateChanged
Both AlarmActor and NativeAlarmActor tell the InstanceActor an AlarmStateChanged. The message was extended additively so existing computed-alarm consumers continue to work unchanged:
| Field | Computed alarm | Native alarm |
|---|---|---|
Kind |
AlarmKind.Computed |
AlarmKind.NativeOpcUa or NativeMxAccess |
Condition |
computed default (auto-acknowledged, Severity = Priority) |
mirrored AlarmConditionState from source |
SourceReference, AlarmTypeName, Category, Message, OperatorUser, OperatorComment, OriginalRaiseTime, CurrentValue, LimitValue |
empty/null | populated from source transition |
InstanceActor stores the latest enriched event per alarm name in _latestAlarmEvents. The Debug View snapshot uses this map so native alarm metadata reaches the central debug view.
Local SQLite schema
SiteStorageService owns the site database (raw Microsoft.Data.Sqlite, not EF Core). Tables created by InitializeAsync:
| Table | Purpose | Reset on redeploy? |
|---|---|---|
deployed_configurations |
Persisted flattened configs (survives restart/failover) | No (replaced) |
static_attribute_overrides |
Runtime attribute writes (SetAttribute on static attrs) |
Yes — cleared by ClearStaticOverridesAsync |
native_alarm_state |
Mirrored native alarm conditions (survives failover) | Yes — cleared by ClearNativeAlarmsForInstanceAsync |
shared_scripts |
Shared script code from artifact deployments | No |
external_systems |
External system definitions | No |
database_connections |
Database connection strings | No |
data_connection_definitions |
OPC UA / MxGateway endpoint definitions | No |
notification_lists |
Notification list definitions | No |
smtp_configurations |
SMTP configuration (from artifact deployment) | No |
Standby replication
SiteReplicationActor runs on every site node (not a singleton). The active node's DeploymentManagerActor tells it ReplicateConfigDeploy, ReplicateConfigRemove, ReplicateConfigSetEnabled, ReplicateArtifacts, or ReplicateStoreAndForward. The replication actor tracks the peer node via Akka cluster membership events and forwards each command to /user/site-replication on the peer via ActorSelection. Replication is fire-and-forget (no ack wait per design), so a failed write to the standby is logged but does not fail the primary operation.
Usage
Lifecycle commands
Central sends commands to the site DeploymentManagerActor singleton over the Communication Layer:
| Command | Effect |
|---|---|
DeployInstanceCommand |
Create or replace the instance actor; persist config to SQLite; clear static overrides and native alarm state |
DisableInstanceCommand |
Stop the instance actor; set is_enabled = 0 in SQLite; retain config for re-enable |
EnableInstanceCommand |
Create a new instance actor from the stored config |
DeleteInstanceCommand |
Stop the instance actor; remove config from SQLite; store-and-forward messages are NOT cleared |
DeployArtifactsCommand |
Persist shared scripts, external system definitions, database connections, notification lists, data connection definitions; recompile shared scripts; push data connections to DCL |
Script API surface
Scripts run inside ScriptExecutionActor with a ScriptGlobals object as the Roslyn host object. The Instance global is a ScriptRuntimeContext. Convenience top-level aliases (ExternalSystem, Database, Notify, Scripts, Attributes, Children, Parent) delegate to context methods. Key calls:
Instance.GetAttribute("name")/Instance.SetAttribute("name", value)— Ask toInstanceActorfor write, in-process for read.Instance.CallScript("scriptName", params)— Ask fromScriptExecutionActorto siblingScriptActor, which spawns a newScriptExecutionActor.Scripts.CallShared("name", params)—SharedScriptLibrary.ExecuteAsync, inline on the current scheduler thread.ExternalSystem.Call(...)— synchronous HTTP call throughIExternalSystemClient.ExternalSystem.CachedCall(...)/Database.CachedWrite(...)— store-and-forwarded; returns aTrackedOperationId.Tracking.Status(id)— reads the site-localOperationTrackingStoresynchronously.Notify.To("list").Send(...)— enqueues a notification in the Store-and-Forward Engine for delivery to central.
Alarm on-trigger scripts run in AlarmExecutionActor with a restricted context: they receive an Alarm global (AlarmContext carrying Name, Level, Priority, Message) and have access to the instance/shared-script surface (Instance.*, Scripts.CallShared, Instance.CallScript), but not the external-system, database, notification, or audit integration APIs. AlarmExecutionActor builds its ScriptRuntimeContext without a serviceProvider, so ExternalSystem, Database, Notify, and audit writes are unavailable to alarm on-trigger scripts — those APIs are only resolved inside ScriptExecutionActor (instance scripts).
Debug view
The Communication Layer sends SubscribeDebugViewRequest or DebugSnapshotRequest to the singleton, which forwards to the named InstanceActor. The Instance Actor replies with a DebugViewSnapshot built from the current _attributes dictionary and _latestAlarmEvents map. Ongoing changes reach the central debug view via the gRPC stream, not through the actor hierarchy.
Configuration
All options live in the ScadaBridge:SiteRuntime section, bound to SiteRuntimeOptions:
| Key | Default | Description |
|---|---|---|
StartupBatchSize |
20 |
Instance Actors created per batch during staggered startup |
StartupBatchDelayMs |
100 |
Milliseconds between startup batches |
MaxScriptCallDepth |
10 |
Maximum Instance.CallScript / Scripts.CallShared recursion depth |
ScriptExecutionTimeoutSeconds |
30 |
Per-script body execution timeout; exceeding it cancels and logs an error |
StreamBufferSize |
1000 |
Per-subscriber drop-oldest buffer size for the Akka broadcast stream |
ScriptExecutionThreadCount |
8 |
Dedicated threads in the ScriptExecutionScheduler (covers both scripts and alarm on-trigger bodies) |
MirroredAlarmCapPerSource |
1000 |
Maximum mirrored conditions per NativeAlarmActor source binding before oldest is dropped and logged |
NativeAlarmRetryIntervalMs |
5000 |
Milliseconds before retrying a failed native alarm subscription |
The SQLite connection string is passed directly to AddSiteRuntime(connectionString) in the host composition root and is not part of SiteRuntimeOptions.
Dependencies & Interactions
- Data Connection Layer (#4) — supplies
TagValueUpdateandConnectionQualityChangedmessages toInstanceActor; receivesSubscribeTagsRequestandWriteTagRequest. Also suppliesNativeAlarmTransitionUpdateandNativeAlarmSourceUnavailabletoNativeAlarmActorviaSubscribeAlarmsRequest(connections implementingIAlarmSubscribableConnection). - Central–Site Communication (#5) — routes
DeployInstanceCommand,DisableInstanceCommand,EnableInstanceCommand,DeleteInstanceCommand,DeployArtifactsCommand, debug view requests, and Inbound APIRouteToCallRequest/RouteToGetAttributesRequest/RouteToSetAttributesRequestto the singleton; receivesDeploymentStatusResponseandArtifactDeploymentResponseback. TheSiteStreamManagerimplementsISiteStreamSubscriberso the Communication Layer'sSiteStreamGrpcServercan subscribeStreamRelayActorinstances to the broadcast hub. - Store-and-Forward Engine (#6) —
ScriptRuntimeContextpassesStoreAndForwardService(resolved from DI insideScriptExecutionActor) forExternalSystem.CachedCall,Database.CachedWrite, andNotify.To().Send(). Owns the site-local operation tracking table thatTracking.Status(id)reads. - External System Gateway (#7) —
IExternalSystemClientcalled byScriptRuntimeContext.ExternalSystemHelperfor synchronous and cached external system calls. - Site Event Logging (#12) —
ISiteEventLogger(resolved from DI inside execution actors) receives script error, alarm error, and script execution events. - Health Monitoring (#11) —
ISiteHealthCollector(injected intoDeploymentManagerActor,InstanceActor,ScriptActor,AlarmActor) tracks instance counts (SetInstanceCounts), script errors (IncrementScriptError), and alarm errors (IncrementAlarmError); setsSetActiveNodeinDeploymentManagerActor.PreStart/PostStopso the health report reflects which node holds the singleton. - Audit Log (#23) —
IAuditWriter(resolved from DI insideScriptExecutionActor) receives one row per script-trust-boundary call; audit writes are best-effort and never abort the calling script. - Commons (#16) — owns all message contracts (
DeployInstanceCommand,AttributeValueChanged,AlarmStateChanged,ScriptCallRequest,NativeAlarmTransitionUpdate, etc.), theFlattenedConfiguration/ResolvedScript/ResolvedAlarm/ResolvedNativeAlarmSourcetypes, and theAlarmKind/AlarmState/AlarmLevel/AlarmConditionState/AlarmTransitionKindenums. - Local SQLite —
SiteStorageServiceowns the site database. Peer SQLite stores (Store-and-Forward buffer, AuditLog, operation tracking, site event log) are owned by their respective components but share the same SQLite file path convention. - Design spec: Component-SiteRuntime.md.
Troubleshooting
Instance stays in Unknown or Failed after deployment
DeploymentManagerActor only tells central DeploymentStatus.Success after SiteStorageService.StoreDeployedConfigAsync commits. If the site SQLite file is locked, full, or on a read-only volume, the persistence task throws, the optimistically-created actor is stopped, and central receives Failed. Check the site event log for Failed to persist deployment entries. The site SQLite path is configured in the host appsettings.json.
Reconnection storm on failover
If many instances race to subscribe to OPC UA at once the server may throttle or drop connections. Increase StartupBatchDelayMs or decrease StartupBatchSize in SiteRuntimeOptions. The current defaults (batch 20, delay 100 ms) mean a site with 200 instances takes 1 second to start all subscriptions, which is acceptable for most servers.
Script execution actor backpressure
The ScriptExecutionScheduler has a fixed thread count (ScriptExecutionThreadCount, default 8). If all threads are blocked (a burst of scripts each waiting on a slow database or external system), new script invocations queue behind them. The queue is unbounded — memory usage can grow during a backlog. If this is observed, raise ScriptExecutionThreadCount or reduce the number of concurrent long-running scripts. Script execution timeout (ScriptExecutionTimeoutSeconds) bounds the worst case.
Native alarm conditions not recovering after reconnect
NativeAlarmActor retains last-known conditions during a source outage (it does not clear them) and reconciles state via the reconnect snapshot swap. If the snapshot never arrives (the DCL connection was cleanly unsubscribed rather than failing), the actor may hold stale Active conditions indefinitely. A redeploy of the instance clears native_alarm_state in SQLite and forces fresh subscription. Failed subscription retries are logged at Warning level with the retry interval.
InvalidActorNameException on rapid redeployment
If two DeployInstanceCommand messages arrive for the same instance while the first redeployment is still terminating, the _terminatingActorsByName shadow index in DeploymentManagerActor detects the collision and buffers the second command (SiteRuntime-020). The displaced deploy receives a Failed-superseded response. This is expected behaviour — central should observe the Failed response and retry when the site is ready.