Files
Joseph Doherty 25bae4e43b docs(components): accuracy fixes from deep review (batch 2)
TemplateEngine (alarm-script-ref ordering, native-alarm-sources not in
revision hash, composition cycle checks, 9-step pipeline), SiteRuntime
(alarm on-trigger scripts run with a restricted context; PreStart seeds
children from defaults before overrides arrive), DataConnectionLayer
(UnsubscribeAlarmsRequest stashed in Connecting), StoreAndForward (InFlight/
Delivered are dead enum values; notifications can park at 50 retries),
ExternalSystemGateway (CachedWrite returns void + enqueues directly; log levels).
2026-06-03 16:34:37 -04:00

27 KiB
Raw Permalink Blame History

Site Runtime

The Site Runtime component runs the site-side actor hierarchy that executes deployed machine instances: it owns script compilation, alarm evaluation, native alarm mirroring, and the site-wide Akka stream that carries attribute value and alarm state changes to every subscriber.

Overview

Site Runtime (#3) operates exclusively on site clusters. Its entry point is the DeploymentManagerActor cluster singleton, which re-creates the full actor hierarchy on every site startup or failover. Each deployed enabled instance gets an InstanceActor child; each InstanceActor spawns ScriptActor and AlarmActor coordinator children, plus a NativeAlarmActor peer for every configured native alarm source. Script invocations spawn short-lived ScriptExecutionActor children; alarm on-trigger invocations spawn short-lived AlarmExecutionActor children.

The component code lives in src/ZB.MOM.WW.ScadaBridge.SiteRuntime/:

  • Actors/DeploymentManagerActor, InstanceActor, ScriptActor, ScriptExecutionActor, AlarmActor, AlarmExecutionActor, NativeAlarmActor, SiteReplicationActor.
  • Scripts/ScriptCompilationService, ScriptExecutionScheduler, SharedScriptLibrary, ScriptRuntimeContext, ScopeAccessors, TriggerExpressionGlobals.
  • Streaming/SiteStreamManager (the site-wide Akka broadcast stream).
  • Persistence/SiteStorageService (raw SQLite via Microsoft.Data.Sqlite), SiteStorageInitializer.
  • Repositories/SiteExternalSystemRepository, SiteNotificationRepository.
  • Tracking/OperationTrackingStore, OperationTrackingOptions.

ServiceCollectionExtensions.AddSiteRuntime(connectionString) registers all singletons; the Host calls it and wires the DeploymentManagerActor cluster singleton separately via AkkaHostedService.

Key Concepts

Cluster singleton as the single point of authority

DeploymentManagerActor runs as an Akka.NET cluster singleton — guaranteed to be active on exactly one site node at a time. On failover, Akka restarts the singleton on the surviving node. Because all deployment commands from central are routed through the singleton, there is never a split-brain dispute over which node owns instance lifecycle: the singleton is the only actor that calls Context.ActorOf for InstanceActor children.

Staggered startup

The singleton reads all deployed configurations from SQLite in PreStart, compiles shared scripts off-thread, and then creates InstanceActor children in batches. The default batch size is 20, with a 100 ms delay between batches (StartupBatchSize and StartupBatchDelayMs in SiteRuntimeOptions). Without staggering, 500 instances each subscribing to OPC UA tags simultaneously would produce a reconnection storm that overwhelms the OPC UA server.

Actor supervision and lifetimes

Actor Kind Supervises children with On exception
DeploymentManagerActor long-lived singleton OneForOneStrategy Resume (coord) / Stop (init failure)
InstanceActor long-lived per instance OneForOneStrategy Resume for all coordinator children
ScriptActor long-lived coordinator OneForOneStrategy Stop execution child, keep self
AlarmActor long-lived coordinator OneForOneStrategy Stop execution child, keep self
NativeAlarmActor long-lived coordinator Supervised by Instance Actor (Resume)
ScriptExecutionActor short-lived per invocation Stops itself; parent logs failure
AlarmExecutionActor short-lived per invocation Stops itself; parent logs failure

Coordinator actors resume on exception because their in-memory state (trigger timers, last execution time, alarm level) must survive child crashes. Short-lived execution actors stop themselves on completion or exception — the coordinator remains available for the next trigger.

Dedicated script-execution dispatcher

Script and alarm on-trigger bodies run on the ScriptExecutionScheduler (SiteRuntime-009): a custom TaskScheduler backed by a bounded set of dedicated threads (default 8, ScriptExecutionThreadCount). The script body is submitted to this scheduler via Task.Factory.StartNew(..., scheduler) inside ScriptExecutionActor and AlarmExecutionActor. Scripts that block on I/O (database connections, synchronous external system calls) block only the scheduler's threads, leaving the shared .NET thread pool and all Akka dispatchers unaffected.

Tell vs. Ask

  • Tell: tag value updates, AttributeValueChanged fan-out to child Script/Alarm actors, stream publishing. These are high-frequency or fire-and-forget paths.
  • Ask: Instance.CallScript() (caller needs the return value), debug view snapshots, attribute reads from the Inbound API. Ask is reserved for cross-boundary, low-frequency calls.

Attribute serialization through the Instance Actor

All in-memory state mutations (attribute values, qualities, alarm states) run inside InstanceActor's mailbox. Multiple ScriptExecutionActor instances may run concurrently but all SetAttribute calls serialize through the InstanceActor mailbox, preventing race conditions. Concurrent script executions may interleave external side effects (HTTP calls, database writes, notifications); those are independent and intentionally not serialized.

Architecture

Actor hierarchy

DeploymentManagerActor  (Akka.NET cluster singleton)
└── InstanceActor  "MachineA-001"
    ├── ScriptActor  "MonitorSpeed"           (coordinator)
    │   └── ScriptExecutionActor             (short-lived, per invocation)
    ├── ScriptActor  "CalculateOEE"           (coordinator)
    │   └── ScriptExecutionActor             (short-lived)
    ├── AlarmActor  "OverTemp"               (coordinator, computed)
    │   └── AlarmExecutionActor             (short-lived, on-trigger)
    ├── AlarmActor  "LowPressure"            (coordinator, computed)
    └── NativeAlarmActor  "OpcUaServer1"     (read-only mirror, peer to AlarmActor)

NativeAlarmActor is a sibling of AlarmActor — a peer under the same InstanceActor parent. It is not a child of AlarmActor and has no relationship to the script engine.

Deployment flow

Central sends a DeployInstanceCommand carrying a JSON FlattenedConfiguration to the site singleton. The singleton:

  1. Calls EnsureDclConnections to push any new or changed connection definitions to the DCL manager (hash-guarded: unchanged configs are skipped).
  2. Calls CreateInstanceActor, which does Context.ActorOf(props, instanceName).
  3. Runs an off-thread Task that calls SiteStorageService.StoreDeployedConfigAsync, clears static overrides and native alarm state, and — if _replicationActor is non-null (it is optional and null in isolated deployments/tests) — tells SiteReplicationActor to push to the peer node.
  4. Pipes back a DeployPersistenceResult; only on success does it tell the deployer DeploymentStatus.Success. If persistence fails, the optimistically-created actor is stopped and the error is returned to central (SiteRuntime-005).

For redeployment (instance already running), the existing actor is stopped and watched:

// DeploymentManagerActor.HandleDeploy
if (_instanceActors.TryGetValue(instanceName, out var existing))
{
    _instanceActors.Remove(instanceName);
    _pendingRedeploys[existing] = new PendingRedeploy(command, Sender);
    _terminatingActorsByName[instanceName] = existing;
    Context.Watch(existing);
    Context.Stop(existing);
    return;
}

The Terminated signal fires once the previous actor and its entire subtree have stopped (freeing the actor name), and only then does ApplyDeployment run for the replacement. A third deploy arriving mid-termination overwrites the buffered PendingRedeploy (last-write-wins) and tells the displaced sender a Failed-superseded response (SiteRuntime-020).

Instance Actor initialization

On PreStart, InstanceActor:

  1. Fires SiteStorageService.GetStaticOverridesAsync asynchronously and pipes the result to self as a LoadOverridesResult — this is non-blocking; the message arrives later in the mailbox.
  2. Calls CreateChildActors() immediately (before the override message arrives). CreateChildActors snapshots _attributes (the live dictionary, seeded from flattened-config defaults) into attributeSnapshot before any child constructor runs. Each child's Props closure captures the immutable snapshot, not the live dictionary — preventing the race condition described in SiteRuntime-017. Because the override load is asynchronous, children are created from the un-overridden defaults; when the LoadOverridesResult message is subsequently processed, HandleOverridesLoaded applies the persisted overrides on top of the live _attributes dictionary.
  3. Calls SubscribeToDcl(), grouping data-sourced attributes by connection name and sending SubscribeTagsRequest to the DCL manager. Tag paths are stored in _tagPathToAttributes, a Dictionary<string, List<string>>, because one physical tag can back more than one attribute canonical name.

Data-sourced attributes start with quality Uncertain until the first TagValueUpdate arrives; static attributes start with quality Good.

Script compilation

ScriptCompilationService.Compile(name, code) first runs ValidateTrustModel, which uses Roslyn semantic analysis (not substring scanning) to detect references to forbidden namespaces (System.IO, System.Diagnostics.Process, System.Threading — except Tasks/CancellationToken, System.Reflection, System.Net.Sockets, System.Net.Http). Only after passing trust validation does it call CSharpScript.Create<object?> with the restricted ScriptOptions (references capped to object, Enumerable, Math, CSharpArgumentInfo, and DynamicJsonElement assemblies).

// ScriptCompilationService.CompileCore
var violations = ValidateTrustModel(code);
if (violations.Count > 0)
    return ScriptCompilationResult.Failed(violations);

var script = CSharpScript.Create<object?>(
    code,
    BuildScriptOptions(),
    globalsType: globalsType);
var diagnostics = script.Compile();

CompileTriggerExpression(name, expression) follows the same path but uses TriggerExpressionGlobals as the globals type instead of ScriptGlobals — trigger expressions are read-only and have no access to the script runtime API.

Shared script library

SharedScriptLibrary holds a Dictionary<string, Script<object?>> under a lock. It is populated at startup (off-thread by the singleton, piped back as SharedScriptsLoaded) and updated live when artifact deployments arrive carrying new shared scripts. Calling Scripts.CallShared("name", params) inside a script calls SharedScriptLibrary.ExecuteAsync, which runs the compiled delegate inline as compiled code (no actor message, no serialization) and is awaited by the caller.

Script Actor triggers

ScriptActor parses the ResolvedScript.TriggerType and TriggerConfiguration into a discriminated union (IntervalTriggerConfig, ValueChangeTriggerConfig, ConditionalTriggerConfig, ExpressionTriggerConfig). Interval triggers use Akka ITimerScheduler. Value-change and conditional triggers react to AttributeValueChanged messages forwarded by the Instance Actor. Expression triggers maintain an _attributeSnapshot dictionary kept current by every AttributeValueChanged, then evaluate the compiled _compiledTriggerExpression synchronously with a 2-second CancellationTokenSource timeout.

WhileTrue mode is handled by HandleWhileTrueTransition:

// ScriptActor.HandleWhileTrueTransition
private void HandleWhileTrueTransition(bool nowTrue, bool wasTrue)
{
    if (nowTrue && !wasTrue)
    {
        TrySpawnExecution(null);
        StartWhileTrueTimer();
    }
    else if (!nowTrue && wasTrue)
    {
        StopWhileTrueTimer();
    }
}

On the false→true edge the script fires once and a periodic re-fire timer starts at MinTimeBetweenRuns cadence; on the true→false edge the timer stops.

Site-wide Akka stream

SiteStreamManager materializes a broadcast hub in Initialize(ActorSystem), called by the Host after Akka starts. The hub is fed by a Source.ActorRef<ISiteStreamEvent> (bounded with OverflowStrategy.DropHead). InstanceActor publishes via _streamManager?.PublishAttributeValueChanged(changed) and PublishAlarmStateChanged(changed) — both are Tell calls; they never block the actor.

Each subscriber (typically a StreamRelayActor created by the Communication Layer's SiteStreamGrpcServer) gets its own materialized sub-graph with an independent Buffer(_bufferSize, DropHead) and a KillSwitch. A slow subscriber drops only its own events; it cannot stall other subscribers or the publishing Instance Actor.

// SiteStreamManager.Subscribe
var killSwitch = _hubSource
    .Where(ev => ev.InstanceUniqueName == capturedInstance)
    .Buffer(_bufferSize, OverflowStrategy.DropHead)
    .ViaMaterialized(KillSwitches.Single<ISiteStreamEvent>(), Keep.Right)
    .To(Sink.ForEach<ISiteStreamEvent>(ev => capturedSubscriber.Tell(ev)))
    .Run(_materializer);

Native alarm mirror

NativeAlarmActor mirrors the condition state of one source binding — an OPC UA A&C server or MxAccess Gateway connection — without writing back to the source. Each condition is keyed by SourceReference.

On PreStart it concurrently kicks off two operations: the SQLite rehydration (GetNativeAlarmsAsync, piped back as RehydrationCompleted) and a SubscribeAlarmsRequest to the DCL manager — the subscribe is sent before the async rehydration completes and is not gated on it. The DCL forwards the subscribe request to the connection's IAlarmSubscribableConnection implementation.

Transition handling:

  • AlarmTransitionKind.Snapshot accumulates into _snapshotBuffer.
  • AlarmTransitionKind.SnapshotComplete atomically swaps _alarms with _snapshotBuffer. Conditions absent from the snapshot emit return-to-normal events and drop from the mirror — the mechanism that reconciles state after a reconnect.
  • Live transitions (Raise, Ack, Clear, etc.) upsert by SourceReference, ignoring transitions older than the held TransitionTime (out-of-order protection).

Retention: a condition that is both inactive and acknowledged (!Active && Acknowledged) is dropped from the mirror and its SQLite row deleted. If the mirror exceeds MirroredAlarmCapPerSource (default 1000), the oldest condition is dropped and logged. Persistence is fire-and-forget — a write failure is logged but never blocks the actor or suppresses the upward AlarmStateChanged emit.

Enriched AlarmStateChanged

Both AlarmActor and NativeAlarmActor tell the InstanceActor an AlarmStateChanged. The message was extended additively so existing computed-alarm consumers continue to work unchanged:

Field Computed alarm Native alarm
Kind AlarmKind.Computed AlarmKind.NativeOpcUa or NativeMxAccess
Condition computed default (auto-acknowledged, Severity = Priority) mirrored AlarmConditionState from source
SourceReference, AlarmTypeName, Category, Message, OperatorUser, OperatorComment, OriginalRaiseTime, CurrentValue, LimitValue empty/null populated from source transition

InstanceActor stores the latest enriched event per alarm name in _latestAlarmEvents. The Debug View snapshot uses this map so native alarm metadata reaches the central debug view.

Local SQLite schema

SiteStorageService owns the site database (raw Microsoft.Data.Sqlite, not EF Core). Tables created by InitializeAsync:

Table Purpose Reset on redeploy?
deployed_configurations Persisted flattened configs (survives restart/failover) No (replaced)
static_attribute_overrides Runtime attribute writes (SetAttribute on static attrs) Yes — cleared by ClearStaticOverridesAsync
native_alarm_state Mirrored native alarm conditions (survives failover) Yes — cleared by ClearNativeAlarmsForInstanceAsync
shared_scripts Shared script code from artifact deployments No
external_systems External system definitions No
database_connections Database connection strings No
data_connection_definitions OPC UA / MxGateway endpoint definitions No
notification_lists Notification list definitions No
smtp_configurations SMTP configuration (from artifact deployment) No

Standby replication

SiteReplicationActor runs on every site node (not a singleton). The active node's DeploymentManagerActor tells it ReplicateConfigDeploy, ReplicateConfigRemove, ReplicateConfigSetEnabled, ReplicateArtifacts, or ReplicateStoreAndForward. The replication actor tracks the peer node via Akka cluster membership events and forwards each command to /user/site-replication on the peer via ActorSelection. Replication is fire-and-forget (no ack wait per design), so a failed write to the standby is logged but does not fail the primary operation.

Usage

Lifecycle commands

Central sends commands to the site DeploymentManagerActor singleton over the Communication Layer:

Command Effect
DeployInstanceCommand Create or replace the instance actor; persist config to SQLite; clear static overrides and native alarm state
DisableInstanceCommand Stop the instance actor; set is_enabled = 0 in SQLite; retain config for re-enable
EnableInstanceCommand Create a new instance actor from the stored config
DeleteInstanceCommand Stop the instance actor; remove config from SQLite; store-and-forward messages are NOT cleared
DeployArtifactsCommand Persist shared scripts, external system definitions, database connections, notification lists, data connection definitions; recompile shared scripts; push data connections to DCL

Script API surface

Scripts run inside ScriptExecutionActor with a ScriptGlobals object as the Roslyn host object. The Instance global is a ScriptRuntimeContext. Convenience top-level aliases (ExternalSystem, Database, Notify, Scripts, Attributes, Children, Parent) delegate to context methods. Key calls:

  • Instance.GetAttribute("name") / Instance.SetAttribute("name", value) — Ask to InstanceActor for write, in-process for read.
  • Instance.CallScript("scriptName", params) — Ask from ScriptExecutionActor to sibling ScriptActor, which spawns a new ScriptExecutionActor.
  • Scripts.CallShared("name", params)SharedScriptLibrary.ExecuteAsync, inline on the current scheduler thread.
  • ExternalSystem.Call(...) — synchronous HTTP call through IExternalSystemClient.
  • ExternalSystem.CachedCall(...) / Database.CachedWrite(...) — store-and-forwarded; returns a TrackedOperationId.
  • Tracking.Status(id) — reads the site-local OperationTrackingStore synchronously.
  • Notify.To("list").Send(...) — enqueues a notification in the Store-and-Forward Engine for delivery to central.

Alarm on-trigger scripts run in AlarmExecutionActor with a restricted context: they receive an Alarm global (AlarmContext carrying Name, Level, Priority, Message) and have access to the instance/shared-script surface (Instance.*, Scripts.CallShared, Instance.CallScript), but not the external-system, database, notification, or audit integration APIs. AlarmExecutionActor builds its ScriptRuntimeContext without a serviceProvider, so ExternalSystem, Database, Notify, and audit writes are unavailable to alarm on-trigger scripts — those APIs are only resolved inside ScriptExecutionActor (instance scripts).

Debug view

The Communication Layer sends SubscribeDebugViewRequest or DebugSnapshotRequest to the singleton, which forwards to the named InstanceActor. The Instance Actor replies with a DebugViewSnapshot built from the current _attributes dictionary and _latestAlarmEvents map. Ongoing changes reach the central debug view via the gRPC stream, not through the actor hierarchy.

Configuration

All options live in the ScadaBridge:SiteRuntime section, bound to SiteRuntimeOptions:

Key Default Description
StartupBatchSize 20 Instance Actors created per batch during staggered startup
StartupBatchDelayMs 100 Milliseconds between startup batches
MaxScriptCallDepth 10 Maximum Instance.CallScript / Scripts.CallShared recursion depth
ScriptExecutionTimeoutSeconds 30 Per-script body execution timeout; exceeding it cancels and logs an error
StreamBufferSize 1000 Per-subscriber drop-oldest buffer size for the Akka broadcast stream
ScriptExecutionThreadCount 8 Dedicated threads in the ScriptExecutionScheduler (covers both scripts and alarm on-trigger bodies)
MirroredAlarmCapPerSource 1000 Maximum mirrored conditions per NativeAlarmActor source binding before oldest is dropped and logged
NativeAlarmRetryIntervalMs 5000 Milliseconds before retrying a failed native alarm subscription

The SQLite connection string is passed directly to AddSiteRuntime(connectionString) in the host composition root and is not part of SiteRuntimeOptions.

Dependencies & Interactions

  • Data Connection Layer (#4) — supplies TagValueUpdate and ConnectionQualityChanged messages to InstanceActor; receives SubscribeTagsRequest and WriteTagRequest. Also supplies NativeAlarmTransitionUpdate and NativeAlarmSourceUnavailable to NativeAlarmActor via SubscribeAlarmsRequest (connections implementing IAlarmSubscribableConnection).
  • CentralSite Communication (#5) — routes DeployInstanceCommand, DisableInstanceCommand, EnableInstanceCommand, DeleteInstanceCommand, DeployArtifactsCommand, debug view requests, and Inbound API RouteToCallRequest / RouteToGetAttributesRequest / RouteToSetAttributesRequest to the singleton; receives DeploymentStatusResponse and ArtifactDeploymentResponse back. The SiteStreamManager implements ISiteStreamSubscriber so the Communication Layer's SiteStreamGrpcServer can subscribe StreamRelayActor instances to the broadcast hub.
  • Store-and-Forward Engine (#6)ScriptRuntimeContext passes StoreAndForwardService (resolved from DI inside ScriptExecutionActor) for ExternalSystem.CachedCall, Database.CachedWrite, and Notify.To().Send(). Owns the site-local operation tracking table that Tracking.Status(id) reads.
  • External System Gateway (#7)IExternalSystemClient called by ScriptRuntimeContext.ExternalSystemHelper for synchronous and cached external system calls.
  • Site Event Logging (#12)ISiteEventLogger (resolved from DI inside execution actors) receives script error, alarm error, and script execution events.
  • Health Monitoring (#11)ISiteHealthCollector (injected into DeploymentManagerActor, InstanceActor, ScriptActor, AlarmActor) tracks instance counts (SetInstanceCounts), script errors (IncrementScriptError), and alarm errors (IncrementAlarmError); sets SetActiveNode in DeploymentManagerActor.PreStart/PostStop so the health report reflects which node holds the singleton.
  • Audit Log (#23)IAuditWriter (resolved from DI inside ScriptExecutionActor) receives one row per script-trust-boundary call; audit writes are best-effort and never abort the calling script.
  • Commons (#16) — owns all message contracts (DeployInstanceCommand, AttributeValueChanged, AlarmStateChanged, ScriptCallRequest, NativeAlarmTransitionUpdate, etc.), the FlattenedConfiguration / ResolvedScript / ResolvedAlarm / ResolvedNativeAlarmSource types, and the AlarmKind / AlarmState / AlarmLevel / AlarmConditionState / AlarmTransitionKind enums.
  • Local SQLite — SiteStorageService owns the site database. Peer SQLite stores (Store-and-Forward buffer, AuditLog, operation tracking, site event log) are owned by their respective components but share the same SQLite file path convention.
  • Design spec: Component-SiteRuntime.md.

Troubleshooting

Instance stays in Unknown or Failed after deployment

DeploymentManagerActor only tells central DeploymentStatus.Success after SiteStorageService.StoreDeployedConfigAsync commits. If the site SQLite file is locked, full, or on a read-only volume, the persistence task throws, the optimistically-created actor is stopped, and central receives Failed. Check the site event log for Failed to persist deployment entries. The site SQLite path is configured in the host appsettings.json.

Reconnection storm on failover

If many instances race to subscribe to OPC UA at once the server may throttle or drop connections. Increase StartupBatchDelayMs or decrease StartupBatchSize in SiteRuntimeOptions. The current defaults (batch 20, delay 100 ms) mean a site with 200 instances takes 1 second to start all subscriptions, which is acceptable for most servers.

Script execution actor backpressure

The ScriptExecutionScheduler has a fixed thread count (ScriptExecutionThreadCount, default 8). If all threads are blocked (a burst of scripts each waiting on a slow database or external system), new script invocations queue behind them. The queue is unbounded — memory usage can grow during a backlog. If this is observed, raise ScriptExecutionThreadCount or reduce the number of concurrent long-running scripts. Script execution timeout (ScriptExecutionTimeoutSeconds) bounds the worst case.

Native alarm conditions not recovering after reconnect

NativeAlarmActor retains last-known conditions during a source outage (it does not clear them) and reconciles state via the reconnect snapshot swap. If the snapshot never arrives (the DCL connection was cleanly unsubscribed rather than failing), the actor may hold stale Active conditions indefinitely. A redeploy of the instance clears native_alarm_state in SQLite and forces fresh subscription. Failed subscription retries are logged at Warning level with the retry interval.

InvalidActorNameException on rapid redeployment

If two DeployInstanceCommand messages arrive for the same instance while the first redeployment is still terminating, the _terminatingActorsByName shadow index in DeploymentManagerActor detects the collision and buffers the second command (SiteRuntime-020). The displaced deploy receives a Failed-superseded response. This is expected behaviour — central should observe the Failed response and retry when the site is ready.