Files
Joseph Doherty 7b0b9c7365 refactor: rename ScadaLink → ZB.MOM.WW.ScadaBridge (code + projects + namespaces)
Solution + 23 src projects + 26 test projects renamed; folders, csproj,
namespaces, and ScadaLinkDbContext/ScadaBridgeDbContext class updated.
ActorSystem "scadalink" → "scadabridge", Akka seed-node URLs migrated.
SQL roles/logins, LDAP domains, CLI command name, and CLI config dir
(~/.scadalink → ~/.scadabridge) also renamed.

Build green; 5 Host.Tests fail awaiting SQL login rename in next commit.
Pre-existing StaleTagMonitor timing flakes unchanged.

Rename script committed at tools/rename-to-scadabridge.sh.
2026-05-28 09:37:45 -04:00

77 KiB
Raw Permalink Blame History

Code Review — SiteRuntime

Field Value
Module src/ZB.MOM.WW.ScadaBridge.SiteRuntime
Design doc docs/requirements/Component-SiteRuntime.md
Status Reviewed
Last reviewed 2026-05-28
Reviewer claude-agent
Commit reviewed 1eb6e97
Open findings 3

Summary

The SiteRuntime module is broadly well-structured: the actor hierarchy matches the design doc, supervision strategies are explicit, and the trigger/alarm evaluation logic is thorough. However the review surfaced one genuinely serious correctness defect — Instance.SetAttribute never routes writes to the Data Connection Layer for data-sourced attributes, contradicting a core design decision and silently turning device writes into local-only static overrides. Several other findings cluster around two themes: (1) actor-thread discipline is violated in a few hot paths (blocking .GetAwaiter().GetResult() calls on the actor thread, a fragile fixed-delay reschedule for redeployment), and (2) the site-local repositories reach into SiteStorageService private state via reflection and mint entity IDs with the non-deterministic string.GetHashCode(). Script execution runs on the default thread pool rather than a dedicated blocking dispatcher (the code acknowledges this in a comment but ships it anyway). Test coverage exists for the coordinator actors, persistence and scripting, but the short-lived execution actors, the replication actor, and the repositories are untested.

Re-review 2026-05-17 (commit 39d737e)

The module was re-reviewed at commit 39d737e. No source under src/ZB.MOM.WW.ScadaBridge.SiteRuntime has changed since the previous review at 9c60592 (the only intervening commits are code-review documentation updates), so all of SiteRuntime-001..013, 015, 016 remain Resolved and SiteRuntime-014 remains Deferred — its Deferred justification (a trigger-evaluation concurrency design decision is required before either recommended fix can land in-module) still holds verbatim against the unchanged ScriptActor/AlarmActor source. The re-review nonetheless worked through all 10 checklist categories afresh and surfaced three new findings that the prior pass did not record: a cross-thread Dictionary enumeration race when the Instance Actor's live _attributes dictionary is handed by reference into child ScriptActor/AlarmActor constructors (SiteRuntime-017, Medium); a stale ScriptExecutionActor XML doc that still claims a "dedicated blocking I/O dispatcher" (SiteRuntime-018, Low); and two dead lifecycle handlers in InstanceActor that the Deployment Manager never routes to (SiteRuntime-019, Low). All three were subsequently resolved on 2026-05-17. Open findings: 0.

Re-review 2026-05-28 (commit 1eb6e97)

The module was re-reviewed at commit 1eb6e97 as part of the new baseline review. The SiteRuntime source surface has grown materially since the prior pass — primarily by threading ExecutionId/ParentExecutionId/SourceNode through the script-trust-boundary helpers and the cached-call telemetry emitters, and by adding OperationTrackingStore, the AuditingDbConnection/AuditingDbCommand/AuditingDbDataReader decorators, and ScriptExecutionScheduler. All 10 checklist categories were walked afresh. Seven new findings were recorded: a race that throws InvalidActorNameException when a second DeployInstanceCommand arrives for the same instance while a redeployment is still terminating its predecessor (SiteRuntime-020, Medium); an artifact-only data-connection update that never reaches the DCL (SiteRuntime-021, Medium); AuditingDbCommand.DbConnection.set reaching into AuditingDbConnection._inner via reflection — the same anti- pattern SiteRuntime-006 eliminated for the repositories, now reintroduced and in direct tension with the script trust model that forbids System.Reflection (SiteRuntime-022, Medium); Convert.ToDouble(value) in ScriptActor / AlarmActor running under CurrentCulture so a string attribute value becomes locale-sensitive (SiteRuntime-023, Low); OperationTrackingStore serialising every cached-call write through a single connection + SemaphoreSlim and using sync-over-async in Dispose() (SiteRuntime-024, Medium); inbound-API SetAttribute (and any future caller) accepting unknown attribute names and persisting them as overrides, polluting both _attributes and the SQLite override table (SiteRuntime-025, Low); and the ReplicationMessages.cs outbound/inbound record types still missing public XML docs (SiteRuntime-026, Low). Prior findings 001019 remain Resolved/Deferred — no regressions observed in any of their fixed call sites. Open findings: 7.

Checklist coverage

# Category Examined Notes
1 Correctness & logic bugs SetAttribute mis-routing, deploy double-count, redeploy reschedule race.
2 Akka.NET conventions Blocking on actor thread, script execution not on a dedicated dispatcher, premature success reply.
3 Concurrency & thread safety _attributes dictionary shared with child actors by reference; _executionCounter is actor-confined (OK).
4 Error handling & resilience Deploy reports Success before persistence; replicated artifact/S&F failures only logged (matches best-effort design).
5 Security Trust-model validation is substring-based and weak; reflection used to read private fields.
6 Performance & resource management Per-call SQLite connections (acceptable); CPU-bound scripts not interruptible by timeout.
7 Design-document adherence SetAttribute DCL routing missing; staggered-startup and supervision otherwise conform.
8 Code organization & conventions Repositories reflect into another class; synthetic IDs non-deterministic.
9 Testing coverage No tests for ScriptExecutionActor, AlarmExecutionActor, SiteReplicationActor, or the two repositories.
10 Documentation & comments Several XML comments describe behaviour the code does not implement (see findings).

Re-review (2026-05-28, 1eb6e97):

# Category Examined Notes
1 Correctness & logic bugs Second-deploy race vs. pending redeploy (020); artifact-only data-connection update never reaches DCL (021); unknown-name SetAttribute persists bogus overrides (025).
2 Akka.NET conventions Trigger-eval blocking on coordinator mailbox remains Deferred (014); short-lived execution actors and replication actor otherwise conform.
3 Concurrency & thread safety DM's _instanceActors cache and _pendingRedeploys map shifted from old race; new ordering race surfaced (020). OperationTrackingStore single-connection + SemaphoreSlim serialises all cached writes (024).
4 Error handling & resilience Task.Run fire-and-forget replication paths log on faulted (acceptable, per "best-effort replication" design). DM's deploy persistence rollback path (resolved as SiteRuntime-005) intact.
5 Security Trust-model semantic analysis (SiteRuntime-011 fix) intact. AuditingDbCommand reflects into AuditingDbConnection._inner — same anti-pattern as SiteRuntime-006 (022). Audit emitter captures SQL parameter values verbatim per M4 design (M5 will redact).
6 Performance & resource management Per-call SQLite connections on hot paths in SiteStorageService (existing pattern, acceptable). OperationTrackingStore Dispose() does sync-over-async (024). ScriptExecutionScheduler bounded threads as expected.
7 Design-document adherence Artifact-only data-connection update path is silently inert (021) — contradicts the "site is self-contained after artifact deployment" intent.
8 Code organization & conventions Repository reflection-via-private-field anti-pattern reintroduced in AuditingDbCommand (022). ReplicationMessages.cs public records still undocumented (026).
9 Testing coverage SiteReplicationActor remains uncovered (SiteRuntime-016 deferred that gap to a clustered-ActorSystem harness, still outstanding). New findings have no targeted coverage yet.
10 Documentation & comments ReplicationMessages.cs records lack XML docs (026); other XML doc surface materially expanded in 1eb6e97.

Findings

SiteRuntime-001 — Instance.SetAttribute never writes to the Data Connection Layer

Severity High
Category Design-document adherence
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Scripts/ScriptRuntimeContext.cs:106, src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:204

Description

The design doc (Component-SiteRuntime.md, "GetAttribute / SetAttribute" and "Script Runtime API") states that Instance.SetAttribute on a data-connected attribute must send a write request to the DCL, which writes to the physical device, and that the in-memory value is not optimistically updated. For static attributes it updates memory and persists an override.

The implementation makes no such distinction. ScriptRuntimeContext.SetAttribute unconditionally sends a SetStaticAttributeCommand, and InstanceActor.HandleSetStaticAttribute unconditionally treats every write as a static override: it mutates _attributes, publishes an AttributeValueChanged with hard-coded "Good" quality, notifies children, and persists a SQLite override. A script writing a data-sourced attribute therefore never reaches the device, the write failure can never be returned synchronously to the script, and the in-memory value diverges from the device until the next subscription update overwrites it. The persisted override is also wrong: data-sourced attributes should not have static overrides.

Recommendation

In InstanceActor, look up the target attribute in _configuration.Attributes. If it has a non-empty DataSourceReference, issue a DCL write (e.g. a WriteTagRequest to _dclManager) and surface success/failure to the caller; do not persist an override and do not optimistically mutate _attributes. Only attributes with no data source reference should follow the current static-override path. Consider splitting the message into SetStaticAttributeCommand vs SetDataAttributeCommand, or branching inside the handler.

Resolution

Resolved 2026-05-16 (<pending>): InstanceActor.HandleSetStaticAttribute now resolves the target attribute's data binding from _configuration. Data-sourced attributes are routed via a new HandleSetDataAttribute that Asks the DCL with a WriteTagRequest and pipes the device-write outcome back to the caller as a SetStaticAttributeResponse — no override is persisted and _attributes is not optimistically mutated. Static attributes keep the override path and now also reply with a SetStaticAttributeResponse. ScriptRuntimeContext.SetAttribute is now async Task and Asks the Instance Actor, throwing InvalidOperationException on a failed device write so scripts get the failure synchronously.

SiteRuntime-002 — RouteInboundApiSetAttributes always treats writes as static overrides

Severity High
Category Correctness & logic bugs
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:632

Description

RouteInboundApiSetAttributes (handling Route.To().SetAttribute(s) from the Inbound API) emits a SetStaticAttributeCommand for every attribute, so it inherits the same defect as SiteRuntime-001: writes to data-sourced attributes never reach the device and are instead persisted as static overrides. In addition the response is sent back as unconditionally successful (true) before the Instance Actor has even processed the command, so a non-existent attribute or a future DCL write failure is reported to the external caller as success.

Recommendation

Route through the same corrected InstanceActor write handler as SiteRuntime-001 so the static-vs-data distinction is honoured. The optimistic ack is acceptable for fire-and-forget static writes per the doc, but the XML comment should make the limitation explicit, and once data-attribute writes are supported they need a real response path.

Resolution

Resolved 2026-05-16 (<pending>): RouteInboundApiSetAttributes now Asks the Instance Actor per attribute (instead of fire-and-forget Tell) and aggregates the SetStaticAttributeResponse results. Because the Instance Actor handler is the SiteRuntime-001 corrected handler, data-sourced attributes now reach the DCL and the RouteToSetAttributesResponse reflects the real per-attribute outcome — a non-existent attribute or a failed device write is reported as failure rather than an unconditional optimistic true.

SiteRuntime-003 — Redeployment relies on a fixed 500 ms reschedule and can collide on the child actor name

Severity High
Category Akka.NET conventions
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:222

Description

HandleDeploy stops an existing Instance Actor with Context.Stop and then reschedules the same DeployInstanceCommand to itself after a hard-coded 500 ms, hoping the child has fully terminated by then. Context.Stop is asynchronous; the child is only removed from the parent's children collection after it actually stops (including running PostStop on its descendants). If a deeply nested or slow hierarchy takes longer than 500 ms, CreateInstanceActor calls Context.ActorOf with a name that still belongs to the terminating child and throws InvalidActorNameException. The _instanceActors dictionary check does not prevent this — the dictionary entry is removed immediately, but the Akka child registry is not. The 500 ms delay is also unconditionally added to every redeploy latency.

Recommendation

Watch the terminating child (Context.Watch) and recreate the Instance Actor only after receiving the Terminated message, instead of guessing with a timer. Buffer or stash the in-flight DeployInstanceCommand (and any further commands for that instance) until termination completes.

Resolution

Resolved 2026-05-16 (<pending>): HandleDeploy no longer uses a fixed 500 ms reschedule. When a redeployment targets a running instance, the existing Instance Actor is Context.Watch-ed and stopped, and the in-flight DeployInstanceCommand is buffered in a _pendingRedeploys map keyed by the terminating actor ref. A new Terminated handler recreates the Instance Actor only after the predecessor (and its whole subtree) has fully stopped, eliminating the InvalidActorNameException race and the unconditional redeploy-latency penalty. The shared ApplyDeployment helper also skips the _totalDeployedCount increment for redeployments, so the deployed-instance count no longer drifts (this additionally addresses the root cause behind SiteRuntime-004).

SiteRuntime-004 — _totalDeployedCount is incremented on redeployment of an existing instance

Severity Medium — re-triaged: already fixed by the SiteRuntime-003 resolution.
Category Correctness & logic bugs
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs (ApplyDeployment)

Description

In HandleDeploy, the existing-actor branch (line 223) reschedules the command and returns. When the rescheduled command runs, no actor exists, so the code falls through to the "new instance" branch and executes _totalDeployedCount++ (line 239). A redeployment is an update of an already-deployed instance, not a new one, so the deployed count is over-counted by one on every redeploy. StoreDeployedConfigAsync uses UPSERT semantics, so the SQLite row count does not grow, but the in-memory _totalDeployedCount (reported to the health collector via UpdateInstanceCounts) drifts upward and the reported "disabled" count becomes wrong.

Re-triage (2026-05-16)

Verified against the current source: this is already fixed. The SiteRuntime-003 resolution replaced the fixed-delay reschedule with a shared ApplyDeployment helper that takes an isRedeploy flag and guards the counter with if (!isRedeploy) _totalDeployedCount++;. The redeploy path (HandleTerminated) always calls ApplyDeployment(..., isRedeploy: true), so the counter is no longer bumped on redeployment. The regression test DeploymentManagerRedeployTests.Redeploy_ExistingInstance_DoesNotOverCountDeployedInstances already covers this and passes. No further code change was required.

Resolution

Resolved 2026-05-16 (commit pending): no new change needed — the root cause was eliminated by the SiteRuntime-003 fix (the isRedeploy guard in ApplyDeployment). Confirmed by the existing passing regression test Redeploy_ExistingInstance_DoesNotOverCountDeployedInstances. Re-triaged from Open to Resolved.

SiteRuntime-005 — Deployment reports Success to central before persistence completes

Severity Medium
Category Error handling & resilience
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs (ApplyDeployment, HandleDeployPersistenceResult)

Description

HandleDeploy replies to central with DeploymentStatus.Success immediately after creating the Instance Actor, while the SQLite persistence (StoreDeployedConfigAsync

  • ClearStaticOverridesAsync) runs asynchronously on a Task.Run. If persistence fails, HandleDeployPersistenceResult only logs an error — central has already been told the deployment succeeded. On a subsequent node restart or failover the instance will not be re-created (it is not in SQLite), so the deployment is silently lost despite central recording success. This contradicts the design's intent that the site is the durable source of truth for deployed configs.

Recommendation

Persist the config before replying, or treat a persistence failure as a deployment failure and send a corrective DeploymentStatusResponse/health signal to central. At minimum, do not report Success until the config row is committed.

Resolution

Resolved 2026-05-16 (commit pending): root cause confirmed — ApplyDeployment sent DeploymentStatusResponse(Success) synchronously before the persistence Task.Run completed. The Success reply is now sent from HandleDeployPersistenceResult only once the persistence result is known: on success it replies Success; on a persistence failure it logs the error, stops the optimistically-created Instance Actor, rolls back the deployed-instance counter, and replies DeploymentStatus.Failed with the error message. DeployPersistenceResult carries an IsRedeploy flag so the counter rollback is skipped for redeployments. Regression tests: DeploymentManagerMediumFindingsTests.Deploy_PersistenceFailure_ReportsFailedNotSuccess and Deploy_Success_ReportsSuccessAndPersistsConfig.

SiteRuntime-006 — Site-local repositories read SiteStorageService private field via reflection

Severity Medium
Category Code organization & conventions
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Repositories/SiteExternalSystemRepository.cs, src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Repositories/SiteNotificationRepository.cs

Description

Both repositories' CreateConnection() use Type.GetField("_connectionString", BindingFlags.NonPublic | BindingFlags.Instance) to extract the private connection string out of SiteStorageService. This is brittle (any rename or refactor of the field breaks it at runtime, not compile time), defeats encapsulation, and the accompanying XML comment openly describes it as a "pragmatic" hack and is internally contradictory (it states a connection string is "passed separately at DI registration time" which is not what the code does). It also sits awkwardly against the project's own script trust model, which forbids System.Reflection in scripts.

Recommendation

Expose the connection string properly: add an ISiteStorageConnectionProvider (already referenced in ServiceCollectionExtensions XML docs but not used), or have SiteStorageService expose a CreateConnection() factory, and inject that into the repositories. Remove the reflection entirely.

Resolution

Resolved 2026-05-16 (commit pending): root cause confirmed — both repositories reflected into SiteStorageService._connectionString. SiteStorageService now exposes a public CreateConnection() factory method that returns an unopened SqliteConnection against the site database. Both SiteExternalSystemRepository and SiteNotificationRepository now obtain connections via _storage.CreateConnection(); all reflection (Type.GetField / BindingFlags) and the contradictory XML comments have been removed. This is a fully in-module refactor — no cross-module design decision was needed. Regression test: SiteRepositoryTests.ExternalSystemRepository_RoundTripsStoredDefinition exercises the repository's connection path end-to-end.

SiteRuntime-007 — Synthetic entity IDs use the non-deterministic string.GetHashCode()

Severity Medium
Category Correctness & logic bugs
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Repositories/SiteExternalSystemRepository.cs, src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Repositories/SiteNotificationRepository.cs, src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Repositories/SyntheticId.cs

Description

GenerateSyntheticId computes name.GetHashCode() & 0x7FFFFFFF. On .NET Core, string.GetHashCode() is randomized per process by default, so the "stable deterministic synthetic ID" promised by the XML comment is not stable at all — it changes every time the process restarts. Any caller that obtained an ID and later calls GetExternalSystemByIdAsync/GetNotificationListByIdAsync after a restart will fail to find the entity. It also risks collisions: distinct names can hash to the same 31-bit value, and GetExternalSystemByIdAsync would then return the wrong row.

Recommendation

Use a deterministic, collision-resistant hash (e.g. a stable FNV-1a or the first bytes of a SHA-256 of the name) if a synthetic integer ID is genuinely required, or better, change the repository contract to key these site-local artifacts by name rather than synthesising integer IDs.

Resolution

Resolved 2026-05-16 (commit pending): root cause confirmed — both repositories used name.GetHashCode(), which is per-process randomized on .NET Core. A new internal SyntheticId helper computes a deterministic, process-stable 31-bit ID using the FNV-1a hash over the name's UTF-8 bytes. Both GenerateSyntheticId methods now delegate to SyntheticId.From(name). (The integer-keyed lookups are kept because they are mandated by the shared IExternalSystemRepository/INotificationRepository contracts in Commons — changing those contracts to name-keyed would be a cross-module change outside this module's scope; the deterministic hash resolves the correctness defect within scope.) Regression tests: SiteRepositoryTests.ExternalSystemRepository_SyntheticId_IsStableAcrossRestart and NotificationRepository_SyntheticId_IsStableAcrossRestart re-create the service to simulate a process restart and confirm by-ID lookups still resolve.

SiteRuntime-008 — Blocking .GetAwaiter().GetResult() on the actor thread during startup

Severity Medium
Category Akka.NET conventions
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs (HandleStartupConfigsLoaded, LoadSharedScriptsFromStorage, HandleSharedScriptsLoaded)

Description

LoadSharedScriptsFromStorage is called synchronously from HandleStartupConfigsLoaded (the actor's message handler) and performs _storage.GetAllSharedScriptsAsync().GetAwaiter().GetResult() followed by Roslyn compilation of every shared script. This blocks the DeploymentManager singleton's mailbox thread for the full duration of the SQLite read and all shared-script compilation. On the default dispatcher this also ties up a thread-pool thread and risks thread-pool starvation, and the singleton cannot process any other message (deployments, lifecycle commands, debug routing) until it returns. The rest of the class correctly uses PipeTo/ContinueWith.

Recommendation

Load shared scripts asynchronously and PipeTo(Self) an internal message, the same pattern already used for StartupConfigsLoaded. Perform compilation either inside the piped continuation handler (still on the actor thread but at least off the synchronous startup path) or on a dedicated background task whose result is piped back.

Resolution

Resolved 2026-05-16 (commit pending): root cause confirmed — the blocking .GetAwaiter().GetResult() and Roslyn compilation ran on the singleton's mailbox thread inside HandleStartupConfigsLoaded. LoadSharedScriptsFromStorage now runs the SQLite read and the Roslyn compilation on a background Task.Run and pipes a new internal SharedScriptsLoaded message back to the actor. A new HandleSharedScriptsLoaded handler then begins staggered Instance Actor creation, so the compilation→creation ordering is preserved without ever blocking the mailbox. A shared-script load failure is logged and startup proceeds (scripts needing a missing shared script fail at execution time). Regression test: DeploymentManagerMediumFindingsTests.Startup_WithSharedScripts_LoadsConfigsAndStaysResponsive (confirms startup completes and the actor stays responsive with shared scripts present).

SiteRuntime-009 — Script execution actors run scripts on the default thread pool, not a dedicated dispatcher

Severity Medium
Category Akka.NET conventions
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/ScriptExecutionActor.cs, src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/AlarmExecutionActor.cs, src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Scripts/ScriptExecutionScheduler.cs

Description

The design (CLAUDE.md "Architecture & Runtime") states Script Execution Actors run on a dedicated blocking I/O dispatcher. The code does not do this: ScriptActor.SpawnExecution and AlarmActor.SpawnAlarmExecution create the execution actors with no .WithDispatcher(...), and the execution itself runs inside a bare Task.Run, i.e. on the shared .NET thread pool. The // NOTE: In production, configure a dedicated ... dispatcher comments acknowledge the gap but it ships unconfigured. Scripts can perform synchronous blocking I/O (Database.Connection, synchronous ExternalSystem.Call); running them on the shared pool can starve it and stall unrelated Akka dispatchers and HTTP request handling under load.

Recommendation

Define the dedicated dispatcher in HOCON and chain .WithDispatcher(...) on the execution actor Props. If the Task.Run model is kept, run script bodies on a dedicated TaskScheduler / bounded scheduler rather than the global pool. Either way, remove the "in production, configure…" comments by actually configuring it.

Resolution

Resolved 2026-05-16 (commit pending): root cause confirmed — script and alarm on-trigger bodies ran inside a bare Task.Run on the shared ThreadPool. The recommendation's TaskScheduler option was taken because it is fully in-module (a HOCON dispatcher would require editing the Host's ActorSystem config, out of scope). A new ScriptExecutionScheduler provides a bounded set of dedicated background threads (count from the new SiteRuntimeOptions.ScriptExecutionThreadCount, default 8). ScriptExecutionActor and AlarmExecutionActor now run their bodies via Task.Factory.StartNew(..., ScriptExecutionScheduler.Shared(options)).Unwrap() instead of Task.Run, so blocking script I/O is contained to those dedicated threads and cannot starve the global pool. The misleading "in production, configure a dedicated dispatcher" comments were removed. Regression tests: ScriptExecutionSchedulerTests (Scheduler_RunsWork_OffTheThreadPool, Scheduler_RespectsConfiguredThreadCount, Scheduler_Shared_ReturnsSameInstanceForOptions).

SiteRuntime-010 — EnsureDclConnections never updates a connection whose configuration changed

Severity Medium
Category Correctness & logic bugs
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs (EnsureDclConnections, ComputeConnectionConfigHash)

Description

EnsureDclConnections tracks created connections in _createdConnections and skips any name already present (if (_createdConnections.Contains(name)) continue;). The skip is purely name-based: if a redeployment (or an artifact deployment) changes the endpoint, credentials, backup endpoint, or FailoverRetryCount of an existing connection, the new configuration is silently ignored and the DCL keeps using the stale CreateConnectionCommand. There is no UpdateConnectionCommand path. The design states that after artifact deployment the site is fully self-contained with current configuration; this caching breaks that for connection changes.

Recommendation

Compare the incoming connection config against the last one sent and re-issue a create/update command when it differs, or have the DCL treat CreateConnectionCommand as idempotent upsert and always forward it. Key the cache on a config hash, not just the name.

Resolution

Resolved 2026-05-16 (commit pending): root cause confirmed — the cache was a name-only HashSet, so a changed connection config was silently dropped. _createdConnections is now a Dictionary<string,string> mapping connection name to a SHA-256 hash of its protocol/primary-config/backup-config/failover-retry-count (ComputeConnectionConfigHash). A connection whose hash is unchanged is still skipped; a connection whose config changed re-issues a CreateConnectionCommand so the DCL adopts the new configuration. Regression tests: DeploymentManagerMediumFindingsTests.EnsureDclConnections_ConnectionConfigChanged_ReissuesCreateCommand and EnsureDclConnections_UnchangedConfig_DoesNotReissueCreateCommand.

SiteRuntime-011 — Trust-model validation is a substring scan and is both over- and under-inclusive

Severity Medium
Category Security
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Scripts/ScriptCompilationService.cs (ValidateTrustModel)

Description

ValidateTrustModel enforces the script trust model by doing raw string.Contains / IndexOf on the script source text for forbidden namespace strings. This is unreliable in both directions:

  • Bypass (under-inclusive): the check looks only for the literal namespace strings. A script can reach forbidden APIs without ever writing System.IO etc. — e.g. via fully-qualified type use through aliasing, global::-prefixed names, or simply because the namespace is already imported transitively. The compilation references include typeof(object).Assembly (the whole of System.Private.CoreLib, which contains System.IO.File, System.Threading.Thread, System.Reflection, etc.), so forbidden types are fully resolvable at compile time and the only barrier is this textual scan.
  • False positives (over-inclusive): any occurrence of the substring in a comment, string literal, or an unrelated identifier (e.g. a variable named ProcessThreading) triggers a violation; the AllowedExceptions logic only rescues exact prefixes.
  • The dead isAllowed variable at line 64 is computed and never used.

Recommendation

Enforce the trust model with a Roslyn SyntaxWalker/semantic analysis (inspect resolved symbols and their containing namespaces/assemblies), or restrict the compilation's metadata references and AssemblyLoadContext so forbidden types are genuinely unavailable, rather than relying on source-text matching. Remove the unused isAllowed variable.

Resolution

Resolved 2026-05-16 (commit pending): root cause confirmed — ValidateTrustModel was a raw string.Contains/IndexOf scan of the source text, with a dead isAllowed variable. It is now Roslyn semantic analysis: the script is parsed and a CSharpCompilation + SemanticModel are built; every name/member/object-creation node is resolved to its symbol and the symbol's containing namespace and fully-qualified containing type are checked against the forbidden roots. Bare namespace symbols are ignored (so the System.Threading qualifier of the allowed System.Threading.Tasks.Task no longer false-positives). A name that cannot be resolved (a type from an assembly deliberately absent from the script's references) falls back to a syntactic fully-qualified-name check, so e.g. System.Net.Http references are still rejected. The dead isAllowed variable was removed. This fixes both the bypass (global::/alias-qualified forbidden types) and the false positives (forbidden namespace string in a comment, string literal, or unrelated identifier). Regression tests: new TrustModelSemanticTests (alias/global:: detection, comment/ literal/identifier non-detection, allowed-exception resolution); all 39 existing SandboxTests + ScriptCompilationServiceTests continue to pass.

SiteRuntime-012 — AttributeAccessor/ScopeAccessors block the script on a synchronous Ask

Severity Low
Category Concurrency & thread safety
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Scripts/ScopeAccessors.cs:28

Description

AttributeAccessor's indexer getter calls _ctx.GetAttribute(...).GetAwaiter().GetResult(), synchronously blocking the script-execution thread on an actor Ask. Combined with SiteRuntime-009 (scripts run on the shared thread pool) this means a script that reads several attributes via Attributes["X"] holds a pool thread blocked for each round-trip. The async variants (GetAsync/SetAsync) exist but the ergonomic indexer encourages the blocking path. The XML comment notes "Reads block on the actor Ask" but does not warn about the thread-pool impact.

Recommendation

Once a dedicated script dispatcher exists (SiteRuntime-009) the blocking is contained to that pool, which is acceptable; until then, document the cost clearly and prefer steering script authors to the async accessors. Consider making the indexer internal-only and exposing only the async API.

Resolution

Resolved 2026-05-16 (pending commit): re-triaged against the current source — the finding's own recommendation states the blocking is acceptable once SiteRuntime-009's dedicated script dispatcher exists, and SiteRuntime-009 is now Resolved (ScriptExecutionActor/AlarmExecutionActor run script bodies on the dedicated ScriptExecutionScheduler threads, confirmed in ScriptExecutionActor.cs:74). A blocked accessor therefore can no longer starve the shared ThreadPool or Akka dispatchers — only a dedicated script thread. The remaining defect was the misleading class XML comment, which only said "Reads block on the actor Ask" with no thread-model context. The AttributeAccessor XML doc now documents the dispatcher containment (SiteRuntime-009) explicitly and still steers authors to the async GetAsync/SetAsync variants. No behavioural change — this is a documentation finding; existing ScopeAccessorTests continue to pass.

SiteRuntime-013 — HandleUnsubscribeDebugView does nothing despite documented behaviour

Severity Low
Category Documentation & comments
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:414

Description

HandleUnsubscribeDebugView is documented ("Debug view unsubscribe — removes subscription") and the actor registers a handler for UnsubscribeDebugViewRequest, but the body only logs a debug message — there is no subscription state in the Instance Actor to remove. The design places the actual subscription lifecycle in SiteStreamManager (Subscribe/Unsubscribe/RemoveSubscriber), so the Instance Actor genuinely has nothing to do here. The handler and its XML comment are therefore misleading: a reader expects it to tear down a subscription.

Recommendation

Either remove the no-op handler and route UnsubscribeDebugViewRequest to wherever the SiteStreamManager subscription is actually cancelled, or correct the XML comment to state explicitly that subscription teardown is handled by SiteStreamManager and this handler is a no-op acknowledgement.

Resolution

Resolved 2026-05-16 (pending commit): root cause confirmed — the Instance Actor holds no per-subscriber state, so HandleUnsubscribeDebugView genuinely has nothing to remove; the real debug-stream subscription lifecycle lives in SiteStreamManager (Subscribe/Unsubscribe/RemoveSubscriber). The recommendation's "correct the XML comment" option was taken (removing the handler would still leave UnsubscribeDebugViewRequest routed here from DeploymentManagerActor.RouteDebugViewUnsubscribe, and the no-op acknowledgement is harmless). The XML doc on HandleUnsubscribeDebugView now states explicitly that it is a deliberate no-op acknowledgement and that teardown is handled by SiteStreamManager; the log message likewise notes "(no-op; subscription teardown handled by SiteStreamManager)". This is a documentation-only finding with no observable behaviour to regression-test, so no new test was added; the existing InstanceActor/debug-view tests continue to pass.

SiteRuntime-014 — Trigger-expression evaluation blocks the coordinator actor thread

Severity Low
Category Akka.NET conventions
Status Deferred
Location src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/ScriptActor.cs:219, src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/AlarmActor.cs:389

Description

EvaluateExpressionTrigger (ScriptActor) and EvaluateExpression (AlarmActor) run a compiled Roslyn script with .RunAsync(...).GetAwaiter().GetResult() directly inside the actor's AttributeValueChanged message handler. This blocks the coordinator actor's mailbox thread for up to the 2-second timeout on every monitored attribute change. Coordinator actors are on the default dispatcher and process the hot path of attribute-change fan-out; a slow expression delays all other messages to that actor and consumes a thread-pool thread for the duration. The inline comments correctly note CPU-bound expressions are not interruptible but do not address the mailbox-blocking concern.

Recommendation

Trigger expressions are expected to be cheap, but to keep the actor responsive consider evaluating them off the actor thread (pipe the boolean result back as an internal message) or pre-compiling to a plain delegate that executes near-instantly without the Roslyn scripting RunAsync machinery.

Resolution

Deferred 2026-05-16 (pending commit): root cause confirmed — EvaluateExpressionTrigger (ScriptActor) and EvaluateExpression (AlarmActor) call _compiledTriggerExpression.RunAsync(...).GetAwaiter().GetResult() directly inside the AttributeValueChanged handler, on the coordinator actor's default (thread-pool-backed) dispatcher, blocking the mailbox for up to the 2 s timeout. Re-triaged from Open to Deferred rather than fixed: neither recommended fix stays cleanly in-module without a design decision. (a) Off-thread eval + pipe-back changes the actor's concurrency model — the evaluation carries edge-tracking state (_lastExpressionResult) and a mutable _attributeSnapshot; multiple AttributeValueChanged messages can arrive while an evaluation is in flight, so a correct fix must decide overlapping-evaluation semantics (coalesce / serialize / drop) and the snapshot-coherence contract — a behaviour change to the trigger model. (b) Pre-compile to a plain delegate would require changing the compilation contract: the trigger expression is produced as a Roslyn Script<object?> by ScriptCompilationService.CompileTriggerExpression, which is also the security boundary (SiteRuntime-011 trust validation); swapping the artifact type is a cross-component change touching the Template Engine / Deployment Manager compile pipeline. Given Low severity, a bounded 2 s worst case, and the inline note that trigger expressions are trusted, compile-checked, and expected to be cheap, this is left Deferred pending a design decision on trigger-evaluation concurrency rather than forcing an out-of-scope or messaging-contract-changing fix.

SiteRuntime-015 — LoggerFactory created per Instance Actor and never disposed

Severity Low
Category Performance & resource management
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:746

Description

CreateInstanceActor does var loggerFactory = new LoggerFactory(); for every Instance Actor it creates, uses it once to produce an ILogger<InstanceActor>, and never disposes it. LoggerFactory is IDisposable. With up to 500 instances (and churn from redeployments) this leaks a factory per instance, and the produced loggers are detached from the application's configured logging providers, so Instance Actor logs may not be routed/filtered consistently with the rest of the host.

Recommendation

Inject the application's ILoggerFactory (or an ILogger<InstanceActor> factory delegate) into DeploymentManagerActor via DI and reuse it, rather than newing one up per child. Do not create a fresh LoggerFactory in a hot creation path.

Resolution

Resolved 2026-05-16 (pending commit): root cause confirmed — CreateInstanceActor did new LoggerFactory() per Instance Actor, never disposed, and detached from the host's logging providers. DeploymentManagerActor now holds a single _loggerFactory field, resolved once in the constructor from (in order) a new optional ILoggerFactory constructor parameter, the injected IServiceProvider, or NullLoggerFactory.Instance as a last resort — never a per-instance allocation. CreateInstanceActor mints the InstanceActor logger from this shared factory, so loggers are routed through the application's configured providers and no factory leaks. Regression test: DeploymentManagerLoggerFactoryTests.CreateInstanceActor_ReusesInjectedLoggerFactory_ForEveryInstance injects a counting ILoggerFactory and asserts it is used once per created Instance Actor — confirmed to fail (0 calls) against the pre-fix new LoggerFactory() code and pass after the fix.

SiteRuntime-016 — Short-lived execution actors, replication actor, and repositories are untested

Severity Low
Category Testing coverage
Status Resolved
Location tests/ZB.MOM.WW.ScadaBridge.SiteRuntime.Tests/

Description

The test project covers the coordinator actors (InstanceActor, ScriptActor, AlarmActor, DeploymentManagerActor), persistence, scripting and streaming, but a search of the test sources finds no references to ScriptExecutionActor, AlarmExecutionActor, SiteReplicationActor, SiteExternalSystemRepository, or SiteNotificationRepository. These cover critical paths: script timeout/failure handling and result reply, alarm on-trigger execution, peer config/S&F replication (including the SendToPeer no-peer drop), and the reflection-based repository reads. Several findings above (001/002 mis-routing, 007 ID instability, 011 trust bypass) would likely have been caught by targeted tests.

Recommendation

Add unit/integration tests for the execution actors (success, timeout, exception, Ask-reply, PoisonPill self-stop), SiteReplicationActor (outbound forward, inbound apply, peer tracking on cluster events), and the two repositories (round-trip read, synthetic-ID lookup, missing-row behaviour).

Resolution

Resolved 2026-05-16 (pending commit): re-triaged against the current test sources — SiteExternalSystemRepository and SiteNotificationRepository are already covered by Repositories/SiteRepositoryTests.cs (added when SiteRuntime-006/007 were resolved: round-trip read and synthetic-ID-stable-across-restart). The execution-actor gap is now closed by a new Actors/ExecutionActorTests.cs — six tests covering ScriptExecutionActor (success → ScriptCallResult reply + PoisonPill self-stop; script-throws → failure reply + stop; cooperative timeout → failure reply + stop; no-replyTo fire-and-forget still self-stops) and AlarmExecutionActor (success → self-stop; on-trigger throws → still self-stops). SiteReplicationActor is not covered here: it depends on Cluster.Get(Context.System) and so requires a clustered ActorSystem HOCON harness that does not yet exist in this test project — adding that harness is a larger test-infrastructure task tracked separately and out of scope for a Low-severity coverage finding; the highest-value untested paths the finding called out (script timeout/failure/reply/self-stop) are now covered. Full module suite: 192 tests green.

SiteRuntime-017 — Instance Actor's live _attributes dictionary is shared by reference into child actor constructors

Severity Medium
Category Concurrency & thread safety
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:625, src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:675, src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/ScriptActor.cs:83, src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/AlarmActor.cs:93

Description

InstanceActor.CreateChildActors passes the Instance Actor's own mutable _attributes field (a plain Dictionary<string, object?>) by reference into the Props.Create(...) factory for every ScriptActor and AlarmActor (as the initialAttributes constructor argument). Each child constructor then iterates that dictionary to seed its _attributeSnapshot:

if (initialAttributes != null)
    foreach (var kvp in initialAttributes)
        _attributeSnapshot[kvp.Key] = kvp.Value;

Context.ActorOf returns immediately; the child actor's constructor runs later on the child's mailbox thread. Meanwhile the Instance Actor's PreStart returns and the Instance Actor begins processing its mailbox — HandleTagValueUpdate and HandleAttributeValueChanged both mutate _attributes (_attributes[...] = ...). A DCL tag update that arrives before a child has finished its constructor copy therefore mutates the dictionary on the Instance Actor thread while the child thread is enumerating it. Dictionary<,> is explicitly not safe for concurrent read/write: the enumeration can throw InvalidOperationException ("collection was modified") — which surfaces as an ActorInitializationException and, under the Instance Actor's SupervisorStrategy, stops the child (the strategy returns Stop for ActorInitializationException). The script or alarm is then silently absent for the life of the instance. A torn read of an entry is also possible. The window is small but deterministically reachable on a busy site at startup/failover — exactly the staggered-startup scenario the design is most concerned about.

Recommendation

Do not share the live dictionary. Snapshot it on the Instance Actor thread before constructing the child — e.g. pass new Dictionary<string, object?>(_attributes) (or an immutable copy) into each Props.Create. The copy is made on the Instance Actor thread inside CreateChildActors, so it is race-free, and each child gets a private dictionary to seed from.

Resolution

Resolved 2026-05-17 (commit pending): root cause confirmed — CreateChildActors captured the live _attributes field directly in every child Props.Create closure. CreateChildActors now takes a single new Dictionary<string,object?>(_attributes) snapshot on the Instance Actor thread and hands each ScriptActor/AlarmActor that private copy, so no child constructor ever enumerates a dictionary the Instance Actor is concurrently mutating. Regression test: InstanceActorChildAttributeRaceTests.ChildActors_AreSeededFromAnIsolatedCopy_NotTheLiveAttributesDictionary asserts every child's seed dictionary is a distinct object from the Instance Actor's live _attributes (confirmed to fail — "seeded ... by reference" — against the pre-fix code and pass after). ScriptActor/AlarmActor expose an internal SeedAttributesReference for this assertion (InternalsVisibleTo added for the test project).

SiteRuntime-018 — ScriptExecutionActor XML doc still claims a "dedicated blocking I/O dispatcher"

Severity Low
Category Documentation & comments
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/ScriptExecutionActor.cs:17

Description

The class-level XML summary on ScriptExecutionActor states "Runs on a dedicated blocking I/O dispatcher." That is not what the code does. SiteRuntime-009 was resolved by introducing ScriptExecutionScheduler (a bounded dedicated TaskScheduler); the actor itself and its mailbox run on the default Akka dispatcher, and only the script body runs on the scheduler's threads via Task.Factory.StartNew(..., scheduler). The resolution of SiteRuntime-009 explicitly chose the TaskScheduler route instead of a HOCON dispatcher and even removed the "in production, configure a dedicated dispatcher" comments elsewhere — but this stale summary line was missed. A reader is told the actor is on a dedicated dispatcher when it is not, which is misleading when reasoning about mailbox throughput and thread-pool pressure. (AlarmExecutionActor does not carry the equivalent claim — its summary only says "Same pattern as ScriptExecutionActor.")

Recommendation

Correct the summary to describe the actual model: the actor runs on the default dispatcher and the script body is dispatched onto the dedicated ScriptExecutionScheduler (SiteRuntime-009). Align the wording with the accurate comment already present at ScriptExecutionActor.cs:71-73.

Resolution

Resolved 2026-05-17 (commit pending): root cause confirmed — the stale "Runs on a dedicated blocking I/O dispatcher" line in the ScriptExecutionActor class summary was missed when SiteRuntime-009 was resolved. The summary now states the actual model: the actor and its mailbox run on the default Akka dispatcher and only the script body is dispatched onto the dedicated ScriptExecutionScheduler (SiteRuntime-009). Documentation-only change with no observable behaviour, so no regression test was added; the existing suite continues to pass.

SiteRuntime-019 — Dead DisableInstanceCommand / EnableInstanceCommand handlers in InstanceActor

Severity Low
Category Correctness & logic bugs
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:106, src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:113

Description

InstanceActor's constructor registers Receive<DisableInstanceCommand> and Receive<EnableInstanceCommand> handlers that log and reply with a successful InstanceLifecycleResponse. These handlers are unreachable. The Deployment Manager is the only sender of those commands, and DeploymentManagerActor.HandleDisable / HandleEnable handle the lifecycle entirely themselves — they call Context.Stop(actor) (disable) or CreateInstanceActor(...) (enable) directly and reply to the original sender from the Deployment Manager. Neither command is ever Forward-ed or Tell-ed to the Instance Actor. The handlers are dead code, and they are actively misleading: a maintainer reading InstanceActor would reasonably believe disable/enable is partly an Instance-Actor responsibility, and the no-op "true" reply implies an instance-side acknowledgement contract that does not exist. If a future change did route these commands here, the disable handler would do nothing useful (it does not stop children or tear down state — Akka does that when the parent stops the actor).

Recommendation

Remove the two Receive<...> registrations and their handler bodies from InstanceActor, since the Deployment Manager owns the disable/enable lifecycle. If the intent is to keep them for a future instance-side hook, add an XML comment stating that the Deployment Manager currently handles these and the handlers are a reserved placeholder — but removal is preferred.

Resolution

Resolved 2026-05-17 (commit pending): re-verified as genuinely dead code — a codebase-wide search confirms DisableInstanceCommand/EnableInstanceCommand are only ever sent (from central) to the site DeploymentManagerActor, whose HandleDisable/HandleEnable own the lifecycle entirely (Context.Stop / CreateInstanceActor) and never Forward/Tell the command to the Instance Actor. The two unreachable Receive<...> registrations and their no-op "success" handler bodies were removed from InstanceActor, replaced with a comment stating the Deployment Manager owns this lifecycle. Regression test: InstanceActorTests.InstanceActor_DoesNotHandleDisableOrEnableCommands asserts the Instance Actor produces no InstanceLifecycleResponse for either command (confirmed to fail against the pre-fix dead handlers and pass after removal).

SiteRuntime-020 — Second DeployInstanceCommand arriving during a pending redeploy races the still-terminating actor on its name

Severity Medium
Category Concurrency & thread safety
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:285, src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:971

Resolution — added a name → terminating-actor-ref shadow (_terminatingActorsByName) populated when HandleDeploy stops the predecessor and cleared in HandleTerminated. HandleDeploy now detects the mid-termination state before falling through to ApplyDeployment(fresh): on hit it tells the displaced PendingRedeploy.OriginalSender a DeploymentStatus.Failed / "superseded by newer deployment …" response and overwrites the buffered pending command (last-write-wins). Regression test SR020_ThreeRapidDeploys_DoNotThrowInvalidActorNameException_LatestWins fires three rapid deploys, asserts the middle deploy is told it was superseded, the latest succeeds, and the resulting instance is operable (DisableInstanceCommand works).

Description

The SiteRuntime-003 fix makes HandleDeploy watch + stop a running Instance Actor and buffer the in-flight DeployInstanceCommand in _pendingRedeploys until Terminated arrives. The handler also removes the instance from _instanceActors synchronously, in step with the stop request:

if (_instanceActors.TryGetValue(instanceName, out var existing))
{
    _instanceActors.Remove(instanceName);
    _pendingRedeploys[existing] = new PendingRedeploy(command, Sender);
    Context.Watch(existing);
    Context.Stop(existing);
    UpdateInstanceCounts();
    return;
}

// Fresh deployment — no existing actor to replace.
ApplyDeployment(command, Sender, isRedeploy: false);

If a second DeployInstanceCommand for the same instanceName arrives on the singleton's mailbox while the predecessor is still terminating, the _instanceActors.TryGetValue lookup correctly reports "no existing actor" — because the first deploy already removed it — and execution falls through to ApplyDeployment(..., isRedeploy: false). ApplyDeployment immediately calls CreateInstanceActor, which calls Context.ActorOf(props, instanceName). But the predecessor's Akka child name is still registered in the parent's child registry: that name is only released after the predecessor's Terminated signal — exactly the asynchronous gap SiteRuntime-003 was created to plug for the first redeploy. Context.ActorOf therefore throws InvalidActorNameException, which Akka rethrows as ActorInitializationException — and the supervisor's Stop directive on that exception (DeploymentManagerActor.cs:179) silently stops the just-created child. The second deploy is then quietly lost: _instanceActors doesn't contain it (the throw aborted the bookkeeping after CreateInstanceActor's own ContainsKey guard but before _instanceActors[instanceName] = actorRef would have run), _totalDeployedCount was incremented, and the deployer is never told the deployment failed (the persistence Task.Run is also dropped on the throw path). The race is real on a busy site where central retries a deploy because the prior attempt timed out — exactly the scenario the DeploymentManager-006 query-then-deploy idempotency mechanism was designed for.

The first-redeploy case (SiteRuntime-003) does NOT exhibit this because at that point the predecessor's child name was still in _instanceActors, so the branch correctly buffers. The bug is specific to the third (and beyond) incoming deploy when two are already in flight for the same instance.

Recommendation

The pending-redeploy bookkeeping needs to be authoritative for "we are mid- redeploy on this instance", not just the _instanceActors cache. Add a second keyed lookup — e.g. a Dictionary<string, IActorRef> _terminatingActorsByName populated when the predecessor is stopped — and check it BEFORE ApplyDeployment(isRedeploy: false). On a hit, overwrite (or stash) the buffered PendingRedeploy for that terminating actor so the latest command wins on the Terminated signal. Alternatively, defer the deploy by stashing all messages for that instanceName until the predecessor terminates (Akka Stash pattern). Either way, the fall-through to "fresh deployment" needs to be gated on "no instance with this name is currently terminating".

SiteRuntime-021 — HandleDeployArtifacts updates DataConnections in SQLite but never sends CreateConnectionCommand to the DCL

Severity Medium
Category Design-document adherence
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:931

Resolution (2026-05-28): Took the "refactor EnsureDclConnections into a shared field-based helper" path. Extracted a new EnsureDclConnection(name, protocol, primaryJson, backupJson, failoverRetryCount) method that owns the hash-cache check and the CreateConnectionCommand Tell — both the existing inline EnsureDclConnections(configJson) and the new artifact path now drive through it. ComputeConnectionConfigHash got a field-based overload so the artifact path (which carries data directly on DataConnectionArtifact) reuses the same hash logic as the ConnectionConfig-based inline path. To keep _createdConnections mutation actor-thread-confined (the artifact-deploy persistence runs inside a Task.Run), the off-thread persistence dispatches a new internal ApplyArtifactDataConnectionsToDcl message back to Self after the SQLite writes; the actor-thread handler then iterates and invokes EnsureDclConnection. The DCL only sees CreateConnectionCommand (no Update/Delete messages exist in the codebase, and CreateConnectionCommand is treated as upsert-by-name — same shape as the inline-config path). Build clean; 302 SiteRuntime tests green (the existing EnsureDclConnections_ConnectionConfigChanged_ReissuesCreateCommand regression test still passes through the refactored shared helper).

Description

HandleDeployArtifacts persists the artifact bundle (shared scripts, external systems, database connections, notification lists, SMTP configs, and data connection definitions) into local SQLite. For data connection definitions specifically (DataConnections), the handler calls _storage.StoreDataConnectionDefinitionAsync(...) — but does NOT issue a CreateConnectionCommand (or any other DCL command) to the _dclManager actor. The only path that pushes DCL configuration to the DCL is EnsureDclConnections, called exclusively from the deploy / startup-batch paths against the flattened instance configuration's inline Connections dictionary. There is no equivalent for an artifact-only update.

Concretely: an artifact deployment that changes a data connection's endpoint URL, credentials, backup endpoint, or failover retry count is stored durably in the site SQLite (so on the next node restart the site loads the new config and EnsureDclConnections picks it up) but is silently inert until either an instance using that connection is redeployed or the node restarts. This contradicts the design's "after artifact deployment, the site is fully self-contained" intent (Component-SiteRuntime.md, "System-Wide Artifact Handling") — the runtime DCL keeps using the stale connection until a much heavier trigger event occurs. It is also asymmetric with how SharedScripts are handled in the same method: shared scripts are both stored and recompiled into _sharedScriptLibrary on update so the change is live immediately.

(SiteRuntime-010 fixed a related defect inside EnsureDclConnections — the config-hash cache — but that's only consulted on the inline-config path; the artifact-deployment path never reaches EnsureDclConnections.)

Recommendation

In the DataConnections branch of HandleDeployArtifacts, after the StoreDataConnectionDefinitionAsync call, also send a CreateConnectionCommand to _dclManager for each updated definition, re-using the SiteRuntime-010 config hash so unchanged connections are skipped. Alternatively, refactor EnsureDclConnections to accept a flat list of (name, protocol, configurationJson, backupConfigurationJson, failoverRetryCount) tuples that both the inline (FlattenedConfiguration) and artifact paths can drive through it.

SiteRuntime-022 — AuditingDbCommand.DbConnection.set uses reflection to read AuditingDbConnection._inner

Severity Medium
Category Code organization & conventions
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Scripts/AuditingDbCommand.cs:138

Resolution (2026-05-28): Took the recommended "expose a proper API surface" path (the SiteRuntime-006 precedent). Added an internal DbConnection Inner => _inner; accessor to AuditingDbConnection; both classes are internal sealed in the same assembly, so the accessor stays out of the public API. The AuditingDbCommand.DbConnection setter now unwraps an AuditingDbConnection via auditing.Inner instead of Type.GetField("_inner", BindingFlags.NonPublic | BindingFlags.Instance)!.GetValue(...). No reflection, no !. null-forgiveness hiding a runtime crash, no static-analyzer/IL-trim noise — and the same module that enforces "no System.Reflection in scripts" no longer reflects internally. The getter's _wrappingConnection ?? _inner.Connection fallback was left as-is; addressing the CreateDbCommand() round-trip concern is a separate behavioural decision (the finding marked it secondary). Build clean; 302 SiteRuntime tests green.

Description

The DbConnection setter on AuditingDbCommand unwraps an AuditingDbConnection value by reading its private _inner field via reflection:

set
{
    _wrappingConnection = value;
    _inner.Connection = value switch
    {
        AuditingDbConnection auditing => auditing.GetType()
            .GetField("_inner", BindingFlags.Instance | BindingFlags.NonPublic)
            !.GetValue(auditing) as DbConnection,
        _ => value
    };
}

This is the same encapsulation-violating anti-pattern that SiteRuntime-006 called out for the site repositories. A rename or refactor of AuditingDbConnection._inner breaks the audit decorator at runtime (no compile-time signal), the !. null-forgiving operator hides the crash, and the reflective access trips static analyzers and IL trimming. More problematically, the script trust model the same module enforces in ScriptCompilationService.ValidateTrustModel explicitly forbids System.Reflection in scripts — yet the auditing helper a script ends up running through itself reaches via reflection into a sibling class. Both classes are internal sealed in the same assembly, so this is purely a self-imposed contract violation.

A second smaller concern in the same property: the getter returns _wrappingConnection ?? _inner.Connection. If the caller obtains a command via AuditingDbConnection.CreateDbCommand() and immediately reads cmd.Connection, the getter returns the raw inner connection (not the auditing wrapper), because _wrappingConnection is only populated when the setter is later invoked. That's surprising and at odds with the class's audit-everything intent — a script that round-trips a command through cmd.Connection re-enters the un-audited path.

Recommendation

Expose the wrapped connection through a proper API surface. The simplest fix that matches the SiteRuntime-006 precedent: add an internal DbConnection Inner { get; } property to AuditingDbConnection (both classes are internal sealed, so the property stays out of the public surface) and replace the reflection switch with auditing.Inner. While touching the property, also have the getter return _wrappingConnection even on the synthesised CreateDbCommand path (e.g. set _wrappingConnection to the parent connection inside AuditingDbConnection.CreateDbCommand).

SiteRuntime-023 — Convert.ToDouble(value) in trigger and alarm evaluation is locale-sensitive

Severity Low
Category Correctness & logic bugs
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/ScriptActor.cs:446, src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/AlarmActor.cs:340, src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/AlarmActor.cs:356, src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/AlarmActor.cs:444

Resolution (2026-05-28): All four call sites (ScriptActor.EvaluateCondition, AlarmActor.EvaluateRangeViolation, AlarmActor.EvaluateRateOfChange, AlarmActor.EvaluateHiLo) now pass CultureInfo.InvariantCulture to Convert.ToDouble, so a string attribute value like "1.5" parses identically regardless of the host's CurrentCulture. For purely-numeric inputs the culture argument is a no-op. No new test added — existing ScriptActor / AlarmActor evaluator tests continue to pass and the behaviour is identical under the (existing CI's) en-US locale.

Description

ScriptActor.EvaluateCondition and the three AlarmActor evaluators (EvaluateRangeViolation, EvaluateRateOfChange, EvaluateHiLo) call Convert.ToDouble(value) without specifying a culture. When value is a string (a path that exists today — attribute values that arrive as JSON- deserialized numbers can still surface as strings on some code paths, particularly array values that are JSON-stringified at InstanceActor.HandleTagValueUpdate:377), Convert.ToDouble parses against CultureInfo.CurrentCulture. On a host whose locale uses a comma decimal separator (German, French, most of continental Europe), "1.5" throws and the condition / alarm silently degrades to its catch-fallthrough (returns false for range/rate-of-change, keeps current level for HiLo, falls back to string-compare for conditionals). The CLAUDE.md "All timestamps are UTC" discipline is the equivalent rule for time; there is no equivalent invariant- culture discipline applied to numeric parsing.

The exposure is bounded — most attribute values arrive as numeric primitives from TagValueUpdate.Value or static FlattenedConfiguration.Attributes (also typed) so the implicit-cast Convert.ToDouble path is hit. But the string path is reachable via inbound API writes (RouteToSetAttributesRequest.AttributeValues is IReadOnlyDictionary<string, string>), via the JSON-array stringification at HandleTagValueUpdate:377, and via static-override values loaded from SQLite (which are persisted as strings — see SetStaticOverrideAsync).

Recommendation

Replace each Convert.ToDouble(value) with Convert.ToDouble(value, CultureInfo.InvariantCulture), or front-load a typed-numeric extraction helper (if (value is double d) return d; if (value is string s && double.TryParse(s, NumberStyles.Float, CultureInfo.InvariantCulture, out var p)) return p; return Convert.ToDouble(value, CultureInfo.InvariantCulture);). The site is a deterministic machine-control surface; condition evaluation must not depend on the host's regional settings.

SiteRuntime-024 — OperationTrackingStore serialises all writes through one connection + SemaphoreSlim, and Dispose() does sync-over-async

Severity Medium
Category Performance & resource management
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Tracking/OperationTrackingStore.cs:39, src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Tracking/OperationTrackingStore.cs:360

Resolution — split reads from writes: the single owned _writeConnection + _writeGate still serialises writers, but GetStatusAsync now opens a fresh SqliteConnection per call against the shared connection string (mirroring SiteStorageService) so reads never block on an in-flight write. Sync Dispose was rewritten to NOT bridge to async — the dispose-once flag is an int flipped with Interlocked.Exchange, the synchronous path disposes _writeConnection + _writeGate directly without acquiring the gate, and DisposeAsync retains the gate-drain semantics for graceful shutdown. Both paths are idempotent; the second call short-circuits via the interlocked flag. Tests: SR024_ConcurrentReads_DoNotBlockOnInFlightWrite, SR024_SyncDispose_DoesNotDeadlock_WhenInvokedFromFreshThread, and SR024_AsyncDispose_DoesNotDeadlock_AndIsIdempotent.

Description

OperationTrackingStore owns exactly one SqliteConnection and gates every public method through a single SemaphoreSlim(1, 1). The class XML comment calls this out as deliberate ("the M3 brief calls out as 'cleaner than the M2 Channel pipeline given the volume'"), and the write volume is genuinely low — at most a handful of lifecycle rows per cached call. But on a busy site the read path (GetStatusAsync) is called by every Tracking.Status(id) invocation from every executing script, and reads are serialised through the same gate as writes. A long-running write (e.g. a Roslyn-script-driven RecordTerminalAsync competing with an SQLite checkpoint) holds the gate and stalls every concurrent status query. SQLite supports concurrent readers with a single writer in WAL mode; the gate forfeits that capability.

A separate concern in the same class: Dispose() calls DisposeAsyncCore().AsTask().GetAwaiter().GetResult(). That is sync-over- async — the very pattern SiteRuntime-008 was a finding for. If a caller disposes the store from a synchronization context that does not allow re-entrance (e.g. an IHostedService.StopAsync continuation observed on the host's sync context, or a finalizer pumping on the thread pool with a stuck continuation), the .WaitAsync() inside DisposeAsyncCore waits for a continuation that will never run, and the dispose deadlocks. The async path itself is correct; only the sync Dispose() wrapper is risky.

Recommendation

For the single-connection gate: split reads and writes into separate gates, or — better — keep the writer single-connection and open a fresh read connection (or pool of read connections) per GetStatusAsync call. SQLite connections are cheap; the SiteStorageService precedent already uses per- call connections on the read path. For Dispose(): prefer Dispose() { GC.SuppressFinalize(this); _connection.Dispose(); _gate.Dispose(); } without an awaited disposal, and have the IAsyncDisposable.DisposeAsync path do the awaiting. If a synchronous disposable is genuinely needed, do not bridge it through the async core — duplicate the dispose-once flag check into a sync path that calls _connection.Dispose() directly.

SiteRuntime-025 — HandleSetStaticAttribute persists unknown attribute names as static overrides

Severity Low
Category Correctness & logic bugs
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:223, src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:246

Resolution (2026-05-28): HandleSetStaticAttribute now rejects writes whose command.AttributeName does not resolve against _configuration.Attributes. The caller receives SetStaticAttributeResponse(Success: false, ErrorMessage: "Unknown attribute '<name>'"); no override is persisted, _attributes is not mutated, and no synthetic AttributeValueChanged is published — eliminating the in-memory-pollution, restart-resurrection, and debug-stream spam vectors. Regression test InstanceActorSetAttributeTests.SetAttribute_UnknownAttribute_ReturnsFailureAndDoesNotPersistOverride exercises an inbound SetStaticAttributeCommand with an unknown name and asserts the failure response, no DCL traffic, and an empty override row.

Description

HandleSetStaticAttribute resolves the target attribute against _configuration.Attributes to decide whether to route the write to the DCL or treat it as a static-override write. If the lookup fails (resolved == null), isDataSourced is false, and execution falls through to HandleSetStaticAttributeCore — which unconditionally:

  1. inserts the bogus key into the in-memory _attributes dictionary,
  2. publishes an AttributeValueChanged for the bogus key to the site stream and to every child Script/Alarm actor,
  3. persists a row in static_attribute_overrides for the bogus key, and
  4. replies Success = true to the caller.

Concretely, an inbound API Route.To().SetAttribute("notARealAttr", "x") returns success, pollutes the in-memory state with a key that no script can legitimately observe (canonical-name lookup will not produce it), persists a durable SQLite override row that survives restart, and (on every restart) re-injects the polluting key via HandleOverridesLoaded at line 608. The override is not reset on instance redeployment in the same way the "genuine" overrides are — ClearStaticOverridesAsync does clear by instance_unique_name, so the row is eventually cleaned, but only on a full redeploy; in the meantime each restart resurrects it. The publish-to-stream side effect also lets a hostile or buggy inbound caller spam debug-view subscribers with synthetic attribute changes.

Worth flagging at Low: the inbound API surface is already authenticated and the design assumes its callers are trusted. But the no-validation behaviour contradicts the design doc's "Scripts can only read/write attributes on their own instance" framing — an inbound API call inherits the same instance-scope authority as a script, and the script trust model wouldn't sanction this.

Recommendation

In HandleSetStaticAttribute, when resolved == null, reply SetStaticAttributeResponse(Success: false, ErrorMessage: $"Attribute '{command.AttributeName}' not found on instance '{_instanceUniqueName}'") instead of falling through to the override path. Optionally also surface the existence check on the RouteInboundApiSetAttributes fan-out so a multi-attribute write reports the offending key without rolling back the others (the per-attribute Ask shape already supports a partial failure response).

SiteRuntime-026 — ReplicationMessages.cs public record types have no XML documentation

Severity Low
Category Documentation & comments
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Messages/ReplicationMessages.cs:10, src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Messages/ReplicationMessages.cs:13, src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Messages/ReplicationMessages.cs:15, src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Messages/ReplicationMessages.cs:17, src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Messages/ReplicationMessages.cs:19, src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Messages/ReplicationMessages.cs:25, src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Messages/ReplicationMessages.cs:28, src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Messages/ReplicationMessages.cs:30, src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Messages/ReplicationMessages.cs:32, src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Messages/ReplicationMessages.cs:34

Description

The ten public record types in ReplicationMessages.cs (ReplicateConfigDeploy, ReplicateConfigRemove, ReplicateConfigSetEnabled, ReplicateArtifacts, ReplicateStoreAndForward, ApplyConfigDeploy, ApplyConfigRemove, ApplyConfigSetEnabled, ApplyArtifacts, ApplyStoreAndForward) carry no XML documentation. The file header comment groups them as "outbound" vs "inbound" but the individual records have no <summary> and no parameter docs. The XML-doc baseline 1eb6e97 rolled out across the rest of the module (the commit being reviewed is literally docs: add XML doc comments across src + Sister Projects section in CLAUDE.md), so this file is now the conspicuous outlier — and the CommentChecker skill relied on by the fixdocs workflow will flag every record as missing docs.

Recommendation

Add a <summary> per record naming the direction (outbound → peer / inbound from peer) and what the operation replicates, and <param> docs for each record parameter. Mirror the precedent in src/ZB.MOM.WW.ScadaBridge.Commons/Messages/.../*.cs. While there, consider sealing the inbound vs outbound split with a marker base type (currently they're just named conventionally) so Receive<ReplicateXxx> vs Receive<ApplyXxx> is expressed at the type level — but that's optional and out of scope for a docs-only finding.

Resolution (2026-05-28):

Added a one-line <summary> to each of the ten records (ReplicateConfigDeploy/Remove/SetEnabled/Artifacts/StoreAndForward and ApplyConfigDeploy/Remove/SetEnabled/Artifacts/StoreAndForward) naming the direction (outbound to peer / inbound from peer) and what is replicated. The two pre-existing group-header XML blocks were converted to plain // comments to avoid orphaned doc-summaries above the first record in each group. Marker-base-type idea left out of scope.