Files
scadalink-design/code-reviews/SiteRuntime/findings.md
Joseph Doherty 977d7369a7 docs: add code review process and baseline review of all 19 modules
Establishes a per-module code review workflow under code-reviews/ and
records the 2026-05-16 baseline review (commit 9c60592): 241 findings
across all src/ modules (6 Critical, 46 High, 100 Medium, 89 Low).
This is the clean starting point for remediation work.
2026-05-16 18:09:09 -04:00

24 KiB

Code Review — SiteRuntime

Field Value
Module src/ScadaLink.SiteRuntime
Design doc docs/requirements/Component-SiteRuntime.md
Status Reviewed
Last reviewed 2026-05-16
Reviewer claude-agent
Commit reviewed 9c60592
Open findings 16

Summary

The SiteRuntime module is broadly well-structured: the actor hierarchy matches the design doc, supervision strategies are explicit, and the trigger/alarm evaluation logic is thorough. However the review surfaced one genuinely serious correctness defect — Instance.SetAttribute never routes writes to the Data Connection Layer for data-sourced attributes, contradicting a core design decision and silently turning device writes into local-only static overrides. Several other findings cluster around two themes: (1) actor-thread discipline is violated in a few hot paths (blocking .GetAwaiter().GetResult() calls on the actor thread, a fragile fixed-delay reschedule for redeployment), and (2) the site-local repositories reach into SiteStorageService private state via reflection and mint entity IDs with the non-deterministic string.GetHashCode(). Script execution runs on the default thread pool rather than a dedicated blocking dispatcher (the code acknowledges this in a comment but ships it anyway). Test coverage exists for the coordinator actors, persistence and scripting, but the short-lived execution actors, the replication actor, and the repositories are untested.

Checklist coverage

# Category Examined Notes
1 Correctness & logic bugs SetAttribute mis-routing, deploy double-count, redeploy reschedule race.
2 Akka.NET conventions Blocking on actor thread, script execution not on a dedicated dispatcher, premature success reply.
3 Concurrency & thread safety _attributes dictionary shared with child actors by reference; _executionCounter is actor-confined (OK).
4 Error handling & resilience Deploy reports Success before persistence; replicated artifact/S&F failures only logged (matches best-effort design).
5 Security Trust-model validation is substring-based and weak; reflection used to read private fields.
6 Performance & resource management Per-call SQLite connections (acceptable); CPU-bound scripts not interruptible by timeout.
7 Design-document adherence SetAttribute DCL routing missing; staggered-startup and supervision otherwise conform.
8 Code organization & conventions Repositories reflect into another class; synthetic IDs non-deterministic.
9 Testing coverage No tests for ScriptExecutionActor, AlarmExecutionActor, SiteReplicationActor, or the two repositories.
10 Documentation & comments Several XML comments describe behaviour the code does not implement (see findings).

Findings

SiteRuntime-001 — Instance.SetAttribute never writes to the Data Connection Layer

Severity High
Category Design-document adherence
Status Open
Location src/ScadaLink.SiteRuntime/Scripts/ScriptRuntimeContext.cs:106, src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:204

Description

The design doc (Component-SiteRuntime.md, "GetAttribute / SetAttribute" and "Script Runtime API") states that Instance.SetAttribute on a data-connected attribute must send a write request to the DCL, which writes to the physical device, and that the in-memory value is not optimistically updated. For static attributes it updates memory and persists an override.

The implementation makes no such distinction. ScriptRuntimeContext.SetAttribute unconditionally sends a SetStaticAttributeCommand, and InstanceActor.HandleSetStaticAttribute unconditionally treats every write as a static override: it mutates _attributes, publishes an AttributeValueChanged with hard-coded "Good" quality, notifies children, and persists a SQLite override. A script writing a data-sourced attribute therefore never reaches the device, the write failure can never be returned synchronously to the script, and the in-memory value diverges from the device until the next subscription update overwrites it. The persisted override is also wrong: data-sourced attributes should not have static overrides.

Recommendation

In InstanceActor, look up the target attribute in _configuration.Attributes. If it has a non-empty DataSourceReference, issue a DCL write (e.g. a WriteTagRequest to _dclManager) and surface success/failure to the caller; do not persist an override and do not optimistically mutate _attributes. Only attributes with no data source reference should follow the current static-override path. Consider splitting the message into SetStaticAttributeCommand vs SetDataAttributeCommand, or branching inside the handler.

Resolution

Unresolved.

SiteRuntime-002 — RouteInboundApiSetAttributes always treats writes as static overrides

Severity High
Category Correctness & logic bugs
Status Open
Location src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:632

Description

RouteInboundApiSetAttributes (handling Route.To().SetAttribute(s) from the Inbound API) emits a SetStaticAttributeCommand for every attribute, so it inherits the same defect as SiteRuntime-001: writes to data-sourced attributes never reach the device and are instead persisted as static overrides. In addition the response is sent back as unconditionally successful (true) before the Instance Actor has even processed the command, so a non-existent attribute or a future DCL write failure is reported to the external caller as success.

Recommendation

Route through the same corrected InstanceActor write handler as SiteRuntime-001 so the static-vs-data distinction is honoured. The optimistic ack is acceptable for fire-and-forget static writes per the doc, but the XML comment should make the limitation explicit, and once data-attribute writes are supported they need a real response path.

Resolution

Unresolved.

SiteRuntime-003 — Redeployment relies on a fixed 500 ms reschedule and can collide on the child actor name

Severity High
Category Akka.NET conventions
Status Open
Location src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:222

Description

HandleDeploy stops an existing Instance Actor with Context.Stop and then reschedules the same DeployInstanceCommand to itself after a hard-coded 500 ms, hoping the child has fully terminated by then. Context.Stop is asynchronous; the child is only removed from the parent's children collection after it actually stops (including running PostStop on its descendants). If a deeply nested or slow hierarchy takes longer than 500 ms, CreateInstanceActor calls Context.ActorOf with a name that still belongs to the terminating child and throws InvalidActorNameException. The _instanceActors dictionary check does not prevent this — the dictionary entry is removed immediately, but the Akka child registry is not. The 500 ms delay is also unconditionally added to every redeploy latency.

Recommendation

Watch the terminating child (Context.Watch) and recreate the Instance Actor only after receiving the Terminated message, instead of guessing with a timer. Buffer or stash the in-flight DeployInstanceCommand (and any further commands for that instance) until termination completes.

Resolution

Unresolved.

SiteRuntime-004 — _totalDeployedCount is incremented on redeployment of an existing instance

Severity Medium
Category Correctness & logic bugs
Status Open
Location src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:239

Description

In HandleDeploy, the existing-actor branch (line 223) reschedules the command and returns. When the rescheduled command runs, no actor exists, so the code falls through to the "new instance" branch and executes _totalDeployedCount++ (line 239). A redeployment is an update of an already-deployed instance, not a new one, so the deployed count is over-counted by one on every redeploy. StoreDeployedConfigAsync uses UPSERT semantics, so the SQLite row count does not grow, but the in-memory _totalDeployedCount (reported to the health collector via UpdateInstanceCounts) drifts upward and the reported "disabled" count becomes wrong.

Recommendation

Only increment _totalDeployedCount when the instance is genuinely new. Either track whether this deploy replaced an existing config, or derive the deployed count from storage / the union of running actors and disabled configs rather than maintaining a hand-incremented counter.

Resolution

Unresolved.

SiteRuntime-005 — Deployment reports Success to central before persistence completes

Severity Medium
Category Error handling & resilience
Status Open
Location src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:272

Description

HandleDeploy replies to central with DeploymentStatus.Success immediately after creating the Instance Actor, while the SQLite persistence (StoreDeployedConfigAsync

  • ClearStaticOverridesAsync) runs asynchronously on a Task.Run. If persistence fails, HandleDeployPersistenceResult only logs an error — central has already been told the deployment succeeded. On a subsequent node restart or failover the instance will not be re-created (it is not in SQLite), so the deployment is silently lost despite central recording success. This contradicts the design's intent that the site is the durable source of truth for deployed configs.

Recommendation

Persist the config before replying, or treat a persistence failure as a deployment failure and send a corrective DeploymentStatusResponse/health signal to central. At minimum, do not report Success until the config row is committed.

Resolution

Unresolved.

SiteRuntime-006 — Site-local repositories read SiteStorageService private field via reflection

Severity Medium
Category Code organization & conventions
Status Open
Location src/ScadaLink.SiteRuntime/Repositories/SiteExternalSystemRepository.cs:183, src/ScadaLink.SiteRuntime/Repositories/SiteNotificationRepository.cs:181

Description

Both repositories' CreateConnection() use Type.GetField("_connectionString", BindingFlags.NonPublic | BindingFlags.Instance) to extract the private connection string out of SiteStorageService. This is brittle (any rename or refactor of the field breaks it at runtime, not compile time), defeats encapsulation, and the accompanying XML comment openly describes it as a "pragmatic" hack and is internally contradictory (it states a connection string is "passed separately at DI registration time" which is not what the code does). It also sits awkwardly against the project's own script trust model, which forbids System.Reflection in scripts.

Recommendation

Expose the connection string properly: add an ISiteStorageConnectionProvider (already referenced in ServiceCollectionExtensions XML docs but not used), or have SiteStorageService expose a CreateConnection() factory, and inject that into the repositories. Remove the reflection entirely.

Resolution

Unresolved.

SiteRuntime-007 — Synthetic entity IDs use the non-deterministic string.GetHashCode()

Severity Medium
Category Correctness & logic bugs
Status Open
Location src/ScadaLink.SiteRuntime/Repositories/SiteExternalSystemRepository.cs:241, src/ScadaLink.SiteRuntime/Repositories/SiteNotificationRepository.cs:254

Description

GenerateSyntheticId computes name.GetHashCode() & 0x7FFFFFFF. On .NET Core, string.GetHashCode() is randomized per process by default, so the "stable deterministic synthetic ID" promised by the XML comment is not stable at all — it changes every time the process restarts. Any caller that obtained an ID and later calls GetExternalSystemByIdAsync/GetNotificationListByIdAsync after a restart will fail to find the entity. It also risks collisions: distinct names can hash to the same 31-bit value, and GetExternalSystemByIdAsync would then return the wrong row.

Recommendation

Use a deterministic, collision-resistant hash (e.g. a stable FNV-1a or the first bytes of a SHA-256 of the name) if a synthetic integer ID is genuinely required, or better, change the repository contract to key these site-local artifacts by name rather than synthesising integer IDs.

Resolution

Unresolved.

SiteRuntime-008 — Blocking .GetAwaiter().GetResult() on the actor thread during startup

Severity Medium
Category Akka.NET conventions
Status Open
Location src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:479

Description

LoadSharedScriptsFromStorage is called synchronously from HandleStartupConfigsLoaded (the actor's message handler) and performs _storage.GetAllSharedScriptsAsync().GetAwaiter().GetResult() followed by Roslyn compilation of every shared script. This blocks the DeploymentManager singleton's mailbox thread for the full duration of the SQLite read and all shared-script compilation. On the default dispatcher this also ties up a thread-pool thread and risks thread-pool starvation, and the singleton cannot process any other message (deployments, lifecycle commands, debug routing) until it returns. The rest of the class correctly uses PipeTo/ContinueWith.

Recommendation

Load shared scripts asynchronously and PipeTo(Self) an internal message, the same pattern already used for StartupConfigsLoaded. Perform compilation either inside the piped continuation handler (still on the actor thread but at least off the synchronous startup path) or on a dedicated background task whose result is piped back.

Resolution

Unresolved.

SiteRuntime-009 — Script execution actors run scripts on the default thread pool, not a dedicated dispatcher

Severity Medium
Category Akka.NET conventions
Status Open
Location src/ScadaLink.SiteRuntime/Actors/ScriptExecutionActor.cs:72, src/ScadaLink.SiteRuntime/Actors/ScriptActor.cs:289, src/ScadaLink.SiteRuntime/Actors/AlarmExecutionActor.cs:57

Description

The design (CLAUDE.md "Architecture & Runtime") states Script Execution Actors run on a dedicated blocking I/O dispatcher. The code does not do this: ScriptActor.SpawnExecution and AlarmActor.SpawnAlarmExecution create the execution actors with no .WithDispatcher(...), and the execution itself runs inside a bare Task.Run, i.e. on the shared .NET thread pool. The // NOTE: In production, configure a dedicated ... dispatcher comments acknowledge the gap but it ships unconfigured. Scripts can perform synchronous blocking I/O (Database.Connection, synchronous ExternalSystem.Call); running them on the shared pool can starve it and stall unrelated Akka dispatchers and HTTP request handling under load.

Recommendation

Define the dedicated dispatcher in HOCON and chain .WithDispatcher(...) on the execution actor Props. If the Task.Run model is kept, run script bodies on a dedicated TaskScheduler / bounded scheduler rather than the global pool. Either way, remove the "in production, configure…" comments by actually configuring it.

Resolution

Unresolved.

SiteRuntime-010 — EnsureDclConnections never updates a connection whose configuration changed

Severity Medium
Category Correctness & logic bugs
Status Open
Location src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:413

Description

EnsureDclConnections tracks created connections in _createdConnections and skips any name already present (if (_createdConnections.Contains(name)) continue;). The skip is purely name-based: if a redeployment (or an artifact deployment) changes the endpoint, credentials, backup endpoint, or FailoverRetryCount of an existing connection, the new configuration is silently ignored and the DCL keeps using the stale CreateConnectionCommand. There is no UpdateConnectionCommand path. The design states that after artifact deployment the site is fully self-contained with current configuration; this caching breaks that for connection changes.

Recommendation

Compare the incoming connection config against the last one sent and re-issue a create/update command when it differs, or have the DCL treat CreateConnectionCommand as idempotent upsert and always forward it. Key the cache on a config hash, not just the name.

Resolution

Unresolved.

SiteRuntime-011 — Trust-model validation is a substring scan and is both over- and under-inclusive

Severity Medium
Category Security
Status Open
Location src/ScadaLink.SiteRuntime/Scripts/ScriptCompilationService.cs:52

Description

ValidateTrustModel enforces the script trust model by doing raw string.Contains / IndexOf on the script source text for forbidden namespace strings. This is unreliable in both directions:

  • Bypass (under-inclusive): the check looks only for the literal namespace strings. A script can reach forbidden APIs without ever writing System.IO etc. — e.g. via fully-qualified type use through aliasing, global::-prefixed names, or simply because the namespace is already imported transitively. The compilation references include typeof(object).Assembly (the whole of System.Private.CoreLib, which contains System.IO.File, System.Threading.Thread, System.Reflection, etc.), so forbidden types are fully resolvable at compile time and the only barrier is this textual scan.
  • False positives (over-inclusive): any occurrence of the substring in a comment, string literal, or an unrelated identifier (e.g. a variable named ProcessThreading) triggers a violation; the AllowedExceptions logic only rescues exact prefixes.
  • The dead isAllowed variable at line 64 is computed and never used.

Recommendation

Enforce the trust model with a Roslyn SyntaxWalker/semantic analysis (inspect resolved symbols and their containing namespaces/assemblies), or restrict the compilation's metadata references and AssemblyLoadContext so forbidden types are genuinely unavailable, rather than relying on source-text matching. Remove the unused isAllowed variable.

Resolution

Unresolved.

SiteRuntime-012 — AttributeAccessor/ScopeAccessors block the script on a synchronous Ask

Severity Low
Category Concurrency & thread safety
Status Open
Location src/ScadaLink.SiteRuntime/Scripts/ScopeAccessors.cs:28

Description

AttributeAccessor's indexer getter calls _ctx.GetAttribute(...).GetAwaiter().GetResult(), synchronously blocking the script-execution thread on an actor Ask. Combined with SiteRuntime-009 (scripts run on the shared thread pool) this means a script that reads several attributes via Attributes["X"] holds a pool thread blocked for each round-trip. The async variants (GetAsync/SetAsync) exist but the ergonomic indexer encourages the blocking path. The XML comment notes "Reads block on the actor Ask" but does not warn about the thread-pool impact.

Recommendation

Once a dedicated script dispatcher exists (SiteRuntime-009) the blocking is contained to that pool, which is acceptable; until then, document the cost clearly and prefer steering script authors to the async accessors. Consider making the indexer internal-only and exposing only the async API.

Resolution

Unresolved.

SiteRuntime-013 — HandleUnsubscribeDebugView does nothing despite documented behaviour

Severity Low
Category Documentation & comments
Status Open
Location src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:414

Description

HandleUnsubscribeDebugView is documented ("Debug view unsubscribe — removes subscription") and the actor registers a handler for UnsubscribeDebugViewRequest, but the body only logs a debug message — there is no subscription state in the Instance Actor to remove. The design places the actual subscription lifecycle in SiteStreamManager (Subscribe/Unsubscribe/RemoveSubscriber), so the Instance Actor genuinely has nothing to do here. The handler and its XML comment are therefore misleading: a reader expects it to tear down a subscription.

Recommendation

Either remove the no-op handler and route UnsubscribeDebugViewRequest to wherever the SiteStreamManager subscription is actually cancelled, or correct the XML comment to state explicitly that subscription teardown is handled by SiteStreamManager and this handler is a no-op acknowledgement.

Resolution

Unresolved.

SiteRuntime-014 — Trigger-expression evaluation blocks the coordinator actor thread

Severity Low
Category Akka.NET conventions
Status Open
Location src/ScadaLink.SiteRuntime/Actors/ScriptActor.cs:219, src/ScadaLink.SiteRuntime/Actors/AlarmActor.cs:389

Description

EvaluateExpressionTrigger (ScriptActor) and EvaluateExpression (AlarmActor) run a compiled Roslyn script with .RunAsync(...).GetAwaiter().GetResult() directly inside the actor's AttributeValueChanged message handler. This blocks the coordinator actor's mailbox thread for up to the 2-second timeout on every monitored attribute change. Coordinator actors are on the default dispatcher and process the hot path of attribute-change fan-out; a slow expression delays all other messages to that actor and consumes a thread-pool thread for the duration. The inline comments correctly note CPU-bound expressions are not interruptible but do not address the mailbox-blocking concern.

Recommendation

Trigger expressions are expected to be cheap, but to keep the actor responsive consider evaluating them off the actor thread (pipe the boolean result back as an internal message) or pre-compiling to a plain delegate that executes near-instantly without the Roslyn scripting RunAsync machinery.

Resolution

Unresolved.

SiteRuntime-015 — LoggerFactory created per Instance Actor and never disposed

Severity Low
Category Performance & resource management
Status Open
Location src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:746

Description

CreateInstanceActor does var loggerFactory = new LoggerFactory(); for every Instance Actor it creates, uses it once to produce an ILogger<InstanceActor>, and never disposes it. LoggerFactory is IDisposable. With up to 500 instances (and churn from redeployments) this leaks a factory per instance, and the produced loggers are detached from the application's configured logging providers, so Instance Actor logs may not be routed/filtered consistently with the rest of the host.

Recommendation

Inject the application's ILoggerFactory (or an ILogger<InstanceActor> factory delegate) into DeploymentManagerActor via DI and reuse it, rather than newing one up per child. Do not create a fresh LoggerFactory in a hot creation path.

Resolution

Unresolved.

SiteRuntime-016 — Short-lived execution actors, replication actor, and repositories are untested

Severity Low
Category Testing coverage
Status Open
Location tests/ScadaLink.SiteRuntime.Tests/

Description

The test project covers the coordinator actors (InstanceActor, ScriptActor, AlarmActor, DeploymentManagerActor), persistence, scripting and streaming, but a search of the test sources finds no references to ScriptExecutionActor, AlarmExecutionActor, SiteReplicationActor, SiteExternalSystemRepository, or SiteNotificationRepository. These cover critical paths: script timeout/failure handling and result reply, alarm on-trigger execution, peer config/S&F replication (including the SendToPeer no-peer drop), and the reflection-based repository reads. Several findings above (001/002 mis-routing, 007 ID instability, 011 trust bypass) would likely have been caught by targeted tests.

Recommendation

Add unit/integration tests for the execution actors (success, timeout, exception, Ask-reply, PoisonPill self-stop), SiteReplicationActor (outbound forward, inbound apply, peer tracking on cluster events), and the two repositories (round-trip read, synthetic-ID lookup, missing-row behaviour).

Resolution

Unresolved.