736 lines
39 KiB
Markdown
736 lines
39 KiB
Markdown
# Code Review — SiteRuntime
|
|
|
|
| Field | Value |
|
|
|-------|-------|
|
|
| Module | `src/ScadaLink.SiteRuntime` |
|
|
| Design doc | `docs/requirements/Component-SiteRuntime.md` |
|
|
| Status | Reviewed |
|
|
| Last reviewed | 2026-05-16 |
|
|
| Reviewer | claude-agent |
|
|
| Commit reviewed | `9c60592` |
|
|
| Open findings | 0 |
|
|
|
|
## Summary
|
|
|
|
The SiteRuntime module is broadly well-structured: the actor hierarchy matches the
|
|
design doc, supervision strategies are explicit, and the trigger/alarm evaluation
|
|
logic is thorough. However the review surfaced one genuinely serious correctness
|
|
defect — `Instance.SetAttribute` never routes writes to the Data Connection Layer
|
|
for data-sourced attributes, contradicting a core design decision and silently
|
|
turning device writes into local-only static overrides. Several other findings
|
|
cluster around two themes: (1) actor-thread discipline is violated in a few hot
|
|
paths (blocking `.GetAwaiter().GetResult()` calls on the actor thread, a fragile
|
|
fixed-delay reschedule for redeployment), and (2) the site-local repositories reach
|
|
into `SiteStorageService` private state via reflection and mint entity IDs with the
|
|
non-deterministic `string.GetHashCode()`. Script execution runs on the default
|
|
thread pool rather than a dedicated blocking dispatcher (the code acknowledges this
|
|
in a comment but ships it anyway). Test coverage exists for the coordinator actors,
|
|
persistence and scripting, but the short-lived execution actors, the replication
|
|
actor, and the repositories are untested.
|
|
|
|
## Checklist coverage
|
|
|
|
| # | Category | Examined | Notes |
|
|
|---|----------|----------|-------|
|
|
| 1 | Correctness & logic bugs | ✓ | SetAttribute mis-routing, deploy double-count, redeploy reschedule race. |
|
|
| 2 | Akka.NET conventions | ✓ | Blocking on actor thread, script execution not on a dedicated dispatcher, premature success reply. |
|
|
| 3 | Concurrency & thread safety | ✓ | `_attributes` dictionary shared with child actors by reference; `_executionCounter` is actor-confined (OK). |
|
|
| 4 | Error handling & resilience | ✓ | Deploy reports Success before persistence; replicated artifact/S&F failures only logged (matches best-effort design). |
|
|
| 5 | Security | ✓ | Trust-model validation is substring-based and weak; reflection used to read private fields. |
|
|
| 6 | Performance & resource management | ✓ | Per-call SQLite connections (acceptable); CPU-bound scripts not interruptible by timeout. |
|
|
| 7 | Design-document adherence | ✓ | SetAttribute DCL routing missing; staggered-startup and supervision otherwise conform. |
|
|
| 8 | Code organization & conventions | ✓ | Repositories reflect into another class; synthetic IDs non-deterministic. |
|
|
| 9 | Testing coverage | ✓ | No tests for ScriptExecutionActor, AlarmExecutionActor, SiteReplicationActor, or the two repositories. |
|
|
| 10 | Documentation & comments | ✓ | Several XML comments describe behaviour the code does not implement (see findings). |
|
|
|
|
## Findings
|
|
|
|
### SiteRuntime-001 — `Instance.SetAttribute` never writes to the Data Connection Layer
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | High |
|
|
| Category | Design-document adherence |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteRuntime/Scripts/ScriptRuntimeContext.cs:106`, `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:204` |
|
|
|
|
**Description**
|
|
|
|
The design doc (Component-SiteRuntime.md, "GetAttribute / SetAttribute" and
|
|
"Script Runtime API") states that `Instance.SetAttribute` on a *data-connected*
|
|
attribute must send a write request to the DCL, which writes to the physical
|
|
device, and that the in-memory value is **not** optimistically updated. For *static*
|
|
attributes it updates memory and persists an override.
|
|
|
|
The implementation makes no such distinction. `ScriptRuntimeContext.SetAttribute`
|
|
unconditionally sends a `SetStaticAttributeCommand`, and `InstanceActor.HandleSetStaticAttribute`
|
|
unconditionally treats every write as a static override: it mutates `_attributes`,
|
|
publishes an `AttributeValueChanged` with hard-coded `"Good"` quality, notifies
|
|
children, and persists a SQLite override. A script writing a data-sourced attribute
|
|
therefore never reaches the device, the write failure can never be returned
|
|
synchronously to the script, and the in-memory value diverges from the device
|
|
until the next subscription update overwrites it. The persisted override is also
|
|
wrong: data-sourced attributes should not have static overrides.
|
|
|
|
**Recommendation**
|
|
|
|
In `InstanceActor`, look up the target attribute in `_configuration.Attributes`. If
|
|
it has a non-empty `DataSourceReference`, issue a DCL write (e.g. a `WriteTagRequest`
|
|
to `_dclManager`) and surface success/failure to the caller; do not persist an
|
|
override and do not optimistically mutate `_attributes`. Only attributes with no
|
|
data source reference should follow the current static-override path. Consider
|
|
splitting the message into `SetStaticAttributeCommand` vs `SetDataAttributeCommand`,
|
|
or branching inside the handler.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (`<pending>`): `InstanceActor.HandleSetStaticAttribute` now resolves
|
|
the target attribute's data binding from `_configuration`. Data-sourced attributes are
|
|
routed via a new `HandleSetDataAttribute` that Asks the DCL with a `WriteTagRequest` and
|
|
pipes the device-write outcome back to the caller as a `SetStaticAttributeResponse` —
|
|
no override is persisted and `_attributes` is not optimistically mutated. Static
|
|
attributes keep the override path and now also reply with a `SetStaticAttributeResponse`.
|
|
`ScriptRuntimeContext.SetAttribute` is now `async Task` and Asks the Instance Actor,
|
|
throwing `InvalidOperationException` on a failed device write so scripts get the failure
|
|
synchronously.
|
|
|
|
### SiteRuntime-002 — `RouteInboundApiSetAttributes` always treats writes as static overrides
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | High |
|
|
| Category | Correctness & logic bugs |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:632` |
|
|
|
|
**Description**
|
|
|
|
`RouteInboundApiSetAttributes` (handling `Route.To().SetAttribute(s)` from the
|
|
Inbound API) emits a `SetStaticAttributeCommand` for every attribute, so it inherits
|
|
the same defect as SiteRuntime-001: writes to data-sourced attributes never reach
|
|
the device and are instead persisted as static overrides. In addition the response
|
|
is sent back as unconditionally successful (`true`) before the Instance Actor has
|
|
even processed the command, so a non-existent attribute or a future DCL write
|
|
failure is reported to the external caller as success.
|
|
|
|
**Recommendation**
|
|
|
|
Route through the same corrected `InstanceActor` write handler as SiteRuntime-001 so
|
|
the static-vs-data distinction is honoured. The optimistic ack is acceptable for
|
|
fire-and-forget static writes per the doc, but the XML comment should make the
|
|
limitation explicit, and once data-attribute writes are supported they need a real
|
|
response path.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (`<pending>`): `RouteInboundApiSetAttributes` now Asks the Instance
|
|
Actor per attribute (instead of fire-and-forget Tell) and aggregates the
|
|
`SetStaticAttributeResponse` results. Because the Instance Actor handler is the
|
|
SiteRuntime-001 corrected handler, data-sourced attributes now reach the DCL and the
|
|
`RouteToSetAttributesResponse` reflects the real per-attribute outcome — a non-existent
|
|
attribute or a failed device write is reported as failure rather than an unconditional
|
|
optimistic `true`.
|
|
|
|
### SiteRuntime-003 — Redeployment relies on a fixed 500 ms reschedule and can collide on the child actor name
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | High |
|
|
| Category | Akka.NET conventions |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:222` |
|
|
|
|
**Description**
|
|
|
|
`HandleDeploy` stops an existing Instance Actor with `Context.Stop` and then
|
|
reschedules the same `DeployInstanceCommand` to itself after a hard-coded 500 ms,
|
|
hoping the child has fully terminated by then. `Context.Stop` is asynchronous; the
|
|
child is only removed from the parent's children collection after it actually stops
|
|
(including running `PostStop` on its descendants). If a deeply nested or slow
|
|
hierarchy takes longer than 500 ms, `CreateInstanceActor` calls `Context.ActorOf`
|
|
with a name that still belongs to the terminating child and throws
|
|
`InvalidActorNameException`. The `_instanceActors` dictionary check does not prevent
|
|
this — the dictionary entry is removed immediately, but the Akka child registry is
|
|
not. The 500 ms delay is also unconditionally added to every redeploy latency.
|
|
|
|
**Recommendation**
|
|
|
|
Watch the terminating child (`Context.Watch`) and recreate the Instance Actor only
|
|
after receiving the `Terminated` message, instead of guessing with a timer. Buffer
|
|
or stash the in-flight `DeployInstanceCommand` (and any further commands for that
|
|
instance) until termination completes.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (`<pending>`): `HandleDeploy` no longer uses a fixed 500 ms
|
|
reschedule. When a redeployment targets a running instance, the existing Instance Actor
|
|
is `Context.Watch`-ed and stopped, and the in-flight `DeployInstanceCommand` is buffered
|
|
in a `_pendingRedeploys` map keyed by the terminating actor ref. A new `Terminated`
|
|
handler recreates the Instance Actor only after the predecessor (and its whole subtree)
|
|
has fully stopped, eliminating the `InvalidActorNameException` race and the
|
|
unconditional redeploy-latency penalty. The shared `ApplyDeployment` helper also skips
|
|
the `_totalDeployedCount` increment for redeployments, so the deployed-instance count no
|
|
longer drifts (this additionally addresses the root cause behind SiteRuntime-004).
|
|
|
|
### SiteRuntime-004 — `_totalDeployedCount` is incremented on redeployment of an existing instance
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium — re-triaged: already fixed by the SiteRuntime-003 resolution. |
|
|
| Category | Correctness & logic bugs |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs` (`ApplyDeployment`) |
|
|
|
|
**Description**
|
|
|
|
In `HandleDeploy`, the existing-actor branch (line 223) reschedules the command and
|
|
returns. When the rescheduled command runs, no actor exists, so the code falls
|
|
through to the "new instance" branch and executes `_totalDeployedCount++`
|
|
(line 239). A redeployment is an *update* of an already-deployed instance, not a new
|
|
one, so the deployed count is over-counted by one on every redeploy.
|
|
`StoreDeployedConfigAsync` uses UPSERT semantics, so the SQLite row count does not
|
|
grow, but the in-memory `_totalDeployedCount` (reported to the health collector via
|
|
`UpdateInstanceCounts`) drifts upward and the reported "disabled" count becomes
|
|
wrong.
|
|
|
|
**Re-triage (2026-05-16)**
|
|
|
|
Verified against the current source: this is **already fixed**. The SiteRuntime-003
|
|
resolution replaced the fixed-delay reschedule with a shared `ApplyDeployment` helper
|
|
that takes an `isRedeploy` flag and guards the counter with `if (!isRedeploy)
|
|
_totalDeployedCount++;`. The redeploy path (`HandleTerminated`) always calls
|
|
`ApplyDeployment(..., isRedeploy: true)`, so the counter is no longer bumped on
|
|
redeployment. The regression test
|
|
`DeploymentManagerRedeployTests.Redeploy_ExistingInstance_DoesNotOverCountDeployedInstances`
|
|
already covers this and passes. No further code change was required.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (`commit pending`): no new change needed — the root cause was
|
|
eliminated by the SiteRuntime-003 fix (the `isRedeploy` guard in `ApplyDeployment`).
|
|
Confirmed by the existing passing regression test
|
|
`Redeploy_ExistingInstance_DoesNotOverCountDeployedInstances`. Re-triaged from Open to
|
|
Resolved.
|
|
|
|
### SiteRuntime-005 — Deployment reports `Success` to central before persistence completes
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Error handling & resilience |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs` (`ApplyDeployment`, `HandleDeployPersistenceResult`) |
|
|
|
|
**Description**
|
|
|
|
`HandleDeploy` replies to central with `DeploymentStatus.Success` immediately after
|
|
creating the Instance Actor, while the SQLite persistence (`StoreDeployedConfigAsync`
|
|
+ `ClearStaticOverridesAsync`) runs asynchronously on a `Task.Run`. If persistence
|
|
fails, `HandleDeployPersistenceResult` only logs an error — central has already been
|
|
told the deployment succeeded. On a subsequent node restart or failover the instance
|
|
will not be re-created (it is not in SQLite), so the deployment is silently lost
|
|
despite central recording success. This contradicts the design's intent that the
|
|
site is the durable source of truth for deployed configs.
|
|
|
|
**Recommendation**
|
|
|
|
Persist the config before replying, or treat a persistence failure as a deployment
|
|
failure and send a corrective `DeploymentStatusResponse`/health signal to central.
|
|
At minimum, do not report `Success` until the config row is committed.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (`commit pending`): root cause confirmed — `ApplyDeployment` sent
|
|
`DeploymentStatusResponse(Success)` synchronously before the persistence `Task.Run`
|
|
completed. The `Success` reply is now sent from `HandleDeployPersistenceResult` only
|
|
once the persistence result is known: on success it replies `Success`; on a
|
|
persistence failure it logs the error, stops the optimistically-created Instance
|
|
Actor, rolls back the deployed-instance counter, and replies
|
|
`DeploymentStatus.Failed` with the error message. `DeployPersistenceResult` carries an
|
|
`IsRedeploy` flag so the counter rollback is skipped for redeployments. Regression
|
|
tests: `DeploymentManagerMediumFindingsTests.Deploy_PersistenceFailure_ReportsFailedNotSuccess`
|
|
and `Deploy_Success_ReportsSuccessAndPersistsConfig`.
|
|
|
|
### SiteRuntime-006 — Site-local repositories read `SiteStorageService` private field via reflection
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Code organization & conventions |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteRuntime/Repositories/SiteExternalSystemRepository.cs`, `src/ScadaLink.SiteRuntime/Repositories/SiteNotificationRepository.cs` |
|
|
|
|
**Description**
|
|
|
|
Both repositories' `CreateConnection()` use `Type.GetField("_connectionString",
|
|
BindingFlags.NonPublic | BindingFlags.Instance)` to extract the private connection
|
|
string out of `SiteStorageService`. This is brittle (any rename or refactor of the
|
|
field breaks it at runtime, not compile time), defeats encapsulation, and the
|
|
accompanying XML comment openly describes it as a "pragmatic" hack and is internally
|
|
contradictory (it states a connection string is "passed separately at DI
|
|
registration time" which is not what the code does). It also sits awkwardly against
|
|
the project's own script trust model, which forbids `System.Reflection` in scripts.
|
|
|
|
**Recommendation**
|
|
|
|
Expose the connection string properly: add an `ISiteStorageConnectionProvider`
|
|
(already referenced in `ServiceCollectionExtensions` XML docs but not used), or have
|
|
`SiteStorageService` expose a `CreateConnection()` factory, and inject that into the
|
|
repositories. Remove the reflection entirely.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (`commit pending`): root cause confirmed — both repositories
|
|
reflected into `SiteStorageService._connectionString`. `SiteStorageService` now
|
|
exposes a public `CreateConnection()` factory method that returns an unopened
|
|
`SqliteConnection` against the site database. Both `SiteExternalSystemRepository` and
|
|
`SiteNotificationRepository` now obtain connections via `_storage.CreateConnection()`;
|
|
all reflection (`Type.GetField` / `BindingFlags`) and the contradictory XML comments
|
|
have been removed. This is a fully in-module refactor — no cross-module design
|
|
decision was needed. Regression test:
|
|
`SiteRepositoryTests.ExternalSystemRepository_RoundTripsStoredDefinition` exercises
|
|
the repository's connection path end-to-end.
|
|
|
|
### SiteRuntime-007 — Synthetic entity IDs use the non-deterministic `string.GetHashCode()`
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Correctness & logic bugs |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteRuntime/Repositories/SiteExternalSystemRepository.cs`, `src/ScadaLink.SiteRuntime/Repositories/SiteNotificationRepository.cs`, `src/ScadaLink.SiteRuntime/Repositories/SyntheticId.cs` |
|
|
|
|
**Description**
|
|
|
|
`GenerateSyntheticId` computes `name.GetHashCode() & 0x7FFFFFFF`. On .NET Core,
|
|
`string.GetHashCode()` is randomized per process by default, so the "stable
|
|
deterministic synthetic ID" promised by the XML comment is not stable at all — it
|
|
changes every time the process restarts. Any caller that obtained an ID and later
|
|
calls `GetExternalSystemByIdAsync`/`GetNotificationListByIdAsync` after a restart
|
|
will fail to find the entity. It also risks collisions: distinct names can hash to
|
|
the same 31-bit value, and `GetExternalSystemByIdAsync` would then return the wrong
|
|
row.
|
|
|
|
**Recommendation**
|
|
|
|
Use a deterministic, collision-resistant hash (e.g. a stable FNV-1a or the first
|
|
bytes of a SHA-256 of the name) if a synthetic integer ID is genuinely required, or
|
|
better, change the repository contract to key these site-local artifacts by name
|
|
rather than synthesising integer IDs.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (`commit pending`): root cause confirmed — both repositories used
|
|
`name.GetHashCode()`, which is per-process randomized on .NET Core. A new internal
|
|
`SyntheticId` helper computes a deterministic, process-stable 31-bit ID using the
|
|
FNV-1a hash over the name's UTF-8 bytes. Both `GenerateSyntheticId` methods now
|
|
delegate to `SyntheticId.From(name)`. (The integer-keyed lookups are kept because
|
|
they are mandated by the shared `IExternalSystemRepository`/`INotificationRepository`
|
|
contracts in Commons — changing those contracts to name-keyed would be a cross-module
|
|
change outside this module's scope; the deterministic hash resolves the correctness
|
|
defect within scope.) Regression tests:
|
|
`SiteRepositoryTests.ExternalSystemRepository_SyntheticId_IsStableAcrossRestart` and
|
|
`NotificationRepository_SyntheticId_IsStableAcrossRestart` re-create the service to
|
|
simulate a process restart and confirm by-ID lookups still resolve.
|
|
|
|
### SiteRuntime-008 — Blocking `.GetAwaiter().GetResult()` on the actor thread during startup
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Akka.NET conventions |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs` (`HandleStartupConfigsLoaded`, `LoadSharedScriptsFromStorage`, `HandleSharedScriptsLoaded`) |
|
|
|
|
**Description**
|
|
|
|
`LoadSharedScriptsFromStorage` is called synchronously from
|
|
`HandleStartupConfigsLoaded` (the actor's message handler) and performs
|
|
`_storage.GetAllSharedScriptsAsync().GetAwaiter().GetResult()` followed by Roslyn
|
|
compilation of every shared script. This blocks the DeploymentManager singleton's
|
|
mailbox thread for the full duration of the SQLite read and all shared-script
|
|
compilation. On the default dispatcher this also ties up a thread-pool thread and
|
|
risks thread-pool starvation, and the singleton cannot process any other message
|
|
(deployments, lifecycle commands, debug routing) until it returns. The rest of the
|
|
class correctly uses `PipeTo`/`ContinueWith`.
|
|
|
|
**Recommendation**
|
|
|
|
Load shared scripts asynchronously and `PipeTo(Self)` an internal message, the same
|
|
pattern already used for `StartupConfigsLoaded`. Perform compilation either inside
|
|
the piped continuation handler (still on the actor thread but at least off the
|
|
synchronous startup path) or on a dedicated background task whose result is piped
|
|
back.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (`commit pending`): root cause confirmed — the blocking
|
|
`.GetAwaiter().GetResult()` and Roslyn compilation ran on the singleton's mailbox
|
|
thread inside `HandleStartupConfigsLoaded`. `LoadSharedScriptsFromStorage` now runs
|
|
the SQLite read **and** the Roslyn compilation on a background `Task.Run` and pipes a
|
|
new internal `SharedScriptsLoaded` message back to the actor. A new
|
|
`HandleSharedScriptsLoaded` handler then begins staggered Instance Actor creation, so
|
|
the compilation→creation ordering is preserved without ever blocking the mailbox. A
|
|
shared-script load failure is logged and startup proceeds (scripts needing a missing
|
|
shared script fail at execution time). Regression test:
|
|
`DeploymentManagerMediumFindingsTests.Startup_WithSharedScripts_LoadsConfigsAndStaysResponsive`
|
|
(confirms startup completes and the actor stays responsive with shared scripts
|
|
present).
|
|
|
|
### SiteRuntime-009 — Script execution actors run scripts on the default thread pool, not a dedicated dispatcher
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Akka.NET conventions |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteRuntime/Actors/ScriptExecutionActor.cs`, `src/ScadaLink.SiteRuntime/Actors/AlarmExecutionActor.cs`, `src/ScadaLink.SiteRuntime/Scripts/ScriptExecutionScheduler.cs` |
|
|
|
|
**Description**
|
|
|
|
The design (CLAUDE.md "Architecture & Runtime") states Script Execution Actors run
|
|
on a *dedicated blocking I/O dispatcher*. The code does not do this: `ScriptActor.SpawnExecution`
|
|
and `AlarmActor.SpawnAlarmExecution` create the execution actors with no
|
|
`.WithDispatcher(...)`, and the execution itself runs inside a bare `Task.Run`,
|
|
i.e. on the shared .NET thread pool. The `// NOTE: In production, configure a
|
|
dedicated ... dispatcher` comments acknowledge the gap but it ships unconfigured.
|
|
Scripts can perform synchronous blocking I/O (`Database.Connection`, synchronous
|
|
`ExternalSystem.Call`); running them on the shared pool can starve it and stall
|
|
unrelated Akka dispatchers and HTTP request handling under load.
|
|
|
|
**Recommendation**
|
|
|
|
Define the dedicated dispatcher in HOCON and chain `.WithDispatcher(...)` on the
|
|
execution actor `Props`. If the `Task.Run` model is kept, run script bodies on a
|
|
dedicated `TaskScheduler` / bounded scheduler rather than the global pool. Either
|
|
way, remove the "in production, configure…" comments by actually configuring it.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (`commit pending`): root cause confirmed — script and alarm
|
|
on-trigger bodies ran inside a bare `Task.Run` on the shared `ThreadPool`. The
|
|
recommendation's `TaskScheduler` option was taken because it is fully in-module (a
|
|
HOCON dispatcher would require editing the Host's ActorSystem config, out of scope).
|
|
A new `ScriptExecutionScheduler` provides a bounded set of dedicated background
|
|
threads (count from the new `SiteRuntimeOptions.ScriptExecutionThreadCount`, default
|
|
8). `ScriptExecutionActor` and `AlarmExecutionActor` now run their bodies via
|
|
`Task.Factory.StartNew(..., ScriptExecutionScheduler.Shared(options)).Unwrap()`
|
|
instead of `Task.Run`, so blocking script I/O is contained to those dedicated threads
|
|
and cannot starve the global pool. The misleading "in production, configure a
|
|
dedicated dispatcher" comments were removed. Regression tests:
|
|
`ScriptExecutionSchedulerTests` (`Scheduler_RunsWork_OffTheThreadPool`,
|
|
`Scheduler_RespectsConfiguredThreadCount`, `Scheduler_Shared_ReturnsSameInstanceForOptions`).
|
|
|
|
### SiteRuntime-010 — `EnsureDclConnections` never updates a connection whose configuration changed
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Correctness & logic bugs |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs` (`EnsureDclConnections`, `ComputeConnectionConfigHash`) |
|
|
|
|
**Description**
|
|
|
|
`EnsureDclConnections` tracks created connections in `_createdConnections` and skips
|
|
any name already present (`if (_createdConnections.Contains(name)) continue;`). The
|
|
skip is purely name-based: if a redeployment (or an artifact deployment) changes the
|
|
endpoint, credentials, backup endpoint, or `FailoverRetryCount` of an existing
|
|
connection, the new configuration is silently ignored and the DCL keeps using the
|
|
stale `CreateConnectionCommand`. There is no `UpdateConnectionCommand` path. The
|
|
design states that after artifact deployment the site is fully self-contained with
|
|
current configuration; this caching breaks that for connection changes.
|
|
|
|
**Recommendation**
|
|
|
|
Compare the incoming connection config against the last one sent and re-issue a
|
|
create/update command when it differs, or have the DCL treat `CreateConnectionCommand`
|
|
as idempotent upsert and always forward it. Key the cache on a config hash, not just
|
|
the name.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (`commit pending`): root cause confirmed — the cache was a
|
|
name-only `HashSet`, so a changed connection config was silently dropped.
|
|
`_createdConnections` is now a `Dictionary<string,string>` mapping connection name to
|
|
a SHA-256 hash of its protocol/primary-config/backup-config/failover-retry-count
|
|
(`ComputeConnectionConfigHash`). A connection whose hash is unchanged is still
|
|
skipped; a connection whose config changed re-issues a `CreateConnectionCommand` so
|
|
the DCL adopts the new configuration. Regression tests:
|
|
`DeploymentManagerMediumFindingsTests.EnsureDclConnections_ConnectionConfigChanged_ReissuesCreateCommand`
|
|
and `EnsureDclConnections_UnchangedConfig_DoesNotReissueCreateCommand`.
|
|
|
|
### SiteRuntime-011 — Trust-model validation is a substring scan and is both over- and under-inclusive
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Security |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteRuntime/Scripts/ScriptCompilationService.cs` (`ValidateTrustModel`) |
|
|
|
|
**Description**
|
|
|
|
`ValidateTrustModel` enforces the script trust model by doing raw `string.Contains` /
|
|
`IndexOf` on the script source text for forbidden namespace strings. This is
|
|
unreliable in both directions:
|
|
|
|
- **Bypass (under-inclusive):** the check looks only for the literal namespace
|
|
strings. A script can reach forbidden APIs without ever writing `System.IO` etc. —
|
|
e.g. via fully-qualified type use through aliasing, `global::`-prefixed names, or
|
|
simply because the namespace is already imported transitively. The compilation
|
|
references include `typeof(object).Assembly` (the whole of `System.Private.CoreLib`,
|
|
which contains `System.IO.File`, `System.Threading.Thread`, `System.Reflection`,
|
|
etc.), so forbidden types are fully resolvable at compile time and the only barrier
|
|
is this textual scan.
|
|
- **False positives (over-inclusive):** any occurrence of the substring in a comment,
|
|
string literal, or an unrelated identifier (e.g. a variable named `ProcessThreading`)
|
|
triggers a violation; the `AllowedExceptions` logic only rescues exact prefixes.
|
|
- The dead `isAllowed` variable at line 64 is computed and never used.
|
|
|
|
**Recommendation**
|
|
|
|
Enforce the trust model with a Roslyn `SyntaxWalker`/semantic analysis (inspect
|
|
resolved symbols and their containing namespaces/assemblies), or restrict the
|
|
compilation's metadata references and `AssemblyLoadContext` so forbidden types are
|
|
genuinely unavailable, rather than relying on source-text matching. Remove the
|
|
unused `isAllowed` variable.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (`commit pending`): root cause confirmed — `ValidateTrustModel`
|
|
was a raw `string.Contains`/`IndexOf` scan of the source text, with a dead `isAllowed`
|
|
variable. It is now Roslyn semantic analysis: the script is parsed and a
|
|
`CSharpCompilation` + `SemanticModel` are built; every name/member/object-creation
|
|
node is resolved to its symbol and the symbol's containing namespace and
|
|
fully-qualified containing type are checked against the forbidden roots. Bare
|
|
namespace symbols are ignored (so the `System.Threading` qualifier of the allowed
|
|
`System.Threading.Tasks.Task` no longer false-positives). A name that cannot be
|
|
resolved (a type from an assembly deliberately absent from the script's references)
|
|
falls back to a syntactic fully-qualified-name check, so e.g. `System.Net.Http`
|
|
references are still rejected. The dead `isAllowed` variable was removed. This fixes
|
|
both the bypass (`global::`/alias-qualified forbidden types) and the false positives
|
|
(forbidden namespace string in a comment, string literal, or unrelated identifier).
|
|
Regression tests: new `TrustModelSemanticTests` (alias/`global::` detection, comment/
|
|
literal/identifier non-detection, allowed-exception resolution); all 39 existing
|
|
`SandboxTests` + `ScriptCompilationServiceTests` continue to pass.
|
|
|
|
### SiteRuntime-012 — `AttributeAccessor`/`ScopeAccessors` block the script on a synchronous Ask
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Concurrency & thread safety |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteRuntime/Scripts/ScopeAccessors.cs:28` |
|
|
|
|
**Description**
|
|
|
|
`AttributeAccessor`'s indexer getter calls `_ctx.GetAttribute(...).GetAwaiter().GetResult()`,
|
|
synchronously blocking the script-execution thread on an actor Ask. Combined with
|
|
SiteRuntime-009 (scripts run on the shared thread pool) this means a script that
|
|
reads several attributes via `Attributes["X"]` holds a pool thread blocked for each
|
|
round-trip. The async variants (`GetAsync`/`SetAsync`) exist but the ergonomic
|
|
indexer encourages the blocking path. The XML comment notes "Reads block on the
|
|
actor Ask" but does not warn about the thread-pool impact.
|
|
|
|
**Recommendation**
|
|
|
|
Once a dedicated script dispatcher exists (SiteRuntime-009) the blocking is contained
|
|
to that pool, which is acceptable; until then, document the cost clearly and prefer
|
|
steering script authors to the async accessors. Consider making the indexer
|
|
internal-only and exposing only the async API.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (`pending commit`): re-triaged against the current source — the
|
|
finding's own recommendation states the blocking is *acceptable* once SiteRuntime-009's
|
|
dedicated script dispatcher exists, and SiteRuntime-009 is now Resolved
|
|
(`ScriptExecutionActor`/`AlarmExecutionActor` run script bodies on the dedicated
|
|
`ScriptExecutionScheduler` threads, confirmed in `ScriptExecutionActor.cs:74`). A
|
|
blocked accessor therefore can no longer starve the shared `ThreadPool` or Akka
|
|
dispatchers — only a dedicated script thread. The remaining defect was the misleading
|
|
class XML comment, which only said "Reads block on the actor Ask" with no thread-model
|
|
context. The `AttributeAccessor` XML doc now documents the dispatcher containment
|
|
(SiteRuntime-009) explicitly and still steers authors to the async `GetAsync`/`SetAsync`
|
|
variants. No behavioural change — this is a documentation finding; existing
|
|
`ScopeAccessorTests` continue to pass.
|
|
|
|
### SiteRuntime-013 — `HandleUnsubscribeDebugView` does nothing despite documented behaviour
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Documentation & comments |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:414` |
|
|
|
|
**Description**
|
|
|
|
`HandleUnsubscribeDebugView` is documented ("Debug view unsubscribe — removes
|
|
subscription") and the actor registers a handler for `UnsubscribeDebugViewRequest`,
|
|
but the body only logs a debug message — there is no subscription state in the
|
|
Instance Actor to remove. The design places the actual subscription lifecycle in
|
|
`SiteStreamManager` (`Subscribe`/`Unsubscribe`/`RemoveSubscriber`), so the Instance
|
|
Actor genuinely has nothing to do here. The handler and its XML comment are
|
|
therefore misleading: a reader expects it to tear down a subscription.
|
|
|
|
**Recommendation**
|
|
|
|
Either remove the no-op handler and route `UnsubscribeDebugViewRequest` to wherever
|
|
the `SiteStreamManager` subscription is actually cancelled, or correct the XML
|
|
comment to state explicitly that subscription teardown is handled by
|
|
`SiteStreamManager` and this handler is a no-op acknowledgement.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (`pending commit`): root cause confirmed — the Instance Actor
|
|
holds no per-subscriber state, so `HandleUnsubscribeDebugView` genuinely has nothing to
|
|
remove; the real debug-stream subscription lifecycle lives in `SiteStreamManager`
|
|
(Subscribe/Unsubscribe/RemoveSubscriber). The recommendation's "correct the XML comment"
|
|
option was taken (removing the handler would still leave `UnsubscribeDebugViewRequest`
|
|
routed here from `DeploymentManagerActor.RouteDebugViewUnsubscribe`, and the no-op
|
|
acknowledgement is harmless). The XML doc on `HandleUnsubscribeDebugView` now states
|
|
explicitly that it is a deliberate no-op acknowledgement and that teardown is handled by
|
|
`SiteStreamManager`; the log message likewise notes "(no-op; subscription teardown
|
|
handled by SiteStreamManager)". This is a documentation-only finding with no observable
|
|
behaviour to regression-test, so no new test was added; the existing
|
|
`InstanceActor`/debug-view tests continue to pass.
|
|
|
|
### SiteRuntime-014 — Trigger-expression evaluation blocks the coordinator actor thread
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Akka.NET conventions |
|
|
| Status | Deferred |
|
|
| Location | `src/ScadaLink.SiteRuntime/Actors/ScriptActor.cs:219`, `src/ScadaLink.SiteRuntime/Actors/AlarmActor.cs:389` |
|
|
|
|
**Description**
|
|
|
|
`EvaluateExpressionTrigger` (ScriptActor) and `EvaluateExpression` (AlarmActor) run a
|
|
compiled Roslyn script with `.RunAsync(...).GetAwaiter().GetResult()` directly inside
|
|
the actor's `AttributeValueChanged` message handler. This blocks the coordinator
|
|
actor's mailbox thread for up to the 2-second timeout on every monitored attribute
|
|
change. Coordinator actors are on the default dispatcher and process the hot path of
|
|
attribute-change fan-out; a slow expression delays all other messages to that actor
|
|
and consumes a thread-pool thread for the duration. The inline comments correctly
|
|
note CPU-bound expressions are not interruptible but do not address the
|
|
mailbox-blocking concern.
|
|
|
|
**Recommendation**
|
|
|
|
Trigger expressions are expected to be cheap, but to keep the actor responsive
|
|
consider evaluating them off the actor thread (pipe the boolean result back as an
|
|
internal message) or pre-compiling to a plain delegate that executes near-instantly
|
|
without the Roslyn scripting `RunAsync` machinery.
|
|
|
|
**Resolution**
|
|
|
|
Deferred 2026-05-16 (`pending commit`): root cause confirmed — `EvaluateExpressionTrigger`
|
|
(ScriptActor) and `EvaluateExpression` (AlarmActor) call
|
|
`_compiledTriggerExpression.RunAsync(...).GetAwaiter().GetResult()` directly inside the
|
|
`AttributeValueChanged` handler, on the coordinator actor's default (thread-pool-backed)
|
|
dispatcher, blocking the mailbox for up to the 2 s timeout. Re-triaged from Open to
|
|
**Deferred** rather than fixed: neither recommended fix stays cleanly in-module without
|
|
a design decision. (a) **Off-thread eval + pipe-back** changes the actor's concurrency
|
|
model — the evaluation carries edge-tracking state (`_lastExpressionResult`) and a
|
|
mutable `_attributeSnapshot`; multiple `AttributeValueChanged` messages can arrive while
|
|
an evaluation is in flight, so a correct fix must decide overlapping-evaluation
|
|
semantics (coalesce / serialize / drop) and the snapshot-coherence contract — a behaviour
|
|
change to the trigger model. (b) **Pre-compile to a plain delegate** would require
|
|
changing the compilation contract: the trigger expression is produced as a Roslyn
|
|
`Script<object?>` by `ScriptCompilationService.CompileTriggerExpression`, which is also
|
|
the security boundary (SiteRuntime-011 trust validation); swapping the artifact type is
|
|
a cross-component change touching the Template Engine / Deployment Manager compile
|
|
pipeline. Given Low severity, a bounded 2 s worst case, and the inline note that trigger
|
|
expressions are trusted, compile-checked, and expected to be cheap, this is left
|
|
Deferred pending a design decision on trigger-evaluation concurrency rather than forcing
|
|
an out-of-scope or messaging-contract-changing fix.
|
|
|
|
### SiteRuntime-015 — `LoggerFactory` created per Instance Actor and never disposed
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Performance & resource management |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:746` |
|
|
|
|
**Description**
|
|
|
|
`CreateInstanceActor` does `var loggerFactory = new LoggerFactory();` for every
|
|
Instance Actor it creates, uses it once to produce an `ILogger<InstanceActor>`, and
|
|
never disposes it. `LoggerFactory` is `IDisposable`. With up to 500 instances (and
|
|
churn from redeployments) this leaks a factory per instance, and the produced
|
|
loggers are detached from the application's configured logging providers, so
|
|
Instance Actor logs may not be routed/filtered consistently with the rest of the
|
|
host.
|
|
|
|
**Recommendation**
|
|
|
|
Inject the application's `ILoggerFactory` (or an `ILogger<InstanceActor>` factory
|
|
delegate) into `DeploymentManagerActor` via DI and reuse it, rather than newing one
|
|
up per child. Do not create a fresh `LoggerFactory` in a hot creation path.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (`pending commit`): root cause confirmed — `CreateInstanceActor`
|
|
did `new LoggerFactory()` per Instance Actor, never disposed, and detached from the
|
|
host's logging providers. `DeploymentManagerActor` now holds a single `_loggerFactory`
|
|
field, resolved once in the constructor from (in order) a new optional `ILoggerFactory`
|
|
constructor parameter, the injected `IServiceProvider`, or `NullLoggerFactory.Instance`
|
|
as a last resort — never a per-instance allocation. `CreateInstanceActor` mints the
|
|
`InstanceActor` logger from this shared factory, so loggers are routed through the
|
|
application's configured providers and no factory leaks. Regression test:
|
|
`DeploymentManagerLoggerFactoryTests.CreateInstanceActor_ReusesInjectedLoggerFactory_ForEveryInstance`
|
|
injects a counting `ILoggerFactory` and asserts it is used once per created Instance
|
|
Actor — confirmed to fail (0 calls) against the pre-fix `new LoggerFactory()` code and
|
|
pass after the fix.
|
|
|
|
### SiteRuntime-016 — Short-lived execution actors, replication actor, and repositories are untested
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Testing coverage |
|
|
| Status | Resolved |
|
|
| Location | `tests/ScadaLink.SiteRuntime.Tests/` |
|
|
|
|
**Description**
|
|
|
|
The test project covers the coordinator actors (`InstanceActor`, `ScriptActor`,
|
|
`AlarmActor`, `DeploymentManagerActor`), persistence, scripting and streaming, but a
|
|
search of the test sources finds no references to `ScriptExecutionActor`,
|
|
`AlarmExecutionActor`, `SiteReplicationActor`, `SiteExternalSystemRepository`, or
|
|
`SiteNotificationRepository`. These cover critical paths: script timeout/failure
|
|
handling and result reply, alarm on-trigger execution, peer config/S&F replication
|
|
(including the `SendToPeer` no-peer drop), and the reflection-based repository reads.
|
|
Several findings above (001/002 mis-routing, 007 ID instability, 011 trust bypass)
|
|
would likely have been caught by targeted tests.
|
|
|
|
**Recommendation**
|
|
|
|
Add unit/integration tests for the execution actors (success, timeout, exception,
|
|
Ask-reply, PoisonPill self-stop), `SiteReplicationActor` (outbound forward, inbound
|
|
apply, peer tracking on cluster events), and the two repositories (round-trip read,
|
|
synthetic-ID lookup, missing-row behaviour).
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (`pending commit`): re-triaged against the current test sources —
|
|
`SiteExternalSystemRepository` and `SiteNotificationRepository` are already covered by
|
|
`Repositories/SiteRepositoryTests.cs` (added when SiteRuntime-006/007 were resolved:
|
|
round-trip read and synthetic-ID-stable-across-restart). The execution-actor gap is now
|
|
closed by a new `Actors/ExecutionActorTests.cs` — six tests covering
|
|
`ScriptExecutionActor` (success → `ScriptCallResult` reply + PoisonPill self-stop;
|
|
script-throws → failure reply + stop; cooperative timeout → failure reply + stop;
|
|
no-`replyTo` fire-and-forget still self-stops) and `AlarmExecutionActor` (success →
|
|
self-stop; on-trigger throws → still self-stops). `SiteReplicationActor` is *not* covered
|
|
here: it depends on `Cluster.Get(Context.System)` and so requires a clustered
|
|
`ActorSystem` HOCON harness that does not yet exist in this test project — adding that
|
|
harness is a larger test-infrastructure task tracked separately and out of scope for a
|
|
Low-severity coverage finding; the highest-value untested paths the finding called out
|
|
(script timeout/failure/reply/self-stop) are now covered. Full module suite: 192 tests
|
|
green.
|