487859bff0
Doc/XML-comment drift + small adherence fixes across 17 modules. Highlights: - Host-017: site CoordinatedShutdown ordering — SiteStreamGrpcServer gains CancelAllStreams() (refuse new streams, cancel active), wired into Program.cs site branch via ApplicationStopping. - InboundAPI-021: ParentExecutionId now travels on RouteToGet/SetAttributes symmetric with RouteToCallRequest; RouteHelper stamps from _parentExecutionId. - ClusterInfra-012: ClusterOptionsValidator now requires both seed nodes. - Comm-018: SiteCommunicationActor.HeartbeatMessage.IsActive derived from cluster leader check (was hardcoded true). - DM-020: reconciliation audit row attributes the current user, not prior deployer. - SEL-019: EventLogPurgeService early-exits on standby via active-node check. - Plus comment/XML-doc accuracy fixes across AuditLog, ConfigurationDatabase, NotificationOutbox, SiteRuntime, SiteCallAudit; doc refreshes for Component- Commons / -ManagementService / -CLI / -ExternalSystemGateway / -HealthMonitoring / -Transport / -ConfigurationDatabase; CD-023 index-name doc alignment. 11 new regression tests (RouteHelper x4, SiteStreamGrpcServer x2, ClusterOptionsValidator x1, SiteCommunicationActor x1, DeploymentService x1, EventLogPurgeService x3). Build clean (0 warnings); InboundAPI/Communication/ Host suites all green. README regenerated: 112 open (was 136).
1347 lines
73 KiB
Markdown
1347 lines
73 KiB
Markdown
# Code Review — SiteRuntime
|
||
|
||
| Field | Value |
|
||
|-------|-------|
|
||
| Module | `src/ScadaLink.SiteRuntime` |
|
||
| Design doc | `docs/requirements/Component-SiteRuntime.md` |
|
||
| Status | Reviewed |
|
||
| Last reviewed | 2026-05-28 |
|
||
| Reviewer | claude-agent |
|
||
| Commit reviewed | `1eb6e97` |
|
||
| Open findings | 7 |
|
||
|
||
## Summary
|
||
|
||
The SiteRuntime module is broadly well-structured: the actor hierarchy matches the
|
||
design doc, supervision strategies are explicit, and the trigger/alarm evaluation
|
||
logic is thorough. However the review surfaced one genuinely serious correctness
|
||
defect — `Instance.SetAttribute` never routes writes to the Data Connection Layer
|
||
for data-sourced attributes, contradicting a core design decision and silently
|
||
turning device writes into local-only static overrides. Several other findings
|
||
cluster around two themes: (1) actor-thread discipline is violated in a few hot
|
||
paths (blocking `.GetAwaiter().GetResult()` calls on the actor thread, a fragile
|
||
fixed-delay reschedule for redeployment), and (2) the site-local repositories reach
|
||
into `SiteStorageService` private state via reflection and mint entity IDs with the
|
||
non-deterministic `string.GetHashCode()`. Script execution runs on the default
|
||
thread pool rather than a dedicated blocking dispatcher (the code acknowledges this
|
||
in a comment but ships it anyway). Test coverage exists for the coordinator actors,
|
||
persistence and scripting, but the short-lived execution actors, the replication
|
||
actor, and the repositories are untested.
|
||
|
||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||
|
||
The module was re-reviewed at commit `39d737e`. No source under
|
||
`src/ScadaLink.SiteRuntime` has changed since the previous review at `9c60592`
|
||
(the only intervening commits are code-review documentation updates), so all of
|
||
SiteRuntime-001..013, 015, 016 remain Resolved and SiteRuntime-014 remains
|
||
Deferred — its Deferred justification (a trigger-evaluation concurrency design
|
||
decision is required before either recommended fix can land in-module) still
|
||
holds verbatim against the unchanged `ScriptActor`/`AlarmActor` source. The
|
||
re-review nonetheless worked through all 10 checklist categories afresh and
|
||
surfaced three new findings that the prior pass did not record: a cross-thread
|
||
`Dictionary` enumeration race when the Instance Actor's live `_attributes`
|
||
dictionary is handed by reference into child `ScriptActor`/`AlarmActor`
|
||
constructors (SiteRuntime-017, Medium); a stale `ScriptExecutionActor` XML doc
|
||
that still claims a "dedicated blocking I/O dispatcher" (SiteRuntime-018, Low);
|
||
and two dead lifecycle handlers in `InstanceActor` that the Deployment Manager
|
||
never routes to (SiteRuntime-019, Low). All three were subsequently resolved on
|
||
2026-05-17. Open findings: 0.
|
||
|
||
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
||
|
||
The module was re-reviewed at commit `1eb6e97` as part of the new baseline
|
||
review. The SiteRuntime source surface has grown materially since the prior
|
||
pass — primarily by threading `ExecutionId`/`ParentExecutionId`/`SourceNode`
|
||
through the script-trust-boundary helpers and the cached-call telemetry
|
||
emitters, and by adding `OperationTrackingStore`, the
|
||
`AuditingDbConnection`/`AuditingDbCommand`/`AuditingDbDataReader` decorators,
|
||
and `ScriptExecutionScheduler`. All 10 checklist categories were walked afresh.
|
||
Seven new findings were recorded: a race that throws
|
||
`InvalidActorNameException` when a second `DeployInstanceCommand` arrives for
|
||
the same instance while a redeployment is still terminating its predecessor
|
||
(SiteRuntime-020, Medium); an artifact-only data-connection update that never
|
||
reaches the DCL (SiteRuntime-021, Medium); `AuditingDbCommand.DbConnection.set`
|
||
reaching into `AuditingDbConnection._inner` via reflection — the same anti-
|
||
pattern SiteRuntime-006 eliminated for the repositories, now reintroduced and
|
||
in direct tension with the script trust model that forbids `System.Reflection`
|
||
(SiteRuntime-022, Medium); `Convert.ToDouble(value)` in `ScriptActor` /
|
||
`AlarmActor` running under `CurrentCulture` so a string attribute value
|
||
becomes locale-sensitive (SiteRuntime-023, Low); `OperationTrackingStore`
|
||
serialising every cached-call write through a single connection +
|
||
`SemaphoreSlim` and using sync-over-async in `Dispose()` (SiteRuntime-024,
|
||
Medium); inbound-API `SetAttribute` (and any future caller) accepting unknown
|
||
attribute names and persisting them as overrides, polluting both `_attributes`
|
||
and the SQLite override table (SiteRuntime-025, Low); and the
|
||
`ReplicationMessages.cs` outbound/inbound record types still missing public XML
|
||
docs (SiteRuntime-026, Low). Prior findings 001–019 remain
|
||
Resolved/Deferred — no regressions observed in any of their fixed call sites.
|
||
Open findings: 7.
|
||
|
||
## Checklist coverage
|
||
|
||
| # | Category | Examined | Notes |
|
||
|---|----------|----------|-------|
|
||
| 1 | Correctness & logic bugs | ✓ | SetAttribute mis-routing, deploy double-count, redeploy reschedule race. |
|
||
| 2 | Akka.NET conventions | ✓ | Blocking on actor thread, script execution not on a dedicated dispatcher, premature success reply. |
|
||
| 3 | Concurrency & thread safety | ✓ | `_attributes` dictionary shared with child actors by reference; `_executionCounter` is actor-confined (OK). |
|
||
| 4 | Error handling & resilience | ✓ | Deploy reports Success before persistence; replicated artifact/S&F failures only logged (matches best-effort design). |
|
||
| 5 | Security | ✓ | Trust-model validation is substring-based and weak; reflection used to read private fields. |
|
||
| 6 | Performance & resource management | ✓ | Per-call SQLite connections (acceptable); CPU-bound scripts not interruptible by timeout. |
|
||
| 7 | Design-document adherence | ✓ | SetAttribute DCL routing missing; staggered-startup and supervision otherwise conform. |
|
||
| 8 | Code organization & conventions | ✓ | Repositories reflect into another class; synthetic IDs non-deterministic. |
|
||
| 9 | Testing coverage | ✓ | No tests for ScriptExecutionActor, AlarmExecutionActor, SiteReplicationActor, or the two repositories. |
|
||
| 10 | Documentation & comments | ✓ | Several XML comments describe behaviour the code does not implement (see findings). |
|
||
|
||
_Re-review (2026-05-28, `1eb6e97`):_
|
||
|
||
| # | Category | Examined | Notes |
|
||
|---|----------|----------|-------|
|
||
| 1 | Correctness & logic bugs | ✓ | Second-deploy race vs. pending redeploy (020); artifact-only data-connection update never reaches DCL (021); unknown-name SetAttribute persists bogus overrides (025). |
|
||
| 2 | Akka.NET conventions | ✓ | Trigger-eval blocking on coordinator mailbox remains Deferred (014); short-lived execution actors and replication actor otherwise conform. |
|
||
| 3 | Concurrency & thread safety | ✓ | DM's `_instanceActors` cache and `_pendingRedeploys` map shifted from old race; new ordering race surfaced (020). `OperationTrackingStore` single-connection + SemaphoreSlim serialises all cached writes (024). |
|
||
| 4 | Error handling & resilience | ✓ | `Task.Run` fire-and-forget replication paths log on faulted (acceptable, per "best-effort replication" design). DM's deploy persistence rollback path (resolved as SiteRuntime-005) intact. |
|
||
| 5 | Security | ✓ | Trust-model semantic analysis (SiteRuntime-011 fix) intact. `AuditingDbCommand` reflects into `AuditingDbConnection._inner` — same anti-pattern as SiteRuntime-006 (022). Audit emitter captures SQL parameter values verbatim per M4 design (M5 will redact). |
|
||
| 6 | Performance & resource management | ✓ | Per-call SQLite connections on hot paths in `SiteStorageService` (existing pattern, acceptable). `OperationTrackingStore` `Dispose()` does sync-over-async (024). `ScriptExecutionScheduler` bounded threads as expected. |
|
||
| 7 | Design-document adherence | ✓ | Artifact-only data-connection update path is silently inert (021) — contradicts the "site is self-contained after artifact deployment" intent. |
|
||
| 8 | Code organization & conventions | ✓ | Repository reflection-via-private-field anti-pattern reintroduced in `AuditingDbCommand` (022). `ReplicationMessages.cs` public records still undocumented (026). |
|
||
| 9 | Testing coverage | ✓ | `SiteReplicationActor` remains uncovered (SiteRuntime-016 deferred that gap to a clustered-ActorSystem harness, still outstanding). New findings have no targeted coverage yet. |
|
||
| 10 | Documentation & comments | ✓ | `ReplicationMessages.cs` records lack XML docs (026); other XML doc surface materially expanded in `1eb6e97`. |
|
||
|
||
## Findings
|
||
|
||
### SiteRuntime-001 — `Instance.SetAttribute` never writes to the Data Connection Layer
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | High |
|
||
| Category | Design-document adherence |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.SiteRuntime/Scripts/ScriptRuntimeContext.cs:106`, `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:204` |
|
||
|
||
**Description**
|
||
|
||
The design doc (Component-SiteRuntime.md, "GetAttribute / SetAttribute" and
|
||
"Script Runtime API") states that `Instance.SetAttribute` on a *data-connected*
|
||
attribute must send a write request to the DCL, which writes to the physical
|
||
device, and that the in-memory value is **not** optimistically updated. For *static*
|
||
attributes it updates memory and persists an override.
|
||
|
||
The implementation makes no such distinction. `ScriptRuntimeContext.SetAttribute`
|
||
unconditionally sends a `SetStaticAttributeCommand`, and `InstanceActor.HandleSetStaticAttribute`
|
||
unconditionally treats every write as a static override: it mutates `_attributes`,
|
||
publishes an `AttributeValueChanged` with hard-coded `"Good"` quality, notifies
|
||
children, and persists a SQLite override. A script writing a data-sourced attribute
|
||
therefore never reaches the device, the write failure can never be returned
|
||
synchronously to the script, and the in-memory value diverges from the device
|
||
until the next subscription update overwrites it. The persisted override is also
|
||
wrong: data-sourced attributes should not have static overrides.
|
||
|
||
**Recommendation**
|
||
|
||
In `InstanceActor`, look up the target attribute in `_configuration.Attributes`. If
|
||
it has a non-empty `DataSourceReference`, issue a DCL write (e.g. a `WriteTagRequest`
|
||
to `_dclManager`) and surface success/failure to the caller; do not persist an
|
||
override and do not optimistically mutate `_attributes`. Only attributes with no
|
||
data source reference should follow the current static-override path. Consider
|
||
splitting the message into `SetStaticAttributeCommand` vs `SetDataAttributeCommand`,
|
||
or branching inside the handler.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (`<pending>`): `InstanceActor.HandleSetStaticAttribute` now resolves
|
||
the target attribute's data binding from `_configuration`. Data-sourced attributes are
|
||
routed via a new `HandleSetDataAttribute` that Asks the DCL with a `WriteTagRequest` and
|
||
pipes the device-write outcome back to the caller as a `SetStaticAttributeResponse` —
|
||
no override is persisted and `_attributes` is not optimistically mutated. Static
|
||
attributes keep the override path and now also reply with a `SetStaticAttributeResponse`.
|
||
`ScriptRuntimeContext.SetAttribute` is now `async Task` and Asks the Instance Actor,
|
||
throwing `InvalidOperationException` on a failed device write so scripts get the failure
|
||
synchronously.
|
||
|
||
### SiteRuntime-002 — `RouteInboundApiSetAttributes` always treats writes as static overrides
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | High |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:632` |
|
||
|
||
**Description**
|
||
|
||
`RouteInboundApiSetAttributes` (handling `Route.To().SetAttribute(s)` from the
|
||
Inbound API) emits a `SetStaticAttributeCommand` for every attribute, so it inherits
|
||
the same defect as SiteRuntime-001: writes to data-sourced attributes never reach
|
||
the device and are instead persisted as static overrides. In addition the response
|
||
is sent back as unconditionally successful (`true`) before the Instance Actor has
|
||
even processed the command, so a non-existent attribute or a future DCL write
|
||
failure is reported to the external caller as success.
|
||
|
||
**Recommendation**
|
||
|
||
Route through the same corrected `InstanceActor` write handler as SiteRuntime-001 so
|
||
the static-vs-data distinction is honoured. The optimistic ack is acceptable for
|
||
fire-and-forget static writes per the doc, but the XML comment should make the
|
||
limitation explicit, and once data-attribute writes are supported they need a real
|
||
response path.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (`<pending>`): `RouteInboundApiSetAttributes` now Asks the Instance
|
||
Actor per attribute (instead of fire-and-forget Tell) and aggregates the
|
||
`SetStaticAttributeResponse` results. Because the Instance Actor handler is the
|
||
SiteRuntime-001 corrected handler, data-sourced attributes now reach the DCL and the
|
||
`RouteToSetAttributesResponse` reflects the real per-attribute outcome — a non-existent
|
||
attribute or a failed device write is reported as failure rather than an unconditional
|
||
optimistic `true`.
|
||
|
||
### SiteRuntime-003 — Redeployment relies on a fixed 500 ms reschedule and can collide on the child actor name
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | High |
|
||
| Category | Akka.NET conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:222` |
|
||
|
||
**Description**
|
||
|
||
`HandleDeploy` stops an existing Instance Actor with `Context.Stop` and then
|
||
reschedules the same `DeployInstanceCommand` to itself after a hard-coded 500 ms,
|
||
hoping the child has fully terminated by then. `Context.Stop` is asynchronous; the
|
||
child is only removed from the parent's children collection after it actually stops
|
||
(including running `PostStop` on its descendants). If a deeply nested or slow
|
||
hierarchy takes longer than 500 ms, `CreateInstanceActor` calls `Context.ActorOf`
|
||
with a name that still belongs to the terminating child and throws
|
||
`InvalidActorNameException`. The `_instanceActors` dictionary check does not prevent
|
||
this — the dictionary entry is removed immediately, but the Akka child registry is
|
||
not. The 500 ms delay is also unconditionally added to every redeploy latency.
|
||
|
||
**Recommendation**
|
||
|
||
Watch the terminating child (`Context.Watch`) and recreate the Instance Actor only
|
||
after receiving the `Terminated` message, instead of guessing with a timer. Buffer
|
||
or stash the in-flight `DeployInstanceCommand` (and any further commands for that
|
||
instance) until termination completes.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (`<pending>`): `HandleDeploy` no longer uses a fixed 500 ms
|
||
reschedule. When a redeployment targets a running instance, the existing Instance Actor
|
||
is `Context.Watch`-ed and stopped, and the in-flight `DeployInstanceCommand` is buffered
|
||
in a `_pendingRedeploys` map keyed by the terminating actor ref. A new `Terminated`
|
||
handler recreates the Instance Actor only after the predecessor (and its whole subtree)
|
||
has fully stopped, eliminating the `InvalidActorNameException` race and the
|
||
unconditional redeploy-latency penalty. The shared `ApplyDeployment` helper also skips
|
||
the `_totalDeployedCount` increment for redeployments, so the deployed-instance count no
|
||
longer drifts (this additionally addresses the root cause behind SiteRuntime-004).
|
||
|
||
### SiteRuntime-004 — `_totalDeployedCount` is incremented on redeployment of an existing instance
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium — re-triaged: already fixed by the SiteRuntime-003 resolution. |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs` (`ApplyDeployment`) |
|
||
|
||
**Description**
|
||
|
||
In `HandleDeploy`, the existing-actor branch (line 223) reschedules the command and
|
||
returns. When the rescheduled command runs, no actor exists, so the code falls
|
||
through to the "new instance" branch and executes `_totalDeployedCount++`
|
||
(line 239). A redeployment is an *update* of an already-deployed instance, not a new
|
||
one, so the deployed count is over-counted by one on every redeploy.
|
||
`StoreDeployedConfigAsync` uses UPSERT semantics, so the SQLite row count does not
|
||
grow, but the in-memory `_totalDeployedCount` (reported to the health collector via
|
||
`UpdateInstanceCounts`) drifts upward and the reported "disabled" count becomes
|
||
wrong.
|
||
|
||
**Re-triage (2026-05-16)**
|
||
|
||
Verified against the current source: this is **already fixed**. The SiteRuntime-003
|
||
resolution replaced the fixed-delay reschedule with a shared `ApplyDeployment` helper
|
||
that takes an `isRedeploy` flag and guards the counter with `if (!isRedeploy)
|
||
_totalDeployedCount++;`. The redeploy path (`HandleTerminated`) always calls
|
||
`ApplyDeployment(..., isRedeploy: true)`, so the counter is no longer bumped on
|
||
redeployment. The regression test
|
||
`DeploymentManagerRedeployTests.Redeploy_ExistingInstance_DoesNotOverCountDeployedInstances`
|
||
already covers this and passes. No further code change was required.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (`commit pending`): no new change needed — the root cause was
|
||
eliminated by the SiteRuntime-003 fix (the `isRedeploy` guard in `ApplyDeployment`).
|
||
Confirmed by the existing passing regression test
|
||
`Redeploy_ExistingInstance_DoesNotOverCountDeployedInstances`. Re-triaged from Open to
|
||
Resolved.
|
||
|
||
### SiteRuntime-005 — Deployment reports `Success` to central before persistence completes
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Error handling & resilience |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs` (`ApplyDeployment`, `HandleDeployPersistenceResult`) |
|
||
|
||
**Description**
|
||
|
||
`HandleDeploy` replies to central with `DeploymentStatus.Success` immediately after
|
||
creating the Instance Actor, while the SQLite persistence (`StoreDeployedConfigAsync`
|
||
+ `ClearStaticOverridesAsync`) runs asynchronously on a `Task.Run`. If persistence
|
||
fails, `HandleDeployPersistenceResult` only logs an error — central has already been
|
||
told the deployment succeeded. On a subsequent node restart or failover the instance
|
||
will not be re-created (it is not in SQLite), so the deployment is silently lost
|
||
despite central recording success. This contradicts the design's intent that the
|
||
site is the durable source of truth for deployed configs.
|
||
|
||
**Recommendation**
|
||
|
||
Persist the config before replying, or treat a persistence failure as a deployment
|
||
failure and send a corrective `DeploymentStatusResponse`/health signal to central.
|
||
At minimum, do not report `Success` until the config row is committed.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (`commit pending`): root cause confirmed — `ApplyDeployment` sent
|
||
`DeploymentStatusResponse(Success)` synchronously before the persistence `Task.Run`
|
||
completed. The `Success` reply is now sent from `HandleDeployPersistenceResult` only
|
||
once the persistence result is known: on success it replies `Success`; on a
|
||
persistence failure it logs the error, stops the optimistically-created Instance
|
||
Actor, rolls back the deployed-instance counter, and replies
|
||
`DeploymentStatus.Failed` with the error message. `DeployPersistenceResult` carries an
|
||
`IsRedeploy` flag so the counter rollback is skipped for redeployments. Regression
|
||
tests: `DeploymentManagerMediumFindingsTests.Deploy_PersistenceFailure_ReportsFailedNotSuccess`
|
||
and `Deploy_Success_ReportsSuccessAndPersistsConfig`.
|
||
|
||
### SiteRuntime-006 — Site-local repositories read `SiteStorageService` private field via reflection
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Code organization & conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.SiteRuntime/Repositories/SiteExternalSystemRepository.cs`, `src/ScadaLink.SiteRuntime/Repositories/SiteNotificationRepository.cs` |
|
||
|
||
**Description**
|
||
|
||
Both repositories' `CreateConnection()` use `Type.GetField("_connectionString",
|
||
BindingFlags.NonPublic | BindingFlags.Instance)` to extract the private connection
|
||
string out of `SiteStorageService`. This is brittle (any rename or refactor of the
|
||
field breaks it at runtime, not compile time), defeats encapsulation, and the
|
||
accompanying XML comment openly describes it as a "pragmatic" hack and is internally
|
||
contradictory (it states a connection string is "passed separately at DI
|
||
registration time" which is not what the code does). It also sits awkwardly against
|
||
the project's own script trust model, which forbids `System.Reflection` in scripts.
|
||
|
||
**Recommendation**
|
||
|
||
Expose the connection string properly: add an `ISiteStorageConnectionProvider`
|
||
(already referenced in `ServiceCollectionExtensions` XML docs but not used), or have
|
||
`SiteStorageService` expose a `CreateConnection()` factory, and inject that into the
|
||
repositories. Remove the reflection entirely.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (`commit pending`): root cause confirmed — both repositories
|
||
reflected into `SiteStorageService._connectionString`. `SiteStorageService` now
|
||
exposes a public `CreateConnection()` factory method that returns an unopened
|
||
`SqliteConnection` against the site database. Both `SiteExternalSystemRepository` and
|
||
`SiteNotificationRepository` now obtain connections via `_storage.CreateConnection()`;
|
||
all reflection (`Type.GetField` / `BindingFlags`) and the contradictory XML comments
|
||
have been removed. This is a fully in-module refactor — no cross-module design
|
||
decision was needed. Regression test:
|
||
`SiteRepositoryTests.ExternalSystemRepository_RoundTripsStoredDefinition` exercises
|
||
the repository's connection path end-to-end.
|
||
|
||
### SiteRuntime-007 — Synthetic entity IDs use the non-deterministic `string.GetHashCode()`
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.SiteRuntime/Repositories/SiteExternalSystemRepository.cs`, `src/ScadaLink.SiteRuntime/Repositories/SiteNotificationRepository.cs`, `src/ScadaLink.SiteRuntime/Repositories/SyntheticId.cs` |
|
||
|
||
**Description**
|
||
|
||
`GenerateSyntheticId` computes `name.GetHashCode() & 0x7FFFFFFF`. On .NET Core,
|
||
`string.GetHashCode()` is randomized per process by default, so the "stable
|
||
deterministic synthetic ID" promised by the XML comment is not stable at all — it
|
||
changes every time the process restarts. Any caller that obtained an ID and later
|
||
calls `GetExternalSystemByIdAsync`/`GetNotificationListByIdAsync` after a restart
|
||
will fail to find the entity. It also risks collisions: distinct names can hash to
|
||
the same 31-bit value, and `GetExternalSystemByIdAsync` would then return the wrong
|
||
row.
|
||
|
||
**Recommendation**
|
||
|
||
Use a deterministic, collision-resistant hash (e.g. a stable FNV-1a or the first
|
||
bytes of a SHA-256 of the name) if a synthetic integer ID is genuinely required, or
|
||
better, change the repository contract to key these site-local artifacts by name
|
||
rather than synthesising integer IDs.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (`commit pending`): root cause confirmed — both repositories used
|
||
`name.GetHashCode()`, which is per-process randomized on .NET Core. A new internal
|
||
`SyntheticId` helper computes a deterministic, process-stable 31-bit ID using the
|
||
FNV-1a hash over the name's UTF-8 bytes. Both `GenerateSyntheticId` methods now
|
||
delegate to `SyntheticId.From(name)`. (The integer-keyed lookups are kept because
|
||
they are mandated by the shared `IExternalSystemRepository`/`INotificationRepository`
|
||
contracts in Commons — changing those contracts to name-keyed would be a cross-module
|
||
change outside this module's scope; the deterministic hash resolves the correctness
|
||
defect within scope.) Regression tests:
|
||
`SiteRepositoryTests.ExternalSystemRepository_SyntheticId_IsStableAcrossRestart` and
|
||
`NotificationRepository_SyntheticId_IsStableAcrossRestart` re-create the service to
|
||
simulate a process restart and confirm by-ID lookups still resolve.
|
||
|
||
### SiteRuntime-008 — Blocking `.GetAwaiter().GetResult()` on the actor thread during startup
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Akka.NET conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs` (`HandleStartupConfigsLoaded`, `LoadSharedScriptsFromStorage`, `HandleSharedScriptsLoaded`) |
|
||
|
||
**Description**
|
||
|
||
`LoadSharedScriptsFromStorage` is called synchronously from
|
||
`HandleStartupConfigsLoaded` (the actor's message handler) and performs
|
||
`_storage.GetAllSharedScriptsAsync().GetAwaiter().GetResult()` followed by Roslyn
|
||
compilation of every shared script. This blocks the DeploymentManager singleton's
|
||
mailbox thread for the full duration of the SQLite read and all shared-script
|
||
compilation. On the default dispatcher this also ties up a thread-pool thread and
|
||
risks thread-pool starvation, and the singleton cannot process any other message
|
||
(deployments, lifecycle commands, debug routing) until it returns. The rest of the
|
||
class correctly uses `PipeTo`/`ContinueWith`.
|
||
|
||
**Recommendation**
|
||
|
||
Load shared scripts asynchronously and `PipeTo(Self)` an internal message, the same
|
||
pattern already used for `StartupConfigsLoaded`. Perform compilation either inside
|
||
the piped continuation handler (still on the actor thread but at least off the
|
||
synchronous startup path) or on a dedicated background task whose result is piped
|
||
back.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (`commit pending`): root cause confirmed — the blocking
|
||
`.GetAwaiter().GetResult()` and Roslyn compilation ran on the singleton's mailbox
|
||
thread inside `HandleStartupConfigsLoaded`. `LoadSharedScriptsFromStorage` now runs
|
||
the SQLite read **and** the Roslyn compilation on a background `Task.Run` and pipes a
|
||
new internal `SharedScriptsLoaded` message back to the actor. A new
|
||
`HandleSharedScriptsLoaded` handler then begins staggered Instance Actor creation, so
|
||
the compilation→creation ordering is preserved without ever blocking the mailbox. A
|
||
shared-script load failure is logged and startup proceeds (scripts needing a missing
|
||
shared script fail at execution time). Regression test:
|
||
`DeploymentManagerMediumFindingsTests.Startup_WithSharedScripts_LoadsConfigsAndStaysResponsive`
|
||
(confirms startup completes and the actor stays responsive with shared scripts
|
||
present).
|
||
|
||
### SiteRuntime-009 — Script execution actors run scripts on the default thread pool, not a dedicated dispatcher
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Akka.NET conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.SiteRuntime/Actors/ScriptExecutionActor.cs`, `src/ScadaLink.SiteRuntime/Actors/AlarmExecutionActor.cs`, `src/ScadaLink.SiteRuntime/Scripts/ScriptExecutionScheduler.cs` |
|
||
|
||
**Description**
|
||
|
||
The design (CLAUDE.md "Architecture & Runtime") states Script Execution Actors run
|
||
on a *dedicated blocking I/O dispatcher*. The code does not do this: `ScriptActor.SpawnExecution`
|
||
and `AlarmActor.SpawnAlarmExecution` create the execution actors with no
|
||
`.WithDispatcher(...)`, and the execution itself runs inside a bare `Task.Run`,
|
||
i.e. on the shared .NET thread pool. The `// NOTE: In production, configure a
|
||
dedicated ... dispatcher` comments acknowledge the gap but it ships unconfigured.
|
||
Scripts can perform synchronous blocking I/O (`Database.Connection`, synchronous
|
||
`ExternalSystem.Call`); running them on the shared pool can starve it and stall
|
||
unrelated Akka dispatchers and HTTP request handling under load.
|
||
|
||
**Recommendation**
|
||
|
||
Define the dedicated dispatcher in HOCON and chain `.WithDispatcher(...)` on the
|
||
execution actor `Props`. If the `Task.Run` model is kept, run script bodies on a
|
||
dedicated `TaskScheduler` / bounded scheduler rather than the global pool. Either
|
||
way, remove the "in production, configure…" comments by actually configuring it.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (`commit pending`): root cause confirmed — script and alarm
|
||
on-trigger bodies ran inside a bare `Task.Run` on the shared `ThreadPool`. The
|
||
recommendation's `TaskScheduler` option was taken because it is fully in-module (a
|
||
HOCON dispatcher would require editing the Host's ActorSystem config, out of scope).
|
||
A new `ScriptExecutionScheduler` provides a bounded set of dedicated background
|
||
threads (count from the new `SiteRuntimeOptions.ScriptExecutionThreadCount`, default
|
||
8). `ScriptExecutionActor` and `AlarmExecutionActor` now run their bodies via
|
||
`Task.Factory.StartNew(..., ScriptExecutionScheduler.Shared(options)).Unwrap()`
|
||
instead of `Task.Run`, so blocking script I/O is contained to those dedicated threads
|
||
and cannot starve the global pool. The misleading "in production, configure a
|
||
dedicated dispatcher" comments were removed. Regression tests:
|
||
`ScriptExecutionSchedulerTests` (`Scheduler_RunsWork_OffTheThreadPool`,
|
||
`Scheduler_RespectsConfiguredThreadCount`, `Scheduler_Shared_ReturnsSameInstanceForOptions`).
|
||
|
||
### SiteRuntime-010 — `EnsureDclConnections` never updates a connection whose configuration changed
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs` (`EnsureDclConnections`, `ComputeConnectionConfigHash`) |
|
||
|
||
**Description**
|
||
|
||
`EnsureDclConnections` tracks created connections in `_createdConnections` and skips
|
||
any name already present (`if (_createdConnections.Contains(name)) continue;`). The
|
||
skip is purely name-based: if a redeployment (or an artifact deployment) changes the
|
||
endpoint, credentials, backup endpoint, or `FailoverRetryCount` of an existing
|
||
connection, the new configuration is silently ignored and the DCL keeps using the
|
||
stale `CreateConnectionCommand`. There is no `UpdateConnectionCommand` path. The
|
||
design states that after artifact deployment the site is fully self-contained with
|
||
current configuration; this caching breaks that for connection changes.
|
||
|
||
**Recommendation**
|
||
|
||
Compare the incoming connection config against the last one sent and re-issue a
|
||
create/update command when it differs, or have the DCL treat `CreateConnectionCommand`
|
||
as idempotent upsert and always forward it. Key the cache on a config hash, not just
|
||
the name.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (`commit pending`): root cause confirmed — the cache was a
|
||
name-only `HashSet`, so a changed connection config was silently dropped.
|
||
`_createdConnections` is now a `Dictionary<string,string>` mapping connection name to
|
||
a SHA-256 hash of its protocol/primary-config/backup-config/failover-retry-count
|
||
(`ComputeConnectionConfigHash`). A connection whose hash is unchanged is still
|
||
skipped; a connection whose config changed re-issues a `CreateConnectionCommand` so
|
||
the DCL adopts the new configuration. Regression tests:
|
||
`DeploymentManagerMediumFindingsTests.EnsureDclConnections_ConnectionConfigChanged_ReissuesCreateCommand`
|
||
and `EnsureDclConnections_UnchangedConfig_DoesNotReissueCreateCommand`.
|
||
|
||
### SiteRuntime-011 — Trust-model validation is a substring scan and is both over- and under-inclusive
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Security |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.SiteRuntime/Scripts/ScriptCompilationService.cs` (`ValidateTrustModel`) |
|
||
|
||
**Description**
|
||
|
||
`ValidateTrustModel` enforces the script trust model by doing raw `string.Contains` /
|
||
`IndexOf` on the script source text for forbidden namespace strings. This is
|
||
unreliable in both directions:
|
||
|
||
- **Bypass (under-inclusive):** the check looks only for the literal namespace
|
||
strings. A script can reach forbidden APIs without ever writing `System.IO` etc. —
|
||
e.g. via fully-qualified type use through aliasing, `global::`-prefixed names, or
|
||
simply because the namespace is already imported transitively. The compilation
|
||
references include `typeof(object).Assembly` (the whole of `System.Private.CoreLib`,
|
||
which contains `System.IO.File`, `System.Threading.Thread`, `System.Reflection`,
|
||
etc.), so forbidden types are fully resolvable at compile time and the only barrier
|
||
is this textual scan.
|
||
- **False positives (over-inclusive):** any occurrence of the substring in a comment,
|
||
string literal, or an unrelated identifier (e.g. a variable named `ProcessThreading`)
|
||
triggers a violation; the `AllowedExceptions` logic only rescues exact prefixes.
|
||
- The dead `isAllowed` variable at line 64 is computed and never used.
|
||
|
||
**Recommendation**
|
||
|
||
Enforce the trust model with a Roslyn `SyntaxWalker`/semantic analysis (inspect
|
||
resolved symbols and their containing namespaces/assemblies), or restrict the
|
||
compilation's metadata references and `AssemblyLoadContext` so forbidden types are
|
||
genuinely unavailable, rather than relying on source-text matching. Remove the
|
||
unused `isAllowed` variable.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (`commit pending`): root cause confirmed — `ValidateTrustModel`
|
||
was a raw `string.Contains`/`IndexOf` scan of the source text, with a dead `isAllowed`
|
||
variable. It is now Roslyn semantic analysis: the script is parsed and a
|
||
`CSharpCompilation` + `SemanticModel` are built; every name/member/object-creation
|
||
node is resolved to its symbol and the symbol's containing namespace and
|
||
fully-qualified containing type are checked against the forbidden roots. Bare
|
||
namespace symbols are ignored (so the `System.Threading` qualifier of the allowed
|
||
`System.Threading.Tasks.Task` no longer false-positives). A name that cannot be
|
||
resolved (a type from an assembly deliberately absent from the script's references)
|
||
falls back to a syntactic fully-qualified-name check, so e.g. `System.Net.Http`
|
||
references are still rejected. The dead `isAllowed` variable was removed. This fixes
|
||
both the bypass (`global::`/alias-qualified forbidden types) and the false positives
|
||
(forbidden namespace string in a comment, string literal, or unrelated identifier).
|
||
Regression tests: new `TrustModelSemanticTests` (alias/`global::` detection, comment/
|
||
literal/identifier non-detection, allowed-exception resolution); all 39 existing
|
||
`SandboxTests` + `ScriptCompilationServiceTests` continue to pass.
|
||
|
||
### SiteRuntime-012 — `AttributeAccessor`/`ScopeAccessors` block the script on a synchronous Ask
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Concurrency & thread safety |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.SiteRuntime/Scripts/ScopeAccessors.cs:28` |
|
||
|
||
**Description**
|
||
|
||
`AttributeAccessor`'s indexer getter calls `_ctx.GetAttribute(...).GetAwaiter().GetResult()`,
|
||
synchronously blocking the script-execution thread on an actor Ask. Combined with
|
||
SiteRuntime-009 (scripts run on the shared thread pool) this means a script that
|
||
reads several attributes via `Attributes["X"]` holds a pool thread blocked for each
|
||
round-trip. The async variants (`GetAsync`/`SetAsync`) exist but the ergonomic
|
||
indexer encourages the blocking path. The XML comment notes "Reads block on the
|
||
actor Ask" but does not warn about the thread-pool impact.
|
||
|
||
**Recommendation**
|
||
|
||
Once a dedicated script dispatcher exists (SiteRuntime-009) the blocking is contained
|
||
to that pool, which is acceptable; until then, document the cost clearly and prefer
|
||
steering script authors to the async accessors. Consider making the indexer
|
||
internal-only and exposing only the async API.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (`pending commit`): re-triaged against the current source — the
|
||
finding's own recommendation states the blocking is *acceptable* once SiteRuntime-009's
|
||
dedicated script dispatcher exists, and SiteRuntime-009 is now Resolved
|
||
(`ScriptExecutionActor`/`AlarmExecutionActor` run script bodies on the dedicated
|
||
`ScriptExecutionScheduler` threads, confirmed in `ScriptExecutionActor.cs:74`). A
|
||
blocked accessor therefore can no longer starve the shared `ThreadPool` or Akka
|
||
dispatchers — only a dedicated script thread. The remaining defect was the misleading
|
||
class XML comment, which only said "Reads block on the actor Ask" with no thread-model
|
||
context. The `AttributeAccessor` XML doc now documents the dispatcher containment
|
||
(SiteRuntime-009) explicitly and still steers authors to the async `GetAsync`/`SetAsync`
|
||
variants. No behavioural change — this is a documentation finding; existing
|
||
`ScopeAccessorTests` continue to pass.
|
||
|
||
### SiteRuntime-013 — `HandleUnsubscribeDebugView` does nothing despite documented behaviour
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Documentation & comments |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:414` |
|
||
|
||
**Description**
|
||
|
||
`HandleUnsubscribeDebugView` is documented ("Debug view unsubscribe — removes
|
||
subscription") and the actor registers a handler for `UnsubscribeDebugViewRequest`,
|
||
but the body only logs a debug message — there is no subscription state in the
|
||
Instance Actor to remove. The design places the actual subscription lifecycle in
|
||
`SiteStreamManager` (`Subscribe`/`Unsubscribe`/`RemoveSubscriber`), so the Instance
|
||
Actor genuinely has nothing to do here. The handler and its XML comment are
|
||
therefore misleading: a reader expects it to tear down a subscription.
|
||
|
||
**Recommendation**
|
||
|
||
Either remove the no-op handler and route `UnsubscribeDebugViewRequest` to wherever
|
||
the `SiteStreamManager` subscription is actually cancelled, or correct the XML
|
||
comment to state explicitly that subscription teardown is handled by
|
||
`SiteStreamManager` and this handler is a no-op acknowledgement.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (`pending commit`): root cause confirmed — the Instance Actor
|
||
holds no per-subscriber state, so `HandleUnsubscribeDebugView` genuinely has nothing to
|
||
remove; the real debug-stream subscription lifecycle lives in `SiteStreamManager`
|
||
(Subscribe/Unsubscribe/RemoveSubscriber). The recommendation's "correct the XML comment"
|
||
option was taken (removing the handler would still leave `UnsubscribeDebugViewRequest`
|
||
routed here from `DeploymentManagerActor.RouteDebugViewUnsubscribe`, and the no-op
|
||
acknowledgement is harmless). The XML doc on `HandleUnsubscribeDebugView` now states
|
||
explicitly that it is a deliberate no-op acknowledgement and that teardown is handled by
|
||
`SiteStreamManager`; the log message likewise notes "(no-op; subscription teardown
|
||
handled by SiteStreamManager)". This is a documentation-only finding with no observable
|
||
behaviour to regression-test, so no new test was added; the existing
|
||
`InstanceActor`/debug-view tests continue to pass.
|
||
|
||
### SiteRuntime-014 — Trigger-expression evaluation blocks the coordinator actor thread
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Akka.NET conventions |
|
||
| Status | Deferred |
|
||
| Location | `src/ScadaLink.SiteRuntime/Actors/ScriptActor.cs:219`, `src/ScadaLink.SiteRuntime/Actors/AlarmActor.cs:389` |
|
||
|
||
**Description**
|
||
|
||
`EvaluateExpressionTrigger` (ScriptActor) and `EvaluateExpression` (AlarmActor) run a
|
||
compiled Roslyn script with `.RunAsync(...).GetAwaiter().GetResult()` directly inside
|
||
the actor's `AttributeValueChanged` message handler. This blocks the coordinator
|
||
actor's mailbox thread for up to the 2-second timeout on every monitored attribute
|
||
change. Coordinator actors are on the default dispatcher and process the hot path of
|
||
attribute-change fan-out; a slow expression delays all other messages to that actor
|
||
and consumes a thread-pool thread for the duration. The inline comments correctly
|
||
note CPU-bound expressions are not interruptible but do not address the
|
||
mailbox-blocking concern.
|
||
|
||
**Recommendation**
|
||
|
||
Trigger expressions are expected to be cheap, but to keep the actor responsive
|
||
consider evaluating them off the actor thread (pipe the boolean result back as an
|
||
internal message) or pre-compiling to a plain delegate that executes near-instantly
|
||
without the Roslyn scripting `RunAsync` machinery.
|
||
|
||
**Resolution**
|
||
|
||
Deferred 2026-05-16 (`pending commit`): root cause confirmed — `EvaluateExpressionTrigger`
|
||
(ScriptActor) and `EvaluateExpression` (AlarmActor) call
|
||
`_compiledTriggerExpression.RunAsync(...).GetAwaiter().GetResult()` directly inside the
|
||
`AttributeValueChanged` handler, on the coordinator actor's default (thread-pool-backed)
|
||
dispatcher, blocking the mailbox for up to the 2 s timeout. Re-triaged from Open to
|
||
**Deferred** rather than fixed: neither recommended fix stays cleanly in-module without
|
||
a design decision. (a) **Off-thread eval + pipe-back** changes the actor's concurrency
|
||
model — the evaluation carries edge-tracking state (`_lastExpressionResult`) and a
|
||
mutable `_attributeSnapshot`; multiple `AttributeValueChanged` messages can arrive while
|
||
an evaluation is in flight, so a correct fix must decide overlapping-evaluation
|
||
semantics (coalesce / serialize / drop) and the snapshot-coherence contract — a behaviour
|
||
change to the trigger model. (b) **Pre-compile to a plain delegate** would require
|
||
changing the compilation contract: the trigger expression is produced as a Roslyn
|
||
`Script<object?>` by `ScriptCompilationService.CompileTriggerExpression`, which is also
|
||
the security boundary (SiteRuntime-011 trust validation); swapping the artifact type is
|
||
a cross-component change touching the Template Engine / Deployment Manager compile
|
||
pipeline. Given Low severity, a bounded 2 s worst case, and the inline note that trigger
|
||
expressions are trusted, compile-checked, and expected to be cheap, this is left
|
||
Deferred pending a design decision on trigger-evaluation concurrency rather than forcing
|
||
an out-of-scope or messaging-contract-changing fix.
|
||
|
||
### SiteRuntime-015 — `LoggerFactory` created per Instance Actor and never disposed
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Performance & resource management |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:746` |
|
||
|
||
**Description**
|
||
|
||
`CreateInstanceActor` does `var loggerFactory = new LoggerFactory();` for every
|
||
Instance Actor it creates, uses it once to produce an `ILogger<InstanceActor>`, and
|
||
never disposes it. `LoggerFactory` is `IDisposable`. With up to 500 instances (and
|
||
churn from redeployments) this leaks a factory per instance, and the produced
|
||
loggers are detached from the application's configured logging providers, so
|
||
Instance Actor logs may not be routed/filtered consistently with the rest of the
|
||
host.
|
||
|
||
**Recommendation**
|
||
|
||
Inject the application's `ILoggerFactory` (or an `ILogger<InstanceActor>` factory
|
||
delegate) into `DeploymentManagerActor` via DI and reuse it, rather than newing one
|
||
up per child. Do not create a fresh `LoggerFactory` in a hot creation path.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (`pending commit`): root cause confirmed — `CreateInstanceActor`
|
||
did `new LoggerFactory()` per Instance Actor, never disposed, and detached from the
|
||
host's logging providers. `DeploymentManagerActor` now holds a single `_loggerFactory`
|
||
field, resolved once in the constructor from (in order) a new optional `ILoggerFactory`
|
||
constructor parameter, the injected `IServiceProvider`, or `NullLoggerFactory.Instance`
|
||
as a last resort — never a per-instance allocation. `CreateInstanceActor` mints the
|
||
`InstanceActor` logger from this shared factory, so loggers are routed through the
|
||
application's configured providers and no factory leaks. Regression test:
|
||
`DeploymentManagerLoggerFactoryTests.CreateInstanceActor_ReusesInjectedLoggerFactory_ForEveryInstance`
|
||
injects a counting `ILoggerFactory` and asserts it is used once per created Instance
|
||
Actor — confirmed to fail (0 calls) against the pre-fix `new LoggerFactory()` code and
|
||
pass after the fix.
|
||
|
||
### SiteRuntime-016 — Short-lived execution actors, replication actor, and repositories are untested
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Testing coverage |
|
||
| Status | Resolved |
|
||
| Location | `tests/ScadaLink.SiteRuntime.Tests/` |
|
||
|
||
**Description**
|
||
|
||
The test project covers the coordinator actors (`InstanceActor`, `ScriptActor`,
|
||
`AlarmActor`, `DeploymentManagerActor`), persistence, scripting and streaming, but a
|
||
search of the test sources finds no references to `ScriptExecutionActor`,
|
||
`AlarmExecutionActor`, `SiteReplicationActor`, `SiteExternalSystemRepository`, or
|
||
`SiteNotificationRepository`. These cover critical paths: script timeout/failure
|
||
handling and result reply, alarm on-trigger execution, peer config/S&F replication
|
||
(including the `SendToPeer` no-peer drop), and the reflection-based repository reads.
|
||
Several findings above (001/002 mis-routing, 007 ID instability, 011 trust bypass)
|
||
would likely have been caught by targeted tests.
|
||
|
||
**Recommendation**
|
||
|
||
Add unit/integration tests for the execution actors (success, timeout, exception,
|
||
Ask-reply, PoisonPill self-stop), `SiteReplicationActor` (outbound forward, inbound
|
||
apply, peer tracking on cluster events), and the two repositories (round-trip read,
|
||
synthetic-ID lookup, missing-row behaviour).
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (`pending commit`): re-triaged against the current test sources —
|
||
`SiteExternalSystemRepository` and `SiteNotificationRepository` are already covered by
|
||
`Repositories/SiteRepositoryTests.cs` (added when SiteRuntime-006/007 were resolved:
|
||
round-trip read and synthetic-ID-stable-across-restart). The execution-actor gap is now
|
||
closed by a new `Actors/ExecutionActorTests.cs` — six tests covering
|
||
`ScriptExecutionActor` (success → `ScriptCallResult` reply + PoisonPill self-stop;
|
||
script-throws → failure reply + stop; cooperative timeout → failure reply + stop;
|
||
no-`replyTo` fire-and-forget still self-stops) and `AlarmExecutionActor` (success →
|
||
self-stop; on-trigger throws → still self-stops). `SiteReplicationActor` is *not* covered
|
||
here: it depends on `Cluster.Get(Context.System)` and so requires a clustered
|
||
`ActorSystem` HOCON harness that does not yet exist in this test project — adding that
|
||
harness is a larger test-infrastructure task tracked separately and out of scope for a
|
||
Low-severity coverage finding; the highest-value untested paths the finding called out
|
||
(script timeout/failure/reply/self-stop) are now covered. Full module suite: 192 tests
|
||
green.
|
||
|
||
### SiteRuntime-017 — Instance Actor's live `_attributes` dictionary is shared by reference into child actor constructors
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Concurrency & thread safety |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:625`, `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:675`, `src/ScadaLink.SiteRuntime/Actors/ScriptActor.cs:83`, `src/ScadaLink.SiteRuntime/Actors/AlarmActor.cs:93` |
|
||
|
||
**Description**
|
||
|
||
`InstanceActor.CreateChildActors` passes the Instance Actor's own mutable
|
||
`_attributes` field (a plain `Dictionary<string, object?>`) by reference into the
|
||
`Props.Create(...)` factory for every `ScriptActor` and `AlarmActor` (as the
|
||
`initialAttributes` constructor argument). Each child constructor then iterates
|
||
that dictionary to seed its `_attributeSnapshot`:
|
||
|
||
```csharp
|
||
if (initialAttributes != null)
|
||
foreach (var kvp in initialAttributes)
|
||
_attributeSnapshot[kvp.Key] = kvp.Value;
|
||
```
|
||
|
||
`Context.ActorOf` returns immediately; the child actor's constructor runs later on
|
||
the *child's* mailbox thread. Meanwhile the Instance Actor's `PreStart` returns and
|
||
the Instance Actor begins processing its mailbox — `HandleTagValueUpdate` and
|
||
`HandleAttributeValueChanged` both mutate `_attributes` (`_attributes[...] = ...`).
|
||
A DCL tag update that arrives before a child has finished its constructor copy
|
||
therefore mutates the dictionary on the Instance Actor thread while the child
|
||
thread is enumerating it. `Dictionary<,>` is explicitly not safe for concurrent
|
||
read/write: the enumeration can throw `InvalidOperationException` ("collection was
|
||
modified") — which surfaces as an `ActorInitializationException` and, under the
|
||
Instance Actor's `SupervisorStrategy`, **stops** the child (the strategy returns
|
||
`Stop` for `ActorInitializationException`). The script or alarm is then silently
|
||
absent for the life of the instance. A torn read of an entry is also possible. The
|
||
window is small but deterministically reachable on a busy site at startup/failover
|
||
— exactly the staggered-startup scenario the design is most concerned about.
|
||
|
||
**Recommendation**
|
||
|
||
Do not share the live dictionary. Snapshot it on the Instance Actor thread before
|
||
constructing the child — e.g. pass `new Dictionary<string, object?>(_attributes)`
|
||
(or an immutable copy) into each `Props.Create`. The copy is made on the Instance
|
||
Actor thread inside `CreateChildActors`, so it is race-free, and each child gets a
|
||
private dictionary to seed from.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-17 (`commit pending`): root cause confirmed — `CreateChildActors`
|
||
captured the live `_attributes` field directly in every child `Props.Create`
|
||
closure. `CreateChildActors` now takes a single `new Dictionary<string,object?>(_attributes)`
|
||
snapshot on the Instance Actor thread and hands each `ScriptActor`/`AlarmActor` that
|
||
private copy, so no child constructor ever enumerates a dictionary the Instance
|
||
Actor is concurrently mutating. Regression test:
|
||
`InstanceActorChildAttributeRaceTests.ChildActors_AreSeededFromAnIsolatedCopy_NotTheLiveAttributesDictionary`
|
||
asserts every child's seed dictionary is a distinct object from the Instance
|
||
Actor's live `_attributes` (confirmed to fail — "seeded ... by reference" — against
|
||
the pre-fix code and pass after). `ScriptActor`/`AlarmActor` expose an internal
|
||
`SeedAttributesReference` for this assertion (`InternalsVisibleTo` added for the
|
||
test project).
|
||
|
||
### SiteRuntime-018 — `ScriptExecutionActor` XML doc still claims a "dedicated blocking I/O dispatcher"
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Documentation & comments |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.SiteRuntime/Actors/ScriptExecutionActor.cs:17` |
|
||
|
||
**Description**
|
||
|
||
The class-level XML summary on `ScriptExecutionActor` states "Runs on a dedicated
|
||
blocking I/O dispatcher." That is not what the code does. SiteRuntime-009 was
|
||
resolved by introducing `ScriptExecutionScheduler` (a bounded dedicated
|
||
`TaskScheduler`); the *actor itself and its mailbox* run on the **default** Akka
|
||
dispatcher, and only the script body runs on the scheduler's threads via
|
||
`Task.Factory.StartNew(..., scheduler)`. The resolution of SiteRuntime-009
|
||
explicitly chose the `TaskScheduler` route *instead of* a HOCON dispatcher and
|
||
even removed the "in production, configure a dedicated dispatcher" comments
|
||
elsewhere — but this stale summary line was missed. A reader is told the actor is
|
||
on a dedicated dispatcher when it is not, which is misleading when reasoning about
|
||
mailbox throughput and thread-pool pressure. (`AlarmExecutionActor` does not carry
|
||
the equivalent claim — its summary only says "Same pattern as ScriptExecutionActor.")
|
||
|
||
**Recommendation**
|
||
|
||
Correct the summary to describe the actual model: the actor runs on the default
|
||
dispatcher and the script body is dispatched onto the dedicated
|
||
`ScriptExecutionScheduler` (SiteRuntime-009). Align the wording with the accurate
|
||
comment already present at `ScriptExecutionActor.cs:71-73`.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-17 (`commit pending`): root cause confirmed — the stale
|
||
"Runs on a dedicated blocking I/O dispatcher" line in the `ScriptExecutionActor`
|
||
class summary was missed when SiteRuntime-009 was resolved. The summary now states
|
||
the actual model: the actor and its mailbox run on the default Akka dispatcher and
|
||
only the script body is dispatched onto the dedicated `ScriptExecutionScheduler`
|
||
(SiteRuntime-009). Documentation-only change with no observable behaviour, so no
|
||
regression test was added; the existing suite continues to pass.
|
||
|
||
### SiteRuntime-019 — Dead `DisableInstanceCommand` / `EnableInstanceCommand` handlers in `InstanceActor`
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:106`, `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:113` |
|
||
|
||
**Description**
|
||
|
||
`InstanceActor`'s constructor registers `Receive<DisableInstanceCommand>` and
|
||
`Receive<EnableInstanceCommand>` handlers that log and reply with a successful
|
||
`InstanceLifecycleResponse`. These handlers are unreachable. The Deployment Manager
|
||
is the only sender of those commands, and `DeploymentManagerActor.HandleDisable` /
|
||
`HandleEnable` handle the lifecycle entirely themselves — they call
|
||
`Context.Stop(actor)` (disable) or `CreateInstanceActor(...)` (enable) directly and
|
||
reply to the original sender from the Deployment Manager. Neither command is ever
|
||
`Forward`-ed or `Tell`-ed to the Instance Actor. The handlers are dead code, and
|
||
they are actively misleading: a maintainer reading `InstanceActor` would reasonably
|
||
believe disable/enable is partly an Instance-Actor responsibility, and the no-op
|
||
"true" reply implies an instance-side acknowledgement contract that does not exist.
|
||
If a future change *did* route these commands here, the disable handler would do
|
||
nothing useful (it does not stop children or tear down state — Akka does that when
|
||
the parent stops the actor).
|
||
|
||
**Recommendation**
|
||
|
||
Remove the two `Receive<...>` registrations and their handler bodies from
|
||
`InstanceActor`, since the Deployment Manager owns the disable/enable lifecycle.
|
||
If the intent is to keep them for a future instance-side hook, add an XML comment
|
||
stating that the Deployment Manager currently handles these and the handlers are a
|
||
reserved placeholder — but removal is preferred.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-17 (`commit pending`): re-verified as genuinely dead code — a
|
||
codebase-wide search confirms `DisableInstanceCommand`/`EnableInstanceCommand` are
|
||
only ever sent (from central) to the site `DeploymentManagerActor`, whose
|
||
`HandleDisable`/`HandleEnable` own the lifecycle entirely (`Context.Stop` /
|
||
`CreateInstanceActor`) and never `Forward`/`Tell` the command to the Instance
|
||
Actor. The two unreachable `Receive<...>` registrations and their no-op
|
||
"success" handler bodies were removed from `InstanceActor`, replaced with a comment
|
||
stating the Deployment Manager owns this lifecycle. Regression test:
|
||
`InstanceActorTests.InstanceActor_DoesNotHandleDisableOrEnableCommands` asserts the
|
||
Instance Actor produces no `InstanceLifecycleResponse` for either command
|
||
(confirmed to fail against the pre-fix dead handlers and pass after removal).
|
||
|
||
### SiteRuntime-020 — Second `DeployInstanceCommand` arriving during a pending redeploy races the still-terminating actor on its name
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Concurrency & thread safety |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:285`, `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:971` |
|
||
|
||
**Resolution** — added a name → terminating-actor-ref shadow
|
||
(`_terminatingActorsByName`) populated when `HandleDeploy` stops the
|
||
predecessor and cleared in `HandleTerminated`. `HandleDeploy` now
|
||
detects the mid-termination state before falling through to
|
||
`ApplyDeployment(fresh)`: on hit it tells the displaced
|
||
`PendingRedeploy.OriginalSender` a `DeploymentStatus.Failed` /
|
||
"superseded by newer deployment …" response and overwrites the buffered
|
||
pending command (last-write-wins). Regression test
|
||
`SR020_ThreeRapidDeploys_DoNotThrowInvalidActorNameException_LatestWins`
|
||
fires three rapid deploys, asserts the middle deploy is told it was
|
||
superseded, the latest succeeds, and the resulting instance is operable
|
||
(DisableInstanceCommand works).
|
||
|
||
**Description**
|
||
|
||
The SiteRuntime-003 fix makes `HandleDeploy` watch + stop a running Instance
|
||
Actor and buffer the in-flight `DeployInstanceCommand` in `_pendingRedeploys`
|
||
until `Terminated` arrives. The handler also removes the instance from
|
||
`_instanceActors` synchronously, in step with the stop request:
|
||
|
||
```csharp
|
||
if (_instanceActors.TryGetValue(instanceName, out var existing))
|
||
{
|
||
_instanceActors.Remove(instanceName);
|
||
_pendingRedeploys[existing] = new PendingRedeploy(command, Sender);
|
||
Context.Watch(existing);
|
||
Context.Stop(existing);
|
||
UpdateInstanceCounts();
|
||
return;
|
||
}
|
||
|
||
// Fresh deployment — no existing actor to replace.
|
||
ApplyDeployment(command, Sender, isRedeploy: false);
|
||
```
|
||
|
||
If a *second* `DeployInstanceCommand` for the same `instanceName` arrives on
|
||
the singleton's mailbox while the predecessor is still terminating, the
|
||
`_instanceActors.TryGetValue` lookup correctly reports "no existing actor" —
|
||
because the first deploy already removed it — and execution falls through to
|
||
`ApplyDeployment(..., isRedeploy: false)`. `ApplyDeployment` immediately calls
|
||
`CreateInstanceActor`, which calls `Context.ActorOf(props, instanceName)`. But
|
||
the predecessor's Akka child name **is still registered** in the parent's
|
||
child registry: that name is only released after the predecessor's `Terminated`
|
||
signal — exactly the asynchronous gap SiteRuntime-003 was created to plug for
|
||
the *first* redeploy. `Context.ActorOf` therefore throws
|
||
`InvalidActorNameException`, which Akka rethrows as
|
||
`ActorInitializationException` — and the supervisor's `Stop` directive on that
|
||
exception (DeploymentManagerActor.cs:179) silently stops the just-created
|
||
child. The second deploy is then quietly lost: `_instanceActors` doesn't
|
||
contain it (the throw aborted the bookkeeping after `CreateInstanceActor`'s
|
||
own `ContainsKey` guard but before `_instanceActors[instanceName] = actorRef`
|
||
would have run), `_totalDeployedCount` was incremented, and the deployer is
|
||
never told the deployment failed (the persistence `Task.Run` is also dropped
|
||
on the throw path). The race is real on a busy site where central retries a
|
||
deploy because the prior attempt timed out — exactly the scenario the
|
||
DeploymentManager-006 query-then-deploy idempotency mechanism was designed for.
|
||
|
||
The first-redeploy case (SiteRuntime-003) does NOT exhibit this because at
|
||
that point the predecessor's child name was still in `_instanceActors`, so the
|
||
branch correctly buffers. The bug is specific to the third (and beyond)
|
||
incoming deploy when two are already in flight for the same instance.
|
||
|
||
**Recommendation**
|
||
|
||
The pending-redeploy bookkeeping needs to be authoritative for "we are mid-
|
||
redeploy on this instance", not just the `_instanceActors` cache. Add a second
|
||
keyed lookup — e.g. a `Dictionary<string, IActorRef> _terminatingActorsByName`
|
||
populated when the predecessor is stopped — and check it BEFORE
|
||
`ApplyDeployment(isRedeploy: false)`. On a hit, overwrite (or stash) the
|
||
buffered `PendingRedeploy` for that terminating actor so the latest command
|
||
wins on the `Terminated` signal. Alternatively, defer the deploy by stashing
|
||
all messages for that `instanceName` until the predecessor terminates (Akka
|
||
`Stash` pattern). Either way, the fall-through to "fresh deployment" needs to
|
||
be gated on "no instance with this name is currently terminating".
|
||
|
||
### SiteRuntime-021 — `HandleDeployArtifacts` updates `DataConnections` in SQLite but never sends `CreateConnectionCommand` to the DCL
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Design-document adherence |
|
||
| Status | Open |
|
||
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:931` |
|
||
|
||
**Description**
|
||
|
||
`HandleDeployArtifacts` persists the artifact bundle (shared scripts, external
|
||
systems, database connections, notification lists, SMTP configs, and
|
||
**data connection definitions**) into local SQLite. For data connection
|
||
definitions specifically (`DataConnections`), the handler calls
|
||
`_storage.StoreDataConnectionDefinitionAsync(...)` — but does NOT issue a
|
||
`CreateConnectionCommand` (or any other DCL command) to the `_dclManager`
|
||
actor. The only path that pushes DCL configuration to the DCL is
|
||
`EnsureDclConnections`, called exclusively from the deploy / startup-batch
|
||
paths against the **flattened instance configuration's** inline `Connections`
|
||
dictionary. There is no equivalent for an artifact-only update.
|
||
|
||
Concretely: an artifact deployment that changes a data connection's endpoint
|
||
URL, credentials, backup endpoint, or failover retry count is stored
|
||
durably in the site SQLite (so on the *next* node restart the site loads the
|
||
new config and `EnsureDclConnections` picks it up) but is silently inert until
|
||
either an instance using that connection is redeployed or the node restarts.
|
||
This contradicts the design's "after artifact deployment, the site is fully
|
||
self-contained" intent (Component-SiteRuntime.md, "System-Wide Artifact
|
||
Handling") — the runtime DCL keeps using the stale connection until a much
|
||
heavier trigger event occurs. It is also asymmetric with how
|
||
`SharedScripts` are handled in the same method: shared scripts are both
|
||
stored *and* recompiled into `_sharedScriptLibrary` on update so the change is
|
||
live immediately.
|
||
|
||
(SiteRuntime-010 fixed a related defect inside `EnsureDclConnections` — the
|
||
config-hash cache — but that's only consulted on the inline-config path; the
|
||
artifact-deployment path never reaches `EnsureDclConnections`.)
|
||
|
||
**Recommendation**
|
||
|
||
In the `DataConnections` branch of `HandleDeployArtifacts`, after the
|
||
`StoreDataConnectionDefinitionAsync` call, also send a
|
||
`CreateConnectionCommand` to `_dclManager` for each updated definition,
|
||
re-using the SiteRuntime-010 config hash so unchanged connections are skipped.
|
||
Alternatively, refactor `EnsureDclConnections` to accept a flat list of
|
||
`(name, protocol, configurationJson, backupConfigurationJson,
|
||
failoverRetryCount)` tuples that both the inline (`FlattenedConfiguration`)
|
||
and artifact paths can drive through it.
|
||
|
||
### SiteRuntime-022 — `AuditingDbCommand.DbConnection.set` uses reflection to read `AuditingDbConnection._inner`
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Code organization & conventions |
|
||
| Status | Open |
|
||
| Location | `src/ScadaLink.SiteRuntime/Scripts/AuditingDbCommand.cs:138` |
|
||
|
||
**Description**
|
||
|
||
The `DbConnection` setter on `AuditingDbCommand` unwraps an
|
||
`AuditingDbConnection` value by reading its private `_inner` field via
|
||
reflection:
|
||
|
||
```csharp
|
||
set
|
||
{
|
||
_wrappingConnection = value;
|
||
_inner.Connection = value switch
|
||
{
|
||
AuditingDbConnection auditing => auditing.GetType()
|
||
.GetField("_inner", BindingFlags.Instance | BindingFlags.NonPublic)
|
||
!.GetValue(auditing) as DbConnection,
|
||
_ => value
|
||
};
|
||
}
|
||
```
|
||
|
||
This is the same encapsulation-violating anti-pattern that SiteRuntime-006
|
||
called out for the site repositories. A rename or refactor of
|
||
`AuditingDbConnection._inner` breaks the audit decorator at runtime (no
|
||
compile-time signal), the `!.` null-forgiving operator hides the crash, and
|
||
the reflective access trips static analyzers and IL trimming. More
|
||
problematically, the script trust model the same module enforces in
|
||
`ScriptCompilationService.ValidateTrustModel` explicitly forbids
|
||
`System.Reflection` in scripts — yet the auditing helper a script ends up
|
||
running through itself reaches via reflection into a sibling class. Both
|
||
classes are `internal sealed` in the same assembly, so this is purely a
|
||
self-imposed contract violation.
|
||
|
||
A second smaller concern in the same property: the getter returns
|
||
`_wrappingConnection ?? _inner.Connection`. If the caller obtains a command
|
||
via `AuditingDbConnection.CreateDbCommand()` and immediately reads
|
||
`cmd.Connection`, the getter returns the raw inner connection (not the
|
||
auditing wrapper), because `_wrappingConnection` is only populated when the
|
||
setter is later invoked. That's surprising and at odds with the class's
|
||
audit-everything intent — a script that round-trips a command through
|
||
`cmd.Connection` re-enters the un-audited path.
|
||
|
||
**Recommendation**
|
||
|
||
Expose the wrapped connection through a proper API surface. The simplest fix
|
||
that matches the SiteRuntime-006 precedent: add an
|
||
`internal DbConnection Inner { get; }` property to `AuditingDbConnection`
|
||
(both classes are `internal sealed`, so the property stays out of the public
|
||
surface) and replace the reflection switch with `auditing.Inner`. While
|
||
touching the property, also have the getter return `_wrappingConnection` even
|
||
on the synthesised CreateDbCommand path (e.g. set `_wrappingConnection` to
|
||
the parent connection inside `AuditingDbConnection.CreateDbCommand`).
|
||
|
||
### SiteRuntime-023 — `Convert.ToDouble(value)` in trigger and alarm evaluation is locale-sensitive
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Open |
|
||
| Location | `src/ScadaLink.SiteRuntime/Actors/ScriptActor.cs:446`, `src/ScadaLink.SiteRuntime/Actors/AlarmActor.cs:340`, `src/ScadaLink.SiteRuntime/Actors/AlarmActor.cs:356`, `src/ScadaLink.SiteRuntime/Actors/AlarmActor.cs:444` |
|
||
|
||
**Description**
|
||
|
||
`ScriptActor.EvaluateCondition` and the three `AlarmActor` evaluators
|
||
(`EvaluateRangeViolation`, `EvaluateRateOfChange`, `EvaluateHiLo`) call
|
||
`Convert.ToDouble(value)` without specifying a culture. When `value` is a
|
||
string (a path that exists today — attribute values that arrive as JSON-
|
||
deserialized numbers can still surface as strings on some code paths,
|
||
particularly array values that are JSON-stringified at
|
||
`InstanceActor.HandleTagValueUpdate:377`), `Convert.ToDouble` parses against
|
||
`CultureInfo.CurrentCulture`. On a host whose locale uses a comma decimal
|
||
separator (German, French, most of continental Europe), `"1.5"` throws and
|
||
the condition / alarm silently degrades to its catch-fallthrough (returns
|
||
`false` for range/rate-of-change, keeps current level for HiLo, falls back to
|
||
string-compare for conditionals). The CLAUDE.md "All timestamps are UTC"
|
||
discipline is the equivalent rule for time; there is no equivalent invariant-
|
||
culture discipline applied to numeric parsing.
|
||
|
||
The exposure is bounded — most attribute values arrive as numeric primitives
|
||
from `TagValueUpdate.Value` or static `FlattenedConfiguration.Attributes`
|
||
(also typed) so the implicit-cast `Convert.ToDouble` path is hit. But the
|
||
string path is reachable via inbound API writes
|
||
(`RouteToSetAttributesRequest.AttributeValues` is `IReadOnlyDictionary<string,
|
||
string>`), via the JSON-array stringification at `HandleTagValueUpdate:377`,
|
||
and via static-override values loaded from SQLite (which are persisted as
|
||
strings — see `SetStaticOverrideAsync`).
|
||
|
||
**Recommendation**
|
||
|
||
Replace each `Convert.ToDouble(value)` with `Convert.ToDouble(value,
|
||
CultureInfo.InvariantCulture)`, or front-load a typed-numeric extraction
|
||
helper (`if (value is double d) return d; if (value is string s && double.TryParse(s,
|
||
NumberStyles.Float, CultureInfo.InvariantCulture, out var p)) return p;
|
||
return Convert.ToDouble(value, CultureInfo.InvariantCulture);`). The site is a
|
||
deterministic machine-control surface; condition evaluation must not depend
|
||
on the host's regional settings.
|
||
|
||
### SiteRuntime-024 — `OperationTrackingStore` serialises all writes through one connection + `SemaphoreSlim`, and `Dispose()` does sync-over-async
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Performance & resource management |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.SiteRuntime/Tracking/OperationTrackingStore.cs:39`, `src/ScadaLink.SiteRuntime/Tracking/OperationTrackingStore.cs:360` |
|
||
|
||
**Resolution** — split reads from writes: the single owned
|
||
`_writeConnection` + `_writeGate` still serialises writers, but
|
||
`GetStatusAsync` now opens a fresh `SqliteConnection` per call against
|
||
the shared connection string (mirroring `SiteStorageService`) so reads
|
||
never block on an in-flight write. Sync `Dispose` was rewritten to NOT
|
||
bridge to async — the dispose-once flag is an `int` flipped with
|
||
`Interlocked.Exchange`, the synchronous path disposes
|
||
`_writeConnection` + `_writeGate` directly without acquiring the gate,
|
||
and `DisposeAsync` retains the gate-drain semantics for graceful
|
||
shutdown. Both paths are idempotent; the second call short-circuits via
|
||
the interlocked flag. Tests:
|
||
`SR024_ConcurrentReads_DoNotBlockOnInFlightWrite`,
|
||
`SR024_SyncDispose_DoesNotDeadlock_WhenInvokedFromFreshThread`, and
|
||
`SR024_AsyncDispose_DoesNotDeadlock_AndIsIdempotent`.
|
||
|
||
**Description**
|
||
|
||
`OperationTrackingStore` owns exactly one `SqliteConnection` and gates every
|
||
public method through a single `SemaphoreSlim(1, 1)`. The class XML comment
|
||
calls this out as deliberate ("the M3 brief calls out as 'cleaner than the M2
|
||
Channel<T> pipeline given the volume'"), and the *write* volume is genuinely
|
||
low — at most a handful of lifecycle rows per cached call. But on a busy site
|
||
the *read* path (`GetStatusAsync`) is called by every `Tracking.Status(id)`
|
||
invocation from every executing script, and reads are serialised through the
|
||
same gate as writes. A long-running write (e.g. a Roslyn-script-driven
|
||
`RecordTerminalAsync` competing with an SQLite checkpoint) holds the gate and
|
||
stalls every concurrent status query. SQLite supports concurrent readers with
|
||
a single writer in WAL mode; the gate forfeits that capability.
|
||
|
||
A separate concern in the same class: `Dispose()` calls
|
||
`DisposeAsyncCore().AsTask().GetAwaiter().GetResult()`. That is sync-over-
|
||
async — the very pattern SiteRuntime-008 was a finding for. If a caller
|
||
disposes the store from a synchronization context that does not allow
|
||
re-entrance (e.g. an `IHostedService.StopAsync` continuation observed on the
|
||
host's sync context, or a finalizer pumping on the thread pool with a stuck
|
||
continuation), the `.WaitAsync()` inside `DisposeAsyncCore` waits for a
|
||
continuation that will never run, and the dispose deadlocks. The async path
|
||
itself is correct; only the sync `Dispose()` wrapper is risky.
|
||
|
||
**Recommendation**
|
||
|
||
For the single-connection gate: split reads and writes into separate gates,
|
||
or — better — keep the writer single-connection and open a fresh read
|
||
connection (or pool of read connections) per `GetStatusAsync` call. SQLite
|
||
connections are cheap; the `SiteStorageService` precedent already uses per-
|
||
call connections on the read path. For `Dispose()`: prefer
|
||
`Dispose() { GC.SuppressFinalize(this); _connection.Dispose(); _gate.Dispose(); }`
|
||
without an awaited disposal, and have the `IAsyncDisposable.DisposeAsync`
|
||
path do the awaiting. If a synchronous disposable is genuinely needed, do
|
||
not bridge it through the async core — duplicate the dispose-once flag check
|
||
into a sync path that calls `_connection.Dispose()` directly.
|
||
|
||
### SiteRuntime-025 — `HandleSetStaticAttribute` persists unknown attribute names as static overrides
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Open |
|
||
| Location | `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:223`, `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:246` |
|
||
|
||
**Description**
|
||
|
||
`HandleSetStaticAttribute` resolves the target attribute against
|
||
`_configuration.Attributes` to decide whether to route the write to the DCL or
|
||
treat it as a static-override write. If the lookup fails (`resolved == null`),
|
||
`isDataSourced` is false, and execution falls through to
|
||
`HandleSetStaticAttributeCore` — which unconditionally:
|
||
|
||
1. inserts the bogus key into the in-memory `_attributes` dictionary,
|
||
2. publishes an `AttributeValueChanged` for the bogus key to the site stream
|
||
and to every child Script/Alarm actor,
|
||
3. persists a row in `static_attribute_overrides` for the bogus key, and
|
||
4. replies `Success = true` to the caller.
|
||
|
||
Concretely, an inbound API `Route.To().SetAttribute("notARealAttr", "x")`
|
||
returns success, pollutes the in-memory state with a key that no script can
|
||
legitimately observe (canonical-name lookup will not produce it), persists a
|
||
durable SQLite override row that survives restart, and (on every restart)
|
||
re-injects the polluting key via `HandleOverridesLoaded` at line 608. The
|
||
override is **not** reset on instance redeployment in the same way the
|
||
"genuine" overrides are — `ClearStaticOverridesAsync` does clear by
|
||
`instance_unique_name`, so the row is eventually cleaned, but only on a full
|
||
redeploy; in the meantime each restart resurrects it. The publish-to-stream
|
||
side effect also lets a hostile or buggy inbound caller spam debug-view
|
||
subscribers with synthetic attribute changes.
|
||
|
||
Worth flagging at Low: the inbound API surface is already authenticated and
|
||
the design assumes its callers are trusted. But the no-validation behaviour
|
||
contradicts the design doc's "Scripts can only read/write attributes on their
|
||
own instance" framing — an inbound API call inherits the same instance-scope
|
||
authority as a script, and the script trust model wouldn't sanction this.
|
||
|
||
**Recommendation**
|
||
|
||
In `HandleSetStaticAttribute`, when `resolved == null`, reply
|
||
`SetStaticAttributeResponse(Success: false,
|
||
ErrorMessage: $"Attribute '{command.AttributeName}' not found on instance
|
||
'{_instanceUniqueName}'")` instead of falling through to the override path.
|
||
Optionally also surface the existence check on the `RouteInboundApiSetAttributes`
|
||
fan-out so a multi-attribute write reports the offending key without rolling
|
||
back the others (the per-attribute `Ask` shape already supports a partial
|
||
failure response).
|
||
|
||
### SiteRuntime-026 — `ReplicationMessages.cs` public record types have no XML documentation
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Documentation & comments |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:10`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:13`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:15`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:17`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:19`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:25`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:28`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:30`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:32`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:34` |
|
||
|
||
**Description**
|
||
|
||
The ten public record types in `ReplicationMessages.cs`
|
||
(`ReplicateConfigDeploy`, `ReplicateConfigRemove`, `ReplicateConfigSetEnabled`,
|
||
`ReplicateArtifacts`, `ReplicateStoreAndForward`, `ApplyConfigDeploy`,
|
||
`ApplyConfigRemove`, `ApplyConfigSetEnabled`, `ApplyArtifacts`,
|
||
`ApplyStoreAndForward`) carry no XML documentation. The file header comment
|
||
groups them as "outbound" vs "inbound" but the individual records have no
|
||
`<summary>` and no parameter docs. The XML-doc baseline `1eb6e97` rolled out
|
||
across the rest of the module (the commit being reviewed is literally `docs:
|
||
add XML doc comments across src + Sister Projects section in CLAUDE.md`), so
|
||
this file is now the conspicuous outlier — and the `CommentChecker` skill
|
||
relied on by the `fixdocs` workflow will flag every record as missing docs.
|
||
|
||
**Recommendation**
|
||
|
||
Add a `<summary>` per record naming the direction (outbound → peer / inbound
|
||
from peer) and what the operation replicates, and `<param>` docs for each
|
||
record parameter. Mirror the precedent in
|
||
`src/ScadaLink.Commons/Messages/.../*.cs`. While there, consider sealing the
|
||
inbound vs outbound split with a marker base type (currently they're just
|
||
named conventionally) so `Receive<ReplicateXxx>` vs `Receive<ApplyXxx>` is
|
||
expressed at the type level — but that's optional and out of scope for a
|
||
docs-only finding.
|
||
|
||
**Resolution (2026-05-28):**
|
||
|
||
Added a one-line `<summary>` to each of the ten records
|
||
(`ReplicateConfigDeploy`/`Remove`/`SetEnabled`/`Artifacts`/`StoreAndForward` and
|
||
`ApplyConfigDeploy`/`Remove`/`SetEnabled`/`Artifacts`/`StoreAndForward`) naming
|
||
the direction (outbound to peer / inbound from peer) and what is replicated.
|
||
The two pre-existing group-header XML blocks were converted to plain `//`
|
||
comments to avoid orphaned doc-summaries above the first record in each group.
|
||
Marker-base-type idea left out of scope.
|