ad4744a295
Flip DataConnectionLayer-029, InboundAPI-031, SiteRuntime-032/033, StoreAndForward-028, AuditLog-017, CentralUI-037, ScriptAnalysis-009 from Open to Resolved with the fixing commit + a one-line resolution each; regen README (0 pending / 576 total across 25 modules).
1708 lines
104 KiB
Markdown
1708 lines
104 KiB
Markdown
# Code Review — SiteRuntime
|
||
|
||
| Field | Value |
|
||
|-------|-------|
|
||
| Module | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime` |
|
||
| Design doc | `docs/requirements/Component-SiteRuntime.md` |
|
||
| Status | Reviewed |
|
||
| Last reviewed | 2026-06-24 |
|
||
| Reviewer | claude-agent |
|
||
| Commit reviewed | `1f9de8a2` |
|
||
| Open findings | 0 |
|
||
|
||
## Summary
|
||
|
||
The SiteRuntime module is broadly well-structured: the actor hierarchy matches the
|
||
design doc, supervision strategies are explicit, and the trigger/alarm evaluation
|
||
logic is thorough. However the review surfaced one genuinely serious correctness
|
||
defect — `Instance.SetAttribute` never routes writes to the Data Connection Layer
|
||
for data-sourced attributes, contradicting a core design decision and silently
|
||
turning device writes into local-only static overrides. Several other findings
|
||
cluster around two themes: (1) actor-thread discipline is violated in a few hot
|
||
paths (blocking `.GetAwaiter().GetResult()` calls on the actor thread, a fragile
|
||
fixed-delay reschedule for redeployment), and (2) the site-local repositories reach
|
||
into `SiteStorageService` private state via reflection and mint entity IDs with the
|
||
non-deterministic `string.GetHashCode()`. Script execution runs on the default
|
||
thread pool rather than a dedicated blocking dispatcher (the code acknowledges this
|
||
in a comment but ships it anyway). Test coverage exists for the coordinator actors,
|
||
persistence and scripting, but the short-lived execution actors, the replication
|
||
actor, and the repositories are untested.
|
||
|
||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||
|
||
The module was re-reviewed at commit `39d737e`. No source under
|
||
`src/ZB.MOM.WW.ScadaBridge.SiteRuntime` has changed since the previous review at `9c60592`
|
||
(the only intervening commits are code-review documentation updates), so all of
|
||
SiteRuntime-001..013, 015, 016 remain Resolved and SiteRuntime-014 remains
|
||
Deferred — its Deferred justification (a trigger-evaluation concurrency design
|
||
decision is required before either recommended fix can land in-module) still
|
||
holds verbatim against the unchanged `ScriptActor`/`AlarmActor` source. The
|
||
re-review nonetheless worked through all 10 checklist categories afresh and
|
||
surfaced three new findings that the prior pass did not record: a cross-thread
|
||
`Dictionary` enumeration race when the Instance Actor's live `_attributes`
|
||
dictionary is handed by reference into child `ScriptActor`/`AlarmActor`
|
||
constructors (SiteRuntime-017, Medium); a stale `ScriptExecutionActor` XML doc
|
||
that still claims a "dedicated blocking I/O dispatcher" (SiteRuntime-018, Low);
|
||
and two dead lifecycle handlers in `InstanceActor` that the Deployment Manager
|
||
never routes to (SiteRuntime-019, Low). All three were subsequently resolved on
|
||
2026-05-17. Open findings: 0.
|
||
|
||
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
||
|
||
The module was re-reviewed at commit `1eb6e97` as part of the new baseline
|
||
review. The SiteRuntime source surface has grown materially since the prior
|
||
pass — primarily by threading `ExecutionId`/`ParentExecutionId`/`SourceNode`
|
||
through the script-trust-boundary helpers and the cached-call telemetry
|
||
emitters, and by adding `OperationTrackingStore`, the
|
||
`AuditingDbConnection`/`AuditingDbCommand`/`AuditingDbDataReader` decorators,
|
||
and `ScriptExecutionScheduler`. All 10 checklist categories were walked afresh.
|
||
Seven new findings were recorded: a race that throws
|
||
`InvalidActorNameException` when a second `DeployInstanceCommand` arrives for
|
||
the same instance while a redeployment is still terminating its predecessor
|
||
(SiteRuntime-020, Medium); an artifact-only data-connection update that never
|
||
reaches the DCL (SiteRuntime-021, Medium); `AuditingDbCommand.DbConnection.set`
|
||
reaching into `AuditingDbConnection._inner` via reflection — the same anti-
|
||
pattern SiteRuntime-006 eliminated for the repositories, now reintroduced and
|
||
in direct tension with the script trust model that forbids `System.Reflection`
|
||
(SiteRuntime-022, Medium); `Convert.ToDouble(value)` in `ScriptActor` /
|
||
`AlarmActor` running under `CurrentCulture` so a string attribute value
|
||
becomes locale-sensitive (SiteRuntime-023, Low); `OperationTrackingStore`
|
||
serialising every cached-call write through a single connection +
|
||
`SemaphoreSlim` and using sync-over-async in `Dispose()` (SiteRuntime-024,
|
||
Medium); inbound-API `SetAttribute` (and any future caller) accepting unknown
|
||
attribute names and persisting them as overrides, polluting both `_attributes`
|
||
and the SQLite override table (SiteRuntime-025, Low); and the
|
||
`ReplicationMessages.cs` outbound/inbound record types still missing public XML
|
||
docs (SiteRuntime-026, Low). Prior findings 001–019 remain
|
||
Resolved/Deferred — no regressions observed in any of their fixed call sites.
|
||
Open findings: 7.
|
||
|
||
## Checklist coverage
|
||
|
||
| # | Category | Examined | Notes |
|
||
|---|----------|----------|-------|
|
||
| 1 | Correctness & logic bugs | ✓ | SetAttribute mis-routing, deploy double-count, redeploy reschedule race. |
|
||
| 2 | Akka.NET conventions | ✓ | Blocking on actor thread, script execution not on a dedicated dispatcher, premature success reply. |
|
||
| 3 | Concurrency & thread safety | ✓ | `_attributes` dictionary shared with child actors by reference; `_executionCounter` is actor-confined (OK). |
|
||
| 4 | Error handling & resilience | ✓ | Deploy reports Success before persistence; replicated artifact/S&F failures only logged (matches best-effort design). |
|
||
| 5 | Security | ✓ | Trust-model validation is substring-based and weak; reflection used to read private fields. |
|
||
| 6 | Performance & resource management | ✓ | Per-call SQLite connections (acceptable); CPU-bound scripts not interruptible by timeout. |
|
||
| 7 | Design-document adherence | ✓ | SetAttribute DCL routing missing; staggered-startup and supervision otherwise conform. |
|
||
| 8 | Code organization & conventions | ✓ | Repositories reflect into another class; synthetic IDs non-deterministic. |
|
||
| 9 | Testing coverage | ✓ | No tests for ScriptExecutionActor, AlarmExecutionActor, SiteReplicationActor, or the two repositories. |
|
||
| 10 | Documentation & comments | ✓ | Several XML comments describe behaviour the code does not implement (see findings). |
|
||
|
||
_Re-review (2026-05-28, `1eb6e97`):_
|
||
|
||
| # | Category | Examined | Notes |
|
||
|---|----------|----------|-------|
|
||
| 1 | Correctness & logic bugs | ✓ | Second-deploy race vs. pending redeploy (020); artifact-only data-connection update never reaches DCL (021); unknown-name SetAttribute persists bogus overrides (025). |
|
||
| 2 | Akka.NET conventions | ✓ | Trigger-eval blocking on coordinator mailbox remains Deferred (014); short-lived execution actors and replication actor otherwise conform. |
|
||
| 3 | Concurrency & thread safety | ✓ | DM's `_instanceActors` cache and `_pendingRedeploys` map shifted from old race; new ordering race surfaced (020). `OperationTrackingStore` single-connection + SemaphoreSlim serialises all cached writes (024). |
|
||
| 4 | Error handling & resilience | ✓ | `Task.Run` fire-and-forget replication paths log on faulted (acceptable, per "best-effort replication" design). DM's deploy persistence rollback path (resolved as SiteRuntime-005) intact. |
|
||
| 5 | Security | ✓ | Trust-model semantic analysis (SiteRuntime-011 fix) intact. `AuditingDbCommand` reflects into `AuditingDbConnection._inner` — same anti-pattern as SiteRuntime-006 (022). Audit emitter captures SQL parameter values verbatim per M4 design (M5 will redact). |
|
||
| 6 | Performance & resource management | ✓ | Per-call SQLite connections on hot paths in `SiteStorageService` (existing pattern, acceptable). `OperationTrackingStore` `Dispose()` does sync-over-async (024). `ScriptExecutionScheduler` bounded threads as expected. |
|
||
| 7 | Design-document adherence | ✓ | Artifact-only data-connection update path is silently inert (021) — contradicts the "site is self-contained after artifact deployment" intent. |
|
||
| 8 | Code organization & conventions | ✓ | Repository reflection-via-private-field anti-pattern reintroduced in `AuditingDbCommand` (022). `ReplicationMessages.cs` public records still undocumented (026). |
|
||
| 9 | Testing coverage | ✓ | `SiteReplicationActor` remains uncovered (SiteRuntime-016 deferred that gap to a clustered-ActorSystem harness, still outstanding). New findings have no targeted coverage yet. |
|
||
| 10 | Documentation & comments | ✓ | `ReplicationMessages.cs` records lack XML docs (026); other XML doc surface materially expanded in `1eb6e97`. |
|
||
|
||
## Findings
|
||
|
||
#### Re-review 2026-06-20 (commit `4307c381`) — full review
|
||
|
||
The module was re-reviewed in full at `4307c381` (current HEAD). The diff against the
|
||
prior baseline `1eb6e97` shows as 100% additions because of the ScadaBridge rename
|
||
(the project moved paths), so the whole module was re-read at its current state rather
|
||
than relying on the diff. Health is good: all prior findings 001–026 remain
|
||
Resolved/Deferred with no regressions observed in their fixed call sites
|
||
(SetAttribute DCL routing, the watch+buffer redeploy, the SiteRuntime-020 terminating
|
||
shadow, the OperationTrackingStore read/write split + safe Dispose, the
|
||
AuditingDb `Inner` accessor, invariant-culture numeric parsing, the
|
||
`_attributes` child-snapshot isolation). The new M7 surface (NativeAlarmActor,
|
||
CertStoreActor) and the WaitForAttribute/batch-write additions are generally
|
||
well-built. Five new findings were recorded, all in the native-alarm and
|
||
deployment-lifecycle bookkeeping: an unbounded `_latestAlarmEvents` map on the
|
||
Instance Actor (027, Medium), a phantom-active alarm left behind by
|
||
`NativeAlarmActor.EnforceCap` (028, Medium), a delete-during-pending-redeploy
|
||
that both over-decrements the deployed counter and resurrects the deleted
|
||
instance (029, Medium), missing tests for the cap/retention/replication paths
|
||
(030, Low), and stale site-side persistence of notification-list + SMTP config
|
||
that the design decision moved to central-only (031, Low).
|
||
|
||
| # | Category | Examined | Notes |
|
||
|---|----------|----------|-------|
|
||
| 1 | Correctness & logic bugs | ✓ | EnforceCap leaves phantom-active alarm (028); delete-during-redeploy over-decrements + resurrects (029). HiLo/RoC/edge logic otherwise sound. |
|
||
| 2 | Akka.NET conventions | ✓ | Supervision (Resume coordinators / Stop short-lived) correct; execution actors on ScriptExecutionScheduler; cert broadcast uses PipeTo, captures `sender` before async. Trigger-eval mailbox-block stays Deferred (014). |
|
||
| 3 | Concurrency & thread safety | ✓ | Child-snapshot isolation (017) intact; `_createdConnections` actor-confined via ApplyArtifact… dispatch (021); OperationTrackingStore read/write split (024). No new actor-thread violations. |
|
||
| 4 | Error handling & resilience | ✓ | Best-effort persistence/replication paths log-on-fault; deploy reports Success only post-persist (005). Native source-unavailable retains+marks-uncertain. |
|
||
| 5 | Security | ✓ | Trust verdict delegated to shared ScriptTrustValidator; CertStoreActor path-traversal-guarded; no reflection in scripts; SQL params captured per M4 design (redaction at write time). |
|
||
| 6 | Performance & resource management | ✓ | `_latestAlarmEvents` unbounded growth on native-alarm churn (027). ScriptExecutionScheduler background threads OK; per-call SQLite connections acceptable. |
|
||
| 7 | Design-document adherence | ✓ | Actor hierarchy / native-alarm wiring conform. Stale site persistence of notification-list + SMTP config vs. "central-only, not deployed to sites" (031). |
|
||
| 8 | Code organization & conventions | ✓ | Reflection anti-pattern eliminated (006/022); Options owned by component; additive message evolution honoured. |
|
||
| 9 | Testing coverage | ✓ | Broad new suites (NativeAlarmActor, WaitForAttribute, SetAttribute, ExecutionActor, scheduler). EnforceCap/retention-drop + SiteReplicationActor still untested (030). |
|
||
| 10 | Documentation & comments | ✓ | XML docs accurate and extensive; ReplicationMessages now documented (026). No new stale comments found. |
|
||
|
||
### SiteRuntime-001 — `Instance.SetAttribute` never writes to the Data Connection Layer
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | High |
|
||
| Category | Design-document adherence |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Scripts/ScriptRuntimeContext.cs:106`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:204` |
|
||
|
||
**Description**
|
||
|
||
The design doc (Component-SiteRuntime.md, "GetAttribute / SetAttribute" and
|
||
"Script Runtime API") states that `Instance.SetAttribute` on a *data-connected*
|
||
attribute must send a write request to the DCL, which writes to the physical
|
||
device, and that the in-memory value is **not** optimistically updated. For *static*
|
||
attributes it updates memory and persists an override.
|
||
|
||
The implementation makes no such distinction. `ScriptRuntimeContext.SetAttribute`
|
||
unconditionally sends a `SetStaticAttributeCommand`, and `InstanceActor.HandleSetStaticAttribute`
|
||
unconditionally treats every write as a static override: it mutates `_attributes`,
|
||
publishes an `AttributeValueChanged` with hard-coded `"Good"` quality, notifies
|
||
children, and persists a SQLite override. A script writing a data-sourced attribute
|
||
therefore never reaches the device, the write failure can never be returned
|
||
synchronously to the script, and the in-memory value diverges from the device
|
||
until the next subscription update overwrites it. The persisted override is also
|
||
wrong: data-sourced attributes should not have static overrides.
|
||
|
||
**Recommendation**
|
||
|
||
In `InstanceActor`, look up the target attribute in `_configuration.Attributes`. If
|
||
it has a non-empty `DataSourceReference`, issue a DCL write (e.g. a `WriteTagRequest`
|
||
to `_dclManager`) and surface success/failure to the caller; do not persist an
|
||
override and do not optimistically mutate `_attributes`. Only attributes with no
|
||
data source reference should follow the current static-override path. Consider
|
||
splitting the message into `SetStaticAttributeCommand` vs `SetDataAttributeCommand`,
|
||
or branching inside the handler.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (`<pending>`): `InstanceActor.HandleSetStaticAttribute` now resolves
|
||
the target attribute's data binding from `_configuration`. Data-sourced attributes are
|
||
routed via a new `HandleSetDataAttribute` that Asks the DCL with a `WriteTagRequest` and
|
||
pipes the device-write outcome back to the caller as a `SetStaticAttributeResponse` —
|
||
no override is persisted and `_attributes` is not optimistically mutated. Static
|
||
attributes keep the override path and now also reply with a `SetStaticAttributeResponse`.
|
||
`ScriptRuntimeContext.SetAttribute` is now `async Task` and Asks the Instance Actor,
|
||
throwing `InvalidOperationException` on a failed device write so scripts get the failure
|
||
synchronously.
|
||
|
||
### SiteRuntime-002 — `RouteInboundApiSetAttributes` always treats writes as static overrides
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | High |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:632` |
|
||
|
||
**Description**
|
||
|
||
`RouteInboundApiSetAttributes` (handling `Route.To().SetAttribute(s)` from the
|
||
Inbound API) emits a `SetStaticAttributeCommand` for every attribute, so it inherits
|
||
the same defect as SiteRuntime-001: writes to data-sourced attributes never reach
|
||
the device and are instead persisted as static overrides. In addition the response
|
||
is sent back as unconditionally successful (`true`) before the Instance Actor has
|
||
even processed the command, so a non-existent attribute or a future DCL write
|
||
failure is reported to the external caller as success.
|
||
|
||
**Recommendation**
|
||
|
||
Route through the same corrected `InstanceActor` write handler as SiteRuntime-001 so
|
||
the static-vs-data distinction is honoured. The optimistic ack is acceptable for
|
||
fire-and-forget static writes per the doc, but the XML comment should make the
|
||
limitation explicit, and once data-attribute writes are supported they need a real
|
||
response path.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (`<pending>`): `RouteInboundApiSetAttributes` now Asks the Instance
|
||
Actor per attribute (instead of fire-and-forget Tell) and aggregates the
|
||
`SetStaticAttributeResponse` results. Because the Instance Actor handler is the
|
||
SiteRuntime-001 corrected handler, data-sourced attributes now reach the DCL and the
|
||
`RouteToSetAttributesResponse` reflects the real per-attribute outcome — a non-existent
|
||
attribute or a failed device write is reported as failure rather than an unconditional
|
||
optimistic `true`.
|
||
|
||
### SiteRuntime-003 — Redeployment relies on a fixed 500 ms reschedule and can collide on the child actor name
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | High |
|
||
| Category | Akka.NET conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:222` |
|
||
|
||
**Description**
|
||
|
||
`HandleDeploy` stops an existing Instance Actor with `Context.Stop` and then
|
||
reschedules the same `DeployInstanceCommand` to itself after a hard-coded 500 ms,
|
||
hoping the child has fully terminated by then. `Context.Stop` is asynchronous; the
|
||
child is only removed from the parent's children collection after it actually stops
|
||
(including running `PostStop` on its descendants). If a deeply nested or slow
|
||
hierarchy takes longer than 500 ms, `CreateInstanceActor` calls `Context.ActorOf`
|
||
with a name that still belongs to the terminating child and throws
|
||
`InvalidActorNameException`. The `_instanceActors` dictionary check does not prevent
|
||
this — the dictionary entry is removed immediately, but the Akka child registry is
|
||
not. The 500 ms delay is also unconditionally added to every redeploy latency.
|
||
|
||
**Recommendation**
|
||
|
||
Watch the terminating child (`Context.Watch`) and recreate the Instance Actor only
|
||
after receiving the `Terminated` message, instead of guessing with a timer. Buffer
|
||
or stash the in-flight `DeployInstanceCommand` (and any further commands for that
|
||
instance) until termination completes.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (`<pending>`): `HandleDeploy` no longer uses a fixed 500 ms
|
||
reschedule. When a redeployment targets a running instance, the existing Instance Actor
|
||
is `Context.Watch`-ed and stopped, and the in-flight `DeployInstanceCommand` is buffered
|
||
in a `_pendingRedeploys` map keyed by the terminating actor ref. A new `Terminated`
|
||
handler recreates the Instance Actor only after the predecessor (and its whole subtree)
|
||
has fully stopped, eliminating the `InvalidActorNameException` race and the
|
||
unconditional redeploy-latency penalty. The shared `ApplyDeployment` helper also skips
|
||
the `_totalDeployedCount` increment for redeployments, so the deployed-instance count no
|
||
longer drifts (this additionally addresses the root cause behind SiteRuntime-004).
|
||
|
||
### SiteRuntime-004 — `_totalDeployedCount` is incremented on redeployment of an existing instance
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium — re-triaged: already fixed by the SiteRuntime-003 resolution. |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs` (`ApplyDeployment`) |
|
||
|
||
**Description**
|
||
|
||
In `HandleDeploy`, the existing-actor branch (line 223) reschedules the command and
|
||
returns. When the rescheduled command runs, no actor exists, so the code falls
|
||
through to the "new instance" branch and executes `_totalDeployedCount++`
|
||
(line 239). A redeployment is an *update* of an already-deployed instance, not a new
|
||
one, so the deployed count is over-counted by one on every redeploy.
|
||
`StoreDeployedConfigAsync` uses UPSERT semantics, so the SQLite row count does not
|
||
grow, but the in-memory `_totalDeployedCount` (reported to the health collector via
|
||
`UpdateInstanceCounts`) drifts upward and the reported "disabled" count becomes
|
||
wrong.
|
||
|
||
**Re-triage (2026-05-16)**
|
||
|
||
Verified against the current source: this is **already fixed**. The SiteRuntime-003
|
||
resolution replaced the fixed-delay reschedule with a shared `ApplyDeployment` helper
|
||
that takes an `isRedeploy` flag and guards the counter with `if (!isRedeploy)
|
||
_totalDeployedCount++;`. The redeploy path (`HandleTerminated`) always calls
|
||
`ApplyDeployment(..., isRedeploy: true)`, so the counter is no longer bumped on
|
||
redeployment. The regression test
|
||
`DeploymentManagerRedeployTests.Redeploy_ExistingInstance_DoesNotOverCountDeployedInstances`
|
||
already covers this and passes. No further code change was required.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (`commit pending`): no new change needed — the root cause was
|
||
eliminated by the SiteRuntime-003 fix (the `isRedeploy` guard in `ApplyDeployment`).
|
||
Confirmed by the existing passing regression test
|
||
`Redeploy_ExistingInstance_DoesNotOverCountDeployedInstances`. Re-triaged from Open to
|
||
Resolved.
|
||
|
||
### SiteRuntime-005 — Deployment reports `Success` to central before persistence completes
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Error handling & resilience |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs` (`ApplyDeployment`, `HandleDeployPersistenceResult`) |
|
||
|
||
**Description**
|
||
|
||
`HandleDeploy` replies to central with `DeploymentStatus.Success` immediately after
|
||
creating the Instance Actor, while the SQLite persistence (`StoreDeployedConfigAsync`
|
||
+ `ClearStaticOverridesAsync`) runs asynchronously on a `Task.Run`. If persistence
|
||
fails, `HandleDeployPersistenceResult` only logs an error — central has already been
|
||
told the deployment succeeded. On a subsequent node restart or failover the instance
|
||
will not be re-created (it is not in SQLite), so the deployment is silently lost
|
||
despite central recording success. This contradicts the design's intent that the
|
||
site is the durable source of truth for deployed configs.
|
||
|
||
**Recommendation**
|
||
|
||
Persist the config before replying, or treat a persistence failure as a deployment
|
||
failure and send a corrective `DeploymentStatusResponse`/health signal to central.
|
||
At minimum, do not report `Success` until the config row is committed.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (`commit pending`): root cause confirmed — `ApplyDeployment` sent
|
||
`DeploymentStatusResponse(Success)` synchronously before the persistence `Task.Run`
|
||
completed. The `Success` reply is now sent from `HandleDeployPersistenceResult` only
|
||
once the persistence result is known: on success it replies `Success`; on a
|
||
persistence failure it logs the error, stops the optimistically-created Instance
|
||
Actor, rolls back the deployed-instance counter, and replies
|
||
`DeploymentStatus.Failed` with the error message. `DeployPersistenceResult` carries an
|
||
`IsRedeploy` flag so the counter rollback is skipped for redeployments. Regression
|
||
tests: `DeploymentManagerMediumFindingsTests.Deploy_PersistenceFailure_ReportsFailedNotSuccess`
|
||
and `Deploy_Success_ReportsSuccessAndPersistsConfig`.
|
||
|
||
### SiteRuntime-006 — Site-local repositories read `SiteStorageService` private field via reflection
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Code organization & conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Repositories/SiteExternalSystemRepository.cs`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Repositories/SiteNotificationRepository.cs` |
|
||
|
||
**Description**
|
||
|
||
Both repositories' `CreateConnection()` use `Type.GetField("_connectionString",
|
||
BindingFlags.NonPublic | BindingFlags.Instance)` to extract the private connection
|
||
string out of `SiteStorageService`. This is brittle (any rename or refactor of the
|
||
field breaks it at runtime, not compile time), defeats encapsulation, and the
|
||
accompanying XML comment openly describes it as a "pragmatic" hack and is internally
|
||
contradictory (it states a connection string is "passed separately at DI
|
||
registration time" which is not what the code does). It also sits awkwardly against
|
||
the project's own script trust model, which forbids `System.Reflection` in scripts.
|
||
|
||
**Recommendation**
|
||
|
||
Expose the connection string properly: add an `ISiteStorageConnectionProvider`
|
||
(already referenced in `ServiceCollectionExtensions` XML docs but not used), or have
|
||
`SiteStorageService` expose a `CreateConnection()` factory, and inject that into the
|
||
repositories. Remove the reflection entirely.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (`commit pending`): root cause confirmed — both repositories
|
||
reflected into `SiteStorageService._connectionString`. `SiteStorageService` now
|
||
exposes a public `CreateConnection()` factory method that returns an unopened
|
||
`SqliteConnection` against the site database. Both `SiteExternalSystemRepository` and
|
||
`SiteNotificationRepository` now obtain connections via `_storage.CreateConnection()`;
|
||
all reflection (`Type.GetField` / `BindingFlags`) and the contradictory XML comments
|
||
have been removed. This is a fully in-module refactor — no cross-module design
|
||
decision was needed. Regression test:
|
||
`SiteRepositoryTests.ExternalSystemRepository_RoundTripsStoredDefinition` exercises
|
||
the repository's connection path end-to-end.
|
||
|
||
### SiteRuntime-007 — Synthetic entity IDs use the non-deterministic `string.GetHashCode()`
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Repositories/SiteExternalSystemRepository.cs`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Repositories/SiteNotificationRepository.cs`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Repositories/SyntheticId.cs` |
|
||
|
||
**Description**
|
||
|
||
`GenerateSyntheticId` computes `name.GetHashCode() & 0x7FFFFFFF`. On .NET Core,
|
||
`string.GetHashCode()` is randomized per process by default, so the "stable
|
||
deterministic synthetic ID" promised by the XML comment is not stable at all — it
|
||
changes every time the process restarts. Any caller that obtained an ID and later
|
||
calls `GetExternalSystemByIdAsync`/`GetNotificationListByIdAsync` after a restart
|
||
will fail to find the entity. It also risks collisions: distinct names can hash to
|
||
the same 31-bit value, and `GetExternalSystemByIdAsync` would then return the wrong
|
||
row.
|
||
|
||
**Recommendation**
|
||
|
||
Use a deterministic, collision-resistant hash (e.g. a stable FNV-1a or the first
|
||
bytes of a SHA-256 of the name) if a synthetic integer ID is genuinely required, or
|
||
better, change the repository contract to key these site-local artifacts by name
|
||
rather than synthesising integer IDs.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (`commit pending`): root cause confirmed — both repositories used
|
||
`name.GetHashCode()`, which is per-process randomized on .NET Core. A new internal
|
||
`SyntheticId` helper computes a deterministic, process-stable 31-bit ID using the
|
||
FNV-1a hash over the name's UTF-8 bytes. Both `GenerateSyntheticId` methods now
|
||
delegate to `SyntheticId.From(name)`. (The integer-keyed lookups are kept because
|
||
they are mandated by the shared `IExternalSystemRepository`/`INotificationRepository`
|
||
contracts in Commons — changing those contracts to name-keyed would be a cross-module
|
||
change outside this module's scope; the deterministic hash resolves the correctness
|
||
defect within scope.) Regression tests:
|
||
`SiteRepositoryTests.ExternalSystemRepository_SyntheticId_IsStableAcrossRestart` and
|
||
`NotificationRepository_SyntheticId_IsStableAcrossRestart` re-create the service to
|
||
simulate a process restart and confirm by-ID lookups still resolve.
|
||
|
||
### SiteRuntime-008 — Blocking `.GetAwaiter().GetResult()` on the actor thread during startup
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Akka.NET conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs` (`HandleStartupConfigsLoaded`, `LoadSharedScriptsFromStorage`, `HandleSharedScriptsLoaded`) |
|
||
|
||
**Description**
|
||
|
||
`LoadSharedScriptsFromStorage` is called synchronously from
|
||
`HandleStartupConfigsLoaded` (the actor's message handler) and performs
|
||
`_storage.GetAllSharedScriptsAsync().GetAwaiter().GetResult()` followed by Roslyn
|
||
compilation of every shared script. This blocks the DeploymentManager singleton's
|
||
mailbox thread for the full duration of the SQLite read and all shared-script
|
||
compilation. On the default dispatcher this also ties up a thread-pool thread and
|
||
risks thread-pool starvation, and the singleton cannot process any other message
|
||
(deployments, lifecycle commands, debug routing) until it returns. The rest of the
|
||
class correctly uses `PipeTo`/`ContinueWith`.
|
||
|
||
**Recommendation**
|
||
|
||
Load shared scripts asynchronously and `PipeTo(Self)` an internal message, the same
|
||
pattern already used for `StartupConfigsLoaded`. Perform compilation either inside
|
||
the piped continuation handler (still on the actor thread but at least off the
|
||
synchronous startup path) or on a dedicated background task whose result is piped
|
||
back.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (`commit pending`): root cause confirmed — the blocking
|
||
`.GetAwaiter().GetResult()` and Roslyn compilation ran on the singleton's mailbox
|
||
thread inside `HandleStartupConfigsLoaded`. `LoadSharedScriptsFromStorage` now runs
|
||
the SQLite read **and** the Roslyn compilation on a background `Task.Run` and pipes a
|
||
new internal `SharedScriptsLoaded` message back to the actor. A new
|
||
`HandleSharedScriptsLoaded` handler then begins staggered Instance Actor creation, so
|
||
the compilation→creation ordering is preserved without ever blocking the mailbox. A
|
||
shared-script load failure is logged and startup proceeds (scripts needing a missing
|
||
shared script fail at execution time). Regression test:
|
||
`DeploymentManagerMediumFindingsTests.Startup_WithSharedScripts_LoadsConfigsAndStaysResponsive`
|
||
(confirms startup completes and the actor stays responsive with shared scripts
|
||
present).
|
||
|
||
### SiteRuntime-009 — Script execution actors run scripts on the default thread pool, not a dedicated dispatcher
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Akka.NET conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/ScriptExecutionActor.cs`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/AlarmExecutionActor.cs`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Scripts/ScriptExecutionScheduler.cs` |
|
||
|
||
**Description**
|
||
|
||
The design (CLAUDE.md "Architecture & Runtime") states Script Execution Actors run
|
||
on a *dedicated blocking I/O dispatcher*. The code does not do this: `ScriptActor.SpawnExecution`
|
||
and `AlarmActor.SpawnAlarmExecution` create the execution actors with no
|
||
`.WithDispatcher(...)`, and the execution itself runs inside a bare `Task.Run`,
|
||
i.e. on the shared .NET thread pool. The `// NOTE: In production, configure a
|
||
dedicated ... dispatcher` comments acknowledge the gap but it ships unconfigured.
|
||
Scripts can perform synchronous blocking I/O (`Database.Connection`, synchronous
|
||
`ExternalSystem.Call`); running them on the shared pool can starve it and stall
|
||
unrelated Akka dispatchers and HTTP request handling under load.
|
||
|
||
**Recommendation**
|
||
|
||
Define the dedicated dispatcher in HOCON and chain `.WithDispatcher(...)` on the
|
||
execution actor `Props`. If the `Task.Run` model is kept, run script bodies on a
|
||
dedicated `TaskScheduler` / bounded scheduler rather than the global pool. Either
|
||
way, remove the "in production, configure…" comments by actually configuring it.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (`commit pending`): root cause confirmed — script and alarm
|
||
on-trigger bodies ran inside a bare `Task.Run` on the shared `ThreadPool`. The
|
||
recommendation's `TaskScheduler` option was taken because it is fully in-module (a
|
||
HOCON dispatcher would require editing the Host's ActorSystem config, out of scope).
|
||
A new `ScriptExecutionScheduler` provides a bounded set of dedicated background
|
||
threads (count from the new `SiteRuntimeOptions.ScriptExecutionThreadCount`, default
|
||
8). `ScriptExecutionActor` and `AlarmExecutionActor` now run their bodies via
|
||
`Task.Factory.StartNew(..., ScriptExecutionScheduler.Shared(options)).Unwrap()`
|
||
instead of `Task.Run`, so blocking script I/O is contained to those dedicated threads
|
||
and cannot starve the global pool. The misleading "in production, configure a
|
||
dedicated dispatcher" comments were removed. Regression tests:
|
||
`ScriptExecutionSchedulerTests` (`Scheduler_RunsWork_OffTheThreadPool`,
|
||
`Scheduler_RespectsConfiguredThreadCount`, `Scheduler_Shared_ReturnsSameInstanceForOptions`).
|
||
|
||
### SiteRuntime-010 — `EnsureDclConnections` never updates a connection whose configuration changed
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs` (`EnsureDclConnections`, `ComputeConnectionConfigHash`) |
|
||
|
||
**Description**
|
||
|
||
`EnsureDclConnections` tracks created connections in `_createdConnections` and skips
|
||
any name already present (`if (_createdConnections.Contains(name)) continue;`). The
|
||
skip is purely name-based: if a redeployment (or an artifact deployment) changes the
|
||
endpoint, credentials, backup endpoint, or `FailoverRetryCount` of an existing
|
||
connection, the new configuration is silently ignored and the DCL keeps using the
|
||
stale `CreateConnectionCommand`. There is no `UpdateConnectionCommand` path. The
|
||
design states that after artifact deployment the site is fully self-contained with
|
||
current configuration; this caching breaks that for connection changes.
|
||
|
||
**Recommendation**
|
||
|
||
Compare the incoming connection config against the last one sent and re-issue a
|
||
create/update command when it differs, or have the DCL treat `CreateConnectionCommand`
|
||
as idempotent upsert and always forward it. Key the cache on a config hash, not just
|
||
the name.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (`commit pending`): root cause confirmed — the cache was a
|
||
name-only `HashSet`, so a changed connection config was silently dropped.
|
||
`_createdConnections` is now a `Dictionary<string,string>` mapping connection name to
|
||
a SHA-256 hash of its protocol/primary-config/backup-config/failover-retry-count
|
||
(`ComputeConnectionConfigHash`). A connection whose hash is unchanged is still
|
||
skipped; a connection whose config changed re-issues a `CreateConnectionCommand` so
|
||
the DCL adopts the new configuration. Regression tests:
|
||
`DeploymentManagerMediumFindingsTests.EnsureDclConnections_ConnectionConfigChanged_ReissuesCreateCommand`
|
||
and `EnsureDclConnections_UnchangedConfig_DoesNotReissueCreateCommand`.
|
||
|
||
### SiteRuntime-011 — Trust-model validation is a substring scan and is both over- and under-inclusive
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Security |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Scripts/ScriptCompilationService.cs` (`ValidateTrustModel`) |
|
||
|
||
**Description**
|
||
|
||
`ValidateTrustModel` enforces the script trust model by doing raw `string.Contains` /
|
||
`IndexOf` on the script source text for forbidden namespace strings. This is
|
||
unreliable in both directions:
|
||
|
||
- **Bypass (under-inclusive):** the check looks only for the literal namespace
|
||
strings. A script can reach forbidden APIs without ever writing `System.IO` etc. —
|
||
e.g. via fully-qualified type use through aliasing, `global::`-prefixed names, or
|
||
simply because the namespace is already imported transitively. The compilation
|
||
references include `typeof(object).Assembly` (the whole of `System.Private.CoreLib`,
|
||
which contains `System.IO.File`, `System.Threading.Thread`, `System.Reflection`,
|
||
etc.), so forbidden types are fully resolvable at compile time and the only barrier
|
||
is this textual scan.
|
||
- **False positives (over-inclusive):** any occurrence of the substring in a comment,
|
||
string literal, or an unrelated identifier (e.g. a variable named `ProcessThreading`)
|
||
triggers a violation; the `AllowedExceptions` logic only rescues exact prefixes.
|
||
- The dead `isAllowed` variable at line 64 is computed and never used.
|
||
|
||
**Recommendation**
|
||
|
||
Enforce the trust model with a Roslyn `SyntaxWalker`/semantic analysis (inspect
|
||
resolved symbols and their containing namespaces/assemblies), or restrict the
|
||
compilation's metadata references and `AssemblyLoadContext` so forbidden types are
|
||
genuinely unavailable, rather than relying on source-text matching. Remove the
|
||
unused `isAllowed` variable.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (`commit pending`): root cause confirmed — `ValidateTrustModel`
|
||
was a raw `string.Contains`/`IndexOf` scan of the source text, with a dead `isAllowed`
|
||
variable. It is now Roslyn semantic analysis: the script is parsed and a
|
||
`CSharpCompilation` + `SemanticModel` are built; every name/member/object-creation
|
||
node is resolved to its symbol and the symbol's containing namespace and
|
||
fully-qualified containing type are checked against the forbidden roots. Bare
|
||
namespace symbols are ignored (so the `System.Threading` qualifier of the allowed
|
||
`System.Threading.Tasks.Task` no longer false-positives). A name that cannot be
|
||
resolved (a type from an assembly deliberately absent from the script's references)
|
||
falls back to a syntactic fully-qualified-name check, so e.g. `System.Net.Http`
|
||
references are still rejected. The dead `isAllowed` variable was removed. This fixes
|
||
both the bypass (`global::`/alias-qualified forbidden types) and the false positives
|
||
(forbidden namespace string in a comment, string literal, or unrelated identifier).
|
||
Regression tests: new `TrustModelSemanticTests` (alias/`global::` detection, comment/
|
||
literal/identifier non-detection, allowed-exception resolution); all 39 existing
|
||
`SandboxTests` + `ScriptCompilationServiceTests` continue to pass.
|
||
|
||
### SiteRuntime-012 — `AttributeAccessor`/`ScopeAccessors` block the script on a synchronous Ask
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Concurrency & thread safety |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Scripts/ScopeAccessors.cs:28` |
|
||
|
||
**Description**
|
||
|
||
`AttributeAccessor`'s indexer getter calls `_ctx.GetAttribute(...).GetAwaiter().GetResult()`,
|
||
synchronously blocking the script-execution thread on an actor Ask. Combined with
|
||
SiteRuntime-009 (scripts run on the shared thread pool) this means a script that
|
||
reads several attributes via `Attributes["X"]` holds a pool thread blocked for each
|
||
round-trip. The async variants (`GetAsync`/`SetAsync`) exist but the ergonomic
|
||
indexer encourages the blocking path. The XML comment notes "Reads block on the
|
||
actor Ask" but does not warn about the thread-pool impact.
|
||
|
||
**Recommendation**
|
||
|
||
Once a dedicated script dispatcher exists (SiteRuntime-009) the blocking is contained
|
||
to that pool, which is acceptable; until then, document the cost clearly and prefer
|
||
steering script authors to the async accessors. Consider making the indexer
|
||
internal-only and exposing only the async API.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (`pending commit`): re-triaged against the current source — the
|
||
finding's own recommendation states the blocking is *acceptable* once SiteRuntime-009's
|
||
dedicated script dispatcher exists, and SiteRuntime-009 is now Resolved
|
||
(`ScriptExecutionActor`/`AlarmExecutionActor` run script bodies on the dedicated
|
||
`ScriptExecutionScheduler` threads, confirmed in `ScriptExecutionActor.cs:74`). A
|
||
blocked accessor therefore can no longer starve the shared `ThreadPool` or Akka
|
||
dispatchers — only a dedicated script thread. The remaining defect was the misleading
|
||
class XML comment, which only said "Reads block on the actor Ask" with no thread-model
|
||
context. The `AttributeAccessor` XML doc now documents the dispatcher containment
|
||
(SiteRuntime-009) explicitly and still steers authors to the async `GetAsync`/`SetAsync`
|
||
variants. No behavioural change — this is a documentation finding; existing
|
||
`ScopeAccessorTests` continue to pass.
|
||
|
||
### SiteRuntime-013 — `HandleUnsubscribeDebugView` does nothing despite documented behaviour
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Documentation & comments |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:414` |
|
||
|
||
**Description**
|
||
|
||
`HandleUnsubscribeDebugView` is documented ("Debug view unsubscribe — removes
|
||
subscription") and the actor registers a handler for `UnsubscribeDebugViewRequest`,
|
||
but the body only logs a debug message — there is no subscription state in the
|
||
Instance Actor to remove. The design places the actual subscription lifecycle in
|
||
`SiteStreamManager` (`Subscribe`/`Unsubscribe`/`RemoveSubscriber`), so the Instance
|
||
Actor genuinely has nothing to do here. The handler and its XML comment are
|
||
therefore misleading: a reader expects it to tear down a subscription.
|
||
|
||
**Recommendation**
|
||
|
||
Either remove the no-op handler and route `UnsubscribeDebugViewRequest` to wherever
|
||
the `SiteStreamManager` subscription is actually cancelled, or correct the XML
|
||
comment to state explicitly that subscription teardown is handled by
|
||
`SiteStreamManager` and this handler is a no-op acknowledgement.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (`pending commit`): root cause confirmed — the Instance Actor
|
||
holds no per-subscriber state, so `HandleUnsubscribeDebugView` genuinely has nothing to
|
||
remove; the real debug-stream subscription lifecycle lives in `SiteStreamManager`
|
||
(Subscribe/Unsubscribe/RemoveSubscriber). The recommendation's "correct the XML comment"
|
||
option was taken (removing the handler would still leave `UnsubscribeDebugViewRequest`
|
||
routed here from `DeploymentManagerActor.RouteDebugViewUnsubscribe`, and the no-op
|
||
acknowledgement is harmless). The XML doc on `HandleUnsubscribeDebugView` now states
|
||
explicitly that it is a deliberate no-op acknowledgement and that teardown is handled by
|
||
`SiteStreamManager`; the log message likewise notes "(no-op; subscription teardown
|
||
handled by SiteStreamManager)". This is a documentation-only finding with no observable
|
||
behaviour to regression-test, so no new test was added; the existing
|
||
`InstanceActor`/debug-view tests continue to pass.
|
||
|
||
### SiteRuntime-014 — Trigger-expression evaluation blocks the coordinator actor thread
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Akka.NET conventions |
|
||
| Status | Deferred |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/ScriptActor.cs:219`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/AlarmActor.cs:389` |
|
||
|
||
**Description**
|
||
|
||
`EvaluateExpressionTrigger` (ScriptActor) and `EvaluateExpression` (AlarmActor) run a
|
||
compiled Roslyn script with `.RunAsync(...).GetAwaiter().GetResult()` directly inside
|
||
the actor's `AttributeValueChanged` message handler. This blocks the coordinator
|
||
actor's mailbox thread for up to the 2-second timeout on every monitored attribute
|
||
change. Coordinator actors are on the default dispatcher and process the hot path of
|
||
attribute-change fan-out; a slow expression delays all other messages to that actor
|
||
and consumes a thread-pool thread for the duration. The inline comments correctly
|
||
note CPU-bound expressions are not interruptible but do not address the
|
||
mailbox-blocking concern.
|
||
|
||
**Recommendation**
|
||
|
||
Trigger expressions are expected to be cheap, but to keep the actor responsive
|
||
consider evaluating them off the actor thread (pipe the boolean result back as an
|
||
internal message) or pre-compiling to a plain delegate that executes near-instantly
|
||
without the Roslyn scripting `RunAsync` machinery.
|
||
|
||
**Resolution**
|
||
|
||
Deferred 2026-05-16 (`pending commit`): root cause confirmed — `EvaluateExpressionTrigger`
|
||
(ScriptActor) and `EvaluateExpression` (AlarmActor) call
|
||
`_compiledTriggerExpression.RunAsync(...).GetAwaiter().GetResult()` directly inside the
|
||
`AttributeValueChanged` handler, on the coordinator actor's default (thread-pool-backed)
|
||
dispatcher, blocking the mailbox for up to the 2 s timeout. Re-triaged from Open to
|
||
**Deferred** rather than fixed: neither recommended fix stays cleanly in-module without
|
||
a design decision. (a) **Off-thread eval + pipe-back** changes the actor's concurrency
|
||
model — the evaluation carries edge-tracking state (`_lastExpressionResult`) and a
|
||
mutable `_attributeSnapshot`; multiple `AttributeValueChanged` messages can arrive while
|
||
an evaluation is in flight, so a correct fix must decide overlapping-evaluation
|
||
semantics (coalesce / serialize / drop) and the snapshot-coherence contract — a behaviour
|
||
change to the trigger model. (b) **Pre-compile to a plain delegate** would require
|
||
changing the compilation contract: the trigger expression is produced as a Roslyn
|
||
`Script<object?>` by `ScriptCompilationService.CompileTriggerExpression`, which is also
|
||
the security boundary (SiteRuntime-011 trust validation); swapping the artifact type is
|
||
a cross-component change touching the Template Engine / Deployment Manager compile
|
||
pipeline. Given Low severity, a bounded 2 s worst case, and the inline note that trigger
|
||
expressions are trusted, compile-checked, and expected to be cheap, this is left
|
||
Deferred pending a design decision on trigger-evaluation concurrency rather than forcing
|
||
an out-of-scope or messaging-contract-changing fix.
|
||
|
||
### SiteRuntime-015 — `LoggerFactory` created per Instance Actor and never disposed
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Performance & resource management |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:746` |
|
||
|
||
**Description**
|
||
|
||
`CreateInstanceActor` does `var loggerFactory = new LoggerFactory();` for every
|
||
Instance Actor it creates, uses it once to produce an `ILogger<InstanceActor>`, and
|
||
never disposes it. `LoggerFactory` is `IDisposable`. With up to 500 instances (and
|
||
churn from redeployments) this leaks a factory per instance, and the produced
|
||
loggers are detached from the application's configured logging providers, so
|
||
Instance Actor logs may not be routed/filtered consistently with the rest of the
|
||
host.
|
||
|
||
**Recommendation**
|
||
|
||
Inject the application's `ILoggerFactory` (or an `ILogger<InstanceActor>` factory
|
||
delegate) into `DeploymentManagerActor` via DI and reuse it, rather than newing one
|
||
up per child. Do not create a fresh `LoggerFactory` in a hot creation path.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (`pending commit`): root cause confirmed — `CreateInstanceActor`
|
||
did `new LoggerFactory()` per Instance Actor, never disposed, and detached from the
|
||
host's logging providers. `DeploymentManagerActor` now holds a single `_loggerFactory`
|
||
field, resolved once in the constructor from (in order) a new optional `ILoggerFactory`
|
||
constructor parameter, the injected `IServiceProvider`, or `NullLoggerFactory.Instance`
|
||
as a last resort — never a per-instance allocation. `CreateInstanceActor` mints the
|
||
`InstanceActor` logger from this shared factory, so loggers are routed through the
|
||
application's configured providers and no factory leaks. Regression test:
|
||
`DeploymentManagerLoggerFactoryTests.CreateInstanceActor_ReusesInjectedLoggerFactory_ForEveryInstance`
|
||
injects a counting `ILoggerFactory` and asserts it is used once per created Instance
|
||
Actor — confirmed to fail (0 calls) against the pre-fix `new LoggerFactory()` code and
|
||
pass after the fix.
|
||
|
||
### SiteRuntime-016 — Short-lived execution actors, replication actor, and repositories are untested
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Testing coverage |
|
||
| Status | Resolved |
|
||
| Location | `tests/ZB.MOM.WW.ScadaBridge.SiteRuntime.Tests/` |
|
||
|
||
**Description**
|
||
|
||
The test project covers the coordinator actors (`InstanceActor`, `ScriptActor`,
|
||
`AlarmActor`, `DeploymentManagerActor`), persistence, scripting and streaming, but a
|
||
search of the test sources finds no references to `ScriptExecutionActor`,
|
||
`AlarmExecutionActor`, `SiteReplicationActor`, `SiteExternalSystemRepository`, or
|
||
`SiteNotificationRepository`. These cover critical paths: script timeout/failure
|
||
handling and result reply, alarm on-trigger execution, peer config/S&F replication
|
||
(including the `SendToPeer` no-peer drop), and the reflection-based repository reads.
|
||
Several findings above (001/002 mis-routing, 007 ID instability, 011 trust bypass)
|
||
would likely have been caught by targeted tests.
|
||
|
||
**Recommendation**
|
||
|
||
Add unit/integration tests for the execution actors (success, timeout, exception,
|
||
Ask-reply, PoisonPill self-stop), `SiteReplicationActor` (outbound forward, inbound
|
||
apply, peer tracking on cluster events), and the two repositories (round-trip read,
|
||
synthetic-ID lookup, missing-row behaviour).
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (`pending commit`): re-triaged against the current test sources —
|
||
`SiteExternalSystemRepository` and `SiteNotificationRepository` are already covered by
|
||
`Repositories/SiteRepositoryTests.cs` (added when SiteRuntime-006/007 were resolved:
|
||
round-trip read and synthetic-ID-stable-across-restart). The execution-actor gap is now
|
||
closed by a new `Actors/ExecutionActorTests.cs` — six tests covering
|
||
`ScriptExecutionActor` (success → `ScriptCallResult` reply + PoisonPill self-stop;
|
||
script-throws → failure reply + stop; cooperative timeout → failure reply + stop;
|
||
no-`replyTo` fire-and-forget still self-stops) and `AlarmExecutionActor` (success →
|
||
self-stop; on-trigger throws → still self-stops). `SiteReplicationActor` is *not* covered
|
||
here: it depends on `Cluster.Get(Context.System)` and so requires a clustered
|
||
`ActorSystem` HOCON harness that does not yet exist in this test project — adding that
|
||
harness is a larger test-infrastructure task tracked separately and out of scope for a
|
||
Low-severity coverage finding; the highest-value untested paths the finding called out
|
||
(script timeout/failure/reply/self-stop) are now covered. Full module suite: 192 tests
|
||
green.
|
||
|
||
### SiteRuntime-017 — Instance Actor's live `_attributes` dictionary is shared by reference into child actor constructors
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Concurrency & thread safety |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:625`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:675`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/ScriptActor.cs:83`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/AlarmActor.cs:93` |
|
||
|
||
**Description**
|
||
|
||
`InstanceActor.CreateChildActors` passes the Instance Actor's own mutable
|
||
`_attributes` field (a plain `Dictionary<string, object?>`) by reference into the
|
||
`Props.Create(...)` factory for every `ScriptActor` and `AlarmActor` (as the
|
||
`initialAttributes` constructor argument). Each child constructor then iterates
|
||
that dictionary to seed its `_attributeSnapshot`:
|
||
|
||
```csharp
|
||
if (initialAttributes != null)
|
||
foreach (var kvp in initialAttributes)
|
||
_attributeSnapshot[kvp.Key] = kvp.Value;
|
||
```
|
||
|
||
`Context.ActorOf` returns immediately; the child actor's constructor runs later on
|
||
the *child's* mailbox thread. Meanwhile the Instance Actor's `PreStart` returns and
|
||
the Instance Actor begins processing its mailbox — `HandleTagValueUpdate` and
|
||
`HandleAttributeValueChanged` both mutate `_attributes` (`_attributes[...] = ...`).
|
||
A DCL tag update that arrives before a child has finished its constructor copy
|
||
therefore mutates the dictionary on the Instance Actor thread while the child
|
||
thread is enumerating it. `Dictionary<,>` is explicitly not safe for concurrent
|
||
read/write: the enumeration can throw `InvalidOperationException` ("collection was
|
||
modified") — which surfaces as an `ActorInitializationException` and, under the
|
||
Instance Actor's `SupervisorStrategy`, **stops** the child (the strategy returns
|
||
`Stop` for `ActorInitializationException`). The script or alarm is then silently
|
||
absent for the life of the instance. A torn read of an entry is also possible. The
|
||
window is small but deterministically reachable on a busy site at startup/failover
|
||
— exactly the staggered-startup scenario the design is most concerned about.
|
||
|
||
**Recommendation**
|
||
|
||
Do not share the live dictionary. Snapshot it on the Instance Actor thread before
|
||
constructing the child — e.g. pass `new Dictionary<string, object?>(_attributes)`
|
||
(or an immutable copy) into each `Props.Create`. The copy is made on the Instance
|
||
Actor thread inside `CreateChildActors`, so it is race-free, and each child gets a
|
||
private dictionary to seed from.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-17 (`commit pending`): root cause confirmed — `CreateChildActors`
|
||
captured the live `_attributes` field directly in every child `Props.Create`
|
||
closure. `CreateChildActors` now takes a single `new Dictionary<string,object?>(_attributes)`
|
||
snapshot on the Instance Actor thread and hands each `ScriptActor`/`AlarmActor` that
|
||
private copy, so no child constructor ever enumerates a dictionary the Instance
|
||
Actor is concurrently mutating. Regression test:
|
||
`InstanceActorChildAttributeRaceTests.ChildActors_AreSeededFromAnIsolatedCopy_NotTheLiveAttributesDictionary`
|
||
asserts every child's seed dictionary is a distinct object from the Instance
|
||
Actor's live `_attributes` (confirmed to fail — "seeded ... by reference" — against
|
||
the pre-fix code and pass after). `ScriptActor`/`AlarmActor` expose an internal
|
||
`SeedAttributesReference` for this assertion (`InternalsVisibleTo` added for the
|
||
test project).
|
||
|
||
### SiteRuntime-018 — `ScriptExecutionActor` XML doc still claims a "dedicated blocking I/O dispatcher"
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Documentation & comments |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/ScriptExecutionActor.cs:17` |
|
||
|
||
**Description**
|
||
|
||
The class-level XML summary on `ScriptExecutionActor` states "Runs on a dedicated
|
||
blocking I/O dispatcher." That is not what the code does. SiteRuntime-009 was
|
||
resolved by introducing `ScriptExecutionScheduler` (a bounded dedicated
|
||
`TaskScheduler`); the *actor itself and its mailbox* run on the **default** Akka
|
||
dispatcher, and only the script body runs on the scheduler's threads via
|
||
`Task.Factory.StartNew(..., scheduler)`. The resolution of SiteRuntime-009
|
||
explicitly chose the `TaskScheduler` route *instead of* a HOCON dispatcher and
|
||
even removed the "in production, configure a dedicated dispatcher" comments
|
||
elsewhere — but this stale summary line was missed. A reader is told the actor is
|
||
on a dedicated dispatcher when it is not, which is misleading when reasoning about
|
||
mailbox throughput and thread-pool pressure. (`AlarmExecutionActor` does not carry
|
||
the equivalent claim — its summary only says "Same pattern as ScriptExecutionActor.")
|
||
|
||
**Recommendation**
|
||
|
||
Correct the summary to describe the actual model: the actor runs on the default
|
||
dispatcher and the script body is dispatched onto the dedicated
|
||
`ScriptExecutionScheduler` (SiteRuntime-009). Align the wording with the accurate
|
||
comment already present at `ScriptExecutionActor.cs:71-73`.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-17 (`commit pending`): root cause confirmed — the stale
|
||
"Runs on a dedicated blocking I/O dispatcher" line in the `ScriptExecutionActor`
|
||
class summary was missed when SiteRuntime-009 was resolved. The summary now states
|
||
the actual model: the actor and its mailbox run on the default Akka dispatcher and
|
||
only the script body is dispatched onto the dedicated `ScriptExecutionScheduler`
|
||
(SiteRuntime-009). Documentation-only change with no observable behaviour, so no
|
||
regression test was added; the existing suite continues to pass.
|
||
|
||
### SiteRuntime-019 — Dead `DisableInstanceCommand` / `EnableInstanceCommand` handlers in `InstanceActor`
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:106`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:113` |
|
||
|
||
**Description**
|
||
|
||
`InstanceActor`'s constructor registers `Receive<DisableInstanceCommand>` and
|
||
`Receive<EnableInstanceCommand>` handlers that log and reply with a successful
|
||
`InstanceLifecycleResponse`. These handlers are unreachable. The Deployment Manager
|
||
is the only sender of those commands, and `DeploymentManagerActor.HandleDisable` /
|
||
`HandleEnable` handle the lifecycle entirely themselves — they call
|
||
`Context.Stop(actor)` (disable) or `CreateInstanceActor(...)` (enable) directly and
|
||
reply to the original sender from the Deployment Manager. Neither command is ever
|
||
`Forward`-ed or `Tell`-ed to the Instance Actor. The handlers are dead code, and
|
||
they are actively misleading: a maintainer reading `InstanceActor` would reasonably
|
||
believe disable/enable is partly an Instance-Actor responsibility, and the no-op
|
||
"true" reply implies an instance-side acknowledgement contract that does not exist.
|
||
If a future change *did* route these commands here, the disable handler would do
|
||
nothing useful (it does not stop children or tear down state — Akka does that when
|
||
the parent stops the actor).
|
||
|
||
**Recommendation**
|
||
|
||
Remove the two `Receive<...>` registrations and their handler bodies from
|
||
`InstanceActor`, since the Deployment Manager owns the disable/enable lifecycle.
|
||
If the intent is to keep them for a future instance-side hook, add an XML comment
|
||
stating that the Deployment Manager currently handles these and the handlers are a
|
||
reserved placeholder — but removal is preferred.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-17 (`commit pending`): re-verified as genuinely dead code — a
|
||
codebase-wide search confirms `DisableInstanceCommand`/`EnableInstanceCommand` are
|
||
only ever sent (from central) to the site `DeploymentManagerActor`, whose
|
||
`HandleDisable`/`HandleEnable` own the lifecycle entirely (`Context.Stop` /
|
||
`CreateInstanceActor`) and never `Forward`/`Tell` the command to the Instance
|
||
Actor. The two unreachable `Receive<...>` registrations and their no-op
|
||
"success" handler bodies were removed from `InstanceActor`, replaced with a comment
|
||
stating the Deployment Manager owns this lifecycle. Regression test:
|
||
`InstanceActorTests.InstanceActor_DoesNotHandleDisableOrEnableCommands` asserts the
|
||
Instance Actor produces no `InstanceLifecycleResponse` for either command
|
||
(confirmed to fail against the pre-fix dead handlers and pass after removal).
|
||
|
||
### SiteRuntime-020 — Second `DeployInstanceCommand` arriving during a pending redeploy races the still-terminating actor on its name
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Concurrency & thread safety |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:285`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:971` |
|
||
|
||
**Resolution** — added a name → terminating-actor-ref shadow
|
||
(`_terminatingActorsByName`) populated when `HandleDeploy` stops the
|
||
predecessor and cleared in `HandleTerminated`. `HandleDeploy` now
|
||
detects the mid-termination state before falling through to
|
||
`ApplyDeployment(fresh)`: on hit it tells the displaced
|
||
`PendingRedeploy.OriginalSender` a `DeploymentStatus.Failed` /
|
||
"superseded by newer deployment …" response and overwrites the buffered
|
||
pending command (last-write-wins). Regression test
|
||
`SR020_ThreeRapidDeploys_DoNotThrowInvalidActorNameException_LatestWins`
|
||
fires three rapid deploys, asserts the middle deploy is told it was
|
||
superseded, the latest succeeds, and the resulting instance is operable
|
||
(DisableInstanceCommand works).
|
||
|
||
**Description**
|
||
|
||
The SiteRuntime-003 fix makes `HandleDeploy` watch + stop a running Instance
|
||
Actor and buffer the in-flight `DeployInstanceCommand` in `_pendingRedeploys`
|
||
until `Terminated` arrives. The handler also removes the instance from
|
||
`_instanceActors` synchronously, in step with the stop request:
|
||
|
||
```csharp
|
||
if (_instanceActors.TryGetValue(instanceName, out var existing))
|
||
{
|
||
_instanceActors.Remove(instanceName);
|
||
_pendingRedeploys[existing] = new PendingRedeploy(command, Sender);
|
||
Context.Watch(existing);
|
||
Context.Stop(existing);
|
||
UpdateInstanceCounts();
|
||
return;
|
||
}
|
||
|
||
// Fresh deployment — no existing actor to replace.
|
||
ApplyDeployment(command, Sender, isRedeploy: false);
|
||
```
|
||
|
||
If a *second* `DeployInstanceCommand` for the same `instanceName` arrives on
|
||
the singleton's mailbox while the predecessor is still terminating, the
|
||
`_instanceActors.TryGetValue` lookup correctly reports "no existing actor" —
|
||
because the first deploy already removed it — and execution falls through to
|
||
`ApplyDeployment(..., isRedeploy: false)`. `ApplyDeployment` immediately calls
|
||
`CreateInstanceActor`, which calls `Context.ActorOf(props, instanceName)`. But
|
||
the predecessor's Akka child name **is still registered** in the parent's
|
||
child registry: that name is only released after the predecessor's `Terminated`
|
||
signal — exactly the asynchronous gap SiteRuntime-003 was created to plug for
|
||
the *first* redeploy. `Context.ActorOf` therefore throws
|
||
`InvalidActorNameException`, which Akka rethrows as
|
||
`ActorInitializationException` — and the supervisor's `Stop` directive on that
|
||
exception (DeploymentManagerActor.cs:179) silently stops the just-created
|
||
child. The second deploy is then quietly lost: `_instanceActors` doesn't
|
||
contain it (the throw aborted the bookkeeping after `CreateInstanceActor`'s
|
||
own `ContainsKey` guard but before `_instanceActors[instanceName] = actorRef`
|
||
would have run), `_totalDeployedCount` was incremented, and the deployer is
|
||
never told the deployment failed (the persistence `Task.Run` is also dropped
|
||
on the throw path). The race is real on a busy site where central retries a
|
||
deploy because the prior attempt timed out — exactly the scenario the
|
||
DeploymentManager-006 query-then-deploy idempotency mechanism was designed for.
|
||
|
||
The first-redeploy case (SiteRuntime-003) does NOT exhibit this because at
|
||
that point the predecessor's child name was still in `_instanceActors`, so the
|
||
branch correctly buffers. The bug is specific to the third (and beyond)
|
||
incoming deploy when two are already in flight for the same instance.
|
||
|
||
**Recommendation**
|
||
|
||
The pending-redeploy bookkeeping needs to be authoritative for "we are mid-
|
||
redeploy on this instance", not just the `_instanceActors` cache. Add a second
|
||
keyed lookup — e.g. a `Dictionary<string, IActorRef> _terminatingActorsByName`
|
||
populated when the predecessor is stopped — and check it BEFORE
|
||
`ApplyDeployment(isRedeploy: false)`. On a hit, overwrite (or stash) the
|
||
buffered `PendingRedeploy` for that terminating actor so the latest command
|
||
wins on the `Terminated` signal. Alternatively, defer the deploy by stashing
|
||
all messages for that `instanceName` until the predecessor terminates (Akka
|
||
`Stash` pattern). Either way, the fall-through to "fresh deployment" needs to
|
||
be gated on "no instance with this name is currently terminating".
|
||
|
||
### SiteRuntime-021 — `HandleDeployArtifacts` updates `DataConnections` in SQLite but never sends `CreateConnectionCommand` to the DCL
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Design-document adherence |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:931` |
|
||
|
||
**Resolution (2026-05-28):** Took the "refactor `EnsureDclConnections` into a shared field-based helper" path. Extracted a new `EnsureDclConnection(name, protocol, primaryJson, backupJson, failoverRetryCount)` method that owns the hash-cache check and the `CreateConnectionCommand` Tell — both the existing inline `EnsureDclConnections(configJson)` and the new artifact path now drive through it. `ComputeConnectionConfigHash` got a field-based overload so the artifact path (which carries data directly on `DataConnectionArtifact`) reuses the same hash logic as the `ConnectionConfig`-based inline path. To keep `_createdConnections` mutation actor-thread-confined (the artifact-deploy persistence runs inside a `Task.Run`), the off-thread persistence dispatches a new internal `ApplyArtifactDataConnectionsToDcl` message back to `Self` after the SQLite writes; the actor-thread handler then iterates and invokes `EnsureDclConnection`. The DCL only sees `CreateConnectionCommand` (no `Update`/`Delete` messages exist in the codebase, and `CreateConnectionCommand` is treated as upsert-by-name — same shape as the inline-config path). Build clean; 302 SiteRuntime tests green (the existing `EnsureDclConnections_ConnectionConfigChanged_ReissuesCreateCommand` regression test still passes through the refactored shared helper).
|
||
|
||
**Description**
|
||
|
||
`HandleDeployArtifacts` persists the artifact bundle (shared scripts, external
|
||
systems, database connections, notification lists, SMTP configs, and
|
||
**data connection definitions**) into local SQLite. For data connection
|
||
definitions specifically (`DataConnections`), the handler calls
|
||
`_storage.StoreDataConnectionDefinitionAsync(...)` — but does NOT issue a
|
||
`CreateConnectionCommand` (or any other DCL command) to the `_dclManager`
|
||
actor. The only path that pushes DCL configuration to the DCL is
|
||
`EnsureDclConnections`, called exclusively from the deploy / startup-batch
|
||
paths against the **flattened instance configuration's** inline `Connections`
|
||
dictionary. There is no equivalent for an artifact-only update.
|
||
|
||
Concretely: an artifact deployment that changes a data connection's endpoint
|
||
URL, credentials, backup endpoint, or failover retry count is stored
|
||
durably in the site SQLite (so on the *next* node restart the site loads the
|
||
new config and `EnsureDclConnections` picks it up) but is silently inert until
|
||
either an instance using that connection is redeployed or the node restarts.
|
||
This contradicts the design's "after artifact deployment, the site is fully
|
||
self-contained" intent (Component-SiteRuntime.md, "System-Wide Artifact
|
||
Handling") — the runtime DCL keeps using the stale connection until a much
|
||
heavier trigger event occurs. It is also asymmetric with how
|
||
`SharedScripts` are handled in the same method: shared scripts are both
|
||
stored *and* recompiled into `_sharedScriptLibrary` on update so the change is
|
||
live immediately.
|
||
|
||
(SiteRuntime-010 fixed a related defect inside `EnsureDclConnections` — the
|
||
config-hash cache — but that's only consulted on the inline-config path; the
|
||
artifact-deployment path never reaches `EnsureDclConnections`.)
|
||
|
||
**Recommendation**
|
||
|
||
In the `DataConnections` branch of `HandleDeployArtifacts`, after the
|
||
`StoreDataConnectionDefinitionAsync` call, also send a
|
||
`CreateConnectionCommand` to `_dclManager` for each updated definition,
|
||
re-using the SiteRuntime-010 config hash so unchanged connections are skipped.
|
||
Alternatively, refactor `EnsureDclConnections` to accept a flat list of
|
||
`(name, protocol, configurationJson, backupConfigurationJson,
|
||
failoverRetryCount)` tuples that both the inline (`FlattenedConfiguration`)
|
||
and artifact paths can drive through it.
|
||
|
||
### SiteRuntime-022 — `AuditingDbCommand.DbConnection.set` uses reflection to read `AuditingDbConnection._inner`
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Code organization & conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Scripts/AuditingDbCommand.cs:138` |
|
||
|
||
**Resolution (2026-05-28):** Took the recommended "expose a proper API surface" path (the SiteRuntime-006 precedent). Added an `internal DbConnection Inner => _inner;` accessor to `AuditingDbConnection`; both classes are `internal sealed` in the same assembly, so the accessor stays out of the public API. The `AuditingDbCommand.DbConnection` setter now unwraps an `AuditingDbConnection` via `auditing.Inner` instead of `Type.GetField("_inner", BindingFlags.NonPublic | BindingFlags.Instance)!.GetValue(...)`. No reflection, no `!.` null-forgiveness hiding a runtime crash, no static-analyzer/IL-trim noise — and the same module that enforces "no `System.Reflection` in scripts" no longer reflects internally. The getter's `_wrappingConnection ?? _inner.Connection` fallback was left as-is; addressing the `CreateDbCommand()` round-trip concern is a separate behavioural decision (the finding marked it secondary). Build clean; 302 SiteRuntime tests green.
|
||
|
||
**Description**
|
||
|
||
The `DbConnection` setter on `AuditingDbCommand` unwraps an
|
||
`AuditingDbConnection` value by reading its private `_inner` field via
|
||
reflection:
|
||
|
||
```csharp
|
||
set
|
||
{
|
||
_wrappingConnection = value;
|
||
_inner.Connection = value switch
|
||
{
|
||
AuditingDbConnection auditing => auditing.GetType()
|
||
.GetField("_inner", BindingFlags.Instance | BindingFlags.NonPublic)
|
||
!.GetValue(auditing) as DbConnection,
|
||
_ => value
|
||
};
|
||
}
|
||
```
|
||
|
||
This is the same encapsulation-violating anti-pattern that SiteRuntime-006
|
||
called out for the site repositories. A rename or refactor of
|
||
`AuditingDbConnection._inner` breaks the audit decorator at runtime (no
|
||
compile-time signal), the `!.` null-forgiving operator hides the crash, and
|
||
the reflective access trips static analyzers and IL trimming. More
|
||
problematically, the script trust model the same module enforces in
|
||
`ScriptCompilationService.ValidateTrustModel` explicitly forbids
|
||
`System.Reflection` in scripts — yet the auditing helper a script ends up
|
||
running through itself reaches via reflection into a sibling class. Both
|
||
classes are `internal sealed` in the same assembly, so this is purely a
|
||
self-imposed contract violation.
|
||
|
||
A second smaller concern in the same property: the getter returns
|
||
`_wrappingConnection ?? _inner.Connection`. If the caller obtains a command
|
||
via `AuditingDbConnection.CreateDbCommand()` and immediately reads
|
||
`cmd.Connection`, the getter returns the raw inner connection (not the
|
||
auditing wrapper), because `_wrappingConnection` is only populated when the
|
||
setter is later invoked. That's surprising and at odds with the class's
|
||
audit-everything intent — a script that round-trips a command through
|
||
`cmd.Connection` re-enters the un-audited path.
|
||
|
||
**Recommendation**
|
||
|
||
Expose the wrapped connection through a proper API surface. The simplest fix
|
||
that matches the SiteRuntime-006 precedent: add an
|
||
`internal DbConnection Inner { get; }` property to `AuditingDbConnection`
|
||
(both classes are `internal sealed`, so the property stays out of the public
|
||
surface) and replace the reflection switch with `auditing.Inner`. While
|
||
touching the property, also have the getter return `_wrappingConnection` even
|
||
on the synthesised CreateDbCommand path (e.g. set `_wrappingConnection` to
|
||
the parent connection inside `AuditingDbConnection.CreateDbCommand`).
|
||
|
||
### SiteRuntime-023 — `Convert.ToDouble(value)` in trigger and alarm evaluation is locale-sensitive
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/ScriptActor.cs:446`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/AlarmActor.cs:340`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/AlarmActor.cs:356`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/AlarmActor.cs:444` |
|
||
|
||
**Resolution (2026-05-28):** All four call sites
|
||
(`ScriptActor.EvaluateCondition`, `AlarmActor.EvaluateRangeViolation`,
|
||
`AlarmActor.EvaluateRateOfChange`, `AlarmActor.EvaluateHiLo`) now pass
|
||
`CultureInfo.InvariantCulture` to `Convert.ToDouble`, so a string attribute
|
||
value like `"1.5"` parses identically regardless of the host's
|
||
`CurrentCulture`. For purely-numeric inputs the culture argument is a no-op.
|
||
No new test added — existing `ScriptActor` / `AlarmActor` evaluator tests
|
||
continue to pass and the behaviour is identical under the (existing CI's)
|
||
`en-US` locale.
|
||
|
||
**Description**
|
||
|
||
`ScriptActor.EvaluateCondition` and the three `AlarmActor` evaluators
|
||
(`EvaluateRangeViolation`, `EvaluateRateOfChange`, `EvaluateHiLo`) call
|
||
`Convert.ToDouble(value)` without specifying a culture. When `value` is a
|
||
string (a path that exists today — attribute values that arrive as JSON-
|
||
deserialized numbers can still surface as strings on some code paths,
|
||
particularly array values that are JSON-stringified at
|
||
`InstanceActor.HandleTagValueUpdate:377`), `Convert.ToDouble` parses against
|
||
`CultureInfo.CurrentCulture`. On a host whose locale uses a comma decimal
|
||
separator (German, French, most of continental Europe), `"1.5"` throws and
|
||
the condition / alarm silently degrades to its catch-fallthrough (returns
|
||
`false` for range/rate-of-change, keeps current level for HiLo, falls back to
|
||
string-compare for conditionals). The CLAUDE.md "All timestamps are UTC"
|
||
discipline is the equivalent rule for time; there is no equivalent invariant-
|
||
culture discipline applied to numeric parsing.
|
||
|
||
The exposure is bounded — most attribute values arrive as numeric primitives
|
||
from `TagValueUpdate.Value` or static `FlattenedConfiguration.Attributes`
|
||
(also typed) so the implicit-cast `Convert.ToDouble` path is hit. But the
|
||
string path is reachable via inbound API writes
|
||
(`RouteToSetAttributesRequest.AttributeValues` is `IReadOnlyDictionary<string,
|
||
string>`), via the JSON-array stringification at `HandleTagValueUpdate:377`,
|
||
and via static-override values loaded from SQLite (which are persisted as
|
||
strings — see `SetStaticOverrideAsync`).
|
||
|
||
**Recommendation**
|
||
|
||
Replace each `Convert.ToDouble(value)` with `Convert.ToDouble(value,
|
||
CultureInfo.InvariantCulture)`, or front-load a typed-numeric extraction
|
||
helper (`if (value is double d) return d; if (value is string s && double.TryParse(s,
|
||
NumberStyles.Float, CultureInfo.InvariantCulture, out var p)) return p;
|
||
return Convert.ToDouble(value, CultureInfo.InvariantCulture);`). The site is a
|
||
deterministic machine-control surface; condition evaluation must not depend
|
||
on the host's regional settings.
|
||
|
||
### SiteRuntime-024 — `OperationTrackingStore` serialises all writes through one connection + `SemaphoreSlim`, and `Dispose()` does sync-over-async
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Performance & resource management |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Tracking/OperationTrackingStore.cs:39`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Tracking/OperationTrackingStore.cs:360` |
|
||
|
||
**Resolution** — split reads from writes: the single owned
|
||
`_writeConnection` + `_writeGate` still serialises writers, but
|
||
`GetStatusAsync` now opens a fresh `SqliteConnection` per call against
|
||
the shared connection string (mirroring `SiteStorageService`) so reads
|
||
never block on an in-flight write. Sync `Dispose` was rewritten to NOT
|
||
bridge to async — the dispose-once flag is an `int` flipped with
|
||
`Interlocked.Exchange`, the synchronous path disposes
|
||
`_writeConnection` + `_writeGate` directly without acquiring the gate,
|
||
and `DisposeAsync` retains the gate-drain semantics for graceful
|
||
shutdown. Both paths are idempotent; the second call short-circuits via
|
||
the interlocked flag. Tests:
|
||
`SR024_ConcurrentReads_DoNotBlockOnInFlightWrite`,
|
||
`SR024_SyncDispose_DoesNotDeadlock_WhenInvokedFromFreshThread`, and
|
||
`SR024_AsyncDispose_DoesNotDeadlock_AndIsIdempotent`.
|
||
|
||
**Description**
|
||
|
||
`OperationTrackingStore` owns exactly one `SqliteConnection` and gates every
|
||
public method through a single `SemaphoreSlim(1, 1)`. The class XML comment
|
||
calls this out as deliberate ("the M3 brief calls out as 'cleaner than the M2
|
||
Channel<T> pipeline given the volume'"), and the *write* volume is genuinely
|
||
low — at most a handful of lifecycle rows per cached call. But on a busy site
|
||
the *read* path (`GetStatusAsync`) is called by every `Tracking.Status(id)`
|
||
invocation from every executing script, and reads are serialised through the
|
||
same gate as writes. A long-running write (e.g. a Roslyn-script-driven
|
||
`RecordTerminalAsync` competing with an SQLite checkpoint) holds the gate and
|
||
stalls every concurrent status query. SQLite supports concurrent readers with
|
||
a single writer in WAL mode; the gate forfeits that capability.
|
||
|
||
A separate concern in the same class: `Dispose()` calls
|
||
`DisposeAsyncCore().AsTask().GetAwaiter().GetResult()`. That is sync-over-
|
||
async — the very pattern SiteRuntime-008 was a finding for. If a caller
|
||
disposes the store from a synchronization context that does not allow
|
||
re-entrance (e.g. an `IHostedService.StopAsync` continuation observed on the
|
||
host's sync context, or a finalizer pumping on the thread pool with a stuck
|
||
continuation), the `.WaitAsync()` inside `DisposeAsyncCore` waits for a
|
||
continuation that will never run, and the dispose deadlocks. The async path
|
||
itself is correct; only the sync `Dispose()` wrapper is risky.
|
||
|
||
**Recommendation**
|
||
|
||
For the single-connection gate: split reads and writes into separate gates,
|
||
or — better — keep the writer single-connection and open a fresh read
|
||
connection (or pool of read connections) per `GetStatusAsync` call. SQLite
|
||
connections are cheap; the `SiteStorageService` precedent already uses per-
|
||
call connections on the read path. For `Dispose()`: prefer
|
||
`Dispose() { GC.SuppressFinalize(this); _connection.Dispose(); _gate.Dispose(); }`
|
||
without an awaited disposal, and have the `IAsyncDisposable.DisposeAsync`
|
||
path do the awaiting. If a synchronous disposable is genuinely needed, do
|
||
not bridge it through the async core — duplicate the dispose-once flag check
|
||
into a sync path that calls `_connection.Dispose()` directly.
|
||
|
||
### SiteRuntime-025 — `HandleSetStaticAttribute` persists unknown attribute names as static overrides
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:223`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:246` |
|
||
|
||
**Resolution (2026-05-28):** `HandleSetStaticAttribute` now rejects writes
|
||
whose `command.AttributeName` does not resolve against
|
||
`_configuration.Attributes`. The caller receives
|
||
`SetStaticAttributeResponse(Success: false, ErrorMessage: "Unknown attribute
|
||
'<name>'")`; no override is persisted, `_attributes` is not mutated, and no
|
||
synthetic `AttributeValueChanged` is published — eliminating the
|
||
in-memory-pollution, restart-resurrection, and debug-stream spam vectors.
|
||
Regression test
|
||
`InstanceActorSetAttributeTests.SetAttribute_UnknownAttribute_ReturnsFailureAndDoesNotPersistOverride`
|
||
exercises an inbound `SetStaticAttributeCommand` with an unknown name and
|
||
asserts the failure response, no DCL traffic, and an empty override row.
|
||
|
||
**Description**
|
||
|
||
`HandleSetStaticAttribute` resolves the target attribute against
|
||
`_configuration.Attributes` to decide whether to route the write to the DCL or
|
||
treat it as a static-override write. If the lookup fails (`resolved == null`),
|
||
`isDataSourced` is false, and execution falls through to
|
||
`HandleSetStaticAttributeCore` — which unconditionally:
|
||
|
||
1. inserts the bogus key into the in-memory `_attributes` dictionary,
|
||
2. publishes an `AttributeValueChanged` for the bogus key to the site stream
|
||
and to every child Script/Alarm actor,
|
||
3. persists a row in `static_attribute_overrides` for the bogus key, and
|
||
4. replies `Success = true` to the caller.
|
||
|
||
Concretely, an inbound API `Route.To().SetAttribute("notARealAttr", "x")`
|
||
returns success, pollutes the in-memory state with a key that no script can
|
||
legitimately observe (canonical-name lookup will not produce it), persists a
|
||
durable SQLite override row that survives restart, and (on every restart)
|
||
re-injects the polluting key via `HandleOverridesLoaded` at line 608. The
|
||
override is **not** reset on instance redeployment in the same way the
|
||
"genuine" overrides are — `ClearStaticOverridesAsync` does clear by
|
||
`instance_unique_name`, so the row is eventually cleaned, but only on a full
|
||
redeploy; in the meantime each restart resurrects it. The publish-to-stream
|
||
side effect also lets a hostile or buggy inbound caller spam debug-view
|
||
subscribers with synthetic attribute changes.
|
||
|
||
Worth flagging at Low: the inbound API surface is already authenticated and
|
||
the design assumes its callers are trusted. But the no-validation behaviour
|
||
contradicts the design doc's "Scripts can only read/write attributes on their
|
||
own instance" framing — an inbound API call inherits the same instance-scope
|
||
authority as a script, and the script trust model wouldn't sanction this.
|
||
|
||
**Recommendation**
|
||
|
||
In `HandleSetStaticAttribute`, when `resolved == null`, reply
|
||
`SetStaticAttributeResponse(Success: false,
|
||
ErrorMessage: $"Attribute '{command.AttributeName}' not found on instance
|
||
'{_instanceUniqueName}'")` instead of falling through to the override path.
|
||
Optionally also surface the existence check on the `RouteInboundApiSetAttributes`
|
||
fan-out so a multi-attribute write reports the offending key without rolling
|
||
back the others (the per-attribute `Ask` shape already supports a partial
|
||
failure response).
|
||
|
||
### SiteRuntime-026 — `ReplicationMessages.cs` public record types have no XML documentation
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Documentation & comments |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Messages/ReplicationMessages.cs:10`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Messages/ReplicationMessages.cs:13`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Messages/ReplicationMessages.cs:15`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Messages/ReplicationMessages.cs:17`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Messages/ReplicationMessages.cs:19`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Messages/ReplicationMessages.cs:25`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Messages/ReplicationMessages.cs:28`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Messages/ReplicationMessages.cs:30`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Messages/ReplicationMessages.cs:32`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Messages/ReplicationMessages.cs:34` |
|
||
|
||
**Description**
|
||
|
||
The ten public record types in `ReplicationMessages.cs`
|
||
(`ReplicateConfigDeploy`, `ReplicateConfigRemove`, `ReplicateConfigSetEnabled`,
|
||
`ReplicateArtifacts`, `ReplicateStoreAndForward`, `ApplyConfigDeploy`,
|
||
`ApplyConfigRemove`, `ApplyConfigSetEnabled`, `ApplyArtifacts`,
|
||
`ApplyStoreAndForward`) carry no XML documentation. The file header comment
|
||
groups them as "outbound" vs "inbound" but the individual records have no
|
||
`<summary>` and no parameter docs. The XML-doc baseline `1eb6e97` rolled out
|
||
across the rest of the module (the commit being reviewed is literally `docs:
|
||
add XML doc comments across src + Sister Projects section in CLAUDE.md`), so
|
||
this file is now the conspicuous outlier — and the `CommentChecker` skill
|
||
relied on by the `fixdocs` workflow will flag every record as missing docs.
|
||
|
||
**Recommendation**
|
||
|
||
Add a `<summary>` per record naming the direction (outbound → peer / inbound
|
||
from peer) and what the operation replicates, and `<param>` docs for each
|
||
record parameter. Mirror the precedent in
|
||
`src/ZB.MOM.WW.ScadaBridge.Commons/Messages/.../*.cs`. While there, consider sealing the
|
||
inbound vs outbound split with a marker base type (currently they're just
|
||
named conventionally) so `Receive<ReplicateXxx>` vs `Receive<ApplyXxx>` is
|
||
expressed at the type level — but that's optional and out of scope for a
|
||
docs-only finding.
|
||
|
||
**Resolution (2026-05-28):**
|
||
|
||
Added a one-line `<summary>` to each of the ten records
|
||
(`ReplicateConfigDeploy`/`Remove`/`SetEnabled`/`Artifacts`/`StoreAndForward` and
|
||
`ApplyConfigDeploy`/`Remove`/`SetEnabled`/`Artifacts`/`StoreAndForward`) naming
|
||
the direction (outbound to peer / inbound from peer) and what is replicated.
|
||
The two pre-existing group-header XML blocks were converted to plain `//`
|
||
comments to avoid orphaned doc-summaries above the first record in each group.
|
||
Marker-base-type idea left out of scope.
|
||
|
||
### SiteRuntime-027 — `InstanceActor._latestAlarmEvents` grows without bound as native conditions churn
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Performance & resource management |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:67`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:1007` |
|
||
|
||
**Description**
|
||
|
||
`HandleAlarmStateChanged` stores the latest enriched `AlarmStateChanged` per
|
||
`changed.AlarmName` into `_latestAlarmEvents` (line 1013) and **never removes
|
||
anything from that dictionary**. For computed alarms the key set is bounded (one
|
||
entry per configured alarm). For **native** alarms, however, the key is the
|
||
per-condition `SourceReference` (every `NativeAlarmActor.Emit` stamps
|
||
`AlarmName = t.SourceReference`). When a native condition fully runs its course —
|
||
`NativeAlarmActor.ApplyLiveTransition` drops it from the mirror once
|
||
`!Active && Acknowledged` (NativeAlarmActor.cs:243), or `EnforceCap` evicts the
|
||
oldest — the Instance Actor still holds the final `AlarmStateChanged` for that
|
||
`SourceReference` forever as a (Normal) entry. There is no message back to the
|
||
Instance Actor to evict it.
|
||
|
||
On an OPC UA A&C source that mints a fresh `SourceReference` per occurrence
|
||
(conditionId/branchId style references are common), the map accumulates one
|
||
permanently-retained entry per distinct condition the instance has *ever* seen.
|
||
The `MirroredAlarmCapPerSource` cap (default 1000) bounds the
|
||
`NativeAlarmActor._alarms` set but does **not** bound `_latestAlarmEvents`, which
|
||
is on the long-lived Instance Actor. Over weeks of uptime on a busy site this is a
|
||
steadily-growing per-instance memory leak, and `BuildAlarmStatesSnapshot`
|
||
(line 1055, `_latestAlarmEvents.Values.ToList()`) makes every DebugView snapshot
|
||
proportionally larger and slower, returning thousands of stale Normal rows.
|
||
|
||
**Recommendation**
|
||
|
||
Evict from `_latestAlarmEvents` when a native condition reaches its terminal
|
||
return-to-normal+dropped state. The cleanest signal is the `AlarmStateChanged`
|
||
the `NativeAlarmActor` already emits on drop-out: have the `NativeAlarmActor`
|
||
stamp a "final/dropped" marker (an additive bool on `AlarmStateChanged`, or a
|
||
dedicated `NativeAlarmDropped(sourceReference)` Tell to the parent), and have
|
||
`HandleAlarmStateChanged` `_latestAlarmEvents.Remove(...)` on it. Alternatively
|
||
cap/prune `_latestAlarmEvents` for native keys mirroring
|
||
`MirroredAlarmCapPerSource`. Computed-alarm keys must stay (they are
|
||
configuration-bounded).
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-06-20 (commit `fd618cf1`): added an additive `NativeAlarmDropped(SourceReference)` Tell emitted by `NativeAlarmActor` at all terminal-drop sites (snapshot-swap removal, retention drop, cap eviction) after the return-to-normal emit; `InstanceActor.HandleNativeAlarmDropped` removes the native key from `_latestAlarmEvents` (and `_alarmStates`/`_alarmTimestamps`). Computed-alarm keys are never dropped.
|
||
|
||
### SiteRuntime-028 — `NativeAlarmActor.EnforceCap` drops a condition without a return-to-normal, leaving a phantom Active alarm
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/NativeAlarmActor.cs:281` |
|
||
|
||
**Description**
|
||
|
||
`EnforceCap` evicts the oldest mirrored conditions once the per-source cap is
|
||
exceeded: it removes them from `_alarms`, calls `PersistDelete`, and logs the
|
||
eviction — but it does **not** `Emit` a return-to-normal `AlarmStateChanged` for
|
||
the dropped condition. If the evicted condition was still `Active`, the Instance
|
||
Actor's `_latestAlarmEvents` (and, downstream, central's gRPC stream and the
|
||
operator Alarm Summary page) keep showing it as **Active** indefinitely. The
|
||
mirror has silently forgotten the condition, so no later transition can ever clear
|
||
it — a phantom stuck-active alarm.
|
||
|
||
This is inconsistent with the sibling drop path `ApplySnapshotSwap`
|
||
(NativeAlarmActor.cs:205), which correctly emits `prior.Condition with { Active =
|
||
false }` for every condition that falls out of the new snapshot. The retention
|
||
drop in `ApplyLiveTransition` (line 243) is safe only because it drops *after*
|
||
emitting the condition's own already-inactive state; `EnforceCap` drops a
|
||
condition whose last-emitted state may still be Active, with no compensating emit.
|
||
|
||
The design doc explicitly states the cap eviction is "logged — there is no silent
|
||
truncation," but logging alone does not reconcile the operator-visible state: the
|
||
alarm view is silently wrong.
|
||
|
||
**Recommendation**
|
||
|
||
In `EnforceCap`, before removing each overflow condition, `Emit(a, a.Condition
|
||
with { Active = false })` (mirroring `ApplySnapshotSwap`) so the eviction produces
|
||
a return-to-normal on the stream and clears the phantom. Add a test asserting that
|
||
exceeding `MirroredAlarmCapPerSource` with an active oldest condition emits a
|
||
Normal `AlarmStateChanged` for the evicted `SourceReference`.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-06-20 (commit `fd618cf1`): `NativeAlarmActor.EnforceCap` now emits `Emit(evicted, evicted.Condition with { Active = false })` for a still-active evicted condition before removing it, clearing the phantom stuck-Active alarm on the stream/UI. Cap-eviction test added.
|
||
|
||
### SiteRuntime-029 — A delete (or disable) arriving during a pending redeploy over-decrements the counter and is undone by `HandleTerminated`
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:627`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:402` |
|
||
|
||
**Description**
|
||
|
||
`HandleDelete` and `HandleDisable` only consult `_instanceActors` to find the live
|
||
actor; neither consults the SiteRuntime-020 `_terminatingActorsByName` /
|
||
`_pendingRedeploys` bookkeeping. When a redeployment is in flight, `HandleDeploy`
|
||
has already removed the instance from `_instanceActors`, stopped the predecessor,
|
||
and buffered a `PendingRedeploy` keyed by the terminating ref. If a
|
||
`DeleteInstanceCommand` then arrives for that same instance *before* the
|
||
`Terminated` signal fires:
|
||
|
||
1. `_instanceActors.TryGetValue` misses (the entry was removed at redeploy time),
|
||
so no actor is stopped.
|
||
2. `_totalDeployedCount = Math.Max(0, _totalDeployedCount - 1)` runs anyway —
|
||
over-decrementing, because the redeploy path did not increment for an update
|
||
(the count had already been adjusted), so the deployed/disabled counts reported
|
||
to the health collector drift.
|
||
3. `RemoveDeployedConfigAsync` deletes the SQLite row and the deleter is told
|
||
success.
|
||
4. The buffered `_pendingRedeploys` entry is **untouched**, so when `Terminated`
|
||
fires, `HandleTerminated` calls `ApplyDeployment(..., isRedeploy: true)`, which
|
||
re-creates the Instance Actor and re-writes the deployed config to SQLite —
|
||
**resurrecting the instance the operator just deleted**, with the counter now
|
||
inconsistent.
|
||
|
||
`HandleDisable` has the milder version: the disable persists `is_enabled = false`,
|
||
but `HandleTerminated` then re-stores the config with `isEnabled: true` and
|
||
re-creates the actor, so a disable issued mid-redeploy is silently reverted to
|
||
enabled.
|
||
|
||
The window is the redeploy-termination interval — small, but reliably hit when
|
||
central issues a delete/disable immediately after a deploy (e.g. an operator
|
||
correcting a mistaken deploy, or an automated teardown), exactly the kind of rapid
|
||
command sequence SiteRuntime-020 was filed to harden.
|
||
|
||
**Recommendation**
|
||
|
||
Make `HandleDelete`/`HandleDisable` authoritative over the mid-redeploy state:
|
||
before falling through, check `_terminatingActorsByName`. On a hit, drop the
|
||
buffered `_pendingRedeploys` entry (so `HandleTerminated` does not resurrect),
|
||
clear the shadow, and for delete tell the buffered redeploy's `OriginalSender` it
|
||
was superseded — mirroring the last-write-wins handling already in `HandleDeploy`.
|
||
Only decrement `_totalDeployedCount` when an instance was actually present
|
||
(in `_instanceActors` **or** terminating). Add a regression test:
|
||
deploy → redeploy → delete-before-Terminated asserts the instance stays deleted
|
||
and the counter is correct.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-06-20 (commit `fd618cf1`): `HandleDelete`/`HandleDisable` now check `_terminatingActorsByName` before the `_instanceActors` fall-through; on a hit they drop the buffered `_pendingRedeploys` entry and clear the shadow so `HandleTerminated` can't resurrect the instance, and `_totalDeployedCount` is decremented only when an instance was actually present. Delete-during-redeploy race test added (verified failing pre-fix).
|
||
|
||
### SiteRuntime-030 — Native-alarm cap/retention-drop and `SiteReplicationActor` paths are untested
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Testing coverage |
|
||
| Status | Resolved |
|
||
| Location | `tests/ZB.MOM.WW.ScadaBridge.SiteRuntime.Tests/Actors/NativeAlarmActorTests.cs`, `tests/ZB.MOM.WW.ScadaBridge.SiteRuntime.Tests/` |
|
||
|
||
**Description**
|
||
|
||
`NativeAlarmActorTests` covers subscribe, raise, snapshot-swap return-to-normal,
|
||
out-of-order rejection, ack, and site-event emission, but there is **no** test for
|
||
`EnforceCap` (the per-source cap eviction) or for the `ApplyLiveTransition`
|
||
retention drop (`!Active && Acknowledged`). Both are non-trivial state
|
||
transitions, and the cap path harbours an observable defect (SiteRuntime-028) that
|
||
a targeted test would have caught. `SiteReplicationActor` remains entirely
|
||
untested — the carried-forward gap that SiteRuntime-016 explicitly deferred to a
|
||
clustered-ActorSystem harness, still outstanding at this commit (the actor calls
|
||
`Cluster.Get(Context.System)` in its constructor, so it needs a clustered HOCON
|
||
test host). The outbound forward, inbound apply, and `SendToPeer` no-peer-drop
|
||
behaviour are unverified.
|
||
|
||
**Recommendation**
|
||
|
||
Add `NativeAlarmActor` tests for (a) cap eviction emits a return-to-normal for an
|
||
evicted active condition (pairs with SiteRuntime-028) and (b) a resolved condition
|
||
(inactive+acked) drops out and deletes its SQLite row. Stand up the clustered
|
||
test harness SiteRuntime-016 called for and cover `SiteReplicationActor`'s
|
||
outbound/inbound/peer-tracking paths.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-06-20 (commit `fd618cf1`): native-alarm cap-eviction and retention-drop tests added (pairs with -027/-028). The clustered `SiteReplicationActor` test harness remains deferred (needs a clustered ActorSystem host, consistent with the prior -016 deferral).
|
||
|
||
### SiteRuntime-031 — Site still persists notification-list and SMTP config that the design moved to central-only
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Design-document adherence |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Persistence/SiteStorageService.cs:90`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Persistence/SiteStorageService.cs:105`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:1383`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:1417` |
|
||
|
||
**Description**
|
||
|
||
The design decision (CLAUDE.md "External Integrations" / Component-NotificationService;
|
||
echoed in Component-SiteRuntime.md "System-Wide Artifact Handling": *"Notification
|
||
lists and SMTP configuration are central-only and are not deployed to sites"*) makes
|
||
notification delivery central-only — sites store-and-forward to central and never
|
||
talk to SMTP. Yet SiteRuntime still:
|
||
|
||
- creates the `notification_lists` and `smtp_configurations` SQLite tables
|
||
(`SiteStorageService.InitializeAsync`),
|
||
- writes them from `HandleDeployArtifacts` (`StoreNotificationListAsync` /
|
||
`StoreSmtpConfigurationAsync`, DeploymentManagerActor.cs:1383/1417) and the
|
||
replication path `SiteReplicationActor.HandleApplyArtifacts`, and
|
||
- the `DeployArtifactsCommand` contract still carries `NotificationLists` /
|
||
`SmtpConfigurations`.
|
||
|
||
The SMTP path persists the SMTP `password` field (SiteStorageService.cs:693) into
|
||
plaintext site SQLite. If central no longer populates these (per the design), this
|
||
is dead code carrying a latent secret-at-rest footprint; if central still does, the
|
||
site is storing SMTP credentials it must never use — both contradict the
|
||
central-only delivery decision. Either the code/tables are stale and should be
|
||
removed, or the design doc is stale and the decision needs re-stating. This straddles
|
||
the SiteRuntime/NotificationService boundary and the shared `DeployArtifactsCommand`
|
||
contract, so the direction is a design-owner call rather than a clean in-module fix.
|
||
|
||
**Recommendation**
|
||
|
||
Confirm with the design owner whether central still ships notification-list / SMTP
|
||
artifacts to sites. If not (the stated decision), remove the `notification_lists`
|
||
and `smtp_configurations` tables, the two `Store…Async` writers, the replication
|
||
branches, and the corresponding `DeployArtifactsCommand` fields (a coordinated
|
||
cross-module change). If the decision has changed, update Component-SiteRuntime.md
|
||
and Component-NotificationService.md and treat the plaintext SMTP password as a
|
||
secret-handling finding in its own right.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-06-20 (commit `fd618cf1`): the site no longer persists notification-list / SMTP config and purges any already-persisted rows (incl. the plaintext SMTP password) on both apply paths (`HandleDeployArtifacts`, replication `HandleApplyArtifacts`), inside the all-or-nothing apply. Paired with DeploymentManager-025 (central stops shipping). Tables retained but kept empty.
|
||
|
||
## Re-review — 2026-06-24 (commit `1f9de8a2`)
|
||
|
||
Focused re-review of the changes since the prior review — verifying the code-review remediation + feature fixes are sound and regression-free. Reviewed by a per-module workflow agent; findings code-verified by the orchestrator.
|
||
|
||
**Changes reviewed:** The diff lands four remediation slices: (1) SiteRuntime-029 — HandleDisable/HandleDelete now cancel a buffered mid-redeploy (telling the displaced deployer "superseded"), and HandleDelete gates the _totalDeployedCount decrement on a new wasPresent flag; (2) SiteRuntime-027 — a new NativeAlarmDropped message is Tell'd from NativeAlarmActor to InstanceActor on every terminal mirror drop (snapshot-swap, retention drop, cap eviction) so InstanceActor.HandleNativeAlarmDropped evicts the stale _latestAlarmEvents/_alarmStates/_alarmTimestamps key; (3) SiteRuntime-028 — EnforceCap now emits a return-to-normal for a still-active evicted condition before dropping it; (4) DeploymentManager-025/SiteRuntime-031 — both the primary (DeploymentManagerActor) and replica (SiteReplicationActor) artifact-apply paths stop persisting notification lists/SMTP config and instead call the new SiteStorageService.PurgeCentralOnlyNotificationConfigAsync to delete any pre-fix rows (including the plaintext SMTP password).
|
||
|
||
**Verdict:** Three of the four fixes are correct and well-tested: the native-alarm key-eviction (027), the cap-eviction return-to-normal (028), and the central-only notification/SMTP purge (031) are sound, message ordering is preserved (return-to-normal is Tell'd before NativeAlarmDropped on the same sender/receiver pair, so the stream sees the clear before the key is evicted), and the purge runs unconditionally and idempotently on both apply paths against tables that always exist. The SiteRuntime-029 redeploy/delete-race fix is correct for the case it targets (delete-during-redeploy, covered by a new passing test), but it over-corrected the counter guard: gating the _totalDeployedCount decrement on presence in _instanceActors-or-_terminatingActorsByName means deleting a DISABLED instance (which is counted in _totalDeployedCount yet absent from both maps) no longer decrements the count, leaking the deployed/disabled tally on the health dashboard. No test exercises the disable-then-delete count path, so the regression is uncaught. One Low doc-drift item: the SiteRuntime component doc's native-alarm retention/cap sections were not updated for the per-condition _latestAlarmEvents eviction (027) or the cap-eviction return-to-normal emit (028).
|
||
|
||
| # | Category | Examined | Notes |
|
||
|---|----------|----------|-------|
|
||
| 1 | Correctness & logic bugs | ☑ | Found: HandleDelete's new wasPresent guard fails to decrement _totalDeployedCount when deleting a disabled instance (absent from both maps but counted). Redeploy-race and native-alarm fixes are otherwise correct. |
|
||
| 2 | Akka.NET conventions | ☑ | NativeAlarmDropped uses Tell (hot path); single sender/receiver pair preserves ordering so return-to-normal precedes the drop signal. No captured sender/this in closures. PipeTo used for async results. No issues. |
|
||
| 3 | Concurrency & thread safety | ☑ | All map mutations (_pendingRedeploys, _terminatingActorsByName, _alarms, _latestAlarmEvents) occur on the actor thread; LogDeploymentEvent off-thread touches only readonly _serviceProvider. No issues. |
|
||
| 4 | Error handling & resilience | ☑ | Purge runs inside SiteReplicationActor's try/catch; native persistence stays fire-and-forget OnlyOnFaulted. Displaced deployers are told Failed-superseded rather than left waiting. No issues. |
|
||
| 5 | Security | ☑ | PurgeCentralOnlyNotificationConfigAsync proactively deletes any pre-fix plaintext SMTP password rows on every apply — a positive security remediation. DELETE statements are static SQL, no injection. No issues. |
|
||
| 6 | Performance & resource management | ☑ | SiteRuntime-027 eviction fixes an unbounded _latestAlarmEvents growth for sources that mint a fresh SourceReference per occurrence. Purge opens a short-lived SqliteConnection per apply (acceptable, low frequency). No issues. |
|
||
| 7 | Design-document adherence | ☑ | Central-only notification/SMTP purge matches the documented decision (sites never deliver). Cap-eviction now emits a final state change, aligning with the retention-drop contract. |
|
||
| 8 | Code organization & conventions | ☑ | NativeAlarmDropped placed in Messages/, NotifyParentDropped centralizes the Tell, wasPresent reads clearly. Comments are accurate and reference the issue IDs. No issues. |
|
||
| 9 | Testing coverage | ☑ | SiteRuntime-027/028 and delete-during-redeploy (SR029) have new passing tests. Gap: no test asserts _totalDeployedCount after disable-then-delete, so the counter-leak regression is uncaught; snapshot-swap drop signal not directly asserted. |
|
||
| 10 | Documentation & comments | ☑ | Code comments are strong. Doc drift: Component-SiteRuntime.md native-alarm retention/cap sections do not mention the new per-condition _latestAlarmEvents eviction (027) or the cap-eviction return-to-normal (028). |
|
||
|
||
**New findings from this re-review (2):**
|
||
|
||
### SiteRuntime-032 — Deleting a disabled instance leaks the deployed count
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:660-690` |
|
||
|
||
**Description**
|
||
|
||
The SiteRuntime-029 fix gates the _totalDeployedCount decrement on a new wasPresent flag that is set true only when the instance is live in _instanceActors OR mid-redeploy in _terminatingActorsByName (lines 661-684). But a DISABLED instance is in NEITHER map: HandleDisable removes it from _instanceActors (line 555) and never adds it to _terminatingActorsByName, yet it remains counted in _totalDeployedCount (startup sets _totalDeployedCount = msg.Configs.Count over ALL configs incl. disabled at line 271, and disable never decrements). So deleting a disabled instance hits the _terminatingActorsByName miss, then the _instanceActors miss, leaves wasPresent=false, and skips the decrement. The instance's deployed-config row is removed from SQLite but _totalDeployedCount stays too high — UpdateInstanceCounts then reports a phantom 'deployed=N, disabled=N-_instanceActors.Count' for an instance that no longer exists. The pre-fix code decremented unconditionally (clamped at 0), so this is a behavioral regression. Each disable→delete cycle accumulates the drift on the health dashboard. The original concern the guard targeted (a delete for a WHOLLY-UNKNOWN instance driving the count negative) was already mitigated by the existing Math.Max(0, ...) clamp.
|
||
|
||
_Severity note (orchestrator): recorded **Medium** rather than the reviewer's initial High. `_totalDeployedCount` feeds only `UpdateInstanceCounts()` -> the health collector's deployed/enabled/disabled tiles; no runtime behaviour is gated on it, and the drift self-heals on the next singleton restart/failover (startup reloads the count from SQLite via `_totalDeployedCount = msg.Configs.Count`). Impact is a monotonic over-count of the disabled tile, accumulating one per disable->delete cycle until restart — incorrect but observational, hence Medium._
|
||
|
||
**Recommendation**
|
||
|
||
Treat 'has a persisted deployed config' as present, not just 'in one of the two in-memory maps'. Either (a) check storage for an existing deployed-config row before deciding wasPresent (e.g. set wasPresent based on the RemoveDeployedConfigAsync result returning whether a row was removed), or (b) keep the unconditional Math.Max(0, _totalDeployedCount - 1) for the delete path (its clamp already prevents going negative) and rely on it; the disabled-instance case then decrements correctly. Add a regression test asserting _totalDeployedCount returns to 0 after deploy → disable → delete (no re-enable).
|
||
|
||
**Resolution**
|
||
|
||
Resolved in commit `9ab1c002` (2026-06-24): replaced the `_totalDeployedCount` int (gated on in-memory map presence) with an authoritative `_deployedInstanceNames` HashSet; the deployed/disabled health counts are derived from its size, so deleting a disabled instance (absent from both maps) now decrements correctly while a delete for a never-deployed instance is a no-op. Regression test `SR032_DeleteDisabledInstance_DecrementsDeployedCount` added.
|
||
|
||
### SiteRuntime-033 — Native-alarm doc stale re: per-condition eviction and cap return-to-normal
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Documentation & comments |
|
||
| Status | Resolved |
|
||
| Location | `docs/requirements/Component-SiteRuntime.md:272` |
|
||
|
||
**Description**
|
||
|
||
The native-alarm spec was not updated for two behaviors this diff introduces. (1) The 'Per-source cap' bullet (line 272) states only that the oldest condition is dropped and 'the eviction is logged', but EnforceCap now also emits a return-to-normal AlarmStateChanged for a still-active evicted condition (SiteRuntime-028, NativeAlarmActor.cs:315-318) — the doc implies a silent drop with no state change. (2) The 'Latest-event retention' and 'Reset semantics' bullets (lines 291, 294) describe _latestAlarmEvents as cleared only on redeploy/undeploy, but the SiteRuntime-027 fix now evicts a condition's key per-drop (snapshot-swap, retention drop, cap eviction) via the new NativeAlarmDropped signal — the doc omits this per-condition eviction, which is the whole point of the memory-leak fix.
|
||
|
||
**Recommendation**
|
||
|
||
Update the Per-source cap bullet to note that an active evicted condition emits a final return-to-normal before being dropped (mirroring the retention-drop bullet at line 271), and update the retention/reset bullets to document that _latestAlarmEvents keys for native conditions are evicted per-condition when the condition leaves the mirror (via NativeAlarmDropped), not only on redeploy/undeploy.
|
||
|
||
**Resolution**
|
||
|
||
Resolved in commit `9ab1c002` (2026-06-24): updated `Component-SiteRuntime.md` — the per-source-cap bullet now documents the return-to-normal emit for an active evicted condition, and the reset-semantics bullet documents per-condition `_latestAlarmEvents` eviction via `NativeAlarmDropped` (snapshot-swap / retention drop / cap eviction), not only redeploy/undeploy.
|