docs(code-review): full review at 4307c381 — 18 modules, 67 findings recorded + remediation tracked

Full per-module re-review of the 16 stale modules (last seen 1eb6e97 / 2026-05-28)
plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD 4307c381.

67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commit
fd618cf1 closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix
with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings
(IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision.

Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to
sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024),
an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001),
and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary
is semantically sound (symbol-based) in the production cluster config.

README regenerated; regen-readme.py --check passes (4 pending / 567 total).
This commit is contained in:
Joseph Doherty
2026-06-20 18:02:32 -04:00
parent fd618cf1dc
commit d39089f4ed
19 changed files with 4031 additions and 69 deletions
+271 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime` |
| Design doc | `docs/requirements/Component-SiteRuntime.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-28 |
| Last reviewed | 2026-06-20 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Open findings | 3 |
| Commit reviewed | `4307c381` |
| Open findings | 0 |
## Summary
@@ -109,6 +109,40 @@ _Re-review (2026-05-28, `1eb6e97`):_
## Findings
#### Re-review 2026-06-20 (commit `4307c381`) — full review
The module was re-reviewed in full at `4307c381` (current HEAD). The diff against the
prior baseline `1eb6e97` shows as 100% additions because of the ScadaBridge rename
(the project moved paths), so the whole module was re-read at its current state rather
than relying on the diff. Health is good: all prior findings 001026 remain
Resolved/Deferred with no regressions observed in their fixed call sites
(SetAttribute DCL routing, the watch+buffer redeploy, the SiteRuntime-020 terminating
shadow, the OperationTrackingStore read/write split + safe Dispose, the
AuditingDb `Inner` accessor, invariant-culture numeric parsing, the
`_attributes` child-snapshot isolation). The new M7 surface (NativeAlarmActor,
CertStoreActor) and the WaitForAttribute/batch-write additions are generally
well-built. Five new findings were recorded, all in the native-alarm and
deployment-lifecycle bookkeeping: an unbounded `_latestAlarmEvents` map on the
Instance Actor (027, Medium), a phantom-active alarm left behind by
`NativeAlarmActor.EnforceCap` (028, Medium), a delete-during-pending-redeploy
that both over-decrements the deployed counter and resurrects the deleted
instance (029, Medium), missing tests for the cap/retention/replication paths
(030, Low), and stale site-side persistence of notification-list + SMTP config
that the design decision moved to central-only (031, Low).
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | EnforceCap leaves phantom-active alarm (028); delete-during-redeploy over-decrements + resurrects (029). HiLo/RoC/edge logic otherwise sound. |
| 2 | Akka.NET conventions | ✓ | Supervision (Resume coordinators / Stop short-lived) correct; execution actors on ScriptExecutionScheduler; cert broadcast uses PipeTo, captures `sender` before async. Trigger-eval mailbox-block stays Deferred (014). |
| 3 | Concurrency & thread safety | ✓ | Child-snapshot isolation (017) intact; `_createdConnections` actor-confined via ApplyArtifact… dispatch (021); OperationTrackingStore read/write split (024). No new actor-thread violations. |
| 4 | Error handling & resilience | ✓ | Best-effort persistence/replication paths log-on-fault; deploy reports Success only post-persist (005). Native source-unavailable retains+marks-uncertain. |
| 5 | Security | ✓ | Trust verdict delegated to shared ScriptTrustValidator; CertStoreActor path-traversal-guarded; no reflection in scripts; SQL params captured per M4 design (redaction at write time). |
| 6 | Performance & resource management | ✓ | `_latestAlarmEvents` unbounded growth on native-alarm churn (027). ScriptExecutionScheduler background threads OK; per-call SQLite connections acceptable. |
| 7 | Design-document adherence | ✓ | Actor hierarchy / native-alarm wiring conform. Stale site persistence of notification-list + SMTP config vs. "central-only, not deployed to sites" (031). |
| 8 | Code organization & conventions | ✓ | Reflection anti-pattern eliminated (006/022); Options owned by component; additive message evolution honoured. |
| 9 | Testing coverage | ✓ | Broad new suites (NativeAlarmActor, WaitForAttribute, SetAttribute, ExecutionActor, scheduler). EnforceCap/retention-drop + SiteReplicationActor still untested (030). |
| 10 | Documentation & comments | ✓ | XML docs accurate and extensive; ReplicationMessages now documented (026). No new stale comments found. |
### SiteRuntime-001 — `Instance.SetAttribute` never writes to the Data Connection Layer
| | |
@@ -1370,3 +1404,237 @@ the direction (outbound to peer / inbound from peer) and what is replicated.
The two pre-existing group-header XML blocks were converted to plain `//`
comments to avoid orphaned doc-summaries above the first record in each group.
Marker-base-type idea left out of scope.
### SiteRuntime-027 — `InstanceActor._latestAlarmEvents` grows without bound as native conditions churn
| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:67`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:1007` |
**Description**
`HandleAlarmStateChanged` stores the latest enriched `AlarmStateChanged` per
`changed.AlarmName` into `_latestAlarmEvents` (line 1013) and **never removes
anything from that dictionary**. For computed alarms the key set is bounded (one
entry per configured alarm). For **native** alarms, however, the key is the
per-condition `SourceReference` (every `NativeAlarmActor.Emit` stamps
`AlarmName = t.SourceReference`). When a native condition fully runs its course —
`NativeAlarmActor.ApplyLiveTransition` drops it from the mirror once
`!Active && Acknowledged` (NativeAlarmActor.cs:243), or `EnforceCap` evicts the
oldest — the Instance Actor still holds the final `AlarmStateChanged` for that
`SourceReference` forever as a (Normal) entry. There is no message back to the
Instance Actor to evict it.
On an OPC UA A&C source that mints a fresh `SourceReference` per occurrence
(conditionId/branchId style references are common), the map accumulates one
permanently-retained entry per distinct condition the instance has *ever* seen.
The `MirroredAlarmCapPerSource` cap (default 1000) bounds the
`NativeAlarmActor._alarms` set but does **not** bound `_latestAlarmEvents`, which
is on the long-lived Instance Actor. Over weeks of uptime on a busy site this is a
steadily-growing per-instance memory leak, and `BuildAlarmStatesSnapshot`
(line 1055, `_latestAlarmEvents.Values.ToList()`) makes every DebugView snapshot
proportionally larger and slower, returning thousands of stale Normal rows.
**Recommendation**
Evict from `_latestAlarmEvents` when a native condition reaches its terminal
return-to-normal+dropped state. The cleanest signal is the `AlarmStateChanged`
the `NativeAlarmActor` already emits on drop-out: have the `NativeAlarmActor`
stamp a "final/dropped" marker (an additive bool on `AlarmStateChanged`, or a
dedicated `NativeAlarmDropped(sourceReference)` Tell to the parent), and have
`HandleAlarmStateChanged` `_latestAlarmEvents.Remove(...)` on it. Alternatively
cap/prune `_latestAlarmEvents` for native keys mirroring
`MirroredAlarmCapPerSource`. Computed-alarm keys must stay (they are
configuration-bounded).
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): added an additive `NativeAlarmDropped(SourceReference)` Tell emitted by `NativeAlarmActor` at all terminal-drop sites (snapshot-swap removal, retention drop, cap eviction) after the return-to-normal emit; `InstanceActor.HandleNativeAlarmDropped` removes the native key from `_latestAlarmEvents` (and `_alarmStates`/`_alarmTimestamps`). Computed-alarm keys are never dropped.
### SiteRuntime-028 — `NativeAlarmActor.EnforceCap` drops a condition without a return-to-normal, leaving a phantom Active alarm
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/NativeAlarmActor.cs:281` |
**Description**
`EnforceCap` evicts the oldest mirrored conditions once the per-source cap is
exceeded: it removes them from `_alarms`, calls `PersistDelete`, and logs the
eviction — but it does **not** `Emit` a return-to-normal `AlarmStateChanged` for
the dropped condition. If the evicted condition was still `Active`, the Instance
Actor's `_latestAlarmEvents` (and, downstream, central's gRPC stream and the
operator Alarm Summary page) keep showing it as **Active** indefinitely. The
mirror has silently forgotten the condition, so no later transition can ever clear
it — a phantom stuck-active alarm.
This is inconsistent with the sibling drop path `ApplySnapshotSwap`
(NativeAlarmActor.cs:205), which correctly emits `prior.Condition with { Active =
false }` for every condition that falls out of the new snapshot. The retention
drop in `ApplyLiveTransition` (line 243) is safe only because it drops *after*
emitting the condition's own already-inactive state; `EnforceCap` drops a
condition whose last-emitted state may still be Active, with no compensating emit.
The design doc explicitly states the cap eviction is "logged — there is no silent
truncation," but logging alone does not reconcile the operator-visible state: the
alarm view is silently wrong.
**Recommendation**
In `EnforceCap`, before removing each overflow condition, `Emit(a, a.Condition
with { Active = false })` (mirroring `ApplySnapshotSwap`) so the eviction produces
a return-to-normal on the stream and clears the phantom. Add a test asserting that
exceeding `MirroredAlarmCapPerSource` with an active oldest condition emits a
Normal `AlarmStateChanged` for the evicted `SourceReference`.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): `NativeAlarmActor.EnforceCap` now emits `Emit(evicted, evicted.Condition with { Active = false })` for a still-active evicted condition before removing it, clearing the phantom stuck-Active alarm on the stream/UI. Cap-eviction test added.
### SiteRuntime-029 — A delete (or disable) arriving during a pending redeploy over-decrements the counter and is undone by `HandleTerminated`
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:627`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:402` |
**Description**
`HandleDelete` and `HandleDisable` only consult `_instanceActors` to find the live
actor; neither consults the SiteRuntime-020 `_terminatingActorsByName` /
`_pendingRedeploys` bookkeeping. When a redeployment is in flight, `HandleDeploy`
has already removed the instance from `_instanceActors`, stopped the predecessor,
and buffered a `PendingRedeploy` keyed by the terminating ref. If a
`DeleteInstanceCommand` then arrives for that same instance *before* the
`Terminated` signal fires:
1. `_instanceActors.TryGetValue` misses (the entry was removed at redeploy time),
so no actor is stopped.
2. `_totalDeployedCount = Math.Max(0, _totalDeployedCount - 1)` runs anyway —
over-decrementing, because the redeploy path did not increment for an update
(the count had already been adjusted), so the deployed/disabled counts reported
to the health collector drift.
3. `RemoveDeployedConfigAsync` deletes the SQLite row and the deleter is told
success.
4. The buffered `_pendingRedeploys` entry is **untouched**, so when `Terminated`
fires, `HandleTerminated` calls `ApplyDeployment(..., isRedeploy: true)`, which
re-creates the Instance Actor and re-writes the deployed config to SQLite —
**resurrecting the instance the operator just deleted**, with the counter now
inconsistent.
`HandleDisable` has the milder version: the disable persists `is_enabled = false`,
but `HandleTerminated` then re-stores the config with `isEnabled: true` and
re-creates the actor, so a disable issued mid-redeploy is silently reverted to
enabled.
The window is the redeploy-termination interval — small, but reliably hit when
central issues a delete/disable immediately after a deploy (e.g. an operator
correcting a mistaken deploy, or an automated teardown), exactly the kind of rapid
command sequence SiteRuntime-020 was filed to harden.
**Recommendation**
Make `HandleDelete`/`HandleDisable` authoritative over the mid-redeploy state:
before falling through, check `_terminatingActorsByName`. On a hit, drop the
buffered `_pendingRedeploys` entry (so `HandleTerminated` does not resurrect),
clear the shadow, and for delete tell the buffered redeploy's `OriginalSender` it
was superseded — mirroring the last-write-wins handling already in `HandleDeploy`.
Only decrement `_totalDeployedCount` when an instance was actually present
(in `_instanceActors` **or** terminating). Add a regression test:
deploy → redeploy → delete-before-Terminated asserts the instance stays deleted
and the counter is correct.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): `HandleDelete`/`HandleDisable` now check `_terminatingActorsByName` before the `_instanceActors` fall-through; on a hit they drop the buffered `_pendingRedeploys` entry and clear the shadow so `HandleTerminated` can't resurrect the instance, and `_totalDeployedCount` is decremented only when an instance was actually present. Delete-during-redeploy race test added (verified failing pre-fix).
### SiteRuntime-030 — Native-alarm cap/retention-drop and `SiteReplicationActor` paths are untested
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Resolved |
| Location | `tests/ZB.MOM.WW.ScadaBridge.SiteRuntime.Tests/Actors/NativeAlarmActorTests.cs`, `tests/ZB.MOM.WW.ScadaBridge.SiteRuntime.Tests/` |
**Description**
`NativeAlarmActorTests` covers subscribe, raise, snapshot-swap return-to-normal,
out-of-order rejection, ack, and site-event emission, but there is **no** test for
`EnforceCap` (the per-source cap eviction) or for the `ApplyLiveTransition`
retention drop (`!Active && Acknowledged`). Both are non-trivial state
transitions, and the cap path harbours an observable defect (SiteRuntime-028) that
a targeted test would have caught. `SiteReplicationActor` remains entirely
untested — the carried-forward gap that SiteRuntime-016 explicitly deferred to a
clustered-ActorSystem harness, still outstanding at this commit (the actor calls
`Cluster.Get(Context.System)` in its constructor, so it needs a clustered HOCON
test host). The outbound forward, inbound apply, and `SendToPeer` no-peer-drop
behaviour are unverified.
**Recommendation**
Add `NativeAlarmActor` tests for (a) cap eviction emits a return-to-normal for an
evicted active condition (pairs with SiteRuntime-028) and (b) a resolved condition
(inactive+acked) drops out and deletes its SQLite row. Stand up the clustered
test harness SiteRuntime-016 called for and cover `SiteReplicationActor`'s
outbound/inbound/peer-tracking paths.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): native-alarm cap-eviction and retention-drop tests added (pairs with -027/-028). The clustered `SiteReplicationActor` test harness remains deferred (needs a clustered ActorSystem host, consistent with the prior -016 deferral).
### SiteRuntime-031 — Site still persists notification-list and SMTP config that the design moved to central-only
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Persistence/SiteStorageService.cs:90`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Persistence/SiteStorageService.cs:105`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:1383`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:1417` |
**Description**
The design decision (CLAUDE.md "External Integrations" / Component-NotificationService;
echoed in Component-SiteRuntime.md "System-Wide Artifact Handling": *"Notification
lists and SMTP configuration are central-only and are not deployed to sites"*) makes
notification delivery central-only — sites store-and-forward to central and never
talk to SMTP. Yet SiteRuntime still:
- creates the `notification_lists` and `smtp_configurations` SQLite tables
(`SiteStorageService.InitializeAsync`),
- writes them from `HandleDeployArtifacts` (`StoreNotificationListAsync` /
`StoreSmtpConfigurationAsync`, DeploymentManagerActor.cs:1383/1417) and the
replication path `SiteReplicationActor.HandleApplyArtifacts`, and
- the `DeployArtifactsCommand` contract still carries `NotificationLists` /
`SmtpConfigurations`.
The SMTP path persists the SMTP `password` field (SiteStorageService.cs:693) into
plaintext site SQLite. If central no longer populates these (per the design), this
is dead code carrying a latent secret-at-rest footprint; if central still does, the
site is storing SMTP credentials it must never use — both contradict the
central-only delivery decision. Either the code/tables are stale and should be
removed, or the design doc is stale and the decision needs re-stating. This straddles
the SiteRuntime/NotificationService boundary and the shared `DeployArtifactsCommand`
contract, so the direction is a design-owner call rather than a clean in-module fix.
**Recommendation**
Confirm with the design owner whether central still ships notification-list / SMTP
artifacts to sites. If not (the stated decision), remove the `notification_lists`
and `smtp_configurations` tables, the two `Store…Async` writers, the replication
branches, and the corresponding `DeployArtifactsCommand` fields (a coordinated
cross-module change). If the decision has changed, update Component-SiteRuntime.md
and Component-NotificationService.md and treat the plaintext SMTP password as a
secret-handling finding in its own right.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): the site no longer persists notification-list / SMTP config and purges any already-persisted rows (incl. the plaintext SMTP password) on both apply paths (`HandleDeployArtifacts`, replication `HandleApplyArtifacts`), inside the all-or-nothing apply. Paired with DeploymentManager-025 (central stops shipping). Tables retained but kept empty.