docs(code-review): full review at 4307c381 — 18 modules, 67 findings recorded + remediation tracked

Full per-module re-review of the 16 stale modules (last seen 1eb6e97 / 2026-05-28) plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD 4307c381. 67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commit fd618cf1 closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings (IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision. Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024), an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001), and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary is semantically sound (symbol-based) in the production cluster config. README regenerated; regen-readme.py --check passes (4 pending / 567 total).
2026-06-20 18:02:32 -04:00
parent fd618cf1dc
commit d39089f4ed
19 changed files with 4031 additions and 69 deletions
@@ -5,10 +5,10 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime` |
 | Design doc | `docs/requirements/Component-SiteRuntime.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
-| Open findings | 3 |
+| Commit reviewed | `4307c381` |
+| Open findings | 0 |

 ## Summary

@@ -109,6 +109,40 @@ _Re-review (2026-05-28, `1eb6e97`):_

 ## Findings

+#### Re-review 2026-06-20 (commit `4307c381`) — full review
+
+The module was re-reviewed in full at `4307c381` (current HEAD). The diff against the
+prior baseline `1eb6e97` shows as 100% additions because of the ScadaBridge rename
+(the project moved paths), so the whole module was re-read at its current state rather
+than relying on the diff. Health is good: all prior findings 001–026 remain
+Resolved/Deferred with no regressions observed in their fixed call sites
+(SetAttribute DCL routing, the watch+buffer redeploy, the SiteRuntime-020 terminating
+shadow, the OperationTrackingStore read/write split + safe Dispose, the
+AuditingDb `Inner` accessor, invariant-culture numeric parsing, the
+`_attributes` child-snapshot isolation). The new M7 surface (NativeAlarmActor,
+CertStoreActor) and the WaitForAttribute/batch-write additions are generally
+well-built. Five new findings were recorded, all in the native-alarm and
+deployment-lifecycle bookkeeping: an unbounded `_latestAlarmEvents` map on the
+Instance Actor (027, Medium), a phantom-active alarm left behind by
+`NativeAlarmActor.EnforceCap` (028, Medium), a delete-during-pending-redeploy
+that both over-decrements the deployed counter and resurrects the deleted
+instance (029, Medium), missing tests for the cap/retention/replication paths
+(030, Low), and stale site-side persistence of notification-list + SMTP config
+that the design decision moved to central-only (031, Low).
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ✓ | EnforceCap leaves phantom-active alarm (028); delete-during-redeploy over-decrements + resurrects (029). HiLo/RoC/edge logic otherwise sound. |
+| 2 | Akka.NET conventions | ✓ | Supervision (Resume coordinators / Stop short-lived) correct; execution actors on ScriptExecutionScheduler; cert broadcast uses PipeTo, captures `sender` before async. Trigger-eval mailbox-block stays Deferred (014). |
+| 3 | Concurrency & thread safety | ✓ | Child-snapshot isolation (017) intact; `_createdConnections` actor-confined via ApplyArtifact… dispatch (021); OperationTrackingStore read/write split (024). No new actor-thread violations. |
+| 4 | Error handling & resilience | ✓ | Best-effort persistence/replication paths log-on-fault; deploy reports Success only post-persist (005). Native source-unavailable retains+marks-uncertain. |
+| 5 | Security | ✓ | Trust verdict delegated to shared ScriptTrustValidator; CertStoreActor path-traversal-guarded; no reflection in scripts; SQL params captured per M4 design (redaction at write time). |
+| 6 | Performance & resource management | ✓ | `_latestAlarmEvents` unbounded growth on native-alarm churn (027). ScriptExecutionScheduler background threads OK; per-call SQLite connections acceptable. |
+| 7 | Design-document adherence | ✓ | Actor hierarchy / native-alarm wiring conform. Stale site persistence of notification-list + SMTP config vs. "central-only, not deployed to sites" (031). |
+| 8 | Code organization & conventions | ✓ | Reflection anti-pattern eliminated (006/022); Options owned by component; additive message evolution honoured. |
+| 9 | Testing coverage | ✓ | Broad new suites (NativeAlarmActor, WaitForAttribute, SetAttribute, ExecutionActor, scheduler). EnforceCap/retention-drop + SiteReplicationActor still untested (030). |
+| 10 | Documentation & comments | ✓ | XML docs accurate and extensive; ReplicationMessages now documented (026). No new stale comments found. |
+
 ### SiteRuntime-001 — `Instance.SetAttribute` never writes to the Data Connection Layer

 | | |
@@ -1370,3 +1404,237 @@ the direction (outbound to peer / inbound from peer) and what is replicated.
 The two pre-existing group-header XML blocks were converted to plain `//`
 comments to avoid orphaned doc-summaries above the first record in each group.
 Marker-base-type idea left out of scope.
+
+### SiteRuntime-027 — `InstanceActor._latestAlarmEvents` grows without bound as native conditions churn
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Performance & resource management |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:67`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:1007` |
+
+**Description**
+
+`HandleAlarmStateChanged` stores the latest enriched `AlarmStateChanged` per
+`changed.AlarmName` into `_latestAlarmEvents` (line 1013) and **never removes
+anything from that dictionary**. For computed alarms the key set is bounded (one
+entry per configured alarm). For **native** alarms, however, the key is the
+per-condition `SourceReference` (every `NativeAlarmActor.Emit` stamps
+`AlarmName = t.SourceReference`). When a native condition fully runs its course —
+`NativeAlarmActor.ApplyLiveTransition` drops it from the mirror once
+`!Active && Acknowledged` (NativeAlarmActor.cs:243), or `EnforceCap` evicts the
+oldest — the Instance Actor still holds the final `AlarmStateChanged` for that
+`SourceReference` forever as a (Normal) entry. There is no message back to the
+Instance Actor to evict it.
+
+On an OPC UA A&C source that mints a fresh `SourceReference` per occurrence
+(conditionId/branchId style references are common), the map accumulates one
+permanently-retained entry per distinct condition the instance has *ever* seen.
+The `MirroredAlarmCapPerSource` cap (default 1000) bounds the
+`NativeAlarmActor._alarms` set but does **not** bound `_latestAlarmEvents`, which
+is on the long-lived Instance Actor. Over weeks of uptime on a busy site this is a
+steadily-growing per-instance memory leak, and `BuildAlarmStatesSnapshot`
+(line 1055, `_latestAlarmEvents.Values.ToList()`) makes every DebugView snapshot
+proportionally larger and slower, returning thousands of stale Normal rows.
+
+**Recommendation**
+
+Evict from `_latestAlarmEvents` when a native condition reaches its terminal
+return-to-normal+dropped state. The cleanest signal is the `AlarmStateChanged`
+the `NativeAlarmActor` already emits on drop-out: have the `NativeAlarmActor`
+stamp a "final/dropped" marker (an additive bool on `AlarmStateChanged`, or a
+dedicated `NativeAlarmDropped(sourceReference)` Tell to the parent), and have
+`HandleAlarmStateChanged` `_latestAlarmEvents.Remove(...)` on it. Alternatively
+cap/prune `_latestAlarmEvents` for native keys mirroring
+`MirroredAlarmCapPerSource`. Computed-alarm keys must stay (they are
+configuration-bounded).
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): added an additive `NativeAlarmDropped(SourceReference)` Tell emitted by `NativeAlarmActor` at all terminal-drop sites (snapshot-swap removal, retention drop, cap eviction) after the return-to-normal emit; `InstanceActor.HandleNativeAlarmDropped` removes the native key from `_latestAlarmEvents` (and `_alarmStates`/`_alarmTimestamps`). Computed-alarm keys are never dropped.
+
+### SiteRuntime-028 — `NativeAlarmActor.EnforceCap` drops a condition without a return-to-normal, leaving a phantom Active alarm
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/NativeAlarmActor.cs:281` |
+
+**Description**
+
+`EnforceCap` evicts the oldest mirrored conditions once the per-source cap is
+exceeded: it removes them from `_alarms`, calls `PersistDelete`, and logs the
+eviction — but it does **not** `Emit` a return-to-normal `AlarmStateChanged` for
+the dropped condition. If the evicted condition was still `Active`, the Instance
+Actor's `_latestAlarmEvents` (and, downstream, central's gRPC stream and the
+operator Alarm Summary page) keep showing it as **Active** indefinitely. The
+mirror has silently forgotten the condition, so no later transition can ever clear
+it — a phantom stuck-active alarm.
+
+This is inconsistent with the sibling drop path `ApplySnapshotSwap`
+(NativeAlarmActor.cs:205), which correctly emits `prior.Condition with { Active =
+false }` for every condition that falls out of the new snapshot. The retention
+drop in `ApplyLiveTransition` (line 243) is safe only because it drops *after*
+emitting the condition's own already-inactive state; `EnforceCap` drops a
+condition whose last-emitted state may still be Active, with no compensating emit.
+
+The design doc explicitly states the cap eviction is "logged — there is no silent
+truncation," but logging alone does not reconcile the operator-visible state: the
+alarm view is silently wrong.
+
+**Recommendation**
+
+In `EnforceCap`, before removing each overflow condition, `Emit(a, a.Condition
+with { Active = false })` (mirroring `ApplySnapshotSwap`) so the eviction produces
+a return-to-normal on the stream and clears the phantom. Add a test asserting that
+exceeding `MirroredAlarmCapPerSource` with an active oldest condition emits a
+Normal `AlarmStateChanged` for the evicted `SourceReference`.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): `NativeAlarmActor.EnforceCap` now emits `Emit(evicted, evicted.Condition with { Active = false })` for a still-active evicted condition before removing it, clearing the phantom stuck-Active alarm on the stream/UI. Cap-eviction test added.
+
+### SiteRuntime-029 — A delete (or disable) arriving during a pending redeploy over-decrements the counter and is undone by `HandleTerminated`
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:627`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:402` |
+
+**Description**
+
+`HandleDelete` and `HandleDisable` only consult `_instanceActors` to find the live
+actor; neither consults the SiteRuntime-020 `_terminatingActorsByName` /
+`_pendingRedeploys` bookkeeping. When a redeployment is in flight, `HandleDeploy`
+has already removed the instance from `_instanceActors`, stopped the predecessor,
+and buffered a `PendingRedeploy` keyed by the terminating ref. If a
+`DeleteInstanceCommand` then arrives for that same instance *before* the
+`Terminated` signal fires:
+
+1. `_instanceActors.TryGetValue` misses (the entry was removed at redeploy time),
+   so no actor is stopped.
+2. `_totalDeployedCount = Math.Max(0, _totalDeployedCount - 1)` runs anyway —
+   over-decrementing, because the redeploy path did not increment for an update
+   (the count had already been adjusted), so the deployed/disabled counts reported
+   to the health collector drift.
+3. `RemoveDeployedConfigAsync` deletes the SQLite row and the deleter is told
+   success.
+4. The buffered `_pendingRedeploys` entry is **untouched**, so when `Terminated`
+   fires, `HandleTerminated` calls `ApplyDeployment(..., isRedeploy: true)`, which
+   re-creates the Instance Actor and re-writes the deployed config to SQLite —
+   **resurrecting the instance the operator just deleted**, with the counter now
+   inconsistent.
+
+`HandleDisable` has the milder version: the disable persists `is_enabled = false`,
+but `HandleTerminated` then re-stores the config with `isEnabled: true` and
+re-creates the actor, so a disable issued mid-redeploy is silently reverted to
+enabled.
+
+The window is the redeploy-termination interval — small, but reliably hit when
+central issues a delete/disable immediately after a deploy (e.g. an operator
+correcting a mistaken deploy, or an automated teardown), exactly the kind of rapid
+command sequence SiteRuntime-020 was filed to harden.
+
+**Recommendation**
+
+Make `HandleDelete`/`HandleDisable` authoritative over the mid-redeploy state:
+before falling through, check `_terminatingActorsByName`. On a hit, drop the
+buffered `_pendingRedeploys` entry (so `HandleTerminated` does not resurrect),
+clear the shadow, and for delete tell the buffered redeploy's `OriginalSender` it
+was superseded — mirroring the last-write-wins handling already in `HandleDeploy`.
+Only decrement `_totalDeployedCount` when an instance was actually present
+(in `_instanceActors` **or** terminating). Add a regression test:
+deploy → redeploy → delete-before-Terminated asserts the instance stays deleted
+and the counter is correct.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): `HandleDelete`/`HandleDisable` now check `_terminatingActorsByName` before the `_instanceActors` fall-through; on a hit they drop the buffered `_pendingRedeploys` entry and clear the shadow so `HandleTerminated` can't resurrect the instance, and `_totalDeployedCount` is decremented only when an instance was actually present. Delete-during-redeploy race test added (verified failing pre-fix).
+
+### SiteRuntime-030 — Native-alarm cap/retention-drop and `SiteReplicationActor` paths are untested
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Testing coverage |
+| Status | Resolved |
+| Location | `tests/ZB.MOM.WW.ScadaBridge.SiteRuntime.Tests/Actors/NativeAlarmActorTests.cs`, `tests/ZB.MOM.WW.ScadaBridge.SiteRuntime.Tests/` |
+
+**Description**
+
+`NativeAlarmActorTests` covers subscribe, raise, snapshot-swap return-to-normal,
+out-of-order rejection, ack, and site-event emission, but there is **no** test for
+`EnforceCap` (the per-source cap eviction) or for the `ApplyLiveTransition`
+retention drop (`!Active && Acknowledged`). Both are non-trivial state
+transitions, and the cap path harbours an observable defect (SiteRuntime-028) that
+a targeted test would have caught. `SiteReplicationActor` remains entirely
+untested — the carried-forward gap that SiteRuntime-016 explicitly deferred to a
+clustered-ActorSystem harness, still outstanding at this commit (the actor calls
+`Cluster.Get(Context.System)` in its constructor, so it needs a clustered HOCON
+test host). The outbound forward, inbound apply, and `SendToPeer` no-peer-drop
+behaviour are unverified.
+
+**Recommendation**
+
+Add `NativeAlarmActor` tests for (a) cap eviction emits a return-to-normal for an
+evicted active condition (pairs with SiteRuntime-028) and (b) a resolved condition
+(inactive+acked) drops out and deletes its SQLite row. Stand up the clustered
+test harness SiteRuntime-016 called for and cover `SiteReplicationActor`'s
+outbound/inbound/peer-tracking paths.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): native-alarm cap-eviction and retention-drop tests added (pairs with -027/-028). The clustered `SiteReplicationActor` test harness remains deferred (needs a clustered ActorSystem host, consistent with the prior -016 deferral).
+
+### SiteRuntime-031 — Site still persists notification-list and SMTP config that the design moved to central-only
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Design-document adherence |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Persistence/SiteStorageService.cs:90`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Persistence/SiteStorageService.cs:105`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:1383`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:1417` |
+
+**Description**
+
+The design decision (CLAUDE.md "External Integrations" / Component-NotificationService;
+echoed in Component-SiteRuntime.md "System-Wide Artifact Handling": *"Notification
+lists and SMTP configuration are central-only and are not deployed to sites"*) makes
+notification delivery central-only — sites store-and-forward to central and never
+talk to SMTP. Yet SiteRuntime still:
+
+- creates the `notification_lists` and `smtp_configurations` SQLite tables
+  (`SiteStorageService.InitializeAsync`),
+- writes them from `HandleDeployArtifacts` (`StoreNotificationListAsync` /
+  `StoreSmtpConfigurationAsync`, DeploymentManagerActor.cs:1383/1417) and the
+  replication path `SiteReplicationActor.HandleApplyArtifacts`, and
+- the `DeployArtifactsCommand` contract still carries `NotificationLists` /
+  `SmtpConfigurations`.
+
+The SMTP path persists the SMTP `password` field (SiteStorageService.cs:693) into
+plaintext site SQLite. If central no longer populates these (per the design), this
+is dead code carrying a latent secret-at-rest footprint; if central still does, the
+site is storing SMTP credentials it must never use — both contradict the
+central-only delivery decision. Either the code/tables are stale and should be
+removed, or the design doc is stale and the decision needs re-stating. This straddles
+the SiteRuntime/NotificationService boundary and the shared `DeployArtifactsCommand`
+contract, so the direction is a design-owner call rather than a clean in-module fix.
+
+**Recommendation**
+
+Confirm with the design owner whether central still ships notification-list / SMTP
+artifacts to sites. If not (the stated decision), remove the `notification_lists`
+and `smtp_configurations` tables, the two `Store…Async` writers, the replication
+branches, and the corresponding `DeployArtifactsCommand` fields (a coordinated
+cross-module change). If the decision has changed, update Component-SiteRuntime.md
+and Component-NotificationService.md and treat the plaintext SMTP password as a
+secret-handling finding in its own right.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): the site no longer persists notification-list / SMTP config and purges any already-persisted rows (incl. the plaintext SMTP password) on both apply paths (`HandleDeployArtifacts`, replication `HandleApplyArtifacts`), inside the all-or-nothing apply. Paired with DeploymentManager-025 (central stops shipping). Tables retained but kept empty.