docs(code-review): full review at 4307c381 — 18 modules, 67 findings recorded + remediation tracked

Full per-module re-review of the 16 stale modules (last seen 1eb6e97 / 2026-05-28) plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD 4307c381. 67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commit fd618cf1 closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings (IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision. Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024), an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001), and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary is semantically sound (symbol-based) in the production cluster config. README regenerated; regen-readme.py --check passes (4 pending / 567 total).
2026-06-20 18:02:32 -04:00
parent fd618cf1dc
commit d39089f4ed
19 changed files with 4031 additions and 69 deletions
@@ -5,10 +5,10 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer` |
 | Design doc | `docs/requirements/Component-DataConnectionLayer.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
-| Open findings | 5 |
+| Commit reviewed | `4307c381` |
+| Open findings | 0 |

 ## Summary

@@ -116,6 +116,37 @@ DCL-007 fixed for `ReadBatchAsync`). New findings are numbered from

 ## Findings

+#### Re-review 2026-06-20 (commit `4307c381`) — full review
+
+The 2026-06-20 full re-review walked all 10 checklist categories against the current
+source, focused on the M7 native-alarm subscribe path, the OPC UA node-browser
+(BrowseNext/search/type-info), and the verify-endpoint + site-local cert-trust surface.
+The M7 surface is well-built overall — the verify-endpoint probe correctly *captures
+but never trusts* an untrusted server certificate, the per-node `CertStoreActor`
+broadcast keeps both site nodes' PKI stores consistent across failover, and the
+node-browser paging/search/type-info additions are clean. All 22 prior findings remain
+`Resolved` and their fixes were verified in place. The review found **4 new findings**,
+all clustering on the native-alarm subscribe path, which did **not** inherit the
+guards the tag-subscribe path earned across DCL-018 / DCL-021 / DCL-022: the alarm
+path leaks its adapter feed on a mid-flight unsubscribe (no DCL-021-style obsolete-
+completion guard), has no alarm-resolution retry to match the tag path's
+`tag-resolution-retry` timer, leaks `_alarmCts` on `DisposeAsync` (only
+`DisconnectAsync` tears it down), and carries a stale "first subscriber wins" comment
+on a last-wins filter assignment.
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | x | No new issues; DCL-016/020/021 counter/response fixes verified. Alarm-path snapshot atomic-swap + per-source prefix routing are correct. |
+| 2 | Akka.NET conventions | x | No new issues; DCL-022 `IsTimerActive` gate verified. Alarm subscribe uses `ContinueWith(...).PipeTo(Self)` with `Self`/generation captured (no `Sender`/`this` in closures). |
+| 3 | Concurrency & thread safety | x | Finding 023 — native-alarm mid-flight unsubscribe leaks the adapter alarm feed (the tag path's DCL-021 obsolete-completion guard was not mirrored on the alarm path). |
+| 4 | Error handling & resilience | x | Finding 024 — no alarm-resolution retry: a failed initial alarm subscribe strands the subscriber with no feed until the next full reconnect (the tag path retries on `tag-resolution-retry`). |
+| 5 | Security | x | No new issues. The M7 verify-endpoint probe builds a temporary `RealOpcUaClient`, captures the untrusted server cert, and NEVER trusts it; cert trust is gated through `CertStoreActor` broadcast to both nodes. DCL-012/014 auto-accept warning + Commons default remain out-of-scope follow-ups. |
+| 6 | Performance & resource management | x | Finding 025 — `MxGatewayDataConnection.DisposeAsync` abandons `_alarmCts` (alarm task + CTS leak on failover/stop); `DisconnectAsync` already cancels+disposes it under `_alarmLock`. |
+| 7 | Design-document adherence | x | No new issues; M7 native-alarm read-only mirror, `AlarmKind` discriminator, and per-source `IAlarmSubscribableConnection` feed match the design. DCL-009 doc action still open at doc level (out of scope). |
+| 8 | Code organization & conventions | x | No issues — alarm messages/POCOs in Commons, `IAlarmSubscribableConnection` capability seam in the module, options class owned by component. |
+| 9 | Testing coverage | x | DCL001–022 regression tests present. Gaps for findings 023 (alarm unsubscribe mid-flight), 024 (failed alarm subscribe → no retry), 025 (`DisposeAsync` alarm-CTS leak). |
+| 10 | Documentation & comments | x | Finding 028 — the `_alarmSourceFilter` "first subscriber wins" XML comment contradicts the last-wins assignment in `HandleSubscribeAlarms`. |
+
 ### DataConnectionLayer-001 — `Task.Run` in `HandleSubscribe` mutates actor state off the actor thread

 | | |
@@ -1267,3 +1298,209 @@ calling `StartPeriodicTimer` — `IsTimerActive` is on `ITimerScheduler`. Apply
 same gate at both call sites. Add a regression test that fires 5 subscribes with
 unresolved tags within one retry interval and asserts the retry fires at most one
 interval after the first failure (not after the fifth subscribe).
+
+### DataConnectionLayer-023 — Native-alarm mid-flight unsubscribe leaks the adapter alarm feed
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Concurrency & thread safety |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:1773,1790-1809,1864-1882` |
+
+**Description**
+
+The native-alarm subscribe path mirrors the tag-subscribe path's
+dispatch-I/O-then-`PipeTo(Self)` shape but never inherited the obsolete-completion
+guard the tag path earned in DCL-021. `HandleSubscribeAlarms` adds the source to
+`_alarmSubscribesInFlight` (line 1773) and dispatches
+`alarmable.SubscribeAlarmsAsync(...)` whose `ContinueWith` pipes an
+`AlarmSubscribeCompleted` back to `Self`. If an `UnsubscribeAlarmsRequest` for the
+last (or only) subscriber is processed on the actor thread between that dispatch and
+the completion, `HandleUnsubscribeAlarms` (lines 1864-1882) removes the subscriber,
+empties `_alarmSourceSubscribers`, and removes the filter entries — but it tries to
+tear down the adapter feed via `_alarmSubscriptionIds.Remove(...)`, which **fails**
+because the subscription id has not been stored yet (it is still in flight). It also
+leaves the stale `_alarmSubscribesInFlight` entry in place. The late
+`AlarmSubscribeCompleted` then reaches `HandleAlarmSubscribeCompleted` (lines
+1790-1809), which **unconditionally** stores `_alarmSubscriptionIds[msg.SourceReference]
+= msg.SubscriptionId` (line 1796) without re-checking whether any subscriber still
+exists in `_alarmSourceSubscribers`.
+
+The net result is an orphaned adapter alarm feed: the OPC UA monitored-item set / the
+MxGateway gateway-wide alarm stream stays alive for a source with **zero** subscribers,
+streaming transitions into `HandleAlarmTransitionReceived` that match no subscriber set
+and are silently dropped — the feed is never torn down for the lifetime of the adapter
+(only a full `ReSubscribeAllAlarms` on reconnect clears `_alarmSubscriptionIds`, and
+even that re-subscribes from `_alarmSourceSubscribers`, which no longer contains the
+orphaned source, so the stale device-side feed is simply abandoned, not closed). The
+tag path closes exactly this race in `HandleSubscribeCompleted` (lines 834-867) by
+releasing the just-created adapter handle and clearing `_subscribesInFlight` when the
+instance entry has gone; the alarm path does neither. This matters because native-alarm
+sources are created/destroyed on every deploy/undeploy and instance stop, so each
+mid-flight unsubscribe permanently leaks one device-side alarm subscription and its
+publish traffic.
+
+**Recommendation**
+
+Mirror the closed DCL-021 fix on the alarm path. In `HandleAlarmSubscribeCompleted`,
+before storing the id, guard on the live subscriber set: `if
+(!_alarmSourceSubscribers.ContainsKey(msg.SourceReference)) {
+if (msg.Success && msg.SubscriptionId != null && _adapter is
+IAlarmSubscribableConnection alarmable) _ =
+alarmable.UnsubscribeAlarmsAsync(msg.SubscriptionId); return; }` so a feed that
+completed after its last subscriber left is released at the adapter rather than stored.
+Additionally clear the `_alarmSubscribesInFlight` entry in `HandleUnsubscribeAlarms`
+when the last subscriber leaves, so the in-flight marker does not linger. Add a
+regression test that subscribes a source, sends `UnsubscribeAlarmsRequest` while the
+alarm subscribe I/O is in flight, completes the subscribe, and asserts
+`UnsubscribeAlarmsAsync` is called and `_alarmSubscriptionIds` /
+`_alarmSubscribesInFlight` are clean.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): `HandleAlarmSubscribeCompleted` now guards against an orphaned completion (if no live subscriber remains for the source, it tears down the just-created adapter subscription and returns) and `HandleUnsubscribeAlarms` clears the in-flight marker when the last subscriber leaves — mirroring the DCL-021 tag-path fix. Regression test added (verified failing pre-fix).
+
+### DataConnectionLayer-024 — No alarm-resolution retry: a failed initial alarm subscribe strands the subscriber until the next full reconnect
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Status | Deferred |
+| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:1752-1757,1790-1809,1885-1908` |
+
+**Description**
+
+When an initial native-alarm subscribe fails, the subscriber is left registered but
+with no feed and no path to recovery short of a full connection reconnect.
+`HandleSubscribeAlarms` registers the subscriber in `_alarmSourceSubscribers` (lines
+1752-1757) **before** issuing the adapter subscribe, so a transition arriving
+mid-subscribe is still routed. If `alarmable.SubscribeAlarmsAsync` fails, the piped
+`AlarmSubscribeCompleted` reaches `HandleAlarmSubscribeCompleted` (lines 1790-1809),
+which on the failure branch only logs a warning (line 1801) and replies a
+`SubscribeAlarmsResponse(success: false, ...)`. It does **not** re-arm any retry, does
+**not** push `NativeAlarmSourceUnavailable` to the subscriber, and does **not** drop
+the source from `_alarmSourceSubscribers`. The subscriber therefore stays in the
+routing map with the adapter feed never established.
+
+Recovery only happens via `ReSubscribeAllAlarms` (lines 1885-1908), which is invoked
+solely from `BecomeConnected` after a full connection reconnect (line 583). So a
+transient device-side failure on the *initial* alarm subscribe — the alarm server
+still booting, a momentary fault — leaves the source dark until the entire connection
+happens to cycle through `Reconnecting`. This diverges from the tag path, which arms a
+periodic `tag-resolution-retry` timer (lines 989-993, 1678-1682; fired by
+`HandleRetryTagResolution` at line 1491) so a tag that fails to resolve at subscribe
+time is retried every `TagResolutionRetryInterval` (10 s default) without needing a
+reconnect. The native-alarm source has no analogous self-healing.
+
+This is partly a **design judgment call** rather than a clear-cut defect: the M7 design
+models native alarms as a read-only mirror, and "drop the source and signal the
+subscriber" is an equally valid contract to "retry the alarm subscribe periodically."
+The defect is that the code currently does **neither** — it silently leaves a
+registered subscriber with a permanently-dark feed and no signal — which is the worst
+of both.
+
+**Recommendation**
+
+Pick one of the two valid contracts and implement it (this is a judgment call — do not
+assume which is wanted):
+
+- **Retry:** add an `alarm-resolution-retry` periodic timer (mirroring
+  `tag-resolution-retry`): track sources whose subscribe failed, re-issue
+  `SubscribeAlarmsAsync` on the timer, and gate the timer start with
+  `IsTimerActive` (per the DCL-022 pattern) so a burst of failed subscribes does not
+  reset it.
+- **Fail fast:** on the failure branch of `HandleAlarmSubscribeCompleted`, push
+  `NativeAlarmSourceUnavailable` to the subscriber(s) for that source and remove the
+  source from `_alarmSourceSubscribers` / `_alarmSubscribesInFlight`, so the
+  `NativeAlarmActor` learns the feed is unavailable rather than waiting forever.
+
+Either way, add a regression test for "initial alarm subscribe fails → subscriber is
+not silently left dark."
+
+**Resolution**
+
+Deferred 2026-06-20: the fix (add an alarm-resolution retry timer vs. push `NativeAlarmSourceUnavailable` and drop the source on initial-subscribe failure) is a design-owner decision; the current code does neither. Awaiting that decision before implementing.
+
+### DataConnectionLayer-025 — `MxGatewayDataConnection.DisposeAsync` abandons `_alarmCts`, leaking the alarm task on failover/stop
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Performance & resource management |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/MxGatewayDataConnection.cs:277-283,121-134` |
+
+**Description**
+
+`MxGatewayDataConnection.DisposeAsync` (lines 277-283) cancels `_eventLoopCts` and
+disposes the underlying `_client`, but never touches `_alarmCts`. The alarm feed is
+established in `SubscribeAlarmsAsync` (lines 153-175): under `_alarmLock` it lazily
+creates `_alarmCts = new CancellationTokenSource()` and launches a long-running
+`Task.Run(() => client.RunAlarmStreamAsync(null, ..., token))`. That background alarm
+stream is bound only to `_alarmCts.Token`. `DisconnectAsync` (lines 121-134) already
+tears it down correctly — under `_alarmLock` it cancels and disposes `_alarmCts`,
+nulls it, and resets `_alarmSubCount` — but `DisposeAsync` does not, so the
+`CancellationTokenSource` and the alarm-stream `Task` are both leaked whenever the
+adapter is disposed without a prior `DisconnectAsync`.
+
+The `DataConnectionActor` disposes adapters fire-and-forget on failover (and on
+actor/connection stop) via `_adapter.DisposeAsync()` without necessarily calling
+`DisconnectAsync` first, so every MxGateway failover or connection teardown that goes
+through `DisposeAsync` leaks one running alarm-stream task plus its CTS. Severity is
+Low because the leaked task is cancellation-bound to a CTS that will be GC-reclaimed
+eventually and the gRPC stream it holds will fault when `_client` is disposed, but a
+still-running alarm-stream loop racing a disposed client is precisely the class of
+dangling background work the lock-guarded `DisconnectAsync` block was written to avoid.
+
+**Recommendation**
+
+In `DisposeAsync`, cancel and dispose `_alarmCts` under `_alarmLock` before disposing
+the client — copy the block already present in `DisconnectAsync` (lines 124-130):
+`lock (_alarmLock) { _alarmCts?.Cancel(); _alarmCts?.Dispose(); _alarmCts = null;
+_alarmSubCount = 0; }`. This guarantees the alarm-stream task is cancelled
+deterministically on every teardown path, not just the `DisconnectAsync` one.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): `MxGatewayDataConnection.DisposeAsync` now cancels and disposes `_alarmCts` under `_alarmLock`, matching `DisconnectAsync` — no more alarm task/CTS leak on failover/stop.
+
+### DataConnectionLayer-028 — `_alarmSourceFilter` "first subscriber wins" comment contradicts the last-wins code
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:102-103,1758-1761` |
+
+**Description**
+
+The XML doc on the `_alarmSourceFilter` field (lines 102-103) reads "sourceReference →
+raw condition filter string passed to the adapter (**first subscriber wins**)." The
+code does the opposite: `HandleSubscribeAlarms` **unconditionally** overwrites the
+filter on every subscribe — `_alarmSourceFilter[request.SourceReference] =
+request.ConditionFilter;` (line 1758) and likewise re-parses
+`_alarmSourceFilterPredicate[request.SourceReference] =
+AlarmConditionFilter.Parse(request.ConditionFilter)` (line 1761) — with no
+"already present" guard. So a **second** subscriber to the same source reference
+re-filters the shared feed with its own condition filter, i.e. **last subscriber
+wins**, not first. The comment will mislead a maintainer reasoning about which filter
+governs a shared alarm feed when two instances subscribe to the same source with
+different condition filters, and it understates a real behavioural subtlety (the second
+subscriber silently changes the gate applied to the first subscriber's transitions in
+`HandleAlarmTransitionReceived`).
+
+**Recommendation**
+
+Correct the comment to "last subscriber wins" (and note that the shared feed carries a
+single filter, so co-subscribers to one source reference must agree on the condition
+filter). If per-subscriber filtering is actually intended, make the predicate
+per-subscriber rather than per-source so each subscriber's own filter governs only its
+own deliveries — but that is a behaviour change, not a comment fix, and should be
+decided explicitly.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): the `_alarmSourceFilter` comment corrected from 'first subscriber wins' to 'last subscriber wins' (and notes co-subscribers share the single filter).