docs(code-review): full review at 4307c381 — 18 modules, 67 findings recorded + remediation tracked

Full per-module re-review of the 16 stale modules (last seen 1eb6e97 / 2026-05-28)
plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD 4307c381.

67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commit
fd618cf1 closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix
with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings
(IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision.

Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to
sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024),
an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001),
and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary
is semantically sound (symbol-based) in the production cluster config.

README regenerated; regen-readme.py --check passes (4 pending / 567 total).
This commit is contained in:
Joseph Doherty
2026-06-20 18:02:32 -04:00
parent fd618cf1dc
commit d39089f4ed
19 changed files with 4031 additions and 69 deletions
+240 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer` |
| Design doc | `docs/requirements/Component-DataConnectionLayer.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-28 |
| Last reviewed | 2026-06-20 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Open findings | 5 |
| Commit reviewed | `4307c381` |
| Open findings | 0 |
## Summary
@@ -116,6 +116,37 @@ DCL-007 fixed for `ReadBatchAsync`). New findings are numbered from
## Findings
#### Re-review 2026-06-20 (commit `4307c381`) — full review
The 2026-06-20 full re-review walked all 10 checklist categories against the current
source, focused on the M7 native-alarm subscribe path, the OPC UA node-browser
(BrowseNext/search/type-info), and the verify-endpoint + site-local cert-trust surface.
The M7 surface is well-built overall — the verify-endpoint probe correctly *captures
but never trusts* an untrusted server certificate, the per-node `CertStoreActor`
broadcast keeps both site nodes' PKI stores consistent across failover, and the
node-browser paging/search/type-info additions are clean. All 22 prior findings remain
`Resolved` and their fixes were verified in place. The review found **4 new findings**,
all clustering on the native-alarm subscribe path, which did **not** inherit the
guards the tag-subscribe path earned across DCL-018 / DCL-021 / DCL-022: the alarm
path leaks its adapter feed on a mid-flight unsubscribe (no DCL-021-style obsolete-
completion guard), has no alarm-resolution retry to match the tag path's
`tag-resolution-retry` timer, leaks `_alarmCts` on `DisposeAsync` (only
`DisconnectAsync` tears it down), and carries a stale "first subscriber wins" comment
on a last-wins filter assignment.
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | x | No new issues; DCL-016/020/021 counter/response fixes verified. Alarm-path snapshot atomic-swap + per-source prefix routing are correct. |
| 2 | Akka.NET conventions | x | No new issues; DCL-022 `IsTimerActive` gate verified. Alarm subscribe uses `ContinueWith(...).PipeTo(Self)` with `Self`/generation captured (no `Sender`/`this` in closures). |
| 3 | Concurrency & thread safety | x | Finding 023 — native-alarm mid-flight unsubscribe leaks the adapter alarm feed (the tag path's DCL-021 obsolete-completion guard was not mirrored on the alarm path). |
| 4 | Error handling & resilience | x | Finding 024 — no alarm-resolution retry: a failed initial alarm subscribe strands the subscriber with no feed until the next full reconnect (the tag path retries on `tag-resolution-retry`). |
| 5 | Security | x | No new issues. The M7 verify-endpoint probe builds a temporary `RealOpcUaClient`, captures the untrusted server cert, and NEVER trusts it; cert trust is gated through `CertStoreActor` broadcast to both nodes. DCL-012/014 auto-accept warning + Commons default remain out-of-scope follow-ups. |
| 6 | Performance & resource management | x | Finding 025 — `MxGatewayDataConnection.DisposeAsync` abandons `_alarmCts` (alarm task + CTS leak on failover/stop); `DisconnectAsync` already cancels+disposes it under `_alarmLock`. |
| 7 | Design-document adherence | x | No new issues; M7 native-alarm read-only mirror, `AlarmKind` discriminator, and per-source `IAlarmSubscribableConnection` feed match the design. DCL-009 doc action still open at doc level (out of scope). |
| 8 | Code organization & conventions | x | No issues — alarm messages/POCOs in Commons, `IAlarmSubscribableConnection` capability seam in the module, options class owned by component. |
| 9 | Testing coverage | x | DCL001022 regression tests present. Gaps for findings 023 (alarm unsubscribe mid-flight), 024 (failed alarm subscribe → no retry), 025 (`DisposeAsync` alarm-CTS leak). |
| 10 | Documentation & comments | x | Finding 028 — the `_alarmSourceFilter` "first subscriber wins" XML comment contradicts the last-wins assignment in `HandleSubscribeAlarms`. |
### DataConnectionLayer-001 — `Task.Run` in `HandleSubscribe` mutates actor state off the actor thread
| | |
@@ -1267,3 +1298,209 @@ calling `StartPeriodicTimer` — `IsTimerActive` is on `ITimerScheduler`. Apply
same gate at both call sites. Add a regression test that fires 5 subscribes with
unresolved tags within one retry interval and asserts the retry fires at most one
interval after the first failure (not after the fifth subscribe).
### DataConnectionLayer-023 — Native-alarm mid-flight unsubscribe leaks the adapter alarm feed
| | |
|--|--|
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:1773,1790-1809,1864-1882` |
**Description**
The native-alarm subscribe path mirrors the tag-subscribe path's
dispatch-I/O-then-`PipeTo(Self)` shape but never inherited the obsolete-completion
guard the tag path earned in DCL-021. `HandleSubscribeAlarms` adds the source to
`_alarmSubscribesInFlight` (line 1773) and dispatches
`alarmable.SubscribeAlarmsAsync(...)` whose `ContinueWith` pipes an
`AlarmSubscribeCompleted` back to `Self`. If an `UnsubscribeAlarmsRequest` for the
last (or only) subscriber is processed on the actor thread between that dispatch and
the completion, `HandleUnsubscribeAlarms` (lines 1864-1882) removes the subscriber,
empties `_alarmSourceSubscribers`, and removes the filter entries — but it tries to
tear down the adapter feed via `_alarmSubscriptionIds.Remove(...)`, which **fails**
because the subscription id has not been stored yet (it is still in flight). It also
leaves the stale `_alarmSubscribesInFlight` entry in place. The late
`AlarmSubscribeCompleted` then reaches `HandleAlarmSubscribeCompleted` (lines
1790-1809), which **unconditionally** stores `_alarmSubscriptionIds[msg.SourceReference]
= msg.SubscriptionId` (line 1796) without re-checking whether any subscriber still
exists in `_alarmSourceSubscribers`.
The net result is an orphaned adapter alarm feed: the OPC UA monitored-item set / the
MxGateway gateway-wide alarm stream stays alive for a source with **zero** subscribers,
streaming transitions into `HandleAlarmTransitionReceived` that match no subscriber set
and are silently dropped — the feed is never torn down for the lifetime of the adapter
(only a full `ReSubscribeAllAlarms` on reconnect clears `_alarmSubscriptionIds`, and
even that re-subscribes from `_alarmSourceSubscribers`, which no longer contains the
orphaned source, so the stale device-side feed is simply abandoned, not closed). The
tag path closes exactly this race in `HandleSubscribeCompleted` (lines 834-867) by
releasing the just-created adapter handle and clearing `_subscribesInFlight` when the
instance entry has gone; the alarm path does neither. This matters because native-alarm
sources are created/destroyed on every deploy/undeploy and instance stop, so each
mid-flight unsubscribe permanently leaks one device-side alarm subscription and its
publish traffic.
**Recommendation**
Mirror the closed DCL-021 fix on the alarm path. In `HandleAlarmSubscribeCompleted`,
before storing the id, guard on the live subscriber set: `if
(!_alarmSourceSubscribers.ContainsKey(msg.SourceReference)) {
if (msg.Success && msg.SubscriptionId != null && _adapter is
IAlarmSubscribableConnection alarmable) _ =
alarmable.UnsubscribeAlarmsAsync(msg.SubscriptionId); return; }` so a feed that
completed after its last subscriber left is released at the adapter rather than stored.
Additionally clear the `_alarmSubscribesInFlight` entry in `HandleUnsubscribeAlarms`
when the last subscriber leaves, so the in-flight marker does not linger. Add a
regression test that subscribes a source, sends `UnsubscribeAlarmsRequest` while the
alarm subscribe I/O is in flight, completes the subscribe, and asserts
`UnsubscribeAlarmsAsync` is called and `_alarmSubscriptionIds` /
`_alarmSubscribesInFlight` are clean.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): `HandleAlarmSubscribeCompleted` now guards against an orphaned completion (if no live subscriber remains for the source, it tears down the just-created adapter subscription and returns) and `HandleUnsubscribeAlarms` clears the in-flight marker when the last subscriber leaves — mirroring the DCL-021 tag-path fix. Regression test added (verified failing pre-fix).
### DataConnectionLayer-024 — No alarm-resolution retry: a failed initial alarm subscribe strands the subscriber until the next full reconnect
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Deferred |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:1752-1757,1790-1809,1885-1908` |
**Description**
When an initial native-alarm subscribe fails, the subscriber is left registered but
with no feed and no path to recovery short of a full connection reconnect.
`HandleSubscribeAlarms` registers the subscriber in `_alarmSourceSubscribers` (lines
1752-1757) **before** issuing the adapter subscribe, so a transition arriving
mid-subscribe is still routed. If `alarmable.SubscribeAlarmsAsync` fails, the piped
`AlarmSubscribeCompleted` reaches `HandleAlarmSubscribeCompleted` (lines 1790-1809),
which on the failure branch only logs a warning (line 1801) and replies a
`SubscribeAlarmsResponse(success: false, ...)`. It does **not** re-arm any retry, does
**not** push `NativeAlarmSourceUnavailable` to the subscriber, and does **not** drop
the source from `_alarmSourceSubscribers`. The subscriber therefore stays in the
routing map with the adapter feed never established.
Recovery only happens via `ReSubscribeAllAlarms` (lines 1885-1908), which is invoked
solely from `BecomeConnected` after a full connection reconnect (line 583). So a
transient device-side failure on the *initial* alarm subscribe — the alarm server
still booting, a momentary fault — leaves the source dark until the entire connection
happens to cycle through `Reconnecting`. This diverges from the tag path, which arms a
periodic `tag-resolution-retry` timer (lines 989-993, 1678-1682; fired by
`HandleRetryTagResolution` at line 1491) so a tag that fails to resolve at subscribe
time is retried every `TagResolutionRetryInterval` (10 s default) without needing a
reconnect. The native-alarm source has no analogous self-healing.
This is partly a **design judgment call** rather than a clear-cut defect: the M7 design
models native alarms as a read-only mirror, and "drop the source and signal the
subscriber" is an equally valid contract to "retry the alarm subscribe periodically."
The defect is that the code currently does **neither** — it silently leaves a
registered subscriber with a permanently-dark feed and no signal — which is the worst
of both.
**Recommendation**
Pick one of the two valid contracts and implement it (this is a judgment call — do not
assume which is wanted):
- **Retry:** add an `alarm-resolution-retry` periodic timer (mirroring
`tag-resolution-retry`): track sources whose subscribe failed, re-issue
`SubscribeAlarmsAsync` on the timer, and gate the timer start with
`IsTimerActive` (per the DCL-022 pattern) so a burst of failed subscribes does not
reset it.
- **Fail fast:** on the failure branch of `HandleAlarmSubscribeCompleted`, push
`NativeAlarmSourceUnavailable` to the subscriber(s) for that source and remove the
source from `_alarmSourceSubscribers` / `_alarmSubscribesInFlight`, so the
`NativeAlarmActor` learns the feed is unavailable rather than waiting forever.
Either way, add a regression test for "initial alarm subscribe fails → subscriber is
not silently left dark."
**Resolution**
Deferred 2026-06-20: the fix (add an alarm-resolution retry timer vs. push `NativeAlarmSourceUnavailable` and drop the source on initial-subscribe failure) is a design-owner decision; the current code does neither. Awaiting that decision before implementing.
### DataConnectionLayer-025 — `MxGatewayDataConnection.DisposeAsync` abandons `_alarmCts`, leaking the alarm task on failover/stop
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/MxGatewayDataConnection.cs:277-283,121-134` |
**Description**
`MxGatewayDataConnection.DisposeAsync` (lines 277-283) cancels `_eventLoopCts` and
disposes the underlying `_client`, but never touches `_alarmCts`. The alarm feed is
established in `SubscribeAlarmsAsync` (lines 153-175): under `_alarmLock` it lazily
creates `_alarmCts = new CancellationTokenSource()` and launches a long-running
`Task.Run(() => client.RunAlarmStreamAsync(null, ..., token))`. That background alarm
stream is bound only to `_alarmCts.Token`. `DisconnectAsync` (lines 121-134) already
tears it down correctly — under `_alarmLock` it cancels and disposes `_alarmCts`,
nulls it, and resets `_alarmSubCount` — but `DisposeAsync` does not, so the
`CancellationTokenSource` and the alarm-stream `Task` are both leaked whenever the
adapter is disposed without a prior `DisconnectAsync`.
The `DataConnectionActor` disposes adapters fire-and-forget on failover (and on
actor/connection stop) via `_adapter.DisposeAsync()` without necessarily calling
`DisconnectAsync` first, so every MxGateway failover or connection teardown that goes
through `DisposeAsync` leaks one running alarm-stream task plus its CTS. Severity is
Low because the leaked task is cancellation-bound to a CTS that will be GC-reclaimed
eventually and the gRPC stream it holds will fault when `_client` is disposed, but a
still-running alarm-stream loop racing a disposed client is precisely the class of
dangling background work the lock-guarded `DisconnectAsync` block was written to avoid.
**Recommendation**
In `DisposeAsync`, cancel and dispose `_alarmCts` under `_alarmLock` before disposing
the client — copy the block already present in `DisconnectAsync` (lines 124-130):
`lock (_alarmLock) { _alarmCts?.Cancel(); _alarmCts?.Dispose(); _alarmCts = null;
_alarmSubCount = 0; }`. This guarantees the alarm-stream task is cancelled
deterministically on every teardown path, not just the `DisconnectAsync` one.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): `MxGatewayDataConnection.DisposeAsync` now cancels and disposes `_alarmCts` under `_alarmLock`, matching `DisconnectAsync` — no more alarm task/CTS leak on failover/stop.
### DataConnectionLayer-028 — `_alarmSourceFilter` "first subscriber wins" comment contradicts the last-wins code
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:102-103,1758-1761` |
**Description**
The XML doc on the `_alarmSourceFilter` field (lines 102-103) reads "sourceReference →
raw condition filter string passed to the adapter (**first subscriber wins**)." The
code does the opposite: `HandleSubscribeAlarms` **unconditionally** overwrites the
filter on every subscribe — `_alarmSourceFilter[request.SourceReference] =
request.ConditionFilter;` (line 1758) and likewise re-parses
`_alarmSourceFilterPredicate[request.SourceReference] =
AlarmConditionFilter.Parse(request.ConditionFilter)` (line 1761) — with no
"already present" guard. So a **second** subscriber to the same source reference
re-filters the shared feed with its own condition filter, i.e. **last subscriber
wins**, not first. The comment will mislead a maintainer reasoning about which filter
governs a shared alarm feed when two instances subscribe to the same source with
different condition filters, and it understates a real behavioural subtlety (the second
subscriber silently changes the gate applied to the first subscriber's transitions in
`HandleAlarmTransitionReceived`).
**Recommendation**
Correct the comment to "last subscriber wins" (and note that the shared feed carries a
single filter, so co-subscribers to one source reference must agree on the condition
filter). If per-subscriber filtering is actually intended, make the predicate
per-subscriber rather than per-source so each subscriber's own filter governs only its
own deliveries — but that is a behaviour change, not a comment fix, and should be
decided explicitly.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): the `_alarmSourceFilter` comment corrected from 'first subscriber wins' to 'last subscriber wins' (and notes co-subscribers share the single filter).