docs(code-review): full review at 4307c381 — 18 modules, 67 findings recorded + remediation tracked
Full per-module re-review of the 16 stale modules (last seen1eb6e97/ 2026-05-28) plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD4307c381. 67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commitfd618cf1closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings (IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision. Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024), an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001), and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary is semantically sound (symbol-based) in the production cluster config. README regenerated; regen-readme.py --check passes (4 pending / 567 total).
This commit is contained in:
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer` |
|
||||
| Design doc | `docs/requirements/Component-DataConnectionLayer.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Last reviewed | 2026-06-20 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 5 |
|
||||
| Commit reviewed | `4307c381` |
|
||||
| Open findings | 0 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -116,6 +116,37 @@ DCL-007 fixed for `ReadBatchAsync`). New findings are numbered from
|
||||
|
||||
## Findings
|
||||
|
||||
#### Re-review 2026-06-20 (commit `4307c381`) — full review
|
||||
|
||||
The 2026-06-20 full re-review walked all 10 checklist categories against the current
|
||||
source, focused on the M7 native-alarm subscribe path, the OPC UA node-browser
|
||||
(BrowseNext/search/type-info), and the verify-endpoint + site-local cert-trust surface.
|
||||
The M7 surface is well-built overall — the verify-endpoint probe correctly *captures
|
||||
but never trusts* an untrusted server certificate, the per-node `CertStoreActor`
|
||||
broadcast keeps both site nodes' PKI stores consistent across failover, and the
|
||||
node-browser paging/search/type-info additions are clean. All 22 prior findings remain
|
||||
`Resolved` and their fixes were verified in place. The review found **4 new findings**,
|
||||
all clustering on the native-alarm subscribe path, which did **not** inherit the
|
||||
guards the tag-subscribe path earned across DCL-018 / DCL-021 / DCL-022: the alarm
|
||||
path leaks its adapter feed on a mid-flight unsubscribe (no DCL-021-style obsolete-
|
||||
completion guard), has no alarm-resolution retry to match the tag path's
|
||||
`tag-resolution-retry` timer, leaks `_alarmCts` on `DisposeAsync` (only
|
||||
`DisconnectAsync` tears it down), and carries a stale "first subscriber wins" comment
|
||||
on a last-wins filter assignment.
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | x | No new issues; DCL-016/020/021 counter/response fixes verified. Alarm-path snapshot atomic-swap + per-source prefix routing are correct. |
|
||||
| 2 | Akka.NET conventions | x | No new issues; DCL-022 `IsTimerActive` gate verified. Alarm subscribe uses `ContinueWith(...).PipeTo(Self)` with `Self`/generation captured (no `Sender`/`this` in closures). |
|
||||
| 3 | Concurrency & thread safety | x | Finding 023 — native-alarm mid-flight unsubscribe leaks the adapter alarm feed (the tag path's DCL-021 obsolete-completion guard was not mirrored on the alarm path). |
|
||||
| 4 | Error handling & resilience | x | Finding 024 — no alarm-resolution retry: a failed initial alarm subscribe strands the subscriber with no feed until the next full reconnect (the tag path retries on `tag-resolution-retry`). |
|
||||
| 5 | Security | x | No new issues. The M7 verify-endpoint probe builds a temporary `RealOpcUaClient`, captures the untrusted server cert, and NEVER trusts it; cert trust is gated through `CertStoreActor` broadcast to both nodes. DCL-012/014 auto-accept warning + Commons default remain out-of-scope follow-ups. |
|
||||
| 6 | Performance & resource management | x | Finding 025 — `MxGatewayDataConnection.DisposeAsync` abandons `_alarmCts` (alarm task + CTS leak on failover/stop); `DisconnectAsync` already cancels+disposes it under `_alarmLock`. |
|
||||
| 7 | Design-document adherence | x | No new issues; M7 native-alarm read-only mirror, `AlarmKind` discriminator, and per-source `IAlarmSubscribableConnection` feed match the design. DCL-009 doc action still open at doc level (out of scope). |
|
||||
| 8 | Code organization & conventions | x | No issues — alarm messages/POCOs in Commons, `IAlarmSubscribableConnection` capability seam in the module, options class owned by component. |
|
||||
| 9 | Testing coverage | x | DCL001–022 regression tests present. Gaps for findings 023 (alarm unsubscribe mid-flight), 024 (failed alarm subscribe → no retry), 025 (`DisposeAsync` alarm-CTS leak). |
|
||||
| 10 | Documentation & comments | x | Finding 028 — the `_alarmSourceFilter` "first subscriber wins" XML comment contradicts the last-wins assignment in `HandleSubscribeAlarms`. |
|
||||
|
||||
### DataConnectionLayer-001 — `Task.Run` in `HandleSubscribe` mutates actor state off the actor thread
|
||||
|
||||
| | |
|
||||
@@ -1267,3 +1298,209 @@ calling `StartPeriodicTimer` — `IsTimerActive` is on `ITimerScheduler`. Apply
|
||||
same gate at both call sites. Add a regression test that fires 5 subscribes with
|
||||
unresolved tags within one retry interval and asserts the retry fires at most one
|
||||
interval after the first failure (not after the fifth subscribe).
|
||||
|
||||
### DataConnectionLayer-023 — Native-alarm mid-flight unsubscribe leaks the adapter alarm feed
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:1773,1790-1809,1864-1882` |
|
||||
|
||||
**Description**
|
||||
|
||||
The native-alarm subscribe path mirrors the tag-subscribe path's
|
||||
dispatch-I/O-then-`PipeTo(Self)` shape but never inherited the obsolete-completion
|
||||
guard the tag path earned in DCL-021. `HandleSubscribeAlarms` adds the source to
|
||||
`_alarmSubscribesInFlight` (line 1773) and dispatches
|
||||
`alarmable.SubscribeAlarmsAsync(...)` whose `ContinueWith` pipes an
|
||||
`AlarmSubscribeCompleted` back to `Self`. If an `UnsubscribeAlarmsRequest` for the
|
||||
last (or only) subscriber is processed on the actor thread between that dispatch and
|
||||
the completion, `HandleUnsubscribeAlarms` (lines 1864-1882) removes the subscriber,
|
||||
empties `_alarmSourceSubscribers`, and removes the filter entries — but it tries to
|
||||
tear down the adapter feed via `_alarmSubscriptionIds.Remove(...)`, which **fails**
|
||||
because the subscription id has not been stored yet (it is still in flight). It also
|
||||
leaves the stale `_alarmSubscribesInFlight` entry in place. The late
|
||||
`AlarmSubscribeCompleted` then reaches `HandleAlarmSubscribeCompleted` (lines
|
||||
1790-1809), which **unconditionally** stores `_alarmSubscriptionIds[msg.SourceReference]
|
||||
= msg.SubscriptionId` (line 1796) without re-checking whether any subscriber still
|
||||
exists in `_alarmSourceSubscribers`.
|
||||
|
||||
The net result is an orphaned adapter alarm feed: the OPC UA monitored-item set / the
|
||||
MxGateway gateway-wide alarm stream stays alive for a source with **zero** subscribers,
|
||||
streaming transitions into `HandleAlarmTransitionReceived` that match no subscriber set
|
||||
and are silently dropped — the feed is never torn down for the lifetime of the adapter
|
||||
(only a full `ReSubscribeAllAlarms` on reconnect clears `_alarmSubscriptionIds`, and
|
||||
even that re-subscribes from `_alarmSourceSubscribers`, which no longer contains the
|
||||
orphaned source, so the stale device-side feed is simply abandoned, not closed). The
|
||||
tag path closes exactly this race in `HandleSubscribeCompleted` (lines 834-867) by
|
||||
releasing the just-created adapter handle and clearing `_subscribesInFlight` when the
|
||||
instance entry has gone; the alarm path does neither. This matters because native-alarm
|
||||
sources are created/destroyed on every deploy/undeploy and instance stop, so each
|
||||
mid-flight unsubscribe permanently leaks one device-side alarm subscription and its
|
||||
publish traffic.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Mirror the closed DCL-021 fix on the alarm path. In `HandleAlarmSubscribeCompleted`,
|
||||
before storing the id, guard on the live subscriber set: `if
|
||||
(!_alarmSourceSubscribers.ContainsKey(msg.SourceReference)) {
|
||||
if (msg.Success && msg.SubscriptionId != null && _adapter is
|
||||
IAlarmSubscribableConnection alarmable) _ =
|
||||
alarmable.UnsubscribeAlarmsAsync(msg.SubscriptionId); return; }` so a feed that
|
||||
completed after its last subscriber left is released at the adapter rather than stored.
|
||||
Additionally clear the `_alarmSubscribesInFlight` entry in `HandleUnsubscribeAlarms`
|
||||
when the last subscriber leaves, so the in-flight marker does not linger. Add a
|
||||
regression test that subscribes a source, sends `UnsubscribeAlarmsRequest` while the
|
||||
alarm subscribe I/O is in flight, completes the subscribe, and asserts
|
||||
`UnsubscribeAlarmsAsync` is called and `_alarmSubscriptionIds` /
|
||||
`_alarmSubscribesInFlight` are clean.
|
||||
|
||||
**Resolution**
|
||||
|
||||
Resolved 2026-06-20 (commit `fd618cf1`): `HandleAlarmSubscribeCompleted` now guards against an orphaned completion (if no live subscriber remains for the source, it tears down the just-created adapter subscription and returns) and `HandleUnsubscribeAlarms` clears the in-flight marker when the last subscriber leaves — mirroring the DCL-021 tag-path fix. Regression test added (verified failing pre-fix).
|
||||
|
||||
### DataConnectionLayer-024 — No alarm-resolution retry: a failed initial alarm subscribe strands the subscriber until the next full reconnect
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Deferred |
|
||||
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:1752-1757,1790-1809,1885-1908` |
|
||||
|
||||
**Description**
|
||||
|
||||
When an initial native-alarm subscribe fails, the subscriber is left registered but
|
||||
with no feed and no path to recovery short of a full connection reconnect.
|
||||
`HandleSubscribeAlarms` registers the subscriber in `_alarmSourceSubscribers` (lines
|
||||
1752-1757) **before** issuing the adapter subscribe, so a transition arriving
|
||||
mid-subscribe is still routed. If `alarmable.SubscribeAlarmsAsync` fails, the piped
|
||||
`AlarmSubscribeCompleted` reaches `HandleAlarmSubscribeCompleted` (lines 1790-1809),
|
||||
which on the failure branch only logs a warning (line 1801) and replies a
|
||||
`SubscribeAlarmsResponse(success: false, ...)`. It does **not** re-arm any retry, does
|
||||
**not** push `NativeAlarmSourceUnavailable` to the subscriber, and does **not** drop
|
||||
the source from `_alarmSourceSubscribers`. The subscriber therefore stays in the
|
||||
routing map with the adapter feed never established.
|
||||
|
||||
Recovery only happens via `ReSubscribeAllAlarms` (lines 1885-1908), which is invoked
|
||||
solely from `BecomeConnected` after a full connection reconnect (line 583). So a
|
||||
transient device-side failure on the *initial* alarm subscribe — the alarm server
|
||||
still booting, a momentary fault — leaves the source dark until the entire connection
|
||||
happens to cycle through `Reconnecting`. This diverges from the tag path, which arms a
|
||||
periodic `tag-resolution-retry` timer (lines 989-993, 1678-1682; fired by
|
||||
`HandleRetryTagResolution` at line 1491) so a tag that fails to resolve at subscribe
|
||||
time is retried every `TagResolutionRetryInterval` (10 s default) without needing a
|
||||
reconnect. The native-alarm source has no analogous self-healing.
|
||||
|
||||
This is partly a **design judgment call** rather than a clear-cut defect: the M7 design
|
||||
models native alarms as a read-only mirror, and "drop the source and signal the
|
||||
subscriber" is an equally valid contract to "retry the alarm subscribe periodically."
|
||||
The defect is that the code currently does **neither** — it silently leaves a
|
||||
registered subscriber with a permanently-dark feed and no signal — which is the worst
|
||||
of both.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Pick one of the two valid contracts and implement it (this is a judgment call — do not
|
||||
assume which is wanted):
|
||||
|
||||
- **Retry:** add an `alarm-resolution-retry` periodic timer (mirroring
|
||||
`tag-resolution-retry`): track sources whose subscribe failed, re-issue
|
||||
`SubscribeAlarmsAsync` on the timer, and gate the timer start with
|
||||
`IsTimerActive` (per the DCL-022 pattern) so a burst of failed subscribes does not
|
||||
reset it.
|
||||
- **Fail fast:** on the failure branch of `HandleAlarmSubscribeCompleted`, push
|
||||
`NativeAlarmSourceUnavailable` to the subscriber(s) for that source and remove the
|
||||
source from `_alarmSourceSubscribers` / `_alarmSubscribesInFlight`, so the
|
||||
`NativeAlarmActor` learns the feed is unavailable rather than waiting forever.
|
||||
|
||||
Either way, add a regression test for "initial alarm subscribe fails → subscriber is
|
||||
not silently left dark."
|
||||
|
||||
**Resolution**
|
||||
|
||||
Deferred 2026-06-20: the fix (add an alarm-resolution retry timer vs. push `NativeAlarmSourceUnavailable` and drop the source on initial-subscribe failure) is a design-owner decision; the current code does neither. Awaiting that decision before implementing.
|
||||
|
||||
### DataConnectionLayer-025 — `MxGatewayDataConnection.DisposeAsync` abandons `_alarmCts`, leaking the alarm task on failover/stop
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Performance & resource management |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/MxGatewayDataConnection.cs:277-283,121-134` |
|
||||
|
||||
**Description**
|
||||
|
||||
`MxGatewayDataConnection.DisposeAsync` (lines 277-283) cancels `_eventLoopCts` and
|
||||
disposes the underlying `_client`, but never touches `_alarmCts`. The alarm feed is
|
||||
established in `SubscribeAlarmsAsync` (lines 153-175): under `_alarmLock` it lazily
|
||||
creates `_alarmCts = new CancellationTokenSource()` and launches a long-running
|
||||
`Task.Run(() => client.RunAlarmStreamAsync(null, ..., token))`. That background alarm
|
||||
stream is bound only to `_alarmCts.Token`. `DisconnectAsync` (lines 121-134) already
|
||||
tears it down correctly — under `_alarmLock` it cancels and disposes `_alarmCts`,
|
||||
nulls it, and resets `_alarmSubCount` — but `DisposeAsync` does not, so the
|
||||
`CancellationTokenSource` and the alarm-stream `Task` are both leaked whenever the
|
||||
adapter is disposed without a prior `DisconnectAsync`.
|
||||
|
||||
The `DataConnectionActor` disposes adapters fire-and-forget on failover (and on
|
||||
actor/connection stop) via `_adapter.DisposeAsync()` without necessarily calling
|
||||
`DisconnectAsync` first, so every MxGateway failover or connection teardown that goes
|
||||
through `DisposeAsync` leaks one running alarm-stream task plus its CTS. Severity is
|
||||
Low because the leaked task is cancellation-bound to a CTS that will be GC-reclaimed
|
||||
eventually and the gRPC stream it holds will fault when `_client` is disposed, but a
|
||||
still-running alarm-stream loop racing a disposed client is precisely the class of
|
||||
dangling background work the lock-guarded `DisconnectAsync` block was written to avoid.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
In `DisposeAsync`, cancel and dispose `_alarmCts` under `_alarmLock` before disposing
|
||||
the client — copy the block already present in `DisconnectAsync` (lines 124-130):
|
||||
`lock (_alarmLock) { _alarmCts?.Cancel(); _alarmCts?.Dispose(); _alarmCts = null;
|
||||
_alarmSubCount = 0; }`. This guarantees the alarm-stream task is cancelled
|
||||
deterministically on every teardown path, not just the `DisconnectAsync` one.
|
||||
|
||||
**Resolution**
|
||||
|
||||
Resolved 2026-06-20 (commit `fd618cf1`): `MxGatewayDataConnection.DisposeAsync` now cancels and disposes `_alarmCts` under `_alarmLock`, matching `DisconnectAsync` — no more alarm task/CTS leak on failover/stop.
|
||||
|
||||
### DataConnectionLayer-028 — `_alarmSourceFilter` "first subscriber wins" comment contradicts the last-wins code
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Documentation & comments |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:102-103,1758-1761` |
|
||||
|
||||
**Description**
|
||||
|
||||
The XML doc on the `_alarmSourceFilter` field (lines 102-103) reads "sourceReference →
|
||||
raw condition filter string passed to the adapter (**first subscriber wins**)." The
|
||||
code does the opposite: `HandleSubscribeAlarms` **unconditionally** overwrites the
|
||||
filter on every subscribe — `_alarmSourceFilter[request.SourceReference] =
|
||||
request.ConditionFilter;` (line 1758) and likewise re-parses
|
||||
`_alarmSourceFilterPredicate[request.SourceReference] =
|
||||
AlarmConditionFilter.Parse(request.ConditionFilter)` (line 1761) — with no
|
||||
"already present" guard. So a **second** subscriber to the same source reference
|
||||
re-filters the shared feed with its own condition filter, i.e. **last subscriber
|
||||
wins**, not first. The comment will mislead a maintainer reasoning about which filter
|
||||
governs a shared alarm feed when two instances subscribe to the same source with
|
||||
different condition filters, and it understates a real behavioural subtlety (the second
|
||||
subscriber silently changes the gate applied to the first subscriber's transitions in
|
||||
`HandleAlarmTransitionReceived`).
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Correct the comment to "last subscriber wins" (and note that the shared feed carries a
|
||||
single filter, so co-subscribers to one source reference must agree on the condition
|
||||
filter). If per-subscriber filtering is actually intended, make the predicate
|
||||
per-subscriber rather than per-source so each subscriber's own filter governs only its
|
||||
own deliveries — but that is a behaviour change, not a comment fix, and should be
|
||||
decided explicitly.
|
||||
|
||||
**Resolution**
|
||||
|
||||
Resolved 2026-06-20 (commit `fd618cf1`): the `_alarmSourceFilter` comment corrected from 'first subscriber wins' to 'last subscriber wins' (and notes co-subscribers share the single filter).
|
||||
|
||||
Reference in New Issue
Block a user