review(Driver.Galaxy): re-review at HEAD (1 Low deferred; vendoring findings obsolete)

Re-review at 7286d320. Driver.Galaxy-019 (Low, Open): EnsureSessionIntervalAsync caches
MxaccessFailure as 'interval applied' (sub-optimal cadence until reconnect) — deferred
(sealed gateway session + gateway-contract design call). 001-014 re-verified Resolved;
015-018 (vendoring) now obsolete (libs/ replaced by real MxGateway PackageReferences).
This commit is contained in:
Joseph Doherty
2026-06-19 11:34:34 -04:00
parent ae3f071945
commit db72dd1dca
+111 -3
View File
@@ -4,10 +4,10 @@
|---|---|
| Module | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy` |
| Reviewer | Claude Code |
| Review date | 2026-05-23 |
| Commit reviewed | `a9be809` |
| Review date | 2026-06-19 |
| Commit reviewed | `7286d320` |
| Status | Reviewed |
| Open findings | 0 |
| Open findings | 1 |
## Checklist coverage
@@ -51,6 +51,54 @@ Correctness (1), OtOpcUa conventions (2), Concurrency (3), Error handling
recorded below (Driver.Galaxy-015..018) — none Critical, none High; two
Medium, two Low.
## Re-review 2026-06-19 (commit `7286d320`)
Heavy churn since `a9be809`: Galaxy completed its transition from the retired
alias/relay machinery to a plain Equipment-kind driver, and gained the
write-through path (`AdviseSupervisory` → raw `Write`), native alarms via the
session-less `GatewayGalaxyAlarmFeed`, the `GalaxyDriverProbe` Test-Connect
preflight (+212 LOC, new), and per-platform health probing. 39 src files,
5305 LOC; 31 test files, 6137 LOC.
**The vendoring is gone.** The biggest structural change retires the entire
`libs/` concern that drove the last re-review: the two vendored binary DLLs
(`MxGateway.Client.dll` / `MxGateway.Contracts.dll`) and the path-based
`<Reference HintPath>` items were replaced by real
`<PackageReference Include="ZB.MOM.WW.MxGateway.Client" />` /
`ZB.MOM.WW.MxGateway.Contracts` NuGet references, and
`Config/GalaxyDriverOptions.cs` moved to the sibling `.Contracts` project (out
of this module's scope). This makes **Driver.Galaxy-015, -016, -017, -018**
(all vendoring-provenance / package-skew / contract-version findings) obsolete
as live concerns — they remain in the history below as closed records and are
NOT re-opened.
All twelve prior code-affecting findings (Driver.Galaxy-001..014) verified
still-resolved against the churned source: the EventPump stream-fault →
supervisor hand-off (`OnEventPumpStreamFault`), the Int64 `DataTypeMap` arm,
`StatusCodeMap.IsSuccess()` semantics + `ToQualityCategoryByte` helper,
deterministic earliest-handle alarm-owner selection, `IAsyncDisposable`,
`ReplayAsync` rebind + EventPump restart, `StartDeployWatcher` fault
observation + single `_ownedRepositoryClient`, the `ResolveApiKey`
`env:`/`file:`/`dev:`/warn-on-literal arms, the live `GetMemoryFootprint`
estimate, the O(1) `SubscriptionRegistry` indices, and the `ReinitializeAsync`
equivalent-config gate all survive intact.
One new finding (Driver.Galaxy-019, Low, Open-deferred). No Critical / High /
Medium new findings. Category results:
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | No new issues found |
| 2 | OtOpcUa conventions | No new issues found |
| 3 | Concurrency & thread safety | No new issues found |
| 4 | Error handling & resilience | Driver.Galaxy-019 |
| 5 | Security | No new issues found (vendoring retired; no secret logging; `ResolveApiKey` warns on cleartext) |
| 6 | Performance & resource management | No new issues found |
| 7 | Design-document adherence | No new issues found |
| 8 | Code organization & conventions | No new issues found |
| 9 | Testing coverage | No new issues found (31 suites; `GatewayGalaxySubscriber` untestable per Driver.Galaxy-019) |
| 10 | Documentation & comments | No new issues found |
## Findings
### Driver.Galaxy-001
@@ -385,3 +433,63 @@ recommendation would be cosmetic without changing behaviour. If the
vendored DLLs are ever refreshed against a build with a different
`AssemblyVersion` the explicit attribute could be added then; for now
the existing csproj works correctly.
### Driver.Galaxy-019
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `Runtime/GatewayGalaxySubscriber.cs:89-100` |
| Status | Open |
**Description:** `GatewayGalaxySubscriber.EnsureSessionIntervalAsync` applies
the session-level `SetBufferedUpdateInterval` command and then caches the
requested interval in `_lastAppliedIntervalMs` so later `SubscribeBulk` calls
skip a redundant set. The gate is
`if (reply.ProtocolStatus?.Code is not (ProtocolStatusCode.Ok or ProtocolStatusCode.MxaccessFailure)) return;`
— i.e. it early-returns (no cache) for unexpected protocol codes, but for both
`Ok` **and** `MxaccessFailure` it falls through and records the interval as
applied. `MxaccessFailure` means the gateway reached MXAccess but the COM-side
`SetBufferedUpdateInterval` failed, so the requested cadence was **not** applied
— yet caching it as applied means every subsequent subscribe permanently skips
the retry and the session stays at the gateway's default cadence until a
reconnect clears the cache. A transient MXAccess hiccup during the first
buffered subscribe thus silently pins the wrong publish rate for the session's
life. Two secondary issues compound it: (a) the inline comment claims "the
warning here gives ops the signal," but no warning is logged on the soft-failure
path (and the class holds no logger at all), so the only signal is the PR-6.1
trace span; (b) the early-return (truly-unexpected code) path also caches
nothing AND logs nothing, so an operator has no log-level evidence the cadence
was rejected.
This is bounded in impact — buffered cadence is a performance knob, not a
correctness one; data still flows at the gateway default, and a reconnect resets
the cache (the writer's `InvalidateHandleCaches` clears writer state, and a fresh
`GatewayGalaxySubscriber`/session is built on reopen). Hence Low.
**Recommendation:** Treat `MxaccessFailure` as a non-cacheable soft failure —
only set `_lastAppliedIntervalMs` when `ProtocolStatus.Code == Ok` so a failed
MXAccess apply is retried on the next subscribe. Give the subscriber an
`ILogger` (it is constructed in `BuildProductionRuntimeAsync`, which already has
`_logger`) and emit a `Warning` on both the soft-failure and unexpected-code
paths so the comment's promised signal actually exists. To make this testable,
extract the reply→outcome classification into a pure internal helper
(`ClassifyIntervalReply(ProtocolStatusCode?)`) so it can be unit-tested without
the sealed, internal-ctor `MxGatewaySession``GatewayGalaxySubscriber`
currently has zero unit tests for exactly this reason.
**Resolution:** Deferred 2026-06-19 — the fix is small but cannot be landed
with a real TDD red→green cycle in this module: the behaviour lives entirely
behind the sealed `MxGatewaySession` (no public/internal ctor; the same
constraint that forced every other gw call onto an injectable `IGalaxy*` seam),
so there is no way to drive a synthetic `MxCommandReply` through
`EnsureSessionIntervalAsync` without first refactoring out the pure-classifier
helper the recommendation calls for. That refactor plus giving the subscriber a
logger is a low-risk but non-trivial change to the production gw session path,
and the underlying intent (is caching on `MxaccessFailure` deliberate "the
gateway processed it, don't retry" behaviour, or an oversight?) is a design
question better confirmed against the gateway contract than guessed at in a
review sweep. Tracked for a follow-up that lands the classifier extraction +
its unit tests together. Impact is bounded to a sub-optimal (not broken)
publish cadence that self-heals on reconnect.