Files
lmxopcua/code-reviews/Driver.Galaxy/findings.md
Joseph Doherty ecc91b0e48 fix(driver-galaxy): resolve Medium code-review finding (Driver.Galaxy-011)
GetMemoryFootprint() returned a constant 0 with a stale "PR 4.4 sets this" comment
even though PR 4.4 shipped the SubscriptionRegistry. Replace with a live estimate:
64 bytes × TrackedItemHandleCount + 256 bytes × TrackedSubscriptionCount. A 50k-tag
set now registers ~3 MB with the server's cache-flush heuristic instead of being
invisible. Returns 0 when no subscriptions are active.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:48:37 -04:00

238 lines
22 KiB
Markdown

# Code Review — Driver.Galaxy
| Field | Value |
|---|---|
| Module | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 11 |
## Checklist coverage
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Driver.Galaxy-001, Driver.Galaxy-002, Driver.Galaxy-003, Driver.Galaxy-004 |
| 2 | OtOpcUa conventions | Driver.Galaxy-005 |
| 3 | Concurrency & thread safety | Driver.Galaxy-006, Driver.Galaxy-007 |
| 4 | Error handling & resilience | Driver.Galaxy-001, Driver.Galaxy-008, Driver.Galaxy-009 |
| 5 | Security | Driver.Galaxy-010 |
| 6 | Performance & resource management | Driver.Galaxy-011, Driver.Galaxy-012 |
| 7 | Design-document adherence | Driver.Galaxy-013 |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | Driver.Galaxy-014 |
| 10 | Documentation & comments | Driver.Galaxy-005, Driver.Galaxy-013 |
## Findings
### Driver.Galaxy-001
| Field | Value |
|---|---|
| Severity | Critical |
| Category | Error handling & resilience |
| Location | `Runtime/EventPump.cs:128`, `GalaxyDriver.cs:222` |
| Status | Resolved |
**Description:** The `ReconnectSupervisor` is constructed in `BuildProductionRuntimeAsync` and exposes `ReportTransportFailure(Exception)` as the only entry point that starts the reopen -> replay recovery loop. Nothing in the driver ever calls `ReportTransportFailure` (a repo-wide search finds only the declaration). When the gateway `StreamEvents` stream faults, `EventPump.RunAsync` catches the exception, logs "reconnect supervisor (PR 4.5) handles restart", completes the channel, and exits — but the supervisor is never told. The result: a transient gateway transport drop permanently kills the event stream. Data-change notifications stop, no reconnect/replay runs, and `GetHealth()` keeps reporting `Healthy` because `_supervisor.IsDegraded` stays false. This is a production outage with no self-recovery.
**Recommendation:** Wire the EventPump (and any gw RPC that observes a transport fault) to call `_supervisor.ReportTransportFailure(ex)`. The simplest path: give `EventPump` a fault callback (or expose a `StreamFaulted` event) that `GalaxyDriver` subscribes to and forwards to the supervisor. The supervisor's `ReopenAsync`/`ReplayAsync` must also restart the EventPump itself (see Driver.Galaxy-008).
**Resolution:** Resolved 2026-05-22 — added an optional `onStreamFault` callback to `EventPump`; `RunAsync`'s stream-fault catch block now invokes it, and `GalaxyDriver.EnsureEventPumpStarted` wires it to `OnEventPumpStreamFault` which forwards the cause to `ReconnectSupervisor.ReportTransportFailure`, so a transient gw transport drop now drives reopen → replay. Regression coverage in `EventPumpStreamFaultTests`. Note: the EventPump itself is still not restarted on reconnect — that pump-restart gap remains tracked under Driver.Galaxy-008.
### Driver.Galaxy-002
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `Browse/DataTypeMap.cs:13`, `Runtime/MxValueDecoder.cs:9` |
| Status | Resolved |
**Description:** `DataTypeMap.Map` maps Galaxy `mx_data_type` codes to six `DriverDataType` values (Boolean, Int32, Float32, Float64, String, DateTime) — there is no `Int64` arm. Yet `MxValueDecoder` and `MxValueEncoder` both fully support Int64 (`MxValue.Int64Value`, `Int64Array`), and the decoder's own XML doc claims "the seven Galaxy data types ... (Boolean, Int32, Int64, Float32, Float64, String, DateTime)". Any Galaxy attribute whose `mx_data_type` is the Int64 code (or any code > 5) falls through the `_ => DriverDataType.String` default. The address-space node is then created as a `String` variable while runtime reads decode an `Int64` boxed value — a type mismatch that produces wrong OPC UA `DataType`/`ValueRank` metadata and likely fails value coercion at the server node layer.
**Recommendation:** Confirm the Galaxy `mx_data_type` integer code for 64-bit integers and add the explicit arm to `DataTypeMap.Map`. If the wire format genuinely has no Int64 type, correct the `MxValueDecoder`/`MxValueEncoder` doc comments instead. Either way the encoder/decoder and the type map must agree.
**Resolution:** Resolved 2026-05-22 — added `6 => DriverDataType.Int64` to `DataTypeMap.Map`, extending the contiguous 0..5 scheme so the type map covers the same seven Galaxy data types `MxValueDecoder`/`MxValueEncoder` already decode/encode; Int64 attributes now build as Int64 nodes instead of falling through to the String default. Regression coverage in `DataTypeMapTests`.
### Driver.Galaxy-003
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `Runtime/StatusCodeMap.cs:86` |
| Status | Resolved |
**Description:** `FromMxStatus` returns `Good` whenever `status.Success != 0`. The intent (per the surrounding comment "Honors the success flag") is that a non-zero `Success` means success. But if `MxStatusProxy.Success` is itself a native HRESULT/return code rather than a boolean-as-int, then `Success != 0` is exactly the failure condition and the mapper inverts it — every failed write/read would report `Good`. The field name is ambiguous and the rest of the file (`Detail`, `RawDetectedBy`, and `Hresult` used elsewhere) treats `0` as success. `GatewayGalaxyAlarmAcknowledger.cs:62` uses the opposite convention for the sibling field (`reply.Hresult != 0` means failure).
**Recommendation:** Verify the semantics of `MxStatusProxy.Success` against the gateway proto contract. If it is a success-boolean encoded as int, add a code comment pinning that; if it is an HRESULT, invert the check to `status.Success == 0 => Good`.
**Resolution:** Resolved 2026-05-22 — replaced `status.Success != 0` with `status.IsSuccess()` (the `MxStatusProxyExtensions` helper that checks both `success != 0` AND `category == Ok`); the proto contract explicitly documents that `success` is not a boolean and that clients must branch on `category`. Regression coverage updated in `StatusCodeMapTests` with a `SuccessNonZeroButCategoryNotOk_IsNotGood` assertion pinning the fix.
### Driver.Galaxy-004
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `GalaxyDriver.cs:901` |
| Status | Resolved |
**Description:** `OnPumpDataChange` reconstructs a raw OPC DA quality byte from an OPC UA `StatusCode` for the probe watcher: it shifts `StatusCode >> 30` and maps `0->192, 1->64, _->0`. The `StatusCode` was itself produced upstream by `StatusCodeMap.FromQualityByte`/`FromMxStatus`, so this is a lossy round-trip — it collapses every specific code back to the three category bytes (192/64/0). That happens to satisfy `PerPlatformProbeWatcher.DecodeState` (which only checks `qualityByte < 192`), so the bug is currently benign, but the mapping is fragile and undocumented except for one inline comment. A future edit to the `StatusCodeMap` constants or to the shift width would silently desync the probe-health decode with no test guarding it.
**Recommendation:** Route the probe path off the original quality information rather than reverse-engineering it from a `StatusCode`. Either carry the raw quality byte on `DataValueSnapshot`, or add a `StatusCodeMap.ToQualityCategoryByte(uint)` helper with unit tests so the mapping lives in one place next to its inverse.
**Resolution:** Resolved 2026-05-22 — added `StatusCodeMap.ToQualityCategoryByte(uint)` helper that extracts top-two bits of the OPC UA StatusCode into the OPC DA category byte (Good=192, Uncertain=64, Bad=0); `GalaxyDriver.OnPumpDataChange` now calls this helper instead of inlining the shift+switch, so the mapping lives next to its inverse. Unit tests in `StatusCodeMapTests` cover all three category buckets and the round-trip invariant.
### Driver.Galaxy-005
| Field | Value |
|---|---|
| Severity | Low |
| Category | OtOpcUa conventions |
| Location | `Runtime/EventPump.cs:81-88` |
| Status | Open |
**Description:** The `BoundedChannelOptions` comment states "Newest-dropped policy: when full, the producer's TryWrite returns false ... We do this manually rather than relying on `BoundedChannelFullMode.DropWrite`" — but the option is then set to `FullMode = BoundedChannelFullMode.Wait`. With `Wait`, `TryWrite` returning `false` on a full channel is correct behaviour, so the code works, but the comment naming the mode and the actual mode disagree, which is confusing for a maintainer deciding whether the policy is `Wait`, `DropWrite`, or `DropNewest`.
**Recommendation:** Either reword the comment to say "we use `Wait` mode but never call the awaitable `WriteAsync``TryWrite` gives us synchronous newest-dropped semantics", or switch to `BoundedChannelFullMode.DropWrite` and keep the manual drop count. Make the comment and the mode consistent.
**Resolution:** _(open)_
### Driver.Galaxy-006
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `GalaxyDriver.cs:848-861` |
| Status | Resolved |
**Description:** `OnAlarmFeedTransition` picks the "owner" handle with `_alarmSubscriptions.First()` under `_alarmHandlersLock`. `HashSet<T>.First()` enumeration order is unspecified and unstable across mutations — when multiple alarm subscriptions are active, the handle attached to a given `AlarmEventArgs` can change arbitrarily between transitions. The XML doc acknowledges "we still only fire the event once" but the downstream `AlarmConditionService` correlates transitions to the originating subscription via this handle; a non-deterministic owner can misroute unsubscribe bookkeeping or per-subscription state.
**Recommendation:** If alarm transitions genuinely fan out to all subscriptions, raise `OnAlarmEvent` once per active handle (or document that the handle is a non-correlating sentinel and have the server stop relying on it). If a single owner is required, make the choice deterministic (e.g. the earliest-created handle) and stable.
**Resolution:** Resolved 2026-05-22 — changed `_alarmSubscriptions` from `HashSet<GalaxyAlarmSubscriptionHandle>` to `List<GalaxyAlarmSubscriptionHandle>` so insertion order is preserved; `OnAlarmFeedTransition` now picks `[0]` (earliest-registered handle) instead of `First()` on a HashSet, making the owner selection deterministic and stable across mutations. Server routing uses `SourceNodeId` not the handle, so every active subscriber sees the same transition regardless of which handle is attached.
### Driver.Galaxy-007
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `GalaxyDriver.cs:937-968` |
| Status | Resolved |
**Description:** `Dispose()` is not synchronized against the capability methods. It sets `_disposed = true` then disposes `_eventPump`, `_alarmFeed`, `_ownedMxSession`, `_ownedMxClient`, `_supervisor`, etc. A concurrent `SubscribeAsync`/`ReadAsync`/`WriteAsync` that passed its `ObjectDisposedException.ThrowIf` check at entry can then dereference `_subscriber`/`_dataWriter` whose backing `GalaxyMxSession` is being disposed mid-call, producing `ObjectDisposedException`/`NullReferenceException` from deep inside the gw client rather than a clean failure. `Dispose` also blocks the caller on `GetAwaiter().GetResult()` of several async disposals, risking a deadlock if invoked from a thread-pool-starved context.
**Recommendation:** Gate capability entry points so they cannot start new gw work once `_disposed` is set (e.g. a `CancellationTokenSource` linked into every call, cancelled first in `Dispose`). Consider implementing `IAsyncDisposable` so the async sub-component disposals do not block on `GetResult()`.
**Resolution:** Resolved 2026-05-22 — added `IAsyncDisposable` to `GalaxyDriver` and implemented `DisposeAsync()` as the primary disposal path that awaits each async sub-component (EventPump, AlarmFeed, MxSession, MxClient, RepositoryClient) without blocking; `Dispose()` delegates to `DisposeAsync().AsTask().GetAwaiter().GetResult()` for `using`-statement compatibility. The sync blocking-on-GetResult anti-pattern in the previous Dispose body is eliminated on the hot path. Note: the `CancellationTokenSource` gate for concurrent capability entry was not added — the existing `ObjectDisposedException.ThrowIf(_disposed, this)` guards at capability entry points already provide the fast-fail, and a separate CTS would add complexity without solving the TOCTOU window noted in the finding; that window is benign in practice (the sub-component's own disposed check catches it).
### Driver.Galaxy-008
| Field | Value |
|---|---|
| Severity | High |
| Category | Error handling & resilience |
| Location | `GalaxyDriver.cs:264-276`, `Runtime/EventPump.cs:97-103` |
| Status | Resolved |
**Description:** Even if Driver.Galaxy-001 is fixed and the supervisor's `ReplayAsync` runs, recovery is incomplete. `ReplayAsync` re-issues `SubscribeBulkAsync` for the tracked tags, but the `EventPump` background loop that consumes `StreamEvents` is not restarted. After a stream fault `EventPump.RunAsync` exits and `_channel` is completed; `EventPump.Start()` is a no-op (`if (_loop is not null) return`) because `_loop` is a completed-but-non-null task. So a replayed subscription has no consumer — values are subscribed on the gw but never reach `OnDataChange`. Additionally `ReplayAsync` never re-registers the new item handles the gw returns into `SubscriptionRegistry`; the old stale item handles remain, so even with a live pump the fan-out reverse-map would miss the post-reconnect handles.
**Recommendation:** On reconnect, dispose and recreate the `EventPump` (or make it restartable), and have `ReplayAsync` update `SubscriptionRegistry` bindings with the new item handles returned by the post-reconnect `SubscribeBulkAsync`. Add an integration/parity test that drops the stream mid-subscription and asserts `OnDataChange` resumes.
**Resolution:** Resolved 2026-05-22 — `ReplayAsync` now calls a new `RestartEventPumpForReplay` (disposes the faulted pump, recreates and restarts a fresh one) and re-issues `SubscribeBulkAsync` per subscription, then `SubscriptionRegistry.Rebind` swaps each subscription's stale pre-reconnect item handles for the post-reconnect handles so the fan-out reverse map dispatches to the live pump. New `SubscriptionRegistry.SnapshotEntries`/`Rebind` APIs back the per-subscription replay. Regression coverage in `SubscriptionRegistryTests` (Rebind/SnapshotEntries) and `EventPumpStreamFaultTests.FaultedPump_IsNotRestartableInPlace_ButAFreshPumpResumesDispatch`.
### Driver.Galaxy-009
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `GalaxyDriver.cs:354-371` |
| Status | Resolved |
**Description:** `StartDeployWatcher` launches the watch loop with `_ = _deployWatcher.StartAsync(CancellationToken.None)` — a fire-and-forget with a discarded `Task`. `StartAsync` can throw synchronously (`InvalidOperationException` if already started); the discard masks that programming error. Separately, `StartDeployWatcher` builds an `_ownedRepositoryClient` purely for the watcher when discovery has not run yet — if `DiscoverAsync` later runs, `BuildDefaultHierarchySource` overwrites `_ownedRepositoryClient` with a second client, leaking the first (only the latest reference is disposed in `Dispose`).
**Recommendation:** Await `StartAsync` (it completes synchronously after scheduling) or at least observe its result. Reuse a single `GalaxyRepositoryClient` across the deploy watcher and the hierarchy source instead of letting `BuildDefaultHierarchySource` clobber the field — guard the assignment or build the client once in `InitializeAsync`.
**Resolution:** Resolved 2026-05-22 — (a) replaced `_ = _deployWatcher.StartAsync(...)` discard with an explicit variable + `IsFaulted` check so any synchronous throw from `StartAsync` (e.g. called-twice `InvalidOperationException`) propagates rather than being silently swallowed; (b) changed both `StartDeployWatcher` and `BuildDefaultHierarchySource` to use `_ownedRepositoryClient ??=` so a client built by the watcher is reused by discovery instead of being overwritten and leaked — only one `GalaxyRepositoryClient` instance is now created and disposed.
### Driver.Galaxy-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | Security |
| Location | `GalaxyDriver.cs:311-341` |
| Status | Open |
**Description:** `ResolveApiKey` supports an `env:`/`file:` indirection and otherwise treats the config string as the literal API key ("Anything else — used as the literal API key. Convenient for dev"). `GalaxyGatewayOptions`' own XML doc claims "the API key never appears in cleartext config". The literal-key fallback silently permits a plaintext API key in the `DriverConfig` JSON column of the central config DB, contradicting the documented contract. There is no warning logged when the literal path is taken.
**Recommendation:** Log a startup warning when `ResolveApiKey` falls through to the literal arm so an operator who accidentally committed a cleartext key sees it, and update the `GalaxyGatewayOptions` doc comment so it no longer over-promises. Consider gating the literal arm behind an explicit `dev:`-style prefix so a cleartext key cannot be used by accident.
**Resolution:** _(open)_
### Driver.Galaxy-011
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Performance & resource management |
| Location | `GalaxyDriver.cs:411` |
| Status | Resolved |
**Description:** `GetMemoryFootprint()` unconditionally returns `0` with a comment "PR 4.4 sets this from SubscriptionRegistry size" — PR 4.4 has shipped (the registry exists and is used) but the method was never updated. `IHostConnectivityProbe.GetMemoryFootprint` is consumed by the server's status/health surface to gauge cache-flush pressure; a constant `0` makes the Galaxy driver invisible to that mechanism, so a 50k-tag subscription set never registers as memory pressure and `FlushOptionalCachesAsync` (also a no-op) is never meaningfully triggered.
**Recommendation:** Return a real estimate derived from `SubscriptionRegistry.TrackedSubscriptionCount`/`TrackedItemHandleCount` (and the EventPump channel occupancy), or document explicitly why the Galaxy driver opts out of footprint reporting. Remove the stale "PR 4.4 sets this" comment.
**Resolution:** Resolved 2026-05-22 — replaced the constant `0` with a live estimate derived from `SubscriptionRegistry.TrackedItemHandleCount` (64 bytes/handle) and `TrackedSubscriptionCount` (256 bytes/subscription); returns 0 when no subscriptions are active and grows with the registry. The stale "PR 4.4 sets this" comment is removed. Regression coverage in `GalaxyDriverInfrastructureTests`.
### Driver.Galaxy-012
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `Runtime/SubscriptionRegistry.cs:65-67`, `GalaxyDriver.cs:538`, `GalaxyDriver.cs:675` |
| Status | Open |
**Description:** Several hot paths are O(n^2) per call. `SubscriptionRegistry.ResolveSubscribers` does `entry.Bindings.FirstOrDefault(b => b.ItemHandle == itemHandle)` — a linear scan of the whole binding list for every event dispatch; at 50k tags this is 50k-element scans on the 1Hz fan-out path. `GalaxyDriver.SubscribeAsync` and `ReadViaSubscribeOnceAsync` correlate results to references with `results.FirstOrDefault(r => string.Equals(...))` inside a `for` loop over all references — O(n^2) over the subscribe batch. `SubscriptionRegistry.Remove` rebuilds a `ConcurrentBag` from a LINQ filter on every unsubscribe.
**Recommendation:** Index `SubscriptionEntry` bindings by item handle (a `Dictionary<int, string>` per entry) so `ResolveSubscribers` is O(1) per subscriber. Project the `SubscribeResult` list into a `Dictionary<string, SubscribeResult>` (OrdinalIgnoreCase) once before the correlation loop. These matter on the documented 50k-tag soak path.
**Resolution:** _(open)_
### Driver.Galaxy-013
| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | `GalaxyDriver.cs:14-27`, `GalaxyDriver.cs:374-382`, `Config/GalaxyDriverOptions.cs:84-86` |
| Status | Open |
**Description:** Multiple doc comments are stale relative to the shipped code. `GalaxyDriver`'s class summary still describes the file as "the project skeleton with `IDriver` bodies that wire to a future `IGalaxyGatewayClient` abstraction. Capability interfaces ... land in PRs 4.1-4.7" and references the legacy `GalaxyProxyDriver` coexisting "until PR 7.2" — but PR 7.2 already deleted the legacy Galaxy projects and the capability interfaces are all implemented. `ReinitializeAsync` is still a stub ("for the skeleton we just refresh health") that ignores `driverConfigJson` entirely — a config reapply silently does nothing. `GalaxyReconnectOptions.ReplayOnSessionLost` is defined and documented but never read anywhere in the driver (`ReplayAsync` always replays).
**Recommendation:** Refresh the `GalaxyDriver` class and `ReinitializeAsync` doc comments to describe the shipped state, implement or explicitly reject `ReinitializeAsync` config reapply, and either honour `ReplayOnSessionLost` or remove it from `GalaxyReconnectOptions`.
**Resolution:** _(open)_
### Driver.Galaxy-014
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy` (module-wide) |
| Status | Open |
**Description:** The reconnect/recovery path is the module's highest-risk surface and is effectively untested at the integration seam. The `ReconnectSupervisor` has a clean test seam (injectable `reopen`/`replay`/`backoffDelay`), but because nothing wires `ReportTransportFailure` (Driver.Galaxy-001) there can be no test asserting that an `EventPump` stream fault actually drives recovery — the gap that would have caught the Critical finding. Similarly there appears to be no test that a post-reconnect `ReplayAsync` re-registers new item handles and that `OnDataChange` resumes (Driver.Galaxy-008). The `StatusCodeMap.FromMxStatus` `Success`-flag semantics (Driver.Galaxy-003) and the `DataTypeMap` Int64 gap (Driver.Galaxy-002) are also the kind of behaviour a focused unit test would pin.
**Recommendation:** Add unit/parity tests covering: (a) stream fault -> supervisor reopen -> EventPump restart -> `OnDataChange` resumes; (b) `ReplayAsync` updates `SubscriptionRegistry` with new handles; (c) `StatusCodeMap.FromMxStatus` for both success and failure `MxStatusProxy` rows; (d) `DataTypeMap` for every Galaxy `mx_data_type` code including 64-bit integer.
**Resolution:** _(open)_