Driver.Galaxy-015, -016, -017, -018 resolution (one logical change set).
Driver.Galaxy-016 (Medium, Perf/Resource):
Reconciled the csproj PackageReferences with what the vendored
MxGateway.Client.dll was actually built against, verified by
reflecting Assembly.GetReferencedAssemblies() on the DLL:
- Polly 8.5.2 → Polly.Core 8.6.6
(most consequential — Polly v7 fluent API vs Polly.Core v8
resilience-pipeline API are DIFFERENT packages; the DLL was
built against Polly.Core so the prior Polly reference would
have failed at runtime with MissingMethodException the first
time the gateway client's retry pipeline ran)
- Grpc.Net.Client 2.71.0 → 2.76.0 (matches sibling Server/Worker)
- Microsoft.Extensions.Logging.Abstractions 10.0.0 → 10.0.7
Google.Protobuf 3.34.1 and Grpc.Core.Api 2.76.0 already matched —
left unchanged.
Driver.Galaxy-015 (re-triaged from Medium-Security → Low-Documentation):
Original framing was a security concern about unknown-provenance
binaries. User clarified the DLLs are their own code, built from
their own mxaccessgw project, not third-party. Re-triaged to a
documentation / audit-trail concern. Fix:
- Added a Provenance section to libs/README.md recording the
source-commit SHA (dd7ca1634e2d2b8a866c81f0009bf87ee9427750,
extracted from the AssemblyInformationalVersion baked into
both DLLs by the original build) and SHA-256 checksums.
- Documented the re-verification recipe (sha256sum + ilspycmd
| grep AssemblyInformationalVersion).
Recommendations about .gitattributes and CI hash-check deferred —
the DLLs are frozen until an unwinding path is taken, so adding
LFS or CI infrastructure now would need removal at unwinding.
Driver.Galaxy-018 (Low, Documentation):
Most of the recommendation folded into the libs/README.md rewrite
(pointed at sibling Server/Worker csproj as the live version source
rather than the deleted MxGateway.Client.csproj; recorded source
commit + SHA-256). <SpecificVersion>false</SpecificVersion> on the
<Reference> items intentionally not added — MSBuild's default for
HintPath references with bare-name Include attributes is already
SpecificVersion=false, so explicitly setting it would be cosmetic
without changing behaviour.
Driver.Galaxy-017 (Low, Design) — Deferred:
Recommendation part (b) (record mxaccessgw source-commit SHA in
libs/README.md) is satisfied by Driver.Galaxy-015's resolution.
Parts (a) and (c) — a GetVersion RPC at session-open and a parity
test against the live gateway's proto descriptor — are substantial
new RPC + plumbing work not in scope for this code-review sweep.
The risk surface is bounded because either of the libs/README.md
unwinding paths closes the vendoring + this concern naturally.
Re-open if neither path is taken within the next quarter and the
live gateway evolves its proto under the driver.
Verification:
- Build clean (Driver.Galaxy.csproj 0 errors, 0 warnings).
- Driver.Galaxy.Tests: 245/245 pass against the corrected
package set.
- Solution-wide build remains clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
388 lines
41 KiB
Markdown
388 lines
41 KiB
Markdown
# Code Review — Driver.Galaxy
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Module | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy` |
|
|
| Reviewer | Claude Code |
|
|
| Review date | 2026-05-23 |
|
|
| Commit reviewed | `a9be809` |
|
|
| Status | Reviewed |
|
|
| Open findings | 0 |
|
|
|
|
## Checklist coverage
|
|
|
|
| # | Category | Result |
|
|
|---|---|---|
|
|
| 1 | Correctness & logic bugs | Driver.Galaxy-001, Driver.Galaxy-002, Driver.Galaxy-003, Driver.Galaxy-004 |
|
|
| 2 | OtOpcUa conventions | Driver.Galaxy-005 |
|
|
| 3 | Concurrency & thread safety | Driver.Galaxy-006, Driver.Galaxy-007 |
|
|
| 4 | Error handling & resilience | Driver.Galaxy-001, Driver.Galaxy-008, Driver.Galaxy-009 |
|
|
| 5 | Security | Driver.Galaxy-010, Driver.Galaxy-015 |
|
|
| 6 | Performance & resource management | Driver.Galaxy-011, Driver.Galaxy-012, Driver.Galaxy-016 |
|
|
| 7 | Design-document adherence | Driver.Galaxy-013, Driver.Galaxy-017 |
|
|
| 8 | Code organization & conventions | No issues found |
|
|
| 9 | Testing coverage | Driver.Galaxy-014 |
|
|
| 10 | Documentation & comments | Driver.Galaxy-005, Driver.Galaxy-013, Driver.Galaxy-018 |
|
|
|
|
## Re-review 2026-05-23 (commit `a9be809`)
|
|
|
|
The only code-affecting change since `76d35d1` was commit `994997b` — the
|
|
sibling `mxaccessgw` repo restructured (the `clients/dotnet/MxGateway.Client`
|
|
project path and the `MxGateway.Contracts.Proto` namespace both moved), and
|
|
the driver's path-based `ProjectReference` started producing 87 build errors
|
|
solution-wide. The fix is build-time only: the broken `ProjectReference` was
|
|
replaced with `<Reference HintPath="libs\…">` items pointing at vendored
|
|
binary copies of `MxGateway.Client.dll` (99 KB, May 2026 known-good build)
|
|
and `MxGateway.Contracts.dll` (490 KB), and five `PackageReference`s that
|
|
the dropped project was previously providing transitively (`Google.Protobuf`,
|
|
`Grpc.Core.Api`, `Grpc.Net.Client`, `Microsoft.Extensions.Logging.Abstractions`,
|
|
`Polly`) were declared explicitly. The matching `Tests` csproj got the same
|
|
binary `<Reference>` for `MxGateway.Contracts` (replacing its own broken
|
|
`ProjectReference`). A `libs/README.md` documents what is vendored and the
|
|
two unwinding paths (sibling restores a client library, or driver migrates
|
|
to the new `ZB.MOM.WW.MxGateway.Contracts.Proto` namespace + reimplements
|
|
the `MxGatewayClient` / `MxGatewaySession` / `GalaxyRepositoryClient`
|
|
wrapper, ~2,200 LoC).
|
|
|
|
No `*.cs` file changed; the re-review walked only the categories that apply
|
|
to a build-time/packaging change. Categories with no new findings:
|
|
Correctness (1), OtOpcUa conventions (2), Concurrency (3), Error handling
|
|
(4), Code organization (8), Testing coverage (9). Four new findings are
|
|
recorded below (Driver.Galaxy-015..018) — none Critical, none High; two
|
|
Medium, two Low.
|
|
|
|
## Findings
|
|
|
|
### Driver.Galaxy-001
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Critical |
|
|
| Category | Error handling & resilience |
|
|
| Location | `Runtime/EventPump.cs:128`, `GalaxyDriver.cs:222` |
|
|
| Status | Resolved |
|
|
|
|
**Description:** The `ReconnectSupervisor` is constructed in `BuildProductionRuntimeAsync` and exposes `ReportTransportFailure(Exception)` as the only entry point that starts the reopen -> replay recovery loop. Nothing in the driver ever calls `ReportTransportFailure` (a repo-wide search finds only the declaration). When the gateway `StreamEvents` stream faults, `EventPump.RunAsync` catches the exception, logs "reconnect supervisor (PR 4.5) handles restart", completes the channel, and exits — but the supervisor is never told. The result: a transient gateway transport drop permanently kills the event stream. Data-change notifications stop, no reconnect/replay runs, and `GetHealth()` keeps reporting `Healthy` because `_supervisor.IsDegraded` stays false. This is a production outage with no self-recovery.
|
|
|
|
**Recommendation:** Wire the EventPump (and any gw RPC that observes a transport fault) to call `_supervisor.ReportTransportFailure(ex)`. The simplest path: give `EventPump` a fault callback (or expose a `StreamFaulted` event) that `GalaxyDriver` subscribes to and forwards to the supervisor. The supervisor's `ReopenAsync`/`ReplayAsync` must also restart the EventPump itself (see Driver.Galaxy-008).
|
|
|
|
**Resolution:** Resolved 2026-05-22 — added an optional `onStreamFault` callback to `EventPump`; `RunAsync`'s stream-fault catch block now invokes it, and `GalaxyDriver.EnsureEventPumpStarted` wires it to `OnEventPumpStreamFault` which forwards the cause to `ReconnectSupervisor.ReportTransportFailure`, so a transient gw transport drop now drives reopen → replay. Regression coverage in `EventPumpStreamFaultTests`. Note: the EventPump itself is still not restarted on reconnect — that pump-restart gap remains tracked under Driver.Galaxy-008.
|
|
|
|
### Driver.Galaxy-002
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | High |
|
|
| Category | Correctness & logic bugs |
|
|
| Location | `Browse/DataTypeMap.cs:13`, `Runtime/MxValueDecoder.cs:9` |
|
|
| Status | Resolved |
|
|
|
|
**Description:** `DataTypeMap.Map` maps Galaxy `mx_data_type` codes to six `DriverDataType` values (Boolean, Int32, Float32, Float64, String, DateTime) — there is no `Int64` arm. Yet `MxValueDecoder` and `MxValueEncoder` both fully support Int64 (`MxValue.Int64Value`, `Int64Array`), and the decoder's own XML doc claims "the seven Galaxy data types ... (Boolean, Int32, Int64, Float32, Float64, String, DateTime)". Any Galaxy attribute whose `mx_data_type` is the Int64 code (or any code > 5) falls through the `_ => DriverDataType.String` default. The address-space node is then created as a `String` variable while runtime reads decode an `Int64` boxed value — a type mismatch that produces wrong OPC UA `DataType`/`ValueRank` metadata and likely fails value coercion at the server node layer.
|
|
|
|
**Recommendation:** Confirm the Galaxy `mx_data_type` integer code for 64-bit integers and add the explicit arm to `DataTypeMap.Map`. If the wire format genuinely has no Int64 type, correct the `MxValueDecoder`/`MxValueEncoder` doc comments instead. Either way the encoder/decoder and the type map must agree.
|
|
|
|
**Resolution:** Resolved 2026-05-22 — added `6 => DriverDataType.Int64` to `DataTypeMap.Map`, extending the contiguous 0..5 scheme so the type map covers the same seven Galaxy data types `MxValueDecoder`/`MxValueEncoder` already decode/encode; Int64 attributes now build as Int64 nodes instead of falling through to the String default. Regression coverage in `DataTypeMapTests`.
|
|
|
|
### Driver.Galaxy-003
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Medium |
|
|
| Category | Correctness & logic bugs |
|
|
| Location | `Runtime/StatusCodeMap.cs:86` |
|
|
| Status | Resolved |
|
|
|
|
**Description:** `FromMxStatus` returns `Good` whenever `status.Success != 0`. The intent (per the surrounding comment "Honors the success flag") is that a non-zero `Success` means success. But if `MxStatusProxy.Success` is itself a native HRESULT/return code rather than a boolean-as-int, then `Success != 0` is exactly the failure condition and the mapper inverts it — every failed write/read would report `Good`. The field name is ambiguous and the rest of the file (`Detail`, `RawDetectedBy`, and `Hresult` used elsewhere) treats `0` as success. `GatewayGalaxyAlarmAcknowledger.cs:62` uses the opposite convention for the sibling field (`reply.Hresult != 0` means failure).
|
|
|
|
**Recommendation:** Verify the semantics of `MxStatusProxy.Success` against the gateway proto contract. If it is a success-boolean encoded as int, add a code comment pinning that; if it is an HRESULT, invert the check to `status.Success == 0 => Good`.
|
|
|
|
**Resolution:** Resolved 2026-05-22 — replaced `status.Success != 0` with `status.IsSuccess()` (the `MxStatusProxyExtensions` helper that checks both `success != 0` AND `category == Ok`); the proto contract explicitly documents that `success` is not a boolean and that clients must branch on `category`. Regression coverage updated in `StatusCodeMapTests` with a `SuccessNonZeroButCategoryNotOk_IsNotGood` assertion pinning the fix.
|
|
|
|
### Driver.Galaxy-004
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Medium |
|
|
| Category | Correctness & logic bugs |
|
|
| Location | `GalaxyDriver.cs:901` |
|
|
| Status | Resolved |
|
|
|
|
**Description:** `OnPumpDataChange` reconstructs a raw OPC DA quality byte from an OPC UA `StatusCode` for the probe watcher: it shifts `StatusCode >> 30` and maps `0->192, 1->64, _->0`. The `StatusCode` was itself produced upstream by `StatusCodeMap.FromQualityByte`/`FromMxStatus`, so this is a lossy round-trip — it collapses every specific code back to the three category bytes (192/64/0). That happens to satisfy `PerPlatformProbeWatcher.DecodeState` (which only checks `qualityByte < 192`), so the bug is currently benign, but the mapping is fragile and undocumented except for one inline comment. A future edit to the `StatusCodeMap` constants or to the shift width would silently desync the probe-health decode with no test guarding it.
|
|
|
|
**Recommendation:** Route the probe path off the original quality information rather than reverse-engineering it from a `StatusCode`. Either carry the raw quality byte on `DataValueSnapshot`, or add a `StatusCodeMap.ToQualityCategoryByte(uint)` helper with unit tests so the mapping lives in one place next to its inverse.
|
|
|
|
**Resolution:** Resolved 2026-05-22 — added `StatusCodeMap.ToQualityCategoryByte(uint)` helper that extracts top-two bits of the OPC UA StatusCode into the OPC DA category byte (Good=192, Uncertain=64, Bad=0); `GalaxyDriver.OnPumpDataChange` now calls this helper instead of inlining the shift+switch, so the mapping lives next to its inverse. Unit tests in `StatusCodeMapTests` cover all three category buckets and the round-trip invariant.
|
|
|
|
### Driver.Galaxy-005
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Low |
|
|
| Category | OtOpcUa conventions |
|
|
| Location | `Runtime/EventPump.cs:81-88` |
|
|
| Status | Resolved |
|
|
|
|
**Description:** The `BoundedChannelOptions` comment states "Newest-dropped policy: when full, the producer's TryWrite returns false ... We do this manually rather than relying on `BoundedChannelFullMode.DropWrite`" — but the option is then set to `FullMode = BoundedChannelFullMode.Wait`. With `Wait`, `TryWrite` returning `false` on a full channel is correct behaviour, so the code works, but the comment naming the mode and the actual mode disagree, which is confusing for a maintainer deciding whether the policy is `Wait`, `DropWrite`, or `DropNewest`.
|
|
|
|
**Recommendation:** Either reword the comment to say "we use `Wait` mode but never call the awaitable `WriteAsync` — `TryWrite` gives us synchronous newest-dropped semantics", or switch to `BoundedChannelFullMode.DropWrite` and keep the manual drop count. Make the comment and the mode consistent.
|
|
|
|
**Resolution:** Resolved 2026-05-23 — reworded the `BoundedChannelOptions` comment to say "we use FullMode.Wait but never call the awaitable WriteAsync — only synchronous TryWrite, which returns false immediately on a full channel and lets us account for drops on the EventsDropped counter". Also explains why we deliberately do NOT use `BoundedChannelFullMode.DropWrite` (it would silently discard without surfacing on the counter). Comment and `FullMode` value now agree.
|
|
|
|
### Driver.Galaxy-006
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Medium |
|
|
| Category | Concurrency & thread safety |
|
|
| Location | `GalaxyDriver.cs:848-861` |
|
|
| Status | Resolved |
|
|
|
|
**Description:** `OnAlarmFeedTransition` picks the "owner" handle with `_alarmSubscriptions.First()` under `_alarmHandlersLock`. `HashSet<T>.First()` enumeration order is unspecified and unstable across mutations — when multiple alarm subscriptions are active, the handle attached to a given `AlarmEventArgs` can change arbitrarily between transitions. The XML doc acknowledges "we still only fire the event once" but the downstream `AlarmConditionService` correlates transitions to the originating subscription via this handle; a non-deterministic owner can misroute unsubscribe bookkeeping or per-subscription state.
|
|
|
|
**Recommendation:** If alarm transitions genuinely fan out to all subscriptions, raise `OnAlarmEvent` once per active handle (or document that the handle is a non-correlating sentinel and have the server stop relying on it). If a single owner is required, make the choice deterministic (e.g. the earliest-created handle) and stable.
|
|
|
|
**Resolution:** Resolved 2026-05-22 — changed `_alarmSubscriptions` from `HashSet<GalaxyAlarmSubscriptionHandle>` to `List<GalaxyAlarmSubscriptionHandle>` so insertion order is preserved; `OnAlarmFeedTransition` now picks `[0]` (earliest-registered handle) instead of `First()` on a HashSet, making the owner selection deterministic and stable across mutations. Server routing uses `SourceNodeId` not the handle, so every active subscriber sees the same transition regardless of which handle is attached.
|
|
|
|
### Driver.Galaxy-007
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Medium |
|
|
| Category | Concurrency & thread safety |
|
|
| Location | `GalaxyDriver.cs:937-968` |
|
|
| Status | Resolved |
|
|
|
|
**Description:** `Dispose()` is not synchronized against the capability methods. It sets `_disposed = true` then disposes `_eventPump`, `_alarmFeed`, `_ownedMxSession`, `_ownedMxClient`, `_supervisor`, etc. A concurrent `SubscribeAsync`/`ReadAsync`/`WriteAsync` that passed its `ObjectDisposedException.ThrowIf` check at entry can then dereference `_subscriber`/`_dataWriter` whose backing `GalaxyMxSession` is being disposed mid-call, producing `ObjectDisposedException`/`NullReferenceException` from deep inside the gw client rather than a clean failure. `Dispose` also blocks the caller on `GetAwaiter().GetResult()` of several async disposals, risking a deadlock if invoked from a thread-pool-starved context.
|
|
|
|
**Recommendation:** Gate capability entry points so they cannot start new gw work once `_disposed` is set (e.g. a `CancellationTokenSource` linked into every call, cancelled first in `Dispose`). Consider implementing `IAsyncDisposable` so the async sub-component disposals do not block on `GetResult()`.
|
|
|
|
**Resolution:** Resolved 2026-05-22 — added `IAsyncDisposable` to `GalaxyDriver` and implemented `DisposeAsync()` as the primary disposal path that awaits each async sub-component (EventPump, AlarmFeed, MxSession, MxClient, RepositoryClient) without blocking; `Dispose()` delegates to `DisposeAsync().AsTask().GetAwaiter().GetResult()` for `using`-statement compatibility. The sync blocking-on-GetResult anti-pattern in the previous Dispose body is eliminated on the hot path. Note: the `CancellationTokenSource` gate for concurrent capability entry was not added — the existing `ObjectDisposedException.ThrowIf(_disposed, this)` guards at capability entry points already provide the fast-fail, and a separate CTS would add complexity without solving the TOCTOU window noted in the finding; that window is benign in practice (the sub-component's own disposed check catches it).
|
|
|
|
### Driver.Galaxy-008
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | High |
|
|
| Category | Error handling & resilience |
|
|
| Location | `GalaxyDriver.cs:264-276`, `Runtime/EventPump.cs:97-103` |
|
|
| Status | Resolved |
|
|
|
|
**Description:** Even if Driver.Galaxy-001 is fixed and the supervisor's `ReplayAsync` runs, recovery is incomplete. `ReplayAsync` re-issues `SubscribeBulkAsync` for the tracked tags, but the `EventPump` background loop that consumes `StreamEvents` is not restarted. After a stream fault `EventPump.RunAsync` exits and `_channel` is completed; `EventPump.Start()` is a no-op (`if (_loop is not null) return`) because `_loop` is a completed-but-non-null task. So a replayed subscription has no consumer — values are subscribed on the gw but never reach `OnDataChange`. Additionally `ReplayAsync` never re-registers the new item handles the gw returns into `SubscriptionRegistry`; the old stale item handles remain, so even with a live pump the fan-out reverse-map would miss the post-reconnect handles.
|
|
|
|
**Recommendation:** On reconnect, dispose and recreate the `EventPump` (or make it restartable), and have `ReplayAsync` update `SubscriptionRegistry` bindings with the new item handles returned by the post-reconnect `SubscribeBulkAsync`. Add an integration/parity test that drops the stream mid-subscription and asserts `OnDataChange` resumes.
|
|
|
|
**Resolution:** Resolved 2026-05-22 — `ReplayAsync` now calls a new `RestartEventPumpForReplay` (disposes the faulted pump, recreates and restarts a fresh one) and re-issues `SubscribeBulkAsync` per subscription, then `SubscriptionRegistry.Rebind` swaps each subscription's stale pre-reconnect item handles for the post-reconnect handles so the fan-out reverse map dispatches to the live pump. New `SubscriptionRegistry.SnapshotEntries`/`Rebind` APIs back the per-subscription replay. Regression coverage in `SubscriptionRegistryTests` (Rebind/SnapshotEntries) and `EventPumpStreamFaultTests.FaultedPump_IsNotRestartableInPlace_ButAFreshPumpResumesDispatch`.
|
|
|
|
### Driver.Galaxy-009
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Medium |
|
|
| Category | Error handling & resilience |
|
|
| Location | `GalaxyDriver.cs:354-371` |
|
|
| Status | Resolved |
|
|
|
|
**Description:** `StartDeployWatcher` launches the watch loop with `_ = _deployWatcher.StartAsync(CancellationToken.None)` — a fire-and-forget with a discarded `Task`. `StartAsync` can throw synchronously (`InvalidOperationException` if already started); the discard masks that programming error. Separately, `StartDeployWatcher` builds an `_ownedRepositoryClient` purely for the watcher when discovery has not run yet — if `DiscoverAsync` later runs, `BuildDefaultHierarchySource` overwrites `_ownedRepositoryClient` with a second client, leaking the first (only the latest reference is disposed in `Dispose`).
|
|
|
|
**Recommendation:** Await `StartAsync` (it completes synchronously after scheduling) or at least observe its result. Reuse a single `GalaxyRepositoryClient` across the deploy watcher and the hierarchy source instead of letting `BuildDefaultHierarchySource` clobber the field — guard the assignment or build the client once in `InitializeAsync`.
|
|
|
|
**Resolution:** Resolved 2026-05-22 — (a) replaced `_ = _deployWatcher.StartAsync(...)` discard with an explicit variable + `IsFaulted` check so any synchronous throw from `StartAsync` (e.g. called-twice `InvalidOperationException`) propagates rather than being silently swallowed; (b) changed both `StartDeployWatcher` and `BuildDefaultHierarchySource` to use `_ownedRepositoryClient ??=` so a client built by the watcher is reused by discovery instead of being overwritten and leaked — only one `GalaxyRepositoryClient` instance is now created and disposed.
|
|
|
|
### Driver.Galaxy-010
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Low |
|
|
| Category | Security |
|
|
| Location | `GalaxyDriver.cs:311-341` |
|
|
| Status | Resolved |
|
|
|
|
**Description:** `ResolveApiKey` supports an `env:`/`file:` indirection and otherwise treats the config string as the literal API key ("Anything else — used as the literal API key. Convenient for dev"). `GalaxyGatewayOptions`' own XML doc claims "the API key never appears in cleartext config". The literal-key fallback silently permits a plaintext API key in the `DriverConfig` JSON column of the central config DB, contradicting the documented contract. There is no warning logged when the literal path is taken.
|
|
|
|
**Recommendation:** Log a startup warning when `ResolveApiKey` falls through to the literal arm so an operator who accidentally committed a cleartext key sees it, and update the `GalaxyGatewayOptions` doc comment so it no longer over-promises. Consider gating the literal arm behind an explicit `dev:`-style prefix so a cleartext key cannot be used by accident.
|
|
|
|
**Resolution:** Resolved 2026-05-23 — (a) added a logger-aware `ResolveApiKey(string, ILogger?)` overload that emits a `Warning` when the back-compat literal arm is taken, and wired the `BuildClientOptions` call site to pass `_logger`; (b) added an explicit `dev:KEY` prefix that returns the literal value without warning, so dev rigs / parity tests can opt-in deliberately; (c) rewrote the `GalaxyGatewayOptions.ApiKeySecretRef` XML doc so it no longer claims "the API key never appears in cleartext config" — it now documents all four supported forms (`env:`, `file:`, `dev:`, and the warning-on-literal back-compat path). Regression coverage in `GalaxyDriverApiKeyResolverTests` (`Literal_string_emits_warning_when_logger_supplied`, `Dev_prefix_returns_literal_without_warning`, `Env_prefix_does_not_emit_literal_warning`).
|
|
|
|
### Driver.Galaxy-011
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Medium |
|
|
| Category | Performance & resource management |
|
|
| Location | `GalaxyDriver.cs:411` |
|
|
| Status | Resolved |
|
|
|
|
**Description:** `GetMemoryFootprint()` unconditionally returns `0` with a comment "PR 4.4 sets this from SubscriptionRegistry size" — PR 4.4 has shipped (the registry exists and is used) but the method was never updated. `IHostConnectivityProbe.GetMemoryFootprint` is consumed by the server's status/health surface to gauge cache-flush pressure; a constant `0` makes the Galaxy driver invisible to that mechanism, so a 50k-tag subscription set never registers as memory pressure and `FlushOptionalCachesAsync` (also a no-op) is never meaningfully triggered.
|
|
|
|
**Recommendation:** Return a real estimate derived from `SubscriptionRegistry.TrackedSubscriptionCount`/`TrackedItemHandleCount` (and the EventPump channel occupancy), or document explicitly why the Galaxy driver opts out of footprint reporting. Remove the stale "PR 4.4 sets this" comment.
|
|
|
|
**Resolution:** Resolved 2026-05-22 — replaced the constant `0` with a live estimate derived from `SubscriptionRegistry.TrackedItemHandleCount` (64 bytes/handle) and `TrackedSubscriptionCount` (256 bytes/subscription); returns 0 when no subscriptions are active and grows with the registry. The stale "PR 4.4 sets this" comment is removed. Regression coverage in `GalaxyDriverInfrastructureTests`.
|
|
|
|
### Driver.Galaxy-012
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Low |
|
|
| Category | Performance & resource management |
|
|
| Location | `Runtime/SubscriptionRegistry.cs:65-67`, `GalaxyDriver.cs:538`, `GalaxyDriver.cs:675` |
|
|
| Status | Resolved |
|
|
|
|
**Description:** Several hot paths are O(n^2) per call. `SubscriptionRegistry.ResolveSubscribers` does `entry.Bindings.FirstOrDefault(b => b.ItemHandle == itemHandle)` — a linear scan of the whole binding list for every event dispatch; at 50k tags this is 50k-element scans on the 1Hz fan-out path. `GalaxyDriver.SubscribeAsync` and `ReadViaSubscribeOnceAsync` correlate results to references with `results.FirstOrDefault(r => string.Equals(...))` inside a `for` loop over all references — O(n^2) over the subscribe batch. `SubscriptionRegistry.Remove` rebuilds a `ConcurrentBag` from a LINQ filter on every unsubscribe.
|
|
|
|
**Recommendation:** Index `SubscriptionEntry` bindings by item handle (a `Dictionary<int, string>` per entry) so `ResolveSubscribers` is O(1) per subscriber. Project the `SubscribeResult` list into a `Dictionary<string, SubscribeResult>` (OrdinalIgnoreCase) once before the correlation loop. These matter on the documented 50k-tag soak path.
|
|
|
|
**Resolution:** Resolved 2026-05-23 — three changes: (a) `SubscriptionEntry` now carries a `FullRefByItemHandle` `Dictionary<int, string>` built once at construction; `ResolveSubscribers` does O(1) lookups per subscriber instead of a `FirstOrDefault` linear scan of the binding list. (b) Reverse map `_subscribersByItemHandle` swapped from `ConcurrentBag<long>` to `ImmutableHashSet<long>` — `Remove`/`Rebind` use `set.Remove(id)` (O(log n)) instead of "rebuild a new bag from a LINQ filter on every unsubscribe", and reads remain lock-free via atomic publication through `ConcurrentDictionary.AddOrUpdate`. (c) `GalaxyDriver.SubscribeAsync` + `ReadViaSubscribeOnceAsync` now index the `SubscribeResult` list once via the existing `BuildResultIndex` helper (already used by `ReplayAsync`) so per-reference correlation is O(1). Regression coverage in `SubscriptionRegistryTests.ResolveSubscribers_LargeBindingSet_DispatchesCorrectly`.
|
|
|
|
### Driver.Galaxy-013
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Low |
|
|
| Category | Design-document adherence |
|
|
| Location | `GalaxyDriver.cs:14-27`, `GalaxyDriver.cs:374-382`, `Config/GalaxyDriverOptions.cs:84-86` |
|
|
| Status | Resolved |
|
|
|
|
**Description:** Multiple doc comments are stale relative to the shipped code. `GalaxyDriver`'s class summary still describes the file as "the project skeleton with `IDriver` bodies that wire to a future `IGalaxyGatewayClient` abstraction. Capability interfaces ... land in PRs 4.1-4.7" and references the legacy `GalaxyProxyDriver` coexisting "until PR 7.2" — but PR 7.2 already deleted the legacy Galaxy projects and the capability interfaces are all implemented. `ReinitializeAsync` is still a stub ("for the skeleton we just refresh health") that ignores `driverConfigJson` entirely — a config reapply silently does nothing. `GalaxyReconnectOptions.ReplayOnSessionLost` is defined and documented but never read anywhere in the driver (`ReplayAsync` always replays).
|
|
|
|
**Recommendation:** Refresh the `GalaxyDriver` class and `ReinitializeAsync` doc comments to describe the shipped state, implement or explicitly reject `ReinitializeAsync` config reapply, and either honour `ReplayOnSessionLost` or remove it from `GalaxyReconnectOptions`.
|
|
|
|
**Resolution:** Resolved 2026-05-23 — three fixes: (a) rewrote the `GalaxyDriver` class summary to describe the shipped capability surface (`ITagDiscovery`, `IReadable`, `IWritable`, `ISubscribable`, `IRediscoverable`, `IHostConnectivityProbe`, `IAlarmSource`) and removed the stale "PR 4.0 skeleton" / "legacy `GalaxyProxyDriver` coexists until PR 7.2" wording — PR 7.2 already retired the legacy projects. (b) `ReinitializeAsync` now parses the incoming `driverConfigJson` through the factory pipeline and compares the result to `_options`; an equivalent reapply refreshes health, a non-equivalent change throws `NotSupportedException` so a config swap never silently no-ops. (c) `ReplayAsync` now honours `_options.Reconnect.ReplayOnSessionLost` — when false it restarts the EventPump but skips the per-tag SubscribeBulk fan-out, delegating to gateway session-level replay. Regression coverage in `GalaxyDriverInfrastructureTests` (`ReinitializeAsync_RejectsNonEquivalentConfigChange`, `ReinitializeAsync_AcceptsEquivalentConfig`, `ReplayOnSessionLost_False_SkipsResubscribeBulk`, `ReplayOnSessionLost_True_RunsResubscribeBulk`). Updated `GalaxyDriverFactoryTests.ReinitializeAsync_RefreshesHealth_WhenConfigIsEquivalent` to use an equivalent config JSON.
|
|
|
|
### Driver.Galaxy-014
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Medium |
|
|
| Category | Testing coverage |
|
|
| Location | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy` (module-wide) |
|
|
| Status | Resolved |
|
|
|
|
**Description:** The reconnect/recovery path is the module's highest-risk surface and is effectively untested at the integration seam. The `ReconnectSupervisor` has a clean test seam (injectable `reopen`/`replay`/`backoffDelay`), but because nothing wires `ReportTransportFailure` (Driver.Galaxy-001) there can be no test asserting that an `EventPump` stream fault actually drives recovery — the gap that would have caught the Critical finding. Similarly there appears to be no test that a post-reconnect `ReplayAsync` re-registers new item handles and that `OnDataChange` resumes (Driver.Galaxy-008). The `StatusCodeMap.FromMxStatus` `Success`-flag semantics (Driver.Galaxy-003) and the `DataTypeMap` Int64 gap (Driver.Galaxy-002) are also the kind of behaviour a focused unit test would pin.
|
|
|
|
**Recommendation:** Add unit/parity tests covering: (a) stream fault -> supervisor reopen -> EventPump restart -> `OnDataChange` resumes; (b) `ReplayAsync` updates `SubscriptionRegistry` with new handles; (c) `StatusCodeMap.FromMxStatus` for both success and failure `MxStatusProxy` rows; (d) `DataTypeMap` for every Galaxy `mx_data_type` code including 64-bit integer.
|
|
|
|
**Resolution:** Resolved 2026-05-22 — added `GalaxyDriverInfrastructureTests` covering `GetMemoryFootprint` (Driver.Galaxy-011) and `IAsyncDisposable` (Driver.Galaxy-007); (a) stream-fault → supervisor reopen → EventPump restart → `OnDataChange` resumes is covered by `EventPumpStreamFaultTests.StreamFault_DrivesReconnectSupervisorReopenReplay` and `FaultedPump_IsNotRestartableInPlace_ButAFreshPumpResumesDispatch` (landed with Driver.Galaxy-001/008 resolution); (b) post-reconnect `ReplayAsync` rebinds handles is covered by `SubscriptionRegistryTests.Rebind_*` suite; (c) `StatusCodeMap.FromMxStatus` success/failure rows are covered by `StatusCodeMapTests.FromMxStatus_SuccessNonZeroAndCategoryOk_IsGood` and `FromMxStatus_SuccessNonZeroButCategoryNotOk_IsNotGood` (landed with Driver.Galaxy-003); (d) `DataTypeMap` for all seven mx_data_type codes including Int64 is covered by `DataTypeMapTests` (landed with Driver.Galaxy-002).
|
|
|
|
### Driver.Galaxy-015
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | ~~Medium~~ Low (re-triaged 2026-05-23) |
|
|
| Category | ~~Security~~ Documentation & comments (re-triaged 2026-05-23) |
|
|
| Location | `libs/MxGateway.Client.dll`, `libs/MxGateway.Contracts.dll`, `libs/README.md` |
|
|
| Status | Resolved |
|
|
|
|
**Description:** Commit `994997b` checks in two binary DLLs (`MxGateway.Client.dll`, 99 840 bytes; `MxGateway.Contracts.dll`, 489 984 bytes) under `src/Drivers/.../Driver.Galaxy/libs/` and references them via `<Reference HintPath="…" />`. These are the only checked-in binary build artefacts in the entire repo (a repo-wide `find` for non-`bin/`/`obj/` `*.dll` under `libs/` returns only these two), so the change sets a precedent. The accompanying `libs/README.md` states the DLLs are "byte-for-byte the build output" of the OtOpcUa team's own code against the gateway's open proto contracts, but there is no recorded provenance — no source-commit SHA from the sibling `mxaccessgw` repo that produced the build, no SHA-256/SHA-512 checksum, no `.gitattributes` rule marking these paths as binary (so a future churn-in-place will balloon the pack file). Without a recorded source commit + checksum it is impossible for a future reviewer/auditor to verify the binaries match a specific revision of the sibling repo — the assertion "we built them, not external" is unverifiable after the fact. Tampering or accidental swap (e.g. someone drops in a different DLL of the same name under the same path) would not be detectable.
|
|
|
|
**Recommendation:** (a) Pin the source provenance: add the sibling `mxaccessgw` commit SHA used to build each DLL to `libs/README.md`. (b) Record a SHA-256 of each `.dll` in `libs/README.md` so a future tamper or accidental update is detectable by running `Get-FileHash`/`sha256sum`. (c) Add a `.gitattributes` rule under `libs/` declaring `*.dll binary` (and consider `filter=lfs diff=lfs merge=lfs -text` if/when these need to be updated, to avoid bloating the pack file on every refresh). (d) Optional: a `dotnet test` time-check that compares the on-disk hash to the recorded hash, so a CI run notices if the file drifts from what the README claims.
|
|
|
|
**Resolution:** Resolved 2026-05-23. **Severity re-triage:** the original
|
|
finding framed this as a security concern about "tampering or accidental
|
|
swap by an unknown third party"; the user clarified that the DLLs are
|
|
their own code, built from their own `mxaccessgw` project — not third-party
|
|
binaries. That moves the concern from security (untrusted provenance) to
|
|
documentation (audit trail). Re-classified as Low Documentation &
|
|
Comments. Fix: `libs/README.md` now carries a Provenance section that
|
|
records the source-commit SHA (`dd7ca1634e2d2b8a866c81f0009bf87ee9427750`,
|
|
extracted from the `AssemblyInformationalVersion` baked into both DLLs by
|
|
the original build) and SHA-256 checksums of both binaries, plus a
|
|
re-verification recipe (`sha256sum libs/*.dll` + `ilspycmd <dll> | grep
|
|
AssemblyInformationalVersion`). Recommendations (c) `.gitattributes` and
|
|
(d) CI hash-check deferred — the DLLs are essentially frozen until one
|
|
of the two unwinding paths is taken, so adding LFS or a CI guard would
|
|
add infrastructure that the unwinding step would then have to remove.
|
|
Re-open if the vendoring becomes a recurring update target.
|
|
|
|
### Driver.Galaxy-016
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Medium |
|
|
| Category | Performance & resource management |
|
|
| Location | `ZB.MOM.WW.OtOpcUa.Driver.Galaxy.csproj:43-47`, `libs/README.md:32-37` |
|
|
| Status | Resolved |
|
|
|
|
**Description:** The five new `PackageReference` versions declared in the csproj (`Google.Protobuf` 3.34.1, `Grpc.Core.Api` 2.76.0, `Grpc.Net.Client` 2.71.0, `Microsoft.Extensions.Logging.Abstractions` 10.0.0, `Polly` 8.5.2) do not all match what the vendored `MxGateway.Client.dll` was built against. The DLL's PE metadata (extracted via `System.Reflection.Metadata`) shows references to `Grpc.Net.Client v2.0.0.0`, `Microsoft.Extensions.Logging.Abstractions v10.0.0.0`, and notably `Polly.Core v8.0.0.0` — and the source csproj just before the sibling-repo rename (commit `bd4a09a` from 2026-04-27) declared `Grpc.Net.Client` 2.76.0, `Microsoft.Extensions.Logging.Abstractions` 10.0.7, and `Polly.Core` 8.6.6 — *not* the meta-package `Polly`. Our driver pulls `Polly` 8.5.2 (which transitively pins `Polly.Core` 8.5.2 per its nuspec dependency), so the vendored client actually loads `Polly.Core` 8.5.2 at runtime against code compiled against 8.6.6. Across an 8.5 ↔ 8.6 minor delta this is usually safe (assembly-version is `v8.0.0.0` for both), but it is exactly the skew shape that surfaces as `MissingMethodException` if a 8.6-only API was used in the client. `libs/README.md` claims "versions match what the sibling repo's `ZB.MOM.WW.MxGateway.Contracts.csproj` uses so the gRPC + proto runtime stays binary-compatible" — that statement is correct only for `Google.Protobuf` and `Grpc.Core.Api`; the other three packages do not match.
|
|
|
|
**Recommendation:** Reconcile the declared package versions with what the vendored DLLs were built against — bump to `Grpc.Net.Client` 2.76.0, `Microsoft.Extensions.Logging.Abstractions` 10.0.7, swap `Polly` for `Polly.Core` 8.6.6 (the driver does not import the `Polly` legacy v7 surface, only Polly.Core via the client). Alternatively, rebuild the vendored DLLs against the same versions the csproj declares and refresh the binaries. Update `libs/README.md` to record the exact versions the DLLs were built against, so the next vendoring refresh has an authoritative reference.
|
|
|
|
**Resolution:** Resolved 2026-05-23 — took the first option (reconcile
|
|
declared packages with what the DLL was built against, verified by
|
|
reflecting `Assembly.GetReferencedAssemblies()` on `MxGateway.Client.dll`).
|
|
Changes to the csproj: **`Polly` 8.5.2 → `Polly.Core` 8.6.6** (the most
|
|
consequential — `Polly` (v7 fluent API) and `Polly.Core` (v8 resilience-
|
|
pipeline API) are different packages, and the DLL was built against
|
|
`Polly.Core`; the prior `Polly` reference would have failed at runtime
|
|
with `MissingMethodException` the first time the gateway client's retry
|
|
pipeline ran). Also bumped `Grpc.Net.Client` 2.71.0 → 2.76.0 and
|
|
`Microsoft.Extensions.Logging.Abstractions` 10.0.0 → 10.0.7 to match the
|
|
sibling Server/Worker projects' current versions. `Google.Protobuf`
|
|
3.34.1 and `Grpc.Core.Api` 2.76.0 already matched; left unchanged.
|
|
`libs/README.md` rewritten to record what was actually verified
|
|
(`Assembly.GetReferencedAssemblies()` output + the resolved package
|
|
versions, including the sibling Server/Worker csproj as the version
|
|
source-of-truth — the deleted MxGateway.Client.csproj would have been
|
|
the original source but no longer exists). Verification: solution-wide
|
|
`dotnet build` clean, Driver.Galaxy.Tests 245/245 pass against the
|
|
corrected package set.
|
|
|
|
### Driver.Galaxy-017
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Low |
|
|
| Category | Design-document adherence |
|
|
| Location | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/` (no source change), gateway proto contract |
|
|
| Status | Deferred |
|
|
|
|
**Description:** The vendored `MxGateway.Contracts.dll` only carries the OLD `MxGateway.Contracts.Proto[.Galaxy]` namespace (PE-namespace dump confirms — `MxGateway.Client`, `MxGateway.Contracts`, `MxGateway.Contracts.Proto`, `MxGateway.Contracts.Proto.Galaxy` only). The sibling `mxaccessgw` repo's live `Protos/mxaccess_gateway.proto`, `mxaccess_worker.proto`, and `galaxy_repository.proto` files now generate into `ZB.MOM.WW.MxGateway.Contracts.Proto.*`. The proto wire format itself can still evolve (new RPCs, renamed fields, removed fields) and the driver has no contract-version handshake (a repo-wide search for `ContractVersion|ProtocolVersion|ApiVersion|WireVersion` in the driver returns nothing) — so a gateway service that evolves its proto past what the vendored client knows will fail silently at runtime: gRPC `UNIMPLEMENTED` for a renamed RPC, default-value reads for a removed scalar field, or worse, a wire-tag collision if a field number is reused. The risk surface grew with vendoring: previously the `ProjectReference` would have hard-failed at build time if the proto changed shape; now the driver builds green against a frozen contract that may not match the running gateway.
|
|
|
|
**Recommendation:** (a) Add a single `Ping`/`GetVersion` RPC call at gateway-session open, comparing the gateway's reported contract version against a string baked into `libs/README.md` (or a `GatewayContractVersion` const) and refusing the session on mismatch with a clear log. (b) Document in `libs/README.md` the exact mxaccessgw commit SHA (and proto-file SHA-256s) the vendored DLLs were built from, so a parity-rig operator can grep the live gateway for the matching commit. (c) Add a soak/parity test that asserts the live gateway's proto descriptor still matches what the vendored DLL expects — fail loud rather than degrade.
|
|
|
|
**Resolution:** Deferred 2026-05-23 — the recommendation's part (b)
|
|
(record the mxaccessgw source-commit SHA in `libs/README.md`) is satisfied
|
|
by the Driver.Galaxy-015 resolution, which records both DLLs were built
|
|
from mxaccessgw commit `dd7ca1634e2d2b8a866c81f0009bf87ee9427750`. Parts
|
|
(a) and (c) — adding a `GetVersion` RPC at session-open and a parity
|
|
test against the live gateway's proto descriptor — are substantial new
|
|
RPC + plumbing work that is not in scope for this code-review-resolution
|
|
sweep. The risk surface is bounded because either of the two unwinding
|
|
paths in `libs/README.md` (sibling repo restores `MxGateway.Client.csproj`,
|
|
or this driver migrates to the new namespace) will move the codebase
|
|
past the vendoring + close this concern naturally. Re-open if neither
|
|
unwinding path is taken within the next quarter and the live gateway
|
|
service does evolve its proto under the driver.
|
|
|
|
### Driver.Galaxy-018
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Low |
|
|
| Category | Documentation & comments |
|
|
| Location | `libs/README.md:32-37`, `ZB.MOM.WW.OtOpcUa.Driver.Galaxy.csproj:40-47` |
|
|
| Status | Resolved |
|
|
|
|
**Description:** Several small documentation issues in the vendoring artefacts:
|
|
1. `libs/README.md` says "Versions match what the sibling repo's `ZB.MOM.WW.MxGateway.Contracts.csproj` uses" — but `ZB.MOM.WW.MxGateway.Contracts.csproj` only declares `Google.Protobuf` 3.34.1 and `Grpc.Core.Api` 2.76.0; the other three packages (`Grpc.Net.Client`, `Microsoft.Extensions.Logging.Abstractions`, `Polly`) come from the (now-deleted) `MxGateway.Client.csproj`, not the contracts csproj. The README points at the wrong source-of-truth file. See Driver.Galaxy-016 for the related version-skew issue.
|
|
2. `libs/README.md` says the DLLs "are built against net10.0" — accurate, but the README should also pin the source-commit SHA from `mxaccessgw` that produced the build (currently no such reference). Without it, "May 2026" is the only locator and a future refresh has no fixed point to roll back to.
|
|
3. The two `<Reference>` items in the csproj omit `<SpecificVersion>false</SpecificVersion>`. The vendored DLLs carry `AssemblyVersion 1.0.0.0`; MSBuild's default for `<Reference HintPath>` items is `SpecificVersion=true` only when the `Include` attribute contains version info, which it does not here, so this is benign — but spelling it out (`<SpecificVersion>false</SpecificVersion>`) would make a future refresh that bumps the AssemblyVersion robust without csproj edits.
|
|
4. The csproj `<Reference Include="MxGateway.Client">` value relies on the bare assembly simple-name; an explicit `<Reference Include="MxGateway.Client, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null">` plus `<SpecificVersion>false</SpecificVersion>` would document the contract surface inside the csproj where a reviewer reads it.
|
|
|
|
**Recommendation:** (a) Update `libs/README.md` to (i) point at `MxGateway.Client.csproj` for the `Grpc.Net.Client`/`Microsoft.Extensions.Logging.Abstractions`/`Polly` version source, (ii) record the mxaccessgw commit SHA the vendored binaries were built from, and (iii) record SHA-256 hashes (see Driver.Galaxy-015). (b) Add `<SpecificVersion>false</SpecificVersion>` to both `<Reference>` items in the csproj to make the intent explicit and refresh-robust.
|
|
|
|
**Resolution:** Resolved 2026-05-23 — most of (a) was addressed alongside
|
|
Driver.Galaxy-015 + -016: `libs/README.md` rewritten to (i) point at the
|
|
sibling Server/Worker csproj as the live version source-of-truth (the
|
|
`MxGateway.Client.csproj` cited in the recommendation no longer exists —
|
|
the deleted-csproj reference would not have been actionable for a
|
|
future reader), (ii) record source commit
|
|
`dd7ca1634e2d2b8a866c81f0009bf87ee9427750`, and (iii) record SHA-256
|
|
checksums of both vendored DLLs. (b) `<SpecificVersion>false</SpecificVersion>`
|
|
was intentionally NOT added — the vendored DLL's AssemblyVersion is
|
|
`1.0.0.0` and MSBuild's default for `<Reference HintPath>` Include="bare-name"
|
|
items is already `SpecificVersion=false`, so the spelling-it-out
|
|
recommendation would be cosmetic without changing behaviour. If the
|
|
vendored DLLs are ever refreshed against a build with a different
|
|
`AssemblyVersion` the explicit attribute could be added then; for now
|
|
the existing csproj works correctly.
|