Re-reviewed the four modules with source changes since the previous review commit76d35d1, per REVIEW-PROCESS.md section 6. Updated each findings.md header (date 2026-05-23, commita9be809) and appended new findings under continued numbering. Regenerated README.md. ## New findings — 12 total across 4 modules ### Core.Scripting (5 new, IDs -012 to -016) - **-012 High Security** — broadened BCL references (System.* + netstandard) re-expose System.Threading.ThreadPool / Timer / AssemblyLoadContext, which the analyzer's deny-list doesn't cover. Re-introduces the background-work threat Core.Scripting-003 closed via System.Threading.Tasks deny. - **-013 Medium Security** — hand-rolled wrapper-source generation lets brace-balanced user source inject sibling methods/classes alongside CompiledScript.Run. Analyzer still gates forbidden types, but the documented 'method body' authoring contract is silently relaxed. - **-014 Medium Concurrency** — CompiledScriptCache.Clear() uses key-only TryRemove(key, out _) — the same race the -006 resolution fixed in GetOrCompile's catch is latent here on publish-replace. - **-015 Low Correctness** — ToCSharpTypeName truncates at first backtick; silently drops closed type arguments of nested-generic shapes (Outer<>.Inner<>). Latent — no production caller uses this shape today. - **-016 Medium Performance** — VirtualTagEngine + ScriptedAlarmEngine call ScriptEvaluator.Compile directly without going through CompiledScriptCache, so the headline -008 collectible-ALC fix doesn't run on the actual production path — the per-publish leak is still in effect. ### Core.ScriptedAlarms (1 new, ID -013) - **-013 Low Documentation** — new internal test accessors return the live mutable scratch dictionary; XML docs don't warn future test authors about the synchronisation contract. ### Driver.Cli.Common (2 new, IDs -007, -008) - **-007 High Correctness** — 0x80550000 was added as BadDeviceFailure but the real OPC UA spec value for BadDeviceFailure is 0x808B0000 (verified against Driver.Galaxy.Runtime.StatusCodeMap and HistorianQualityMapper, both of which use the correct 0x808B0000). 0x80550000 is actually BadSecurityPolicyRejected. The native mappers (FOCAS / AbCip / AbLegacy) all use the wrong 0x80550000; this session's SnapshotFormatter extension propagated the wrong name and the test asserts against the same wrong value so CI is blind — same shape of bug as Driver.Cli.Common-001. - **-008 Low Testing** — new FormatStatus_names_native_driver_emitted_codes Theory is redundant with the existing well-known Theory (same five InlineData rows added to both) and uses weaker ShouldContain assertion than the well-known Theory's ShouldBe. ### Driver.Galaxy (4 new, IDs -015 to -018) - **-015 Medium Security** — vendored DLLs (libs/) have no recorded provenance: no source-commit SHA from the mxaccessgw repo, no SHA-256 checksum in libs/README.md. Tampering / accidental swap undetectable. - **-016 Medium Performance** — version skew between declared PackageReferences (Polly 8.5.2 / Grpc.Net.Client 2.71.0 / Microsoft.Extensions.Logging.Abstractions 10.0.0) and what the vendored DLL was actually built against (Polly.Core 8.6.6 / Grpc.Net.Client 2.76.0 / Microsoft.Extensions.Logging.Abstractions 10.0.7). Latent now (assembly-version refs are loose) but precise shape that produces a runtime MissingMethodException. - **-017 Low Design** — no contract-version handshake between the driver and the gateway; proto could evolve under the gateway without the driver noticing. - **-018 Low Documentation** — libs/README.md points at the wrong sibling csproj as the version source-of-truth; missing SpecificVersion=false on the Reference items; missing mxaccessgw source-commit SHA. ## Particularly notable Two findings undercut commits from this session: - Driver.Cli.Common-007 invalidates commit5a9c459(which named 0x80550000 as BadDeviceFailure across the cross-CLI shortlist). - Core.Scripting-016 invalidates the production effect of commit7b6ab2e(the collectible-ALC fix wired Dispose only via CompiledScriptCache, which the engines don't use). The wider native-mapper miscoding behind -007 also affects three driver modules outside this session's edit scope (FocasStatusMapper, AbCipStatusMapper, AbLegacyStatusMapper all carry the wrong code). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
321 lines
37 KiB
Markdown
321 lines
37 KiB
Markdown
# Code Review — Driver.Galaxy
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Module | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy` |
|
|
| Reviewer | Claude Code |
|
|
| Review date | 2026-05-23 |
|
|
| Commit reviewed | `a9be809` |
|
|
| Status | Reviewed |
|
|
| Open findings | 4 |
|
|
|
|
## Checklist coverage
|
|
|
|
| # | Category | Result |
|
|
|---|---|---|
|
|
| 1 | Correctness & logic bugs | Driver.Galaxy-001, Driver.Galaxy-002, Driver.Galaxy-003, Driver.Galaxy-004 |
|
|
| 2 | OtOpcUa conventions | Driver.Galaxy-005 |
|
|
| 3 | Concurrency & thread safety | Driver.Galaxy-006, Driver.Galaxy-007 |
|
|
| 4 | Error handling & resilience | Driver.Galaxy-001, Driver.Galaxy-008, Driver.Galaxy-009 |
|
|
| 5 | Security | Driver.Galaxy-010, Driver.Galaxy-015 |
|
|
| 6 | Performance & resource management | Driver.Galaxy-011, Driver.Galaxy-012, Driver.Galaxy-016 |
|
|
| 7 | Design-document adherence | Driver.Galaxy-013, Driver.Galaxy-017 |
|
|
| 8 | Code organization & conventions | No issues found |
|
|
| 9 | Testing coverage | Driver.Galaxy-014 |
|
|
| 10 | Documentation & comments | Driver.Galaxy-005, Driver.Galaxy-013, Driver.Galaxy-018 |
|
|
|
|
## Re-review 2026-05-23 (commit `a9be809`)
|
|
|
|
The only code-affecting change since `76d35d1` was commit `994997b` — the
|
|
sibling `mxaccessgw` repo restructured (the `clients/dotnet/MxGateway.Client`
|
|
project path and the `MxGateway.Contracts.Proto` namespace both moved), and
|
|
the driver's path-based `ProjectReference` started producing 87 build errors
|
|
solution-wide. The fix is build-time only: the broken `ProjectReference` was
|
|
replaced with `<Reference HintPath="libs\…">` items pointing at vendored
|
|
binary copies of `MxGateway.Client.dll` (99 KB, May 2026 known-good build)
|
|
and `MxGateway.Contracts.dll` (490 KB), and five `PackageReference`s that
|
|
the dropped project was previously providing transitively (`Google.Protobuf`,
|
|
`Grpc.Core.Api`, `Grpc.Net.Client`, `Microsoft.Extensions.Logging.Abstractions`,
|
|
`Polly`) were declared explicitly. The matching `Tests` csproj got the same
|
|
binary `<Reference>` for `MxGateway.Contracts` (replacing its own broken
|
|
`ProjectReference`). A `libs/README.md` documents what is vendored and the
|
|
two unwinding paths (sibling restores a client library, or driver migrates
|
|
to the new `ZB.MOM.WW.MxGateway.Contracts.Proto` namespace + reimplements
|
|
the `MxGatewayClient` / `MxGatewaySession` / `GalaxyRepositoryClient`
|
|
wrapper, ~2,200 LoC).
|
|
|
|
No `*.cs` file changed; the re-review walked only the categories that apply
|
|
to a build-time/packaging change. Categories with no new findings:
|
|
Correctness (1), OtOpcUa conventions (2), Concurrency (3), Error handling
|
|
(4), Code organization (8), Testing coverage (9). Four new findings are
|
|
recorded below (Driver.Galaxy-015..018) — none Critical, none High; two
|
|
Medium, two Low.
|
|
|
|
## Findings
|
|
|
|
### Driver.Galaxy-001
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Critical |
|
|
| Category | Error handling & resilience |
|
|
| Location | `Runtime/EventPump.cs:128`, `GalaxyDriver.cs:222` |
|
|
| Status | Resolved |
|
|
|
|
**Description:** The `ReconnectSupervisor` is constructed in `BuildProductionRuntimeAsync` and exposes `ReportTransportFailure(Exception)` as the only entry point that starts the reopen -> replay recovery loop. Nothing in the driver ever calls `ReportTransportFailure` (a repo-wide search finds only the declaration). When the gateway `StreamEvents` stream faults, `EventPump.RunAsync` catches the exception, logs "reconnect supervisor (PR 4.5) handles restart", completes the channel, and exits — but the supervisor is never told. The result: a transient gateway transport drop permanently kills the event stream. Data-change notifications stop, no reconnect/replay runs, and `GetHealth()` keeps reporting `Healthy` because `_supervisor.IsDegraded` stays false. This is a production outage with no self-recovery.
|
|
|
|
**Recommendation:** Wire the EventPump (and any gw RPC that observes a transport fault) to call `_supervisor.ReportTransportFailure(ex)`. The simplest path: give `EventPump` a fault callback (or expose a `StreamFaulted` event) that `GalaxyDriver` subscribes to and forwards to the supervisor. The supervisor's `ReopenAsync`/`ReplayAsync` must also restart the EventPump itself (see Driver.Galaxy-008).
|
|
|
|
**Resolution:** Resolved 2026-05-22 — added an optional `onStreamFault` callback to `EventPump`; `RunAsync`'s stream-fault catch block now invokes it, and `GalaxyDriver.EnsureEventPumpStarted` wires it to `OnEventPumpStreamFault` which forwards the cause to `ReconnectSupervisor.ReportTransportFailure`, so a transient gw transport drop now drives reopen → replay. Regression coverage in `EventPumpStreamFaultTests`. Note: the EventPump itself is still not restarted on reconnect — that pump-restart gap remains tracked under Driver.Galaxy-008.
|
|
|
|
### Driver.Galaxy-002
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | High |
|
|
| Category | Correctness & logic bugs |
|
|
| Location | `Browse/DataTypeMap.cs:13`, `Runtime/MxValueDecoder.cs:9` |
|
|
| Status | Resolved |
|
|
|
|
**Description:** `DataTypeMap.Map` maps Galaxy `mx_data_type` codes to six `DriverDataType` values (Boolean, Int32, Float32, Float64, String, DateTime) — there is no `Int64` arm. Yet `MxValueDecoder` and `MxValueEncoder` both fully support Int64 (`MxValue.Int64Value`, `Int64Array`), and the decoder's own XML doc claims "the seven Galaxy data types ... (Boolean, Int32, Int64, Float32, Float64, String, DateTime)". Any Galaxy attribute whose `mx_data_type` is the Int64 code (or any code > 5) falls through the `_ => DriverDataType.String` default. The address-space node is then created as a `String` variable while runtime reads decode an `Int64` boxed value — a type mismatch that produces wrong OPC UA `DataType`/`ValueRank` metadata and likely fails value coercion at the server node layer.
|
|
|
|
**Recommendation:** Confirm the Galaxy `mx_data_type` integer code for 64-bit integers and add the explicit arm to `DataTypeMap.Map`. If the wire format genuinely has no Int64 type, correct the `MxValueDecoder`/`MxValueEncoder` doc comments instead. Either way the encoder/decoder and the type map must agree.
|
|
|
|
**Resolution:** Resolved 2026-05-22 — added `6 => DriverDataType.Int64` to `DataTypeMap.Map`, extending the contiguous 0..5 scheme so the type map covers the same seven Galaxy data types `MxValueDecoder`/`MxValueEncoder` already decode/encode; Int64 attributes now build as Int64 nodes instead of falling through to the String default. Regression coverage in `DataTypeMapTests`.
|
|
|
|
### Driver.Galaxy-003
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Medium |
|
|
| Category | Correctness & logic bugs |
|
|
| Location | `Runtime/StatusCodeMap.cs:86` |
|
|
| Status | Resolved |
|
|
|
|
**Description:** `FromMxStatus` returns `Good` whenever `status.Success != 0`. The intent (per the surrounding comment "Honors the success flag") is that a non-zero `Success` means success. But if `MxStatusProxy.Success` is itself a native HRESULT/return code rather than a boolean-as-int, then `Success != 0` is exactly the failure condition and the mapper inverts it — every failed write/read would report `Good`. The field name is ambiguous and the rest of the file (`Detail`, `RawDetectedBy`, and `Hresult` used elsewhere) treats `0` as success. `GatewayGalaxyAlarmAcknowledger.cs:62` uses the opposite convention for the sibling field (`reply.Hresult != 0` means failure).
|
|
|
|
**Recommendation:** Verify the semantics of `MxStatusProxy.Success` against the gateway proto contract. If it is a success-boolean encoded as int, add a code comment pinning that; if it is an HRESULT, invert the check to `status.Success == 0 => Good`.
|
|
|
|
**Resolution:** Resolved 2026-05-22 — replaced `status.Success != 0` with `status.IsSuccess()` (the `MxStatusProxyExtensions` helper that checks both `success != 0` AND `category == Ok`); the proto contract explicitly documents that `success` is not a boolean and that clients must branch on `category`. Regression coverage updated in `StatusCodeMapTests` with a `SuccessNonZeroButCategoryNotOk_IsNotGood` assertion pinning the fix.
|
|
|
|
### Driver.Galaxy-004
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Medium |
|
|
| Category | Correctness & logic bugs |
|
|
| Location | `GalaxyDriver.cs:901` |
|
|
| Status | Resolved |
|
|
|
|
**Description:** `OnPumpDataChange` reconstructs a raw OPC DA quality byte from an OPC UA `StatusCode` for the probe watcher: it shifts `StatusCode >> 30` and maps `0->192, 1->64, _->0`. The `StatusCode` was itself produced upstream by `StatusCodeMap.FromQualityByte`/`FromMxStatus`, so this is a lossy round-trip — it collapses every specific code back to the three category bytes (192/64/0). That happens to satisfy `PerPlatformProbeWatcher.DecodeState` (which only checks `qualityByte < 192`), so the bug is currently benign, but the mapping is fragile and undocumented except for one inline comment. A future edit to the `StatusCodeMap` constants or to the shift width would silently desync the probe-health decode with no test guarding it.
|
|
|
|
**Recommendation:** Route the probe path off the original quality information rather than reverse-engineering it from a `StatusCode`. Either carry the raw quality byte on `DataValueSnapshot`, or add a `StatusCodeMap.ToQualityCategoryByte(uint)` helper with unit tests so the mapping lives in one place next to its inverse.
|
|
|
|
**Resolution:** Resolved 2026-05-22 — added `StatusCodeMap.ToQualityCategoryByte(uint)` helper that extracts top-two bits of the OPC UA StatusCode into the OPC DA category byte (Good=192, Uncertain=64, Bad=0); `GalaxyDriver.OnPumpDataChange` now calls this helper instead of inlining the shift+switch, so the mapping lives next to its inverse. Unit tests in `StatusCodeMapTests` cover all three category buckets and the round-trip invariant.
|
|
|
|
### Driver.Galaxy-005
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Low |
|
|
| Category | OtOpcUa conventions |
|
|
| Location | `Runtime/EventPump.cs:81-88` |
|
|
| Status | Resolved |
|
|
|
|
**Description:** The `BoundedChannelOptions` comment states "Newest-dropped policy: when full, the producer's TryWrite returns false ... We do this manually rather than relying on `BoundedChannelFullMode.DropWrite`" — but the option is then set to `FullMode = BoundedChannelFullMode.Wait`. With `Wait`, `TryWrite` returning `false` on a full channel is correct behaviour, so the code works, but the comment naming the mode and the actual mode disagree, which is confusing for a maintainer deciding whether the policy is `Wait`, `DropWrite`, or `DropNewest`.
|
|
|
|
**Recommendation:** Either reword the comment to say "we use `Wait` mode but never call the awaitable `WriteAsync` — `TryWrite` gives us synchronous newest-dropped semantics", or switch to `BoundedChannelFullMode.DropWrite` and keep the manual drop count. Make the comment and the mode consistent.
|
|
|
|
**Resolution:** Resolved 2026-05-23 — reworded the `BoundedChannelOptions` comment to say "we use FullMode.Wait but never call the awaitable WriteAsync — only synchronous TryWrite, which returns false immediately on a full channel and lets us account for drops on the EventsDropped counter". Also explains why we deliberately do NOT use `BoundedChannelFullMode.DropWrite` (it would silently discard without surfacing on the counter). Comment and `FullMode` value now agree.
|
|
|
|
### Driver.Galaxy-006
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Medium |
|
|
| Category | Concurrency & thread safety |
|
|
| Location | `GalaxyDriver.cs:848-861` |
|
|
| Status | Resolved |
|
|
|
|
**Description:** `OnAlarmFeedTransition` picks the "owner" handle with `_alarmSubscriptions.First()` under `_alarmHandlersLock`. `HashSet<T>.First()` enumeration order is unspecified and unstable across mutations — when multiple alarm subscriptions are active, the handle attached to a given `AlarmEventArgs` can change arbitrarily between transitions. The XML doc acknowledges "we still only fire the event once" but the downstream `AlarmConditionService` correlates transitions to the originating subscription via this handle; a non-deterministic owner can misroute unsubscribe bookkeeping or per-subscription state.
|
|
|
|
**Recommendation:** If alarm transitions genuinely fan out to all subscriptions, raise `OnAlarmEvent` once per active handle (or document that the handle is a non-correlating sentinel and have the server stop relying on it). If a single owner is required, make the choice deterministic (e.g. the earliest-created handle) and stable.
|
|
|
|
**Resolution:** Resolved 2026-05-22 — changed `_alarmSubscriptions` from `HashSet<GalaxyAlarmSubscriptionHandle>` to `List<GalaxyAlarmSubscriptionHandle>` so insertion order is preserved; `OnAlarmFeedTransition` now picks `[0]` (earliest-registered handle) instead of `First()` on a HashSet, making the owner selection deterministic and stable across mutations. Server routing uses `SourceNodeId` not the handle, so every active subscriber sees the same transition regardless of which handle is attached.
|
|
|
|
### Driver.Galaxy-007
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Medium |
|
|
| Category | Concurrency & thread safety |
|
|
| Location | `GalaxyDriver.cs:937-968` |
|
|
| Status | Resolved |
|
|
|
|
**Description:** `Dispose()` is not synchronized against the capability methods. It sets `_disposed = true` then disposes `_eventPump`, `_alarmFeed`, `_ownedMxSession`, `_ownedMxClient`, `_supervisor`, etc. A concurrent `SubscribeAsync`/`ReadAsync`/`WriteAsync` that passed its `ObjectDisposedException.ThrowIf` check at entry can then dereference `_subscriber`/`_dataWriter` whose backing `GalaxyMxSession` is being disposed mid-call, producing `ObjectDisposedException`/`NullReferenceException` from deep inside the gw client rather than a clean failure. `Dispose` also blocks the caller on `GetAwaiter().GetResult()` of several async disposals, risking a deadlock if invoked from a thread-pool-starved context.
|
|
|
|
**Recommendation:** Gate capability entry points so they cannot start new gw work once `_disposed` is set (e.g. a `CancellationTokenSource` linked into every call, cancelled first in `Dispose`). Consider implementing `IAsyncDisposable` so the async sub-component disposals do not block on `GetResult()`.
|
|
|
|
**Resolution:** Resolved 2026-05-22 — added `IAsyncDisposable` to `GalaxyDriver` and implemented `DisposeAsync()` as the primary disposal path that awaits each async sub-component (EventPump, AlarmFeed, MxSession, MxClient, RepositoryClient) without blocking; `Dispose()` delegates to `DisposeAsync().AsTask().GetAwaiter().GetResult()` for `using`-statement compatibility. The sync blocking-on-GetResult anti-pattern in the previous Dispose body is eliminated on the hot path. Note: the `CancellationTokenSource` gate for concurrent capability entry was not added — the existing `ObjectDisposedException.ThrowIf(_disposed, this)` guards at capability entry points already provide the fast-fail, and a separate CTS would add complexity without solving the TOCTOU window noted in the finding; that window is benign in practice (the sub-component's own disposed check catches it).
|
|
|
|
### Driver.Galaxy-008
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | High |
|
|
| Category | Error handling & resilience |
|
|
| Location | `GalaxyDriver.cs:264-276`, `Runtime/EventPump.cs:97-103` |
|
|
| Status | Resolved |
|
|
|
|
**Description:** Even if Driver.Galaxy-001 is fixed and the supervisor's `ReplayAsync` runs, recovery is incomplete. `ReplayAsync` re-issues `SubscribeBulkAsync` for the tracked tags, but the `EventPump` background loop that consumes `StreamEvents` is not restarted. After a stream fault `EventPump.RunAsync` exits and `_channel` is completed; `EventPump.Start()` is a no-op (`if (_loop is not null) return`) because `_loop` is a completed-but-non-null task. So a replayed subscription has no consumer — values are subscribed on the gw but never reach `OnDataChange`. Additionally `ReplayAsync` never re-registers the new item handles the gw returns into `SubscriptionRegistry`; the old stale item handles remain, so even with a live pump the fan-out reverse-map would miss the post-reconnect handles.
|
|
|
|
**Recommendation:** On reconnect, dispose and recreate the `EventPump` (or make it restartable), and have `ReplayAsync` update `SubscriptionRegistry` bindings with the new item handles returned by the post-reconnect `SubscribeBulkAsync`. Add an integration/parity test that drops the stream mid-subscription and asserts `OnDataChange` resumes.
|
|
|
|
**Resolution:** Resolved 2026-05-22 — `ReplayAsync` now calls a new `RestartEventPumpForReplay` (disposes the faulted pump, recreates and restarts a fresh one) and re-issues `SubscribeBulkAsync` per subscription, then `SubscriptionRegistry.Rebind` swaps each subscription's stale pre-reconnect item handles for the post-reconnect handles so the fan-out reverse map dispatches to the live pump. New `SubscriptionRegistry.SnapshotEntries`/`Rebind` APIs back the per-subscription replay. Regression coverage in `SubscriptionRegistryTests` (Rebind/SnapshotEntries) and `EventPumpStreamFaultTests.FaultedPump_IsNotRestartableInPlace_ButAFreshPumpResumesDispatch`.
|
|
|
|
### Driver.Galaxy-009
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Medium |
|
|
| Category | Error handling & resilience |
|
|
| Location | `GalaxyDriver.cs:354-371` |
|
|
| Status | Resolved |
|
|
|
|
**Description:** `StartDeployWatcher` launches the watch loop with `_ = _deployWatcher.StartAsync(CancellationToken.None)` — a fire-and-forget with a discarded `Task`. `StartAsync` can throw synchronously (`InvalidOperationException` if already started); the discard masks that programming error. Separately, `StartDeployWatcher` builds an `_ownedRepositoryClient` purely for the watcher when discovery has not run yet — if `DiscoverAsync` later runs, `BuildDefaultHierarchySource` overwrites `_ownedRepositoryClient` with a second client, leaking the first (only the latest reference is disposed in `Dispose`).
|
|
|
|
**Recommendation:** Await `StartAsync` (it completes synchronously after scheduling) or at least observe its result. Reuse a single `GalaxyRepositoryClient` across the deploy watcher and the hierarchy source instead of letting `BuildDefaultHierarchySource` clobber the field — guard the assignment or build the client once in `InitializeAsync`.
|
|
|
|
**Resolution:** Resolved 2026-05-22 — (a) replaced `_ = _deployWatcher.StartAsync(...)` discard with an explicit variable + `IsFaulted` check so any synchronous throw from `StartAsync` (e.g. called-twice `InvalidOperationException`) propagates rather than being silently swallowed; (b) changed both `StartDeployWatcher` and `BuildDefaultHierarchySource` to use `_ownedRepositoryClient ??=` so a client built by the watcher is reused by discovery instead of being overwritten and leaked — only one `GalaxyRepositoryClient` instance is now created and disposed.
|
|
|
|
### Driver.Galaxy-010
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Low |
|
|
| Category | Security |
|
|
| Location | `GalaxyDriver.cs:311-341` |
|
|
| Status | Resolved |
|
|
|
|
**Description:** `ResolveApiKey` supports an `env:`/`file:` indirection and otherwise treats the config string as the literal API key ("Anything else — used as the literal API key. Convenient for dev"). `GalaxyGatewayOptions`' own XML doc claims "the API key never appears in cleartext config". The literal-key fallback silently permits a plaintext API key in the `DriverConfig` JSON column of the central config DB, contradicting the documented contract. There is no warning logged when the literal path is taken.
|
|
|
|
**Recommendation:** Log a startup warning when `ResolveApiKey` falls through to the literal arm so an operator who accidentally committed a cleartext key sees it, and update the `GalaxyGatewayOptions` doc comment so it no longer over-promises. Consider gating the literal arm behind an explicit `dev:`-style prefix so a cleartext key cannot be used by accident.
|
|
|
|
**Resolution:** Resolved 2026-05-23 — (a) added a logger-aware `ResolveApiKey(string, ILogger?)` overload that emits a `Warning` when the back-compat literal arm is taken, and wired the `BuildClientOptions` call site to pass `_logger`; (b) added an explicit `dev:KEY` prefix that returns the literal value without warning, so dev rigs / parity tests can opt-in deliberately; (c) rewrote the `GalaxyGatewayOptions.ApiKeySecretRef` XML doc so it no longer claims "the API key never appears in cleartext config" — it now documents all four supported forms (`env:`, `file:`, `dev:`, and the warning-on-literal back-compat path). Regression coverage in `GalaxyDriverApiKeyResolverTests` (`Literal_string_emits_warning_when_logger_supplied`, `Dev_prefix_returns_literal_without_warning`, `Env_prefix_does_not_emit_literal_warning`).
|
|
|
|
### Driver.Galaxy-011
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Medium |
|
|
| Category | Performance & resource management |
|
|
| Location | `GalaxyDriver.cs:411` |
|
|
| Status | Resolved |
|
|
|
|
**Description:** `GetMemoryFootprint()` unconditionally returns `0` with a comment "PR 4.4 sets this from SubscriptionRegistry size" — PR 4.4 has shipped (the registry exists and is used) but the method was never updated. `IHostConnectivityProbe.GetMemoryFootprint` is consumed by the server's status/health surface to gauge cache-flush pressure; a constant `0` makes the Galaxy driver invisible to that mechanism, so a 50k-tag subscription set never registers as memory pressure and `FlushOptionalCachesAsync` (also a no-op) is never meaningfully triggered.
|
|
|
|
**Recommendation:** Return a real estimate derived from `SubscriptionRegistry.TrackedSubscriptionCount`/`TrackedItemHandleCount` (and the EventPump channel occupancy), or document explicitly why the Galaxy driver opts out of footprint reporting. Remove the stale "PR 4.4 sets this" comment.
|
|
|
|
**Resolution:** Resolved 2026-05-22 — replaced the constant `0` with a live estimate derived from `SubscriptionRegistry.TrackedItemHandleCount` (64 bytes/handle) and `TrackedSubscriptionCount` (256 bytes/subscription); returns 0 when no subscriptions are active and grows with the registry. The stale "PR 4.4 sets this" comment is removed. Regression coverage in `GalaxyDriverInfrastructureTests`.
|
|
|
|
### Driver.Galaxy-012
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Low |
|
|
| Category | Performance & resource management |
|
|
| Location | `Runtime/SubscriptionRegistry.cs:65-67`, `GalaxyDriver.cs:538`, `GalaxyDriver.cs:675` |
|
|
| Status | Resolved |
|
|
|
|
**Description:** Several hot paths are O(n^2) per call. `SubscriptionRegistry.ResolveSubscribers` does `entry.Bindings.FirstOrDefault(b => b.ItemHandle == itemHandle)` — a linear scan of the whole binding list for every event dispatch; at 50k tags this is 50k-element scans on the 1Hz fan-out path. `GalaxyDriver.SubscribeAsync` and `ReadViaSubscribeOnceAsync` correlate results to references with `results.FirstOrDefault(r => string.Equals(...))` inside a `for` loop over all references — O(n^2) over the subscribe batch. `SubscriptionRegistry.Remove` rebuilds a `ConcurrentBag` from a LINQ filter on every unsubscribe.
|
|
|
|
**Recommendation:** Index `SubscriptionEntry` bindings by item handle (a `Dictionary<int, string>` per entry) so `ResolveSubscribers` is O(1) per subscriber. Project the `SubscribeResult` list into a `Dictionary<string, SubscribeResult>` (OrdinalIgnoreCase) once before the correlation loop. These matter on the documented 50k-tag soak path.
|
|
|
|
**Resolution:** Resolved 2026-05-23 — three changes: (a) `SubscriptionEntry` now carries a `FullRefByItemHandle` `Dictionary<int, string>` built once at construction; `ResolveSubscribers` does O(1) lookups per subscriber instead of a `FirstOrDefault` linear scan of the binding list. (b) Reverse map `_subscribersByItemHandle` swapped from `ConcurrentBag<long>` to `ImmutableHashSet<long>` — `Remove`/`Rebind` use `set.Remove(id)` (O(log n)) instead of "rebuild a new bag from a LINQ filter on every unsubscribe", and reads remain lock-free via atomic publication through `ConcurrentDictionary.AddOrUpdate`. (c) `GalaxyDriver.SubscribeAsync` + `ReadViaSubscribeOnceAsync` now index the `SubscribeResult` list once via the existing `BuildResultIndex` helper (already used by `ReplayAsync`) so per-reference correlation is O(1). Regression coverage in `SubscriptionRegistryTests.ResolveSubscribers_LargeBindingSet_DispatchesCorrectly`.
|
|
|
|
### Driver.Galaxy-013
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Low |
|
|
| Category | Design-document adherence |
|
|
| Location | `GalaxyDriver.cs:14-27`, `GalaxyDriver.cs:374-382`, `Config/GalaxyDriverOptions.cs:84-86` |
|
|
| Status | Resolved |
|
|
|
|
**Description:** Multiple doc comments are stale relative to the shipped code. `GalaxyDriver`'s class summary still describes the file as "the project skeleton with `IDriver` bodies that wire to a future `IGalaxyGatewayClient` abstraction. Capability interfaces ... land in PRs 4.1-4.7" and references the legacy `GalaxyProxyDriver` coexisting "until PR 7.2" — but PR 7.2 already deleted the legacy Galaxy projects and the capability interfaces are all implemented. `ReinitializeAsync` is still a stub ("for the skeleton we just refresh health") that ignores `driverConfigJson` entirely — a config reapply silently does nothing. `GalaxyReconnectOptions.ReplayOnSessionLost` is defined and documented but never read anywhere in the driver (`ReplayAsync` always replays).
|
|
|
|
**Recommendation:** Refresh the `GalaxyDriver` class and `ReinitializeAsync` doc comments to describe the shipped state, implement or explicitly reject `ReinitializeAsync` config reapply, and either honour `ReplayOnSessionLost` or remove it from `GalaxyReconnectOptions`.
|
|
|
|
**Resolution:** Resolved 2026-05-23 — three fixes: (a) rewrote the `GalaxyDriver` class summary to describe the shipped capability surface (`ITagDiscovery`, `IReadable`, `IWritable`, `ISubscribable`, `IRediscoverable`, `IHostConnectivityProbe`, `IAlarmSource`) and removed the stale "PR 4.0 skeleton" / "legacy `GalaxyProxyDriver` coexists until PR 7.2" wording — PR 7.2 already retired the legacy projects. (b) `ReinitializeAsync` now parses the incoming `driverConfigJson` through the factory pipeline and compares the result to `_options`; an equivalent reapply refreshes health, a non-equivalent change throws `NotSupportedException` so a config swap never silently no-ops. (c) `ReplayAsync` now honours `_options.Reconnect.ReplayOnSessionLost` — when false it restarts the EventPump but skips the per-tag SubscribeBulk fan-out, delegating to gateway session-level replay. Regression coverage in `GalaxyDriverInfrastructureTests` (`ReinitializeAsync_RejectsNonEquivalentConfigChange`, `ReinitializeAsync_AcceptsEquivalentConfig`, `ReplayOnSessionLost_False_SkipsResubscribeBulk`, `ReplayOnSessionLost_True_RunsResubscribeBulk`). Updated `GalaxyDriverFactoryTests.ReinitializeAsync_RefreshesHealth_WhenConfigIsEquivalent` to use an equivalent config JSON.
|
|
|
|
### Driver.Galaxy-014
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Medium |
|
|
| Category | Testing coverage |
|
|
| Location | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy` (module-wide) |
|
|
| Status | Resolved |
|
|
|
|
**Description:** The reconnect/recovery path is the module's highest-risk surface and is effectively untested at the integration seam. The `ReconnectSupervisor` has a clean test seam (injectable `reopen`/`replay`/`backoffDelay`), but because nothing wires `ReportTransportFailure` (Driver.Galaxy-001) there can be no test asserting that an `EventPump` stream fault actually drives recovery — the gap that would have caught the Critical finding. Similarly there appears to be no test that a post-reconnect `ReplayAsync` re-registers new item handles and that `OnDataChange` resumes (Driver.Galaxy-008). The `StatusCodeMap.FromMxStatus` `Success`-flag semantics (Driver.Galaxy-003) and the `DataTypeMap` Int64 gap (Driver.Galaxy-002) are also the kind of behaviour a focused unit test would pin.
|
|
|
|
**Recommendation:** Add unit/parity tests covering: (a) stream fault -> supervisor reopen -> EventPump restart -> `OnDataChange` resumes; (b) `ReplayAsync` updates `SubscriptionRegistry` with new handles; (c) `StatusCodeMap.FromMxStatus` for both success and failure `MxStatusProxy` rows; (d) `DataTypeMap` for every Galaxy `mx_data_type` code including 64-bit integer.
|
|
|
|
**Resolution:** Resolved 2026-05-22 — added `GalaxyDriverInfrastructureTests` covering `GetMemoryFootprint` (Driver.Galaxy-011) and `IAsyncDisposable` (Driver.Galaxy-007); (a) stream-fault → supervisor reopen → EventPump restart → `OnDataChange` resumes is covered by `EventPumpStreamFaultTests.StreamFault_DrivesReconnectSupervisorReopenReplay` and `FaultedPump_IsNotRestartableInPlace_ButAFreshPumpResumesDispatch` (landed with Driver.Galaxy-001/008 resolution); (b) post-reconnect `ReplayAsync` rebinds handles is covered by `SubscriptionRegistryTests.Rebind_*` suite; (c) `StatusCodeMap.FromMxStatus` success/failure rows are covered by `StatusCodeMapTests.FromMxStatus_SuccessNonZeroAndCategoryOk_IsGood` and `FromMxStatus_SuccessNonZeroButCategoryNotOk_IsNotGood` (landed with Driver.Galaxy-003); (d) `DataTypeMap` for all seven mx_data_type codes including Int64 is covered by `DataTypeMapTests` (landed with Driver.Galaxy-002).
|
|
|
|
### Driver.Galaxy-015
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Medium |
|
|
| Category | Security |
|
|
| Location | `libs/MxGateway.Client.dll`, `libs/MxGateway.Contracts.dll`, `libs/README.md` |
|
|
| Status | Open |
|
|
|
|
**Description:** Commit `994997b` checks in two binary DLLs (`MxGateway.Client.dll`, 99 840 bytes; `MxGateway.Contracts.dll`, 489 984 bytes) under `src/Drivers/.../Driver.Galaxy/libs/` and references them via `<Reference HintPath="…" />`. These are the only checked-in binary build artefacts in the entire repo (a repo-wide `find` for non-`bin/`/`obj/` `*.dll` under `libs/` returns only these two), so the change sets a precedent. The accompanying `libs/README.md` states the DLLs are "byte-for-byte the build output" of the OtOpcUa team's own code against the gateway's open proto contracts, but there is no recorded provenance — no source-commit SHA from the sibling `mxaccessgw` repo that produced the build, no SHA-256/SHA-512 checksum, no `.gitattributes` rule marking these paths as binary (so a future churn-in-place will balloon the pack file). Without a recorded source commit + checksum it is impossible for a future reviewer/auditor to verify the binaries match a specific revision of the sibling repo — the assertion "we built them, not external" is unverifiable after the fact. Tampering or accidental swap (e.g. someone drops in a different DLL of the same name under the same path) would not be detectable.
|
|
|
|
**Recommendation:** (a) Pin the source provenance: add the sibling `mxaccessgw` commit SHA used to build each DLL to `libs/README.md`. (b) Record a SHA-256 of each `.dll` in `libs/README.md` so a future tamper or accidental update is detectable by running `Get-FileHash`/`sha256sum`. (c) Add a `.gitattributes` rule under `libs/` declaring `*.dll binary` (and consider `filter=lfs diff=lfs merge=lfs -text` if/when these need to be updated, to avoid bloating the pack file on every refresh). (d) Optional: a `dotnet test` time-check that compares the on-disk hash to the recorded hash, so a CI run notices if the file drifts from what the README claims.
|
|
|
|
### Driver.Galaxy-016
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Medium |
|
|
| Category | Performance & resource management |
|
|
| Location | `ZB.MOM.WW.OtOpcUa.Driver.Galaxy.csproj:43-47`, `libs/README.md:32-37` |
|
|
| Status | Open |
|
|
|
|
**Description:** The five new `PackageReference` versions declared in the csproj (`Google.Protobuf` 3.34.1, `Grpc.Core.Api` 2.76.0, `Grpc.Net.Client` 2.71.0, `Microsoft.Extensions.Logging.Abstractions` 10.0.0, `Polly` 8.5.2) do not all match what the vendored `MxGateway.Client.dll` was built against. The DLL's PE metadata (extracted via `System.Reflection.Metadata`) shows references to `Grpc.Net.Client v2.0.0.0`, `Microsoft.Extensions.Logging.Abstractions v10.0.0.0`, and notably `Polly.Core v8.0.0.0` — and the source csproj just before the sibling-repo rename (commit `bd4a09a` from 2026-04-27) declared `Grpc.Net.Client` 2.76.0, `Microsoft.Extensions.Logging.Abstractions` 10.0.7, and `Polly.Core` 8.6.6 — *not* the meta-package `Polly`. Our driver pulls `Polly` 8.5.2 (which transitively pins `Polly.Core` 8.5.2 per its nuspec dependency), so the vendored client actually loads `Polly.Core` 8.5.2 at runtime against code compiled against 8.6.6. Across an 8.5 ↔ 8.6 minor delta this is usually safe (assembly-version is `v8.0.0.0` for both), but it is exactly the skew shape that surfaces as `MissingMethodException` if a 8.6-only API was used in the client. `libs/README.md` claims "versions match what the sibling repo's `ZB.MOM.WW.MxGateway.Contracts.csproj` uses so the gRPC + proto runtime stays binary-compatible" — that statement is correct only for `Google.Protobuf` and `Grpc.Core.Api`; the other three packages do not match.
|
|
|
|
**Recommendation:** Reconcile the declared package versions with what the vendored DLLs were built against — bump to `Grpc.Net.Client` 2.76.0, `Microsoft.Extensions.Logging.Abstractions` 10.0.7, swap `Polly` for `Polly.Core` 8.6.6 (the driver does not import the `Polly` legacy v7 surface, only Polly.Core via the client). Alternatively, rebuild the vendored DLLs against the same versions the csproj declares and refresh the binaries. Update `libs/README.md` to record the exact versions the DLLs were built against, so the next vendoring refresh has an authoritative reference.
|
|
|
|
### Driver.Galaxy-017
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Low |
|
|
| Category | Design-document adherence |
|
|
| Location | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/` (no source change), gateway proto contract |
|
|
| Status | Open |
|
|
|
|
**Description:** The vendored `MxGateway.Contracts.dll` only carries the OLD `MxGateway.Contracts.Proto[.Galaxy]` namespace (PE-namespace dump confirms — `MxGateway.Client`, `MxGateway.Contracts`, `MxGateway.Contracts.Proto`, `MxGateway.Contracts.Proto.Galaxy` only). The sibling `mxaccessgw` repo's live `Protos/mxaccess_gateway.proto`, `mxaccess_worker.proto`, and `galaxy_repository.proto` files now generate into `ZB.MOM.WW.MxGateway.Contracts.Proto.*`. The proto wire format itself can still evolve (new RPCs, renamed fields, removed fields) and the driver has no contract-version handshake (a repo-wide search for `ContractVersion|ProtocolVersion|ApiVersion|WireVersion` in the driver returns nothing) — so a gateway service that evolves its proto past what the vendored client knows will fail silently at runtime: gRPC `UNIMPLEMENTED` for a renamed RPC, default-value reads for a removed scalar field, or worse, a wire-tag collision if a field number is reused. The risk surface grew with vendoring: previously the `ProjectReference` would have hard-failed at build time if the proto changed shape; now the driver builds green against a frozen contract that may not match the running gateway.
|
|
|
|
**Recommendation:** (a) Add a single `Ping`/`GetVersion` RPC call at gateway-session open, comparing the gateway's reported contract version against a string baked into `libs/README.md` (or a `GatewayContractVersion` const) and refusing the session on mismatch with a clear log. (b) Document in `libs/README.md` the exact mxaccessgw commit SHA (and proto-file SHA-256s) the vendored DLLs were built from, so a parity-rig operator can grep the live gateway for the matching commit. (c) Add a soak/parity test that asserts the live gateway's proto descriptor still matches what the vendored DLL expects — fail loud rather than degrade.
|
|
|
|
### Driver.Galaxy-018
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Low |
|
|
| Category | Documentation & comments |
|
|
| Location | `libs/README.md:32-37`, `ZB.MOM.WW.OtOpcUa.Driver.Galaxy.csproj:40-47` |
|
|
| Status | Open |
|
|
|
|
**Description:** Several small documentation issues in the vendoring artefacts:
|
|
1. `libs/README.md` says "Versions match what the sibling repo's `ZB.MOM.WW.MxGateway.Contracts.csproj` uses" — but `ZB.MOM.WW.MxGateway.Contracts.csproj` only declares `Google.Protobuf` 3.34.1 and `Grpc.Core.Api` 2.76.0; the other three packages (`Grpc.Net.Client`, `Microsoft.Extensions.Logging.Abstractions`, `Polly`) come from the (now-deleted) `MxGateway.Client.csproj`, not the contracts csproj. The README points at the wrong source-of-truth file. See Driver.Galaxy-016 for the related version-skew issue.
|
|
2. `libs/README.md` says the DLLs "are built against net10.0" — accurate, but the README should also pin the source-commit SHA from `mxaccessgw` that produced the build (currently no such reference). Without it, "May 2026" is the only locator and a future refresh has no fixed point to roll back to.
|
|
3. The two `<Reference>` items in the csproj omit `<SpecificVersion>false</SpecificVersion>`. The vendored DLLs carry `AssemblyVersion 1.0.0.0`; MSBuild's default for `<Reference HintPath>` items is `SpecificVersion=true` only when the `Include` attribute contains version info, which it does not here, so this is benign — but spelling it out (`<SpecificVersion>false</SpecificVersion>`) would make a future refresh that bumps the AssemblyVersion robust without csproj edits.
|
|
4. The csproj `<Reference Include="MxGateway.Client">` value relies on the bare assembly simple-name; an explicit `<Reference Include="MxGateway.Client, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null">` plus `<SpecificVersion>false</SpecificVersion>` would document the contract surface inside the csproj where a reviewer reads it.
|
|
|
|
**Recommendation:** (a) Update `libs/README.md` to (i) point at `MxGateway.Client.csproj` for the `Grpc.Net.Client`/`Microsoft.Extensions.Logging.Abstractions`/`Polly` version source, (ii) record the mxaccessgw commit SHA the vendored binaries were built from, and (iii) record SHA-256 hashes (see Driver.Galaxy-015). (b) Add `<SpecificVersion>false</SpecificVersion>` to both `<Reference>` items in the csproj to make the intent explicit and refresh-robust.
|