Re-reviewed the four modules with source changes since the previous review commit76d35d1, per REVIEW-PROCESS.md section 6. Updated each findings.md header (date 2026-05-23, commita9be809) and appended new findings under continued numbering. Regenerated README.md. ## New findings — 12 total across 4 modules ### Core.Scripting (5 new, IDs -012 to -016) - **-012 High Security** — broadened BCL references (System.* + netstandard) re-expose System.Threading.ThreadPool / Timer / AssemblyLoadContext, which the analyzer's deny-list doesn't cover. Re-introduces the background-work threat Core.Scripting-003 closed via System.Threading.Tasks deny. - **-013 Medium Security** — hand-rolled wrapper-source generation lets brace-balanced user source inject sibling methods/classes alongside CompiledScript.Run. Analyzer still gates forbidden types, but the documented 'method body' authoring contract is silently relaxed. - **-014 Medium Concurrency** — CompiledScriptCache.Clear() uses key-only TryRemove(key, out _) — the same race the -006 resolution fixed in GetOrCompile's catch is latent here on publish-replace. - **-015 Low Correctness** — ToCSharpTypeName truncates at first backtick; silently drops closed type arguments of nested-generic shapes (Outer<>.Inner<>). Latent — no production caller uses this shape today. - **-016 Medium Performance** — VirtualTagEngine + ScriptedAlarmEngine call ScriptEvaluator.Compile directly without going through CompiledScriptCache, so the headline -008 collectible-ALC fix doesn't run on the actual production path — the per-publish leak is still in effect. ### Core.ScriptedAlarms (1 new, ID -013) - **-013 Low Documentation** — new internal test accessors return the live mutable scratch dictionary; XML docs don't warn future test authors about the synchronisation contract. ### Driver.Cli.Common (2 new, IDs -007, -008) - **-007 High Correctness** — 0x80550000 was added as BadDeviceFailure but the real OPC UA spec value for BadDeviceFailure is 0x808B0000 (verified against Driver.Galaxy.Runtime.StatusCodeMap and HistorianQualityMapper, both of which use the correct 0x808B0000). 0x80550000 is actually BadSecurityPolicyRejected. The native mappers (FOCAS / AbCip / AbLegacy) all use the wrong 0x80550000; this session's SnapshotFormatter extension propagated the wrong name and the test asserts against the same wrong value so CI is blind — same shape of bug as Driver.Cli.Common-001. - **-008 Low Testing** — new FormatStatus_names_native_driver_emitted_codes Theory is redundant with the existing well-known Theory (same five InlineData rows added to both) and uses weaker ShouldContain assertion than the well-known Theory's ShouldBe. ### Driver.Galaxy (4 new, IDs -015 to -018) - **-015 Medium Security** — vendored DLLs (libs/) have no recorded provenance: no source-commit SHA from the mxaccessgw repo, no SHA-256 checksum in libs/README.md. Tampering / accidental swap undetectable. - **-016 Medium Performance** — version skew between declared PackageReferences (Polly 8.5.2 / Grpc.Net.Client 2.71.0 / Microsoft.Extensions.Logging.Abstractions 10.0.0) and what the vendored DLL was actually built against (Polly.Core 8.6.6 / Grpc.Net.Client 2.76.0 / Microsoft.Extensions.Logging.Abstractions 10.0.7). Latent now (assembly-version refs are loose) but precise shape that produces a runtime MissingMethodException. - **-017 Low Design** — no contract-version handshake between the driver and the gateway; proto could evolve under the gateway without the driver noticing. - **-018 Low Documentation** — libs/README.md points at the wrong sibling csproj as the version source-of-truth; missing SpecificVersion=false on the Reference items; missing mxaccessgw source-commit SHA. ## Particularly notable Two findings undercut commits from this session: - Driver.Cli.Common-007 invalidates commit5a9c459(which named 0x80550000 as BadDeviceFailure across the cross-CLI shortlist). - Core.Scripting-016 invalidates the production effect of commit7b6ab2e(the collectible-ALC fix wired Dispose only via CompiledScriptCache, which the engines don't use). The wider native-mapper miscoding behind -007 also affects three driver modules outside this session's edit scope (FocasStatusMapper, AbCipStatusMapper, AbLegacyStatusMapper all carry the wrong code). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
37 KiB
Code Review — Driver.Galaxy
| Field | Value |
|---|---|
| Module | src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy |
| Reviewer | Claude Code |
| Review date | 2026-05-23 |
| Commit reviewed | a9be809 |
| Status | Reviewed |
| Open findings | 4 |
Checklist coverage
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Driver.Galaxy-001, Driver.Galaxy-002, Driver.Galaxy-003, Driver.Galaxy-004 |
| 2 | OtOpcUa conventions | Driver.Galaxy-005 |
| 3 | Concurrency & thread safety | Driver.Galaxy-006, Driver.Galaxy-007 |
| 4 | Error handling & resilience | Driver.Galaxy-001, Driver.Galaxy-008, Driver.Galaxy-009 |
| 5 | Security | Driver.Galaxy-010, Driver.Galaxy-015 |
| 6 | Performance & resource management | Driver.Galaxy-011, Driver.Galaxy-012, Driver.Galaxy-016 |
| 7 | Design-document adherence | Driver.Galaxy-013, Driver.Galaxy-017 |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | Driver.Galaxy-014 |
| 10 | Documentation & comments | Driver.Galaxy-005, Driver.Galaxy-013, Driver.Galaxy-018 |
Re-review 2026-05-23 (commit a9be809)
The only code-affecting change since 76d35d1 was commit 994997b — the
sibling mxaccessgw repo restructured (the clients/dotnet/MxGateway.Client
project path and the MxGateway.Contracts.Proto namespace both moved), and
the driver's path-based ProjectReference started producing 87 build errors
solution-wide. The fix is build-time only: the broken ProjectReference was
replaced with <Reference HintPath="libs\…"> items pointing at vendored
binary copies of MxGateway.Client.dll (99 KB, May 2026 known-good build)
and MxGateway.Contracts.dll (490 KB), and five PackageReferences that
the dropped project was previously providing transitively (Google.Protobuf,
Grpc.Core.Api, Grpc.Net.Client, Microsoft.Extensions.Logging.Abstractions,
Polly) were declared explicitly. The matching Tests csproj got the same
binary <Reference> for MxGateway.Contracts (replacing its own broken
ProjectReference). A libs/README.md documents what is vendored and the
two unwinding paths (sibling restores a client library, or driver migrates
to the new ZB.MOM.WW.MxGateway.Contracts.Proto namespace + reimplements
the MxGatewayClient / MxGatewaySession / GalaxyRepositoryClient
wrapper, ~2,200 LoC).
No *.cs file changed; the re-review walked only the categories that apply
to a build-time/packaging change. Categories with no new findings:
Correctness (1), OtOpcUa conventions (2), Concurrency (3), Error handling
(4), Code organization (8), Testing coverage (9). Four new findings are
recorded below (Driver.Galaxy-015..018) — none Critical, none High; two
Medium, two Low.
Findings
Driver.Galaxy-001
| Field | Value |
|---|---|
| Severity | Critical |
| Category | Error handling & resilience |
| Location | Runtime/EventPump.cs:128, GalaxyDriver.cs:222 |
| Status | Resolved |
Description: The ReconnectSupervisor is constructed in BuildProductionRuntimeAsync and exposes ReportTransportFailure(Exception) as the only entry point that starts the reopen -> replay recovery loop. Nothing in the driver ever calls ReportTransportFailure (a repo-wide search finds only the declaration). When the gateway StreamEvents stream faults, EventPump.RunAsync catches the exception, logs "reconnect supervisor (PR 4.5) handles restart", completes the channel, and exits — but the supervisor is never told. The result: a transient gateway transport drop permanently kills the event stream. Data-change notifications stop, no reconnect/replay runs, and GetHealth() keeps reporting Healthy because _supervisor.IsDegraded stays false. This is a production outage with no self-recovery.
Recommendation: Wire the EventPump (and any gw RPC that observes a transport fault) to call _supervisor.ReportTransportFailure(ex). The simplest path: give EventPump a fault callback (or expose a StreamFaulted event) that GalaxyDriver subscribes to and forwards to the supervisor. The supervisor's ReopenAsync/ReplayAsync must also restart the EventPump itself (see Driver.Galaxy-008).
Resolution: Resolved 2026-05-22 — added an optional onStreamFault callback to EventPump; RunAsync's stream-fault catch block now invokes it, and GalaxyDriver.EnsureEventPumpStarted wires it to OnEventPumpStreamFault which forwards the cause to ReconnectSupervisor.ReportTransportFailure, so a transient gw transport drop now drives reopen → replay. Regression coverage in EventPumpStreamFaultTests. Note: the EventPump itself is still not restarted on reconnect — that pump-restart gap remains tracked under Driver.Galaxy-008.
Driver.Galaxy-002
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | Browse/DataTypeMap.cs:13, Runtime/MxValueDecoder.cs:9 |
| Status | Resolved |
Description: DataTypeMap.Map maps Galaxy mx_data_type codes to six DriverDataType values (Boolean, Int32, Float32, Float64, String, DateTime) — there is no Int64 arm. Yet MxValueDecoder and MxValueEncoder both fully support Int64 (MxValue.Int64Value, Int64Array), and the decoder's own XML doc claims "the seven Galaxy data types ... (Boolean, Int32, Int64, Float32, Float64, String, DateTime)". Any Galaxy attribute whose mx_data_type is the Int64 code (or any code > 5) falls through the _ => DriverDataType.String default. The address-space node is then created as a String variable while runtime reads decode an Int64 boxed value — a type mismatch that produces wrong OPC UA DataType/ValueRank metadata and likely fails value coercion at the server node layer.
Recommendation: Confirm the Galaxy mx_data_type integer code for 64-bit integers and add the explicit arm to DataTypeMap.Map. If the wire format genuinely has no Int64 type, correct the MxValueDecoder/MxValueEncoder doc comments instead. Either way the encoder/decoder and the type map must agree.
Resolution: Resolved 2026-05-22 — added 6 => DriverDataType.Int64 to DataTypeMap.Map, extending the contiguous 0..5 scheme so the type map covers the same seven Galaxy data types MxValueDecoder/MxValueEncoder already decode/encode; Int64 attributes now build as Int64 nodes instead of falling through to the String default. Regression coverage in DataTypeMapTests.
Driver.Galaxy-003
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | Runtime/StatusCodeMap.cs:86 |
| Status | Resolved |
Description: FromMxStatus returns Good whenever status.Success != 0. The intent (per the surrounding comment "Honors the success flag") is that a non-zero Success means success. But if MxStatusProxy.Success is itself a native HRESULT/return code rather than a boolean-as-int, then Success != 0 is exactly the failure condition and the mapper inverts it — every failed write/read would report Good. The field name is ambiguous and the rest of the file (Detail, RawDetectedBy, and Hresult used elsewhere) treats 0 as success. GatewayGalaxyAlarmAcknowledger.cs:62 uses the opposite convention for the sibling field (reply.Hresult != 0 means failure).
Recommendation: Verify the semantics of MxStatusProxy.Success against the gateway proto contract. If it is a success-boolean encoded as int, add a code comment pinning that; if it is an HRESULT, invert the check to status.Success == 0 => Good.
Resolution: Resolved 2026-05-22 — replaced status.Success != 0 with status.IsSuccess() (the MxStatusProxyExtensions helper that checks both success != 0 AND category == Ok); the proto contract explicitly documents that success is not a boolean and that clients must branch on category. Regression coverage updated in StatusCodeMapTests with a SuccessNonZeroButCategoryNotOk_IsNotGood assertion pinning the fix.
Driver.Galaxy-004
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | GalaxyDriver.cs:901 |
| Status | Resolved |
Description: OnPumpDataChange reconstructs a raw OPC DA quality byte from an OPC UA StatusCode for the probe watcher: it shifts StatusCode >> 30 and maps 0->192, 1->64, _->0. The StatusCode was itself produced upstream by StatusCodeMap.FromQualityByte/FromMxStatus, so this is a lossy round-trip — it collapses every specific code back to the three category bytes (192/64/0). That happens to satisfy PerPlatformProbeWatcher.DecodeState (which only checks qualityByte < 192), so the bug is currently benign, but the mapping is fragile and undocumented except for one inline comment. A future edit to the StatusCodeMap constants or to the shift width would silently desync the probe-health decode with no test guarding it.
Recommendation: Route the probe path off the original quality information rather than reverse-engineering it from a StatusCode. Either carry the raw quality byte on DataValueSnapshot, or add a StatusCodeMap.ToQualityCategoryByte(uint) helper with unit tests so the mapping lives in one place next to its inverse.
Resolution: Resolved 2026-05-22 — added StatusCodeMap.ToQualityCategoryByte(uint) helper that extracts top-two bits of the OPC UA StatusCode into the OPC DA category byte (Good=192, Uncertain=64, Bad=0); GalaxyDriver.OnPumpDataChange now calls this helper instead of inlining the shift+switch, so the mapping lives next to its inverse. Unit tests in StatusCodeMapTests cover all three category buckets and the round-trip invariant.
Driver.Galaxy-005
| Field | Value |
|---|---|
| Severity | Low |
| Category | OtOpcUa conventions |
| Location | Runtime/EventPump.cs:81-88 |
| Status | Resolved |
Description: The BoundedChannelOptions comment states "Newest-dropped policy: when full, the producer's TryWrite returns false ... We do this manually rather than relying on BoundedChannelFullMode.DropWrite" — but the option is then set to FullMode = BoundedChannelFullMode.Wait. With Wait, TryWrite returning false on a full channel is correct behaviour, so the code works, but the comment naming the mode and the actual mode disagree, which is confusing for a maintainer deciding whether the policy is Wait, DropWrite, or DropNewest.
Recommendation: Either reword the comment to say "we use Wait mode but never call the awaitable WriteAsync — TryWrite gives us synchronous newest-dropped semantics", or switch to BoundedChannelFullMode.DropWrite and keep the manual drop count. Make the comment and the mode consistent.
Resolution: Resolved 2026-05-23 — reworded the BoundedChannelOptions comment to say "we use FullMode.Wait but never call the awaitable WriteAsync — only synchronous TryWrite, which returns false immediately on a full channel and lets us account for drops on the EventsDropped counter". Also explains why we deliberately do NOT use BoundedChannelFullMode.DropWrite (it would silently discard without surfacing on the counter). Comment and FullMode value now agree.
Driver.Galaxy-006
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | GalaxyDriver.cs:848-861 |
| Status | Resolved |
Description: OnAlarmFeedTransition picks the "owner" handle with _alarmSubscriptions.First() under _alarmHandlersLock. HashSet<T>.First() enumeration order is unspecified and unstable across mutations — when multiple alarm subscriptions are active, the handle attached to a given AlarmEventArgs can change arbitrarily between transitions. The XML doc acknowledges "we still only fire the event once" but the downstream AlarmConditionService correlates transitions to the originating subscription via this handle; a non-deterministic owner can misroute unsubscribe bookkeeping or per-subscription state.
Recommendation: If alarm transitions genuinely fan out to all subscriptions, raise OnAlarmEvent once per active handle (or document that the handle is a non-correlating sentinel and have the server stop relying on it). If a single owner is required, make the choice deterministic (e.g. the earliest-created handle) and stable.
Resolution: Resolved 2026-05-22 — changed _alarmSubscriptions from HashSet<GalaxyAlarmSubscriptionHandle> to List<GalaxyAlarmSubscriptionHandle> so insertion order is preserved; OnAlarmFeedTransition now picks [0] (earliest-registered handle) instead of First() on a HashSet, making the owner selection deterministic and stable across mutations. Server routing uses SourceNodeId not the handle, so every active subscriber sees the same transition regardless of which handle is attached.
Driver.Galaxy-007
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | GalaxyDriver.cs:937-968 |
| Status | Resolved |
Description: Dispose() is not synchronized against the capability methods. It sets _disposed = true then disposes _eventPump, _alarmFeed, _ownedMxSession, _ownedMxClient, _supervisor, etc. A concurrent SubscribeAsync/ReadAsync/WriteAsync that passed its ObjectDisposedException.ThrowIf check at entry can then dereference _subscriber/_dataWriter whose backing GalaxyMxSession is being disposed mid-call, producing ObjectDisposedException/NullReferenceException from deep inside the gw client rather than a clean failure. Dispose also blocks the caller on GetAwaiter().GetResult() of several async disposals, risking a deadlock if invoked from a thread-pool-starved context.
Recommendation: Gate capability entry points so they cannot start new gw work once _disposed is set (e.g. a CancellationTokenSource linked into every call, cancelled first in Dispose). Consider implementing IAsyncDisposable so the async sub-component disposals do not block on GetResult().
Resolution: Resolved 2026-05-22 — added IAsyncDisposable to GalaxyDriver and implemented DisposeAsync() as the primary disposal path that awaits each async sub-component (EventPump, AlarmFeed, MxSession, MxClient, RepositoryClient) without blocking; Dispose() delegates to DisposeAsync().AsTask().GetAwaiter().GetResult() for using-statement compatibility. The sync blocking-on-GetResult anti-pattern in the previous Dispose body is eliminated on the hot path. Note: the CancellationTokenSource gate for concurrent capability entry was not added — the existing ObjectDisposedException.ThrowIf(_disposed, this) guards at capability entry points already provide the fast-fail, and a separate CTS would add complexity without solving the TOCTOU window noted in the finding; that window is benign in practice (the sub-component's own disposed check catches it).
Driver.Galaxy-008
| Field | Value |
|---|---|
| Severity | High |
| Category | Error handling & resilience |
| Location | GalaxyDriver.cs:264-276, Runtime/EventPump.cs:97-103 |
| Status | Resolved |
Description: Even if Driver.Galaxy-001 is fixed and the supervisor's ReplayAsync runs, recovery is incomplete. ReplayAsync re-issues SubscribeBulkAsync for the tracked tags, but the EventPump background loop that consumes StreamEvents is not restarted. After a stream fault EventPump.RunAsync exits and _channel is completed; EventPump.Start() is a no-op (if (_loop is not null) return) because _loop is a completed-but-non-null task. So a replayed subscription has no consumer — values are subscribed on the gw but never reach OnDataChange. Additionally ReplayAsync never re-registers the new item handles the gw returns into SubscriptionRegistry; the old stale item handles remain, so even with a live pump the fan-out reverse-map would miss the post-reconnect handles.
Recommendation: On reconnect, dispose and recreate the EventPump (or make it restartable), and have ReplayAsync update SubscriptionRegistry bindings with the new item handles returned by the post-reconnect SubscribeBulkAsync. Add an integration/parity test that drops the stream mid-subscription and asserts OnDataChange resumes.
Resolution: Resolved 2026-05-22 — ReplayAsync now calls a new RestartEventPumpForReplay (disposes the faulted pump, recreates and restarts a fresh one) and re-issues SubscribeBulkAsync per subscription, then SubscriptionRegistry.Rebind swaps each subscription's stale pre-reconnect item handles for the post-reconnect handles so the fan-out reverse map dispatches to the live pump. New SubscriptionRegistry.SnapshotEntries/Rebind APIs back the per-subscription replay. Regression coverage in SubscriptionRegistryTests (Rebind/SnapshotEntries) and EventPumpStreamFaultTests.FaultedPump_IsNotRestartableInPlace_ButAFreshPumpResumesDispatch.
Driver.Galaxy-009
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | GalaxyDriver.cs:354-371 |
| Status | Resolved |
Description: StartDeployWatcher launches the watch loop with _ = _deployWatcher.StartAsync(CancellationToken.None) — a fire-and-forget with a discarded Task. StartAsync can throw synchronously (InvalidOperationException if already started); the discard masks that programming error. Separately, StartDeployWatcher builds an _ownedRepositoryClient purely for the watcher when discovery has not run yet — if DiscoverAsync later runs, BuildDefaultHierarchySource overwrites _ownedRepositoryClient with a second client, leaking the first (only the latest reference is disposed in Dispose).
Recommendation: Await StartAsync (it completes synchronously after scheduling) or at least observe its result. Reuse a single GalaxyRepositoryClient across the deploy watcher and the hierarchy source instead of letting BuildDefaultHierarchySource clobber the field — guard the assignment or build the client once in InitializeAsync.
Resolution: Resolved 2026-05-22 — (a) replaced _ = _deployWatcher.StartAsync(...) discard with an explicit variable + IsFaulted check so any synchronous throw from StartAsync (e.g. called-twice InvalidOperationException) propagates rather than being silently swallowed; (b) changed both StartDeployWatcher and BuildDefaultHierarchySource to use _ownedRepositoryClient ??= so a client built by the watcher is reused by discovery instead of being overwritten and leaked — only one GalaxyRepositoryClient instance is now created and disposed.
Driver.Galaxy-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | Security |
| Location | GalaxyDriver.cs:311-341 |
| Status | Resolved |
Description: ResolveApiKey supports an env:/file: indirection and otherwise treats the config string as the literal API key ("Anything else — used as the literal API key. Convenient for dev"). GalaxyGatewayOptions' own XML doc claims "the API key never appears in cleartext config". The literal-key fallback silently permits a plaintext API key in the DriverConfig JSON column of the central config DB, contradicting the documented contract. There is no warning logged when the literal path is taken.
Recommendation: Log a startup warning when ResolveApiKey falls through to the literal arm so an operator who accidentally committed a cleartext key sees it, and update the GalaxyGatewayOptions doc comment so it no longer over-promises. Consider gating the literal arm behind an explicit dev:-style prefix so a cleartext key cannot be used by accident.
Resolution: Resolved 2026-05-23 — (a) added a logger-aware ResolveApiKey(string, ILogger?) overload that emits a Warning when the back-compat literal arm is taken, and wired the BuildClientOptions call site to pass _logger; (b) added an explicit dev:KEY prefix that returns the literal value without warning, so dev rigs / parity tests can opt-in deliberately; (c) rewrote the GalaxyGatewayOptions.ApiKeySecretRef XML doc so it no longer claims "the API key never appears in cleartext config" — it now documents all four supported forms (env:, file:, dev:, and the warning-on-literal back-compat path). Regression coverage in GalaxyDriverApiKeyResolverTests (Literal_string_emits_warning_when_logger_supplied, Dev_prefix_returns_literal_without_warning, Env_prefix_does_not_emit_literal_warning).
Driver.Galaxy-011
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Performance & resource management |
| Location | GalaxyDriver.cs:411 |
| Status | Resolved |
Description: GetMemoryFootprint() unconditionally returns 0 with a comment "PR 4.4 sets this from SubscriptionRegistry size" — PR 4.4 has shipped (the registry exists and is used) but the method was never updated. IHostConnectivityProbe.GetMemoryFootprint is consumed by the server's status/health surface to gauge cache-flush pressure; a constant 0 makes the Galaxy driver invisible to that mechanism, so a 50k-tag subscription set never registers as memory pressure and FlushOptionalCachesAsync (also a no-op) is never meaningfully triggered.
Recommendation: Return a real estimate derived from SubscriptionRegistry.TrackedSubscriptionCount/TrackedItemHandleCount (and the EventPump channel occupancy), or document explicitly why the Galaxy driver opts out of footprint reporting. Remove the stale "PR 4.4 sets this" comment.
Resolution: Resolved 2026-05-22 — replaced the constant 0 with a live estimate derived from SubscriptionRegistry.TrackedItemHandleCount (64 bytes/handle) and TrackedSubscriptionCount (256 bytes/subscription); returns 0 when no subscriptions are active and grows with the registry. The stale "PR 4.4 sets this" comment is removed. Regression coverage in GalaxyDriverInfrastructureTests.
Driver.Galaxy-012
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | Runtime/SubscriptionRegistry.cs:65-67, GalaxyDriver.cs:538, GalaxyDriver.cs:675 |
| Status | Resolved |
Description: Several hot paths are O(n^2) per call. SubscriptionRegistry.ResolveSubscribers does entry.Bindings.FirstOrDefault(b => b.ItemHandle == itemHandle) — a linear scan of the whole binding list for every event dispatch; at 50k tags this is 50k-element scans on the 1Hz fan-out path. GalaxyDriver.SubscribeAsync and ReadViaSubscribeOnceAsync correlate results to references with results.FirstOrDefault(r => string.Equals(...)) inside a for loop over all references — O(n^2) over the subscribe batch. SubscriptionRegistry.Remove rebuilds a ConcurrentBag from a LINQ filter on every unsubscribe.
Recommendation: Index SubscriptionEntry bindings by item handle (a Dictionary<int, string> per entry) so ResolveSubscribers is O(1) per subscriber. Project the SubscribeResult list into a Dictionary<string, SubscribeResult> (OrdinalIgnoreCase) once before the correlation loop. These matter on the documented 50k-tag soak path.
Resolution: Resolved 2026-05-23 — three changes: (a) SubscriptionEntry now carries a FullRefByItemHandle Dictionary<int, string> built once at construction; ResolveSubscribers does O(1) lookups per subscriber instead of a FirstOrDefault linear scan of the binding list. (b) Reverse map _subscribersByItemHandle swapped from ConcurrentBag<long> to ImmutableHashSet<long> — Remove/Rebind use set.Remove(id) (O(log n)) instead of "rebuild a new bag from a LINQ filter on every unsubscribe", and reads remain lock-free via atomic publication through ConcurrentDictionary.AddOrUpdate. (c) GalaxyDriver.SubscribeAsync + ReadViaSubscribeOnceAsync now index the SubscribeResult list once via the existing BuildResultIndex helper (already used by ReplayAsync) so per-reference correlation is O(1). Regression coverage in SubscriptionRegistryTests.ResolveSubscribers_LargeBindingSet_DispatchesCorrectly.
Driver.Galaxy-013
| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | GalaxyDriver.cs:14-27, GalaxyDriver.cs:374-382, Config/GalaxyDriverOptions.cs:84-86 |
| Status | Resolved |
Description: Multiple doc comments are stale relative to the shipped code. GalaxyDriver's class summary still describes the file as "the project skeleton with IDriver bodies that wire to a future IGalaxyGatewayClient abstraction. Capability interfaces ... land in PRs 4.1-4.7" and references the legacy GalaxyProxyDriver coexisting "until PR 7.2" — but PR 7.2 already deleted the legacy Galaxy projects and the capability interfaces are all implemented. ReinitializeAsync is still a stub ("for the skeleton we just refresh health") that ignores driverConfigJson entirely — a config reapply silently does nothing. GalaxyReconnectOptions.ReplayOnSessionLost is defined and documented but never read anywhere in the driver (ReplayAsync always replays).
Recommendation: Refresh the GalaxyDriver class and ReinitializeAsync doc comments to describe the shipped state, implement or explicitly reject ReinitializeAsync config reapply, and either honour ReplayOnSessionLost or remove it from GalaxyReconnectOptions.
Resolution: Resolved 2026-05-23 — three fixes: (a) rewrote the GalaxyDriver class summary to describe the shipped capability surface (ITagDiscovery, IReadable, IWritable, ISubscribable, IRediscoverable, IHostConnectivityProbe, IAlarmSource) and removed the stale "PR 4.0 skeleton" / "legacy GalaxyProxyDriver coexists until PR 7.2" wording — PR 7.2 already retired the legacy projects. (b) ReinitializeAsync now parses the incoming driverConfigJson through the factory pipeline and compares the result to _options; an equivalent reapply refreshes health, a non-equivalent change throws NotSupportedException so a config swap never silently no-ops. (c) ReplayAsync now honours _options.Reconnect.ReplayOnSessionLost — when false it restarts the EventPump but skips the per-tag SubscribeBulk fan-out, delegating to gateway session-level replay. Regression coverage in GalaxyDriverInfrastructureTests (ReinitializeAsync_RejectsNonEquivalentConfigChange, ReinitializeAsync_AcceptsEquivalentConfig, ReplayOnSessionLost_False_SkipsResubscribeBulk, ReplayOnSessionLost_True_RunsResubscribeBulk). Updated GalaxyDriverFactoryTests.ReinitializeAsync_RefreshesHealth_WhenConfigIsEquivalent to use an equivalent config JSON.
Driver.Galaxy-014
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy (module-wide) |
| Status | Resolved |
Description: The reconnect/recovery path is the module's highest-risk surface and is effectively untested at the integration seam. The ReconnectSupervisor has a clean test seam (injectable reopen/replay/backoffDelay), but because nothing wires ReportTransportFailure (Driver.Galaxy-001) there can be no test asserting that an EventPump stream fault actually drives recovery — the gap that would have caught the Critical finding. Similarly there appears to be no test that a post-reconnect ReplayAsync re-registers new item handles and that OnDataChange resumes (Driver.Galaxy-008). The StatusCodeMap.FromMxStatus Success-flag semantics (Driver.Galaxy-003) and the DataTypeMap Int64 gap (Driver.Galaxy-002) are also the kind of behaviour a focused unit test would pin.
Recommendation: Add unit/parity tests covering: (a) stream fault -> supervisor reopen -> EventPump restart -> OnDataChange resumes; (b) ReplayAsync updates SubscriptionRegistry with new handles; (c) StatusCodeMap.FromMxStatus for both success and failure MxStatusProxy rows; (d) DataTypeMap for every Galaxy mx_data_type code including 64-bit integer.
Resolution: Resolved 2026-05-22 — added GalaxyDriverInfrastructureTests covering GetMemoryFootprint (Driver.Galaxy-011) and IAsyncDisposable (Driver.Galaxy-007); (a) stream-fault → supervisor reopen → EventPump restart → OnDataChange resumes is covered by EventPumpStreamFaultTests.StreamFault_DrivesReconnectSupervisorReopenReplay and FaultedPump_IsNotRestartableInPlace_ButAFreshPumpResumesDispatch (landed with Driver.Galaxy-001/008 resolution); (b) post-reconnect ReplayAsync rebinds handles is covered by SubscriptionRegistryTests.Rebind_* suite; (c) StatusCodeMap.FromMxStatus success/failure rows are covered by StatusCodeMapTests.FromMxStatus_SuccessNonZeroAndCategoryOk_IsGood and FromMxStatus_SuccessNonZeroButCategoryNotOk_IsNotGood (landed with Driver.Galaxy-003); (d) DataTypeMap for all seven mx_data_type codes including Int64 is covered by DataTypeMapTests (landed with Driver.Galaxy-002).
Driver.Galaxy-015
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Security |
| Location | libs/MxGateway.Client.dll, libs/MxGateway.Contracts.dll, libs/README.md |
| Status | Open |
Description: Commit 994997b checks in two binary DLLs (MxGateway.Client.dll, 99 840 bytes; MxGateway.Contracts.dll, 489 984 bytes) under src/Drivers/.../Driver.Galaxy/libs/ and references them via <Reference HintPath="…" />. These are the only checked-in binary build artefacts in the entire repo (a repo-wide find for non-bin//obj/ *.dll under libs/ returns only these two), so the change sets a precedent. The accompanying libs/README.md states the DLLs are "byte-for-byte the build output" of the OtOpcUa team's own code against the gateway's open proto contracts, but there is no recorded provenance — no source-commit SHA from the sibling mxaccessgw repo that produced the build, no SHA-256/SHA-512 checksum, no .gitattributes rule marking these paths as binary (so a future churn-in-place will balloon the pack file). Without a recorded source commit + checksum it is impossible for a future reviewer/auditor to verify the binaries match a specific revision of the sibling repo — the assertion "we built them, not external" is unverifiable after the fact. Tampering or accidental swap (e.g. someone drops in a different DLL of the same name under the same path) would not be detectable.
Recommendation: (a) Pin the source provenance: add the sibling mxaccessgw commit SHA used to build each DLL to libs/README.md. (b) Record a SHA-256 of each .dll in libs/README.md so a future tamper or accidental update is detectable by running Get-FileHash/sha256sum. (c) Add a .gitattributes rule under libs/ declaring *.dll binary (and consider filter=lfs diff=lfs merge=lfs -text if/when these need to be updated, to avoid bloating the pack file on every refresh). (d) Optional: a dotnet test time-check that compares the on-disk hash to the recorded hash, so a CI run notices if the file drifts from what the README claims.
Driver.Galaxy-016
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Performance & resource management |
| Location | ZB.MOM.WW.OtOpcUa.Driver.Galaxy.csproj:43-47, libs/README.md:32-37 |
| Status | Open |
Description: The five new PackageReference versions declared in the csproj (Google.Protobuf 3.34.1, Grpc.Core.Api 2.76.0, Grpc.Net.Client 2.71.0, Microsoft.Extensions.Logging.Abstractions 10.0.0, Polly 8.5.2) do not all match what the vendored MxGateway.Client.dll was built against. The DLL's PE metadata (extracted via System.Reflection.Metadata) shows references to Grpc.Net.Client v2.0.0.0, Microsoft.Extensions.Logging.Abstractions v10.0.0.0, and notably Polly.Core v8.0.0.0 — and the source csproj just before the sibling-repo rename (commit bd4a09a from 2026-04-27) declared Grpc.Net.Client 2.76.0, Microsoft.Extensions.Logging.Abstractions 10.0.7, and Polly.Core 8.6.6 — not the meta-package Polly. Our driver pulls Polly 8.5.2 (which transitively pins Polly.Core 8.5.2 per its nuspec dependency), so the vendored client actually loads Polly.Core 8.5.2 at runtime against code compiled against 8.6.6. Across an 8.5 ↔ 8.6 minor delta this is usually safe (assembly-version is v8.0.0.0 for both), but it is exactly the skew shape that surfaces as MissingMethodException if a 8.6-only API was used in the client. libs/README.md claims "versions match what the sibling repo's ZB.MOM.WW.MxGateway.Contracts.csproj uses so the gRPC + proto runtime stays binary-compatible" — that statement is correct only for Google.Protobuf and Grpc.Core.Api; the other three packages do not match.
Recommendation: Reconcile the declared package versions with what the vendored DLLs were built against — bump to Grpc.Net.Client 2.76.0, Microsoft.Extensions.Logging.Abstractions 10.0.7, swap Polly for Polly.Core 8.6.6 (the driver does not import the Polly legacy v7 surface, only Polly.Core via the client). Alternatively, rebuild the vendored DLLs against the same versions the csproj declares and refresh the binaries. Update libs/README.md to record the exact versions the DLLs were built against, so the next vendoring refresh has an authoritative reference.
Driver.Galaxy-017
| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/ (no source change), gateway proto contract |
| Status | Open |
Description: The vendored MxGateway.Contracts.dll only carries the OLD MxGateway.Contracts.Proto[.Galaxy] namespace (PE-namespace dump confirms — MxGateway.Client, MxGateway.Contracts, MxGateway.Contracts.Proto, MxGateway.Contracts.Proto.Galaxy only). The sibling mxaccessgw repo's live Protos/mxaccess_gateway.proto, mxaccess_worker.proto, and galaxy_repository.proto files now generate into ZB.MOM.WW.MxGateway.Contracts.Proto.*. The proto wire format itself can still evolve (new RPCs, renamed fields, removed fields) and the driver has no contract-version handshake (a repo-wide search for ContractVersion|ProtocolVersion|ApiVersion|WireVersion in the driver returns nothing) — so a gateway service that evolves its proto past what the vendored client knows will fail silently at runtime: gRPC UNIMPLEMENTED for a renamed RPC, default-value reads for a removed scalar field, or worse, a wire-tag collision if a field number is reused. The risk surface grew with vendoring: previously the ProjectReference would have hard-failed at build time if the proto changed shape; now the driver builds green against a frozen contract that may not match the running gateway.
Recommendation: (a) Add a single Ping/GetVersion RPC call at gateway-session open, comparing the gateway's reported contract version against a string baked into libs/README.md (or a GatewayContractVersion const) and refusing the session on mismatch with a clear log. (b) Document in libs/README.md the exact mxaccessgw commit SHA (and proto-file SHA-256s) the vendored DLLs were built from, so a parity-rig operator can grep the live gateway for the matching commit. (c) Add a soak/parity test that asserts the live gateway's proto descriptor still matches what the vendored DLL expects — fail loud rather than degrade.
Driver.Galaxy-018
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | libs/README.md:32-37, ZB.MOM.WW.OtOpcUa.Driver.Galaxy.csproj:40-47 |
| Status | Open |
Description: Several small documentation issues in the vendoring artefacts:
libs/README.mdsays "Versions match what the sibling repo'sZB.MOM.WW.MxGateway.Contracts.csprojuses" — butZB.MOM.WW.MxGateway.Contracts.csprojonly declaresGoogle.Protobuf3.34.1 andGrpc.Core.Api2.76.0; the other three packages (Grpc.Net.Client,Microsoft.Extensions.Logging.Abstractions,Polly) come from the (now-deleted)MxGateway.Client.csproj, not the contracts csproj. The README points at the wrong source-of-truth file. See Driver.Galaxy-016 for the related version-skew issue.libs/README.mdsays the DLLs "are built against net10.0" — accurate, but the README should also pin the source-commit SHA frommxaccessgwthat produced the build (currently no such reference). Without it, "May 2026" is the only locator and a future refresh has no fixed point to roll back to.- The two
<Reference>items in the csproj omit<SpecificVersion>false</SpecificVersion>. The vendored DLLs carryAssemblyVersion 1.0.0.0; MSBuild's default for<Reference HintPath>items isSpecificVersion=trueonly when theIncludeattribute contains version info, which it does not here, so this is benign — but spelling it out (<SpecificVersion>false</SpecificVersion>) would make a future refresh that bumps the AssemblyVersion robust without csproj edits. - The csproj
<Reference Include="MxGateway.Client">value relies on the bare assembly simple-name; an explicit<Reference Include="MxGateway.Client, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null">plus<SpecificVersion>false</SpecificVersion>would document the contract surface inside the csproj where a reviewer reads it.
Recommendation: (a) Update libs/README.md to (i) point at MxGateway.Client.csproj for the Grpc.Net.Client/Microsoft.Extensions.Logging.Abstractions/Polly version source, (ii) record the mxaccessgw commit SHA the vendored binaries were built from, and (iii) record SHA-256 hashes (see Driver.Galaxy-015). (b) Add <SpecificVersion>false</SpecificVersion> to both <Reference> items in the csproj to make the intent explicit and refresh-robust.