GetMemoryFootprint() returned a constant 0 with a stale "PR 4.4 sets this" comment even though PR 4.4 shipped the SubscriptionRegistry. Replace with a live estimate: 64 bytes × TrackedItemHandleCount + 256 bytes × TrackedSubscriptionCount. A 50k-tag set now registers ~3 MB with the server's cache-flush heuristic instead of being invisible. Returns 0 when no subscriptions are active. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
22 KiB
Code Review — Driver.Galaxy
| Field | Value |
|---|---|
| Module | src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | 76d35d1 |
| Status | Reviewed |
| Open findings | 11 |
Checklist coverage
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Driver.Galaxy-001, Driver.Galaxy-002, Driver.Galaxy-003, Driver.Galaxy-004 |
| 2 | OtOpcUa conventions | Driver.Galaxy-005 |
| 3 | Concurrency & thread safety | Driver.Galaxy-006, Driver.Galaxy-007 |
| 4 | Error handling & resilience | Driver.Galaxy-001, Driver.Galaxy-008, Driver.Galaxy-009 |
| 5 | Security | Driver.Galaxy-010 |
| 6 | Performance & resource management | Driver.Galaxy-011, Driver.Galaxy-012 |
| 7 | Design-document adherence | Driver.Galaxy-013 |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | Driver.Galaxy-014 |
| 10 | Documentation & comments | Driver.Galaxy-005, Driver.Galaxy-013 |
Findings
Driver.Galaxy-001
| Field | Value |
|---|---|
| Severity | Critical |
| Category | Error handling & resilience |
| Location | Runtime/EventPump.cs:128, GalaxyDriver.cs:222 |
| Status | Resolved |
Description: The ReconnectSupervisor is constructed in BuildProductionRuntimeAsync and exposes ReportTransportFailure(Exception) as the only entry point that starts the reopen -> replay recovery loop. Nothing in the driver ever calls ReportTransportFailure (a repo-wide search finds only the declaration). When the gateway StreamEvents stream faults, EventPump.RunAsync catches the exception, logs "reconnect supervisor (PR 4.5) handles restart", completes the channel, and exits — but the supervisor is never told. The result: a transient gateway transport drop permanently kills the event stream. Data-change notifications stop, no reconnect/replay runs, and GetHealth() keeps reporting Healthy because _supervisor.IsDegraded stays false. This is a production outage with no self-recovery.
Recommendation: Wire the EventPump (and any gw RPC that observes a transport fault) to call _supervisor.ReportTransportFailure(ex). The simplest path: give EventPump a fault callback (or expose a StreamFaulted event) that GalaxyDriver subscribes to and forwards to the supervisor. The supervisor's ReopenAsync/ReplayAsync must also restart the EventPump itself (see Driver.Galaxy-008).
Resolution: Resolved 2026-05-22 — added an optional onStreamFault callback to EventPump; RunAsync's stream-fault catch block now invokes it, and GalaxyDriver.EnsureEventPumpStarted wires it to OnEventPumpStreamFault which forwards the cause to ReconnectSupervisor.ReportTransportFailure, so a transient gw transport drop now drives reopen → replay. Regression coverage in EventPumpStreamFaultTests. Note: the EventPump itself is still not restarted on reconnect — that pump-restart gap remains tracked under Driver.Galaxy-008.
Driver.Galaxy-002
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | Browse/DataTypeMap.cs:13, Runtime/MxValueDecoder.cs:9 |
| Status | Resolved |
Description: DataTypeMap.Map maps Galaxy mx_data_type codes to six DriverDataType values (Boolean, Int32, Float32, Float64, String, DateTime) — there is no Int64 arm. Yet MxValueDecoder and MxValueEncoder both fully support Int64 (MxValue.Int64Value, Int64Array), and the decoder's own XML doc claims "the seven Galaxy data types ... (Boolean, Int32, Int64, Float32, Float64, String, DateTime)". Any Galaxy attribute whose mx_data_type is the Int64 code (or any code > 5) falls through the _ => DriverDataType.String default. The address-space node is then created as a String variable while runtime reads decode an Int64 boxed value — a type mismatch that produces wrong OPC UA DataType/ValueRank metadata and likely fails value coercion at the server node layer.
Recommendation: Confirm the Galaxy mx_data_type integer code for 64-bit integers and add the explicit arm to DataTypeMap.Map. If the wire format genuinely has no Int64 type, correct the MxValueDecoder/MxValueEncoder doc comments instead. Either way the encoder/decoder and the type map must agree.
Resolution: Resolved 2026-05-22 — added 6 => DriverDataType.Int64 to DataTypeMap.Map, extending the contiguous 0..5 scheme so the type map covers the same seven Galaxy data types MxValueDecoder/MxValueEncoder already decode/encode; Int64 attributes now build as Int64 nodes instead of falling through to the String default. Regression coverage in DataTypeMapTests.
Driver.Galaxy-003
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | Runtime/StatusCodeMap.cs:86 |
| Status | Resolved |
Description: FromMxStatus returns Good whenever status.Success != 0. The intent (per the surrounding comment "Honors the success flag") is that a non-zero Success means success. But if MxStatusProxy.Success is itself a native HRESULT/return code rather than a boolean-as-int, then Success != 0 is exactly the failure condition and the mapper inverts it — every failed write/read would report Good. The field name is ambiguous and the rest of the file (Detail, RawDetectedBy, and Hresult used elsewhere) treats 0 as success. GatewayGalaxyAlarmAcknowledger.cs:62 uses the opposite convention for the sibling field (reply.Hresult != 0 means failure).
Recommendation: Verify the semantics of MxStatusProxy.Success against the gateway proto contract. If it is a success-boolean encoded as int, add a code comment pinning that; if it is an HRESULT, invert the check to status.Success == 0 => Good.
Resolution: Resolved 2026-05-22 — replaced status.Success != 0 with status.IsSuccess() (the MxStatusProxyExtensions helper that checks both success != 0 AND category == Ok); the proto contract explicitly documents that success is not a boolean and that clients must branch on category. Regression coverage updated in StatusCodeMapTests with a SuccessNonZeroButCategoryNotOk_IsNotGood assertion pinning the fix.
Driver.Galaxy-004
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | GalaxyDriver.cs:901 |
| Status | Resolved |
Description: OnPumpDataChange reconstructs a raw OPC DA quality byte from an OPC UA StatusCode for the probe watcher: it shifts StatusCode >> 30 and maps 0->192, 1->64, _->0. The StatusCode was itself produced upstream by StatusCodeMap.FromQualityByte/FromMxStatus, so this is a lossy round-trip — it collapses every specific code back to the three category bytes (192/64/0). That happens to satisfy PerPlatformProbeWatcher.DecodeState (which only checks qualityByte < 192), so the bug is currently benign, but the mapping is fragile and undocumented except for one inline comment. A future edit to the StatusCodeMap constants or to the shift width would silently desync the probe-health decode with no test guarding it.
Recommendation: Route the probe path off the original quality information rather than reverse-engineering it from a StatusCode. Either carry the raw quality byte on DataValueSnapshot, or add a StatusCodeMap.ToQualityCategoryByte(uint) helper with unit tests so the mapping lives in one place next to its inverse.
Resolution: Resolved 2026-05-22 — added StatusCodeMap.ToQualityCategoryByte(uint) helper that extracts top-two bits of the OPC UA StatusCode into the OPC DA category byte (Good=192, Uncertain=64, Bad=0); GalaxyDriver.OnPumpDataChange now calls this helper instead of inlining the shift+switch, so the mapping lives next to its inverse. Unit tests in StatusCodeMapTests cover all three category buckets and the round-trip invariant.
Driver.Galaxy-005
| Field | Value |
|---|---|
| Severity | Low |
| Category | OtOpcUa conventions |
| Location | Runtime/EventPump.cs:81-88 |
| Status | Open |
Description: The BoundedChannelOptions comment states "Newest-dropped policy: when full, the producer's TryWrite returns false ... We do this manually rather than relying on BoundedChannelFullMode.DropWrite" — but the option is then set to FullMode = BoundedChannelFullMode.Wait. With Wait, TryWrite returning false on a full channel is correct behaviour, so the code works, but the comment naming the mode and the actual mode disagree, which is confusing for a maintainer deciding whether the policy is Wait, DropWrite, or DropNewest.
Recommendation: Either reword the comment to say "we use Wait mode but never call the awaitable WriteAsync — TryWrite gives us synchronous newest-dropped semantics", or switch to BoundedChannelFullMode.DropWrite and keep the manual drop count. Make the comment and the mode consistent.
Resolution: (open)
Driver.Galaxy-006
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | GalaxyDriver.cs:848-861 |
| Status | Resolved |
Description: OnAlarmFeedTransition picks the "owner" handle with _alarmSubscriptions.First() under _alarmHandlersLock. HashSet<T>.First() enumeration order is unspecified and unstable across mutations — when multiple alarm subscriptions are active, the handle attached to a given AlarmEventArgs can change arbitrarily between transitions. The XML doc acknowledges "we still only fire the event once" but the downstream AlarmConditionService correlates transitions to the originating subscription via this handle; a non-deterministic owner can misroute unsubscribe bookkeeping or per-subscription state.
Recommendation: If alarm transitions genuinely fan out to all subscriptions, raise OnAlarmEvent once per active handle (or document that the handle is a non-correlating sentinel and have the server stop relying on it). If a single owner is required, make the choice deterministic (e.g. the earliest-created handle) and stable.
Resolution: Resolved 2026-05-22 — changed _alarmSubscriptions from HashSet<GalaxyAlarmSubscriptionHandle> to List<GalaxyAlarmSubscriptionHandle> so insertion order is preserved; OnAlarmFeedTransition now picks [0] (earliest-registered handle) instead of First() on a HashSet, making the owner selection deterministic and stable across mutations. Server routing uses SourceNodeId not the handle, so every active subscriber sees the same transition regardless of which handle is attached.
Driver.Galaxy-007
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | GalaxyDriver.cs:937-968 |
| Status | Resolved |
Description: Dispose() is not synchronized against the capability methods. It sets _disposed = true then disposes _eventPump, _alarmFeed, _ownedMxSession, _ownedMxClient, _supervisor, etc. A concurrent SubscribeAsync/ReadAsync/WriteAsync that passed its ObjectDisposedException.ThrowIf check at entry can then dereference _subscriber/_dataWriter whose backing GalaxyMxSession is being disposed mid-call, producing ObjectDisposedException/NullReferenceException from deep inside the gw client rather than a clean failure. Dispose also blocks the caller on GetAwaiter().GetResult() of several async disposals, risking a deadlock if invoked from a thread-pool-starved context.
Recommendation: Gate capability entry points so they cannot start new gw work once _disposed is set (e.g. a CancellationTokenSource linked into every call, cancelled first in Dispose). Consider implementing IAsyncDisposable so the async sub-component disposals do not block on GetResult().
Resolution: Resolved 2026-05-22 — added IAsyncDisposable to GalaxyDriver and implemented DisposeAsync() as the primary disposal path that awaits each async sub-component (EventPump, AlarmFeed, MxSession, MxClient, RepositoryClient) without blocking; Dispose() delegates to DisposeAsync().AsTask().GetAwaiter().GetResult() for using-statement compatibility. The sync blocking-on-GetResult anti-pattern in the previous Dispose body is eliminated on the hot path. Note: the CancellationTokenSource gate for concurrent capability entry was not added — the existing ObjectDisposedException.ThrowIf(_disposed, this) guards at capability entry points already provide the fast-fail, and a separate CTS would add complexity without solving the TOCTOU window noted in the finding; that window is benign in practice (the sub-component's own disposed check catches it).
Driver.Galaxy-008
| Field | Value |
|---|---|
| Severity | High |
| Category | Error handling & resilience |
| Location | GalaxyDriver.cs:264-276, Runtime/EventPump.cs:97-103 |
| Status | Resolved |
Description: Even if Driver.Galaxy-001 is fixed and the supervisor's ReplayAsync runs, recovery is incomplete. ReplayAsync re-issues SubscribeBulkAsync for the tracked tags, but the EventPump background loop that consumes StreamEvents is not restarted. After a stream fault EventPump.RunAsync exits and _channel is completed; EventPump.Start() is a no-op (if (_loop is not null) return) because _loop is a completed-but-non-null task. So a replayed subscription has no consumer — values are subscribed on the gw but never reach OnDataChange. Additionally ReplayAsync never re-registers the new item handles the gw returns into SubscriptionRegistry; the old stale item handles remain, so even with a live pump the fan-out reverse-map would miss the post-reconnect handles.
Recommendation: On reconnect, dispose and recreate the EventPump (or make it restartable), and have ReplayAsync update SubscriptionRegistry bindings with the new item handles returned by the post-reconnect SubscribeBulkAsync. Add an integration/parity test that drops the stream mid-subscription and asserts OnDataChange resumes.
Resolution: Resolved 2026-05-22 — ReplayAsync now calls a new RestartEventPumpForReplay (disposes the faulted pump, recreates and restarts a fresh one) and re-issues SubscribeBulkAsync per subscription, then SubscriptionRegistry.Rebind swaps each subscription's stale pre-reconnect item handles for the post-reconnect handles so the fan-out reverse map dispatches to the live pump. New SubscriptionRegistry.SnapshotEntries/Rebind APIs back the per-subscription replay. Regression coverage in SubscriptionRegistryTests (Rebind/SnapshotEntries) and EventPumpStreamFaultTests.FaultedPump_IsNotRestartableInPlace_ButAFreshPumpResumesDispatch.
Driver.Galaxy-009
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | GalaxyDriver.cs:354-371 |
| Status | Resolved |
Description: StartDeployWatcher launches the watch loop with _ = _deployWatcher.StartAsync(CancellationToken.None) — a fire-and-forget with a discarded Task. StartAsync can throw synchronously (InvalidOperationException if already started); the discard masks that programming error. Separately, StartDeployWatcher builds an _ownedRepositoryClient purely for the watcher when discovery has not run yet — if DiscoverAsync later runs, BuildDefaultHierarchySource overwrites _ownedRepositoryClient with a second client, leaking the first (only the latest reference is disposed in Dispose).
Recommendation: Await StartAsync (it completes synchronously after scheduling) or at least observe its result. Reuse a single GalaxyRepositoryClient across the deploy watcher and the hierarchy source instead of letting BuildDefaultHierarchySource clobber the field — guard the assignment or build the client once in InitializeAsync.
Resolution: Resolved 2026-05-22 — (a) replaced _ = _deployWatcher.StartAsync(...) discard with an explicit variable + IsFaulted check so any synchronous throw from StartAsync (e.g. called-twice InvalidOperationException) propagates rather than being silently swallowed; (b) changed both StartDeployWatcher and BuildDefaultHierarchySource to use _ownedRepositoryClient ??= so a client built by the watcher is reused by discovery instead of being overwritten and leaked — only one GalaxyRepositoryClient instance is now created and disposed.
Driver.Galaxy-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | Security |
| Location | GalaxyDriver.cs:311-341 |
| Status | Open |
Description: ResolveApiKey supports an env:/file: indirection and otherwise treats the config string as the literal API key ("Anything else — used as the literal API key. Convenient for dev"). GalaxyGatewayOptions' own XML doc claims "the API key never appears in cleartext config". The literal-key fallback silently permits a plaintext API key in the DriverConfig JSON column of the central config DB, contradicting the documented contract. There is no warning logged when the literal path is taken.
Recommendation: Log a startup warning when ResolveApiKey falls through to the literal arm so an operator who accidentally committed a cleartext key sees it, and update the GalaxyGatewayOptions doc comment so it no longer over-promises. Consider gating the literal arm behind an explicit dev:-style prefix so a cleartext key cannot be used by accident.
Resolution: (open)
Driver.Galaxy-011
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Performance & resource management |
| Location | GalaxyDriver.cs:411 |
| Status | Resolved |
Description: GetMemoryFootprint() unconditionally returns 0 with a comment "PR 4.4 sets this from SubscriptionRegistry size" — PR 4.4 has shipped (the registry exists and is used) but the method was never updated. IHostConnectivityProbe.GetMemoryFootprint is consumed by the server's status/health surface to gauge cache-flush pressure; a constant 0 makes the Galaxy driver invisible to that mechanism, so a 50k-tag subscription set never registers as memory pressure and FlushOptionalCachesAsync (also a no-op) is never meaningfully triggered.
Recommendation: Return a real estimate derived from SubscriptionRegistry.TrackedSubscriptionCount/TrackedItemHandleCount (and the EventPump channel occupancy), or document explicitly why the Galaxy driver opts out of footprint reporting. Remove the stale "PR 4.4 sets this" comment.
Resolution: Resolved 2026-05-22 — replaced the constant 0 with a live estimate derived from SubscriptionRegistry.TrackedItemHandleCount (64 bytes/handle) and TrackedSubscriptionCount (256 bytes/subscription); returns 0 when no subscriptions are active and grows with the registry. The stale "PR 4.4 sets this" comment is removed. Regression coverage in GalaxyDriverInfrastructureTests.
Driver.Galaxy-012
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | Runtime/SubscriptionRegistry.cs:65-67, GalaxyDriver.cs:538, GalaxyDriver.cs:675 |
| Status | Open |
Description: Several hot paths are O(n^2) per call. SubscriptionRegistry.ResolveSubscribers does entry.Bindings.FirstOrDefault(b => b.ItemHandle == itemHandle) — a linear scan of the whole binding list for every event dispatch; at 50k tags this is 50k-element scans on the 1Hz fan-out path. GalaxyDriver.SubscribeAsync and ReadViaSubscribeOnceAsync correlate results to references with results.FirstOrDefault(r => string.Equals(...)) inside a for loop over all references — O(n^2) over the subscribe batch. SubscriptionRegistry.Remove rebuilds a ConcurrentBag from a LINQ filter on every unsubscribe.
Recommendation: Index SubscriptionEntry bindings by item handle (a Dictionary<int, string> per entry) so ResolveSubscribers is O(1) per subscriber. Project the SubscribeResult list into a Dictionary<string, SubscribeResult> (OrdinalIgnoreCase) once before the correlation loop. These matter on the documented 50k-tag soak path.
Resolution: (open)
Driver.Galaxy-013
| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | GalaxyDriver.cs:14-27, GalaxyDriver.cs:374-382, Config/GalaxyDriverOptions.cs:84-86 |
| Status | Open |
Description: Multiple doc comments are stale relative to the shipped code. GalaxyDriver's class summary still describes the file as "the project skeleton with IDriver bodies that wire to a future IGalaxyGatewayClient abstraction. Capability interfaces ... land in PRs 4.1-4.7" and references the legacy GalaxyProxyDriver coexisting "until PR 7.2" — but PR 7.2 already deleted the legacy Galaxy projects and the capability interfaces are all implemented. ReinitializeAsync is still a stub ("for the skeleton we just refresh health") that ignores driverConfigJson entirely — a config reapply silently does nothing. GalaxyReconnectOptions.ReplayOnSessionLost is defined and documented but never read anywhere in the driver (ReplayAsync always replays).
Recommendation: Refresh the GalaxyDriver class and ReinitializeAsync doc comments to describe the shipped state, implement or explicitly reject ReinitializeAsync config reapply, and either honour ReplayOnSessionLost or remove it from GalaxyReconnectOptions.
Resolution: (open)
Driver.Galaxy-014
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy (module-wide) |
| Status | Open |
Description: The reconnect/recovery path is the module's highest-risk surface and is effectively untested at the integration seam. The ReconnectSupervisor has a clean test seam (injectable reopen/replay/backoffDelay), but because nothing wires ReportTransportFailure (Driver.Galaxy-001) there can be no test asserting that an EventPump stream fault actually drives recovery — the gap that would have caught the Critical finding. Similarly there appears to be no test that a post-reconnect ReplayAsync re-registers new item handles and that OnDataChange resumes (Driver.Galaxy-008). The StatusCodeMap.FromMxStatus Success-flag semantics (Driver.Galaxy-003) and the DataTypeMap Int64 gap (Driver.Galaxy-002) are also the kind of behaviour a focused unit test would pin.
Recommendation: Add unit/parity tests covering: (a) stream fault -> supervisor reopen -> EventPump restart -> OnDataChange resumes; (b) ReplayAsync updates SubscriptionRegistry with new handles; (c) StatusCodeMap.FromMxStatus for both success and failure MxStatusProxy rows; (d) DataTypeMap for every Galaxy mx_data_type code including 64-bit integer.
Resolution: (open)