Driver.OpcUaClient-001 — ReadAsync/WriteAsync/DiscoverAsync captured the session before acquiring _gate, so a reconnect that completed while the operation was blocked on the gate left the wire call bound to a stale, closed session. All three now re-read Session (and parse NodeIds) inside the _gate critical section after WaitAsync returns. Driver.OpcUaClient-002 — OnReconnectComplete ignored the give-up (null session) case, permanently wedging the driver with no Faulted signal and no reconnect loop. The give-up branch now transitions HostState to Faulted, sets a Faulted DriverHealth with an explanatory message, and re-arms a fresh SessionReconnectHandler (TryRearmReconnect) against the last-known session so an always-on gateway self-heals. Driver.OpcUaClient-003 — BrowseRecursiveAsync discarded browse continuation points, silently truncating large remote folders. It now loops on BrowseResult.ContinuationPoint calling BrowseNextAsync and appending each page until the continuation point is empty. Driver.OpcUaClient-004 — driver-specs.md §8 namespace handling was absent. Added NamespaceMap (built from session.NamespaceUris at connect, rebuilt on reconnect) which persists discovered NodeIds in the server-stable nsu=<uri>;... form; reads/writes re-resolve that form against the current session so a remote namespace-table reorder no longer misaddresses nodes. Added the TargetNamespaceKind option + UnsMappingTable and ValidateNamespaceKind startup enforcement. Driver.OpcUaClient-005 — OnKeepAlive read/wrote _reconnectHandler without a lock, racing the SDK keep-alive timer thread and leaking handlers. The check-and-set in OnKeepAlive, the take-and-clear in ShutdownAsync, and the dispose/re-arm in OnReconnectComplete now all run inside the _probeLock critical section. Adds OpcUaClientNamespaceTests (11 xUnit + Shouldly regression tests) covering ValidateNamespaceKind and the NamespaceMap stable encoding. Reconnect/browse wire paths remain fixture-gated per finding -015. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
20 KiB
Code Review — Driver.OpcUaClient
| Field | Value |
|---|---|
| Module | src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | 76d35d1 |
| Status | Reviewed |
| Open findings | 10 |
Checklist coverage
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Driver.OpcUaClient-001, -002, -003, -010, -011 |
| 2 | OtOpcUa conventions | Driver.OpcUaClient-004 |
| 3 | Concurrency & thread safety | Driver.OpcUaClient-005, -006, -007 |
| 4 | Error handling & resilience | Driver.OpcUaClient-002, -008, -009 |
| 5 | Security | Driver.OpcUaClient-012 |
| 6 | Performance & resource management | Driver.OpcUaClient-013, -014 |
| 7 | Design-document adherence | Driver.OpcUaClient-004, -013, -015 |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | Driver.OpcUaClient-015 |
| 10 | Documentation & comments | Driver.OpcUaClient-011 |
Findings
Driver.OpcUaClient-001
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | OpcUaClientDriver.cs:444, :466, :517, :540, :599, :610 |
| Status | Resolved |
Description: ReadAsync, WriteAsync, and DiscoverAsync capture the session into a local variable via RequireSession() before acquiring _gate, then perform the wire call on that captured reference inside the gate. The reconnect path (OnReconnectComplete, line 1330) swaps Session to a brand-new ISession. A read that captured the pre-reconnect session at line 444, then blocked on _gate.WaitAsync while a reconnect completed, issues ReadAsync against a stale/closed session. The catch block then fans out BadCommunicationError for the whole batch even though the driver is healthy on the new session, and the operation is silently lost. The gate does not protect against the session being swapped underneath a waiter.
Recommendation: Re-read Session inside the _gate critical section (after WaitAsync returns), or route the session swap itself through _gate so a swap cannot interleave with a gated operation.
Resolution: Resolved 2026-05-22 — ReadAsync/WriteAsync/DiscoverAsync now re-read Session (and parse NodeIds) inside the _gate critical section after WaitAsync returns; a session swapped in by a concurrent reconnect is the one used for the wire call.
Driver.OpcUaClient-002
| Field | Value |
|---|---|
| Severity | High |
| Category | Error handling & resilience |
| Location | OpcUaClientDriver.cs:1330-1359 |
| Status | Resolved |
Description: OnReconnectComplete handles only the success case. When SessionReconnectHandler gives up (its retry loop exhausts the 2-minute maxReconnectPeriod), it invokes the callback with handler.Session == null. The code sets Session = null, disposes the handler, and sets _reconnectHandler = null, but leaves _health at whatever it was (typically Degraded) and _hostState at Stopped. There is no further reconnect attempt (the handler is gone, and OnKeepAlive only fires on a live session which no longer exists), and DriverState is never set to Faulted. The driver is permanently wedged: no session, no reconnect loop, no Faulted signal for the Core, and ReinitializeAsync is never triggered. This is the single largest gateway resilience gap.
Recommendation: In OnReconnectComplete, when newSession is null, set _health to a Faulted DriverHealth with an explanatory message so the Core can fan out Bad quality and offer an operator reinitialize. Consider re-arming a fresh reconnect attempt rather than giving up entirely for an always-on gateway.
Resolution: Resolved 2026-05-22 — OnReconnectComplete's give-up branch now transitions HostState to Faulted, sets a Faulted DriverHealth with an explanatory message, and re-arms a fresh SessionReconnectHandler (TryRearmReconnect) against the last-known session so an always-on gateway self-heals while the Core can still offer an operator reinitialize.
Driver.OpcUaClient-003
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | OpcUaClientDriver.cs:644-711 |
| Status | Resolved |
Description: BrowseRecursiveAsync calls session.BrowseAsync with requestedMaxReferencesPerNode: 0 but never follows browse continuation points. OPC UA servers enforce a server-side max-references-per-node limit; when a node has more children than the server returns in one response, BrowseResult.ContinuationPoint is non-empty and the caller must issue BrowseNext to retrieve the remainder. This driver discards the continuation point, so any folder on the remote server with a large child set is silently truncated: discovered tags go missing from the local address space with no error. For the tens-of-thousands-of-nodes scenario the options doc targets (MaxDiscoveredNodes = 10000), this is a realistic and silent data-completeness bug.
Recommendation: After processing resp.Results[0].References, check resp.Results[0].ContinuationPoint; while non-empty, call session.BrowseNextAsync and append the additional references before recursing/registering.
Resolution: Resolved 2026-05-22 — BrowseRecursiveAsync now loops on the BrowseResult.ContinuationPoint, calling session.BrowseNextAsync and appending each page of references until the continuation point is empty, so large remote folders are no longer silently truncated.
Driver.OpcUaClient-004
| Field | Value |
|---|---|
| Severity | High |
| Category | Design-document adherence |
| Location | OpcUaClientDriver.cs:596-632, :789, OpcUaClientDriverOptions.cs |
| Status | Resolved |
Description: docs/v2/driver-specs.md section 8 mandates two features that are absent. (1) Namespace remapping: the spec requires building a bidirectional namespace map at connect time from session.NamespaceUris. The driver instead stores the raw upstream NodeId string (pv.NodeId.ToString()) as DriverAttributeInfo.FullName and re-parses it verbatim for reads/writes. The namespace index embedded in ns=N;... is server-session-relative; if the upstream server reorders its namespace table across a restart (permitted by the spec), every stored ns=N reference points at the wrong namespace and reads/writes silently address wrong nodes. (2) TargetNamespaceKind enforcement: section 8 requires the driver to enforce Equipment-vs-SystemPlatform choice at startup and fail draft validation on misconfiguration; OpcUaClientDriverOptions has no such knob.
Recommendation: Build a namespace-URI map from session.NamespaceUris at connect time and store NodeIds in a server-stable form (namespace URI plus identifier) rather than session-relative ns=N. Add the TargetNamespaceKind option and the startup validation section 8 describes, or document explicitly why the design deviates.
Resolution: Resolved 2026-05-22 — new NamespaceMap (built from session.NamespaceUris at connect and rebuilt on reconnect) persists discovered NodeIds in the server-stable nsu=<uri>;… form; reads/writes re-resolve that form against the current session so a remote namespace-table reorder no longer misaddresses nodes. Added the TargetNamespaceKind option + UnsMappingTable and ValidateNamespaceKind, which fails draft validation for an Equipment instance lacking a UNS mapping or a SystemPlatform instance carrying one.
Driver.OpcUaClient-005
| Field | Value |
|---|---|
| Severity | High |
| Category | Concurrency & thread safety |
| Location | OpcUaClientDriver.cs:1297-1319 |
| Status | Resolved |
Description: OnKeepAlive reads and writes _reconnectHandler without any lock: if (_reconnectHandler is not null) return; followed by _reconnectHandler = new SessionReconnectHandler(...). Keep-alive callbacks are raised from the SDK keep-alive timer thread; on a bad keep-alive the SDK can fire the handler repeatedly while the channel stays down. Two callbacks racing through the check-then-set both observe null, both construct a SessionReconnectHandler, both call BeginReconnect, and the second assignment overwrites the first handler, leaking the first handler (its retry loop keeps running, unreferenced and never disposed) and creating two competing reconnect loops. ShutdownAsync then only cancels/disposes the one that won the assignment race.
Recommendation: Guard the _reconnectHandler check-and-set with _probeLock (already held for _hostState), or use Interlocked.CompareExchange to ensure exactly one handler is constructed per drop.
Resolution: Resolved 2026-05-22 — the _reconnectHandler check-and-set in OnKeepAlive (and the take-and-clear in ShutdownAsync, plus the dispose/re-arm in OnReconnectComplete/TryRearmReconnect) now run inside the _probeLock critical section, so exactly one SessionReconnectHandler is constructed per drop and a racing keep-alive callback cannot leak a handler.
Driver.OpcUaClient-006
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | OpcUaClientDriver.cs:1330-1359 |
| Status | Open |
Description: OnReconnectComplete mutates Session (line 1347) directly from the reconnect-handler callback thread with no synchronization against ReadAsync/WriteAsync/DiscoverAsync/ShutdownAsync. Session is a plain auto-property with no memory barrier; a concurrent reader on another thread may observe a stale reference. ShutdownAsync (line 425) can also run concurrently with OnReconnectComplete: ShutdownAsync disposes the session and sets Session = null while OnReconnectComplete sets Session = newSession, and the interleaving is unspecified, potentially leaving a live session leaked after shutdown.
Recommendation: Route all Session mutations through a single lock (or the _gate). Make ShutdownAsync cancel the reconnect handler and wait for any in-flight OnReconnectComplete to settle before disposing the session.
Resolution: (open)
Driver.OpcUaClient-007
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | OpcUaClientDriver.cs:1374, :1376-1383, :508 |
| Status | Open |
Description: Two disposal races. (1) Dispose() does DisposeAsync().AsTask().GetAwaiter().GetResult(), synchronous blocking on async work. The Galaxy stability review (driver-stability.md, the 2026-04-13 findings) explicitly calls out sync-over-async on the OPC UA stack thread as a closed bug class; if Dispose() runs on the OPC UA stack thread or any thread the SDK continuations need, this deadlocks. (2) DisposeAsync disposes _gate (line 1382) after ShutdownAsync returns, but ShutdownAsync does not drain in-flight ReadAsync/WriteAsync operations holding _gate. An in-flight read that calls _gate.Release() (line 508) after _gate.Dispose() throws ObjectDisposedException on a background thread.
Recommendation: Provide an async disposal path callers prefer; if a sync Dispose() is unavoidable keep it free of .GetResult() on SDK-thread-affine work. Before disposing _gate, acquire it once so all in-flight gated operations have completed, or guard releases against disposal.
Resolution: (open)
Driver.OpcUaClient-008
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | OpcUaClientDriver.cs:1092-1099 |
| Status | Open |
Description: AcknowledgeAsync issues the batched CallAsync and then catches all exceptions with a best-effort empty catch; it also never inspects the per-call results in the success path (_ = await session.CallAsync(...)). An alarm acknowledgment the upstream server rejects (BadConditionAlreadyAcked, BadNodeIdUnknown, BadUserAccessDenied) is reported as success to the caller. IAlarmSource.AcknowledgeAsync has no per-item result, so the only way a failure could surface is via an exception, and the catch suppresses even that. Operators acking a critical alarm get no signal that the ack did not take.
Recommendation: Inspect CallMethodResult.StatusCode for each result and log Bad codes; rethrow (or surface via driver health) genuine transport failures rather than swallowing them. Consider extending the contract so per-ack failures propagate.
Resolution: (open)
Driver.OpcUaClient-009
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | OpcUaClientDriver.cs:560-564 |
| Status | Open |
Description: WriteAsync's catch block fans out BadCommunicationError across the whole batch on any exception. Writes are non-idempotent by default (IWritable remarks, decision #44/#45): a timeout exception may fire after the upstream server already applied the write. Reporting BadCommunicationError (a code that reads as "definitely did not happen") for a write that may have succeeded is misleading; the OPC UA client downstream may safely re-issue and double-apply. The read path has the same fan-out but reads are idempotent so it is benign there; for writes the ambiguity matters.
Recommendation: Map write timeouts/cancellations to BadTimeout (which downstream correctly treats as "outcome unknown, do not blindly retry") rather than BadCommunicationError, and only use BadCommunicationError for failures that provably occurred before the request reached the wire.
Resolution: (open)
Driver.OpcUaClient-010
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | OpcUaClientDriver.cs:823-824 |
| Status | Open |
Description: MapUpstreamDataType maps DataTypeIds.Byte (the OPC UA unsigned 8-bit type) to DriverDataType.Int16. Byte should map to an unsigned driver type (UInt16 is the smallest unsigned available, matching how SByte belongs with the signed family). Mapping an unsigned 0-255 type onto signed Int16 misrepresents the type metadata downstream: clients see a signed type for an unsigned source, and any range/validation logic keyed off the driver data type is wrong. SByte correctly belongs with Int16; Byte does not.
Recommendation: Map DataTypeIds.Byte to DriverDataType.UInt16 (or add a Byte/UInt8 driver type if the enum supports finer granularity), keeping SByte and Int16 on the signed Int16 mapping.
Resolution: (open)
Driver.OpcUaClient-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | OpcUaClientDriver.cs:783-784 |
| Status | Open |
Description: The comment on the isArray computation states "-1 = scalar; 1+ = array dimensions; 0 = one-dimensional array". This is inaccurate against OPC UA ValueRank semantics: -3 is ScalarOrOneDimension, -2 is Any, -1 is Scalar, and 0 is OneOrMoreDimensions (not specifically one-dimensional). The code valueRank >= 0 treats -2 (Any) and -3 (ScalarOrOneDimension) as scalar, which is a defensible default, but the comment misdescribes the constants and would mislead a maintainer.
Recommendation: Correct the comment to the actual ValueRank constants (-3 ScalarOrOneDimension, -2 Any, -1 Scalar, 0 OneOrMoreDimensions, 1 OneDimension, >1 multi-dim) and state the deliberate choice that anything >= 0 is treated as an array.
Resolution: (open)
Driver.OpcUaClient-012
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Security |
| Location | OpcUaClientDriver.cs:210-217 |
| Status | Open |
Description: When AutoAcceptCertificates is true the driver registers a CertificateValidation handler that accepts only StatusCodes.BadCertificateUntrusted. A self-signed or otherwise untrusted server certificate frequently fails validation with a different code first (BadCertificateChainIncomplete, BadCertificateTimeInvalid, BadCertificateHostNameInvalid), so auto-accept silently does not accept many real dev certificates and the connect fails confusingly. The handler is added to config.CertificateValidator but never removed; each driver instance leaks a delegate subscription on a validator that may be process-shared. The option doc says auto-accept is dev-only and must be false in production, but there is no runtime guard preventing AutoAcceptCertificates=true shipping to production and no log warning when it is enabled.
Recommendation: When auto-accepting for dev, accept the full set of certificate-validation error codes (or use the SDK AutoAcceptUntrustedCertificates path consistently). Emit a prominent warning log every time AutoAcceptCertificates is enabled so a production misconfiguration is visible. Detach the handler on shutdown.
Resolution: (open)
Driver.OpcUaClient-013
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Performance & resource management |
| Location | OpcUaClientDriver.cs:436-437 |
| Status | Open |
Description: GetMemoryFootprint() is hard-coded to return 0 and FlushOptionalCachesAsync is a no-op Task.CompletedTask. docs/v2/driver-stability.md section "In-process only (Tier A/B)" makes per-instance allocation tracking a contract requirement, and driver-specs.md section 8 explicitly calls out browse-cache memory: BrowseStrategy=Full against a large remote server can cache tens of thousands of node descriptions and the per-instance budget should bound this. Returning 0 means the Core 30-second footprint poll can never detect this driver's browse-cache growth, and the cache-budget-breach to flush escalation path is dead code. A gateway pointed at a 10k-node server (the configured cap) silently evades the Tier-A memory-guard mechanism.
Recommendation: Track an approximate footprint for the discovered-node set and any cached browse state, return it from GetMemoryFootprint(), and implement FlushOptionalCachesAsync to drop droppable cache. If the driver genuinely holds no significant cache, document why 0 is correct.
Resolution: (open)
Driver.OpcUaClient-014
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | OpcUaClientDriver.cs:904, :1035 |
| Status | Open |
Description: MonitoredItem.Notification += (mi, args) => ... (and the alarm-event equivalent) attaches a closure-capturing lambda to each monitored item's event. The lambda is never detached. When UnsubscribeAsync removes a subscription it calls Subscription.DeleteAsync but does not clear the MonitoredItem.Notification handlers; if the SDK retains the MonitoredItem/Subscription graph anywhere (the session keeps a reference until its own disposal, or during transfer-on-reconnect), the driver instance is kept alive by the closure longer than necessary.
Recommendation: Detach the Notification handlers when deleting a subscription, or hold the handler delegate so it can be explicitly removed in UnsubscribeAsync/ShutdownAsync.
Resolution: (open)
Driver.OpcUaClient-015
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests/*, tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.IntegrationTests/OpcUaClientSmokeTests.cs |
| Status | Open |
Description: Unit-test coverage is solid for the pure mappers (MapSeverity, MapUpstreamDataType, MapSecurityPolicy, MapAggregateToNodeId, BuildCertificateIdentity, ResolveEndpointCandidates) and for "throws before init" guards, but the highest-risk behaviours of a gateway driver have no test: the reconnect/session-swap path (OnKeepAlive to OnReconnectComplete, findings -001/-002/-005/-006), browse continuation-point handling (-003), the cascading-quality fan-out on a mid-batch transport failure, and namespace remapping (-004). The reconnect test file itself states wire-level disconnect-reconnect-resume coverage lands with the in-process fixture, i.e. the single largest gateway bug surface (per driver-specs.md section 8) is explicitly untested. The integration suite is Docker-fixture gated against opc-plc and is a smoke test only. The failed-reconnect-to-Faulted and concurrent-keep-alive races are pure-logic paths testable with a fake ISession.
Recommendation: Add tests exercising the reconnect callbacks with a stub session (success and give-up cases), a browse test with a paged/continuation-point server stub, and a read-batch test asserting upstream Bad StatusCodes pass through verbatim while a transport throw fans out the local fault code.
Resolution: (open)