# Code Review — Driver.OpcUaClient

| Field | Value |
|---|---|
| Module | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 0 |

## Checklist coverage

| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Driver.OpcUaClient-001, -002, -003, -010, -011 |
| 2 | OtOpcUa conventions | Driver.OpcUaClient-004 |
| 3 | Concurrency & thread safety | Driver.OpcUaClient-005, -006, -007 |
| 4 | Error handling & resilience | Driver.OpcUaClient-002, -008, -009 |
| 5 | Security | Driver.OpcUaClient-012 |
| 6 | Performance & resource management | Driver.OpcUaClient-013, -014 |
| 7 | Design-document adherence | Driver.OpcUaClient-004, -013, -015 |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | Driver.OpcUaClient-015 |
| 10 | Documentation & comments | Driver.OpcUaClient-011 |

## Findings

### Driver.OpcUaClient-001

| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `OpcUaClientDriver.cs:444`, `:466`, `:517`, `:540`, `:599`, `:610` |
| Status | Resolved |

**Description:** ReadAsync, WriteAsync, and DiscoverAsync capture the session into a local variable via RequireSession() before acquiring `_gate`, then perform the wire call on that captured reference inside the gate. The reconnect path (OnReconnectComplete, line 1330) swaps `Session` to a brand-new ISession. A read that captured the pre-reconnect session at line 444, then blocked on `_gate.WaitAsync` while a reconnect completed, issues ReadAsync against a stale/closed session. The catch block then fans out BadCommunicationError for the whole batch even though the driver is healthy on the new session, and the operation is silently lost. The gate does not protect against the session being swapped underneath a waiter.

**Recommendation:** Re-read `Session` inside the `_gate` critical section (after WaitAsync returns), or route the session swap itself through `_gate` so a swap cannot interleave with a gated operation.

**Resolution:** Resolved 2026-05-22 — ReadAsync/WriteAsync/DiscoverAsync now re-read `Session` (and parse NodeIds) inside the `_gate` critical section after `WaitAsync` returns; a session swapped in by a concurrent reconnect is the one used for the wire call.

### Driver.OpcUaClient-002

| Field | Value |
|---|---|
| Severity | High |
| Category | Error handling & resilience |
| Location | `OpcUaClientDriver.cs:1330-1359` |
| Status | Resolved |

**Description:** OnReconnectComplete handles only the success case. When SessionReconnectHandler gives up (its retry loop exhausts the 2-minute maxReconnectPeriod), it invokes the callback with `handler.Session == null`. The code sets `Session = null`, disposes the handler, and sets `_reconnectHandler = null`, but leaves `_health` at whatever it was (typically Degraded) and `_hostState` at Stopped. There is no further reconnect attempt (the handler is gone, and OnKeepAlive only fires on a live session which no longer exists), and DriverState is never set to Faulted. The driver is permanently wedged: no session, no reconnect loop, no Faulted signal for the Core, and ReinitializeAsync is never triggered. This is the single largest gateway resilience gap.

**Recommendation:** In OnReconnectComplete, when newSession is null, set `_health` to a Faulted DriverHealth with an explanatory message so the Core can fan out Bad quality and offer an operator reinitialize. Consider re-arming a fresh reconnect attempt rather than giving up entirely for an always-on gateway.

**Resolution:** Resolved 2026-05-22 — OnReconnectComplete's give-up branch now transitions HostState to Faulted, sets a Faulted DriverHealth with an explanatory message, and re-arms a fresh SessionReconnectHandler (`TryRearmReconnect`) against the last-known session so an always-on gateway self-heals while the Core can still offer an operator reinitialize.

### Driver.OpcUaClient-003

| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `OpcUaClientDriver.cs:644-711` |
| Status | Resolved |

**Description:** BrowseRecursiveAsync calls session.BrowseAsync with `requestedMaxReferencesPerNode: 0` but never follows browse continuation points. OPC UA servers enforce a server-side max-references-per-node limit; when a node has more children than the server returns in one response, BrowseResult.ContinuationPoint is non-empty and the caller must issue BrowseNext to retrieve the remainder. This driver discards the continuation point, so any folder on the remote server with a large child set is silently truncated: discovered tags go missing from the local address space with no error. For the tens-of-thousands-of-nodes scenario the options doc targets (MaxDiscoveredNodes = 10000), this is a realistic and silent data-completeness bug.

**Recommendation:** After processing resp.Results[0].References, check resp.Results[0].ContinuationPoint; while non-empty, call session.BrowseNextAsync and append the additional references before recursing/registering.

**Resolution:** Resolved 2026-05-22 — BrowseRecursiveAsync now loops on the BrowseResult.ContinuationPoint, calling `session.BrowseNextAsync` and appending each page of references until the continuation point is empty, so large remote folders are no longer silently truncated.

### Driver.OpcUaClient-004

| Field | Value |
|---|---|
| Severity | High |
| Category | Design-document adherence |
| Location | `OpcUaClientDriver.cs:596-632`, `:789`, `OpcUaClientDriverOptions.cs` |
| Status | Resolved |

**Description:** docs/v2/driver-specs.md section 8 mandates two features that are absent. (1) Namespace remapping: the spec requires building a bidirectional namespace map at connect time from session.NamespaceUris. The driver instead stores the raw upstream NodeId string (pv.NodeId.ToString()) as DriverAttributeInfo.FullName and re-parses it verbatim for reads/writes. The namespace index embedded in `ns=N;...` is server-session-relative; if the upstream server reorders its namespace table across a restart (permitted by the spec), every stored ns=N reference points at the wrong namespace and reads/writes silently address wrong nodes. (2) TargetNamespaceKind enforcement: section 8 requires the driver to enforce Equipment-vs-SystemPlatform choice at startup and fail draft validation on misconfiguration; OpcUaClientDriverOptions has no such knob.

**Recommendation:** Build a namespace-URI map from session.NamespaceUris at connect time and store NodeIds in a server-stable form (namespace URI plus identifier) rather than session-relative ns=N. Add the TargetNamespaceKind option and the startup validation section 8 describes, or document explicitly why the design deviates.

**Resolution:** Resolved 2026-05-22 — new `NamespaceMap` (built from session.NamespaceUris at connect and rebuilt on reconnect) persists discovered NodeIds in the server-stable `nsu=<uri>;…` form; reads/writes re-resolve that form against the current session so a remote namespace-table reorder no longer misaddresses nodes. Added the `TargetNamespaceKind` option + `UnsMappingTable` and `ValidateNamespaceKind`, which fails draft validation for an Equipment instance lacking a UNS mapping or a SystemPlatform instance carrying one.

### Driver.OpcUaClient-005

| Field | Value |
|---|---|
| Severity | High |
| Category | Concurrency & thread safety |
| Location | `OpcUaClientDriver.cs:1297-1319` |
| Status | Resolved |

**Description:** OnKeepAlive reads and writes `_reconnectHandler` without any lock: `if (_reconnectHandler is not null) return;` followed by `_reconnectHandler = new SessionReconnectHandler(...)`. Keep-alive callbacks are raised from the SDK keep-alive timer thread; on a bad keep-alive the SDK can fire the handler repeatedly while the channel stays down. Two callbacks racing through the check-then-set both observe null, both construct a SessionReconnectHandler, both call BeginReconnect, and the second assignment overwrites the first handler, leaking the first handler (its retry loop keeps running, unreferenced and never disposed) and creating two competing reconnect loops. ShutdownAsync then only cancels/disposes the one that won the assignment race.

**Recommendation:** Guard the `_reconnectHandler` check-and-set with `_probeLock` (already held for `_hostState`), or use Interlocked.CompareExchange to ensure exactly one handler is constructed per drop.

**Resolution:** Resolved 2026-05-22 — the `_reconnectHandler` check-and-set in OnKeepAlive (and the take-and-clear in ShutdownAsync, plus the dispose/re-arm in OnReconnectComplete/TryRearmReconnect) now run inside the `_probeLock` critical section, so exactly one SessionReconnectHandler is constructed per drop and a racing keep-alive callback cannot leak a handler.

### Driver.OpcUaClient-006

| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `OpcUaClientDriver.cs:1330-1359` |
| Status | Resolved |

**Description:** OnReconnectComplete mutates `Session` (line 1347) directly from the reconnect-handler callback thread with no synchronization against ReadAsync/WriteAsync/DiscoverAsync/ShutdownAsync. Session is a plain auto-property with no memory barrier; a concurrent reader on another thread may observe a stale reference. ShutdownAsync (line 425) can also run concurrently with OnReconnectComplete: ShutdownAsync disposes the session and sets Session = null while OnReconnectComplete sets Session = newSession, and the interleaving is unspecified, potentially leaving a live session leaked after shutdown.

**Recommendation:** Route all Session mutations through a single lock (or the `_gate`). Make ShutdownAsync cancel the reconnect handler and wait for any in-flight OnReconnectComplete to settle before disposing the session.

**Resolution:** Resolved 2026-05-22 — All Session mutations (assignment to newSession in OnReconnectComplete, and assignment to null in ShutdownAsync) now run inside the `_probeLock` critical section, preventing races between the reconnect callback thread, ShutdownAsync, and keep-alive callbacks. KeepAlive handler detach/attach is also done under `_probeLock` so a keep-alive cannot fire against the old session after the swap.

### Driver.OpcUaClient-007

| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `OpcUaClientDriver.cs:1374`, `:1376-1383`, `:508` |
| Status | Resolved |

**Description:** Two disposal races. (1) Dispose() does `DisposeAsync().AsTask().GetAwaiter().GetResult()`, synchronous blocking on async work. The Galaxy stability review (driver-stability.md, the 2026-04-13 findings) explicitly calls out sync-over-async on the OPC UA stack thread as a closed bug class; if Dispose() runs on the OPC UA stack thread or any thread the SDK continuations need, this deadlocks. (2) DisposeAsync disposes `_gate` (line 1382) after ShutdownAsync returns, but ShutdownAsync does not drain in-flight ReadAsync/WriteAsync operations holding `_gate`. An in-flight read that calls `_gate.Release()` (line 508) after `_gate.Dispose()` throws ObjectDisposedException on a background thread.

**Recommendation:** Provide an async disposal path callers prefer; if a sync Dispose() is unavoidable keep it free of .GetResult() on SDK-thread-affine work. Before disposing `_gate`, acquire it once so all in-flight gated operations have completed, or guard releases against disposal.

**Resolution:** Resolved 2026-05-22 — `Dispose()` no longer calls `.GetAwaiter().GetResult()` on async work; it performs a purely-synchronous teardown (cancel reconnect handler, detach keep-alive, null Session under `_probeLock`). Both `Dispose()` and `DisposeAsync()` now acquire `_gate` once before disposing it, ensuring any in-flight gated operation has released before the gate is torn down.

### Driver.OpcUaClient-008

| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `OpcUaClientDriver.cs:1092-1099` |
| Status | Resolved |

**Description:** AcknowledgeAsync issues the batched CallAsync and then catches all exceptions with a best-effort empty catch; it also never inspects the per-call results in the success path (`_ = await session.CallAsync(...)`). An alarm acknowledgment the upstream server rejects (BadConditionAlreadyAcked, BadNodeIdUnknown, BadUserAccessDenied) is reported as success to the caller. IAlarmSource.AcknowledgeAsync has no per-item result, so the only way a failure could surface is via an exception, and the catch suppresses even that. Operators acking a critical alarm get no signal that the ack did not take.

**Recommendation:** Inspect CallMethodResult.StatusCode for each result and log Bad codes; rethrow (or surface via driver health) genuine transport failures rather than swallowing them. Consider extending the contract so per-ack failures propagate.

**Resolution:** Resolved 2026-05-22 — `AcknowledgeAsync` now inspects each `CallMethodResult.StatusCode` in the success path and logs a Warning for any Bad code (BadConditionAlreadyAcked, BadNodeIdUnknown, BadUserAccessDenied, etc.). `OperationCanceledException` (transport timeout) is now re-thrown instead of swallowed; other transport exceptions are also logged with the driver instance ID. Requires `ILogger<OpcUaClientDriver>` injected via new optional constructor parameter.

### Driver.OpcUaClient-009

| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `OpcUaClientDriver.cs:560-564` |
| Status | Resolved |

**Description:** WriteAsync's catch block fans out BadCommunicationError across the whole batch on any exception. Writes are non-idempotent by default (IWritable remarks, decision #44/#45): a timeout exception may fire after the upstream server already applied the write. Reporting BadCommunicationError (a code that reads as "definitely did not happen") for a write that may have succeeded is misleading; the OPC UA client downstream may safely re-issue and double-apply. The read path has the same fan-out but reads are idempotent so it is benign there; for writes the ambiguity matters.

**Recommendation:** Map write timeouts/cancellations to BadTimeout (which downstream correctly treats as "outcome unknown, do not blindly retry") rather than BadCommunicationError, and only use BadCommunicationError for failures that provably occurred before the request reached the wire.

**Resolution:** Resolved 2026-05-22 — `WriteAsync`'s inner catch block now handles `OperationCanceledException` (timeout/cancellation) separately, mapping it to `BadTimeout` (0x800A0000), while all other exceptions map to `BadCommunicationError`. The session-null pre-wire exit still correctly uses `BadCommunicationError`.

### Driver.OpcUaClient-010

| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `OpcUaClientDriver.cs:823-824` |
| Status | Resolved |

**Description:** MapUpstreamDataType maps DataTypeIds.Byte (the OPC UA unsigned 8-bit type) to DriverDataType.Int16. Byte should map to an unsigned driver type (UInt16 is the smallest unsigned available, matching how SByte belongs with the signed family). Mapping an unsigned 0-255 type onto signed Int16 misrepresents the type metadata downstream: clients see a signed type for an unsigned source, and any range/validation logic keyed off the driver data type is wrong. SByte correctly belongs with Int16; Byte does not.

**Recommendation:** Map DataTypeIds.Byte to DriverDataType.UInt16 (or add a Byte/UInt8 driver type if the enum supports finer granularity), keeping SByte and Int16 on the signed Int16 mapping.

**Resolution:** Resolved 2026-05-22 — `MapUpstreamDataType` now maps `DataTypeIds.Byte` → `DriverDataType.UInt16` (unsigned family) while `DataTypeIds.SByte` remains on `DriverDataType.Int16` (signed family). Test `MapUpstreamDataType_Byte_maps_to_UInt16_unsigned_family` asserts the fix and `MapUpstreamDataType_maps_Byte_to_UInt16_not_Int16` guards the regression.

### Driver.OpcUaClient-011

| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `OpcUaClientDriver.cs:1007-1015` |
| Status | Resolved |

**Description:** The comment on the isArray computation stated "-1 = scalar; 1+ = array dimensions; 0 = one-dimensional array". This is inaccurate against OPC UA ValueRank semantics: -3 is ScalarOrOneDimension, -2 is Any, -1 is Scalar, and 0 is OneOrMoreDimensions (not specifically one-dimensional). The code `valueRank >= 0` treats -2 (Any) and -3 (ScalarOrOneDimension) as scalar, which is a defensible default, but the comment misdescribed the constants and would mislead a maintainer.

**Recommendation:** Correct the comment to the actual ValueRank constants (-3 ScalarOrOneDimension, -2 Any, -1 Scalar, 0 OneOrMoreDimensions, 1 OneDimension, >1 multi-dim) and state the deliberate choice that anything >= 0 is treated as an array.

**Resolution:** Resolved 2026-05-23 — `EnrichAndRegisterVariablesAsync` now carries the correct OPC UA Part 3 ValueRank legend (`-3 ScalarOrOneDimension`, `-2 Any`, `-1 Scalar`, `0 OneOrMoreDimensions`, `1 OneDimension`, `>1` specific N-dimensions) and explicitly states the deliberate choice that anything `>= 0` is treated as an array, with `-3`/`-2` conservatively folded into the scalar bucket. Regression tests `ValueRank_constants_have_the_OPCUA_Part3_spec_values` (anchors the SDK constants) and `IsArray_decision_matches_valueRank_greater_or_equal_zero` (theory across -3..2) pin the logic in `OpcUaClientLowFindingsRegressionTests.cs`.

### Driver.OpcUaClient-012

| Field | Value |
|---|---|
| Severity | Medium |
| Category | Security |
| Location | `OpcUaClientDriver.cs:210-217` |
| Status | Resolved |

**Description:** When AutoAcceptCertificates is true the driver registers a CertificateValidation handler that accepts only StatusCodes.BadCertificateUntrusted. A self-signed or otherwise untrusted server certificate frequently fails validation with a different code first (BadCertificateChainIncomplete, BadCertificateTimeInvalid, BadCertificateHostNameInvalid), so auto-accept silently does not accept many real dev certificates and the connect fails confusingly. The handler is added to config.CertificateValidator but never removed; each driver instance leaks a delegate subscription on a validator that may be process-shared. The option doc says auto-accept is dev-only and must be false in production, but there is no runtime guard preventing AutoAcceptCertificates=true shipping to production and no log warning when it is enabled.

**Recommendation:** When auto-accepting for dev, accept the full set of certificate-validation error codes (or use the SDK AutoAcceptUntrustedCertificates path consistently). Emit a prominent warning log every time AutoAcceptCertificates is enabled so a production misconfiguration is visible. Detach the handler on shutdown.

**Resolution:** Resolved 2026-05-22 — The cert-validation handler now accepts ALL validation errors (not only BadCertificateUntrusted) when `AutoAcceptCertificates=true`, so real dev certs with chain/host/time errors work. A `LogWarning` is emitted at startup whenever the flag is set. The handler delegate + validator reference are stored in `_certValidationHandler`/`_certValidatorRef` and detached in both `ShutdownAsync` and `Dispose()`/`DisposeAsync()` to prevent the delegate leak.

### Driver.OpcUaClient-013

| Field | Value |
|---|---|
| Severity | Medium |
| Category | Performance & resource management |
| Location | `OpcUaClientDriver.cs:436-437` |
| Status | Resolved |

**Description:** GetMemoryFootprint() is hard-coded to return 0 and FlushOptionalCachesAsync is a no-op Task.CompletedTask. docs/v2/driver-stability.md section "In-process only (Tier A/B)" makes per-instance allocation tracking a contract requirement, and driver-specs.md section 8 explicitly calls out browse-cache memory: BrowseStrategy=Full against a large remote server can cache tens of thousands of node descriptions and the per-instance budget should bound this. Returning 0 means the Core 30-second footprint poll can never detect this driver's browse-cache growth, and the cache-budget-breach to flush escalation path is dead code. A gateway pointed at a 10k-node server (the configured cap) silently evades the Tier-A memory-guard mechanism.

**Recommendation:** Track an approximate footprint for the discovered-node set and any cached browse state, return it from GetMemoryFootprint(), and implement FlushOptionalCachesAsync to drop droppable cache. If the driver genuinely holds no significant cache, document why 0 is correct.

**Resolution:** Resolved 2026-05-22 — `DiscoverAsync` now updates a `_discoveredNodeCount` volatile counter after each pass. `GetMemoryFootprint()` returns `_discoveredNodeCount * 512` (conservative ~512 bytes per node for DriverAttributeInfo + strings). `FlushOptionalCachesAsync` resets `_discoveredNodeCount` to 0, signalling Core that re-discovery will rebuild cleanly. A 10k-node server now reports ~5 MB to the Core slope alarm rather than 0.

### Driver.OpcUaClient-014

| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `OpcUaClientDriver.cs:1138`, `:1314` |
| Status | Resolved |

**Description:** `MonitoredItem.Notification += (mi, args) => ...` (and the alarm-event equivalent) attached a closure-capturing lambda to each monitored item's event. The lambda was never detached. When UnsubscribeAsync removed a subscription it called Subscription.DeleteAsync but did not clear the MonitoredItem.Notification handlers; if the SDK retains the MonitoredItem/Subscription graph anywhere (the session keeps a reference until its own disposal, or during transfer-on-reconnect), the driver instance was kept alive by the closure longer than necessary.

**Recommendation:** Detach the Notification handlers when deleting a subscription, or hold the handler delegate so it can be explicitly removed in UnsubscribeAsync/ShutdownAsync.

**Resolution:** Resolved 2026-05-23 — `SubscribeAsync` now stores each `(MonitoredItem, MonitoredItemNotificationEventHandler)` pair in a new `MonitoredItemNotificationHandle` record carried inside `RemoteSubscription`. `SubscribeAlarmsAsync` similarly stores the event-MonitoredItem and its handler delegate on `RemoteAlarmSubscription`. `UnsubscribeAsync`, `UnsubscribeAlarmsAsync`, and the subscription-teardown loops in `ShutdownAsync` now invoke `DetachNotificationHandlers` (or the alarm-equivalent inline `Notification -= rs.Handler`) BEFORE calling `Subscription.DeleteAsync`, so the SDK's invocation list no longer pins the driver through the captured lambda. Reflection-based regression tests `RemoteSubscription_record_carries_handler_delegates_so_they_can_be_detached` and `RemoteAlarmSubscription_record_carries_handler_delegate_so_it_can_be_detached` pin the contract that the handler reference is reachable from the bookkeeping record (`OpcUaClientLowFindingsRegressionTests.cs`).

### Driver.OpcUaClient-015

| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests/*`, `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.IntegrationTests/OpcUaClientSmokeTests.cs` |
| Status | Resolved |

**Description:** Unit-test coverage is solid for the pure mappers (MapSeverity, MapUpstreamDataType, MapSecurityPolicy, MapAggregateToNodeId, BuildCertificateIdentity, ResolveEndpointCandidates) and for "throws before init" guards, but the highest-risk behaviours of a gateway driver have no test: the reconnect/session-swap path (OnKeepAlive to OnReconnectComplete, findings -001/-002/-005/-006), browse continuation-point handling (-003), the cascading-quality fan-out on a mid-batch transport failure, and namespace remapping (-004). The reconnect test file itself states wire-level disconnect-reconnect-resume coverage lands with the in-process fixture, i.e. the single largest gateway bug surface (per driver-specs.md section 8) is explicitly untested. The integration suite is Docker-fixture gated against opc-plc and is a smoke test only. The failed-reconnect-to-Faulted and concurrent-keep-alive races are pure-logic paths testable with a fake ISession.

**Recommendation:** Add tests exercising the reconnect callbacks with a stub session (success and give-up cases), a browse test with a paged/continuation-point server stub, and a read-batch test asserting upstream Bad StatusCodes pass through verbatim while a transport throw fans out the local fault code.

**Resolution:** Resolved 2026-05-22 — Added `OpcUaClientMediumFindingsRegressionTests.cs` covering: (1) BadTimeout vs BadCommunicationError status-code distinction for the write-timeout path (Driver.OpcUaClient-009); (2) Byte→UInt16 mapping regression (Driver.OpcUaClient-010); (3) AutoAcceptCertificates warning log assertion (Driver.OpcUaClient-012); (4) GetMemoryFootprint/FlushOptionalCachesAsync contract (Driver.OpcUaClient-013); (5) MapSeverity thresholds, pre-init health, Session null pre-init, GetHostStatuses contract. Wire-level reconnect callback tests remain fixture-gated pending the in-process OPC UA server fixture.