fix(driver-opcuaclient): resolve High code-review findings (Driver.OpcUaClient-001..-005)
Driver.OpcUaClient-001 — ReadAsync/WriteAsync/DiscoverAsync captured the session before acquiring _gate, so a reconnect that completed while the operation was blocked on the gate left the wire call bound to a stale, closed session. All three now re-read Session (and parse NodeIds) inside the _gate critical section after WaitAsync returns. Driver.OpcUaClient-002 — OnReconnectComplete ignored the give-up (null session) case, permanently wedging the driver with no Faulted signal and no reconnect loop. The give-up branch now transitions HostState to Faulted, sets a Faulted DriverHealth with an explanatory message, and re-arms a fresh SessionReconnectHandler (TryRearmReconnect) against the last-known session so an always-on gateway self-heals. Driver.OpcUaClient-003 — BrowseRecursiveAsync discarded browse continuation points, silently truncating large remote folders. It now loops on BrowseResult.ContinuationPoint calling BrowseNextAsync and appending each page until the continuation point is empty. Driver.OpcUaClient-004 — driver-specs.md §8 namespace handling was absent. Added NamespaceMap (built from session.NamespaceUris at connect, rebuilt on reconnect) which persists discovered NodeIds in the server-stable nsu=<uri>;... form; reads/writes re-resolve that form against the current session so a remote namespace-table reorder no longer misaddresses nodes. Added the TargetNamespaceKind option + UnsMappingTable and ValidateNamespaceKind startup enforcement. Driver.OpcUaClient-005 — OnKeepAlive read/wrote _reconnectHandler without a lock, racing the SDK keep-alive timer thread and leaking handlers. The check-and-set in OnKeepAlive, the take-and-clear in ShutdownAsync, and the dispose/re-arm in OnReconnectComplete now all run inside the _probeLock critical section. Adds OpcUaClientNamespaceTests (11 xUnit + Shouldly regression tests) covering ValidateNamespaceKind and the NamespaceMap stable encoding. Reconnect/browse wire paths remain fixture-gated per finding -015. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -7,7 +7,7 @@
|
||||
| Review date | 2026-05-22 |
|
||||
| Commit reviewed | `76d35d1` |
|
||||
| Status | Reviewed |
|
||||
| Open findings | 15 |
|
||||
| Open findings | 10 |
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
@@ -33,13 +33,13 @@
|
||||
| Severity | High |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Location | `OpcUaClientDriver.cs:444`, `:466`, `:517`, `:540`, `:599`, `:610` |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
|
||||
**Description:** ReadAsync, WriteAsync, and DiscoverAsync capture the session into a local variable via RequireSession() before acquiring `_gate`, then perform the wire call on that captured reference inside the gate. The reconnect path (OnReconnectComplete, line 1330) swaps `Session` to a brand-new ISession. A read that captured the pre-reconnect session at line 444, then blocked on `_gate.WaitAsync` while a reconnect completed, issues ReadAsync against a stale/closed session. The catch block then fans out BadCommunicationError for the whole batch even though the driver is healthy on the new session, and the operation is silently lost. The gate does not protect against the session being swapped underneath a waiter.
|
||||
|
||||
**Recommendation:** Re-read `Session` inside the `_gate` critical section (after WaitAsync returns), or route the session swap itself through `_gate` so a swap cannot interleave with a gated operation.
|
||||
|
||||
**Resolution:** _(open)_
|
||||
**Resolution:** Resolved 2026-05-22 — ReadAsync/WriteAsync/DiscoverAsync now re-read `Session` (and parse NodeIds) inside the `_gate` critical section after `WaitAsync` returns; a session swapped in by a concurrent reconnect is the one used for the wire call.
|
||||
|
||||
### Driver.OpcUaClient-002
|
||||
|
||||
@@ -48,13 +48,13 @@
|
||||
| Severity | High |
|
||||
| Category | Error handling & resilience |
|
||||
| Location | `OpcUaClientDriver.cs:1330-1359` |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
|
||||
**Description:** OnReconnectComplete handles only the success case. When SessionReconnectHandler gives up (its retry loop exhausts the 2-minute maxReconnectPeriod), it invokes the callback with `handler.Session == null`. The code sets `Session = null`, disposes the handler, and sets `_reconnectHandler = null`, but leaves `_health` at whatever it was (typically Degraded) and `_hostState` at Stopped. There is no further reconnect attempt (the handler is gone, and OnKeepAlive only fires on a live session which no longer exists), and DriverState is never set to Faulted. The driver is permanently wedged: no session, no reconnect loop, no Faulted signal for the Core, and ReinitializeAsync is never triggered. This is the single largest gateway resilience gap.
|
||||
|
||||
**Recommendation:** In OnReconnectComplete, when newSession is null, set `_health` to a Faulted DriverHealth with an explanatory message so the Core can fan out Bad quality and offer an operator reinitialize. Consider re-arming a fresh reconnect attempt rather than giving up entirely for an always-on gateway.
|
||||
|
||||
**Resolution:** _(open)_
|
||||
**Resolution:** Resolved 2026-05-22 — OnReconnectComplete's give-up branch now transitions HostState to Faulted, sets a Faulted DriverHealth with an explanatory message, and re-arms a fresh SessionReconnectHandler (`TryRearmReconnect`) against the last-known session so an always-on gateway self-heals while the Core can still offer an operator reinitialize.
|
||||
|
||||
### Driver.OpcUaClient-003
|
||||
|
||||
@@ -63,13 +63,13 @@
|
||||
| Severity | High |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Location | `OpcUaClientDriver.cs:644-711` |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
|
||||
**Description:** BrowseRecursiveAsync calls session.BrowseAsync with `requestedMaxReferencesPerNode: 0` but never follows browse continuation points. OPC UA servers enforce a server-side max-references-per-node limit; when a node has more children than the server returns in one response, BrowseResult.ContinuationPoint is non-empty and the caller must issue BrowseNext to retrieve the remainder. This driver discards the continuation point, so any folder on the remote server with a large child set is silently truncated: discovered tags go missing from the local address space with no error. For the tens-of-thousands-of-nodes scenario the options doc targets (MaxDiscoveredNodes = 10000), this is a realistic and silent data-completeness bug.
|
||||
|
||||
**Recommendation:** After processing resp.Results[0].References, check resp.Results[0].ContinuationPoint; while non-empty, call session.BrowseNextAsync and append the additional references before recursing/registering.
|
||||
|
||||
**Resolution:** _(open)_
|
||||
**Resolution:** Resolved 2026-05-22 — BrowseRecursiveAsync now loops on the BrowseResult.ContinuationPoint, calling `session.BrowseNextAsync` and appending each page of references until the continuation point is empty, so large remote folders are no longer silently truncated.
|
||||
|
||||
### Driver.OpcUaClient-004
|
||||
|
||||
@@ -78,13 +78,13 @@
|
||||
| Severity | High |
|
||||
| Category | Design-document adherence |
|
||||
| Location | `OpcUaClientDriver.cs:596-632`, `:789`, `OpcUaClientDriverOptions.cs` |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
|
||||
**Description:** docs/v2/driver-specs.md section 8 mandates two features that are absent. (1) Namespace remapping: the spec requires building a bidirectional namespace map at connect time from session.NamespaceUris. The driver instead stores the raw upstream NodeId string (pv.NodeId.ToString()) as DriverAttributeInfo.FullName and re-parses it verbatim for reads/writes. The namespace index embedded in `ns=N;...` is server-session-relative; if the upstream server reorders its namespace table across a restart (permitted by the spec), every stored ns=N reference points at the wrong namespace and reads/writes silently address wrong nodes. (2) TargetNamespaceKind enforcement: section 8 requires the driver to enforce Equipment-vs-SystemPlatform choice at startup and fail draft validation on misconfiguration; OpcUaClientDriverOptions has no such knob.
|
||||
|
||||
**Recommendation:** Build a namespace-URI map from session.NamespaceUris at connect time and store NodeIds in a server-stable form (namespace URI plus identifier) rather than session-relative ns=N. Add the TargetNamespaceKind option and the startup validation section 8 describes, or document explicitly why the design deviates.
|
||||
|
||||
**Resolution:** _(open)_
|
||||
**Resolution:** Resolved 2026-05-22 — new `NamespaceMap` (built from session.NamespaceUris at connect and rebuilt on reconnect) persists discovered NodeIds in the server-stable `nsu=<uri>;…` form; reads/writes re-resolve that form against the current session so a remote namespace-table reorder no longer misaddresses nodes. Added the `TargetNamespaceKind` option + `UnsMappingTable` and `ValidateNamespaceKind`, which fails draft validation for an Equipment instance lacking a UNS mapping or a SystemPlatform instance carrying one.
|
||||
|
||||
### Driver.OpcUaClient-005
|
||||
|
||||
@@ -93,13 +93,13 @@
|
||||
| Severity | High |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Location | `OpcUaClientDriver.cs:1297-1319` |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
|
||||
**Description:** OnKeepAlive reads and writes `_reconnectHandler` without any lock: `if (_reconnectHandler is not null) return;` followed by `_reconnectHandler = new SessionReconnectHandler(...)`. Keep-alive callbacks are raised from the SDK keep-alive timer thread; on a bad keep-alive the SDK can fire the handler repeatedly while the channel stays down. Two callbacks racing through the check-then-set both observe null, both construct a SessionReconnectHandler, both call BeginReconnect, and the second assignment overwrites the first handler, leaking the first handler (its retry loop keeps running, unreferenced and never disposed) and creating two competing reconnect loops. ShutdownAsync then only cancels/disposes the one that won the assignment race.
|
||||
|
||||
**Recommendation:** Guard the `_reconnectHandler` check-and-set with `_probeLock` (already held for `_hostState`), or use Interlocked.CompareExchange to ensure exactly one handler is constructed per drop.
|
||||
|
||||
**Resolution:** _(open)_
|
||||
**Resolution:** Resolved 2026-05-22 — the `_reconnectHandler` check-and-set in OnKeepAlive (and the take-and-clear in ShutdownAsync, plus the dispose/re-arm in OnReconnectComplete/TryRearmReconnect) now run inside the `_probeLock` critical section, so exactly one SessionReconnectHandler is constructed per drop and a racing keep-alive callback cannot leak a handler.
|
||||
|
||||
### Driver.OpcUaClient-006
|
||||
|
||||
|
||||
Reference in New Issue
Block a user