fix(driver-opcuaclient): resolve High code-review findings (Driver.OpcUaClient-001..-005)

Driver.OpcUaClient-001 — ReadAsync/WriteAsync/DiscoverAsync captured the
session before acquiring _gate, so a reconnect that completed while the
operation was blocked on the gate left the wire call bound to a stale,
closed session. All three now re-read Session (and parse NodeIds) inside
the _gate critical section after WaitAsync returns.

Driver.OpcUaClient-002 — OnReconnectComplete ignored the give-up (null
session) case, permanently wedging the driver with no Faulted signal and
no reconnect loop. The give-up branch now transitions HostState to
Faulted, sets a Faulted DriverHealth with an explanatory message, and
re-arms a fresh SessionReconnectHandler (TryRearmReconnect) against the
last-known session so an always-on gateway self-heals.

Driver.OpcUaClient-003 — BrowseRecursiveAsync discarded browse
continuation points, silently truncating large remote folders.
It now loops on BrowseResult.ContinuationPoint calling BrowseNextAsync
and appending each page until the continuation point is empty.

Driver.OpcUaClient-004 — driver-specs.md §8 namespace handling was
absent. Added NamespaceMap (built from session.NamespaceUris at connect,
rebuilt on reconnect) which persists discovered NodeIds in the
server-stable nsu=<uri>;... form; reads/writes re-resolve that form
against the current session so a remote namespace-table reorder no
longer misaddresses nodes. Added the TargetNamespaceKind option +
UnsMappingTable and ValidateNamespaceKind startup enforcement.

Driver.OpcUaClient-005 — OnKeepAlive read/wrote _reconnectHandler
without a lock, racing the SDK keep-alive timer thread and leaking
handlers. The check-and-set in OnKeepAlive, the take-and-clear in
ShutdownAsync, and the dispose/re-arm in OnReconnectComplete now all
run inside the _probeLock critical section.

Adds OpcUaClientNamespaceTests (11 xUnit + Shouldly regression tests)
covering ValidateNamespaceKind and the NamespaceMap stable encoding.
Reconnect/browse wire paths remain fixture-gated per finding -015.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-22 06:38:20 -04:00
parent 090d2a4b44
commit ebc0511c72
7 changed files with 638 additions and 99 deletions
@@ -71,10 +71,22 @@ public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string d
/// <summary>
/// SDK-provided reconnect handler that owns the retry loop + session-transfer machinery
/// when the session's keep-alive channel reports a bad status. Null outside the
/// reconnecting window; constructed lazily inside the keep-alive handler.
/// reconnecting window; constructed lazily inside the keep-alive handler. Guarded by
/// <see cref="_probeLock"/> — keep-alive callbacks fire from the SDK timer thread and
/// can race a check-then-set if left unsynchronized (Driver.OpcUaClient-005).
/// </summary>
private SessionReconnectHandler? _reconnectHandler;
/// <summary>
/// Bidirectional namespace map built at connect time from <c>session.NamespaceUris</c>.
/// Stored NodeIds embed the server-stable namespace <b>URI</b> rather than the
/// session-relative <c>ns=N</c> index, so a remote-server namespace-table reorder
/// across a restart does not silently re-point stored references at the wrong
/// namespace (driver-specs.md §8 "Namespace Remapping", finding Driver.OpcUaClient-004).
/// Null until <see cref="InitializeAsync"/> returns cleanly.
/// </summary>
private NamespaceMap? _namespaceMap;
public string DriverInstanceId => driverInstanceId;
public string DriverType => "OpcUaClient";
@@ -83,6 +95,11 @@ public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string d
_health = new DriverHealth(DriverState.Initializing, null, null);
try
{
// Enforce the Equipment-vs-SystemPlatform choice at startup per driver-specs.md
// §8 "Namespace Assignment" — a misconfigured remote fails draft validation here,
// not as a runtime surprise.
ValidateNamespaceKind(_options);
var appConfig = await BuildApplicationConfigurationAsync(cancellationToken).ConfigureAwait(false);
var candidates = ResolveEndpointCandidates(_options);
@@ -126,6 +143,12 @@ public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string d
_keepAliveHandler = OnKeepAlive;
session.KeepAlive += _keepAliveHandler;
// Build the bidirectional namespace map from the freshly negotiated session's
// NamespaceUris (driver-specs.md §8 "Namespace Remapping"). Stored NodeIds carry
// the namespace URI, not the session-relative ns=N index, so a remote namespace
// reorder across a restart can't silently misaddress nodes.
_namespaceMap = NamespaceMap.FromSession(session);
Session = session;
_connectedEndpointUrl = connectedUrl;
_health = new DriverHealth(DriverState.Healthy, DateTime.UtcNow, null);
@@ -235,6 +258,41 @@ public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string d
return [opts.EndpointUrl];
}
/// <summary>
/// Enforce the §8 "Namespace Assignment" rule at startup. An <c>Equipment</c>-kind
/// instance gateways raw equipment data and therefore needs a config-driven UNS
/// mapping table (remote nodes don't conform to UNS); a <c>SystemPlatform</c>-kind
/// instance gateways processed data whose hierarchy is preserved verbatim, so a
/// UNS mapping table is meaningless and rejected. Throwing here surfaces the
/// misconfiguration as a draft-validation failure rather than a runtime surprise.
/// </summary>
internal static void ValidateNamespaceKind(OpcUaClientDriverOptions opts)
{
switch (opts.TargetNamespaceKind)
{
case OpcUaTargetNamespaceKind.Equipment:
if (opts.UnsMappingTable is null || opts.UnsMappingTable.Count == 0)
throw new InvalidOperationException(
"OpcUaClient driver configured with TargetNamespaceKind=Equipment but no " +
"UnsMappingTable: §8 requires a config-driven remote-to-UNS mapping table " +
"because remote nodes do not conform to UNS by default. Provide a mapping " +
"table or set TargetNamespaceKind=SystemPlatform if the remote exposes " +
"processed data.");
break;
case OpcUaTargetNamespaceKind.SystemPlatform:
if (opts.UnsMappingTable is { Count: > 0 })
throw new InvalidOperationException(
"OpcUaClient driver configured with TargetNamespaceKind=SystemPlatform but " +
"a UnsMappingTable was supplied: processed data preserves its own hierarchy " +
"and a UNS mapping table is ambiguous here. Clear the mapping table or set " +
"TargetNamespaceKind=Equipment if the remote exposes raw equipment data.");
break;
default:
throw new ArgumentOutOfRangeException(
nameof(opts), opts.TargetNamespaceKind, "Unknown TargetNamespaceKind.");
}
}
/// <summary>
/// Build the user-identity token from the driver options. Split out of
/// <see cref="InitializeAsync"/> so the failover sweep reuses one identity across
@@ -411,10 +469,17 @@ public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string d
// Abort any in-flight reconnect attempts before touching the session — BeginReconnect's
// retry loop holds a reference to the current session and would fight Session.CloseAsync
// if left spinning.
try { _reconnectHandler?.CancelReconnect(); } catch { }
_reconnectHandler?.Dispose();
_reconnectHandler = null;
// if left spinning. Take the handler under _probeLock so a keep-alive callback racing
// through OnKeepAlive can't arm a fresh handler after we've torn this one down
// (Driver.OpcUaClient-005).
SessionReconnectHandler? handlerToCancel;
lock (_probeLock)
{
handlerToCancel = _reconnectHandler;
_reconnectHandler = null;
}
try { handlerToCancel?.CancelReconnect(); } catch { }
handlerToCancel?.Dispose();
if (_keepAliveHandler is not null && Session is not null)
{
@@ -426,6 +491,7 @@ public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string d
catch { /* best-effort */ }
try { Session?.Dispose(); } catch { }
Session = null;
_namespaceMap = null;
_connectedEndpointUrl = null;
TransitionTo(HostState.Unknown);
@@ -441,31 +507,46 @@ public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string d
public async Task<IReadOnlyList<DataValueSnapshot>> ReadAsync(
IReadOnlyList<string> fullReferences, CancellationToken cancellationToken)
{
var session = RequireSession();
// Make sure a session exists before queuing on the gate, but do NOT bind the wire
// call to this reference — a reconnect can swap Session while we wait on _gate. The
// session actually used is re-read inside the gate (Driver.OpcUaClient-001/-006).
_ = RequireSession();
var results = new DataValueSnapshot[fullReferences.Count];
var now = DateTime.UtcNow;
// Parse NodeIds up-front. Tags whose reference doesn't parse get BadNodeIdInvalid
// and are omitted from the wire request — saves a round-trip against the upstream
// server for a fault the driver can detect locally.
var toSend = new ReadValueIdCollection();
var indexMap = new List<int>(fullReferences.Count); // maps wire-index -> results-index
for (var i = 0; i < fullReferences.Count; i++)
{
if (!TryParseNodeId(session, fullReferences[i], out var nodeId))
{
results[i] = new DataValueSnapshot(null, StatusBadNodeIdInvalid, null, now);
continue;
}
toSend.Add(new ReadValueId { NodeId = nodeId, AttributeId = Attributes.Value });
indexMap.Add(i);
}
if (toSend.Count == 0) return results;
await _gate.WaitAsync(cancellationToken).ConfigureAwait(false);
try
{
// Re-read Session inside the critical section: if a reconnect completed while we
// were blocked on _gate, OnReconnectComplete has already swapped in the new
// session. NodeId parsing is namespace-relative, so it must also use the current
// session's namespace table.
var session = Session;
if (session is null)
{
for (var i = 0; i < fullReferences.Count; i++)
results[i] = new DataValueSnapshot(null, StatusBadCommunicationError, null, now);
return results;
}
// Parse NodeIds against the live session. Tags whose reference doesn't parse get
// BadNodeIdInvalid and are omitted from the wire request — saves a round-trip for
// a fault the driver can detect locally.
var toSend = new ReadValueIdCollection();
var indexMap = new List<int>(fullReferences.Count); // maps wire-index -> results-index
for (var i = 0; i < fullReferences.Count; i++)
{
if (!TryParseNodeId(session, fullReferences[i], out var nodeId))
{
results[i] = new DataValueSnapshot(null, StatusBadNodeIdInvalid, null, now);
continue;
}
toSend.Add(new ReadValueId { NodeId = nodeId, AttributeId = Attributes.Value });
indexMap.Add(i);
}
if (toSend.Count == 0) return results;
try
{
var resp = await session.ReadAsync(
@@ -514,32 +595,45 @@ public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string d
public async Task<IReadOnlyList<WriteResult>> WriteAsync(
IReadOnlyList<Core.Abstractions.WriteRequest> writes, CancellationToken cancellationToken)
{
var session = RequireSession();
// See ReadAsync — the wire call must use the session current inside the gate, not a
// reference captured before WaitAsync (Driver.OpcUaClient-001/-006).
_ = RequireSession();
var results = new WriteResult[writes.Count];
var toSend = new WriteValueCollection();
var indexMap = new List<int>(writes.Count);
for (var i = 0; i < writes.Count; i++)
{
if (!TryParseNodeId(session, writes[i].FullReference, out var nodeId))
{
results[i] = new WriteResult(StatusBadNodeIdInvalid);
continue;
}
toSend.Add(new WriteValue
{
NodeId = nodeId,
AttributeId = Attributes.Value,
Value = new DataValue(new Variant(writes[i].Value)),
});
indexMap.Add(i);
}
if (toSend.Count == 0) return results;
await _gate.WaitAsync(cancellationToken).ConfigureAwait(false);
try
{
var session = Session;
if (session is null)
{
// Writes are non-idempotent (decision #44/#45) — but here the request never
// reached the wire, so BadCommunicationError ("definitely did not happen") is
// the honest code.
for (var i = 0; i < writes.Count; i++)
results[i] = new WriteResult(StatusBadCommunicationError);
return results;
}
var toSend = new WriteValueCollection();
var indexMap = new List<int>(writes.Count);
for (var i = 0; i < writes.Count; i++)
{
if (!TryParseNodeId(session, writes[i].FullReference, out var nodeId))
{
results[i] = new WriteResult(StatusBadNodeIdInvalid);
continue;
}
toSend.Add(new WriteValue
{
NodeId = nodeId,
AttributeId = Attributes.Value,
Value = new DataValue(new Variant(writes[i].Value)),
});
indexMap.Add(i);
}
if (toSend.Count == 0) return results;
try
{
var resp = await session.WriteAsync(
@@ -568,25 +662,26 @@ public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string d
}
/// <summary>
/// Parse a tag's full-reference string as a NodeId. Accepts the standard OPC UA
/// serialized forms (<c>ns=2;s=…</c>, <c>i=2253</c>, <c>ns=4;g=…</c>, <c>ns=3;b=…</c>).
/// Empty + malformed strings return false; the driver surfaces that as
/// <see cref="StatusBadNodeIdInvalid"/> without a wire round-trip.
/// Parse a tag's full-reference string as a NodeId, resolved against the
/// <paramref name="session"/>'s <i>current</i> namespace table. Accepts both the
/// server-stable <c>nsu=&lt;uri&gt;;…</c> form the driver persists (see
/// <see cref="NamespaceMap.ToStableReference"/>) and plain OPC UA serialized forms
/// (<c>ns=2;s=…</c>, <c>i=2253</c>, <c>ns=4;g=…</c>, <c>ns=3;b=…</c>). Resolving the
/// <c>nsu=…</c> form against the current session re-binds it through that session's
/// URI table, so a remote namespace-table reorder across a restart is transparently
/// corrected (driver-specs.md §8). Empty + malformed strings return false; the driver
/// surfaces that as <see cref="StatusBadNodeIdInvalid"/> without a wire round-trip.
/// </summary>
internal static bool TryParseNodeId(ISession session, string fullReference, out NodeId nodeId)
{
nodeId = NodeId.Null;
if (string.IsNullOrWhiteSpace(fullReference)) return false;
try
{
nodeId = NodeId.Parse(session.MessageContext, fullReference);
return !NodeId.IsNull(nodeId);
}
catch
{
return false;
}
}
internal static bool TryParseNodeId(ISession session, string fullReference, out NodeId nodeId) =>
NamespaceMap.TryResolve(session, fullReference, out nodeId);
/// <summary>
/// Render a discovered NodeId in the server-stable form persisted into the local
/// address space. Falls back to the raw serialized NodeId if the namespace map is not
/// yet built (it always is by the time <see cref="DiscoverAsync"/> runs).
/// </summary>
private string StableReference(NodeId nodeId) =>
_namespaceMap?.ToStableReference(nodeId) ?? nodeId.ToString() ?? string.Empty;
private ISession RequireSession() =>
Session ?? throw new InvalidOperationException("OpcUaClientDriver not initialized");
@@ -596,11 +691,10 @@ public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string d
public async Task DiscoverAsync(IAddressSpaceBuilder builder, CancellationToken cancellationToken)
{
ArgumentNullException.ThrowIfNull(builder);
var session = RequireSession();
var root = !string.IsNullOrEmpty(_options.BrowseRoot)
? NodeId.Parse(session.MessageContext, _options.BrowseRoot)
: ObjectIds.ObjectsFolder;
// Confirm a session exists before queuing; the session actually browsed is re-read
// inside the gate so a reconnect mid-wait can't leave us browsing a closed session
// (Driver.OpcUaClient-001/-006).
_ = RequireSession();
var rootFolder = builder.Folder("Remote", "Remote");
var visited = new HashSet<NodeId>();
@@ -610,6 +704,14 @@ public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string d
await _gate.WaitAsync(cancellationToken).ConfigureAwait(false);
try
{
var session = Session
?? throw new InvalidOperationException(
"OpcUaClient session was lost before discovery could browse the remote server.");
var root = !string.IsNullOrEmpty(_options.BrowseRoot)
? NodeId.Parse(session.MessageContext, _options.BrowseRoot)
: ObjectIds.ObjectsFolder;
// Pass 1: browse hierarchy + create folders inline, collect variables into a
// pending list. Defers variable registration until attributes are resolved — the
// address-space builder's Variable call is the one-way commit, so doing it only
@@ -664,15 +766,41 @@ public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string d
}
};
BrowseResponse resp;
ReferenceDescriptionCollection refs;
try
{
resp = await session.BrowseAsync(
var resp = await session.BrowseAsync(
requestHeader: null,
view: null,
requestedMaxReferencesPerNode: 0,
nodesToBrowse: browseDescriptions,
ct: ct).ConfigureAwait(false);
if (resp.Results.Count == 0) return;
var result = resp.Results[0];
refs = result.References;
// Follow browse continuation points. OPC UA servers cap the references returned
// per node in a single response; when a folder has more children than the cap,
// BrowseResult.ContinuationPoint is non-empty and the remainder must be pulled
// with BrowseNext. Without this loop a large remote folder is silently truncated
// and discovered tags go missing from the local address space
// (Driver.OpcUaClient-003).
var continuationPoint = result.ContinuationPoint;
while (continuationPoint is { Length: > 0 })
{
var next = await session.BrowseNextAsync(
requestHeader: null,
releaseContinuationPoints: false,
continuationPoints: [continuationPoint],
ct: ct).ConfigureAwait(false);
if (next.Results.Count == 0) break;
var nextResult = next.Results[0];
if (nextResult.References is { Count: > 0 })
refs.AddRange(nextResult.References);
continuationPoint = nextResult.ContinuationPoint;
}
}
catch
{
@@ -682,9 +810,6 @@ public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string d
return;
}
if (resp.Results.Count == 0) return;
var refs = resp.Results[0].References;
foreach (var rf in refs)
{
if (discovered() >= _options.MaxDiscoveredNodes) break;
@@ -787,7 +912,7 @@ public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string d
var historizing = StatusCode.IsGood(histDv.StatusCode) && histDv.Value is bool b && b;
pv.ParentFolder.Variable(pv.BrowseName, pv.DisplayName, new DriverAttributeInfo(
FullName: pv.NodeId.ToString() ?? string.Empty,
FullName: StableReference(pv.NodeId),
DriverDataType: dataType,
IsArray: isArray,
ArrayDim: null,
@@ -799,7 +924,7 @@ public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string d
void RegisterFallback(PendingVariable pv)
{
pv.ParentFolder.Variable(pv.BrowseName, pv.DisplayName, new DriverAttributeInfo(
FullName: pv.NodeId.ToString() ?? string.Empty,
FullName: StableReference(pv.NodeId),
DriverDataType: DriverDataType.Int32,
IsArray: false,
ArrayDim: null,
@@ -1304,15 +1429,25 @@ public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string d
TransitionTo(HostState.Stopped);
// Kick off the SDK's reconnect loop exactly once per drop. The handler handles its
// own retry cadence via ReconnectPeriod; we tear it down in OnReconnectComplete.
if (_reconnectHandler is not null) return;
// Kick off the SDK's reconnect loop exactly once per drop. Keep-alive callbacks fire
// from the SDK keep-alive timer thread and the SDK can fire this handler repeatedly
// while the channel stays down — the check-then-set must be atomic, otherwise two
// callbacks both observe null, both construct a SessionReconnectHandler, and the
// second assignment leaks the first (its retry loop keeps running, unreferenced and
// never disposed). Guard with _probeLock (Driver.OpcUaClient-005).
SessionReconnectHandler handler;
lock (_probeLock)
{
if (_reconnectHandler is not null || _disposed) return;
handler = new SessionReconnectHandler(telemetry: null!,
reconnectAbort: false,
maxReconnectPeriod: (int)TimeSpan.FromMinutes(2).TotalMilliseconds);
_reconnectHandler = handler;
}
_reconnectHandler = new SessionReconnectHandler(telemetry: null!,
reconnectAbort: false,
maxReconnectPeriod: (int)TimeSpan.FromMinutes(2).TotalMilliseconds);
var state = _reconnectHandler.BeginReconnect(
// BeginReconnect is started outside the lock — it does no _reconnectHandler mutation
// and we don't want to hold _probeLock across an SDK call.
handler.BeginReconnect(
sender,
(int)_options.ReconnectPeriod.TotalMilliseconds,
OnReconnectComplete);
@@ -1345,16 +1480,88 @@ public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string d
}
Session = newSession;
_reconnectHandler?.Dispose();
_reconnectHandler = null;
// Whether the reconnect actually succeeded depends on whether the session is
// non-null + connected. When it succeeded, flip back to Running so downstream
// consumers see recovery.
// Retire the handler that just finished. Done under _probeLock so this can't race
// OnKeepAlive arming a fresh handler for a subsequent drop (Driver.OpcUaClient-005).
lock (_probeLock)
{
if (ReferenceEquals(_reconnectHandler, handler))
{
_reconnectHandler.Dispose();
_reconnectHandler = null;
}
}
if (newSession is not null)
{
// Reconnect succeeded. Rebuild the namespace map from the *new* session: the
// remote server may have reordered its namespace table across the restart that
// caused the drop (driver-specs.md §8). Stable nsu= references stored in the
// address space re-resolve correctly against this fresh map.
_namespaceMap = NamespaceMap.FromSession(newSession);
TransitionTo(HostState.Running);
_health = new DriverHealth(DriverState.Healthy, DateTime.UtcNow, null);
return;
}
// The reconnect handler gave up — its retry loop exhausted the 2-minute
// maxReconnectPeriod and invoked the callback with a null Session. Without an
// explicit Faulted signal the driver is permanently wedged: no session, no live
// keep-alive to re-trigger OnKeepAlive, and the Core never learns it must offer an
// operator reinitialize (Driver.OpcUaClient-002). Surface Faulted so the Core fans
// out Bad quality and ReinitializeAsync becomes available, and arm a fresh reconnect
// attempt against the last-known session for an always-on gateway rather than
// abandoning recovery entirely.
TransitionTo(HostState.Faulted);
_health = new DriverHealth(
DriverState.Faulted, _health.LastSuccessfulRead,
"OPC UA session reconnect exhausted its retry window without recovering. " +
"The remote server is unreachable; reinitialize the driver once it is back.");
if (oldSession is not null && !_disposed)
TryRearmReconnect(handler, oldSession);
}
/// <summary>
/// Arm a fresh reconnect attempt after a previous handler gave up. The OPC UA Client
/// driver gateways an always-on remote server, so abandoning recovery permanently is
/// the wrong default — a new <see cref="SessionReconnectHandler"/> keeps retrying so
/// the driver self-heals when the remote returns, while the Faulted health set by the
/// caller still lets an operator force a clean reinitialize in the meantime.
/// </summary>
private void TryRearmReconnect(SessionReconnectHandler exhausted, ISession lastSession)
{
SessionReconnectHandler handler;
lock (_probeLock)
{
// Only re-arm if no other handler took over and we aren't shutting down.
if (_disposed || (_reconnectHandler is not null && !ReferenceEquals(_reconnectHandler, exhausted)))
return;
handler = new SessionReconnectHandler(telemetry: null!,
reconnectAbort: false,
maxReconnectPeriod: (int)TimeSpan.FromMinutes(2).TotalMilliseconds);
_reconnectHandler = handler;
}
try
{
handler.BeginReconnect(
lastSession,
(int)_options.ReconnectPeriod.TotalMilliseconds,
OnReconnectComplete);
}
catch
{
// If the SDK refuses to re-arm (e.g. the last session is fully torn down), drop
// the handler so a later operator ReinitializeAsync isn't blocked by a stale one.
lock (_probeLock)
{
if (ReferenceEquals(_reconnectHandler, handler))
{
_reconnectHandler.Dispose();
_reconnectHandler = null;
}
}
}
}