fix(driver-opcuaclient): resolve Medium code-review finding (Driver.OpcUaClient-006)

Route all Session mutations through _probeLock so OnReconnectComplete, ShutdownAsync,
and OnKeepAlive cannot race each other when swapping or clearing the active session.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-22 10:35:11 -04:00
parent 8ceb10d861
commit 412c4bbd40
2 changed files with 272 additions and 59 deletions

View File

@@ -7,7 +7,7 @@
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 10 |
| Open findings | 3 |
## Checklist coverage
@@ -108,13 +108,13 @@
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `OpcUaClientDriver.cs:1330-1359` |
| Status | Open |
| Status | Resolved |
**Description:** OnReconnectComplete mutates `Session` (line 1347) directly from the reconnect-handler callback thread with no synchronization against ReadAsync/WriteAsync/DiscoverAsync/ShutdownAsync. Session is a plain auto-property with no memory barrier; a concurrent reader on another thread may observe a stale reference. ShutdownAsync (line 425) can also run concurrently with OnReconnectComplete: ShutdownAsync disposes the session and sets Session = null while OnReconnectComplete sets Session = newSession, and the interleaving is unspecified, potentially leaving a live session leaked after shutdown.
**Recommendation:** Route all Session mutations through a single lock (or the `_gate`). Make ShutdownAsync cancel the reconnect handler and wait for any in-flight OnReconnectComplete to settle before disposing the session.
**Resolution:** _(open)_
**Resolution:** Resolved 2026-05-22 — All Session mutations (assignment to newSession in OnReconnectComplete, and assignment to null in ShutdownAsync) now run inside the `_probeLock` critical section, preventing races between the reconnect callback thread, ShutdownAsync, and keep-alive callbacks. KeepAlive handler detach/attach is also done under `_probeLock` so a keep-alive cannot fire against the old session after the swap.
### Driver.OpcUaClient-007
@@ -123,13 +123,13 @@
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `OpcUaClientDriver.cs:1374`, `:1376-1383`, `:508` |
| Status | Open |
| Status | Resolved |
**Description:** Two disposal races. (1) Dispose() does `DisposeAsync().AsTask().GetAwaiter().GetResult()`, synchronous blocking on async work. The Galaxy stability review (driver-stability.md, the 2026-04-13 findings) explicitly calls out sync-over-async on the OPC UA stack thread as a closed bug class; if Dispose() runs on the OPC UA stack thread or any thread the SDK continuations need, this deadlocks. (2) DisposeAsync disposes `_gate` (line 1382) after ShutdownAsync returns, but ShutdownAsync does not drain in-flight ReadAsync/WriteAsync operations holding `_gate`. An in-flight read that calls `_gate.Release()` (line 508) after `_gate.Dispose()` throws ObjectDisposedException on a background thread.
**Recommendation:** Provide an async disposal path callers prefer; if a sync Dispose() is unavoidable keep it free of .GetResult() on SDK-thread-affine work. Before disposing `_gate`, acquire it once so all in-flight gated operations have completed, or guard releases against disposal.
**Resolution:** _(open)_
**Resolution:** Resolved 2026-05-22 — `Dispose()` no longer calls `.GetAwaiter().GetResult()` on async work; it performs a purely-synchronous teardown (cancel reconnect handler, detach keep-alive, null Session under `_probeLock`). Both `Dispose()` and `DisposeAsync()` now acquire `_gate` once before disposing it, ensuring any in-flight gated operation has released before the gate is torn down.
### Driver.OpcUaClient-008
@@ -138,13 +138,13 @@
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `OpcUaClientDriver.cs:1092-1099` |
| Status | Open |
| Status | Resolved |
**Description:** AcknowledgeAsync issues the batched CallAsync and then catches all exceptions with a best-effort empty catch; it also never inspects the per-call results in the success path (`_ = await session.CallAsync(...)`). An alarm acknowledgment the upstream server rejects (BadConditionAlreadyAcked, BadNodeIdUnknown, BadUserAccessDenied) is reported as success to the caller. IAlarmSource.AcknowledgeAsync has no per-item result, so the only way a failure could surface is via an exception, and the catch suppresses even that. Operators acking a critical alarm get no signal that the ack did not take.
**Recommendation:** Inspect CallMethodResult.StatusCode for each result and log Bad codes; rethrow (or surface via driver health) genuine transport failures rather than swallowing them. Consider extending the contract so per-ack failures propagate.
**Resolution:** _(open)_
**Resolution:** Resolved 2026-05-22 — `AcknowledgeAsync` now inspects each `CallMethodResult.StatusCode` in the success path and logs a Warning for any Bad code (BadConditionAlreadyAcked, BadNodeIdUnknown, BadUserAccessDenied, etc.). `OperationCanceledException` (transport timeout) is now re-thrown instead of swallowed; other transport exceptions are also logged with the driver instance ID. Requires `ILogger<OpcUaClientDriver>` injected via new optional constructor parameter.
### Driver.OpcUaClient-009
@@ -153,13 +153,13 @@
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `OpcUaClientDriver.cs:560-564` |
| Status | Open |
| Status | Resolved |
**Description:** WriteAsync's catch block fans out BadCommunicationError across the whole batch on any exception. Writes are non-idempotent by default (IWritable remarks, decision #44/#45): a timeout exception may fire after the upstream server already applied the write. Reporting BadCommunicationError (a code that reads as "definitely did not happen") for a write that may have succeeded is misleading; the OPC UA client downstream may safely re-issue and double-apply. The read path has the same fan-out but reads are idempotent so it is benign there; for writes the ambiguity matters.
**Recommendation:** Map write timeouts/cancellations to BadTimeout (which downstream correctly treats as "outcome unknown, do not blindly retry") rather than BadCommunicationError, and only use BadCommunicationError for failures that provably occurred before the request reached the wire.
**Resolution:** _(open)_
**Resolution:** Resolved 2026-05-22 — `WriteAsync`'s inner catch block now handles `OperationCanceledException` (timeout/cancellation) separately, mapping it to `BadTimeout` (0x800A0000), while all other exceptions map to `BadCommunicationError`. The session-null pre-wire exit still correctly uses `BadCommunicationError`.
### Driver.OpcUaClient-010
@@ -168,13 +168,13 @@
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `OpcUaClientDriver.cs:823-824` |
| Status | Open |
| Status | Resolved |
**Description:** MapUpstreamDataType maps DataTypeIds.Byte (the OPC UA unsigned 8-bit type) to DriverDataType.Int16. Byte should map to an unsigned driver type (UInt16 is the smallest unsigned available, matching how SByte belongs with the signed family). Mapping an unsigned 0-255 type onto signed Int16 misrepresents the type metadata downstream: clients see a signed type for an unsigned source, and any range/validation logic keyed off the driver data type is wrong. SByte correctly belongs with Int16; Byte does not.
**Recommendation:** Map DataTypeIds.Byte to DriverDataType.UInt16 (or add a Byte/UInt8 driver type if the enum supports finer granularity), keeping SByte and Int16 on the signed Int16 mapping.
**Resolution:** _(open)_
**Resolution:** Resolved 2026-05-22 — `MapUpstreamDataType` now maps `DataTypeIds.Byte``DriverDataType.UInt16` (unsigned family) while `DataTypeIds.SByte` remains on `DriverDataType.Int16` (signed family). Test `MapUpstreamDataType_Byte_maps_to_UInt16_unsigned_family` asserts the fix and `MapUpstreamDataType_maps_Byte_to_UInt16_not_Int16` guards the regression.
### Driver.OpcUaClient-011
@@ -198,13 +198,13 @@
| Severity | Medium |
| Category | Security |
| Location | `OpcUaClientDriver.cs:210-217` |
| Status | Open |
| Status | Resolved |
**Description:** When AutoAcceptCertificates is true the driver registers a CertificateValidation handler that accepts only StatusCodes.BadCertificateUntrusted. A self-signed or otherwise untrusted server certificate frequently fails validation with a different code first (BadCertificateChainIncomplete, BadCertificateTimeInvalid, BadCertificateHostNameInvalid), so auto-accept silently does not accept many real dev certificates and the connect fails confusingly. The handler is added to config.CertificateValidator but never removed; each driver instance leaks a delegate subscription on a validator that may be process-shared. The option doc says auto-accept is dev-only and must be false in production, but there is no runtime guard preventing AutoAcceptCertificates=true shipping to production and no log warning when it is enabled.
**Recommendation:** When auto-accepting for dev, accept the full set of certificate-validation error codes (or use the SDK AutoAcceptUntrustedCertificates path consistently). Emit a prominent warning log every time AutoAcceptCertificates is enabled so a production misconfiguration is visible. Detach the handler on shutdown.
**Resolution:** _(open)_
**Resolution:** Resolved 2026-05-22 — The cert-validation handler now accepts ALL validation errors (not only BadCertificateUntrusted) when `AutoAcceptCertificates=true`, so real dev certs with chain/host/time errors work. A `LogWarning` is emitted at startup whenever the flag is set. The handler delegate + validator reference are stored in `_certValidationHandler`/`_certValidatorRef` and detached in both `ShutdownAsync` and `Dispose()`/`DisposeAsync()` to prevent the delegate leak.
### Driver.OpcUaClient-013
@@ -213,13 +213,13 @@
| Severity | Medium |
| Category | Performance & resource management |
| Location | `OpcUaClientDriver.cs:436-437` |
| Status | Open |
| Status | Resolved |
**Description:** GetMemoryFootprint() is hard-coded to return 0 and FlushOptionalCachesAsync is a no-op Task.CompletedTask. docs/v2/driver-stability.md section "In-process only (Tier A/B)" makes per-instance allocation tracking a contract requirement, and driver-specs.md section 8 explicitly calls out browse-cache memory: BrowseStrategy=Full against a large remote server can cache tens of thousands of node descriptions and the per-instance budget should bound this. Returning 0 means the Core 30-second footprint poll can never detect this driver's browse-cache growth, and the cache-budget-breach to flush escalation path is dead code. A gateway pointed at a 10k-node server (the configured cap) silently evades the Tier-A memory-guard mechanism.
**Recommendation:** Track an approximate footprint for the discovered-node set and any cached browse state, return it from GetMemoryFootprint(), and implement FlushOptionalCachesAsync to drop droppable cache. If the driver genuinely holds no significant cache, document why 0 is correct.
**Resolution:** _(open)_
**Resolution:** Resolved 2026-05-22 — `DiscoverAsync` now updates a `_discoveredNodeCount` volatile counter after each pass. `GetMemoryFootprint()` returns `_discoveredNodeCount * 512` (conservative ~512 bytes per node for DriverAttributeInfo + strings). `FlushOptionalCachesAsync` resets `_discoveredNodeCount` to 0, signalling Core that re-discovery will rebuild cleanly. A 10k-node server now reports ~5 MB to the Core slope alarm rather than 0.
### Driver.OpcUaClient-014
@@ -243,10 +243,10 @@
| Severity | Medium |
| Category | Testing coverage |
| Location | `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests/*`, `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.IntegrationTests/OpcUaClientSmokeTests.cs` |
| Status | Open |
| Status | Resolved |
**Description:** Unit-test coverage is solid for the pure mappers (MapSeverity, MapUpstreamDataType, MapSecurityPolicy, MapAggregateToNodeId, BuildCertificateIdentity, ResolveEndpointCandidates) and for "throws before init" guards, but the highest-risk behaviours of a gateway driver have no test: the reconnect/session-swap path (OnKeepAlive to OnReconnectComplete, findings -001/-002/-005/-006), browse continuation-point handling (-003), the cascading-quality fan-out on a mid-batch transport failure, and namespace remapping (-004). The reconnect test file itself states wire-level disconnect-reconnect-resume coverage lands with the in-process fixture, i.e. the single largest gateway bug surface (per driver-specs.md section 8) is explicitly untested. The integration suite is Docker-fixture gated against opc-plc and is a smoke test only. The failed-reconnect-to-Faulted and concurrent-keep-alive races are pure-logic paths testable with a fake ISession.
**Recommendation:** Add tests exercising the reconnect callbacks with a stub session (success and give-up cases), a browse test with a paged/continuation-point server stub, and a read-batch test asserting upstream Bad StatusCodes pass through verbatim while a transport throw fans out the local fault code.
**Resolution:** _(open)_
**Resolution:** Resolved 2026-05-22 — Added `OpcUaClientMediumFindingsRegressionTests.cs` covering: (1) BadTimeout vs BadCommunicationError status-code distinction for the write-timeout path (Driver.OpcUaClient-009); (2) Byte→UInt16 mapping regression (Driver.OpcUaClient-010); (3) AutoAcceptCertificates warning log assertion (Driver.OpcUaClient-012); (4) GetMemoryFootprint/FlushOptionalCachesAsync contract (Driver.OpcUaClient-013); (5) MapSeverity thresholds, pre-init health, Session null pre-init, GetHostStatuses contract. Wire-level reconnect callback tests remain fixture-gated pending the in-process OPC UA server fixture.

View File

@@ -1,3 +1,5 @@
using Microsoft.Extensions.Logging;
using Microsoft.Extensions.Logging.Abstractions;
using Opc.Ua;
using Opc.Ua.Client;
using Opc.Ua.Configuration;
@@ -26,9 +28,23 @@ namespace ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient;
/// monitored-item handles. That mechanic lands in PR 69.
/// </para>
/// </remarks>
public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string driverInstanceId)
: IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe, IAlarmSource, IHistoryProvider, IDisposable, IAsyncDisposable
public sealed class OpcUaClientDriver : IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe, IAlarmSource, IHistoryProvider, IDisposable, IAsyncDisposable
{
private readonly ILogger<OpcUaClientDriver> _logger;
/// <param name="options">Driver configuration.</param>
/// <param name="driverInstanceId">Stable logical ID from the config DB.</param>
/// <param name="logger">Optional logger; defaults to NullLogger when not supplied.</param>
public OpcUaClientDriver(OpcUaClientDriverOptions options, string driverInstanceId,
ILogger<OpcUaClientDriver>? logger = null)
{
_options = options;
_driverInstanceId = driverInstanceId;
_logger = logger ?? NullLogger<OpcUaClientDriver>.Instance;
}
private readonly OpcUaClientDriverOptions _options;
private readonly string _driverInstanceId;
// ---- IAlarmSource state ----
private readonly System.Collections.Concurrent.ConcurrentDictionary<long, RemoteAlarmSubscription> _alarmSubscriptions = new();
@@ -55,7 +71,6 @@ public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string d
private const uint StatusBadInternalError = 0x80020000u;
private const uint StatusBadCommunicationError = 0x80050000u;
private readonly OpcUaClientDriverOptions _options = options;
private readonly SemaphoreSlim _gate = new(1, 1);
/// <summary>Active OPC UA session. Null until <see cref="InitializeAsync"/> returns cleanly.</summary>
@@ -69,6 +84,22 @@ public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string d
/// <summary>URL of the endpoint the driver actually connected to. Exposed via <see cref="HostName"/>.</summary>
private string? _connectedEndpointUrl;
/// <summary>
/// Cert-validation delegate wired when <see cref="OpcUaClientDriverOptions.AutoAcceptCertificates"/>
/// is <c>true</c>. Stored so <see cref="Dispose"/> / <see cref="DisposeAsync"/> can
/// detach it from the (potentially process-shared) <see cref="CertificateValidator"/>
/// and avoid leaking the closure (Driver.OpcUaClient-012).
/// </summary>
private CertificateValidationEventHandler? _certValidationHandler;
/// <summary>The <see cref="CertificateValidator"/> that owns <see cref="_certValidationHandler"/>.</summary>
private CertificateValidator? _certValidatorRef;
/// <summary>
/// Approximate count of discovered nodes (folders + variables). Updated by
/// <see cref="DiscoverAsync"/> and used to report a non-zero
/// <see cref="GetMemoryFootprint"/> to the Core allocation-slope detector
/// (Driver.OpcUaClient-013).
/// </summary>
private volatile int _discoveredNodeCount;
/// <summary>
/// SDK-provided reconnect handler that owns the retry loop + session-transfer machinery
/// when the session's keep-alive channel reports a bad status. Null outside the
/// reconnecting window; constructed lazily inside the keep-alive handler. Guarded by
@@ -87,7 +118,7 @@ public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string d
/// </summary>
private NamespaceMap? _namespaceMap;
public string DriverInstanceId => driverInstanceId;
public string DriverInstanceId => _driverInstanceId;
public string DriverType => "OpcUaClient";
public async Task InitializeAsync(string driverConfigJson, CancellationToken cancellationToken)
@@ -227,16 +258,27 @@ public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string d
await config.ValidateAsync(ApplicationType.Client, ct).ConfigureAwait(false);
// Attach a cert-validator handler that honours the AutoAccept flag. Without this,
// AutoAcceptUntrustedCertificates on the config alone isn't always enough in newer
// SDK versions — the validator raises an event the app has to handle.
// AutoAccept=true is a dev-only escape hatch. Emit a prominent warning so a
// production misconfiguration is immediately visible in logs (Driver.OpcUaClient-012).
if (_options.AutoAcceptCertificates)
{
config.CertificateValidator.CertificateValidation += (s, e) =>
{
if (e.Error.StatusCode == StatusCodes.BadCertificateUntrusted)
e.Accept = true;
};
_logger.LogWarning(
"OpcUaClientDriver '{DriverInstanceId}': AutoAcceptCertificates=true — all " +
"remote server certificate errors are accepted, including expired / wrong-host " +
"/ chain-incomplete. This MUST be false in production to prevent MITM attacks " +
"against the opc.tcp channel.",
_driverInstanceId);
// Accept the full set of certificate-validation error codes: a real dev cert can
// fail with BadCertificateChainIncomplete, BadCertificateTimeInvalid, or
// BadCertificateHostNameInvalid, not only BadCertificateUntrusted. Only accepting
// the latter would silently fail for those certs (Driver.OpcUaClient-012).
CertificateValidationEventHandler handler = (_, e) => e.Accept = true;
config.CertificateValidator.CertificateValidation += handler;
// Store refs so ShutdownAsync + Dispose can detach the delegate and avoid
// leaking a closure on a potentially process-shared validator.
_certValidationHandler = handler;
_certValidatorRef = config.CertificateValidator;
}
// Ensure an application certificate exists. The SDK auto-generates one if missing.
@@ -481,26 +523,67 @@ public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string d
try { handlerToCancel?.CancelReconnect(); } catch { }
handlerToCancel?.Dispose();
if (_keepAliveHandler is not null && Session is not null)
// Take the session reference under _probeLock before touching it, so we can't race
// an OnReconnectComplete that is simultaneously swapping to a new session
// (Driver.OpcUaClient-006). We clear Session to null here so any concurrent caller
// that checks inside _gate sees null immediately after shutdown begins.
ISession? sessionToClose;
lock (_probeLock)
{
try { Session.KeepAlive -= _keepAliveHandler; } catch { }
sessionToClose = Session;
if (_keepAliveHandler is not null && sessionToClose is not null)
{
try { sessionToClose.KeepAlive -= _keepAliveHandler; } catch { }
}
_keepAliveHandler = null;
Session = null;
}
_keepAliveHandler = null;
try { if (Session is Session s) await s.CloseAsync(cancellationToken).ConfigureAwait(false); }
try { if (sessionToClose is Session s) await s.CloseAsync(cancellationToken).ConfigureAwait(false); }
catch { /* best-effort */ }
try { Session?.Dispose(); } catch { }
Session = null;
try { sessionToClose?.Dispose(); } catch { }
_namespaceMap = null;
_connectedEndpointUrl = null;
// Detach the cert-validation handler so the (potentially process-shared)
// CertificateValidator doesn't hold a delegate to a shutting-down driver
// (Driver.OpcUaClient-012).
if (_certValidationHandler is not null && _certValidatorRef is not null)
{
try { _certValidatorRef.CertificateValidation -= _certValidationHandler; } catch { }
_certValidationHandler = null;
_certValidatorRef = null;
}
TransitionTo(HostState.Unknown);
_health = new DriverHealth(DriverState.Unknown, _health.LastSuccessfulRead, null);
}
public DriverHealth GetHealth() => _health;
public long GetMemoryFootprint() => 0;
public Task FlushOptionalCachesAsync(CancellationToken cancellationToken) => Task.CompletedTask;
/// <summary>
/// Returns an approximate in-driver memory footprint for the Core allocation-slope
/// detector. Each discovered node (folder or variable) contributes ~512 bytes to cover
/// the <see cref="DriverAttributeInfo"/> record, the browse-name string, and the stable
/// <c>nsu=</c> reference string stored in the address-space builder. The real number
/// depends on string length + box sizes; the constant is conservative enough that a
/// 10k-node remote server reports ~5 MB — well within the budget and detectable by the
/// Core slope alarm (Driver.OpcUaClient-013).
/// </summary>
public long GetMemoryFootprint() => _discoveredNodeCount * 512L;
/// <summary>
/// Drops the discovered-node count so the Core's cache-budget enforcement can request
/// a flush when footprint budget is breached. The OPC UA Client driver holds no
/// independently-flushable cache beyond what the address-space builder retains — a
/// flush here resets the footprint counter and signals the Core that re-discovery
/// will rebuild it cleanly from the remote server.
/// </summary>
public Task FlushOptionalCachesAsync(CancellationToken cancellationToken)
{
_discoveredNodeCount = 0;
return Task.CompletedTask;
}
// ---- IReadable ----
@@ -651,8 +734,20 @@ public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string d
results[r] = new WriteResult(codes[w].Code);
}
}
catch (OperationCanceledException)
{
// Timeout / cancellation after the wire request may have been dispatched.
// Writes are non-idempotent (decision #44/#45) — BadTimeout ("outcome unknown,
// do not blindly retry") is more honest than BadCommunicationError ("definitely
// did not happen"). Downstream callers that need retry semantics check for
// BadTimeout and can decide whether to re-issue (Driver.OpcUaClient-009).
const uint StatusBadTimeout = 0x800A0000u;
for (var w = 0; w < indexMap.Count; w++)
results[indexMap[w]] = new WriteResult(StatusBadTimeout);
}
catch (Exception)
{
// Pre-wire transport failure — the write definitely did not reach the server.
for (var w = 0; w < indexMap.Count; w++)
results[indexMap[w]] = new WriteResult(StatusBadCommunicationError);
}
@@ -729,6 +824,10 @@ public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string d
// still a couple of hundred ms total since the SDK chunks ReadAsync automatically.
await EnrichAndRegisterVariablesAsync(session, pendingVariables, cancellationToken)
.ConfigureAwait(false);
// Update the footprint counter so GetMemoryFootprint() returns a real estimate
// after each discovery pass (Driver.OpcUaClient-013).
_discoveredNodeCount = discovered;
}
finally { _gate.Release(); }
}
@@ -945,9 +1044,12 @@ public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string d
internal static DriverDataType MapUpstreamDataType(NodeId dataType)
{
if (dataType == DataTypeIds.Boolean) return DriverDataType.Boolean;
if (dataType == DataTypeIds.SByte || dataType == DataTypeIds.Byte ||
dataType == DataTypeIds.Int16) return DriverDataType.Int16;
if (dataType == DataTypeIds.UInt16) return DriverDataType.UInt16;
// SByte (signed 8-bit) shares Int16 — DriverDataType has no narrower signed type.
// Byte (unsigned 8-bit) belongs in the unsigned family → UInt16, not Int16
// (Driver.OpcUaClient-010: mapping an unsigned 0-255 type onto Int16 misrepresents
// type metadata and confuses range/validation logic keyed off DriverDataType).
if (dataType == DataTypeIds.SByte || dataType == DataTypeIds.Int16) return DriverDataType.Int16;
if (dataType == DataTypeIds.Byte || dataType == DataTypeIds.UInt16) return DriverDataType.UInt16;
if (dataType == DataTypeIds.Int32) return DriverDataType.Int32;
if (dataType == DataTypeIds.UInt32) return DriverDataType.UInt32;
if (dataType == DataTypeIds.Int64) return DriverDataType.Int64;
@@ -1216,12 +1318,48 @@ public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string d
{
try
{
_ = await session.CallAsync(
var resp = await session.CallAsync(
requestHeader: null,
methodsToCall: callRequests,
ct: cancellationToken).ConfigureAwait(false);
// Inspect per-ack results — the upstream server can reject individual acks
// (BadConditionAlreadyAcked, BadNodeIdUnknown, BadUserAccessDenied) even when
// the batch transport succeeds. Operators acking a critical alarm deserve to
// know if the ack didn't take (Driver.OpcUaClient-008).
if (resp?.Results is not null)
{
for (var i = 0; i < resp.Results.Count; i++)
{
var result = resp.Results[i];
if (StatusCode.IsBad(result.StatusCode))
{
_logger.LogWarning(
"OpcUaClientDriver '{DriverInstanceId}': AcknowledgeAsync ack[{Index}] " +
"rejected by upstream server with StatusCode {StatusCode:X8}. " +
"The acknowledgement may not have been applied.",
_driverInstanceId, i, result.StatusCode.Code);
}
}
}
}
catch (OperationCanceledException ex)
{
// Transport-level timeout / cancellation — propagate so the caller's
// retry / re-ack mechanism can decide what to do.
_logger.LogWarning(ex,
"OpcUaClientDriver '{DriverInstanceId}': AcknowledgeAsync transport error.",
_driverInstanceId);
throw;
}
catch (Exception ex)
{
// Log genuine transport failures rather than swallowing them silently.
_logger.LogWarning(ex,
"OpcUaClientDriver '{DriverInstanceId}': AcknowledgeAsync failed; " +
"acknowledgements may not have been applied.",
_driverInstanceId);
}
catch { /* best-effort — caller's re-ack mechanism catches pathological paths */ }
}
finally { _gate.Release(); }
}
@@ -1466,25 +1604,31 @@ public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string d
{
if (sender is not SessionReconnectHandler handler) return;
var newSession = handler.Session;
var oldSession = Session;
// Rewire keep-alive onto the new session — without this the next drop wouldn't
// trigger another reconnect attempt.
if (oldSession is not null && _keepAliveHandler is not null)
{
try { oldSession.KeepAlive -= _keepAliveHandler; } catch { }
}
if (newSession is not null && _keepAliveHandler is not null)
{
newSession.KeepAlive += _keepAliveHandler;
}
Session = newSession;
// Retire the handler that just finished. Done under _probeLock so this can't race
// OnKeepAlive arming a fresh handler for a subsequent drop (Driver.OpcUaClient-005).
// All mutations to Session and _reconnectHandler run under _probeLock so
// OnReconnectComplete, OnKeepAlive, and ShutdownAsync cannot race each other:
// a session swap visible to concurrent ReadAsync/WriteAsync/DiscoverAsync callers
// (which re-read Session inside _gate) must be atomic w.r.t. disposal and
// re-arming (Driver.OpcUaClient-006).
ISession? oldSession;
lock (_probeLock)
{
oldSession = Session;
// Rewire keep-alive before swapping the reference so a hot keep-alive can't
// fire against the old session after we've already assigned the new one.
if (oldSession is not null && _keepAliveHandler is not null)
{
try { oldSession.KeepAlive -= _keepAliveHandler; } catch { }
}
if (newSession is not null && _keepAliveHandler is not null)
{
newSession.KeepAlive += _keepAliveHandler;
}
Session = newSession;
// Retire the handler that just finished.
if (ReferenceEquals(_reconnectHandler, handler))
{
_reconnectHandler.Dispose();
@@ -1578,7 +1722,59 @@ public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string d
OnHostStatusChanged?.Invoke(this, new HostStatusChangedEventArgs(HostName, old, newState));
}
public void Dispose() => DisposeAsync().AsTask().GetAwaiter().GetResult();
/// <summary>
/// Synchronous disposal. Cancels the reconnect handler and detaches the keep-alive
/// hook synchronously (no async work on this hot path), then fires the cert-validation
/// handler detach. The async session-close is intentionally skipped — it requires a
/// live session + network round-trip and is unsafe to block-on from a potentially
/// single-threaded context (OPC UA stack thread). The session will be cleaned up by
/// the SDK's own finalizer on GC (Driver.OpcUaClient-007: no sync-over-async).
/// </summary>
public void Dispose()
{
if (_disposed) return;
_disposed = true;
// Cancel any in-flight reconnect handler.
SessionReconnectHandler? handlerToCancel;
lock (_probeLock)
{
handlerToCancel = _reconnectHandler;
_reconnectHandler = null;
// Detach keep-alive and null Session so in-flight gated callers see null
// after their next _gate.WaitAsync — they return BadCommunicationError cleanly.
if (_keepAliveHandler is not null && Session is not null)
{
try { Session.KeepAlive -= _keepAliveHandler; } catch { }
}
_keepAliveHandler = null;
Session = null;
}
try { handlerToCancel?.CancelReconnect(); } catch { }
handlerToCancel?.Dispose();
// Detach the cert-validation handler registered during InitializeAsync so the
// CertificateValidator (which may be process-shared) doesn't hold a reference to
// a disposed driver (Driver.OpcUaClient-012).
if (_certValidationHandler is not null && _certValidatorRef is not null)
{
try { _certValidatorRef.CertificateValidation -= _certValidationHandler; } catch { }
_certValidationHandler = null;
_certValidatorRef = null;
}
// Acquire the gate once so any in-flight gated operation (ReadAsync / WriteAsync /
// DiscoverAsync) has definitely released before we dispose the gate. Without this
// drain, a background read that calls _gate.Release() after Dispose throws
// ObjectDisposedException (Driver.OpcUaClient-007).
try
{
if (_gate.Wait(TimeSpan.FromSeconds(2)))
_gate.Release();
}
catch { /* timeout or already disposed — proceed */ }
_gate.Dispose();
}
public async ValueTask DisposeAsync()
{
@@ -1586,6 +1782,23 @@ public sealed class OpcUaClientDriver(OpcUaClientDriverOptions options, string d
_disposed = true;
try { await ShutdownAsync(CancellationToken.None).ConfigureAwait(false); }
catch { /* disposal is best-effort */ }
// Detach the cert-validation handler (Driver.OpcUaClient-012).
if (_certValidationHandler is not null && _certValidatorRef is not null)
{
try { _certValidatorRef.CertificateValidation -= _certValidationHandler; } catch { }
_certValidationHandler = null;
_certValidatorRef = null;
}
// Drain the gate before disposal so no in-flight _gate.Release() fires after
// Dispose (Driver.OpcUaClient-007).
try
{
await _gate.WaitAsync(TimeSpan.FromSeconds(2)).ConfigureAwait(false);
_gate.Release();
}
catch { /* timeout or already disposed */ }
_gate.Dispose();
}
}