review(Client.Shared): fix Disconnect/failover subscription race + CT forwarding

Re-review at 7286d320. -012 (Medium): DisconnectAsync now snapshots+nulls the data/alarm
subscriptions under _subscriptionLock before async teardown (was racing RunFailoverAsync).
-013: SubscribeAlarmsAsync guarded by a semaphore (idempotent under concurrency). -014/-015:
forward CancellationToken through Delete/BrowseNext adapters. + TDD.
This commit is contained in:
Joseph Doherty
2026-06-19 11:58:15 -04:00
parent 887a31e825
commit d68c9db9f9
6 changed files with 240 additions and 28 deletions
+83 -2
View File
@@ -4,8 +4,8 @@
|---|---|
| Module | `src/Client/ZB.MOM.WW.OtOpcUa.Client.Shared` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Review date | 2026-06-19 |
| Commit reviewed | `7286d320` |
| Status | Reviewed |
| Open findings | 0 |
@@ -24,6 +24,26 @@
| 9 | Testing coverage | Client.Shared-011 |
| 10 | Documentation & comments | Client.Shared-009 |
## Re-review 2026-06-19 (commit 7286d320)
All 11 prior findings remain correctly Resolved. The module has grown since `76d35d1` with
new alarm-condition methods (`ConfirmAlarmAsync`, `ShelveAlarmAsync`, `EnableAsync`,
`DisableAsync`), a `DefaultApplicationConfigurationFactory` PKI-path lazy-resolve fix, and
`EndpointSelector` extraction. Four new issues were found and fixed in the same pass.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Client.Shared-015 |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | Client.Shared-012, Client.Shared-013 |
| 4 | Error handling & resilience | No issues found |
| 5 | Security | No issues found |
| 6 | Performance & resource management | No issues found |
| 7 | Design-document adherence | No issues found |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | No issues found |
| 10 | Documentation & comments | Client.Shared-014 (correctness) |
## Findings
### Client.Shared-001
@@ -178,6 +198,7 @@
### Client.Shared-011
| Field | Value |
|---|---|
| Severity | Low |
@@ -190,3 +211,63 @@
**Recommendation:** Add tests for re-entrant/concurrent failover, the alarm fallback path with truncated event fields, and string-write coercion against a typeless node. Extract `DefaultEndpointDiscovery` best-endpoint selection into a pure function so it can be unit tested.
**Resolution:** Resolved 2026-05-23 — added the previously-missing unit coverage: (a) `OnAlarmEvent_MissingAckedActiveButHasConditionNode_FallbackReadsAndRaisesEvent` drives the supplemental-read fallback path with null AckedState/ActiveState fields and a non-null SourceNode and asserts the Galaxy attribute reads populate the delivered event; (b) `WriteValueAsync` typeless-node coverage is exercised via the Client.Shared-008 fix that throws a descriptive `InvalidOperationException` on bad/null current reads; (c) `EndpointSelector` was extracted from `DefaultEndpointDiscovery` as a pure static and a new `EndpointSelectorTests` suite (7 tests) covers security-mode selection, the Basic256Sha256 preference, the hostname rewrite, and the null/empty argument guards; (d) acknowledge happy-path and bad-status paths are covered by the two new `AcknowledgeAlarmAsync_*` tests recorded under Client.Shared-009. Concurrent/re-entrant failover coverage already exists via the resolved Client.Shared-005/-006 tests in the suite.
### Client.Shared-012
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `OpcUaClientService.cs:DisconnectAsync` (lines 135145 at HEAD), `OpcUaClientService.cs:RunFailoverAsync` (lines 671672) |
| Status | Resolved |
**Description:** `DisconnectAsync` accessed `_dataSubscription` and `_alarmSubscription` without holding `_subscriptionLock`. `RunFailoverAsync` also nulled both fields without a lock. A concurrent failover firing while `DisconnectAsync` was in progress could null the fields between the null-check (`if (_dataSubscription != null)`) and the `DeleteAsync` call, producing a `NullReferenceException` caught silently by the `catch (Exception ex)` wrapper, or — if the failover won the race completely — a double-null-check that skipped the cleanup. The session adapter was similarly unguarded.
**Recommendation:** Snapshot `_dataSubscription` and `_alarmSubscription` under `_subscriptionLock` before the async teardown block, null the fields atomically under the lock, then call `DeleteAsync` on the local copies outside the lock (no async under lock).
**Resolution:** Resolved 2026-06-19 — `DisconnectAsync` now snapshots `dataSubscription` and `alarmSubscription` under `_subscriptionLock` at the top of the cleanup block, nulling the fields atomically before any async call. The local copies are then passed to `DeleteAsync`, eliminating the race window. Tests: `DisconnectAsync_WithActiveSubscriptions_DeletesBothAndForwardsCt` verifies both adapters are deleted and the CT is forwarded.
### Client.Shared-013
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `OpcUaClientService.cs:SubscribeAlarmsAsync` (line 293 at HEAD) |
| Status | Resolved |
**Description:** The `if (_alarmSubscription != null) return;` guard in `SubscribeAlarmsAsync` was outside any lock or serialisation primitive. Two concurrent callers could both observe `_alarmSubscription == null`, proceed to `CreateSubscriptionAsync`, and create duplicate event subscriptions. The gap is exploitable during the async I/O between the null-check and the assignment of `_alarmSubscription` after `CreateSubscriptionAsync` completes. In practice callers are CLI/UI (single-threaded) or the failover replay path, so the risk is low, but it is a true race.
**Recommendation:** Serialise the check-and-create via a `SemaphoreSlim(1,1)` so only one caller can create the alarm subscription at a time. The semaphore should be async-friendly (no thread-block) and disposed in `Dispose()`.
**Resolution:** Resolved 2026-06-19 — added `_alarmSubscribeSemaphore` (`SemaphoreSlim(1,1)`) and wrapped the entire `SubscribeAlarmsAsync` body (null-check + `CreateSubscriptionAsync` + `AddEventMonitoredItemAsync`) inside `await _alarmSubscribeSemaphore.WaitAsync(ct)` / `Release()`. The semaphore is disposed in `Dispose()`. Test: `SubscribeAlarmsAsync_ConcurrentDuplicateCalls_CreatesExactlyOneSubscription` launches 20 concurrent tasks and asserts exactly one event monitored item across all created subscriptions.
### Client.Shared-014
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `Adapters/DefaultSubscriptionAdapter.cs:DeleteAsync` (line 108 at HEAD) |
| Status | Resolved |
**Description:** `DefaultSubscriptionAdapter.DeleteAsync(CancellationToken ct)` accepted a `CancellationToken` parameter but forwarded it as `_subscription.DeleteAsync(true)`, discarding `ct`. The OPC UA SDK `Subscription.DeleteAsync(bool, CancellationToken)` overload exists and accepts the token. Callers that pass a cancellable token (e.g. `DisconnectAsync`) had no way to interrupt a stalled subscription-delete service round-trip.
**Recommendation:** Pass `ct` to the SDK: `_subscription.DeleteAsync(true, ct)`.
**Resolution:** Resolved 2026-06-19 — changed `_subscription.DeleteAsync(true)` to `_subscription.DeleteAsync(true, ct)` in `DefaultSubscriptionAdapter.DeleteAsync`. Test: `DisconnectAsync_WithActiveSubscriptions_DeletesBothAndForwardsCt` asserts `FakeSubscriptionAdapter.LastDeleteCt` equals the caller's token; `UnsubscribeAlarmsAsync_ForwardsCancellationToken` covers the `UnsubscribeAlarmsAsync` path.
### Client.Shared-015
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `Adapters/DefaultSessionAdapter.cs:BrowseNextAsync` (line 111 at HEAD) |
| Status | Resolved |
**Description:** `DefaultSessionAdapter.BrowseNextAsync(byte[] continuationPoint, CancellationToken ct)` accepted a `CancellationToken` but called `_session.BrowseNextAsync(null, false, continuationPoint)` — a three-argument overload that carries no token. The SDK's `SessionExtensions.BrowseNextAsync(ISession, RequestHeader, bool, byte[], CancellationToken)` extension accepts the token. This meant a cancelled multi-page browse could not abort the continuation round-trip mid-flight. Callers in `OpcUaClientService.BrowseAsync` correctly propagate a `CancellationToken` through `BrowseNextAsync` but the adapter silently dropped it.
**Recommendation:** Use the CT-accepting SDK overload: `_session.BrowseNextAsync(null, false, continuationPoint, ct)`.
**Resolution:** Resolved 2026-06-19 — changed the `_session.BrowseNextAsync(null, false, continuationPoint)` call to `_session.BrowseNextAsync(null, false, continuationPoint, ct)` in `DefaultSessionAdapter.BrowseNextAsync`. Build is clean (0 errors); the SDK extension overload is confirmed present in `Opc.Ua.Client 1.5.377.21`.
@@ -108,7 +108,9 @@ internal sealed class DefaultSessionAdapter : ISessionAdapter
public async Task<(byte[]? ContinuationPoint, ReferenceDescriptionCollection References)> BrowseNextAsync(
byte[] continuationPoint, CancellationToken ct)
{
var (_, nextCp, nextRefs) = await _session.BrowseNextAsync(null, false, continuationPoint);
// Pass the caller's token so a cancelled browse does not block on the continuation
// round-trip (Client.Shared-015).
var (_, nextCp, nextRefs) = await _session.BrowseNextAsync(null, false, continuationPoint, ct);
return (nextCp, nextRefs ?? []);
}
@@ -105,7 +105,8 @@ internal sealed class DefaultSubscriptionAdapter : ISubscriptionAdapter
{
try
{
await _subscription.DeleteAsync(true);
// Forward the caller's token so the delete can be cancelled (Client.Shared-014).
await _subscription.DeleteAsync(true, ct);
}
catch (Exception ex)
{
@@ -21,6 +21,12 @@ public sealed class OpcUaClientService : IOpcUaClientService
// path, and DisconnectAsync, so every read/write must be inside this lock.
private readonly object _subscriptionLock = new();
// Serialises SubscribeAlarmsAsync / UnsubscribeAlarmsAsync so concurrent callers
// cannot both pass the _alarmSubscription == null check and create duplicate event
// subscriptions. Capacity 1 = at most one waiter; async-friendly (no thread-block).
// (Client.Shared-013)
private readonly SemaphoreSlim _alarmSubscribeSemaphore = new(1, 1);
// Track active data subscriptions for replay after failover
private readonly Dictionary<string, (NodeId NodeId, int IntervalMs, uint Handle)> _activeDataSubscriptions = new();
@@ -130,19 +136,26 @@ public sealed class OpcUaClientService : IOpcUaClientService
var endpointUrl = _session?.EndpointUrl ?? _settings?.EndpointUrl ?? string.Empty;
// Snapshot the subscription adapter references under the lock so a concurrent
// RunFailoverAsync that nulls _dataSubscription / _alarmSubscription (Client.Shared-012)
// cannot invalidate the references between the null-check and the DeleteAsync call.
ISubscriptionAdapter? dataSubscription;
ISubscriptionAdapter? alarmSubscription;
lock (_subscriptionLock)
{
dataSubscription = _dataSubscription;
alarmSubscription = _alarmSubscription;
_dataSubscription = null;
_alarmSubscription = null;
}
try
{
if (_dataSubscription != null)
{
await _dataSubscription.DeleteAsync(ct);
_dataSubscription = null;
}
if (dataSubscription != null)
await dataSubscription.DeleteAsync(ct);
if (_alarmSubscription != null)
{
await _alarmSubscription.DeleteAsync(ct);
_alarmSubscription = null;
}
if (alarmSubscription != null)
await alarmSubscription.DeleteAsync(ct);
if (_session != null)
{
@@ -290,22 +303,33 @@ public sealed class OpcUaClientService : IOpcUaClientService
ThrowIfDisposed();
ThrowIfNotConnected();
if (_alarmSubscription != null)
return; // Already subscribed to alarms
var monitorNode = sourceNodeId ?? ObjectIds.Server;
_alarmSubscription = await _session!.CreateSubscriptionAsync(intervalMs, ct);
var filter = CreateAlarmEventFilter();
await _alarmSubscription.AddEventMonitoredItemAsync(
monitorNode, intervalMs, filter, OnAlarmEventNotification, ct);
lock (_subscriptionLock)
// Serialise check-and-create under the semaphore (Client.Shared-013): concurrent
// callers cannot both pass the null-check and create duplicate alarm subscriptions.
// The semaphore is async-friendly; the subscription I/O runs inside the critical section.
await _alarmSubscribeSemaphore.WaitAsync(ct);
try
{
_activeAlarmSubscription = (sourceNodeId, intervalMs);
}
if (_alarmSubscription != null)
return; // Already subscribed to alarms
Logger.Debug("Subscribed to alarm events on {NodeId}", monitorNode);
var monitorNode = sourceNodeId ?? ObjectIds.Server;
_alarmSubscription = await _session!.CreateSubscriptionAsync(intervalMs, ct);
var filter = CreateAlarmEventFilter();
await _alarmSubscription.AddEventMonitoredItemAsync(
monitorNode, intervalMs, filter, OnAlarmEventNotification, ct);
lock (_subscriptionLock)
{
_activeAlarmSubscription = (sourceNodeId, intervalMs);
}
Logger.Debug("Subscribed to alarm events on {NodeId}", monitorNode);
}
finally
{
_alarmSubscribeSemaphore.Release();
}
}
/// <inheritdoc />
@@ -584,6 +608,7 @@ public sealed class OpcUaClientService : IOpcUaClientService
_dataSubscription?.Dispose();
_alarmSubscription?.Dispose();
_session?.Dispose();
_alarmSubscribeSemaphore.Dispose();
lock (_subscriptionLock)
{
@@ -101,10 +101,17 @@ internal sealed class FakeSubscriptionAdapter : ISubscriptionAdapter
return Task.CompletedTask;
}
/// <summary>
/// Gets the cancellation token that was supplied to the most recent <see cref="DeleteAsync"/> call,
/// so tests can assert the CT from the caller is honoured (Client.Shared-014).
/// </summary>
public CancellationToken? LastDeleteCt { get; private set; }
/// <inheritdoc />
public Task DeleteAsync(CancellationToken ct)
{
Deleted = true;
LastDeleteCt = ct;
lock (_itemsLock) _items.Clear();
return Task.CompletedTask;
}
@@ -1630,4 +1630,100 @@ public class OpcUaClientServiceTests : IDisposable
service.ShouldBeAssignableTo<IOpcUaClientService>();
service.Dispose();
}
// --- Client.Shared-012: DisconnectAsync subscription snapshot under lock ---
/// <summary>
/// Regression for Client.Shared-012: DisconnectAsync must snapshot the subscription
/// references under _subscriptionLock before teardown so a concurrent RunFailoverAsync
/// nulling the same fields cannot produce a NullReferenceException between the null-check
/// and the DeleteAsync call. This test verifies that when data and alarm subscriptions
/// exist at disconnect time the adapters are deleted exactly once and the supplied
/// CancellationToken is forwarded (Client.Shared-014).
/// </summary>
[Fact]
public async Task DisconnectAsync_WithActiveSubscriptions_DeletesBothAndForwardsCt()
{
var dataFakeSub = new FakeSubscriptionAdapter();
var alarmFakeSub = new FakeSubscriptionAdapter();
// First CreateSubscription call → data subscription; second → alarm subscription.
var callIndex = 0;
var session = new FakeSessionAdapter
{
NextSubscription = dataFakeSub
};
// Second CreateSubscription call should return the alarm fake.
// We wire it by overriding the factory queue after the first subscribe.
_sessionFactory.EnqueueSession(session);
await _service.ConnectAsync(ValidSettings());
// Subscribe data (uses dataFakeSub from session.NextSubscription).
await _service.SubscribeAsync(new NodeId("ns=2;s=Node1"));
// Now set NextSubscription on the session so SubscribeAlarmsAsync gets the alarm fake.
session.NextSubscription = alarmFakeSub;
await _service.SubscribeAlarmsAsync();
// Both subscriptions are now active. Disconnect with a cancellable token.
using var cts = new CancellationTokenSource();
await _service.DisconnectAsync(cts.Token);
// Both adapters must have been deleted.
dataFakeSub.Deleted.ShouldBeTrue();
alarmFakeSub.Deleted.ShouldBeTrue();
// The CancellationToken must have been forwarded (Client.Shared-014).
dataFakeSub.LastDeleteCt.ShouldBe(cts.Token);
alarmFakeSub.LastDeleteCt.ShouldBe(cts.Token);
}
// --- Client.Shared-013: SubscribeAlarmsAsync idempotency under lock ---
/// <summary>
/// Regression for Client.Shared-013: the _alarmSubscription != null guard in
/// SubscribeAlarmsAsync must be checked under _subscriptionLock so concurrent callers
/// cannot both pass the null-check and create duplicate alarm subscriptions. Exercises
/// concurrent calls from 20 tasks and verifies only one event subscription is created.
/// </summary>
[Fact]
public async Task SubscribeAlarmsAsync_ConcurrentDuplicateCalls_CreatesExactlyOneSubscription()
{
var fakeSub = new FakeSubscriptionAdapter();
var session = new FakeSessionAdapter { NextSubscription = fakeSub };
_sessionFactory.EnqueueSession(session);
await _service.ConnectAsync(ValidSettings());
// Fire many concurrent SubscribeAlarmsAsync calls — only one should "win".
var tasks = Enumerable.Range(0, 20).Select(_ =>
Task.Run(() => _service.SubscribeAlarmsAsync()));
await Task.WhenAll(tasks);
// Exactly one event monitored item must have been added across all calls.
var totalEventItems = session.CreatedSubscriptions.Sum(s => s.AddEventCount);
totalEventItems.ShouldBe(1);
}
// --- Client.Shared-014: CancellationToken forwarding ---
/// <summary>
/// Regression for Client.Shared-014: UnsubscribeAsync must forward the caller's
/// CancellationToken to the subscription adapter's DeleteAsync so the operation
/// can be cancelled — the pre-fix code passed CancellationToken.None.
/// </summary>
[Fact]
public async Task UnsubscribeAlarmsAsync_ForwardsCancellationToken()
{
var fakeSub = new FakeSubscriptionAdapter();
var session = new FakeSessionAdapter { NextSubscription = fakeSub };
_sessionFactory.EnqueueSession(session);
await _service.ConnectAsync(ValidSettings());
await _service.SubscribeAlarmsAsync();
using var cts = new CancellationTokenSource();
await _service.UnsubscribeAlarmsAsync(cts.Token);
fakeSub.LastDeleteCt.ShouldBe(cts.Token);
}
}