review(Client.Shared): fix Disconnect/failover subscription race + CT forwarding

Re-review at 7286d320. -012 (Medium): DisconnectAsync now snapshots+nulls the data/alarm
subscriptions under _subscriptionLock before async teardown (was racing RunFailoverAsync).
-013: SubscribeAlarmsAsync guarded by a semaphore (idempotent under concurrency). -014/-015:
forward CancellationToken through Delete/BrowseNext adapters. + TDD.
This commit is contained in:
Joseph Doherty
2026-06-19 11:58:15 -04:00
parent 887a31e825
commit d68c9db9f9
6 changed files with 240 additions and 28 deletions
+83 -2
View File
@@ -4,8 +4,8 @@
|---|---|
| Module | `src/Client/ZB.MOM.WW.OtOpcUa.Client.Shared` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Review date | 2026-06-19 |
| Commit reviewed | `7286d320` |
| Status | Reviewed |
| Open findings | 0 |
@@ -24,6 +24,26 @@
| 9 | Testing coverage | Client.Shared-011 |
| 10 | Documentation & comments | Client.Shared-009 |
## Re-review 2026-06-19 (commit 7286d320)
All 11 prior findings remain correctly Resolved. The module has grown since `76d35d1` with
new alarm-condition methods (`ConfirmAlarmAsync`, `ShelveAlarmAsync`, `EnableAsync`,
`DisableAsync`), a `DefaultApplicationConfigurationFactory` PKI-path lazy-resolve fix, and
`EndpointSelector` extraction. Four new issues were found and fixed in the same pass.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Client.Shared-015 |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | Client.Shared-012, Client.Shared-013 |
| 4 | Error handling & resilience | No issues found |
| 5 | Security | No issues found |
| 6 | Performance & resource management | No issues found |
| 7 | Design-document adherence | No issues found |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | No issues found |
| 10 | Documentation & comments | Client.Shared-014 (correctness) |
## Findings
### Client.Shared-001
@@ -178,6 +198,7 @@
### Client.Shared-011
| Field | Value |
|---|---|
| Severity | Low |
@@ -190,3 +211,63 @@
**Recommendation:** Add tests for re-entrant/concurrent failover, the alarm fallback path with truncated event fields, and string-write coercion against a typeless node. Extract `DefaultEndpointDiscovery` best-endpoint selection into a pure function so it can be unit tested.
**Resolution:** Resolved 2026-05-23 — added the previously-missing unit coverage: (a) `OnAlarmEvent_MissingAckedActiveButHasConditionNode_FallbackReadsAndRaisesEvent` drives the supplemental-read fallback path with null AckedState/ActiveState fields and a non-null SourceNode and asserts the Galaxy attribute reads populate the delivered event; (b) `WriteValueAsync` typeless-node coverage is exercised via the Client.Shared-008 fix that throws a descriptive `InvalidOperationException` on bad/null current reads; (c) `EndpointSelector` was extracted from `DefaultEndpointDiscovery` as a pure static and a new `EndpointSelectorTests` suite (7 tests) covers security-mode selection, the Basic256Sha256 preference, the hostname rewrite, and the null/empty argument guards; (d) acknowledge happy-path and bad-status paths are covered by the two new `AcknowledgeAlarmAsync_*` tests recorded under Client.Shared-009. Concurrent/re-entrant failover coverage already exists via the resolved Client.Shared-005/-006 tests in the suite.
### Client.Shared-012
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `OpcUaClientService.cs:DisconnectAsync` (lines 135145 at HEAD), `OpcUaClientService.cs:RunFailoverAsync` (lines 671672) |
| Status | Resolved |
**Description:** `DisconnectAsync` accessed `_dataSubscription` and `_alarmSubscription` without holding `_subscriptionLock`. `RunFailoverAsync` also nulled both fields without a lock. A concurrent failover firing while `DisconnectAsync` was in progress could null the fields between the null-check (`if (_dataSubscription != null)`) and the `DeleteAsync` call, producing a `NullReferenceException` caught silently by the `catch (Exception ex)` wrapper, or — if the failover won the race completely — a double-null-check that skipped the cleanup. The session adapter was similarly unguarded.
**Recommendation:** Snapshot `_dataSubscription` and `_alarmSubscription` under `_subscriptionLock` before the async teardown block, null the fields atomically under the lock, then call `DeleteAsync` on the local copies outside the lock (no async under lock).
**Resolution:** Resolved 2026-06-19 — `DisconnectAsync` now snapshots `dataSubscription` and `alarmSubscription` under `_subscriptionLock` at the top of the cleanup block, nulling the fields atomically before any async call. The local copies are then passed to `DeleteAsync`, eliminating the race window. Tests: `DisconnectAsync_WithActiveSubscriptions_DeletesBothAndForwardsCt` verifies both adapters are deleted and the CT is forwarded.
### Client.Shared-013
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `OpcUaClientService.cs:SubscribeAlarmsAsync` (line 293 at HEAD) |
| Status | Resolved |
**Description:** The `if (_alarmSubscription != null) return;` guard in `SubscribeAlarmsAsync` was outside any lock or serialisation primitive. Two concurrent callers could both observe `_alarmSubscription == null`, proceed to `CreateSubscriptionAsync`, and create duplicate event subscriptions. The gap is exploitable during the async I/O between the null-check and the assignment of `_alarmSubscription` after `CreateSubscriptionAsync` completes. In practice callers are CLI/UI (single-threaded) or the failover replay path, so the risk is low, but it is a true race.
**Recommendation:** Serialise the check-and-create via a `SemaphoreSlim(1,1)` so only one caller can create the alarm subscription at a time. The semaphore should be async-friendly (no thread-block) and disposed in `Dispose()`.
**Resolution:** Resolved 2026-06-19 — added `_alarmSubscribeSemaphore` (`SemaphoreSlim(1,1)`) and wrapped the entire `SubscribeAlarmsAsync` body (null-check + `CreateSubscriptionAsync` + `AddEventMonitoredItemAsync`) inside `await _alarmSubscribeSemaphore.WaitAsync(ct)` / `Release()`. The semaphore is disposed in `Dispose()`. Test: `SubscribeAlarmsAsync_ConcurrentDuplicateCalls_CreatesExactlyOneSubscription` launches 20 concurrent tasks and asserts exactly one event monitored item across all created subscriptions.
### Client.Shared-014
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `Adapters/DefaultSubscriptionAdapter.cs:DeleteAsync` (line 108 at HEAD) |
| Status | Resolved |
**Description:** `DefaultSubscriptionAdapter.DeleteAsync(CancellationToken ct)` accepted a `CancellationToken` parameter but forwarded it as `_subscription.DeleteAsync(true)`, discarding `ct`. The OPC UA SDK `Subscription.DeleteAsync(bool, CancellationToken)` overload exists and accepts the token. Callers that pass a cancellable token (e.g. `DisconnectAsync`) had no way to interrupt a stalled subscription-delete service round-trip.
**Recommendation:** Pass `ct` to the SDK: `_subscription.DeleteAsync(true, ct)`.
**Resolution:** Resolved 2026-06-19 — changed `_subscription.DeleteAsync(true)` to `_subscription.DeleteAsync(true, ct)` in `DefaultSubscriptionAdapter.DeleteAsync`. Test: `DisconnectAsync_WithActiveSubscriptions_DeletesBothAndForwardsCt` asserts `FakeSubscriptionAdapter.LastDeleteCt` equals the caller's token; `UnsubscribeAlarmsAsync_ForwardsCancellationToken` covers the `UnsubscribeAlarmsAsync` path.
### Client.Shared-015
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `Adapters/DefaultSessionAdapter.cs:BrowseNextAsync` (line 111 at HEAD) |
| Status | Resolved |
**Description:** `DefaultSessionAdapter.BrowseNextAsync(byte[] continuationPoint, CancellationToken ct)` accepted a `CancellationToken` but called `_session.BrowseNextAsync(null, false, continuationPoint)` — a three-argument overload that carries no token. The SDK's `SessionExtensions.BrowseNextAsync(ISession, RequestHeader, bool, byte[], CancellationToken)` extension accepts the token. This meant a cancelled multi-page browse could not abort the continuation round-trip mid-flight. Callers in `OpcUaClientService.BrowseAsync` correctly propagate a `CancellationToken` through `BrowseNextAsync` but the adapter silently dropped it.
**Recommendation:** Use the CT-accepting SDK overload: `_session.BrowseNextAsync(null, false, continuationPoint, ct)`.
**Resolution:** Resolved 2026-06-19 — changed the `_session.BrowseNextAsync(null, false, continuationPoint)` call to `_session.BrowseNextAsync(null, false, continuationPoint, ct)` in `DefaultSessionAdapter.BrowseNextAsync`. Build is clean (0 errors); the SDK extension overload is confirmed present in `Opc.Ua.Client 1.5.377.21`.