Files
lmxopcua/code-reviews/Client.Shared/findings.md
Joseph Doherty e221371a0c fix(client-shared): resolve High code-review findings (Client.Shared-005, Client.Shared-006)
Client.Shared-005: _activeDataSubscriptions (a plain Dictionary) and the
_activeAlarmSubscription tuple were mutated from the caller thread, the
keep-alive failover path, and DisconnectAsync with no synchronization,
risking bucket corrosion / InvalidOperationException / lost entries.
Added a dedicated _subscriptionLock and wrapped every read/write of that
bookkeeping state inside it (Subscribe/Unsubscribe[Alarms]Async,
Disconnect, Dispose, and the snapshot/clear/re-record steps of
ReplaySubscriptionsAsync). Awaited adapter calls stay outside the lock so
it is never held across I/O.

Client.Shared-006: HandleKeepAliveFailureAsync had only a non-atomic
state check guarding re-entry, so two bad keep-alives could each start a
failover loop, racing to dispose/replace _session and double-replaying
subscriptions. It now claims an atomic _failoverInProgress slot via
Interlocked.CompareExchange; a re-entrant call returns immediately. The
loop body moved to RunFailoverAsync, wrapped in try/finally that resets
the flag.

Tests: added KeepAliveFailure_ReentrantWhileFailoverInFlight_RunsFailoverOnce
and SubscribeAndUnsubscribe_ConcurrentCalls_DoNotCorruptState regression
tests; made the FakeSubscriptionAdapter / FakeSessionAdapter /
FakeSessionFactory test doubles thread-safe (and added a CreateGate hook)
so the concurrency tests exercise production locking rather than fake
state. All 138 Client.Shared tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 06:27:38 -04:00

14 KiB

Code Review — Client.Shared

Field Value
Module src/Client/ZB.MOM.WW.OtOpcUa.Client.Shared
Reviewer Claude Code
Review date 2026-05-22
Commit reviewed 76d35d1
Status Reviewed
Open findings 9

Checklist coverage

# Category Result
1 Correctness & logic bugs Client.Shared-001, Client.Shared-002, Client.Shared-003
2 OtOpcUa conventions Client.Shared-004
3 Concurrency & thread safety Client.Shared-005, Client.Shared-006, Client.Shared-007
4 Error handling & resilience Client.Shared-008, Client.Shared-009
5 Security No issues found
6 Performance & resource management Client.Shared-010
7 Design-document adherence No issues found
8 Code organization & conventions No issues found
9 Testing coverage Client.Shared-011
10 Documentation & comments Client.Shared-009

Findings

Client.Shared-001

Field Value
Severity Medium
Category Correctness & logic bugs
Location OpcUaClientService.cs:552
Status Open

Description: OnAlarmEventNotification returns early when eventFields.EventFields has fewer than 6 entries. The event filter built by CreateAlarmEventFilter always registers 13 select clauses, so a conforming server returns 13 fields. The < 6 threshold is arbitrary and inconsistent: SourceName is index 2 and Severity index 5, but ConditionName (6), Retain (7), Acked/Active (8/9) and ConditionNodeId (12) are all needed for a usable alarm and are each guarded individually with fields.Count > N. A non-conforming server that returns a truncated list (or fewer fields than requested) makes the < 6 early return silently drop the entire notification, including the EventId/SourceName/Severity that are present.

Recommendation: Drop the < 6 early return (or lower it to < 1) and rely on the existing per-index fields.Count > N guards, which already default missing fields safely. If a hard floor is wanted, document why 6 and not 13.

Resolution: (open)

Client.Shared-002

Field Value
Severity Medium
Category Correctness & logic bugs
Location OpcUaClientService.cs:351-355, OpcUaClientService.cs:373
Status Open

Description: GetRedundancyInfoAsync performs unguarded unboxing casts on values read from the server: (int)redundancySupportValue.Value and (byte)serviceLevelValue.Value. Unlike the ServerUriArray/ServerArray reads below them, the RedundancySupport and ServiceLevel reads are not wrapped in try/catch. If the server returns the value boxed as a different numeric type than expected (e.g. ServiceLevel boxed as int instead of byte), or returns a null Value on a Bad DataValue, the cast throws InvalidCastException/NullReferenceException and the whole call fails instead of returning a sensible default.

Recommendation: Wrap the RedundancySupport and ServiceLevel reads in the same defensive pattern used for the array reads, using Convert.ToInt32/Convert.ToByte on the boxed value and falling back to None/0 when the read status is bad or the value is null.

Resolution: (open)

Client.Shared-003

Field Value
Severity Low
Category Correctness & logic bugs
Location Adapters/DefaultSessionAdapter.cs:76, Adapters/DefaultSessionAdapter.cs:273
Status Open

Description: WriteValueAsync returns response.Results[0] and CallMethodAsync reads result.Results[0] without first checking the Results collection is non-empty. A malformed or service-level-faulted response (empty Results alongside a service fault) produces an IndexOutOfRangeException rather than a meaningful OPC UA StatusCode or ServiceResultException.

Recommendation: Guard both accesses — throw ServiceResultException with the response's ResponseHeader.ServiceResult (or BadUnexpectedError) when Results is empty.

Resolution: (open)

Client.Shared-004

Field Value
Severity Low
Category OtOpcUa conventions
Location Adapters/DefaultSessionAdapter.cs:228, Adapters/DefaultSessionAdapter.cs:121, Adapters/DefaultSessionAdapter.cs:172
Status Open

Description: CloseAsync, HistoryReadRawAsync, and HistoryReadAggregateAsync are declared async Task but call the synchronous Session.Close() / Session.HistoryRead(...) APIs and contain no await. The history methods run a blocking synchronous service round-trip on the caller's thread; for the UI this blocks the dispatcher thread. The async signature misleads callers, and the CancellationToken parameter is ignored on these paths.

Recommendation: Use the stack's async overloads (Session.HistoryReadAsync, Session.CloseAsync) where available, or wrap the synchronous calls in Task.Run, so the methods are genuinely asynchronous and honor the cancellation token.

Resolution: (open)

Client.Shared-005

Field Value
Severity High
Category Concurrency & thread safety
Location OpcUaClientService.cs:19, OpcUaClientService.cs:226-249, OpcUaClientService.cs:499-521
Status Resolved

Description: _activeDataSubscriptions is a plain Dictionary mutated from at least three thread contexts with no synchronization: the caller thread (SubscribeAsync/UnsubscribeAsync), the keep-alive callback thread (HandleKeepAliveFailureAsync -> ReplaySubscriptionsAsync, invoked fire-and-forget from the OPC UA KeepAlive event), and DisconnectAsync. Concurrent Add/Remove/Clear/enumeration on a non-thread-safe Dictionary can corrupt its internal buckets, throw InvalidOperationException, or lose entries. A failover firing while the UI calls SubscribeAsync is a realistic trigger. The _activeAlarmSubscription nullable tuple has the same exposure.

Recommendation: Guard all access to _activeDataSubscriptions / _activeAlarmSubscription (and the _session/_dataSubscription/_alarmSubscription fields) with a single lock, or move subscription bookkeeping behind a ConcurrentDictionary plus a lock for the multi-field failover transition.

Resolution: Resolved 2026-05-22 — added a dedicated _subscriptionLock and wrapped every read/write of _activeDataSubscriptions and _activeAlarmSubscription (in Subscribe/Unsubscribe[Alarms]Async, Disconnect, Dispose, and the snapshot/clear/re-record steps of ReplaySubscriptionsAsync) inside it; awaited adapter calls run outside the lock to avoid holding it across I/O.

Client.Shared-006

Field Value
Severity High
Category Concurrency & thread safety
Location OpcUaClientService.cs:97-100, OpcUaClientService.cs:432-497
Status Resolved

Description: HandleKeepAliveFailureAsync is launched fire-and-forget (_ = HandleKeepAliveFailureAsync()) from every bad keep-alive callback. The only guard against re-entry is the non-atomic check if (_state == Reconnecting || _state == Disconnected) return; at the top. Between that read and the TransitionState(Reconnecting, ...) write a few lines later, a second keep-alive failure (the SDK raises KeepAlive repeatedly while a session is down) can pass the same guard, and two failover loops run concurrently — each disposing _session, nulling subscription fields, and racing to assign a new _session. The session created by the loser leaks, and ReplaySubscriptionsAsync can run twice creating duplicate monitored items.

Recommendation: Serialize failover with an Interlocked.CompareExchange flag or a SemaphoreSlim(1,1) so only one failover loop runs at a time; subsequent keep-alive failures during an in-flight failover should be ignored. Make the state transition atomic with the re-entry guard.

Resolution: Resolved 2026-05-22 — HandleKeepAliveFailureAsync now claims an atomic _failoverInProgress slot via Interlocked.CompareExchange(ref _failoverInProgress, 1, 0); a re-entrant bad keep-alive sees 1 and returns immediately, so only one failover loop runs. The loop body moved to RunFailoverAsync, wrapped in try/finally that resets the flag with Interlocked.Exchange.

Client.Shared-007

Field Value
Severity Medium
Category Concurrency & thread safety
Location OpcUaClientService.cs:581-622
Status Open

Description: In the alarm fallback path, the Task.Run closure mutates the captured locals activeState, ackedState, time, and capturedMessage, then reads them when invoking AlarmEvent. Because the captured _session reference can be replaced by a concurrent failover (see Client.Shared-006), the supplemental ReadValueAsync calls may run against a session being disposed, throwing ObjectDisposedException — caught by the bare catch, after which the alarm is delivered with default (false/MinValue) states, silently misreporting it as inactive/unacknowledged. The notification callback also has no back-pressure: a burst of alarm events spawns an unbounded number of Task.Run continuations each doing 3-4 server round-trips.

Recommendation: Capture the session under the same lock proposed in Client.Shared-005 and skip the supplemental read if the session has changed or is disposed. Consider batching the four sequential ReadValueAsync calls into one Read request.

Resolution: (open)

Client.Shared-008

Field Value
Severity Medium
Category Error handling & resilience
Location OpcUaClientService.cs:170-180, Helpers/ValueConverter.cs:15-31
Status Open

Description: WriteValueAsync coerces a string input to the target type by reading the node's current value and inferring the type from currentDataValue.Value. When the node has never been written, or the read returns a Bad status with a null Value, ValueConverter.ConvertValue falls through to the _ => rawValue default and writes a raw string into, for example, an Int32 node — the server then rejects it with BadTypeMismatch, surfacing as a confusing failure unrelated to the operator's input. Separately, ConvertValue uses bool.Parse, which accepts only true/false — operator input of 1/0 throws FormatException that propagates raw to the caller. The read-before-write also doubles the round-trip cost of every string write.

Recommendation: Inspect currentDataValue.StatusCode before trusting Value; when the type cannot be inferred, surface a clear error rather than writing a mistyped value. Make boolean parsing accept 1/0/yes/no, and wrap parse failures in a descriptive exception naming the node and target type.

Resolution: (open)

Client.Shared-009

Field Value
Severity Low
Category Error handling & resilience / Documentation & comments
Location OpcUaClientService.cs:302-322
Status Open

Description: AcknowledgeAlarmAsync is typed Task<StatusCode> and its XML doc implies the returned code reports the ack outcome, but the method unconditionally return StatusCodes.Good. The actual failure path is DefaultSessionAdapter.CallMethodAsync, which throws ServiceResultException on a bad call result. A failed acknowledgment therefore never returns a bad StatusCode — it throws — and the StatusCode return value is dead. Callers writing if (StatusCode.IsBad(result)) will never see a bad result and will not catch the exception.

Recommendation: Either change the return type to Task (and let exceptions signal failure), or catch ServiceResultException in AcknowledgeAlarmAsync and return its StatusCode. Update the XML doc to match whichever is chosen.

Resolution: (open)

Client.Shared-010

Field Value
Severity Low
Category Performance & resource management
Location Models/ConnectionSettings.cs:48, OpcUaClientService.cs:408-417
Status Open

Description: ConnectionSettings.CertificateStorePath is initialized to ClientStoragePaths.GetPkiPath() as a property initializer, so every ConnectionSettings instantiation runs Environment.GetFolderPath + Path.Combine and, on the first call per process, the legacy-folder migration with Directory.Exists/Directory.Move filesystem IO. ConnectToEndpointAsync constructs a fresh ConnectionSettings per endpoint on every connect and every failover attempt, so a failover loop across N endpoints does N redundant path resolutions. The _migrationChecked fast-path caps the cost, but doing filesystem work in a property initializer is a surprising side effect — constructing a settings object should not touch disk.

Recommendation: Make CertificateStorePath default to string.Empty and resolve ClientStoragePaths.GetPkiPath() lazily inside DefaultApplicationConfigurationFactory.CreateAsync only when the path is unset.

Resolution: (open)

Client.Shared-011

Field Value
Severity Low
Category Testing coverage
Location tests/Client/ZB.MOM.WW.OtOpcUa.Client.Shared.Tests/OpcUaClientServiceTests.cs
Status Open

Description: The test suite is solid for the happy paths, connection lifecycle, and single-failover behavior. Gaps relative to the findings above: (a) no test exercises concurrent SubscribeAsync/failover to expose the _activeDataSubscriptions race (Client.Shared-005) or re-entrant keep-alive failures (Client.Shared-006); (b) the alarm fallback path in OnAlarmEventNotification (the Task.Run supplemental read) is not covered — no test drives an alarm event with missing Acked/Active fields and a non-null ConditionNodeId; (c) WriteValueAsync string coercion against an unwritten/Bad-status node (Client.Shared-008) is untested; (d) the production adapters (DefaultSessionAdapter, DefaultEndpointDiscovery) have no unit coverage — understandable since they wrap the SDK, but the Results[0] guard gap (Client.Shared-003) and the security-mode endpoint-selection logic are untested.

Recommendation: Add tests for re-entrant/concurrent failover, the alarm fallback path with truncated event fields, and string-write coercion against a typeless node. Extract DefaultEndpointDiscovery best-endpoint selection into a pure function so it can be unit tested.

Resolution: (open)