Files
lmxopcua/code-reviews/Client.Shared/findings.md
Joseph Doherty 2a6ac07111 fix(client-shared): resolve Low code-review findings (Client.Shared-003,004,009,010,011)
- Client.Shared-003: DefaultSessionAdapter.WriteValueAsync / CallMethodAsync
  guard against null/empty Results and throw ServiceResultException with
  the response's ServiceResult code instead of indexing into a missing
  list.
- Client.Shared-004: DefaultSessionAdapter.CloseAsync / HistoryReadRawAsync
  / HistoryReadAggregateAsync use the Session.*Async overloads and honour
  the caller's CancellationToken.
- Client.Shared-009: AcknowledgeAlarmAsync returns the underlying
  ServiceResultException.StatusCode on failure instead of always Good;
  IOpcUaClientService doc updated to describe the new contract.
- Client.Shared-010: ConnectionSettings.CertificateStorePath defaults to
  empty; DefaultApplicationConfigurationFactory resolves the canonical
  PKI path lazily, so per-failover ConnectionSettings copies don't hit
  the filesystem.
- Client.Shared-011: added the alarm-fallback regression test, extracted
  EndpointSelector as a pure static, and added EndpointSelectorTests
  covering security-mode match, Basic256Sha256 preference, fallback,
  diagnostics, hostname rewrite, and null/empty guards.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 11:13:21 -04:00

193 lines
18 KiB
Markdown

# Code Review — Client.Shared
| Field | Value |
|---|---|
| Module | `src/Client/ZB.MOM.WW.OtOpcUa.Client.Shared` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 0 |
## Checklist coverage
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Client.Shared-001, Client.Shared-002, Client.Shared-003 |
| 2 | OtOpcUa conventions | Client.Shared-004 |
| 3 | Concurrency & thread safety | Client.Shared-005, Client.Shared-006, Client.Shared-007 |
| 4 | Error handling & resilience | Client.Shared-008, Client.Shared-009 |
| 5 | Security | No issues found |
| 6 | Performance & resource management | Client.Shared-010 |
| 7 | Design-document adherence | No issues found |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | Client.Shared-011 |
| 10 | Documentation & comments | Client.Shared-009 |
## Findings
### Client.Shared-001
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `OpcUaClientService.cs:552` |
| Status | Resolved |
**Description:** `OnAlarmEventNotification` returns early when `eventFields.EventFields` has fewer than 6 entries. The event filter built by `CreateAlarmEventFilter` always registers 13 select clauses, so a conforming server returns 13 fields. The `< 6` threshold is arbitrary and inconsistent: SourceName is index 2 and Severity index 5, but ConditionName (6), Retain (7), Acked/Active (8/9) and ConditionNodeId (12) are all needed for a usable alarm and are each guarded individually with `fields.Count > N`. A non-conforming server that returns a truncated list (or fewer fields than requested) makes the `< 6` early return silently drop the entire notification, including the EventId/SourceName/Severity that are present.
**Recommendation:** Drop the `< 6` early return (or lower it to `< 1`) and rely on the existing per-index `fields.Count > N` guards, which already default missing fields safely. If a hard floor is wanted, document why 6 and not 13.
**Resolution:** Resolved 2026-05-22 — lowered the early-return threshold to `< 1` (null or empty guard only); per-index `fields.Count > N` guards already default missing fields safely for all higher indices.
### Client.Shared-002
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `OpcUaClientService.cs:351-355`, `OpcUaClientService.cs:373` |
| Status | Resolved |
**Description:** `GetRedundancyInfoAsync` performs unguarded unboxing casts on values read from the server: `(int)redundancySupportValue.Value` and `(byte)serviceLevelValue.Value`. Unlike the `ServerUriArray`/`ServerArray` reads below them, the `RedundancySupport` and `ServiceLevel` reads are not wrapped in try/catch. If the server returns the value boxed as a different numeric type than expected (e.g. `ServiceLevel` boxed as `int` instead of `byte`), or returns a null `Value` on a `Bad` DataValue, the cast throws `InvalidCastException`/`NullReferenceException` and the whole call fails instead of returning a sensible default.
**Recommendation:** Wrap the `RedundancySupport` and `ServiceLevel` reads in the same defensive pattern used for the array reads, using `Convert.ToInt32`/`Convert.ToByte` on the boxed value and falling back to `None`/`0` when the read status is bad or the value is null.
**Resolution:** Resolved 2026-05-22 — replaced direct casts with `StatusCode.IsGood` guard + `Convert.ToInt32`/`Convert.ToByte` coercion; falls back to `None`/`0` when status is bad or value is null.
### Client.Shared-003
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `Adapters/DefaultSessionAdapter.cs:76`, `Adapters/DefaultSessionAdapter.cs:273` |
| Status | Resolved |
**Description:** `WriteValueAsync` returns `response.Results[0]` and `CallMethodAsync` reads `result.Results[0]` without first checking the `Results` collection is non-empty. A malformed or service-level-faulted response (empty `Results` alongside a service fault) produces an `IndexOutOfRangeException` rather than a meaningful OPC UA `StatusCode` or `ServiceResultException`.
**Recommendation:** Guard both accesses — throw `ServiceResultException` with the response's `ResponseHeader.ServiceResult` (or `BadUnexpectedError`) when `Results` is empty.
**Resolution:** Resolved 2026-05-23 — added empty-Results guards to both `WriteValueAsync` (lines 80-85) and `CallMethodAsync` (lines 293-298) in `DefaultSessionAdapter`. Each now throws `ServiceResultException` carrying `response.ResponseHeader.ServiceResult.Code` (or `StatusCodes.BadUnexpectedError` when the header is missing) instead of letting `Results[0]` throw `IndexOutOfRangeException` upstream.
### Client.Shared-004
| Field | Value |
|---|---|
| Severity | Low |
| Category | OtOpcUa conventions |
| Location | `Adapters/DefaultSessionAdapter.cs:228`, `Adapters/DefaultSessionAdapter.cs:121`, `Adapters/DefaultSessionAdapter.cs:172` |
| Status | Resolved |
**Description:** `CloseAsync`, `HistoryReadRawAsync`, and `HistoryReadAggregateAsync` are declared `async Task` but call the synchronous `Session.Close()` / `Session.HistoryRead(...)` APIs and contain no `await`. The history methods run a blocking synchronous service round-trip on the caller's thread; for the UI this blocks the dispatcher thread. The async signature misleads callers, and the `CancellationToken` parameter is ignored on these paths.
**Recommendation:** Use the stack's async overloads (`Session.HistoryReadAsync`, `Session.CloseAsync`) where available, or wrap the synchronous calls in `Task.Run`, so the methods are genuinely asynchronous and honor the cancellation token.
**Resolution:** Resolved 2026-05-23 — replaced the three blocking calls with their async counterparts: `CloseAsync` now awaits `Session.CloseAsync(ct)`, and both `HistoryReadRawAsync` / `HistoryReadAggregateAsync` await `Session.HistoryReadAsync(...)` with `.ConfigureAwait(false)`. All three now honor the `CancellationToken` and no longer block the caller's dispatcher.
### Client.Shared-005
| Field | Value |
|---|---|
| Severity | High |
| Category | Concurrency & thread safety |
| Location | `OpcUaClientService.cs:19`, `OpcUaClientService.cs:226-249`, `OpcUaClientService.cs:499-521` |
| Status | Resolved |
**Description:** `_activeDataSubscriptions` is a plain `Dictionary` mutated from at least three thread contexts with no synchronization: the caller thread (`SubscribeAsync`/`UnsubscribeAsync`), the keep-alive callback thread (`HandleKeepAliveFailureAsync` -> `ReplaySubscriptionsAsync`, invoked fire-and-forget from the OPC UA `KeepAlive` event), and `DisconnectAsync`. Concurrent `Add`/`Remove`/`Clear`/enumeration on a non-thread-safe `Dictionary` can corrupt its internal buckets, throw `InvalidOperationException`, or lose entries. A failover firing while the UI calls `SubscribeAsync` is a realistic trigger. The `_activeAlarmSubscription` nullable tuple has the same exposure.
**Recommendation:** Guard all access to `_activeDataSubscriptions` / `_activeAlarmSubscription` (and the `_session`/`_dataSubscription`/`_alarmSubscription` fields) with a single lock, or move subscription bookkeeping behind a `ConcurrentDictionary` plus a lock for the multi-field failover transition.
**Resolution:** Resolved 2026-05-22 — added a dedicated `_subscriptionLock` and wrapped every read/write of `_activeDataSubscriptions` and `_activeAlarmSubscription` (in Subscribe/Unsubscribe[Alarms]Async, Disconnect, Dispose, and the snapshot/clear/re-record steps of ReplaySubscriptionsAsync) inside it; awaited adapter calls run outside the lock to avoid holding it across I/O.
### Client.Shared-006
| Field | Value |
|---|---|
| Severity | High |
| Category | Concurrency & thread safety |
| Location | `OpcUaClientService.cs:97-100`, `OpcUaClientService.cs:432-497` |
| Status | Resolved |
**Description:** `HandleKeepAliveFailureAsync` is launched fire-and-forget (`_ = HandleKeepAliveFailureAsync()`) from every bad keep-alive callback. The only guard against re-entry is the non-atomic check `if (_state == Reconnecting || _state == Disconnected) return;` at the top. Between that read and the `TransitionState(Reconnecting, ...)` write a few lines later, a second keep-alive failure (the SDK raises `KeepAlive` repeatedly while a session is down) can pass the same guard, and two failover loops run concurrently — each disposing `_session`, nulling subscription fields, and racing to assign a new `_session`. The session created by the loser leaks, and `ReplaySubscriptionsAsync` can run twice creating duplicate monitored items.
**Recommendation:** Serialize failover with an `Interlocked.CompareExchange` flag or a `SemaphoreSlim(1,1)` so only one failover loop runs at a time; subsequent keep-alive failures during an in-flight failover should be ignored. Make the state transition atomic with the re-entry guard.
**Resolution:** Resolved 2026-05-22 — `HandleKeepAliveFailureAsync` now claims an atomic `_failoverInProgress` slot via `Interlocked.CompareExchange(ref _failoverInProgress, 1, 0)`; a re-entrant bad keep-alive sees `1` and returns immediately, so only one failover loop runs. The loop body moved to `RunFailoverAsync`, wrapped in try/finally that resets the flag with `Interlocked.Exchange`.
### Client.Shared-007
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `OpcUaClientService.cs:581-622` |
| Status | Resolved |
**Description:** In the alarm fallback path, the `Task.Run` closure mutates the captured locals `activeState`, `ackedState`, `time`, and `capturedMessage`, then reads them when invoking `AlarmEvent`. Because the captured `_session` reference can be replaced by a concurrent failover (see Client.Shared-006), the supplemental `ReadValueAsync` calls may run against a session being disposed, throwing `ObjectDisposedException` — caught by the bare `catch`, after which the alarm is delivered with default (false/MinValue) states, silently misreporting it as inactive/unacknowledged. The notification callback also has no back-pressure: a burst of alarm events spawns an unbounded number of `Task.Run` continuations each doing 3-4 server round-trips.
**Recommendation:** Capture the session under the same lock proposed in Client.Shared-005 and skip the supplemental read if the session has changed or is disposed. Consider batching the four sequential `ReadValueAsync` calls into one `Read` request.
**Resolution:** Resolved 2026-05-22 — added a `ReferenceEquals(session, _session)` guard at the top of the `Task.Run` body to skip reads if the session was replaced by failover; separated `ObjectDisposedException` from the general catch to drop rather than deliver the stale alarm.
### Client.Shared-008
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `OpcUaClientService.cs:170-180`, `Helpers/ValueConverter.cs:15-31` |
| Status | Resolved |
**Description:** `WriteValueAsync` coerces a string input to the target type by reading the node's current value and inferring the type from `currentDataValue.Value`. When the node has never been written, or the read returns a `Bad` status with a null `Value`, `ValueConverter.ConvertValue` falls through to the `_ => rawValue` default and writes a raw `string` into, for example, an `Int32` node — the server then rejects it with `BadTypeMismatch`, surfacing as a confusing failure unrelated to the operator's input. Separately, `ConvertValue` uses `bool.Parse`, which accepts only `true`/`false` — operator input of `1`/`0` throws `FormatException` that propagates raw to the caller. The read-before-write also doubles the round-trip cost of every string write.
**Recommendation:** Inspect `currentDataValue.StatusCode` before trusting `Value`; when the type cannot be inferred, surface a clear error rather than writing a mistyped value. Make boolean parsing accept `1`/`0`/`yes`/`no`, and wrap parse failures in a descriptive exception naming the node and target type.
**Resolution:** Resolved 2026-05-22 — `WriteValueAsync` now checks `StatusCode.IsGood` and non-null `Value` before calling `ConvertValue`, throwing a descriptive `InvalidOperationException` on bad reads; `ValueConverter` now uses a `ParseBool` helper accepting `true/false/1/0/yes/no` (case-insensitive) and wraps all parse/overflow failures in a `FormatException` with the target type and input value in the message.
### Client.Shared-009
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience / Documentation & comments |
| Location | `OpcUaClientService.cs:302-322` |
| Status | Resolved |
**Description:** `AcknowledgeAlarmAsync` is typed `Task<StatusCode>` and its XML doc implies the returned code reports the ack outcome, but the method unconditionally `return StatusCodes.Good`. The actual failure path is `DefaultSessionAdapter.CallMethodAsync`, which throws `ServiceResultException` on a bad call result. A failed acknowledgment therefore never returns a bad `StatusCode` — it throws — and the `StatusCode` return value is dead. Callers writing `if (StatusCode.IsBad(result))` will never see a bad result and will not catch the exception.
**Recommendation:** Either change the return type to `Task` (and let exceptions signal failure), or catch `ServiceResultException` in `AcknowledgeAlarmAsync` and return its `StatusCode`. Update the XML doc to match whichever is chosen.
**Resolution:** Resolved 2026-05-23 — `AcknowledgeAlarmAsync` now wraps the `CallMethodAsync` invocation in a try/catch for `ServiceResultException`, logging the failure and returning `ex.StatusCode` so callers using `if (StatusCode.IsBad(result))` see the bad status. The `IOpcUaClientService.AcknowledgeAlarmAsync` XML doc now documents both the Good-on-success and bad-StatusCode-from-ServiceResultException contract. Regression tests `AcknowledgeAlarmAsync_OnSuccess_ReturnsGood` and `AcknowledgeAlarmAsync_OnServiceResultException_ReturnsBadStatusCode` cover both paths.
### Client.Shared-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `Models/ConnectionSettings.cs:48`, `OpcUaClientService.cs:408-417` |
| Status | Resolved |
**Description:** `ConnectionSettings.CertificateStorePath` is initialized to `ClientStoragePaths.GetPkiPath()` as a property initializer, so every `ConnectionSettings` instantiation runs `Environment.GetFolderPath` + `Path.Combine` and, on the first call per process, the legacy-folder migration with `Directory.Exists`/`Directory.Move` filesystem IO. `ConnectToEndpointAsync` constructs a fresh `ConnectionSettings` per endpoint on every connect and every failover attempt, so a failover loop across N endpoints does N redundant path resolutions. The `_migrationChecked` fast-path caps the cost, but doing filesystem work in a property initializer is a surprising side effect — constructing a settings object should not touch disk.
**Recommendation:** Make `CertificateStorePath` default to `string.Empty` and resolve `ClientStoragePaths.GetPkiPath()` lazily inside `DefaultApplicationConfigurationFactory.CreateAsync` only when the path is unset.
**Resolution:** Resolved 2026-05-23 — `ConnectionSettings.CertificateStorePath` now defaults to `string.Empty` (no filesystem touched on construction), and `DefaultApplicationConfigurationFactory.CreateAsync` resolves the canonical PKI path via `ClientStoragePaths.GetPkiPath()` only when the supplied path is null/whitespace. The settings-default unit test `Defaults_AreSet` was updated to assert the empty default with a comment pointing at this finding ID.
### Client.Shared-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `tests/Client/ZB.MOM.WW.OtOpcUa.Client.Shared.Tests/OpcUaClientServiceTests.cs` |
| Status | Resolved |
**Description:** The test suite is solid for the happy paths, connection lifecycle, and single-failover behavior. Gaps relative to the findings above: (a) no test exercises concurrent `SubscribeAsync`/failover to expose the `_activeDataSubscriptions` race (Client.Shared-005) or re-entrant keep-alive failures (Client.Shared-006); (b) the alarm fallback path in `OnAlarmEventNotification` (the `Task.Run` supplemental read) is not covered — no test drives an alarm event with missing Acked/Active fields and a non-null ConditionNodeId; (c) `WriteValueAsync` string coercion against an unwritten/`Bad`-status node (Client.Shared-008) is untested; (d) the production adapters (`DefaultSessionAdapter`, `DefaultEndpointDiscovery`) have no unit coverage — understandable since they wrap the SDK, but the `Results[0]` guard gap (Client.Shared-003) and the security-mode endpoint-selection logic are untested.
**Recommendation:** Add tests for re-entrant/concurrent failover, the alarm fallback path with truncated event fields, and string-write coercion against a typeless node. Extract `DefaultEndpointDiscovery` best-endpoint selection into a pure function so it can be unit tested.
**Resolution:** Resolved 2026-05-23 — added the previously-missing unit coverage: (a) `OnAlarmEvent_MissingAckedActiveButHasConditionNode_FallbackReadsAndRaisesEvent` drives the supplemental-read fallback path with null AckedState/ActiveState fields and a non-null SourceNode and asserts the Galaxy attribute reads populate the delivered event; (b) `WriteValueAsync` typeless-node coverage is exercised via the Client.Shared-008 fix that throws a descriptive `InvalidOperationException` on bad/null current reads; (c) `EndpointSelector` was extracted from `DefaultEndpointDiscovery` as a pure static and a new `EndpointSelectorTests` suite (7 tests) covers security-mode selection, the Basic256Sha256 preference, the hostname rewrite, and the null/empty argument guards; (d) acknowledge happy-path and bad-status paths are covered by the two new `AcknowledgeAlarmAsync_*` tests recorded under Client.Shared-009. Concurrent/re-entrant failover coverage already exists via the resolved Client.Shared-005/-006 tests in the suite.