docs(code-reviews): comprehensive per-module review pass at 76d35d1

Reviewed all 31 src/ production projects against the 10-category checklist in REVIEW-PROCESS.md. Each module gets its own findings.md; code-reviews/README.md is regenerated from them. 334 findings: 6 Critical, 46 High, 126 Medium, 156 Low. Critical findings: - Server-001: WriteNodeIdUnknown recurses unconditionally — a HistoryRead on an unresolvable node crashes the process (remote DoS). - Admin-001/002: app-wide auth bypass (RouteView not AuthorizeRouteView) plus unauthenticated mutating routes. - Core.Scripting-001: System.Environment reachable from operator scripts; Environment.Exit() terminates the server. - Core.AlarmHistorian-001: rowIds/events parallel-list desync on a corrupt payload misapplies outcomes — silent alarm-event data loss. - Driver.Galaxy-001: ReconnectSupervisor is built but never triggered, so a transient gateway drop permanently kills the event stream. All findings are Status=Open; resolution is tracked per REVIEW-PROCESS.md section 4. Review only — no source code changed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 05:20:27 -04:00
parent 76d35d1b9f
commit 8568f5cd85
32 changed files with 8134 additions and 2 deletions
@@ -0,0 +1,294 @@
+# Code Review — Driver.Historian.Wonderware.Client
+
+| Field | Value |
+|---|---|
+| Module | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Client` |
+| Reviewer | Claude Code |
+| Review date | 2026-05-22 |
+| Commit reviewed | `76d35d1` |
+| Status | Reviewed |
+| Open findings | 10 |
+
+## Checklist coverage
+
+A comprehensive review completes every category, recording "No issues found" where
+a category produced nothing rather than leaving it blank.
+
+| # | Category | Result |
+|---|---|---|
+| 1 | Correctness & logic bugs | Driver.Historian.Wonderware.Client-001, Driver.Historian.Wonderware.Client-002 |
+| 2 | OtOpcUa conventions | No issues found |
+| 3 | Concurrency & thread safety | Driver.Historian.Wonderware.Client-003, Driver.Historian.Wonderware.Client-004 |
+| 4 | Error handling & resilience | Driver.Historian.Wonderware.Client-005, Driver.Historian.Wonderware.Client-006 |
+| 5 | Security | Driver.Historian.Wonderware.Client-007, Driver.Historian.Wonderware.Client-008 |
+| 6 | Performance & resource management | No issues found |
+| 7 | Design-document adherence | No issues found |
+| 8 | Code organization & conventions | No issues found |
+| 9 | Testing coverage | Driver.Historian.Wonderware.Client-009 |
+| 10 | Documentation & comments | Driver.Historian.Wonderware.Client-010 |
+
+## Findings
+
+### Driver.Historian.Wonderware.Client-001
+
+| Field | Value |
+|---|---|
+| Severity | High |
+| Category | Correctness & logic bugs |
+| Location | `WonderwareHistorianClient.cs:98-113` |
+| Status | Open |
+
+**Description:** `ReadAtTimeAsync` violates the explicit `IHistorianDataSource.ReadAtTimeAsync`
+contract. The interface XML doc states: the returned list MUST be the same length and
+order as `timestampsUtc`, and gaps are returned as Bad-quality snapshots. The client passes
+`reply.Samples` straight through `ToSnapshots` with no check that the sidecar returned
+exactly one sample per requested timestamp, nor that the order matches. If the sidecar
+returns fewer/more samples (e.g. it drops boundary-less timestamps), the OPC UA
+HistoryReadAtTime service receives a result that the spec-compliant caller expects to
+index positionally against the request timestamps, silently misaligning values with
+timestamps. The matching `ReadAtTimeAsync_PreservesTimestampOrder` test only passes because
+the fake echoes the request verbatim; it never exercises a short/reordered reply.
+
+**Recommendation:** After receiving the reply, reconcile `reply.Samples` against
+`timestampsUtc` by timestamp: build the result array at `timestampsUtc.Count`, fill matched
+entries, and emit a Bad-quality (`0x80000000`) snapshot for any requested timestamp the
+sidecar did not return. Alternatively assert `reply.Samples.Length == timestampsUtc.Count`
+and fail loudly. Add a test where the fake returns a partial/reordered sample set.
+
+**Resolution:** _(open)_
+
+### Driver.Historian.Wonderware.Client-002
+
+| Field | Value |
+|---|---|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Location | `WonderwareHistorianClient.cs:154-199`, `IAlarmHistorianSink.cs:66-74` |
+| Status | Open |
+
+**Description:** `WriteBatchAsync` can never return `HistorianWriteOutcome.PermanentFail`.
+`HistorianWriteOutcome` defines three states (`Ack`, `RetryPlease`, `PermanentFail`) and
+the drain worker is documented to move the event to the dead-letter table on
+`PermanentFail`. The client maps the sidecar `WriteAlarmEventsReply.PerEventOk` bool array
+to only `Ack`/`RetryPlease`, and the whole-call-failure and catch paths also only emit
+`RetryPlease`. A malformed alarm event the sidecar can never persist (unrecoverable SDK
+error on that specific row) therefore retries forever, blocking the head of the
+store-and-forward queue and never dead-lettering. The wire contract
+(`WriteAlarmEventsReply`) carries no per-event permanent/transient distinction, so the
+limitation is structural.
+
+**Recommendation:** Extend the wire contract: replace `bool[] PerEventOk` with a
+per-event status enum (Ack/Retry/Permanent), coordinated as an additive change on both
+sidecar and client per the Contracts.cs versioning rules, so unrecoverable events can be
+dead-lettered. Until then, document explicitly that this writer never produces
+`PermanentFail` and that poison events retry indefinitely.
+
+**Resolution:** _(open)_
+
+### Driver.Historian.Wonderware.Client-003
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | Concurrency & thread safety |
+| Location | `WonderwareHistorianClient.cs:207`, `WonderwareHistorianClient.cs:132-150` |
+| Status | Open |
+
+**Description:** `_totalQueries` is mutated with `Interlocked.Increment` in `Invoke`, but
+read inside `GetHealthSnapshot` under `_healthLock`, and every other counter
+(`_totalSuccesses`, `_totalFailures`, `_consecutiveFailures`) is mutated only under
+`_healthLock`. The two synchronization mechanisms do not compose: an `Interlocked`
+increment is not ordered against `lock`-protected reads, so a snapshot can observe a
+`_totalQueries` value inconsistent with the lock-protected counters. The window is small
+and the counters are advisory, but the mixed model is a latent hazard.
+
+**Recommendation:** Pick one mechanism. Simplest: move the `_totalQueries++` into the
+`_healthLock` block (a new `RecordQuery()` helper, or fold it into `RecordSuccess`/
+`RecordFailure`) so all six health fields share a single lock.
+
+**Resolution:** _(open)_
+
+### Driver.Historian.Wonderware.Client-004
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | Concurrency & thread safety |
+| Location | `WonderwareHistorianClient.cs:203-267` |
+| Status | Open |
+
+**Description:** A sidecar-reported failure is recorded in two non-atomic steps under
+separate lock acquisitions: `Invoke` calls `RecordSuccess()` (line 211) and then the
+caller calls `ThrowIfFailed` which calls `ReclassifySuccessAsFailure()` (line 256),
+decrementing `_totalSuccesses` and incrementing `_totalFailures`. Between those two locked
+regions a concurrent `GetHealthSnapshot` can observe a transient state where the operation
+counts as both a success and not-yet-a-failure (`_totalSuccesses` inflated,
+`_consecutiveFailures` still 0). The undo-a-success/record-a-failure dance is also fragile:
+if a future change adds an early return or exception between `RecordSuccess` and
+`ThrowIfFailed`, the success is never reversed.
+
+**Recommendation:** Classify the call once: do not call `RecordSuccess` until the
+sidecar-level `Success` flag has been checked, or pass the reply success/error into a
+single `RecordOutcome(bool transportOk, bool sidecarOk, string? error)` that updates all
+counters under one lock acquisition.
+
+**Resolution:** _(open)_
+
+### Driver.Historian.Wonderware.Client-005
+
+| Field | Value |
+|---|---|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Location | `Ipc/FrameReader.cs:31-32` |
+| Status | Open |
+
+**Description:** After reading the 4-byte length prefix, `ReadFrameAsync` reads the kind
+byte with the synchronous, blocking `_stream.ReadByte()` and ignores the
+`CancellationToken`. On a `NamedPipeClientStream` with `PipeOptions.Asynchronous`, a
+synchronous `ReadByte()` blocks the calling thread until a byte arrives or the pipe
+closes. If the sidecar sends a length prefix and then stalls (slow/hung peer), the call
+hangs on a thread-pool thread and the `EffectiveCallTimeout` linked token in
+`PipeChannel.InvokeAsync` cannot interrupt it because the timeout only fires between
+awaits. This defeats the documented cap on a single read/write call once connected and can
+wedge the single-in-flight call gate.
+
+**Recommendation:** Read the kind byte asynchronously and cancellably: extend the length
+prefix read to 5 bytes, or do a second `ReadExactAsync(new byte[1], ct)`. This makes the
+whole frame read honor the call-timeout token and matches the async style of the rest of
+the reader.
+
+**Resolution:** _(open)_
+
+### Driver.Historian.Wonderware.Client-006
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | Error handling & resilience |
+| Location | `Internal/PipeChannel.cs:96-107`, `WonderwareHistorianClientOptions.cs:11-12` |
+| Status | Open |
+
+**Description:** `PipeChannel.InvokeAsync` retries exactly once on transport failure and
+otherwise propagates. The options expose `ReconnectInitialBackoff` and
+`ReconnectMaxBackoff` and `WonderwareHistorianClientOptions` documents them as exponential
+backoff between reconnects, but neither field is referenced anywhere in the module: the
+single retry reconnects immediately with no delay. A sidecar that is restarting will
+reject or refuse the immediate reconnect, the call fails, and there is no backoff before
+the next caller-driven attempt. Either the backoff belongs in the channel and is missing,
+or the options are dead config that misleads operators.
+
+**Recommendation:** Either implement the documented exponential backoff in the reconnect
+path, or remove the two unused option fields and their XML docs and state plainly that
+retry/backoff is owned by the caller (the alarm drain worker / history router).
+
+**Resolution:** _(open)_
+
+### Driver.Historian.Wonderware.Client-007
+
+| Field | Value |
+|---|---|
+| Severity | Medium |
+| Category | Security |
+| Location | `WonderwareHistorianClient.cs:276` |
+| Status | Open |
+
+**Description:** `ToSnapshots` deserializes peer-supplied bytes with
+`MessagePackSerializer.Deserialize<object>(dto.ValueBytes)`, typeless MessagePack
+deserialization. The `object` overload resolves runtime types from the wire payload. The
+client treats the pipe peer as untrusted elsewhere (16 MiB frame cap stated to protect
+the receiver from a hostile or buggy peer, shared-secret Hello). Typeless deserialization
+of bytes that originate from the historian database widens the trust surface. The
+MessagePack standard resolver is primitive-only by default so the practical blast radius
+is limited, but this is the pattern called out by the two suppressed MessagePack
+advisories on this project (see finding 008).
+
+**Recommendation:** Confirm the serializer options here use the default (non-typeless)
+resolver and that no `TypelessContractlessStandardResolver` is in play; if so, document
+that. Prefer round-tripping the value as a constrained set of known primitive types rather
+than `object`, and validate `ValueBytes.Length` against a sane per-sample cap before
+deserializing.
+
+**Resolution:** _(open)_
+
+### Driver.Historian.Wonderware.Client-008
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | Security |
+| Location | `ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Client.csproj:29-32` |
+| Status | Open |
+
+**Description:** The csproj suppresses two NuGet audit advisories
+(`GHSA-37gx-xxp4-5rgx`, `GHSA-w3x6-4m5h-cxqf`) for the `MessagePack` 2.5.187 dependency
+with no inline comment recording why the suppression is safe, who reviewed it, or when it
+should be revisited. Blanket `NuGetAuditSuppress` entries silence the very signal that
+would flag the next related CVE. Combined with finding 007 (typeless deserialization), an
+unexplained MessagePack advisory suppression is a maintainability and audit-trail gap.
+
+**Recommendation:** Add an XML comment next to each `NuGetAuditSuppress` stating the
+advisory title, why it does not apply to this module usage, and a revisit trigger. Track a
+follow-up to upgrade `MessagePack` once a patched version is available so the suppressions
+can be dropped.
+
+**Resolution:** _(open)_
+
+### Driver.Historian.Wonderware.Client-009
+
+| Field | Value |
+|---|---|
+| Severity | Medium |
+| Category | Testing coverage |
+| Location | `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Client.Tests/WonderwareHistorianClientTests.cs` |
+| Status | Open |
+
+**Description:** The suite covers happy paths, server-error, bad-secret, a single
+reconnect and health counters, but several critical paths are untested:
+(1) `ReadAtTimeAsync` with a partial/reordered sidecar reply, the contract-alignment case
+from finding 001 (the existing test only echoes the request);
+(2) the `WriteBatchAsync` catch branch, a transport/deserialization throw during a write,
+which must return `RetryPlease` for every event;
+(3) `InvokeAsync` second-attempt-also-fails path (the test only proves a successful
+reconnect, never a reconnect that fails again and propagates);
+(4) the `CallTimeout` path, no test asserts that a stalled sidecar produces a timed-out
+`OperationCanceledException`;
+(5) `MapAggregate` for `HistoryAggregateType.Total` throwing `NotSupportedException`;
+(6) the `InvalidDataException` path when the sidecar replies with an unexpected
+`MessageKind`. The byte-equality / round-trip parity test the Contracts.cs and Framing.cs
+comments repeatedly promise is not present in this test project.
+
+**Recommendation:** Add the missing-edge-case tests above. In particular add the
+wire-parity test the source comments commit to: serialize each DTO with the client copy
+and assert byte-equality against the sidecar `Driver.Historian.Wonderware.Ipc` copy, so a
+silent `[Key]` drift between the two duplicated contract sets is caught at build time.
+
+**Resolution:** _(open)_
+
+### Driver.Historian.Wonderware.Client-010
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | Documentation & comments |
+| Location | `WonderwareHistorianClient.cs:355-361`, `WonderwareHistorianClient.cs:132-150` |
+| Status | Open |
+
+**Description:** Two doc/behaviour mismatches.
+(1) The `Dispose()` XML comment asserts the underlying channel async cleanup is
+non-blocking so the `GetAwaiter()/GetResult()` bridge is safe. `PipeChannel.DisposeAsync`
+calls `ResetTransport()`, which invokes synchronous `Stream.Dispose()` on a
+`NamedPipeClientStream`; pipe disposal can block briefly on OS handle teardown. The bridge
+is safe (no deadlock, no captured context) but not strictly non-blocking; the comment
+should say "does not deadlock".
+(2) `GetHealthSnapshot` populates both `ProcessConnectionOpen` and `EventConnectionOpen`
+from the same `_channel.IsConnected`, and `ActiveProcessNode`/`ActiveEventNode`/`Nodes`
+are hard-coded to null/empty. A consumer reading `HistorianHealthSnapshot` would assume
+two independent connections and per-node health; this client has a single channel and no
+node concept. The collapse is reasonable but undocumented.
+
+**Recommendation:** Reword the `Dispose()` comment to claim only deadlock-safety. Add a
+short remark on `GetHealthSnapshot` explaining that the single-channel client maps both
+connection flags to one transport and does not track per-node health.
+
+**Resolution:** _(open)_