docs(code-reviews): comprehensive per-module review pass at 76d35d1

Reviewed all 31 src/ production projects against the 10-category checklist in REVIEW-PROCESS.md. Each module gets its own findings.md; code-reviews/README.md is regenerated from them. 334 findings: 6 Critical, 46 High, 126 Medium, 156 Low. Critical findings: - Server-001: WriteNodeIdUnknown recurses unconditionally — a HistoryRead on an unresolvable node crashes the process (remote DoS). - Admin-001/002: app-wide auth bypass (RouteView not AuthorizeRouteView) plus unauthenticated mutating routes. - Core.Scripting-001: System.Environment reachable from operator scripts; Environment.Exit() terminates the server. - Core.AlarmHistorian-001: rowIds/events parallel-list desync on a corrupt payload misapplies outcomes — silent alarm-event data loss. - Driver.Galaxy-001: ReconnectSupervisor is built but never triggered, so a transient gateway drop permanently kills the event stream. All findings are Status=Open; resolution is tracked per REVIEW-PROCESS.md section 4. Review only — no source code changed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 05:20:27 -04:00
parent 76d35d1b9f
commit 8568f5cd85
32 changed files with 8134 additions and 2 deletions
@@ -0,0 +1,337 @@
+# Code Review — Driver.Historian.Wonderware
+
+| Field | Value |
+|---|---|
+| Module | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware` |
+| Reviewer | Claude Code |
+| Review date | 2026-05-22 |
+| Commit reviewed | `76d35d1` |
+| Status | Reviewed |
+| Open findings | 12 |
+
+## Checklist coverage
+
+A comprehensive review completes every category, recording "No issues found" where
+a category produced nothing rather than leaving it blank.
+
+| # | Category | Result |
+|---|---|---|
+| 1 | Correctness and logic bugs | Driver.Historian.Wonderware-001, -002, -003, -004 |
+| 2 | OtOpcUa conventions | No issues found |
+| 3 | Concurrency and thread safety | Driver.Historian.Wonderware-005 |
+| 4 | Error handling and resilience | Driver.Historian.Wonderware-006, -007, -008 |
+| 5 | Security | No issues found |
+| 6 | Performance and resource management | Driver.Historian.Wonderware-009, -010 |
+| 7 | Design-document adherence | Driver.Historian.Wonderware-011 |
+| 8 | Code organization and conventions | No issues found |
+| 9 | Testing coverage | Driver.Historian.Wonderware-012 |
+| 10 | Documentation and comments | No issues found |
+
+## Findings
+
+### Driver.Historian.Wonderware-001
+
+| Field | Value |
+|---|---|
+| Severity | High |
+| Category | Correctness and logic bugs |
+| Location | `Backend/SdkAlarmHistorianWriteBackend.cs:68`, `Backend/AahClientManagedAlarmEventWriter.cs:82-103` |
+| Status | Open |
+
+**Description:** `MalformedErrors` includes `HistorianAccessError.ErrorValue.WriteToReadOnlyFile`.
+When `ClassifyOutcome` routes that code through `MapOutcome`, `isMalformedInput` is
+`true`, so the per-event result becomes `PermanentFail` and the lmxopcua-side
+store-and-forward sink dead-letters the alarm event. But `WriteToReadOnlyFile` is
+not a property of the event payload; it is a connection-configuration fault (the
+write backend opened the session without `ReadOnly` set to `false`, or the SDK
+defaulted it). Treating it as permanent means a misconfigured or regressed
+connection would silently and permanently discard every alarm event in the batch
+instead of deferring them for retry once the connection is corrected.
+Alarm-event historization is the module's whole purpose, so this is data loss.
+
+**Recommendation:** Move `WriteToReadOnlyFile` out of `MalformedErrors`. It should
+be treated as a connection-class error (abort the batch, reset the connection so
+the reconnect path can re-open with `ReadOnly = false`) or at minimum as
+`RetryPlease`, never `PermanentFail`.
+
+**Resolution:** _(open)_
+
+### Driver.Historian.Wonderware-002
+
+| Field | Value |
+|---|---|
+| Severity | Medium |
+| Category | Correctness and logic bugs |
+| Location | `Ipc/HistorianFrameHandler.cs:162`, `:181` |
+| Status | Open |
+
+**Description:** `HandleWriteAlarmEventsAsync` dereferences `req.Events.Length`
+in both the `_alarmWriter is null` branch (line 162) and the catch block (line
+181). MessagePack deserializes an absent or explicit-nil array field as a `null`
+reference, not `Array.Empty<T>()`. A client (or a buggy/hostile peer) that sends
+a `WriteAlarmEventsRequest` with a null `Events` array triggers a
+`NullReferenceException`. Although `RunOneConnectionAsync` would log it and accept
+the next connection, the request gets no reply frame, so the client correlation-id
+wait hangs until its own timeout. `AahClientManagedAlarmEventWriter.WriteAsync`
+already null-guards `events`; the frame handler does not.
+
+**Recommendation:** Normalize `req.Events` to `Array.Empty<AlarmHistorianEventDto>()`
+immediately after deserialization (or guard each `.Length` access), consistent
+with the null-tolerance the writer already has.
+
+**Resolution:** _(open)_
+
+### Driver.Historian.Wonderware-003
+
+| Field | Value |
+|---|---|
+| Severity | Medium |
+| Category | Correctness and logic bugs |
+| Location | `Backend/HistorianDataSource.cs:320-323`, `:457-460` |
+| Status | Open |
+
+**Description:** Raw and at-time reads decide whether a sample is a string or a
+numeric with `if (!string.IsNullOrEmpty(result.StringValue) && result.Value == 0)`.
+The `result.Value == 0` clause is intended to distinguish a real numeric zero from
+a string tag whose numeric projection is zero, but it is wrong in both directions:
+a numeric (analog) tag that legitimately sampled the value `0` while the SDK also
+populates a non-empty `StringValue` (some Historian builds populate the formatted
+text on every result) is reported to OPC UA as a string, changing the variable
+data type mid-stream; conversely a string tag whose numeric projection is non-zero
+is reported as a numeric. The historian SDK exposes the tag actual data type,
+which should drive the branch instead of a value heuristic.
+
+**Recommendation:** Select string vs. numeric from the SDK result tag-data-type
+field rather than from `Value == 0`. If the type field is genuinely unavailable in
+the bound SDK version, document the limitation explicitly and prefer numeric for
+analog/integer tags.
+
+**Resolution:** _(open)_
+
+### Driver.Historian.Wonderware-004
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | Correctness and logic bugs |
+| Location | `Backend/SdkAlarmHistorianWriteBackend.cs:198-201` |
+| Status | Open |
+
+**Description:** `ToHistorianEvent` only assigns `historianEvent.Id` when
+`Guid.TryParse(dto.EventId, ...)` succeeds. If `EventId` is not a parseable GUID
+(or is empty), `Id` stays `Guid.Empty` and the event is written to the historian
+with an all-zeros identifier. Multiple such events collide on the same id, and the
+write is still accepted (`outcomes[i] = Ack`) so neither side detects the problem.
+The non-parseable case is never logged.
+
+**Recommendation:** Log a warning when `EventId` fails to parse, and either reject
+the event as `PermanentFail` (malformed input) or synthesize a fresh
+`Guid.NewGuid()` so each event still gets a unique id.
+
+**Resolution:** _(open)_
+
+### Driver.Historian.Wonderware-005
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | Concurrency and thread safety |
+| Location | `Backend/HistorianDataSource.cs:124`, `:126-127` |
+| Status | Open |
+
+**Description:** `GetHealthSnapshot` reads `_activeProcessNode` and
+`_activeEventNode` inside `_healthLock`, but those two fields are written under
+`_connectionLock` / `_eventConnectionLock` (lines 183, 243, 209-210, 266-269) — a
+different lock. The health-counter fields are correctly `_healthLock`-protected,
+but the active-node strings are published under one lock and read under another,
+so the snapshot can observe a stale active-node value relative to the
+connection-open booleans. This is a diagnostics-only path, so impact is limited to
+a momentarily inconsistent health snapshot.
+
+**Recommendation:** Pick one lock for the active-node strings (publish them under
+`_healthLock` on every connection state change, or read them under the connection
+lock), so the snapshot is internally consistent.
+
+**Resolution:** _(open)_
+
+### Driver.Historian.Wonderware-006
+
+| Field | Value |
+|---|---|
+| Severity | Medium |
+| Category | Error handling and resilience |
+| Location | `Ipc/PipeServer.cs:120-128` |
+| Status | Open |
+
+**Description:** `RunAsync` re-accepts connections in a `while` loop. If
+`RunOneConnectionAsync` throws synchronously and immediately on every iteration
+(for example `new NamedPipeServerStream(...)` fails because the pipe name is
+already in use, or `PipeAcl.Create` throws), the loop spins with no delay and no
+backoff, pegging a CPU core and flooding the rolling log file with one `Error`
+line per iteration. There is no circuit-breaker or retry cap.
+
+**Recommendation:** Add a short delay (exponential backoff capped at a few
+seconds) before re-accepting after a caught exception, and consider a
+consecutive-failure threshold that escalates to a fatal exit so the supervisor can
+restart the sidecar cleanly.
+
+**Resolution:** _(open)_
+
+### Driver.Historian.Wonderware-007
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | Error handling and resilience |
+| Location | `Ipc/PipeServer.cs:70-75` |
+| Status | Open |
+
+**Description:** When `VerifyCaller` rejects the peer SID, the server logs the
+reason and calls `_current.Disconnect()` with no `HelloAck` frame sent. The
+shared-secret-mismatch and major-version-mismatch paths below it both send a
+rejecting `HelloAck` so the client learns why. A client that fails the SID check
+instead sees an abrupt disconnect and must rely on its own read timeout, with no
+diagnostic on the client side. The asymmetry also makes the SID-rejection path
+harder to test from the client.
+
+**Recommendation:** Send a `HelloAck` with `Accepted = false` and a
+`caller-sid-mismatch` reject reason before disconnecting, consistent with the
+other two rejection paths.
+
+**Resolution:** _(open)_
+
+### Driver.Historian.Wonderware-008
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | Error handling and resilience |
+| Location | `Backend/HistorianDataSource.cs:301-307`, `:374-380` |
+| Status | Open |
+
+**Description:** When `query.StartQuery` returns `false`, `ReadRawAsync` and
+`ReadAggregateAsync` call `HandleConnectionError()` and return an empty result
+list. A failed `StartQuery` is not necessarily a connection failure — it can be a
+bad tag name, an invalid time range, or an unsupported aggregate — yet the code
+unconditionally tears down the shared SDK connection. A burst of queries with one
+bad tag name therefore repeatedly drops and re-opens the (relatively expensive)
+historian connection and marks the cluster node failed via `HandleConnectionError`
+into `_picker.MarkFailed`, which can push an otherwise healthy node into cooldown.
+The empty-list result is also indistinguishable from "no data in range" to the
+caller — the `Success` flag on the reply will still be `true`.
+
+**Recommendation:** Inspect `error.ErrorCode` to distinguish connection-class
+failures (reset and mark node failed) from query-class failures (leave the
+connection intact, surface the error). Consider returning a failed reply
+(`Success = false`) for query-class `StartQuery` failures so the client does not
+treat an SDK error as an empty history.
+
+**Resolution:** _(open)_
+
+### Driver.Historian.Wonderware-009
+
+| Field | Value |
+|---|---|
+| Severity | Medium |
+| Category | Performance and resource management |
+| Location | `Backend/HistorianDataSource.cs:382-395`, `Ipc/Contracts.cs:85-99` |
+| Status | Open |
+
+**Description:** `ReadAggregateAsync` drains `query.MoveNext` into `results` with
+no upper bound, unlike `ReadRawAsync`, which honours `maxValues` /
+`MaxValuesPerRead` and breaks. `ReadProcessedRequest` carries no max-buckets field.
+A processed read over a wide time range with a small `IntervalMs` produces an
+unbounded `HistorianAggregateSample` list; the handler then serializes it into
+`ReadProcessedReply`. If the serialized body exceeds the 16 MiB
+`Framing.MaxFrameBodyBytes` cap, `FrameWriter.WriteAsync` throws and the entire
+reply is lost (the client correlation wait hangs), and before that point the
+sidecar holds the whole result set in memory.
+
+**Recommendation:** Apply `_config.MaxValuesPerRead` as a bucket cap in
+`ReadAggregateAsync` (mirroring the raw path), and/or add a `MaxBuckets` field to
+`ReadProcessedRequest`. Reject or truncate result sets that would exceed the frame
+cap with an explicit error reply rather than letting `WriteAsync` throw.
+
+**Resolution:** _(open)_
+
+### Driver.Historian.Wonderware-010
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | Performance and resource management |
+| Location | `Backend/HistorianConfiguration.cs:32-36`, `Backend/HistorianDataSource.cs` (all read methods) |
+| Status | Open |
+
+**Description:** `HistorianConfiguration.RequestTimeoutSeconds` is documented as
+the "outer safety timeout applied to sync-over-async Historian operations" and is
+copied around (`SdkAlarmHistorianWriteBackend.CloneConfigWithServerName:346`), but
+it is never read or enforced anywhere. The `HistorianDataSource` read methods are
+declared `Task`-returning but execute the SDK calls synchronously on the caller
+thread and only check the `CancellationToken` between `MoveNext` iterations. There
+is no outer timeout: a hung `StartQuery` or a slow `MoveNext` blocks the single
+pipe-server connection thread indefinitely (the connect path has its own poll
+timeout, but the query path does not). The documented safety net does not exist.
+
+**Recommendation:** Either wire `RequestTimeoutSeconds` into the read paths (a
+`CancellationTokenSource.CancelAfter` linked into `ct`, or run the SDK call on a
+worker with a bounded wait), or remove the property and its XML doc so the code
+does not advertise a guarantee it does not provide.
+
+**Resolution:** _(open)_
+
+### Driver.Historian.Wonderware-011
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | Design-document adherence |
+| Location | `Backend/HistorianDataSource.cs:9-12`, `Backend/IHistorianDataSource.cs:9-11`, `Backend/HistorianSample.cs:7-9`, `Backend/HistorianConfiguration.cs:7-9` |
+| Status | Open |
+
+**Description:** Several XML doc comments reference the retired v1 architecture as
+if it were current: "inside Galaxy.Host", "the Proxy maps returned samples", "the
+Host returns these across the IPC boundary as `GalaxyDataValue`", "Populated from
+... the Proxy DriverInstance.DriverConfig". Per `CLAUDE.md`, PR 7.2 retired the
+`Galaxy.Host` / `Galaxy.Proxy` / `Galaxy.Shared` projects, and this driver is now a
+standalone sidecar whose client is the .NET 10 `WonderwareHistorianClient`
+(`docs/AlarmTracking.md`). The comments are stale and misdescribe the current data
+flow, which contradicts the "no stale design docs/comments" expectation in the
+review checklist.
+
+**Recommendation:** Update the doc comments to describe the current sidecar/IPC
+architecture (sidecar talking to `WonderwareHistorianClient` over the named pipe),
+dropping the `Galaxy.Host` / `Proxy` / `GalaxyDataValue` references.
+
+**Resolution:** _(open)_
+
+### Driver.Historian.Wonderware-012
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | Testing coverage |
+| Location | `Backend/HistorianDataSource.cs`, `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests/` |
+| Status | Open |
+
+**Description:** The unit-test suite covers `HistorianQualityMapper`,
+`HistorianClusterEndpointPicker`, `SdkAlarmHistorianWriteBackend`,
+`AahClientManagedAlarmEventWriter`, the IPC round trip, and `Program` alarm-writer
+wiring. `HistorianDataSource` itself — the largest and most logic-dense file in
+the module — has no direct unit coverage of its read paths, despite
+`IHistorianConnectionFactory` being explicitly extracted "so tests can inject
+fakes that control connection success, failure, and timeout behavior". The
+connect-failover-and-cooldown loop (`ConnectToAnyHealthyNode`), the mid-query
+connection-reset path (`HandleConnectionError`), the string-vs-numeric value
+selection (see -003), the at-time per-timestamp loop, and `ExtractAggregateValue`
+column dispatch are all untested. A stale empty test directory
+(`tests/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests/`, containing only
+`bin/obj`) also sits alongside the live `tests/Drivers/...` project and should be
+removed to avoid confusion.
+
+**Recommendation:** Add `HistorianDataSource` tests driving an
+`IHistorianConnectionFactory` fake — covering failover, cooldown, mid-query reset,
+cancellation, and the value-type selection — and delete the stale empty
+`tests/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests/` directory.
+
+**Resolution:** _(open)_