docs(code-reviews): comprehensive per-module review pass at 76d35d1
Reviewed all 31 src/ production projects against the 10-category checklist in REVIEW-PROCESS.md. Each module gets its own findings.md; code-reviews/README.md is regenerated from them. 334 findings: 6 Critical, 46 High, 126 Medium, 156 Low. Critical findings: - Server-001: WriteNodeIdUnknown recurses unconditionally — a HistoryRead on an unresolvable node crashes the process (remote DoS). - Admin-001/002: app-wide auth bypass (RouteView not AuthorizeRouteView) plus unauthenticated mutating routes. - Core.Scripting-001: System.Environment reachable from operator scripts; Environment.Exit() terminates the server. - Core.AlarmHistorian-001: rowIds/events parallel-list desync on a corrupt payload misapplies outcomes — silent alarm-event data loss. - Driver.Galaxy-001: ReconnectSupervisor is built but never triggered, so a transient gateway drop permanently kills the event stream. All findings are Status=Open; resolution is tracked per REVIEW-PROCESS.md section 4. Review only — no source code changed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
337
code-reviews/Driver.Historian.Wonderware/findings.md
Normal file
337
code-reviews/Driver.Historian.Wonderware/findings.md
Normal file
@@ -0,0 +1,337 @@
|
||||
# Code Review — Driver.Historian.Wonderware
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Module | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware` |
|
||||
| Reviewer | Claude Code |
|
||||
| Review date | 2026-05-22 |
|
||||
| Commit reviewed | `76d35d1` |
|
||||
| Status | Reviewed |
|
||||
| Open findings | 12 |
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
A comprehensive review completes every category, recording "No issues found" where
|
||||
a category produced nothing rather than leaving it blank.
|
||||
|
||||
| # | Category | Result |
|
||||
|---|---|---|
|
||||
| 1 | Correctness and logic bugs | Driver.Historian.Wonderware-001, -002, -003, -004 |
|
||||
| 2 | OtOpcUa conventions | No issues found |
|
||||
| 3 | Concurrency and thread safety | Driver.Historian.Wonderware-005 |
|
||||
| 4 | Error handling and resilience | Driver.Historian.Wonderware-006, -007, -008 |
|
||||
| 5 | Security | No issues found |
|
||||
| 6 | Performance and resource management | Driver.Historian.Wonderware-009, -010 |
|
||||
| 7 | Design-document adherence | Driver.Historian.Wonderware-011 |
|
||||
| 8 | Code organization and conventions | No issues found |
|
||||
| 9 | Testing coverage | Driver.Historian.Wonderware-012 |
|
||||
| 10 | Documentation and comments | No issues found |
|
||||
|
||||
## Findings
|
||||
|
||||
### Driver.Historian.Wonderware-001
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | High |
|
||||
| Category | Correctness and logic bugs |
|
||||
| Location | `Backend/SdkAlarmHistorianWriteBackend.cs:68`, `Backend/AahClientManagedAlarmEventWriter.cs:82-103` |
|
||||
| Status | Open |
|
||||
|
||||
**Description:** `MalformedErrors` includes `HistorianAccessError.ErrorValue.WriteToReadOnlyFile`.
|
||||
When `ClassifyOutcome` routes that code through `MapOutcome`, `isMalformedInput` is
|
||||
`true`, so the per-event result becomes `PermanentFail` and the lmxopcua-side
|
||||
store-and-forward sink dead-letters the alarm event. But `WriteToReadOnlyFile` is
|
||||
not a property of the event payload; it is a connection-configuration fault (the
|
||||
write backend opened the session without `ReadOnly` set to `false`, or the SDK
|
||||
defaulted it). Treating it as permanent means a misconfigured or regressed
|
||||
connection would silently and permanently discard every alarm event in the batch
|
||||
instead of deferring them for retry once the connection is corrected.
|
||||
Alarm-event historization is the module's whole purpose, so this is data loss.
|
||||
|
||||
**Recommendation:** Move `WriteToReadOnlyFile` out of `MalformedErrors`. It should
|
||||
be treated as a connection-class error (abort the batch, reset the connection so
|
||||
the reconnect path can re-open with `ReadOnly = false`) or at minimum as
|
||||
`RetryPlease`, never `PermanentFail`.
|
||||
|
||||
**Resolution:** _(open)_
|
||||
|
||||
### Driver.Historian.Wonderware-002
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness and logic bugs |
|
||||
| Location | `Ipc/HistorianFrameHandler.cs:162`, `:181` |
|
||||
| Status | Open |
|
||||
|
||||
**Description:** `HandleWriteAlarmEventsAsync` dereferences `req.Events.Length`
|
||||
in both the `_alarmWriter is null` branch (line 162) and the catch block (line
|
||||
181). MessagePack deserializes an absent or explicit-nil array field as a `null`
|
||||
reference, not `Array.Empty<T>()`. A client (or a buggy/hostile peer) that sends
|
||||
a `WriteAlarmEventsRequest` with a null `Events` array triggers a
|
||||
`NullReferenceException`. Although `RunOneConnectionAsync` would log it and accept
|
||||
the next connection, the request gets no reply frame, so the client correlation-id
|
||||
wait hangs until its own timeout. `AahClientManagedAlarmEventWriter.WriteAsync`
|
||||
already null-guards `events`; the frame handler does not.
|
||||
|
||||
**Recommendation:** Normalize `req.Events` to `Array.Empty<AlarmHistorianEventDto>()`
|
||||
immediately after deserialization (or guard each `.Length` access), consistent
|
||||
with the null-tolerance the writer already has.
|
||||
|
||||
**Resolution:** _(open)_
|
||||
|
||||
### Driver.Historian.Wonderware-003
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness and logic bugs |
|
||||
| Location | `Backend/HistorianDataSource.cs:320-323`, `:457-460` |
|
||||
| Status | Open |
|
||||
|
||||
**Description:** Raw and at-time reads decide whether a sample is a string or a
|
||||
numeric with `if (!string.IsNullOrEmpty(result.StringValue) && result.Value == 0)`.
|
||||
The `result.Value == 0` clause is intended to distinguish a real numeric zero from
|
||||
a string tag whose numeric projection is zero, but it is wrong in both directions:
|
||||
a numeric (analog) tag that legitimately sampled the value `0` while the SDK also
|
||||
populates a non-empty `StringValue` (some Historian builds populate the formatted
|
||||
text on every result) is reported to OPC UA as a string, changing the variable
|
||||
data type mid-stream; conversely a string tag whose numeric projection is non-zero
|
||||
is reported as a numeric. The historian SDK exposes the tag actual data type,
|
||||
which should drive the branch instead of a value heuristic.
|
||||
|
||||
**Recommendation:** Select string vs. numeric from the SDK result tag-data-type
|
||||
field rather than from `Value == 0`. If the type field is genuinely unavailable in
|
||||
the bound SDK version, document the limitation explicitly and prefer numeric for
|
||||
analog/integer tags.
|
||||
|
||||
**Resolution:** _(open)_
|
||||
|
||||
### Driver.Historian.Wonderware-004
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | Low |
|
||||
| Category | Correctness and logic bugs |
|
||||
| Location | `Backend/SdkAlarmHistorianWriteBackend.cs:198-201` |
|
||||
| Status | Open |
|
||||
|
||||
**Description:** `ToHistorianEvent` only assigns `historianEvent.Id` when
|
||||
`Guid.TryParse(dto.EventId, ...)` succeeds. If `EventId` is not a parseable GUID
|
||||
(or is empty), `Id` stays `Guid.Empty` and the event is written to the historian
|
||||
with an all-zeros identifier. Multiple such events collide on the same id, and the
|
||||
write is still accepted (`outcomes[i] = Ack`) so neither side detects the problem.
|
||||
The non-parseable case is never logged.
|
||||
|
||||
**Recommendation:** Log a warning when `EventId` fails to parse, and either reject
|
||||
the event as `PermanentFail` (malformed input) or synthesize a fresh
|
||||
`Guid.NewGuid()` so each event still gets a unique id.
|
||||
|
||||
**Resolution:** _(open)_
|
||||
|
||||
### Driver.Historian.Wonderware-005
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | Low |
|
||||
| Category | Concurrency and thread safety |
|
||||
| Location | `Backend/HistorianDataSource.cs:124`, `:126-127` |
|
||||
| Status | Open |
|
||||
|
||||
**Description:** `GetHealthSnapshot` reads `_activeProcessNode` and
|
||||
`_activeEventNode` inside `_healthLock`, but those two fields are written under
|
||||
`_connectionLock` / `_eventConnectionLock` (lines 183, 243, 209-210, 266-269) — a
|
||||
different lock. The health-counter fields are correctly `_healthLock`-protected,
|
||||
but the active-node strings are published under one lock and read under another,
|
||||
so the snapshot can observe a stale active-node value relative to the
|
||||
connection-open booleans. This is a diagnostics-only path, so impact is limited to
|
||||
a momentarily inconsistent health snapshot.
|
||||
|
||||
**Recommendation:** Pick one lock for the active-node strings (publish them under
|
||||
`_healthLock` on every connection state change, or read them under the connection
|
||||
lock), so the snapshot is internally consistent.
|
||||
|
||||
**Resolution:** _(open)_
|
||||
|
||||
### Driver.Historian.Wonderware-006
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | Medium |
|
||||
| Category | Error handling and resilience |
|
||||
| Location | `Ipc/PipeServer.cs:120-128` |
|
||||
| Status | Open |
|
||||
|
||||
**Description:** `RunAsync` re-accepts connections in a `while` loop. If
|
||||
`RunOneConnectionAsync` throws synchronously and immediately on every iteration
|
||||
(for example `new NamedPipeServerStream(...)` fails because the pipe name is
|
||||
already in use, or `PipeAcl.Create` throws), the loop spins with no delay and no
|
||||
backoff, pegging a CPU core and flooding the rolling log file with one `Error`
|
||||
line per iteration. There is no circuit-breaker or retry cap.
|
||||
|
||||
**Recommendation:** Add a short delay (exponential backoff capped at a few
|
||||
seconds) before re-accepting after a caught exception, and consider a
|
||||
consecutive-failure threshold that escalates to a fatal exit so the supervisor can
|
||||
restart the sidecar cleanly.
|
||||
|
||||
**Resolution:** _(open)_
|
||||
|
||||
### Driver.Historian.Wonderware-007
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | Low |
|
||||
| Category | Error handling and resilience |
|
||||
| Location | `Ipc/PipeServer.cs:70-75` |
|
||||
| Status | Open |
|
||||
|
||||
**Description:** When `VerifyCaller` rejects the peer SID, the server logs the
|
||||
reason and calls `_current.Disconnect()` with no `HelloAck` frame sent. The
|
||||
shared-secret-mismatch and major-version-mismatch paths below it both send a
|
||||
rejecting `HelloAck` so the client learns why. A client that fails the SID check
|
||||
instead sees an abrupt disconnect and must rely on its own read timeout, with no
|
||||
diagnostic on the client side. The asymmetry also makes the SID-rejection path
|
||||
harder to test from the client.
|
||||
|
||||
**Recommendation:** Send a `HelloAck` with `Accepted = false` and a
|
||||
`caller-sid-mismatch` reject reason before disconnecting, consistent with the
|
||||
other two rejection paths.
|
||||
|
||||
**Resolution:** _(open)_
|
||||
|
||||
### Driver.Historian.Wonderware-008
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | Low |
|
||||
| Category | Error handling and resilience |
|
||||
| Location | `Backend/HistorianDataSource.cs:301-307`, `:374-380` |
|
||||
| Status | Open |
|
||||
|
||||
**Description:** When `query.StartQuery` returns `false`, `ReadRawAsync` and
|
||||
`ReadAggregateAsync` call `HandleConnectionError()` and return an empty result
|
||||
list. A failed `StartQuery` is not necessarily a connection failure — it can be a
|
||||
bad tag name, an invalid time range, or an unsupported aggregate — yet the code
|
||||
unconditionally tears down the shared SDK connection. A burst of queries with one
|
||||
bad tag name therefore repeatedly drops and re-opens the (relatively expensive)
|
||||
historian connection and marks the cluster node failed via `HandleConnectionError`
|
||||
into `_picker.MarkFailed`, which can push an otherwise healthy node into cooldown.
|
||||
The empty-list result is also indistinguishable from "no data in range" to the
|
||||
caller — the `Success` flag on the reply will still be `true`.
|
||||
|
||||
**Recommendation:** Inspect `error.ErrorCode` to distinguish connection-class
|
||||
failures (reset and mark node failed) from query-class failures (leave the
|
||||
connection intact, surface the error). Consider returning a failed reply
|
||||
(`Success = false`) for query-class `StartQuery` failures so the client does not
|
||||
treat an SDK error as an empty history.
|
||||
|
||||
**Resolution:** _(open)_
|
||||
|
||||
### Driver.Historian.Wonderware-009
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | Medium |
|
||||
| Category | Performance and resource management |
|
||||
| Location | `Backend/HistorianDataSource.cs:382-395`, `Ipc/Contracts.cs:85-99` |
|
||||
| Status | Open |
|
||||
|
||||
**Description:** `ReadAggregateAsync` drains `query.MoveNext` into `results` with
|
||||
no upper bound, unlike `ReadRawAsync`, which honours `maxValues` /
|
||||
`MaxValuesPerRead` and breaks. `ReadProcessedRequest` carries no max-buckets field.
|
||||
A processed read over a wide time range with a small `IntervalMs` produces an
|
||||
unbounded `HistorianAggregateSample` list; the handler then serializes it into
|
||||
`ReadProcessedReply`. If the serialized body exceeds the 16 MiB
|
||||
`Framing.MaxFrameBodyBytes` cap, `FrameWriter.WriteAsync` throws and the entire
|
||||
reply is lost (the client correlation wait hangs), and before that point the
|
||||
sidecar holds the whole result set in memory.
|
||||
|
||||
**Recommendation:** Apply `_config.MaxValuesPerRead` as a bucket cap in
|
||||
`ReadAggregateAsync` (mirroring the raw path), and/or add a `MaxBuckets` field to
|
||||
`ReadProcessedRequest`. Reject or truncate result sets that would exceed the frame
|
||||
cap with an explicit error reply rather than letting `WriteAsync` throw.
|
||||
|
||||
**Resolution:** _(open)_
|
||||
|
||||
### Driver.Historian.Wonderware-010
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | Low |
|
||||
| Category | Performance and resource management |
|
||||
| Location | `Backend/HistorianConfiguration.cs:32-36`, `Backend/HistorianDataSource.cs` (all read methods) |
|
||||
| Status | Open |
|
||||
|
||||
**Description:** `HistorianConfiguration.RequestTimeoutSeconds` is documented as
|
||||
the "outer safety timeout applied to sync-over-async Historian operations" and is
|
||||
copied around (`SdkAlarmHistorianWriteBackend.CloneConfigWithServerName:346`), but
|
||||
it is never read or enforced anywhere. The `HistorianDataSource` read methods are
|
||||
declared `Task`-returning but execute the SDK calls synchronously on the caller
|
||||
thread and only check the `CancellationToken` between `MoveNext` iterations. There
|
||||
is no outer timeout: a hung `StartQuery` or a slow `MoveNext` blocks the single
|
||||
pipe-server connection thread indefinitely (the connect path has its own poll
|
||||
timeout, but the query path does not). The documented safety net does not exist.
|
||||
|
||||
**Recommendation:** Either wire `RequestTimeoutSeconds` into the read paths (a
|
||||
`CancellationTokenSource.CancelAfter` linked into `ct`, or run the SDK call on a
|
||||
worker with a bounded wait), or remove the property and its XML doc so the code
|
||||
does not advertise a guarantee it does not provide.
|
||||
|
||||
**Resolution:** _(open)_
|
||||
|
||||
### Driver.Historian.Wonderware-011
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | Low |
|
||||
| Category | Design-document adherence |
|
||||
| Location | `Backend/HistorianDataSource.cs:9-12`, `Backend/IHistorianDataSource.cs:9-11`, `Backend/HistorianSample.cs:7-9`, `Backend/HistorianConfiguration.cs:7-9` |
|
||||
| Status | Open |
|
||||
|
||||
**Description:** Several XML doc comments reference the retired v1 architecture as
|
||||
if it were current: "inside Galaxy.Host", "the Proxy maps returned samples", "the
|
||||
Host returns these across the IPC boundary as `GalaxyDataValue`", "Populated from
|
||||
... the Proxy DriverInstance.DriverConfig". Per `CLAUDE.md`, PR 7.2 retired the
|
||||
`Galaxy.Host` / `Galaxy.Proxy` / `Galaxy.Shared` projects, and this driver is now a
|
||||
standalone sidecar whose client is the .NET 10 `WonderwareHistorianClient`
|
||||
(`docs/AlarmTracking.md`). The comments are stale and misdescribe the current data
|
||||
flow, which contradicts the "no stale design docs/comments" expectation in the
|
||||
review checklist.
|
||||
|
||||
**Recommendation:** Update the doc comments to describe the current sidecar/IPC
|
||||
architecture (sidecar talking to `WonderwareHistorianClient` over the named pipe),
|
||||
dropping the `Galaxy.Host` / `Proxy` / `GalaxyDataValue` references.
|
||||
|
||||
**Resolution:** _(open)_
|
||||
|
||||
### Driver.Historian.Wonderware-012
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | Low |
|
||||
| Category | Testing coverage |
|
||||
| Location | `Backend/HistorianDataSource.cs`, `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests/` |
|
||||
| Status | Open |
|
||||
|
||||
**Description:** The unit-test suite covers `HistorianQualityMapper`,
|
||||
`HistorianClusterEndpointPicker`, `SdkAlarmHistorianWriteBackend`,
|
||||
`AahClientManagedAlarmEventWriter`, the IPC round trip, and `Program` alarm-writer
|
||||
wiring. `HistorianDataSource` itself — the largest and most logic-dense file in
|
||||
the module — has no direct unit coverage of its read paths, despite
|
||||
`IHistorianConnectionFactory` being explicitly extracted "so tests can inject
|
||||
fakes that control connection success, failure, and timeout behavior". The
|
||||
connect-failover-and-cooldown loop (`ConnectToAnyHealthyNode`), the mid-query
|
||||
connection-reset path (`HandleConnectionError`), the string-vs-numeric value
|
||||
selection (see -003), the at-time per-timestamp loop, and `ExtractAggregateValue`
|
||||
column dispatch are all untested. A stale empty test directory
|
||||
(`tests/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests/`, containing only
|
||||
`bin/obj`) also sits alongside the live `tests/Drivers/...` project and should be
|
||||
removed to avoid confusion.
|
||||
|
||||
**Recommendation:** Add `HistorianDataSource` tests driving an
|
||||
`IHistorianConnectionFactory` fake — covering failover, cooldown, mid-query reset,
|
||||
cancellation, and the value-type selection — and delete the stale empty
|
||||
`tests/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests/` directory.
|
||||
|
||||
**Resolution:** _(open)_
|
||||
Reference in New Issue
Block a user