Files
lmxopcua/code-reviews/Driver.Historian.Wonderware/findings.md
Joseph Doherty f982fa1f69 fix(driver-historian-wonderware): resolve High code-review finding (Driver.Historian.Wonderware-001)
WriteToReadOnlyFile was listed in MalformedErrors, so ClassifyOutcome/
MapOutcome routed it to PermanentFail and the store-and-forward sink
dead-lettered every alarm event in the batch. But WriteToReadOnlyFile is
a connection-configuration fault (the write session was opened without
ReadOnly = false), not an event-payload fault — treating it as permanent
silently and permanently discards alarm events on a misconfigured or
regressed connection, which is data loss.

Move WriteToReadOnlyFile from MalformedErrors into ConnectionErrors. The
batch loop now aborts the batch, resets the connection (so the reconnect
path re-opens a writable ReadOnly = false session), and defers the
events as RetryPlease for the next drain tick.

Updated the ClassifyOutcome theory data and added a dedicated regression
test pinning WriteToReadOnlyFile -> RetryPlease.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 06:59:40 -04:00

15 KiB

Code Review — Driver.Historian.Wonderware

Field Value
Module src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware
Reviewer Claude Code
Review date 2026-05-22
Commit reviewed 76d35d1
Status Reviewed
Open findings 11

Checklist coverage

A comprehensive review completes every category, recording "No issues found" where a category produced nothing rather than leaving it blank.

# Category Result
1 Correctness and logic bugs Driver.Historian.Wonderware-001, -002, -003, -004
2 OtOpcUa conventions No issues found
3 Concurrency and thread safety Driver.Historian.Wonderware-005
4 Error handling and resilience Driver.Historian.Wonderware-006, -007, -008
5 Security No issues found
6 Performance and resource management Driver.Historian.Wonderware-009, -010
7 Design-document adherence Driver.Historian.Wonderware-011
8 Code organization and conventions No issues found
9 Testing coverage Driver.Historian.Wonderware-012
10 Documentation and comments No issues found

Findings

Driver.Historian.Wonderware-001

Field Value
Severity High
Category Correctness and logic bugs
Location Backend/SdkAlarmHistorianWriteBackend.cs:68, Backend/AahClientManagedAlarmEventWriter.cs:82-103
Status Resolved

Description: MalformedErrors includes HistorianAccessError.ErrorValue.WriteToReadOnlyFile. When ClassifyOutcome routes that code through MapOutcome, isMalformedInput is true, so the per-event result becomes PermanentFail and the lmxopcua-side store-and-forward sink dead-letters the alarm event. But WriteToReadOnlyFile is not a property of the event payload; it is a connection-configuration fault (the write backend opened the session without ReadOnly set to false, or the SDK defaulted it). Treating it as permanent means a misconfigured or regressed connection would silently and permanently discard every alarm event in the batch instead of deferring them for retry once the connection is corrected. Alarm-event historization is the module's whole purpose, so this is data loss.

Recommendation: Move WriteToReadOnlyFile out of MalformedErrors. It should be treated as a connection-class error (abort the batch, reset the connection so the reconnect path can re-open with ReadOnly = false) or at minimum as RetryPlease, never PermanentFail.

Resolution: Resolved 2026-05-22 — moved WriteToReadOnlyFile from MalformedErrors into ConnectionErrors so the batch loop aborts, resets the connection (re-opening with ReadOnly = false), and defers the events as RetryPlease instead of dead-lettering them.

Driver.Historian.Wonderware-002

Field Value
Severity Medium
Category Correctness and logic bugs
Location Ipc/HistorianFrameHandler.cs:162, :181
Status Open

Description: HandleWriteAlarmEventsAsync dereferences req.Events.Length in both the _alarmWriter is null branch (line 162) and the catch block (line 181). MessagePack deserializes an absent or explicit-nil array field as a null reference, not Array.Empty<T>(). A client (or a buggy/hostile peer) that sends a WriteAlarmEventsRequest with a null Events array triggers a NullReferenceException. Although RunOneConnectionAsync would log it and accept the next connection, the request gets no reply frame, so the client correlation-id wait hangs until its own timeout. AahClientManagedAlarmEventWriter.WriteAsync already null-guards events; the frame handler does not.

Recommendation: Normalize req.Events to Array.Empty<AlarmHistorianEventDto>() immediately after deserialization (or guard each .Length access), consistent with the null-tolerance the writer already has.

Resolution: (open)

Driver.Historian.Wonderware-003

Field Value
Severity Medium
Category Correctness and logic bugs
Location Backend/HistorianDataSource.cs:320-323, :457-460
Status Open

Description: Raw and at-time reads decide whether a sample is a string or a numeric with if (!string.IsNullOrEmpty(result.StringValue) && result.Value == 0). The result.Value == 0 clause is intended to distinguish a real numeric zero from a string tag whose numeric projection is zero, but it is wrong in both directions: a numeric (analog) tag that legitimately sampled the value 0 while the SDK also populates a non-empty StringValue (some Historian builds populate the formatted text on every result) is reported to OPC UA as a string, changing the variable data type mid-stream; conversely a string tag whose numeric projection is non-zero is reported as a numeric. The historian SDK exposes the tag actual data type, which should drive the branch instead of a value heuristic.

Recommendation: Select string vs. numeric from the SDK result tag-data-type field rather than from Value == 0. If the type field is genuinely unavailable in the bound SDK version, document the limitation explicitly and prefer numeric for analog/integer tags.

Resolution: (open)

Driver.Historian.Wonderware-004

Field Value
Severity Low
Category Correctness and logic bugs
Location Backend/SdkAlarmHistorianWriteBackend.cs:198-201
Status Open

Description: ToHistorianEvent only assigns historianEvent.Id when Guid.TryParse(dto.EventId, ...) succeeds. If EventId is not a parseable GUID (or is empty), Id stays Guid.Empty and the event is written to the historian with an all-zeros identifier. Multiple such events collide on the same id, and the write is still accepted (outcomes[i] = Ack) so neither side detects the problem. The non-parseable case is never logged.

Recommendation: Log a warning when EventId fails to parse, and either reject the event as PermanentFail (malformed input) or synthesize a fresh Guid.NewGuid() so each event still gets a unique id.

Resolution: (open)

Driver.Historian.Wonderware-005

Field Value
Severity Low
Category Concurrency and thread safety
Location Backend/HistorianDataSource.cs:124, :126-127
Status Open

Description: GetHealthSnapshot reads _activeProcessNode and _activeEventNode inside _healthLock, but those two fields are written under _connectionLock / _eventConnectionLock (lines 183, 243, 209-210, 266-269) — a different lock. The health-counter fields are correctly _healthLock-protected, but the active-node strings are published under one lock and read under another, so the snapshot can observe a stale active-node value relative to the connection-open booleans. This is a diagnostics-only path, so impact is limited to a momentarily inconsistent health snapshot.

Recommendation: Pick one lock for the active-node strings (publish them under _healthLock on every connection state change, or read them under the connection lock), so the snapshot is internally consistent.

Resolution: (open)

Driver.Historian.Wonderware-006

Field Value
Severity Medium
Category Error handling and resilience
Location Ipc/PipeServer.cs:120-128
Status Open

Description: RunAsync re-accepts connections in a while loop. If RunOneConnectionAsync throws synchronously and immediately on every iteration (for example new NamedPipeServerStream(...) fails because the pipe name is already in use, or PipeAcl.Create throws), the loop spins with no delay and no backoff, pegging a CPU core and flooding the rolling log file with one Error line per iteration. There is no circuit-breaker or retry cap.

Recommendation: Add a short delay (exponential backoff capped at a few seconds) before re-accepting after a caught exception, and consider a consecutive-failure threshold that escalates to a fatal exit so the supervisor can restart the sidecar cleanly.

Resolution: (open)

Driver.Historian.Wonderware-007

Field Value
Severity Low
Category Error handling and resilience
Location Ipc/PipeServer.cs:70-75
Status Open

Description: When VerifyCaller rejects the peer SID, the server logs the reason and calls _current.Disconnect() with no HelloAck frame sent. The shared-secret-mismatch and major-version-mismatch paths below it both send a rejecting HelloAck so the client learns why. A client that fails the SID check instead sees an abrupt disconnect and must rely on its own read timeout, with no diagnostic on the client side. The asymmetry also makes the SID-rejection path harder to test from the client.

Recommendation: Send a HelloAck with Accepted = false and a caller-sid-mismatch reject reason before disconnecting, consistent with the other two rejection paths.

Resolution: (open)

Driver.Historian.Wonderware-008

Field Value
Severity Low
Category Error handling and resilience
Location Backend/HistorianDataSource.cs:301-307, :374-380
Status Open

Description: When query.StartQuery returns false, ReadRawAsync and ReadAggregateAsync call HandleConnectionError() and return an empty result list. A failed StartQuery is not necessarily a connection failure — it can be a bad tag name, an invalid time range, or an unsupported aggregate — yet the code unconditionally tears down the shared SDK connection. A burst of queries with one bad tag name therefore repeatedly drops and re-opens the (relatively expensive) historian connection and marks the cluster node failed via HandleConnectionError into _picker.MarkFailed, which can push an otherwise healthy node into cooldown. The empty-list result is also indistinguishable from "no data in range" to the caller — the Success flag on the reply will still be true.

Recommendation: Inspect error.ErrorCode to distinguish connection-class failures (reset and mark node failed) from query-class failures (leave the connection intact, surface the error). Consider returning a failed reply (Success = false) for query-class StartQuery failures so the client does not treat an SDK error as an empty history.

Resolution: (open)

Driver.Historian.Wonderware-009

Field Value
Severity Medium
Category Performance and resource management
Location Backend/HistorianDataSource.cs:382-395, Ipc/Contracts.cs:85-99
Status Open

Description: ReadAggregateAsync drains query.MoveNext into results with no upper bound, unlike ReadRawAsync, which honours maxValues / MaxValuesPerRead and breaks. ReadProcessedRequest carries no max-buckets field. A processed read over a wide time range with a small IntervalMs produces an unbounded HistorianAggregateSample list; the handler then serializes it into ReadProcessedReply. If the serialized body exceeds the 16 MiB Framing.MaxFrameBodyBytes cap, FrameWriter.WriteAsync throws and the entire reply is lost (the client correlation wait hangs), and before that point the sidecar holds the whole result set in memory.

Recommendation: Apply _config.MaxValuesPerRead as a bucket cap in ReadAggregateAsync (mirroring the raw path), and/or add a MaxBuckets field to ReadProcessedRequest. Reject or truncate result sets that would exceed the frame cap with an explicit error reply rather than letting WriteAsync throw.

Resolution: (open)

Driver.Historian.Wonderware-010

Field Value
Severity Low
Category Performance and resource management
Location Backend/HistorianConfiguration.cs:32-36, Backend/HistorianDataSource.cs (all read methods)
Status Open

Description: HistorianConfiguration.RequestTimeoutSeconds is documented as the "outer safety timeout applied to sync-over-async Historian operations" and is copied around (SdkAlarmHistorianWriteBackend.CloneConfigWithServerName:346), but it is never read or enforced anywhere. The HistorianDataSource read methods are declared Task-returning but execute the SDK calls synchronously on the caller thread and only check the CancellationToken between MoveNext iterations. There is no outer timeout: a hung StartQuery or a slow MoveNext blocks the single pipe-server connection thread indefinitely (the connect path has its own poll timeout, but the query path does not). The documented safety net does not exist.

Recommendation: Either wire RequestTimeoutSeconds into the read paths (a CancellationTokenSource.CancelAfter linked into ct, or run the SDK call on a worker with a bounded wait), or remove the property and its XML doc so the code does not advertise a guarantee it does not provide.

Resolution: (open)

Driver.Historian.Wonderware-011

Field Value
Severity Low
Category Design-document adherence
Location Backend/HistorianDataSource.cs:9-12, Backend/IHistorianDataSource.cs:9-11, Backend/HistorianSample.cs:7-9, Backend/HistorianConfiguration.cs:7-9
Status Open

Description: Several XML doc comments reference the retired v1 architecture as if it were current: "inside Galaxy.Host", "the Proxy maps returned samples", "the Host returns these across the IPC boundary as GalaxyDataValue", "Populated from ... the Proxy DriverInstance.DriverConfig". Per CLAUDE.md, PR 7.2 retired the Galaxy.Host / Galaxy.Proxy / Galaxy.Shared projects, and this driver is now a standalone sidecar whose client is the .NET 10 WonderwareHistorianClient (docs/AlarmTracking.md). The comments are stale and misdescribe the current data flow, which contradicts the "no stale design docs/comments" expectation in the review checklist.

Recommendation: Update the doc comments to describe the current sidecar/IPC architecture (sidecar talking to WonderwareHistorianClient over the named pipe), dropping the Galaxy.Host / Proxy / GalaxyDataValue references.

Resolution: (open)

Driver.Historian.Wonderware-012

Field Value
Severity Low
Category Testing coverage
Location Backend/HistorianDataSource.cs, tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests/
Status Open

Description: The unit-test suite covers HistorianQualityMapper, HistorianClusterEndpointPicker, SdkAlarmHistorianWriteBackend, AahClientManagedAlarmEventWriter, the IPC round trip, and Program alarm-writer wiring. HistorianDataSource itself — the largest and most logic-dense file in the module — has no direct unit coverage of its read paths, despite IHistorianConnectionFactory being explicitly extracted "so tests can inject fakes that control connection success, failure, and timeout behavior". The connect-failover-and-cooldown loop (ConnectToAnyHealthyNode), the mid-query connection-reset path (HandleConnectionError), the string-vs-numeric value selection (see -003), the at-time per-timestamp loop, and ExtractAggregateValue column dispatch are all untested. A stale empty test directory (tests/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests/, containing only bin/obj) also sits alongside the live tests/Drivers/... project and should be removed to avoid confusion.

Recommendation: Add HistorianDataSource tests driving an IHistorianConnectionFactory fake — covering failover, cooldown, mid-query reset, cancellation, and the value-type selection — and delete the stale empty tests/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests/ directory.

Resolution: (open)