Files
lmxopcua/code-reviews/Driver.Historian.Wonderware/findings.md
T

472 lines
30 KiB
Markdown

# Code Review — Driver.Historian.Wonderware
| Field | Value |
|---|---|
| Module | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware` |
| Reviewer | Claude Code |
| Review date | 2026-06-19 |
| Commit reviewed | `7286d320` |
| Status | Reviewed |
| Open findings | 0 |
## Checklist coverage
A comprehensive review completes every category, recording "No issues found" where
a category produced nothing rather than leaving it blank.
| # | Category | Result |
|---|---|---|
| 1 | Correctness and logic bugs | Driver.Historian.Wonderware-001, -002, -003, -004 |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency and thread safety | Driver.Historian.Wonderware-005 |
| 4 | Error handling and resilience | Driver.Historian.Wonderware-006, -007, -008 |
| 5 | Security | No issues found |
| 6 | Performance and resource management | Driver.Historian.Wonderware-009, -010 |
| 7 | Design-document adherence | Driver.Historian.Wonderware-011 |
| 8 | Code organization and conventions | No issues found |
| 9 | Testing coverage | Driver.Historian.Wonderware-012 |
| 10 | Documentation and comments | No issues found |
## Findings
### Driver.Historian.Wonderware-001
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness and logic bugs |
| Location | `Backend/SdkAlarmHistorianWriteBackend.cs:68`, `Backend/AahClientManagedAlarmEventWriter.cs:82-103` |
| Status | Resolved |
**Description:** `MalformedErrors` includes `HistorianAccessError.ErrorValue.WriteToReadOnlyFile`.
When `ClassifyOutcome` routes that code through `MapOutcome`, `isMalformedInput` is
`true`, so the per-event result becomes `PermanentFail` and the lmxopcua-side
store-and-forward sink dead-letters the alarm event. But `WriteToReadOnlyFile` is
not a property of the event payload; it is a connection-configuration fault (the
write backend opened the session without `ReadOnly` set to `false`, or the SDK
defaulted it). Treating it as permanent means a misconfigured or regressed
connection would silently and permanently discard every alarm event in the batch
instead of deferring them for retry once the connection is corrected.
Alarm-event historization is the module's whole purpose, so this is data loss.
**Recommendation:** Move `WriteToReadOnlyFile` out of `MalformedErrors`. It should
be treated as a connection-class error (abort the batch, reset the connection so
the reconnect path can re-open with `ReadOnly = false`) or at minimum as
`RetryPlease`, never `PermanentFail`.
**Resolution:** Resolved 2026-05-22 — moved `WriteToReadOnlyFile` from `MalformedErrors` into `ConnectionErrors` so the batch loop aborts, resets the connection (re-opening with `ReadOnly = false`), and defers the events as `RetryPlease` instead of dead-lettering them.
### Driver.Historian.Wonderware-002
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness and logic bugs |
| Location | `Ipc/HistorianFrameHandler.cs:162`, `:181` |
| Status | Resolved |
**Description:** `HandleWriteAlarmEventsAsync` dereferences `req.Events.Length`
in both the `_alarmWriter is null` branch (line 162) and the catch block (line
181). MessagePack deserializes an absent or explicit-nil array field as a `null`
reference, not `Array.Empty<T>()`. A client (or a buggy/hostile peer) that sends
a `WriteAlarmEventsRequest` with a null `Events` array triggers a
`NullReferenceException`. Although `RunOneConnectionAsync` would log it and accept
the next connection, the request gets no reply frame, so the client correlation-id
wait hangs until its own timeout. `AahClientManagedAlarmEventWriter.WriteAsync`
already null-guards `events`; the frame handler does not.
**Recommendation:** Normalize `req.Events` to `Array.Empty<AlarmHistorianEventDto>()`
immediately after deserialization (or guard each `.Length` access), consistent
with the null-tolerance the writer already has.
**Resolution:** Resolved 2026-05-22 — normalise `req.Events` to `Array.Empty<AlarmHistorianEventDto>()` immediately after deserialization so all subsequent `.Length` accesses are safe against null frames.
### Driver.Historian.Wonderware-003
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness and logic bugs |
| Location | `Backend/HistorianDataSource.cs:320-323`, `:457-460` |
| Status | Resolved |
**Description:** Raw and at-time reads decide whether a sample is a string or a
numeric with `if (!string.IsNullOrEmpty(result.StringValue) && result.Value == 0)`.
The `result.Value == 0` clause is intended to distinguish a real numeric zero from
a string tag whose numeric projection is zero, but it is wrong in both directions:
a numeric (analog) tag that legitimately sampled the value `0` while the SDK also
populates a non-empty `StringValue` (some Historian builds populate the formatted
text on every result) is reported to OPC UA as a string, changing the variable
data type mid-stream; conversely a string tag whose numeric projection is non-zero
is reported as a numeric. The historian SDK exposes the tag actual data type,
which should drive the branch instead of a value heuristic.
**Recommendation:** Select string vs. numeric from the SDK result tag-data-type
field rather than from `Value == 0`. If the type field is genuinely unavailable in
the bound SDK version, document the limitation explicitly and prefer numeric for
analog/integer tags.
**Resolution:** Resolved 2026-05-22 — extracted the heuristic into a `SelectValue` helper with a detailed XML doc comment explaining the SDK limitation (`HistoryQueryResult` has no data type field in the bound `aahClientManaged` version); the existing `Value == 0` discriminator is preserved as the best available heuristic with the known edge-case documented.
### Driver.Historian.Wonderware-004
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness and logic bugs |
| Location | `Backend/SdkAlarmHistorianWriteBackend.cs:198-201` |
| Status | Resolved |
**Description:** `ToHistorianEvent` only assigns `historianEvent.Id` when
`Guid.TryParse(dto.EventId, ...)` succeeds. If `EventId` is not a parseable GUID
(or is empty), `Id` stays `Guid.Empty` and the event is written to the historian
with an all-zeros identifier. Multiple such events collide on the same id, and the
write is still accepted (`outcomes[i] = Ack`) so neither side detects the problem.
The non-parseable case is never logged.
**Recommendation:** Log a warning when `EventId` fails to parse, and either reject
the event as `PermanentFail` (malformed input) or synthesize a fresh
`Guid.NewGuid()` so each event still gets a unique id.
**Resolution:** Resolved 2026-05-23 — `ToHistorianEvent` now synthesizes a fresh `Guid.NewGuid()` when the dto's `EventId` fails `Guid.TryParse`, and logs a warning carrying both the original (unparseable) id and the synthesized id so collisions stop happening silently. Regression tests `ToHistorianEvent_parseable_event_id_is_used_verbatim` and `ToHistorianEvent_unparseable_event_id_synthesizes_unique_non_empty_Guid` in `SdkAlarmHistorianWriteBackendTests`.
### Driver.Historian.Wonderware-005
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency and thread safety |
| Location | `Backend/HistorianDataSource.cs:124`, `:126-127` |
| Status | Resolved |
**Description:** `GetHealthSnapshot` reads `_activeProcessNode` and
`_activeEventNode` inside `_healthLock`, but those two fields are written under
`_connectionLock` / `_eventConnectionLock` (lines 183, 243, 209-210, 266-269) — a
different lock. The health-counter fields are correctly `_healthLock`-protected,
but the active-node strings are published under one lock and read under another,
so the snapshot can observe a stale active-node value relative to the
connection-open booleans. This is a diagnostics-only path, so impact is limited to
a momentarily inconsistent health snapshot.
**Recommendation:** Pick one lock for the active-node strings (publish them under
`_healthLock` on every connection state change, or read them under the connection
lock), so the snapshot is internally consistent.
**Resolution:** Resolved 2026-05-23 — `GetHealthSnapshot` now derives the `ProcessConnectionOpen` / `EventConnectionOpen` booleans from the active-node strings (`_activeProcessNode != null` / `_activeEventNode != null`) which all live under `_healthLock`, instead of reading `_connection`/`_eventConnection` via `Volatile.Read` outside the lock those fields are published under. The snapshot is now self-consistent by construction: open ↔ active node populated. Regression tests in `HistorianDataSourceHealthSnapshotTests` cover the three half-published states plus the steady-state cases.
### Driver.Historian.Wonderware-006
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling and resilience |
| Location | `Ipc/PipeServer.cs:120-128` |
| Status | Resolved |
**Description:** `RunAsync` re-accepts connections in a `while` loop. If
`RunOneConnectionAsync` throws synchronously and immediately on every iteration
(for example `new NamedPipeServerStream(...)` fails because the pipe name is
already in use, or `PipeAcl.Create` throws), the loop spins with no delay and no
backoff, pegging a CPU core and flooding the rolling log file with one `Error`
line per iteration. There is no circuit-breaker or retry cap.
**Recommendation:** Add a short delay (exponential backoff capped at a few
seconds) before re-accepting after a caught exception, and consider a
consecutive-failure threshold that escalates to a fatal exit so the supervisor can
restart the sidecar cleanly.
**Resolution:** Resolved 2026-05-22 — added exponential backoff (250 ms → 8 s, six steps) after each connection-loop failure and a `MaxConsecutiveFailures=20` threshold that re-throws so the SCM/NSSM supervisor can restart the sidecar cleanly.
### Driver.Historian.Wonderware-007
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling and resilience |
| Location | `Ipc/PipeServer.cs:70-75` |
| Status | Resolved |
**Description:** When `VerifyCaller` rejects the peer SID, the server logs the
reason and calls `_current.Disconnect()` with no `HelloAck` frame sent. The
shared-secret-mismatch and major-version-mismatch paths below it both send a
rejecting `HelloAck` so the client learns why. A client that fails the SID check
instead sees an abrupt disconnect and must rely on its own read timeout, with no
diagnostic on the client side. The asymmetry also makes the SID-rejection path
harder to test from the client.
**Recommendation:** Send a `HelloAck` with `Accepted = false` and a
`caller-sid-mismatch` reject reason before disconnecting, consistent with the
other two rejection paths.
**Resolution:** Resolved 2026-05-23 — the SID rejection path now writes a `HelloAck { Accepted=false, RejectReason="caller-sid-mismatch: ..." }` before disconnecting, symmetric with the shared-secret-mismatch and major-version-mismatch paths. The caller-verification function was also extracted into a `CallerVerifier` delegate so tests can override it (the pipe ACL would otherwise block the test client itself). End-to-end regression `PipeServerSidRejectTests.Caller_SID_mismatch_sends_HelloAck_with_reject_reason_before_disconnect` connects a real named-pipe client and asserts the rejecting ack frame arrives.
### Driver.Historian.Wonderware-008
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling and resilience |
| Location | `Backend/HistorianDataSource.cs:301-307`, `:374-380` |
| Status | Resolved |
**Description:** When `query.StartQuery` returns `false`, `ReadRawAsync` and
`ReadAggregateAsync` call `HandleConnectionError()` and return an empty result
list. A failed `StartQuery` is not necessarily a connection failure — it can be a
bad tag name, an invalid time range, or an unsupported aggregate — yet the code
unconditionally tears down the shared SDK connection. A burst of queries with one
bad tag name therefore repeatedly drops and re-opens the (relatively expensive)
historian connection and marks the cluster node failed via `HandleConnectionError`
into `_picker.MarkFailed`, which can push an otherwise healthy node into cooldown.
The empty-list result is also indistinguishable from "no data in range" to the
caller — the `Success` flag on the reply will still be `true`.
**Recommendation:** Inspect `error.ErrorCode` to distinguish connection-class
failures (reset and mark node failed) from query-class failures (leave the
connection intact, surface the error). Consider returning a failed reply
(`Success = false`) for query-class `StartQuery` failures so the client does not
treat an SDK error as an empty history.
**Resolution:** Resolved 2026-05-23 — extracted a static `ConnectionErrorCodes` set + `IsConnectionClassError` classifier (mirroring the alarm-write side) and centralised the failure handling in a new `HandleStartQueryFailure` helper. Connection-class codes still drop the connection and mark the node failed; query-class codes throw a new `QueryClassStartQueryException` that the outer catch re-throws WITHOUT touching the connection. All four read paths (raw / aggregate / at-time / events) also re-throw caught exceptions so the IPC frame handler surfaces `Success=false` instead of returning an empty list with `Success=true`. Regression tests `HistorianDataSourceStartQueryClassificationTests` pin the connection-class vs query-class classification per error code; the connect-failover suite (`HistorianDataSourceConnectFailoverTests`) verifies the read paths now propagate the exception.
### Driver.Historian.Wonderware-009
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Performance and resource management |
| Location | `Backend/HistorianDataSource.cs:382-395`, `Ipc/Contracts.cs:85-99` |
| Status | Resolved |
**Description:** `ReadAggregateAsync` drains `query.MoveNext` into `results` with
no upper bound, unlike `ReadRawAsync`, which honours `maxValues` /
`MaxValuesPerRead` and breaks. `ReadProcessedRequest` carries no max-buckets field.
A processed read over a wide time range with a small `IntervalMs` produces an
unbounded `HistorianAggregateSample` list; the handler then serializes it into
`ReadProcessedReply`. If the serialized body exceeds the 16 MiB
`Framing.MaxFrameBodyBytes` cap, `FrameWriter.WriteAsync` throws and the entire
reply is lost (the client correlation wait hangs), and before that point the
sidecar holds the whole result set in memory.
**Recommendation:** Apply `_config.MaxValuesPerRead` as a bucket cap in
`ReadAggregateAsync` (mirroring the raw path), and/or add a `MaxBuckets` field to
`ReadProcessedRequest`. Reject or truncate result sets that would exceed the frame
cap with an explicit error reply rather than letting `WriteAsync` throw.
**Resolution:** Resolved 2026-05-22 — applied `_config.MaxValuesPerRead` as a bucket cap in `ReadAggregateAsync` mirroring the raw-read path; truncation logs a Warning with the limit and a hint to widen `IntervalMs` or reduce the time range.
### Driver.Historian.Wonderware-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance and resource management |
| Location | `Backend/HistorianConfiguration.cs:32-36`, `Backend/HistorianDataSource.cs` (all read methods) |
| Status | Resolved |
**Description:** `HistorianConfiguration.RequestTimeoutSeconds` is documented as
the "outer safety timeout applied to sync-over-async Historian operations" and is
copied around (`SdkAlarmHistorianWriteBackend.CloneConfigWithServerName:346`), but
it is never read or enforced anywhere. The `HistorianDataSource` read methods are
declared `Task`-returning but execute the SDK calls synchronously on the caller
thread and only check the `CancellationToken` between `MoveNext` iterations. There
is no outer timeout: a hung `StartQuery` or a slow `MoveNext` blocks the single
pipe-server connection thread indefinitely (the connect path has its own poll
timeout, but the query path does not). The documented safety net does not exist.
**Recommendation:** Either wire `RequestTimeoutSeconds` into the read paths (a
`CancellationTokenSource.CancelAfter` linked into `ct`, or run the SDK call on a
worker with a bounded wait), or remove the property and its XML doc so the code
does not advertise a guarantee it does not provide.
**Resolution:** Resolved 2026-05-23 — added an internal `BuildRequestCts` helper that returns a `CancellationTokenSource` linked into the caller's `ct` with `CancelAfter(RequestTimeoutSeconds)` applied when positive. Each read method (`ReadRawAsync`, `ReadAggregateAsync`, `ReadAtTimeAsync`, `ReadEventsAsync`) now wraps its work with the linked CTS and feeds the linked token into the `ThrowIfCancellationRequested` checks between `MoveNext` iterations, so a hung SDK call cancels at the configured deadline instead of blocking the connection thread indefinitely. Regression tests `HistorianDataSourceRequestTimeoutTests` pin the helper: positive value enforces `CancelAfter`, zero/negative means no timeout, caller cancellation propagates, default is 60s.
### Driver.Historian.Wonderware-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | `Backend/HistorianDataSource.cs:9-12`, `Backend/IHistorianDataSource.cs:9-11`, `Backend/HistorianSample.cs:7-9`, `Backend/HistorianConfiguration.cs:7-9` |
| Status | Resolved |
**Description:** Several XML doc comments reference the retired v1 architecture as
if it were current: "inside Galaxy.Host", "the Proxy maps returned samples", "the
Host returns these across the IPC boundary as `GalaxyDataValue`", "Populated from
... the Proxy DriverInstance.DriverConfig". Per `CLAUDE.md`, PR 7.2 retired the
`Galaxy.Host` / `Galaxy.Proxy` / `Galaxy.Shared` projects, and this driver is now a
standalone sidecar whose client is the .NET 10 `WonderwareHistorianClient`
(`docs/AlarmTracking.md`). The comments are stale and misdescribe the current data
flow, which contradicts the "no stale design docs/comments" expectation in the
review checklist.
**Recommendation:** Update the doc comments to describe the current sidecar/IPC
architecture (sidecar talking to `WonderwareHistorianClient` over the named pipe),
dropping the `Galaxy.Host` / `Proxy` / `GalaxyDataValue` references.
**Resolution:** Resolved 2026-05-23 — refreshed the XML doc comments on `HistorianDataSource`, `IHistorianDataSource`, `HistorianSample` / `HistorianAggregateSample`, and `HistorianConfiguration` to describe the current sidecar / named-pipe / .NET 10 `WonderwareHistorianClient` architecture. References to `Galaxy.Host` / `Galaxy.Proxy` / `GalaxyDataValue` are now framed as historical context tied to the PR 7.2 retirement rather than as current behaviour.
### Driver.Historian.Wonderware-012
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `Backend/HistorianDataSource.cs`, `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests/` |
| Status | Resolved |
**Description:** The unit-test suite covers `HistorianQualityMapper`,
`HistorianClusterEndpointPicker`, `SdkAlarmHistorianWriteBackend`,
`AahClientManagedAlarmEventWriter`, the IPC round trip, and `Program` alarm-writer
wiring. `HistorianDataSource` itself — the largest and most logic-dense file in
the module — has no direct unit coverage of its read paths, despite
`IHistorianConnectionFactory` being explicitly extracted "so tests can inject
fakes that control connection success, failure, and timeout behavior". The
connect-failover-and-cooldown loop (`ConnectToAnyHealthyNode`), the mid-query
connection-reset path (`HandleConnectionError`), the string-vs-numeric value
selection (see -003), the at-time per-timestamp loop, and `ExtractAggregateValue`
column dispatch are all untested. A stale empty test directory
(`tests/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests/`, containing only
`bin/obj`) also sits alongside the live `tests/Drivers/...` project and should be
removed to avoid confusion.
**Recommendation:** Add `HistorianDataSource` tests driving an
`IHistorianConnectionFactory` fake — covering failover, cooldown, mid-query reset,
cancellation, and the value-type selection — and delete the stale empty
`tests/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests/` directory.
**Resolution:** Resolved 2026-05-23 — added four new `HistorianDataSource`-targeted test files: `HistorianDataSourceHealthSnapshotTests` (snapshot consistency under half-published state, see also -005), `HistorianDataSourceStartQueryClassificationTests` (connection-class vs query-class error-code table, see also -008), `HistorianDataSourceRequestTimeoutTests` (the request-timeout helper, see also -010), `HistorianDataSourceConnectFailoverTests` (cluster failover order + cooldown via the `IHistorianConnectionFactory` fake), and `HistorianDataSourceValueAndAggregateTests` (the string-vs-numeric heuristic via the new SDK-independent `SelectValueFromPair` overload + the `ExtractAggregateValue` column dispatch). Stale empty `tests/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests/` directory deleted. Unit count rose from 80 to 125 (+45 new tests).
## Re-review 2026-06-19 (commit 7286d320)
The transport changed substantially since the `76d35d1` review: the named-pipe
server (`Ipc/PipeServer.cs` + `Ipc/PipeAcl.cs`, both deleted) was replaced by a
shared-secret + optional-TLS TCP server (`Ipc/TcpFrameServer.cs`). `HistorianDataSource`
grew the `IsConnectionClassError` / `HandleStartQueryFailure` / `BuildRequestCts`
helpers (the -008 / -010 fixes). All prior findings remain Resolved. The
re-review covers all 10 categories at `7286d320`; new findings continue the ID
sequence from -012.
| # | Category | Result (re-review) |
|---|---|---|
| 1 | Correctness & logic bugs | Driver.Historian.Wonderware-014 |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | No issues found |
| 4 | Error handling & resilience | Driver.Historian.Wonderware-014 (cross-listed) |
| 5 | Security | No issues found |
| 6 | Performance & resource management | No issues found |
| 7 | Design-document adherence | No issues found |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | No issues found |
| 10 | Documentation & comments | Driver.Historian.Wonderware-015 |
| — | Cross-module contract | Driver.Historian.Wonderware-013 |
#### Driver.Historian.Wonderware-013
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness and logic bugs |
| Location | `Backend/HistorianDataSource.cs:667-729` (`ReadEventsAsync`); also `:405-486` (`ReadRawAsync`), `:495-573` (`ReadAggregateAsync`); `Ipc/Contracts.cs:224-238` (`ReadEventsReply`) |
| Status | Deferred |
**Description:** Cross-module context (Core.Abstractions-009 / OpcUaServer-002):
when a HistoryRead arrives with the `maxEvents <= 0` (or `maxValues <= 0`)
sentinel meaning "no caller cap — return everything", the implementer must
signal more-data (a continuation point) when a backend cap truncates the result;
otherwise the server silently drops history. This sidecar **silently truncates**.
`ReadEventsAsync` substitutes `_config.MaxValuesPerRead` (default 10000) as the
SDK `EventQueryArgs.EventCount` when `maxEvents <= 0` (line 686), and the loop's
only break is `if (maxEvents > 0 && count >= maxEvents)` (line 708) — so with the
sentinel the result is capped server-side at `MaxValuesPerRead` with no signal
that more rows existed. `ReadRawAsync` (`limit = maxValues > 0 ? maxValues :
_config.MaxValuesPerRead`, line 441) and `ReadAggregateAsync` (bucket cap, line
528) behave the same. Crucially the wire contracts (`ReadEventsReply`,
`ReadRawReply`, `ReadProcessedReply`) carry **no** `ContinuationPoint` /
`MoreDataAvailable` / `Truncated` field at all — there is no way for the sidecar
to tell the `WonderwareHistorianClient` "this set was capped", so the OPC UA
server cannot set a Part 11 `ContinuationPoint` and the client silently sees a
short read as a complete one. The aggregate path at least logs a Warning on
truncation (line 548); raw and events truncate with no log.
**Recommendation:** Add a `bool Truncated` (or a continuation token) field to the
three read reply DTOs and set it when the loop broke on the cap rather than on
`MoveNext` exhaustion; the `WonderwareHistorianClient` then maps it to a Part 11
`ContinuationPoint` (or at minimum `GoodMoreData`). At a bare minimum, log a
Warning on raw/event truncation to match the aggregate path so a silently capped
read is at least observable in the rolling log.
**Resolution:** Deferred — this is a cross-module wire-contract change. The
`ReadEventsReply` / `ReadRawReply` / `ReadProcessedReply` MessagePack DTOs are
shared with the .NET 10 `WonderwareHistorianClient` (a different module) and the
OPC UA server's HistoryRead glue; adding a continuation/truncated field and the
client-side mapping to a Part 11 `ContinuationPoint` must be designed and landed
across the sidecar + client + server together (and needs a live historian to
verify the end-to-end Part 11 paging). Out of scope for a self-contained
sidecar-only fix; tracked here for the coordinated change. (The orchestrator's
verdict request: this implementation does **not** honor the continuation
contract — it silently truncates with no more-data signal on the wire.)
#### Driver.Historian.Wonderware-014
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling and resilience |
| Location | `Backend/HistorianDataSource.cs:596-643` (`ReadAtTimeAsync`) |
| Status | Resolved |
**Description:** The -008 fix taught `ReadRawAsync` / `ReadAggregateAsync` /
`ReadEventsAsync` to classify a failed `StartQuery` as connection-class (reset the
connection, mark the node failed, propagate `Success=false`) vs query-class (keep
the connection, propagate `Success=false`). `ReadAtTimeAsync` was **not** updated
and still uses the pre-008 behaviour: on a `StartQuery` failure for any timestamp
it appends a Bad-quality null sample and `continue`s (lines 610-619) with no
inspection of `error.ErrorCode`. Two consequences:
1. A **connection-class** failure (e.g. `NoReply`, `FailedToConnect`) on the first
timestamp leaves the dead `_connection` in place; every subsequent timestamp's
`StartQuery` also fails on the same dead connection, and the method still calls
`RecordSuccess()` at the end (line 643) and returns `Success=true` with an
all-Bad sample set. The connection is never reset and the node is never marked
failed, so failover/cooldown never engages for an at-time read.
2. The all-Bad result is reported to the client as a successful read of Bad
samples, indistinguishable from "the historian genuinely had no interpolated
value at those instants" — masking a real connection outage.
**Recommendation:** When `StartQuery` returns false in the at-time loop, classify
the error with `IsConnectionClassError`. On a connection-class code, reset the
connection (`HandleConnectionError`) and throw so the IPC layer surfaces
`Success=false` (consistent with the other three read paths). A query-class /
no-data code may continue to record a per-timestamp Bad sample.
**Resolution:** Resolved 2026-06-19 — extracted the per-timestamp StartQuery-failure decision into a pure `ShouldResetConnectionForStartQueryFailure(HistorianAccessError?)` helper (the SDK `HistoryQuery`/`HistorianAccess` types are non-virtual with no interface, so the loop itself can't be driven offline — this mirrors the existing `IsConnectionClassError`/`SelectValueFromPair` testability seams). `ReadAtTimeAsync` now calls it when `StartQuery` returns false: a connection-class code throws an `InvalidOperationException` that the existing outer catch turns into a connection reset + node-failed + `Success=false` (matching the raw/aggregate/event paths); a query-class / no-data code keeps the prior per-timestamp Bad-sample-and-continue behaviour. Regression tests `AtTime_StartQuery_failure_with_connection_class_code_requests_connection_reset`, `AtTime_StartQuery_failure_with_query_class_code_does_not_request_reset`, and `AtTime_StartQuery_failure_with_null_error_defaults_to_no_reset` in `HistorianDataSourceStartQueryClassificationTests`.
#### Driver.Historian.Wonderware-015
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation and comments |
| Location | `Backend/HistorianDataSource.cs:14`, `:95`; `Backend/IHistorianDataSource.cs:11`; `Backend/HistorianConfiguration.cs:9`; `Backend/HistorianSample.cs:7`; `Ipc/Contracts.cs:7`; `Ipc/Framing.cs:4` |
| Status | Resolved |
**Description:** The IPC transport was rewritten from a named pipe to TCP at this
commit (`Ipc/PipeServer.cs` + `Ipc/PipeAcl.cs` deleted; `Ipc/TcpFrameServer.cs`
added; `Program.cs` now binds a `TcpFrameServer`). Several XML doc comments and
header comments still describe the wire as a "named-pipe" / "pipe protocol" /
"pipe-server connection thread": `HistorianDataSource` class summary ("serialises
onto the named-pipe wire"), the `BuildRequestCts` doc ("single pipe-server
connection thread"), `IHistorianDataSource` ("the other side of the named-pipe
IPC"), `HistorianConfiguration` ("client side of the named-pipe IPC"),
`HistorianSample` ("serialises these onto the named-pipe wire"), `Contracts.cs`
("sidecar pipe protocol"), and `Framing.cs` ("Wonderware historian sidecar pipe
protocol"). These now misdescribe the transport — the same class of stale-comment
issue as the resolved -011 (which fixed the retired-Galaxy.Host references).
**Recommendation:** Replace "named-pipe" / "pipe protocol" / "pipe-server" with
the TCP wording ("TCP wire" / "sidecar TCP protocol" / "single TCP connection
thread"), consistent with `TcpFrameServer` and `Program.cs`.
**Resolution:** Resolved 2026-06-19 — updated the eight stale comments to describe the TCP transport: "named-pipe wire" → "TCP wire", "named-pipe IPC" → "TCP IPC", "pipe-server connection thread" → "TCP-server connection thread", "sidecar pipe protocol" / "sidecar pipe protocol" header → "sidecar TCP protocol". No behaviour change.