review(Driver.Historian.Wonderware): AtTime fails over on connection-class errors

Re-review at 7286d320. -014 (Medium): ReadAtTimeAsync didn't classify StartQuery failures,
so a connection-class failure left a dead connection, re-failed every timestamp, and returned
Success=true with all-Bad (no failover); now resets+fails over via a shared classifier + tests.
-015: refresh stale named-pipe comments to TCP (no wire change). -013 (silent cap truncation,
ties OpcUaServer-002/Core.Abstractions-009) deferred cross-module. NOTE: the SDK-touching tests
are net48 + native aahClientManaged and run only on Windows; macOS verifies build + the SDK-free
subset only.
This commit is contained in:
Joseph Doherty
2026-06-19 11:47:11 -04:00
parent e07a4fbf52
commit b3907efa6e
8 changed files with 225 additions and 11 deletions
@@ -4,10 +4,10 @@
|---|---|
| Module | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Review date | 2026-06-19 |
| Commit reviewed | `7286d320` |
| Status | Reviewed |
| Open findings | 0 |
| Open findings | 3 |
## Checklist coverage
@@ -335,3 +335,137 @@ cancellation, and the value-type selection — and delete the stale empty
`tests/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests/` directory.
**Resolution:** Resolved 2026-05-23 — added four new `HistorianDataSource`-targeted test files: `HistorianDataSourceHealthSnapshotTests` (snapshot consistency under half-published state, see also -005), `HistorianDataSourceStartQueryClassificationTests` (connection-class vs query-class error-code table, see also -008), `HistorianDataSourceRequestTimeoutTests` (the request-timeout helper, see also -010), `HistorianDataSourceConnectFailoverTests` (cluster failover order + cooldown via the `IHistorianConnectionFactory` fake), and `HistorianDataSourceValueAndAggregateTests` (the string-vs-numeric heuristic via the new SDK-independent `SelectValueFromPair` overload + the `ExtractAggregateValue` column dispatch). Stale empty `tests/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests/` directory deleted. Unit count rose from 80 to 125 (+45 new tests).
## Re-review 2026-06-19 (commit 7286d320)
The transport changed substantially since the `76d35d1` review: the named-pipe
server (`Ipc/PipeServer.cs` + `Ipc/PipeAcl.cs`, both deleted) was replaced by a
shared-secret + optional-TLS TCP server (`Ipc/TcpFrameServer.cs`). `HistorianDataSource`
grew the `IsConnectionClassError` / `HandleStartQueryFailure` / `BuildRequestCts`
helpers (the -008 / -010 fixes). All prior findings remain Resolved. The
re-review covers all 10 categories at `7286d320`; new findings continue the ID
sequence from -012.
| # | Category | Result (re-review) |
|---|---|---|
| 1 | Correctness & logic bugs | Driver.Historian.Wonderware-014 |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | No issues found |
| 4 | Error handling & resilience | Driver.Historian.Wonderware-014 (cross-listed) |
| 5 | Security | No issues found |
| 6 | Performance & resource management | No issues found |
| 7 | Design-document adherence | No issues found |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | No issues found |
| 10 | Documentation & comments | Driver.Historian.Wonderware-015 |
| — | Cross-module contract | Driver.Historian.Wonderware-013 |
#### Driver.Historian.Wonderware-013
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness and logic bugs |
| Location | `Backend/HistorianDataSource.cs:667-729` (`ReadEventsAsync`); also `:405-486` (`ReadRawAsync`), `:495-573` (`ReadAggregateAsync`); `Ipc/Contracts.cs:224-238` (`ReadEventsReply`) |
| Status | Deferred |
**Description:** Cross-module context (Core.Abstractions-009 / OpcUaServer-002):
when a HistoryRead arrives with the `maxEvents <= 0` (or `maxValues <= 0`)
sentinel meaning "no caller cap — return everything", the implementer must
signal more-data (a continuation point) when a backend cap truncates the result;
otherwise the server silently drops history. This sidecar **silently truncates**.
`ReadEventsAsync` substitutes `_config.MaxValuesPerRead` (default 10000) as the
SDK `EventQueryArgs.EventCount` when `maxEvents <= 0` (line 686), and the loop's
only break is `if (maxEvents > 0 && count >= maxEvents)` (line 708) — so with the
sentinel the result is capped server-side at `MaxValuesPerRead` with no signal
that more rows existed. `ReadRawAsync` (`limit = maxValues > 0 ? maxValues :
_config.MaxValuesPerRead`, line 441) and `ReadAggregateAsync` (bucket cap, line
528) behave the same. Crucially the wire contracts (`ReadEventsReply`,
`ReadRawReply`, `ReadProcessedReply`) carry **no** `ContinuationPoint` /
`MoreDataAvailable` / `Truncated` field at all — there is no way for the sidecar
to tell the `WonderwareHistorianClient` "this set was capped", so the OPC UA
server cannot set a Part 11 `ContinuationPoint` and the client silently sees a
short read as a complete one. The aggregate path at least logs a Warning on
truncation (line 548); raw and events truncate with no log.
**Recommendation:** Add a `bool Truncated` (or a continuation token) field to the
three read reply DTOs and set it when the loop broke on the cap rather than on
`MoveNext` exhaustion; the `WonderwareHistorianClient` then maps it to a Part 11
`ContinuationPoint` (or at minimum `GoodMoreData`). At a bare minimum, log a
Warning on raw/event truncation to match the aggregate path so a silently capped
read is at least observable in the rolling log.
**Resolution:** Deferred — this is a cross-module wire-contract change. The
`ReadEventsReply` / `ReadRawReply` / `ReadProcessedReply` MessagePack DTOs are
shared with the .NET 10 `WonderwareHistorianClient` (a different module) and the
OPC UA server's HistoryRead glue; adding a continuation/truncated field and the
client-side mapping to a Part 11 `ContinuationPoint` must be designed and landed
across the sidecar + client + server together (and needs a live historian to
verify the end-to-end Part 11 paging). Out of scope for a self-contained
sidecar-only fix; tracked here for the coordinated change. (The orchestrator's
verdict request: this implementation does **not** honor the continuation
contract — it silently truncates with no more-data signal on the wire.)
#### Driver.Historian.Wonderware-014
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling and resilience |
| Location | `Backend/HistorianDataSource.cs:596-643` (`ReadAtTimeAsync`) |
| Status | Resolved |
**Description:** The -008 fix taught `ReadRawAsync` / `ReadAggregateAsync` /
`ReadEventsAsync` to classify a failed `StartQuery` as connection-class (reset the
connection, mark the node failed, propagate `Success=false`) vs query-class (keep
the connection, propagate `Success=false`). `ReadAtTimeAsync` was **not** updated
and still uses the pre-008 behaviour: on a `StartQuery` failure for any timestamp
it appends a Bad-quality null sample and `continue`s (lines 610-619) with no
inspection of `error.ErrorCode`. Two consequences:
1. A **connection-class** failure (e.g. `NoReply`, `FailedToConnect`) on the first
timestamp leaves the dead `_connection` in place; every subsequent timestamp's
`StartQuery` also fails on the same dead connection, and the method still calls
`RecordSuccess()` at the end (line 643) and returns `Success=true` with an
all-Bad sample set. The connection is never reset and the node is never marked
failed, so failover/cooldown never engages for an at-time read.
2. The all-Bad result is reported to the client as a successful read of Bad
samples, indistinguishable from "the historian genuinely had no interpolated
value at those instants" — masking a real connection outage.
**Recommendation:** When `StartQuery` returns false in the at-time loop, classify
the error with `IsConnectionClassError`. On a connection-class code, reset the
connection (`HandleConnectionError`) and throw so the IPC layer surfaces
`Success=false` (consistent with the other three read paths). A query-class /
no-data code may continue to record a per-timestamp Bad sample.
**Resolution:** Resolved 2026-06-19 — extracted the per-timestamp StartQuery-failure decision into a pure `ShouldResetConnectionForStartQueryFailure(HistorianAccessError?)` helper (the SDK `HistoryQuery`/`HistorianAccess` types are non-virtual with no interface, so the loop itself can't be driven offline — this mirrors the existing `IsConnectionClassError`/`SelectValueFromPair` testability seams). `ReadAtTimeAsync` now calls it when `StartQuery` returns false: a connection-class code throws an `InvalidOperationException` that the existing outer catch turns into a connection reset + node-failed + `Success=false` (matching the raw/aggregate/event paths); a query-class / no-data code keeps the prior per-timestamp Bad-sample-and-continue behaviour. Regression tests `AtTime_StartQuery_failure_with_connection_class_code_requests_connection_reset`, `AtTime_StartQuery_failure_with_query_class_code_does_not_request_reset`, and `AtTime_StartQuery_failure_with_null_error_defaults_to_no_reset` in `HistorianDataSourceStartQueryClassificationTests`.
#### Driver.Historian.Wonderware-015
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation and comments |
| Location | `Backend/HistorianDataSource.cs:14`, `:95`; `Backend/IHistorianDataSource.cs:11`; `Backend/HistorianConfiguration.cs:9`; `Backend/HistorianSample.cs:7`; `Ipc/Contracts.cs:7`; `Ipc/Framing.cs:4` |
| Status | Resolved |
**Description:** The IPC transport was rewritten from a named pipe to TCP at this
commit (`Ipc/PipeServer.cs` + `Ipc/PipeAcl.cs` deleted; `Ipc/TcpFrameServer.cs`
added; `Program.cs` now binds a `TcpFrameServer`). Several XML doc comments and
header comments still describe the wire as a "named-pipe" / "pipe protocol" /
"pipe-server connection thread": `HistorianDataSource` class summary ("serialises
onto the named-pipe wire"), the `BuildRequestCts` doc ("single pipe-server
connection thread"), `IHistorianDataSource` ("the other side of the named-pipe
IPC"), `HistorianConfiguration` ("client side of the named-pipe IPC"),
`HistorianSample` ("serialises these onto the named-pipe wire"), `Contracts.cs`
("sidecar pipe protocol"), and `Framing.cs` ("Wonderware historian sidecar pipe
protocol"). These now misdescribe the transport — the same class of stale-comment
issue as the resolved -011 (which fixed the retired-Galaxy.Host references).
**Recommendation:** Replace "named-pipe" / "pipe protocol" / "pipe-server" with
the TCP wording ("TCP wire" / "sidecar TCP protocol" / "single TCP connection
thread"), consistent with `TcpFrameServer` and `Program.cs`.
**Resolution:** Resolved 2026-06-19 — updated the eight stale comments to describe the TCP transport: "named-pipe wire" → "TCP wire", "named-pipe IPC" → "TCP IPC", "pipe-server connection thread" → "TCP-server connection thread", "sidecar pipe protocol" / "sidecar pipe protocol" header → "sidecar TCP protocol". No behaviour change.