diff --git a/code-reviews/Driver.Historian.Wonderware/findings.md b/code-reviews/Driver.Historian.Wonderware/findings.md index 8473b5e6..bf18dd5b 100644 --- a/code-reviews/Driver.Historian.Wonderware/findings.md +++ b/code-reviews/Driver.Historian.Wonderware/findings.md @@ -4,10 +4,10 @@ |---|---| | Module | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware` | | Reviewer | Claude Code | -| Review date | 2026-05-22 | -| Commit reviewed | `76d35d1` | +| Review date | 2026-06-19 | +| Commit reviewed | `7286d320` | | Status | Reviewed | -| Open findings | 0 | +| Open findings | 3 | ## Checklist coverage @@ -335,3 +335,137 @@ cancellation, and the value-type selection — and delete the stale empty `tests/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests/` directory. **Resolution:** Resolved 2026-05-23 — added four new `HistorianDataSource`-targeted test files: `HistorianDataSourceHealthSnapshotTests` (snapshot consistency under half-published state, see also -005), `HistorianDataSourceStartQueryClassificationTests` (connection-class vs query-class error-code table, see also -008), `HistorianDataSourceRequestTimeoutTests` (the request-timeout helper, see also -010), `HistorianDataSourceConnectFailoverTests` (cluster failover order + cooldown via the `IHistorianConnectionFactory` fake), and `HistorianDataSourceValueAndAggregateTests` (the string-vs-numeric heuristic via the new SDK-independent `SelectValueFromPair` overload + the `ExtractAggregateValue` column dispatch). Stale empty `tests/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests/` directory deleted. Unit count rose from 80 to 125 (+45 new tests). + +## Re-review 2026-06-19 (commit 7286d320) + +The transport changed substantially since the `76d35d1` review: the named-pipe +server (`Ipc/PipeServer.cs` + `Ipc/PipeAcl.cs`, both deleted) was replaced by a +shared-secret + optional-TLS TCP server (`Ipc/TcpFrameServer.cs`). `HistorianDataSource` +grew the `IsConnectionClassError` / `HandleStartQueryFailure` / `BuildRequestCts` +helpers (the -008 / -010 fixes). All prior findings remain Resolved. The +re-review covers all 10 categories at `7286d320`; new findings continue the ID +sequence from -012. + +| # | Category | Result (re-review) | +|---|---|---| +| 1 | Correctness & logic bugs | Driver.Historian.Wonderware-014 | +| 2 | OtOpcUa conventions | No issues found | +| 3 | Concurrency & thread safety | No issues found | +| 4 | Error handling & resilience | Driver.Historian.Wonderware-014 (cross-listed) | +| 5 | Security | No issues found | +| 6 | Performance & resource management | No issues found | +| 7 | Design-document adherence | No issues found | +| 8 | Code organization & conventions | No issues found | +| 9 | Testing coverage | No issues found | +| 10 | Documentation & comments | Driver.Historian.Wonderware-015 | +| — | Cross-module contract | Driver.Historian.Wonderware-013 | + +#### Driver.Historian.Wonderware-013 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Correctness and logic bugs | +| Location | `Backend/HistorianDataSource.cs:667-729` (`ReadEventsAsync`); also `:405-486` (`ReadRawAsync`), `:495-573` (`ReadAggregateAsync`); `Ipc/Contracts.cs:224-238` (`ReadEventsReply`) | +| Status | Deferred | + +**Description:** Cross-module context (Core.Abstractions-009 / OpcUaServer-002): +when a HistoryRead arrives with the `maxEvents <= 0` (or `maxValues <= 0`) +sentinel meaning "no caller cap — return everything", the implementer must +signal more-data (a continuation point) when a backend cap truncates the result; +otherwise the server silently drops history. This sidecar **silently truncates**. + +`ReadEventsAsync` substitutes `_config.MaxValuesPerRead` (default 10000) as the +SDK `EventQueryArgs.EventCount` when `maxEvents <= 0` (line 686), and the loop's +only break is `if (maxEvents > 0 && count >= maxEvents)` (line 708) — so with the +sentinel the result is capped server-side at `MaxValuesPerRead` with no signal +that more rows existed. `ReadRawAsync` (`limit = maxValues > 0 ? maxValues : +_config.MaxValuesPerRead`, line 441) and `ReadAggregateAsync` (bucket cap, line +528) behave the same. Crucially the wire contracts (`ReadEventsReply`, +`ReadRawReply`, `ReadProcessedReply`) carry **no** `ContinuationPoint` / +`MoreDataAvailable` / `Truncated` field at all — there is no way for the sidecar +to tell the `WonderwareHistorianClient` "this set was capped", so the OPC UA +server cannot set a Part 11 `ContinuationPoint` and the client silently sees a +short read as a complete one. The aggregate path at least logs a Warning on +truncation (line 548); raw and events truncate with no log. + +**Recommendation:** Add a `bool Truncated` (or a continuation token) field to the +three read reply DTOs and set it when the loop broke on the cap rather than on +`MoveNext` exhaustion; the `WonderwareHistorianClient` then maps it to a Part 11 +`ContinuationPoint` (or at minimum `GoodMoreData`). At a bare minimum, log a +Warning on raw/event truncation to match the aggregate path so a silently capped +read is at least observable in the rolling log. + +**Resolution:** Deferred — this is a cross-module wire-contract change. The +`ReadEventsReply` / `ReadRawReply` / `ReadProcessedReply` MessagePack DTOs are +shared with the .NET 10 `WonderwareHistorianClient` (a different module) and the +OPC UA server's HistoryRead glue; adding a continuation/truncated field and the +client-side mapping to a Part 11 `ContinuationPoint` must be designed and landed +across the sidecar + client + server together (and needs a live historian to +verify the end-to-end Part 11 paging). Out of scope for a self-contained +sidecar-only fix; tracked here for the coordinated change. (The orchestrator's +verdict request: this implementation does **not** honor the continuation +contract — it silently truncates with no more-data signal on the wire.) + +#### Driver.Historian.Wonderware-014 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Error handling and resilience | +| Location | `Backend/HistorianDataSource.cs:596-643` (`ReadAtTimeAsync`) | +| Status | Resolved | + +**Description:** The -008 fix taught `ReadRawAsync` / `ReadAggregateAsync` / +`ReadEventsAsync` to classify a failed `StartQuery` as connection-class (reset the +connection, mark the node failed, propagate `Success=false`) vs query-class (keep +the connection, propagate `Success=false`). `ReadAtTimeAsync` was **not** updated +and still uses the pre-008 behaviour: on a `StartQuery` failure for any timestamp +it appends a Bad-quality null sample and `continue`s (lines 610-619) with no +inspection of `error.ErrorCode`. Two consequences: + +1. A **connection-class** failure (e.g. `NoReply`, `FailedToConnect`) on the first + timestamp leaves the dead `_connection` in place; every subsequent timestamp's + `StartQuery` also fails on the same dead connection, and the method still calls + `RecordSuccess()` at the end (line 643) and returns `Success=true` with an + all-Bad sample set. The connection is never reset and the node is never marked + failed, so failover/cooldown never engages for an at-time read. +2. The all-Bad result is reported to the client as a successful read of Bad + samples, indistinguishable from "the historian genuinely had no interpolated + value at those instants" — masking a real connection outage. + +**Recommendation:** When `StartQuery` returns false in the at-time loop, classify +the error with `IsConnectionClassError`. On a connection-class code, reset the +connection (`HandleConnectionError`) and throw so the IPC layer surfaces +`Success=false` (consistent with the other three read paths). A query-class / +no-data code may continue to record a per-timestamp Bad sample. + +**Resolution:** Resolved 2026-06-19 — extracted the per-timestamp StartQuery-failure decision into a pure `ShouldResetConnectionForStartQueryFailure(HistorianAccessError?)` helper (the SDK `HistoryQuery`/`HistorianAccess` types are non-virtual with no interface, so the loop itself can't be driven offline — this mirrors the existing `IsConnectionClassError`/`SelectValueFromPair` testability seams). `ReadAtTimeAsync` now calls it when `StartQuery` returns false: a connection-class code throws an `InvalidOperationException` that the existing outer catch turns into a connection reset + node-failed + `Success=false` (matching the raw/aggregate/event paths); a query-class / no-data code keeps the prior per-timestamp Bad-sample-and-continue behaviour. Regression tests `AtTime_StartQuery_failure_with_connection_class_code_requests_connection_reset`, `AtTime_StartQuery_failure_with_query_class_code_does_not_request_reset`, and `AtTime_StartQuery_failure_with_null_error_defaults_to_no_reset` in `HistorianDataSourceStartQueryClassificationTests`. + +#### Driver.Historian.Wonderware-015 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation and comments | +| Location | `Backend/HistorianDataSource.cs:14`, `:95`; `Backend/IHistorianDataSource.cs:11`; `Backend/HistorianConfiguration.cs:9`; `Backend/HistorianSample.cs:7`; `Ipc/Contracts.cs:7`; `Ipc/Framing.cs:4` | +| Status | Resolved | + +**Description:** The IPC transport was rewritten from a named pipe to TCP at this +commit (`Ipc/PipeServer.cs` + `Ipc/PipeAcl.cs` deleted; `Ipc/TcpFrameServer.cs` +added; `Program.cs` now binds a `TcpFrameServer`). Several XML doc comments and +header comments still describe the wire as a "named-pipe" / "pipe protocol" / +"pipe-server connection thread": `HistorianDataSource` class summary ("serialises +onto the named-pipe wire"), the `BuildRequestCts` doc ("single pipe-server +connection thread"), `IHistorianDataSource` ("the other side of the named-pipe +IPC"), `HistorianConfiguration` ("client side of the named-pipe IPC"), +`HistorianSample` ("serialises these onto the named-pipe wire"), `Contracts.cs` +("sidecar pipe protocol"), and `Framing.cs` ("Wonderware historian sidecar pipe +protocol"). These now misdescribe the transport — the same class of stale-comment +issue as the resolved -011 (which fixed the retired-Galaxy.Host references). + +**Recommendation:** Replace "named-pipe" / "pipe protocol" / "pipe-server" with +the TCP wording ("TCP wire" / "sidecar TCP protocol" / "single TCP connection +thread"), consistent with `TcpFrameServer` and `Program.cs`. + +**Resolution:** Resolved 2026-06-19 — updated the eight stale comments to describe the TCP transport: "named-pipe wire" → "TCP wire", "named-pipe IPC" → "TCP IPC", "pipe-server connection thread" → "TCP-server connection thread", "sidecar pipe protocol" / "sidecar pipe protocol" header → "sidecar TCP protocol". No behaviour change. diff --git a/src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware/Backend/HistorianConfiguration.cs b/src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware/Backend/HistorianConfiguration.cs index 0c0de43b..0c7ce2d6 100644 --- a/src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware/Backend/HistorianConfiguration.cs +++ b/src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware/Backend/HistorianConfiguration.cs @@ -6,7 +6,7 @@ namespace ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Backend /// Wonderware Historian SDK configuration. Populated from environment variables at /// sidecar startup (see Program.cs): the supervisor (lmxopcua-side /// WonderwareHistorianClient) spawns the sidecar with these env vars; UA - /// translation lives on the client side of the named-pipe IPC, so this surface is + /// translation lives on the client side of the TCP IPC, so this surface is /// kept OPC-UA-free. The legacy v1 Galaxy.Host / Proxy host this lived in retired /// in PR 7.2. /// diff --git a/src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware/Backend/HistorianDataSource.cs b/src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware/Backend/HistorianDataSource.cs index 22f3afb8..98b5fbf5 100644 --- a/src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware/Backend/HistorianDataSource.cs +++ b/src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware/Backend/HistorianDataSource.cs @@ -11,7 +11,7 @@ namespace ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Backend /// /// Reads historical data from the Wonderware Historian via the aahClientManaged SDK. /// OPC-UA-free — emits / - /// which the sidecar serialises onto the named-pipe wire (PR 3.3 contracts) for the + /// which the sidecar serialises onto the TCP wire (PR 3.3 contracts) for the /// .NET 10 WonderwareHistorianClient to translate into OPC UA DataValue /// on its side of the IPC. The v1 Galaxy.Host / Proxy architecture this class /// originally lived in retired in PR 7.2. @@ -86,13 +86,27 @@ namespace ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Backend internal static bool IsConnectionClassError(HistorianAccessError.ErrorValue code) => ConnectionErrorCodes.Contains(code); + /// + /// Whether a failed StartQuery in the per-timestamp at-time loop should reset + /// the shared SDK connection (and abort the read) rather than record a per-timestamp + /// Bad sample and continue. Returns true only for connection-class error + /// codes; query-class / no-data codes (and a missing error) return false so + /// a single bad/empty timestamp does not tear down a connection that is still serving + /// the other timestamps. The HistoryQuery SDK type is non-virtual and has no + /// interface, so the at-time loop can't be driven offline — this pure helper is the + /// unit-testable seam for the classification. See Driver.Historian.Wonderware-014. + /// + /// The SDK error returned by the failed StartQuery. + internal static bool ShouldResetConnectionForStartQueryFailure(HistorianAccessError? error) + => IsConnectionClassError(error?.ErrorCode ?? HistorianAccessError.ErrorValue.Failure); + /// /// Builds the per-read linked into the /// caller's and pre-wired to fire after /// if positive. The /// read paths use the resulting token in their ThrowIfCancellationRequested /// checks so a hung StartQuery or slow MoveNext cannot block the - /// single pipe-server connection thread indefinitely. See + /// single TCP-server connection thread indefinitely. See /// Driver.Historian.Wonderware-010. /// /// The historian configuration. @@ -409,7 +423,7 @@ namespace ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Backend var results = new List(); // Driver.Historian.Wonderware-010: wire RequestTimeoutSeconds into the read path - // so a hung StartQuery / slow MoveNext can't block the connection thread forever. + // so a hung StartQuery / slow MoveNext can't block the TCP connection thread forever. using var requestCts = BuildRequestCts(_config, ct); var token = requestCts.Token; @@ -609,6 +623,22 @@ namespace ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Backend if (!query.StartQuery(args, out var error)) { + // Driver.Historian.Wonderware-014: classify the failure like the raw / + // aggregate / event paths. A connection-class code means the shared + // connection is dead — throw so the whole at-time read aborts and the IPC + // layer surfaces Success=false (the outer catch resets the connection and + // marks the node failed). Without this, every remaining timestamp would + // re-fail StartQuery on the dead connection and the method would still + // report Success=true with an all-Bad result, never failing over. A + // query-class / no-data code keeps the connection and records a Bad sample + // for just this timestamp. + if (ShouldResetConnectionForStartQueryFailure(error)) + { + var code = error?.ErrorCode ?? HistorianAccessError.ErrorValue.Failure; + throw new InvalidOperationException( + $"Historian SDK StartQuery failed for at-time query of tag '{tagName}': {code} ({error?.ErrorDescription})"); + } + results.Add(new HistorianSample { Value = null, diff --git a/src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware/Backend/HistorianSample.cs b/src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware/Backend/HistorianSample.cs index 8fab9915..51478c41 100644 --- a/src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware/Backend/HistorianSample.cs +++ b/src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware/Backend/HistorianSample.cs @@ -4,7 +4,7 @@ namespace ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Backend { /// /// OPC-UA-free representation of a single historical data point. The sidecar serialises - /// these onto the named-pipe wire (HistorianSampleDto) for the .NET 10 + /// these onto the TCP wire (HistorianSampleDto) for the .NET 10 /// WonderwareHistorianClient, which maps quality and value into OPC UA /// DataValue on its side. Raw OPC DA quality byte is preserved so the client /// can reuse the same quality mapper it already uses for live reads. diff --git a/src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware/Backend/IHistorianDataSource.cs b/src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware/Backend/IHistorianDataSource.cs index b697165f..8900909d 100644 --- a/src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware/Backend/IHistorianDataSource.cs +++ b/src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware/Backend/IHistorianDataSource.cs @@ -8,7 +8,7 @@ namespace ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Backend /// /// OPC-UA-free surface for the Wonderware Historian subsystem inside the historian /// sidecar process. Implementations read via the aahClient* SDK; the .NET 10 - /// WonderwareHistorianClient on the other side of the named-pipe IPC maps + /// WonderwareHistorianClient on the other side of the TCP IPC maps /// returned samples to OPC UA DataValue. The v1 Galaxy.Host / Proxy hosts /// this lived in retired in PR 7.2. /// diff --git a/src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware/Ipc/Contracts.cs b/src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware/Ipc/Contracts.cs index 91e25400..7a0211fa 100644 --- a/src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware/Ipc/Contracts.cs +++ b/src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware/Ipc/Contracts.cs @@ -4,7 +4,7 @@ using MessagePack; namespace ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Ipc; // ============================================================================ -// Wire DTOs for the sidecar pipe protocol. The sidecar speaks its own legacy +// Wire DTOs for the sidecar TCP protocol. The sidecar speaks its own legacy // shape (List etc.) — the .NET 10 client (PR 3.4) translates // to / from Core.Abstractions.DataValueSnapshot + HistoricalEvent. // diff --git a/src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware/Ipc/Framing.cs b/src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware/Ipc/Framing.cs index 26d3c591..bcc51b6b 100644 --- a/src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware/Ipc/Framing.cs +++ b/src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware/Ipc/Framing.cs @@ -1,7 +1,7 @@ namespace ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Ipc; /// -/// Length-prefixed framing constants for the Wonderware historian sidecar pipe protocol. +/// Length-prefixed framing constants for the Wonderware historian sidecar TCP protocol. /// Each frame on the wire is: /// [4-byte big-endian length][1-byte message kind][MessagePack body]. /// Length is the body size only; the kind byte is not part of the prefixed length. diff --git a/tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests/Backend/HistorianDataSourceStartQueryClassificationTests.cs b/tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests/Backend/HistorianDataSourceStartQueryClassificationTests.cs index b7c27800..32528255 100644 --- a/tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests/Backend/HistorianDataSourceStartQueryClassificationTests.cs +++ b/tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests/Backend/HistorianDataSourceStartQueryClassificationTests.cs @@ -51,4 +51,54 @@ public sealed class HistorianDataSourceStartQueryClassificationTests HistorianDataSource.IsConnectionClassError(code).ShouldBeFalse( $"{code} is a query payload problem — must NOT tear down the SDK connection"); } + + // ── Driver.Historian.Wonderware-014: the at-time loop must classify a per-timestamp + // StartQuery failure the same way the raw / aggregate / event paths do. The SDK + // HistoryQuery type is sealed-by-non-virtual + has no interface, so the loop itself + // can't be driven offline; the per-failure decision is therefore extracted into a + // pure helper that the at-time loop calls and these tests pin directly. ────────── + + /// + /// A connection-class StartQuery error in the at-time loop must signal "reset the + /// connection and abort the read" (true) — not silently record a Bad sample and keep + /// hammering the dead connection for every remaining timestamp. + /// + /// The connection-class error code. + [Theory] + [InlineData(HistorianAccessError.ErrorValue.FailedToConnect)] + [InlineData(HistorianAccessError.ErrorValue.NoReply)] + [InlineData(HistorianAccessError.ErrorValue.NotReady)] + public void AtTime_StartQuery_failure_with_connection_class_code_requests_connection_reset( + HistorianAccessError.ErrorValue code) + { + var error = new HistorianAccessError { ErrorCode = code }; + HistorianDataSource.ShouldResetConnectionForStartQueryFailure(error).ShouldBeTrue( + $"{code} is a connection failure — the at-time loop must reset the connection, not record Bad"); + } + + /// + /// A query-class StartQuery error (or a missing error) in the at-time loop must NOT + /// reset the connection (false): a single bad/empty timestamp records a per-timestamp + /// Bad sample and continues to the next without tearing down the shared connection. + /// + /// The query-class error code. + [Theory] + [InlineData(HistorianAccessError.ErrorValue.InvalidArgument)] + [InlineData(HistorianAccessError.ErrorValue.NoData)] + [InlineData(HistorianAccessError.ErrorValue.NotApplicable)] + public void AtTime_StartQuery_failure_with_query_class_code_does_not_request_reset( + HistorianAccessError.ErrorValue code) + { + var error = new HistorianAccessError { ErrorCode = code }; + HistorianDataSource.ShouldResetConnectionForStartQueryFailure(error).ShouldBeFalse( + $"{code} is a query/no-data problem — the at-time loop keeps the connection and records Bad"); + } + + /// A null error defaults to query-class (no reset) — the caller still records a Bad sample. + [Fact] + public void AtTime_StartQuery_failure_with_null_error_defaults_to_no_reset() + { + HistorianDataSource.ShouldResetConnectionForStartQueryFailure(null).ShouldBeFalse( + "a null error must not be promoted to a connection reset"); + } }