fix(code-review): resolve Batch 3 wave A (OpcUaServer history/guard, ControlPlane topology gate)

- OpcUaServer-002: HistoryRead-Events NumValuesPerNode==0 now maps to unbounded (int.MaxValue) instead of the backend default-cap sentinel; no Core.Abstractions contract change (+EventMaxEvents helper tests)
- OpcUaServer-004: EnsureAddressSpaceCreated guard on public mutators -> clear InvalidOperationException instead of bare NRE if called pre-start (+tests)
- OpcUaServer-003: Deferred (endUtc inclusive/exclusive needs live Wonderware boundary confirmation)
- Configuration-013: wire DraftValidator.ValidateClusterTopology into AdminOperationsActor deploy gate (read-only, no migration) (+2 tests)
This commit is contained in:
Joseph Doherty
2026-06-20 22:53:29 -04:00
parent c817d7720e
commit 94eec70fb0
8 changed files with 455 additions and 13 deletions
+3 -3
View File
@@ -7,7 +7,7 @@
| Review date | 2026-06-19 (re-review; first reviewed 2026-05-22) |
| Commit reviewed | `7286d320` (re-review; was `76d35d1`) |
| Status | Reviewed |
| Open findings | 1 |
| Open findings | 0 |
## Checklist coverage
@@ -232,13 +232,13 @@ Prior findings Configuration-001…011 remain Resolved. Notable since the first
| Severity | Medium |
| Category | Design-document adherence |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Validation/DraftValidator.cs:243` (`ValidateClusterTopology`) |
| Status | Open |
| Status | Resolved |
**Description:** `DraftValidator.ValidateClusterTopology` is documented as the managed pre-publish guard that catches cluster-topology drift the SQL `CK_ServerCluster_RedundancyMode_NodeCount` check cannot see — specifically an operator disabling a `ClusterNode` (effective enabled-count = 1) while `RedundancyMode` stays `Hot`/`Warm`, which would boot the runtime into an invalid-topology band. It is fully unit-tested (`DraftValidatorTests` §"ValidateClusterTopology") but **no production code calls it.** The deploy gate in `AdminOperationsActor.StartDeployment` runs `DraftValidator.Validate(...)` (the snapshot rules) but never `ValidateClusterTopology(...)`, so the documented enabled-node-count guard is inert at deploy time — the only thing standing is the row-level SQL CHECK, which the doc explicitly says is insufficient.
**Recommendation:** Wire `ValidateClusterTopology` into the deploy/publish path — load the `ServerCluster` row(s) + their `ClusterNode`s and run it alongside `Validate`, folding its errors into the same reject summary. The fix belongs in `src/Server/ZB.MOM.WW.OtOpcUa.ControlPlane/AdminOperations/AdminOperationsActor.cs` (a different module), so it is **deferred from this module's edit scope** and recorded here against the now-dead Configuration-layer method. Cross-module: ControlPlane.
**Resolution:** _(open — fix is in the ControlPlane module's `AdminOperationsActor`, outside Configuration's edit scope)_
**Resolution:** Resolved 2026-06-20 — wired `DraftValidator.ValidateClusterTopology` into the deploy gate in the ControlPlane module's `AdminOperationsActor.HandleStartDeploymentAsync` (`src/Server/ZB.MOM.WW.OtOpcUa.ControlPlane/AdminOperations/AdminOperationsActor.cs`). Immediately after the existing `DraftValidator.Validate(draft)` call, the handler now loads the `ServerCluster` rows (ClusterId-ordered for a deterministic summary) and their `ClusterNode`s from the **same** `db` context already open in the handler — read-only via `AsNoTracking()`, no second DbContext lifetime, no schema/migration or entity change — groups the nodes by `ClusterId`, and runs `ValidateClusterTopology(cluster, nodes)` per cluster. Its errors are appended to the SAME error list (`Validate(...)` now collected into a `List<ValidationError>`), so a deploy failing either the snapshot rules or the topology guard is rejected with both sets of messages folded into the single reject summary string; ordering stays deterministic (snapshot rules first, then per-cluster topology errors in ClusterId order). The previously-inert enabled-node-count guard (e.g. `RedundancyMode = Hot` with one `ClusterNode` toggled off, effective enabled-count = 1) now rejects at deploy time rather than relying solely on the row-level SQL CHECK the doc says is insufficient. New ControlPlane tests `AdminOperationsActorTests.StartDeployment_rejects_on_invalid_cluster_topology_disabled_node` (Hot + one disabled node → `Rejected` with `ClusterEnabledNodeCountMismatch`, no coordinator dispatch, no Deployment row) and `StartDeployment_accepts_when_cluster_topology_is_valid` (Hot + two enabled nodes → `Accepted`, no topology error, row inserted) pin the wiring; the rejecting test was confirmed red against the unwired handler before the fix. ControlPlane.Tests 62/62 green; the existing `DraftValidatorTests` §"ValidateClusterTopology" (Configuration.Tests 103/103) unchanged and still green.
### Configuration-014
+40 -7
View File
@@ -7,7 +7,7 @@
| Review date | 2026-06-19 |
| Commit reviewed | `7286d320` |
| Status | Reviewed |
| Open findings | 4 |
| Open findings | 1 |
## Checklist coverage
@@ -72,7 +72,7 @@ which is outside this module's edit boundary.
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `OtOpcUaNodeManager.cs:1748` (`HistoryReadEvents`), `OtOpcUaNodeManager.cs:1814` (`ClampToInt`) |
| Status | Open |
| Status | Resolved |
**Description:** For HistoryRead-Events, `HistoryReadEvents` passes
`ClampToInt(details.NumValuesPerNode)` to `IHistorianDataSource.ReadEventsAsync(maxEvents)` and
@@ -93,7 +93,21 @@ backend truncation and surface a continuation point / `GoodMoreData` for events.
event backends (cross-module, Core.Abstractions contract); option (b) requires the backend to report
truncation. Both cross this module's boundary.
**Resolution:** _(Open — deferred: rooted in the cross-module `IHistoryProvider.ReadEventsAsync` `maxEvents <= 0` sentinel contract (Core.Abstractions-006) and the Wonderware/OpcUaClient event backends; cannot be fixed safely inside OpcUaServer alone.)_
**Resolution:** Resolved — 2026-06-20 (SHA pending): fixed locally inside OpcUaServer without touching
the cross-module `IHistoryProvider.ReadEventsAsync` `maxEvents <= 0` sentinel. Added a small testable
`internal static int EventMaxEvents(uint numValuesPerNode)` helper next to `ClampToInt` that translates
the OPC UA Part 4/11 "no limit" request (`NumValuesPerNode == 0`) to UNBOUNDED (`int.MaxValue`, a very
large positive cap) rather than the backend's `<= 0` "use the default cap" sentinel; a positive value
still passes through `ClampToInt` unchanged. `HistoryReadEvents` now calls `EventMaxEvents(details.NumValuesPerNode)`
instead of `ClampToInt(details.NumValuesPerNode)`, so a "give me the whole window" events read is no
longer silently truncated at the backend default. The sentinel contract + the Wonderware/OpcUaClient
backends are untouched (a positive `int.MaxValue` is never the `<= 0` sentinel). Tests:
`NodeManagerEventMaxEventsTests` (helper purity — `0u→int.MaxValue`, normal passthrough,
`>int.MaxValue→int.MaxValue` clamp, exact-`int.MaxValue` boundary) plus
`NodeManagerHistoryReadEventsTests.Events_unbounded_request_passes_int_max_to_backend` (the recording fake
`IHistorianDataSource` receives `int.MaxValue` when `NumValuesPerNode == 0`). Note: option (b) — surfacing
a continuation point / `GoodMoreData` on backend truncation — remains a cross-module/backend change and is
out of scope; option (a) here removes the silent-truncation defect for the common "all events" request.
### OpcUaServer-003
@@ -102,7 +116,7 @@ truncation. Both cross this module's boundary.
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `OtOpcUaNodeManager.cs:1978` (`ServeRawPaged`), `HistoryPaging.cs` (whole), `HistoryPaging.cs:213` (`SliceTieCluster` `next <= endUtc`) |
| Status | Open |
| Status | Deferred |
**Description:** The Raw paging chain treats `endUtc` as an **inclusive** upper bound throughout —
the `HistoryContinuationState`/`HistoryPaging` XML docs all say "the original (inclusive) end of
@@ -126,7 +140,14 @@ the inclusive/exclusive question requires confirming the Wonderware backend's ac
semantics (cross-module / infra), and changing a comparison without that confirmation risks the
opposite off-by-one.
**Resolution:** _(Open — deferred: needs the backend's authoritative endUtc boundary semantics confirmed before the comparison/doc is changed; flipping it blindly risks an off-by-one in the other direction.)_
**Resolution:** Deferred — 2026-06-20: infra-gated. Resolving the `endUtc` inclusive-vs-exclusive
disagreement requires confirming the actual Wonderware historian backend's boundary semantics, which is
hardware/infra-gated and not reachable from this macOS dev host. The impact is benign and bounded — because
the backend is the authority on which samples exist (a sample at exactly `endUtc` never appears in an
exclusive-end read), the disagreement only ever yields ONE extra empty resume page (`[endUtc, endUtc)`
GoodNoData, no continuation point) rather than any duplicated or dropped data. Changing the
`SliceTieCluster` comparison / paging XML docs without confirming the live backend boundary risks
introducing the opposite off-by-one, so no code is changed here pending that live confirmation.
### OpcUaServer-004
@@ -135,7 +156,7 @@ opposite off-by-one.
| Severity | Low |
| Category | Error handling & resilience |
| Location | `OtOpcUaNodeManager.cs:1597` (`ResolveParentFolder`), and every public sink mutator that calls it (`EnsureFolder` 1278, `EnsureVariable` 1335, `MaterialiseAlarmCondition` 597, plus `WriteValue`/`WriteAlarmCondition` `CreateVariable`) |
| Status | Open |
| Status | Resolved |
**Description:** `ResolveParentFolder` dereferences `_root!` with the null-forgiving operator, and
`CreateVariable` uses `_root` (`AddChild`). `_root` is only assigned in `CreateAddressSpace`, which
@@ -153,7 +174,19 @@ mutators, so a too-early call fails legibly instead of with a bare NRE. Low prio
hardening, not a live defect. Left Open to avoid an unscoped change to the mutator entry points on
this critical class without a regression scenario that reproduces the early-call ordering.
**Resolution:** _(Open — defensive-only; latent given current boot ordering. Deferred to avoid an unscoped guard-add across five mutators without a reproducing pre-start ordering scenario.)_
**Resolution:** Resolved — 2026-06-20 (SHA pending): added a private `EnsureAddressSpaceCreated()` helper
that throws `InvalidOperationException("OPC UA address space has not been created yet (server not started.)")`
when `_root` is null, and call it at the top of `ResolveParentFolder` and at every public address-space
mutator entry point (`WriteValue`, `WriteAlarmCondition`, `EnsureFolder`, `EnsureVariable`,
`MaterialiseAlarmCondition`) — right after argument validation, before any `_root` dereference. A too-early
call (a sink wired or a publish replayed before `StartAsync` drives `CreateAddressSpace`) now fails legibly
instead of with a bare NRE out of `ResolveParentFolder` / `CreateVariable`. Happy-path behaviour is
unchanged. The guard was test-feasible after all: `NodeManagerPreStartGuardTests` boots a real host,
borrows the live node manager's real `IServerInternal`, constructs a SECOND, never-started
`OtOpcUaNodeManager` from it (so `_root` is null), and asserts each of the four lock-taking mutators
(`EnsureFolder`/`EnsureVariable`/`WriteValue`/`MaterialiseAlarmCondition`) throws `InvalidOperationException`
(not NRE), with the folder case asserting the message text. (`WriteAlarmCondition`'s guard is identical and
sits on the same path; it is build-verified.) Full `OpcUaServer.Tests` suite green (284/284).
### OpcUaServer-005