Files
lmxopcua/code-reviews/OpcUaServer/findings.md
T
Joseph Doherty 94eec70fb0 fix(code-review): resolve Batch 3 wave A (OpcUaServer history/guard, ControlPlane topology gate)
- OpcUaServer-002: HistoryRead-Events NumValuesPerNode==0 now maps to unbounded (int.MaxValue) instead of the backend default-cap sentinel; no Core.Abstractions contract change (+EventMaxEvents helper tests)
- OpcUaServer-004: EnsureAddressSpaceCreated guard on public mutators -> clear InvalidOperationException instead of bare NRE if called pre-start (+tests)
- OpcUaServer-003: Deferred (endUtc inclusive/exclusive needs live Wonderware boundary confirmation)
- Configuration-013: wire DraftValidator.ValidateClusterTopology into AdminOperationsActor deploy gate (read-only, no migration) (+2 tests)
2026-06-20 22:53:29 -04:00

19 KiB

Code Review — OpcUaServer

Field Value
Module src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer
Reviewer Claude Code
Review date 2026-06-19
Commit reviewed 7286d320
Status Reviewed
Open findings 1

Checklist coverage

A comprehensive review completes every category, recording "No issues found" where a category produced nothing rather than leaving it blank.

# Category Result
1 Correctness & logic bugs OpcUaServer-001, -002, -003, -007
2 OtOpcUa conventions No issues found
3 Concurrency & thread safety No issues found (Lock discipline + fire-and-forget dispatch verified correct)
4 Error handling & resilience OpcUaServer-004
5 Security No issues found (WriteOperate / AlarmAck gates fail closed; anonymous → BadUserAccessDenied; HistoryRead AccessLevel bits correct)
6 Performance & resource management No issues found
7 Design-document adherence No issues found (ConditionId=dotted-FullName routing, MapSeverity buckets, fire-and-forget dispatch all match CLAUDE.md / docs)
8 Code organization & conventions No issues found
9 Testing coverage No issues found (HistoryRead paging incl. oversized-tie-cluster + backstop is covered end-to-end; OpcUaServer-005 re-triaged to Won't Fix)
10 Documentation & comments OpcUaServer-006 (Resolved)

Findings

OpcUaServer-001

Field Value
Severity Medium
Category Correctness & logic bugs
Location AddressSpacePlan.cs:56 (AddressSpacePlan.IsEmpty), AddressSpacePlan.cs:80 (AddressSpacePlanner.Compute)
Status Open

Description: AddressSpaceComposition carries the UNS topology (UnsAreas + UnsLines), and AddressSpaceApplier.MaterialiseHierarchy uses each area's/line's DisplayName for the OPC UA folder display name. But AddressSpacePlanner.Compute only diffs Equipment / DriverInstance / ScriptedAlarm / EquipmentTag / EquipmentVirtualTag — it never diffs UnsAreas / UnsLines. So a deployment whose ONLY change is a UNS Area or Line rename (no equipment/driver/alarm/tag/vtag delta) produces a plan whose IsEmpty is true. In OpcUaPublishActor.HandleRebuild (src/Server/ZB.MOM.WW.OtOpcUa.Runtime/OpcUa/OpcUaPublishActor.cs:313-318) an empty plan short-circuits BEFORE MaterialiseHierarchy runs, so the renamed Area/Line folder keeps its stale display name until some unrelated structural change forces a full rebuild. Compounding it, OtOpcUaNodeManager.EnsureFolder (OtOpcUaNodeManager.cs:1272) early-returns for an already-present folder id and never updates an existing folder's DisplayName, so even if MaterialiseHierarchy were reached on a rename it would be a no-op until RebuildAddressSpace clears _folders.

Recommendation: Carry UNS area/line diff sets into AddressSpacePlan (mirroring the init-only EquipmentTag/VirtualTag pattern) so a rename is no longer "empty", and have the apply path drive a hierarchy refresh (or make EnsureFolder update an existing folder's DisplayName in place + ClearChangeMasks). Deferred: a complete fix spans the Runtime module (OpcUaPublishActor must honour the new plan flag and call a hierarchy-refresh / rebuild path), which is outside this module's edit boundary.

Resolution: (Open — deferred: needs a coordinated change in the Runtime module's OpcUaPublishActor to act on a UNS-changed plan; an in-AddressSpacePlan change alone is inert because Apply/MaterialiseHierarchy do not refresh existing folder names.)

OpcUaServer-002

Field Value
Severity Medium
Category Correctness & logic bugs
Location OtOpcUaNodeManager.cs:1748 (HistoryReadEvents), OtOpcUaNodeManager.cs:1814 (ClampToInt)
Status Resolved

Description: For HistoryRead-Events, HistoryReadEvents passes ClampToInt(details.NumValuesPerNode) to IHistorianDataSource.ReadEventsAsync(maxEvents) and always returns the result with ContinuationPoint = null ("the full window in one shot"). The events arm never issues continuation points. By the IHistoryProvider.ReadEventsAsync contract (src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IHistoryProvider.cs:88-95), maxEvents <= 0 is a sentinel meaning "use the backend's default cap". But OPC UA Part 4/11 define NumValuesPerNode == 0 on a HistoryRead as "no limit — return ALL values". So a client that asks for all events in a window (NumValuesPerNode == 0) is silently truncated at the backend's default cap, and because no continuation point (and no GoodMoreData) is returned, the client has no signal that the result was truncated — it believes it received the complete window. This is a silent-data-loss / spec-deviation on the events arm.

Recommendation: Either (a) translate NumValuesPerNode == 0 to "unbounded" for the events backend (e.g. saturate to a very large cap) rather than the "default cap" sentinel, or (b) detect backend truncation and surface a continuation point / GoodMoreData for events. Deferred: option (a) changes the documented maxEvents <= 0 sentinel semantics shared with the Wonderware/OpcUaClient event backends (cross-module, Core.Abstractions contract); option (b) requires the backend to report truncation. Both cross this module's boundary.

Resolution: Resolved — 2026-06-20 (SHA pending): fixed locally inside OpcUaServer without touching the cross-module IHistoryProvider.ReadEventsAsync maxEvents <= 0 sentinel. Added a small testable internal static int EventMaxEvents(uint numValuesPerNode) helper next to ClampToInt that translates the OPC UA Part 4/11 "no limit" request (NumValuesPerNode == 0) to UNBOUNDED (int.MaxValue, a very large positive cap) rather than the backend's <= 0 "use the default cap" sentinel; a positive value still passes through ClampToInt unchanged. HistoryReadEvents now calls EventMaxEvents(details.NumValuesPerNode) instead of ClampToInt(details.NumValuesPerNode), so a "give me the whole window" events read is no longer silently truncated at the backend default. The sentinel contract + the Wonderware/OpcUaClient backends are untouched (a positive int.MaxValue is never the <= 0 sentinel). Tests: NodeManagerEventMaxEventsTests (helper purity — 0u→int.MaxValue, normal passthrough, >int.MaxValue→int.MaxValue clamp, exact-int.MaxValue boundary) plus NodeManagerHistoryReadEventsTests.Events_unbounded_request_passes_int_max_to_backend (the recording fake IHistorianDataSource receives int.MaxValue when NumValuesPerNode == 0). Note: option (b) — surfacing a continuation point / GoodMoreData on backend truncation — remains a cross-module/backend change and is out of scope; option (a) here removes the silent-truncation defect for the common "all events" request.

OpcUaServer-003

Field Value
Severity Low
Category Correctness & logic bugs
Location OtOpcUaNodeManager.cs:1978 (ServeRawPaged), HistoryPaging.cs (whole), HistoryPaging.cs:213 (SliceTieCluster next <= endUtc)
Status Deferred

Description: The Raw paging chain treats endUtc as an inclusive upper bound throughout — the HistoryContinuationState/HistoryPaging XML docs all say "the original (inclusive) end of the window", and SliceTieCluster advances with next <= endUtc. But the backend contract (IHistoryProvider.ReadRawAsync, IHistoryProvider.cs:25) defines endUtc as an exclusive upper bound. The two arms therefore disagree at the exact endUtc boundary. Impact is small because the backend is the authority on which samples actually exist (a sample at exactly endUtc never appears in an exclusive-end read), so the disagreement only ever yields one extra empty resume page ([endUtc, endUtc) → GoodNoData, no CP) rather than duplicated or dropped data. Separately, the SDK permits a Raw read with StartTime == DateTime.MinValue + a NumValuesPerNode cap (the OPC UA "read N values backward from EndTime" request — validated in CustomNodeManager.HistoryRead); the forward-only resume cursor (ComputeResumeCursor resumes from the LAST sample forward) is not defined for a backward read, so paging a backward Raw read is unsound if the backend honours the backward direction.

Recommendation: Pin the endUtc inclusive-vs-exclusive convention against the actual backend (align the paging XML docs + SliceTieCluster comparison to the contract), and either explicitly reject a StartTime == MinValue paged Raw read with a clear status or document that the backend is assumed to always return chronological forward results regardless of direction. Deferred: resolving the inclusive/exclusive question requires confirming the Wonderware backend's actual boundary semantics (cross-module / infra), and changing a comparison without that confirmation risks the opposite off-by-one.

Resolution: Deferred — 2026-06-20: infra-gated. Resolving the endUtc inclusive-vs-exclusive disagreement requires confirming the actual Wonderware historian backend's boundary semantics, which is hardware/infra-gated and not reachable from this macOS dev host. The impact is benign and bounded — because the backend is the authority on which samples exist (a sample at exactly endUtc never appears in an exclusive-end read), the disagreement only ever yields ONE extra empty resume page ([endUtc, endUtc) → GoodNoData, no continuation point) rather than any duplicated or dropped data. Changing the SliceTieCluster comparison / paging XML docs without confirming the live backend boundary risks introducing the opposite off-by-one, so no code is changed here pending that live confirmation.

OpcUaServer-004

Field Value
Severity Low
Category Error handling & resilience
Location OtOpcUaNodeManager.cs:1597 (ResolveParentFolder), and every public sink mutator that calls it (EnsureFolder 1278, EnsureVariable 1335, MaterialiseAlarmCondition 597, plus WriteValue/WriteAlarmCondition CreateVariable)
Status Resolved

Description: ResolveParentFolder dereferences _root! with the null-forgiving operator, and CreateVariable uses _root (AddChild). _root is only assigned in CreateAddressSpace, which the SDK invokes during StandardServer start. Every public address-space mutator (WriteValue, WriteAlarmCondition, EnsureFolder, EnsureVariable, MaterialiseAlarmCondition) assumes CreateAddressSpace has already run. If any of these is ever called before the server has started (e.g. a sink wired or a publish replayed before StartAsync completes), _root is null and the call NREs out of the node manager. In the current boot ordering the host wires the SdkAddressSpaceSink only after start, so this is latent rather than live, but it is an unguarded ordering hazard on the highest-risk class.

Recommendation: Add an explicit guard (e.g. throw a clear InvalidOperationException("address space not yet created") or no-op with a logged warning) when _root is null at the top of the mutators, so a too-early call fails legibly instead of with a bare NRE. Low priority — defensive hardening, not a live defect. Left Open to avoid an unscoped change to the mutator entry points on this critical class without a regression scenario that reproduces the early-call ordering.

Resolution: Resolved — 2026-06-20 (SHA pending): added a private EnsureAddressSpaceCreated() helper that throws InvalidOperationException("OPC UA address space has not been created yet (server not started.)") when _root is null, and call it at the top of ResolveParentFolder and at every public address-space mutator entry point (WriteValue, WriteAlarmCondition, EnsureFolder, EnsureVariable, MaterialiseAlarmCondition) — right after argument validation, before any _root dereference. A too-early call (a sink wired or a publish replayed before StartAsync drives CreateAddressSpace) now fails legibly instead of with a bare NRE out of ResolveParentFolder / CreateVariable. Happy-path behaviour is unchanged. The guard was test-feasible after all: NodeManagerPreStartGuardTests boots a real host, borrows the live node manager's real IServerInternal, constructs a SECOND, never-started OtOpcUaNodeManager from it (so _root is null), and asserts each of the four lock-taking mutators (EnsureFolder/EnsureVariable/WriteValue/MaterialiseAlarmCondition) throws InvalidOperationException (not NRE), with the folder case asserting the message text. (WriteAlarmCondition's guard is identical and sits on the same path; it is build-verified.) Full OpcUaServer.Tests suite green (284/284).

OpcUaServer-005

Field Value
Severity Low
Category Testing coverage
Location OtOpcUaNodeManager.cs:2049 (ServeRawPaged tie-cluster stall path), OtOpcUaNodeManager.cs:2068 (absurd-burst backstop)
Status Won't Fix

Description: (Re-triaged to a non-issue.) Initial concern was that the oversized-tie-cluster paging path in ServeRawPaged (over-fetch the whole cluster, SliceTieCluster, the MaxTieClusterOverfetch absurd-burst BadHistoryOperationUnsupported backstop) lacked an end-to-end node-manager regression and was only covered at the pure level by HistoryPagingTests.

Recommendation: None required.

Resolution: Won't Fix — 2026-06-19: coverage already exists end-to-end. Verified that NodeManagerHistoryReadPagingTests.Raw_oversized_tie_cluster_pages_within_the_timestamp drives a real multi-page resume INTO the stall (a 5-way tie cluster larger than the page cap) through a booted OtOpcUaSdkServer + InMemoryHistoryContinuationStore and asserts lossless duplicate-free paging, and Raw_tie_cluster_beyond_overfetch_bound_fails_loudly exercises the > MaxTieClusterOverfetch backstop. The module's HistoryRead paging surface is comprehensively tested; no test gap.

OpcUaServer-006

Field Value
Severity Low
Category Documentation & comments
Location OtOpcUaNodeManager.cs:11-30 (class XML doc), OpcUaApplicationHost.cs:88-93 / OpcUaApplicationHost.cs:421-423 (F13/F13c follow-up notes)
Status Resolved

Description: Several XML comments reference superseded plan milestones as if still pending. The OtOpcUaNodeManager class doc says equipment-folder hierarchy + type metadata "still come from the AddressSpaceApplier / EquipmentNodeWalker integration (F14b, tracked under #85)", but EquipmentNodeWalker is no longer the integration path (the composer → applier → sink → node-manager chain replaced it, per AddressSpaceComposer's own header) and #85 hierarchy materialisation has shipped. OpcUaApplicationHost still carries "Full extraction … is tracked as follow-up F13" and "F13c will plug a real LDAP-bound validator" notes, but the IOpcUaUserAuthenticator / RoleCarryingUserIdentity impersonation wiring is implemented and live (the doc/docs/OpcUaServer.md describe it as done). These are stale-comment / undocumented-as-shipped issues, not behaviour bugs.

Recommendation: Refresh the class- and method-level doc comments to drop the EquipmentNodeWalker/F13/F13c/F14b "pending" framing and describe the shipped behaviour, matching docs/OpcUaServer.md.

Resolution: Resolved — 2026-06-19 (SHA pending): rewrote the OtOpcUaNodeManager class XML doc to describe the shipped folder-hierarchy / typed-variable / historized-node / Part 9 condition materialisation (dropping the false "treats every id as a flat BaseDataVariableState" claim and the retired EquipmentNodeWalker/F14b framing), and the two OpcUaApplicationHost doc blocks (class summary + BuildUserTokenPolicies) to describe the shipped impersonation/auth wiring instead of the "F13/F13c pending" framing. Doc-comment-only — no behaviour change; build re-verified green.

OpcUaServer-007

Field Value
Severity Medium
Category Correctness & logic bugs
Location OtOpcUaNodeManager.cs:674 (MaterialiseAlarmConditionalarm.OnTimedUnshelve)
Status Resolved

Description: The system-timer auto-unshelve callback (alarm.OnTimedUnshelve, wired in MaterialiseAlarmCondition) — fired by the SDK when a TimedShelve duration expires — built its AlarmCommand with User: string.Empty: AlarmCommandRouter?.Invoke(new AlarmCommand(alarmId, "Unshelve", string.Empty, null, null)). That command is routed to the scripted-alarm engine, whose ScriptedAlarmEngine.UnshelveAsync calls Part9StateMachine.ApplyUnshelve(cur, user, _clock()) (src/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms/Part9StateMachine.cs:211-213), and that method opens with if (string.IsNullOrWhiteSpace(user)) throw new ArgumentException("User required.", …). The exception is swallowed downstream (caught by ScriptedAlarmHostActor), so this is NOT a crash — but the auto-unshelve silently no-ops: a TimedShelve never auto-expires, leaving the alarm permanently shelved with no operator-visible error. The bug was operationally invisible and even had a green node-manager test that asserted cmd.User == string.Empty, encoding the defect as expected behaviour. The separate, correct design rule — OnTimedUnshelve BYPASSES the AlarmAck role gate because it is a session-less system timer with no client principal (routing through the gated HandleAlarmCommand would return BadUserAccessDenied) — is intentional and was preserved; only the empty User string was the defect.

Recommendation: Pass a non-empty system user at the OnTimedUnshelve call site so ApplyUnshelve accepts it, while keeping the gate bypass. Use the codebase-wide canonical "system" system-actor label (matching Part9StateMachine's own AutoUnshelve audit user, AlarmConditionState's "engine-internal events ⇒ system" doc, and AuditActor.SystemFallback = "system").

Resolution: Resolved — 2026-06-19 (SHA pending): changed the OnTimedUnshelve delegate in MaterialiseAlarmCondition (OtOpcUaNodeManager.cs:674) to build the AlarmCommand with User: "system" instead of string.Empty, so the engine's ApplyUnshelve user-required guard accepts the system-initiated unshelve and the timed auto-unshelve actually applies. The AlarmAck gate bypass for the session-less system timer is unchanged. TDD: repurposed the existing AlarmCommandRouterTests.OnTimedUnshelve_with_system_context_returns_good_and_routes_unshelve (which previously asserted cmd.User == string.Empty) to assert a non-empty "system" user — verified it failed against the unfixed source (cmd.User should not be null or white space) and passes after the fix. Surgical one-line src change; no public-contract/wire change. Full OpcUaServer.Tests suite green (275/275).