Files

T

Joseph Doherty 40749d3f67 review(OpcUaServer): fix silent auto-unshelve failure (empty User -> 'system')

Cross-module fix from the review sweep. -007 (Medium): OnTimedUnshelve built its AlarmCommand
with User=string.Empty, so Part9StateMachine.ApplyUnshelve rejected it (ArgumentException,
swallowed) and a TimedShelve never auto-expired. Pass the canonical 'system' user; the
AlarmAck-gate bypass is preserved. Repurposed the test that had encoded the bug.

2026-06-19 12:29:40 -04:00

16 KiB

Raw Blame History

Code Review — OpcUaServer

Field	Value
Module	`src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer`
Reviewer	Claude Code
Review date	2026-06-19
Commit reviewed	`7286d320`
Status	Reviewed
Open findings	4

Checklist coverage

A comprehensive review completes every category, recording "No issues found" where a category produced nothing rather than leaving it blank.

#	Category	Result
1	Correctness & logic bugs	OpcUaServer-001, -002, -003, -007
2	OtOpcUa conventions	No issues found
3	Concurrency & thread safety	No issues found (Lock discipline + fire-and-forget dispatch verified correct)
4	Error handling & resilience	OpcUaServer-004
5	Security	No issues found (WriteOperate / AlarmAck gates fail closed; anonymous → BadUserAccessDenied; HistoryRead AccessLevel bits correct)
6	Performance & resource management	No issues found
7	Design-document adherence	No issues found (ConditionId=dotted-FullName routing, MapSeverity buckets, fire-and-forget dispatch all match CLAUDE.md / docs)
8	Code organization & conventions	No issues found
9	Testing coverage	No issues found (HistoryRead paging incl. oversized-tie-cluster + backstop is covered end-to-end; OpcUaServer-005 re-triaged to Won't Fix)
10	Documentation & comments	OpcUaServer-006 (Resolved)

Findings

OpcUaServer-001

Field	Value
Severity	Medium
Category	Correctness & logic bugs
Location	`AddressSpacePlan.cs:56` (`AddressSpacePlan.IsEmpty`), `AddressSpacePlan.cs:80` (`AddressSpacePlanner.Compute`)
Status	Open

Description: AddressSpaceComposition carries the UNS topology (UnsAreas + UnsLines), and AddressSpaceApplier.MaterialiseHierarchy uses each area's/line's DisplayName for the OPC UA folder display name. But AddressSpacePlanner.Compute only diffs Equipment / DriverInstance / ScriptedAlarm / EquipmentTag / EquipmentVirtualTag — it never diffs UnsAreas / UnsLines. So a deployment whose ONLY change is a UNS Area or Line rename (no equipment/driver/alarm/tag/vtag delta) produces a plan whose IsEmpty is true. In OpcUaPublishActor.HandleRebuild (src/Server/ZB.MOM.WW.OtOpcUa.Runtime/OpcUa/OpcUaPublishActor.cs:313-318) an empty plan short-circuits BEFORE MaterialiseHierarchy runs, so the renamed Area/Line folder keeps its stale display name until some unrelated structural change forces a full rebuild. Compounding it, OtOpcUaNodeManager.EnsureFolder (OtOpcUaNodeManager.cs:1272) early-returns for an already-present folder id and never updates an existing folder's DisplayName, so even if MaterialiseHierarchy were reached on a rename it would be a no-op until RebuildAddressSpace clears _folders.

Recommendation: Carry UNS area/line diff sets into AddressSpacePlan (mirroring the init-only EquipmentTag/VirtualTag pattern) so a rename is no longer "empty", and have the apply path drive a hierarchy refresh (or make EnsureFolder update an existing folder's DisplayName in place + ClearChangeMasks). Deferred: a complete fix spans the Runtime module (OpcUaPublishActor must honour the new plan flag and call a hierarchy-refresh / rebuild path), which is outside this module's edit boundary.

Resolution: (Open — deferred: needs a coordinated change in the Runtime module's OpcUaPublishActor to act on a UNS-changed plan; an in-AddressSpacePlan change alone is inert because Apply/MaterialiseHierarchy do not refresh existing folder names.)

OpcUaServer-002

Field	Value
Severity	Medium
Category	Correctness & logic bugs
Location	`OtOpcUaNodeManager.cs:1748` (`HistoryReadEvents`), `OtOpcUaNodeManager.cs:1814` (`ClampToInt`)
Status	Open

Description: For HistoryRead-Events, HistoryReadEvents passes ClampToInt(details.NumValuesPerNode) to IHistorianDataSource.ReadEventsAsync(maxEvents) and always returns the result with ContinuationPoint = null ("the full window in one shot"). The events arm never issues continuation points. By the IHistoryProvider.ReadEventsAsync contract (src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IHistoryProvider.cs:88-95), maxEvents <= 0 is a sentinel meaning "use the backend's default cap". But OPC UA Part 4/11 define NumValuesPerNode == 0 on a HistoryRead as "no limit — return ALL values". So a client that asks for all events in a window (NumValuesPerNode == 0) is silently truncated at the backend's default cap, and because no continuation point (and no GoodMoreData) is returned, the client has no signal that the result was truncated — it believes it received the complete window. This is a silent-data-loss / spec-deviation on the events arm.

Recommendation: Either (a) translate NumValuesPerNode == 0 to "unbounded" for the events backend (e.g. saturate to a very large cap) rather than the "default cap" sentinel, or (b) detect backend truncation and surface a continuation point / GoodMoreData for events. Deferred: option (a) changes the documented maxEvents <= 0 sentinel semantics shared with the Wonderware/OpcUaClient event backends (cross-module, Core.Abstractions contract); option (b) requires the backend to report truncation. Both cross this module's boundary.

Resolution: (Open — deferred: rooted in the cross-module IHistoryProvider.ReadEventsAsync maxEvents <= 0 sentinel contract (Core.Abstractions-006) and the Wonderware/OpcUaClient event backends; cannot be fixed safely inside OpcUaServer alone.)

OpcUaServer-003

Field	Value
Severity	Low
Category	Correctness & logic bugs
Location	`OtOpcUaNodeManager.cs:1978` (`ServeRawPaged`), `HistoryPaging.cs` (whole), `HistoryPaging.cs:213` (`SliceTieCluster` `next <= endUtc`)
Status	Open

Description: The Raw paging chain treats endUtc as an inclusive upper bound throughout — the HistoryContinuationState/HistoryPaging XML docs all say "the original (inclusive) end of the window", and SliceTieCluster advances with next <= endUtc. But the backend contract (IHistoryProvider.ReadRawAsync, IHistoryProvider.cs:25) defines endUtc as an exclusive upper bound. The two arms therefore disagree at the exact endUtc boundary. Impact is small because the backend is the authority on which samples actually exist (a sample at exactly endUtc never appears in an exclusive-end read), so the disagreement only ever yields one extra empty resume page ([endUtc, endUtc) → GoodNoData, no CP) rather than duplicated or dropped data. Separately, the SDK permits a Raw read with StartTime == DateTime.MinValue + a NumValuesPerNode cap (the OPC UA "read N values backward from EndTime" request — validated in CustomNodeManager.HistoryRead); the forward-only resume cursor (ComputeResumeCursor resumes from the LAST sample forward) is not defined for a backward read, so paging a backward Raw read is unsound if the backend honours the backward direction.

Recommendation: Pin the endUtc inclusive-vs-exclusive convention against the actual backend (align the paging XML docs + SliceTieCluster comparison to the contract), and either explicitly reject a StartTime == MinValue paged Raw read with a clear status or document that the backend is assumed to always return chronological forward results regardless of direction. Deferred: resolving the inclusive/exclusive question requires confirming the Wonderware backend's actual boundary semantics (cross-module / infra), and changing a comparison without that confirmation risks the opposite off-by-one.

Resolution: (Open — deferred: needs the backend's authoritative endUtc boundary semantics confirmed before the comparison/doc is changed; flipping it blindly risks an off-by-one in the other direction.)

OpcUaServer-004

Field	Value
Severity	Low
Category	Error handling & resilience
Location	`OtOpcUaNodeManager.cs:1597` (`ResolveParentFolder`), and every public sink mutator that calls it (`EnsureFolder` 1278, `EnsureVariable` 1335, `MaterialiseAlarmCondition` 597, plus `WriteValue`/`WriteAlarmCondition` `CreateVariable`)
Status	Open

Description: ResolveParentFolder dereferences _root! with the null-forgiving operator, and CreateVariable uses _root (AddChild). _root is only assigned in CreateAddressSpace, which the SDK invokes during StandardServer start. Every public address-space mutator (WriteValue, WriteAlarmCondition, EnsureFolder, EnsureVariable, MaterialiseAlarmCondition) assumes CreateAddressSpace has already run. If any of these is ever called before the server has started (e.g. a sink wired or a publish replayed before StartAsync completes), _root is null and the call NREs out of the node manager. In the current boot ordering the host wires the SdkAddressSpaceSink only after start, so this is latent rather than live, but it is an unguarded ordering hazard on the highest-risk class.

Recommendation: Add an explicit guard (e.g. throw a clear InvalidOperationException("address space not yet created") or no-op with a logged warning) when _root is null at the top of the mutators, so a too-early call fails legibly instead of with a bare NRE. Low priority — defensive hardening, not a live defect. Left Open to avoid an unscoped change to the mutator entry points on this critical class without a regression scenario that reproduces the early-call ordering.

Resolution: (Open — defensive-only; latent given current boot ordering. Deferred to avoid an unscoped guard-add across five mutators without a reproducing pre-start ordering scenario.)

OpcUaServer-005

Field	Value
Severity	Low
Category	Testing coverage
Location	`OtOpcUaNodeManager.cs:2049` (`ServeRawPaged` tie-cluster stall path), `OtOpcUaNodeManager.cs:2068` (absurd-burst backstop)
Status	Won't Fix

Description: (Re-triaged to a non-issue.) Initial concern was that the oversized-tie-cluster paging path in ServeRawPaged (over-fetch the whole cluster, SliceTieCluster, the MaxTieClusterOverfetch absurd-burst BadHistoryOperationUnsupported backstop) lacked an end-to-end node-manager regression and was only covered at the pure level by HistoryPagingTests.

Recommendation: None required.

Resolution: Won't Fix — 2026-06-19: coverage already exists end-to-end. Verified that NodeManagerHistoryReadPagingTests.Raw_oversized_tie_cluster_pages_within_the_timestamp drives a real multi-page resume INTO the stall (a 5-way tie cluster larger than the page cap) through a booted OtOpcUaSdkServer + InMemoryHistoryContinuationStore and asserts lossless duplicate-free paging, and Raw_tie_cluster_beyond_overfetch_bound_fails_loudly exercises the > MaxTieClusterOverfetch backstop. The module's HistoryRead paging surface is comprehensively tested; no test gap.

OpcUaServer-006

Field	Value
Severity	Low
Category	Documentation & comments
Location	`OtOpcUaNodeManager.cs:11-30` (class XML doc), `OpcUaApplicationHost.cs:88-93` / `OpcUaApplicationHost.cs:421-423` (F13/F13c follow-up notes)
Status	Resolved

Description: Several XML comments reference superseded plan milestones as if still pending. The OtOpcUaNodeManager class doc says equipment-folder hierarchy + type metadata "still come from the AddressSpaceApplier / EquipmentNodeWalker integration (F14b, tracked under #85)", but EquipmentNodeWalker is no longer the integration path (the composer → applier → sink → node-manager chain replaced it, per AddressSpaceComposer's own header) and #85 hierarchy materialisation has shipped. OpcUaApplicationHost still carries "Full extraction … is tracked as follow-up F13" and "F13c will plug a real LDAP-bound validator" notes, but the IOpcUaUserAuthenticator / RoleCarryingUserIdentity impersonation wiring is implemented and live (the doc/docs/OpcUaServer.md describe it as done). These are stale-comment / undocumented-as-shipped issues, not behaviour bugs.

Recommendation: Refresh the class- and method-level doc comments to drop the EquipmentNodeWalker/F13/F13c/F14b "pending" framing and describe the shipped behaviour, matching docs/OpcUaServer.md.

Resolution: Resolved — 2026-06-19 (SHA pending): rewrote the OtOpcUaNodeManager class XML doc to describe the shipped folder-hierarchy / typed-variable / historized-node / Part 9 condition materialisation (dropping the false "treats every id as a flat BaseDataVariableState" claim and the retired EquipmentNodeWalker/F14b framing), and the two OpcUaApplicationHost doc blocks (class summary + BuildUserTokenPolicies) to describe the shipped impersonation/auth wiring instead of the "F13/F13c pending" framing. Doc-comment-only — no behaviour change; build re-verified green.

OpcUaServer-007

Field	Value
Severity	Medium
Category	Correctness & logic bugs
Location	`OtOpcUaNodeManager.cs:674` (`MaterialiseAlarmCondition` → `alarm.OnTimedUnshelve`)
Status	Resolved

Description: The system-timer auto-unshelve callback (alarm.OnTimedUnshelve, wired in MaterialiseAlarmCondition) — fired by the SDK when a TimedShelve duration expires — built its AlarmCommand with User: string.Empty: AlarmCommandRouter?.Invoke(new AlarmCommand(alarmId, "Unshelve", string.Empty, null, null)). That command is routed to the scripted-alarm engine, whose ScriptedAlarmEngine.UnshelveAsync calls Part9StateMachine.ApplyUnshelve(cur, user, _clock()) (src/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms/Part9StateMachine.cs:211-213), and that method opens with if (string.IsNullOrWhiteSpace(user)) throw new ArgumentException("User required.", …). The exception is swallowed downstream (caught by ScriptedAlarmHostActor), so this is NOT a crash — but the auto-unshelve silently no-ops: a TimedShelve never auto-expires, leaving the alarm permanently shelved with no operator-visible error. The bug was operationally invisible and even had a green node-manager test that asserted cmd.User == string.Empty, encoding the defect as expected behaviour. The separate, correct design rule — OnTimedUnshelve BYPASSES the AlarmAck role gate because it is a session-less system timer with no client principal (routing through the gated HandleAlarmCommand would return BadUserAccessDenied) — is intentional and was preserved; only the empty User string was the defect.

Recommendation: Pass a non-empty system user at the OnTimedUnshelve call site so ApplyUnshelve accepts it, while keeping the gate bypass. Use the codebase-wide canonical "system" system-actor label (matching Part9StateMachine's own AutoUnshelve audit user, AlarmConditionState's "engine-internal events ⇒ system" doc, and AuditActor.SystemFallback = "system").

Resolution: Resolved — 2026-06-19 (SHA pending): changed the OnTimedUnshelve delegate in MaterialiseAlarmCondition (OtOpcUaNodeManager.cs:674) to build the AlarmCommand with User: "system" instead of string.Empty, so the engine's ApplyUnshelve user-required guard accepts the system-initiated unshelve and the timed auto-unshelve actually applies. The AlarmAck gate bypass for the session-less system timer is unchanged. TDD: repurposed the existing AlarmCommandRouterTests.OnTimedUnshelve_with_system_context_returns_good_and_routes_unshelve (which previously asserted cmd.User == string.Empty) to assert a non-empty "system" user — verified it failed against the unfixed source (cmd.User should not be null or white space) and passes after the fix. Surgical one-line src change; no public-contract/wire change. Full OpcUaServer.Tests suite green (275/275).

16 KiB Raw Blame History