Cross-module fix from the review sweep. -007 (Medium): OnTimedUnshelve built its AlarmCommand with User=string.Empty, so Part9StateMachine.ApplyUnshelve rejected it (ArgumentException, swallowed) and a TimedShelve never auto-expired. Pass the canonical 'system' user; the AlarmAck-gate bypass is preserved. Repurposed the test that had encoded the bug.
16 KiB
Code Review — OpcUaServer
| Field | Value |
|---|---|
| Module | src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer |
| Reviewer | Claude Code |
| Review date | 2026-06-19 |
| Commit reviewed | 7286d320 |
| Status | Reviewed |
| Open findings | 4 |
Checklist coverage
A comprehensive review completes every category, recording "No issues found" where a category produced nothing rather than leaving it blank.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | OpcUaServer-001, -002, -003, -007 |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | No issues found (Lock discipline + fire-and-forget dispatch verified correct) |
| 4 | Error handling & resilience | OpcUaServer-004 |
| 5 | Security | No issues found (WriteOperate / AlarmAck gates fail closed; anonymous → BadUserAccessDenied; HistoryRead AccessLevel bits correct) |
| 6 | Performance & resource management | No issues found |
| 7 | Design-document adherence | No issues found (ConditionId=dotted-FullName routing, MapSeverity buckets, fire-and-forget dispatch all match CLAUDE.md / docs) |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | No issues found (HistoryRead paging incl. oversized-tie-cluster + backstop is covered end-to-end; OpcUaServer-005 re-triaged to Won't Fix) |
| 10 | Documentation & comments | OpcUaServer-006 (Resolved) |
Findings
OpcUaServer-001
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | AddressSpacePlan.cs:56 (AddressSpacePlan.IsEmpty), AddressSpacePlan.cs:80 (AddressSpacePlanner.Compute) |
| Status | Open |
Description: AddressSpaceComposition carries the UNS topology (UnsAreas + UnsLines), and
AddressSpaceApplier.MaterialiseHierarchy uses each area's/line's DisplayName for the OPC UA
folder display name. But AddressSpacePlanner.Compute only diffs Equipment / DriverInstance /
ScriptedAlarm / EquipmentTag / EquipmentVirtualTag — it never diffs UnsAreas / UnsLines. So a
deployment whose ONLY change is a UNS Area or Line rename (no equipment/driver/alarm/tag/vtag
delta) produces a plan whose IsEmpty is true. In OpcUaPublishActor.HandleRebuild
(src/Server/ZB.MOM.WW.OtOpcUa.Runtime/OpcUa/OpcUaPublishActor.cs:313-318) an empty plan
short-circuits BEFORE MaterialiseHierarchy runs, so the renamed Area/Line folder keeps its stale
display name until some unrelated structural change forces a full rebuild. Compounding it,
OtOpcUaNodeManager.EnsureFolder (OtOpcUaNodeManager.cs:1272) early-returns for an
already-present folder id and never updates an existing folder's DisplayName, so even if
MaterialiseHierarchy were reached on a rename it would be a no-op until RebuildAddressSpace
clears _folders.
Recommendation: Carry UNS area/line diff sets into AddressSpacePlan (mirroring the
init-only EquipmentTag/VirtualTag pattern) so a rename is no longer "empty", and have the apply
path drive a hierarchy refresh (or make EnsureFolder update an existing folder's DisplayName
in place + ClearChangeMasks). Deferred: a complete fix spans the Runtime module
(OpcUaPublishActor must honour the new plan flag and call a hierarchy-refresh / rebuild path),
which is outside this module's edit boundary.
Resolution: (Open — deferred: needs a coordinated change in the Runtime module's OpcUaPublishActor to act on a UNS-changed plan; an in-AddressSpacePlan change alone is inert because Apply/MaterialiseHierarchy do not refresh existing folder names.)
OpcUaServer-002
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | OtOpcUaNodeManager.cs:1748 (HistoryReadEvents), OtOpcUaNodeManager.cs:1814 (ClampToInt) |
| Status | Open |
Description: For HistoryRead-Events, HistoryReadEvents passes
ClampToInt(details.NumValuesPerNode) to IHistorianDataSource.ReadEventsAsync(maxEvents) and
always returns the result with ContinuationPoint = null ("the full window in one shot"). The
events arm never issues continuation points. By the IHistoryProvider.ReadEventsAsync contract
(src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IHistoryProvider.cs:88-95), maxEvents <= 0 is a
sentinel meaning "use the backend's default cap". But OPC UA Part 4/11 define
NumValuesPerNode == 0 on a HistoryRead as "no limit — return ALL values". So a client that asks
for all events in a window (NumValuesPerNode == 0) is silently truncated at the backend's
default cap, and because no continuation point (and no GoodMoreData) is returned, the client has
no signal that the result was truncated — it believes it received the complete window. This is a
silent-data-loss / spec-deviation on the events arm.
Recommendation: Either (a) translate NumValuesPerNode == 0 to "unbounded" for the events
backend (e.g. saturate to a very large cap) rather than the "default cap" sentinel, or (b) detect
backend truncation and surface a continuation point / GoodMoreData for events. Deferred: option
(a) changes the documented maxEvents <= 0 sentinel semantics shared with the Wonderware/OpcUaClient
event backends (cross-module, Core.Abstractions contract); option (b) requires the backend to report
truncation. Both cross this module's boundary.
Resolution: (Open — deferred: rooted in the cross-module IHistoryProvider.ReadEventsAsync maxEvents <= 0 sentinel contract (Core.Abstractions-006) and the Wonderware/OpcUaClient event backends; cannot be fixed safely inside OpcUaServer alone.)
OpcUaServer-003
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | OtOpcUaNodeManager.cs:1978 (ServeRawPaged), HistoryPaging.cs (whole), HistoryPaging.cs:213 (SliceTieCluster next <= endUtc) |
| Status | Open |
Description: The Raw paging chain treats endUtc as an inclusive upper bound throughout —
the HistoryContinuationState/HistoryPaging XML docs all say "the original (inclusive) end of
the window", and SliceTieCluster advances with next <= endUtc. But the backend contract
(IHistoryProvider.ReadRawAsync, IHistoryProvider.cs:25) defines endUtc as an exclusive
upper bound. The two arms therefore disagree at the exact endUtc boundary. Impact is small
because the backend is the authority on which samples actually exist (a sample at exactly endUtc
never appears in an exclusive-end read), so the disagreement only ever yields one extra empty
resume page ([endUtc, endUtc) → GoodNoData, no CP) rather than duplicated or dropped data.
Separately, the SDK permits a Raw read with StartTime == DateTime.MinValue + a NumValuesPerNode
cap (the OPC UA "read N values backward from EndTime" request — validated in
CustomNodeManager.HistoryRead); the forward-only resume cursor (ComputeResumeCursor resumes from
the LAST sample forward) is not defined for a backward read, so paging a backward Raw read is
unsound if the backend honours the backward direction.
Recommendation: Pin the endUtc inclusive-vs-exclusive convention against the actual backend
(align the paging XML docs + SliceTieCluster comparison to the contract), and either explicitly
reject a StartTime == MinValue paged Raw read with a clear status or document that the backend is
assumed to always return chronological forward results regardless of direction. Deferred: resolving
the inclusive/exclusive question requires confirming the Wonderware backend's actual boundary
semantics (cross-module / infra), and changing a comparison without that confirmation risks the
opposite off-by-one.
Resolution: (Open — deferred: needs the backend's authoritative endUtc boundary semantics confirmed before the comparison/doc is changed; flipping it blindly risks an off-by-one in the other direction.)
OpcUaServer-004
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | OtOpcUaNodeManager.cs:1597 (ResolveParentFolder), and every public sink mutator that calls it (EnsureFolder 1278, EnsureVariable 1335, MaterialiseAlarmCondition 597, plus WriteValue/WriteAlarmCondition CreateVariable) |
| Status | Open |
Description: ResolveParentFolder dereferences _root! with the null-forgiving operator, and
CreateVariable uses _root (AddChild). _root is only assigned in CreateAddressSpace, which
the SDK invokes during StandardServer start. Every public address-space mutator
(WriteValue, WriteAlarmCondition, EnsureFolder, EnsureVariable, MaterialiseAlarmCondition)
assumes CreateAddressSpace has already run. If any of these is ever called before the server has
started (e.g. a sink wired or a publish replayed before StartAsync completes), _root is null
and the call NREs out of the node manager. In the current boot ordering the host wires the
SdkAddressSpaceSink only after start, so this is latent rather than live, but it is an unguarded
ordering hazard on the highest-risk class.
Recommendation: Add an explicit guard (e.g. throw a clear InvalidOperationException("address space not yet created") or no-op with a logged warning) when _root is null at the top of the
mutators, so a too-early call fails legibly instead of with a bare NRE. Low priority — defensive
hardening, not a live defect. Left Open to avoid an unscoped change to the mutator entry points on
this critical class without a regression scenario that reproduces the early-call ordering.
Resolution: (Open — defensive-only; latent given current boot ordering. Deferred to avoid an unscoped guard-add across five mutators without a reproducing pre-start ordering scenario.)
OpcUaServer-005
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | OtOpcUaNodeManager.cs:2049 (ServeRawPaged tie-cluster stall path), OtOpcUaNodeManager.cs:2068 (absurd-burst backstop) |
| Status | Won't Fix |
Description: (Re-triaged to a non-issue.) Initial concern was that the oversized-tie-cluster
paging path in ServeRawPaged (over-fetch the whole cluster, SliceTieCluster, the
MaxTieClusterOverfetch absurd-burst BadHistoryOperationUnsupported backstop) lacked an
end-to-end node-manager regression and was only covered at the pure level by HistoryPagingTests.
Recommendation: None required.
Resolution: Won't Fix — 2026-06-19: coverage already exists end-to-end. Verified that
NodeManagerHistoryReadPagingTests.Raw_oversized_tie_cluster_pages_within_the_timestamp drives a
real multi-page resume INTO the stall (a 5-way tie cluster larger than the page cap) through a
booted OtOpcUaSdkServer + InMemoryHistoryContinuationStore and asserts lossless duplicate-free
paging, and Raw_tie_cluster_beyond_overfetch_bound_fails_loudly exercises the
> MaxTieClusterOverfetch backstop. The module's HistoryRead paging surface is comprehensively
tested; no test gap.
OpcUaServer-006
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | OtOpcUaNodeManager.cs:11-30 (class XML doc), OpcUaApplicationHost.cs:88-93 / OpcUaApplicationHost.cs:421-423 (F13/F13c follow-up notes) |
| Status | Resolved |
Description: Several XML comments reference superseded plan milestones as if still pending. The
OtOpcUaNodeManager class doc says equipment-folder hierarchy + type metadata "still come from the
AddressSpaceApplier / EquipmentNodeWalker integration (F14b, tracked under #85)", but
EquipmentNodeWalker is no longer the integration path (the composer → applier → sink → node-manager
chain replaced it, per AddressSpaceComposer's own header) and #85 hierarchy materialisation has
shipped. OpcUaApplicationHost still carries "Full extraction … is tracked as follow-up F13" and
"F13c will plug a real LDAP-bound validator" notes, but the IOpcUaUserAuthenticator /
RoleCarryingUserIdentity impersonation wiring is implemented and live (the doc/docs/OpcUaServer.md
describe it as done). These are stale-comment / undocumented-as-shipped issues, not behaviour bugs.
Recommendation: Refresh the class- and method-level doc comments to drop the
EquipmentNodeWalker/F13/F13c/F14b "pending" framing and describe the shipped behaviour, matching
docs/OpcUaServer.md.
Resolution: Resolved — 2026-06-19 (SHA pending): rewrote the OtOpcUaNodeManager class XML
doc to describe the shipped folder-hierarchy / typed-variable / historized-node / Part 9
condition materialisation (dropping the false "treats every id as a flat BaseDataVariableState"
claim and the retired EquipmentNodeWalker/F14b framing), and the two OpcUaApplicationHost
doc blocks (class summary + BuildUserTokenPolicies) to describe the shipped impersonation/auth
wiring instead of the "F13/F13c pending" framing. Doc-comment-only — no behaviour change; build
re-verified green.
OpcUaServer-007
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | OtOpcUaNodeManager.cs:674 (MaterialiseAlarmCondition → alarm.OnTimedUnshelve) |
| Status | Resolved |
Description: The system-timer auto-unshelve callback (alarm.OnTimedUnshelve, wired in
MaterialiseAlarmCondition) — fired by the SDK when a TimedShelve duration expires — built its
AlarmCommand with User: string.Empty:
AlarmCommandRouter?.Invoke(new AlarmCommand(alarmId, "Unshelve", string.Empty, null, null)). That
command is routed to the scripted-alarm engine, whose ScriptedAlarmEngine.UnshelveAsync calls
Part9StateMachine.ApplyUnshelve(cur, user, _clock())
(src/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms/Part9StateMachine.cs:211-213), and that method
opens with if (string.IsNullOrWhiteSpace(user)) throw new ArgumentException("User required.", …).
The exception is swallowed downstream (caught by ScriptedAlarmHostActor), so this is NOT a crash —
but the auto-unshelve silently no-ops: a TimedShelve never auto-expires, leaving the alarm
permanently shelved with no operator-visible error. The bug was operationally invisible and even had
a green node-manager test that asserted cmd.User == string.Empty, encoding the defect as expected
behaviour. The separate, correct design rule — OnTimedUnshelve BYPASSES the AlarmAck role gate
because it is a session-less system timer with no client principal (routing through the gated
HandleAlarmCommand would return BadUserAccessDenied) — is intentional and was preserved; only the
empty User string was the defect.
Recommendation: Pass a non-empty system user at the OnTimedUnshelve call site so ApplyUnshelve
accepts it, while keeping the gate bypass. Use the codebase-wide canonical "system" system-actor
label (matching Part9StateMachine's own AutoUnshelve audit user, AlarmConditionState's
"engine-internal events ⇒ system" doc, and AuditActor.SystemFallback = "system").
Resolution: Resolved — 2026-06-19 (SHA pending): changed the OnTimedUnshelve delegate in
MaterialiseAlarmCondition (OtOpcUaNodeManager.cs:674) to build the AlarmCommand with
User: "system" instead of string.Empty, so the engine's ApplyUnshelve user-required guard
accepts the system-initiated unshelve and the timed auto-unshelve actually applies. The AlarmAck
gate bypass for the session-less system timer is unchanged. TDD: repurposed the existing
AlarmCommandRouterTests.OnTimedUnshelve_with_system_context_returns_good_and_routes_unshelve
(which previously asserted cmd.User == string.Empty) to assert a non-empty "system" user —
verified it failed against the unfixed source (cmd.User should not be null or white space) and
passes after the fix. Surgical one-line src change; no public-contract/wire change. Full
OpcUaServer.Tests suite green (275/275).