40749d3f67
Cross-module fix from the review sweep. -007 (Medium): OnTimedUnshelve built its AlarmCommand with User=string.Empty, so Part9StateMachine.ApplyUnshelve rejected it (ArgumentException, swallowed) and a TimedShelve never auto-expired. Pass the canonical 'system' user; the AlarmAck-gate bypass is preserved. Repurposed the test that had encoded the bug.
254 lines
16 KiB
Markdown
254 lines
16 KiB
Markdown
# Code Review — OpcUaServer
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Module | `src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer` |
|
|
| Reviewer | Claude Code |
|
|
| Review date | 2026-06-19 |
|
|
| Commit reviewed | `7286d320` |
|
|
| Status | Reviewed |
|
|
| Open findings | 4 |
|
|
|
|
## Checklist coverage
|
|
|
|
A comprehensive review completes every category, recording "No issues found" where
|
|
a category produced nothing rather than leaving it blank.
|
|
|
|
| # | Category | Result |
|
|
|---|---|---|
|
|
| 1 | Correctness & logic bugs | OpcUaServer-001, -002, -003, -007 |
|
|
| 2 | OtOpcUa conventions | No issues found |
|
|
| 3 | Concurrency & thread safety | No issues found (Lock discipline + fire-and-forget dispatch verified correct) |
|
|
| 4 | Error handling & resilience | OpcUaServer-004 |
|
|
| 5 | Security | No issues found (WriteOperate / AlarmAck gates fail closed; anonymous → BadUserAccessDenied; HistoryRead AccessLevel bits correct) |
|
|
| 6 | Performance & resource management | No issues found |
|
|
| 7 | Design-document adherence | No issues found (ConditionId=dotted-FullName routing, MapSeverity buckets, fire-and-forget dispatch all match CLAUDE.md / docs) |
|
|
| 8 | Code organization & conventions | No issues found |
|
|
| 9 | Testing coverage | No issues found (HistoryRead paging incl. oversized-tie-cluster + backstop is covered end-to-end; OpcUaServer-005 re-triaged to Won't Fix) |
|
|
| 10 | Documentation & comments | OpcUaServer-006 (Resolved) |
|
|
|
|
## Findings
|
|
|
|
<!-- One ### entry per finding. IDs are OpcUaServer-NNN, sequential within the module,
|
|
never reused. Findings are never deleted — close them by changing Status and
|
|
completing Resolution. -->
|
|
|
|
### OpcUaServer-001
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Medium |
|
|
| Category | Correctness & logic bugs |
|
|
| Location | `AddressSpacePlan.cs:56` (`AddressSpacePlan.IsEmpty`), `AddressSpacePlan.cs:80` (`AddressSpacePlanner.Compute`) |
|
|
| Status | Open |
|
|
|
|
**Description:** `AddressSpaceComposition` carries the UNS topology (`UnsAreas` + `UnsLines`), and
|
|
`AddressSpaceApplier.MaterialiseHierarchy` uses each area's/line's `DisplayName` for the OPC UA
|
|
folder display name. But `AddressSpacePlanner.Compute` only diffs Equipment / DriverInstance /
|
|
ScriptedAlarm / EquipmentTag / EquipmentVirtualTag — it never diffs `UnsAreas` / `UnsLines`. So a
|
|
deployment whose ONLY change is a UNS Area or Line **rename** (no equipment/driver/alarm/tag/vtag
|
|
delta) produces a plan whose `IsEmpty` is `true`. In `OpcUaPublishActor.HandleRebuild`
|
|
(`src/Server/ZB.MOM.WW.OtOpcUa.Runtime/OpcUa/OpcUaPublishActor.cs:313-318`) an empty plan
|
|
short-circuits BEFORE `MaterialiseHierarchy` runs, so the renamed Area/Line folder keeps its stale
|
|
display name until some unrelated structural change forces a full rebuild. Compounding it,
|
|
`OtOpcUaNodeManager.EnsureFolder` (`OtOpcUaNodeManager.cs:1272`) early-returns for an
|
|
already-present folder id and never updates an existing folder's `DisplayName`, so even if
|
|
`MaterialiseHierarchy` were reached on a rename it would be a no-op until `RebuildAddressSpace`
|
|
clears `_folders`.
|
|
|
|
**Recommendation:** Carry UNS area/line diff sets into `AddressSpacePlan` (mirroring the
|
|
init-only EquipmentTag/VirtualTag pattern) so a rename is no longer "empty", and have the apply
|
|
path drive a hierarchy refresh (or make `EnsureFolder` update an existing folder's `DisplayName`
|
|
in place + `ClearChangeMasks`). Deferred: a complete fix spans the Runtime module
|
|
(`OpcUaPublishActor` must honour the new plan flag and call a hierarchy-refresh / rebuild path),
|
|
which is outside this module's edit boundary.
|
|
|
|
**Resolution:** _(Open — deferred: needs a coordinated change in the Runtime module's `OpcUaPublishActor` to act on a UNS-changed plan; an in-`AddressSpacePlan` change alone is inert because `Apply`/`MaterialiseHierarchy` do not refresh existing folder names.)_
|
|
|
|
### OpcUaServer-002
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Medium |
|
|
| Category | Correctness & logic bugs |
|
|
| Location | `OtOpcUaNodeManager.cs:1748` (`HistoryReadEvents`), `OtOpcUaNodeManager.cs:1814` (`ClampToInt`) |
|
|
| Status | Open |
|
|
|
|
**Description:** For HistoryRead-Events, `HistoryReadEvents` passes
|
|
`ClampToInt(details.NumValuesPerNode)` to `IHistorianDataSource.ReadEventsAsync(maxEvents)` and
|
|
always returns the result with `ContinuationPoint = null` ("the full window in one shot"). The
|
|
events arm never issues continuation points. By the `IHistoryProvider.ReadEventsAsync` contract
|
|
(`src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IHistoryProvider.cs:88-95`), `maxEvents <= 0` is a
|
|
sentinel meaning "use the backend's **default cap**". But OPC UA Part 4/11 define
|
|
`NumValuesPerNode == 0` on a HistoryRead as "no limit — return ALL values". So a client that asks
|
|
for all events in a window (`NumValuesPerNode == 0`) is silently truncated at the backend's
|
|
default cap, and because no continuation point (and no `GoodMoreData`) is returned, the client has
|
|
no signal that the result was truncated — it believes it received the complete window. This is a
|
|
silent-data-loss / spec-deviation on the events arm.
|
|
|
|
**Recommendation:** Either (a) translate `NumValuesPerNode == 0` to "unbounded" for the events
|
|
backend (e.g. saturate to a very large cap) rather than the "default cap" sentinel, or (b) detect
|
|
backend truncation and surface a continuation point / `GoodMoreData` for events. Deferred: option
|
|
(a) changes the documented `maxEvents <= 0` sentinel semantics shared with the Wonderware/OpcUaClient
|
|
event backends (cross-module, Core.Abstractions contract); option (b) requires the backend to report
|
|
truncation. Both cross this module's boundary.
|
|
|
|
**Resolution:** _(Open — deferred: rooted in the cross-module `IHistoryProvider.ReadEventsAsync` `maxEvents <= 0` sentinel contract (Core.Abstractions-006) and the Wonderware/OpcUaClient event backends; cannot be fixed safely inside OpcUaServer alone.)_
|
|
|
|
### OpcUaServer-003
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Low |
|
|
| Category | Correctness & logic bugs |
|
|
| Location | `OtOpcUaNodeManager.cs:1978` (`ServeRawPaged`), `HistoryPaging.cs` (whole), `HistoryPaging.cs:213` (`SliceTieCluster` `next <= endUtc`) |
|
|
| Status | Open |
|
|
|
|
**Description:** The Raw paging chain treats `endUtc` as an **inclusive** upper bound throughout —
|
|
the `HistoryContinuationState`/`HistoryPaging` XML docs all say "the original (inclusive) end of
|
|
the window", and `SliceTieCluster` advances with `next <= endUtc`. But the backend contract
|
|
(`IHistoryProvider.ReadRawAsync`, `IHistoryProvider.cs:25`) defines `endUtc` as an **exclusive**
|
|
upper bound. The two arms therefore disagree at the exact `endUtc` boundary. Impact is small
|
|
because the backend is the authority on which samples actually exist (a sample at exactly `endUtc`
|
|
never appears in an exclusive-end read), so the disagreement only ever yields one extra empty
|
|
resume page (`[endUtc, endUtc)` → GoodNoData, no CP) rather than duplicated or dropped data.
|
|
Separately, the SDK permits a Raw read with `StartTime == DateTime.MinValue` + a `NumValuesPerNode`
|
|
cap (the OPC UA "read N values backward from EndTime" request — validated in
|
|
`CustomNodeManager.HistoryRead`); the forward-only resume cursor (`ComputeResumeCursor` resumes from
|
|
the LAST sample forward) is not defined for a backward read, so paging a backward Raw read is
|
|
unsound if the backend honours the backward direction.
|
|
|
|
**Recommendation:** Pin the `endUtc` inclusive-vs-exclusive convention against the actual backend
|
|
(align the paging XML docs + `SliceTieCluster` comparison to the contract), and either explicitly
|
|
reject a `StartTime == MinValue` paged Raw read with a clear status or document that the backend is
|
|
assumed to always return chronological forward results regardless of direction. Deferred: resolving
|
|
the inclusive/exclusive question requires confirming the Wonderware backend's actual boundary
|
|
semantics (cross-module / infra), and changing a comparison without that confirmation risks the
|
|
opposite off-by-one.
|
|
|
|
**Resolution:** _(Open — deferred: needs the backend's authoritative endUtc boundary semantics confirmed before the comparison/doc is changed; flipping it blindly risks an off-by-one in the other direction.)_
|
|
|
|
### OpcUaServer-004
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Low |
|
|
| Category | Error handling & resilience |
|
|
| Location | `OtOpcUaNodeManager.cs:1597` (`ResolveParentFolder`), and every public sink mutator that calls it (`EnsureFolder` 1278, `EnsureVariable` 1335, `MaterialiseAlarmCondition` 597, plus `WriteValue`/`WriteAlarmCondition` `CreateVariable`) |
|
|
| Status | Open |
|
|
|
|
**Description:** `ResolveParentFolder` dereferences `_root!` with the null-forgiving operator, and
|
|
`CreateVariable` uses `_root` (`AddChild`). `_root` is only assigned in `CreateAddressSpace`, which
|
|
the SDK invokes during `StandardServer` start. Every public address-space mutator
|
|
(`WriteValue`, `WriteAlarmCondition`, `EnsureFolder`, `EnsureVariable`, `MaterialiseAlarmCondition`)
|
|
assumes `CreateAddressSpace` has already run. If any of these is ever called before the server has
|
|
started (e.g. a sink wired or a publish replayed before `StartAsync` completes), `_root` is `null`
|
|
and the call NREs out of the node manager. In the current boot ordering the host wires the
|
|
`SdkAddressSpaceSink` only after start, so this is latent rather than live, but it is an unguarded
|
|
ordering hazard on the highest-risk class.
|
|
|
|
**Recommendation:** Add an explicit guard (e.g. throw a clear `InvalidOperationException("address
|
|
space not yet created")` or no-op with a logged warning) when `_root` is null at the top of the
|
|
mutators, so a too-early call fails legibly instead of with a bare NRE. Low priority — defensive
|
|
hardening, not a live defect. Left Open to avoid an unscoped change to the mutator entry points on
|
|
this critical class without a regression scenario that reproduces the early-call ordering.
|
|
|
|
**Resolution:** _(Open — defensive-only; latent given current boot ordering. Deferred to avoid an unscoped guard-add across five mutators without a reproducing pre-start ordering scenario.)_
|
|
|
|
### OpcUaServer-005
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Low |
|
|
| Category | Testing coverage |
|
|
| Location | `OtOpcUaNodeManager.cs:2049` (`ServeRawPaged` tie-cluster stall path), `OtOpcUaNodeManager.cs:2068` (absurd-burst backstop) |
|
|
| Status | Won't Fix |
|
|
|
|
**Description:** (Re-triaged to a non-issue.) Initial concern was that the oversized-tie-cluster
|
|
paging path in `ServeRawPaged` (over-fetch the whole cluster, `SliceTieCluster`, the
|
|
`MaxTieClusterOverfetch` absurd-burst `BadHistoryOperationUnsupported` backstop) lacked an
|
|
end-to-end node-manager regression and was only covered at the pure level by `HistoryPagingTests`.
|
|
|
|
**Recommendation:** None required.
|
|
|
|
**Resolution:** Won't Fix — 2026-06-19: coverage already exists end-to-end. Verified that
|
|
`NodeManagerHistoryReadPagingTests.Raw_oversized_tie_cluster_pages_within_the_timestamp` drives a
|
|
real multi-page resume INTO the stall (a 5-way tie cluster larger than the page cap) through a
|
|
booted `OtOpcUaSdkServer` + `InMemoryHistoryContinuationStore` and asserts lossless duplicate-free
|
|
paging, and `Raw_tie_cluster_beyond_overfetch_bound_fails_loudly` exercises the
|
|
`> MaxTieClusterOverfetch` backstop. The module's HistoryRead paging surface is comprehensively
|
|
tested; no test gap.
|
|
|
|
### OpcUaServer-006
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Low |
|
|
| Category | Documentation & comments |
|
|
| Location | `OtOpcUaNodeManager.cs:11-30` (class XML doc), `OpcUaApplicationHost.cs:88-93` / `OpcUaApplicationHost.cs:421-423` (F13/F13c follow-up notes) |
|
|
| Status | Resolved |
|
|
|
|
**Description:** Several XML comments reference superseded plan milestones as if still pending. The
|
|
`OtOpcUaNodeManager` class doc says equipment-folder hierarchy + type metadata "still come from the
|
|
AddressSpaceApplier / EquipmentNodeWalker integration (F14b, tracked under #85)", but
|
|
`EquipmentNodeWalker` is no longer the integration path (the composer → applier → sink → node-manager
|
|
chain replaced it, per `AddressSpaceComposer`'s own header) and #85 hierarchy materialisation has
|
|
shipped. `OpcUaApplicationHost` still carries "Full extraction … is tracked as follow-up F13" and
|
|
"F13c will plug a real LDAP-bound validator" notes, but the `IOpcUaUserAuthenticator` /
|
|
`RoleCarryingUserIdentity` impersonation wiring is implemented and live (the doc/`docs/OpcUaServer.md`
|
|
describe it as done). These are stale-comment / undocumented-as-shipped issues, not behaviour bugs.
|
|
|
|
**Recommendation:** Refresh the class- and method-level doc comments to drop the
|
|
`EquipmentNodeWalker`/F13/F13c/F14b "pending" framing and describe the shipped behaviour, matching
|
|
`docs/OpcUaServer.md`.
|
|
|
|
**Resolution:** Resolved — 2026-06-19 (SHA pending): rewrote the `OtOpcUaNodeManager` class XML
|
|
doc to describe the shipped folder-hierarchy / typed-variable / historized-node / Part 9
|
|
condition materialisation (dropping the false "treats every id as a flat BaseDataVariableState"
|
|
claim and the retired `EquipmentNodeWalker`/F14b framing), and the two `OpcUaApplicationHost`
|
|
doc blocks (class summary + `BuildUserTokenPolicies`) to describe the shipped impersonation/auth
|
|
wiring instead of the "F13/F13c pending" framing. Doc-comment-only — no behaviour change; build
|
|
re-verified green.
|
|
|
|
### OpcUaServer-007
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Severity | Medium |
|
|
| Category | Correctness & logic bugs |
|
|
| Location | `OtOpcUaNodeManager.cs:674` (`MaterialiseAlarmCondition` → `alarm.OnTimedUnshelve`) |
|
|
| Status | Resolved |
|
|
|
|
**Description:** The system-timer auto-unshelve callback (`alarm.OnTimedUnshelve`, wired in
|
|
`MaterialiseAlarmCondition`) — fired by the SDK when a `TimedShelve` duration expires — built its
|
|
`AlarmCommand` with `User: string.Empty`:
|
|
`AlarmCommandRouter?.Invoke(new AlarmCommand(alarmId, "Unshelve", string.Empty, null, null))`. That
|
|
command is routed to the scripted-alarm engine, whose `ScriptedAlarmEngine.UnshelveAsync` calls
|
|
`Part9StateMachine.ApplyUnshelve(cur, user, _clock())`
|
|
(`src/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms/Part9StateMachine.cs:211-213`), and that method
|
|
opens with `if (string.IsNullOrWhiteSpace(user)) throw new ArgumentException("User required.", …)`.
|
|
The exception is swallowed downstream (caught by `ScriptedAlarmHostActor`), so this is NOT a crash —
|
|
but the auto-unshelve **silently no-ops**: a `TimedShelve` never auto-expires, leaving the alarm
|
|
permanently shelved with no operator-visible error. The bug was operationally invisible and even had
|
|
a green node-manager test that asserted `cmd.User == string.Empty`, encoding the defect as expected
|
|
behaviour. The separate, correct design rule — `OnTimedUnshelve` BYPASSES the `AlarmAck` role gate
|
|
because it is a session-less system timer with no client principal (routing through the gated
|
|
`HandleAlarmCommand` would return `BadUserAccessDenied`) — is intentional and was preserved; only the
|
|
empty `User` string was the defect.
|
|
|
|
**Recommendation:** Pass a non-empty system user at the `OnTimedUnshelve` call site so `ApplyUnshelve`
|
|
accepts it, while keeping the gate bypass. Use the codebase-wide canonical `"system"` system-actor
|
|
label (matching `Part9StateMachine`'s own `AutoUnshelve` audit user, `AlarmConditionState`'s
|
|
"engine-internal events ⇒ `system`" doc, and `AuditActor.SystemFallback = "system"`).
|
|
|
|
**Resolution:** Resolved — 2026-06-19 (SHA pending): changed the `OnTimedUnshelve` delegate in
|
|
`MaterialiseAlarmCondition` (`OtOpcUaNodeManager.cs:674`) to build the `AlarmCommand` with
|
|
`User: "system"` instead of `string.Empty`, so the engine's `ApplyUnshelve` user-required guard
|
|
accepts the system-initiated unshelve and the timed auto-unshelve actually applies. The `AlarmAck`
|
|
gate bypass for the session-less system timer is unchanged. TDD: repurposed the existing
|
|
`AlarmCommandRouterTests.OnTimedUnshelve_with_system_context_returns_good_and_routes_unshelve`
|
|
(which previously asserted `cmd.User == string.Empty`) to assert a non-empty `"system"` user —
|
|
verified it failed against the unfixed source (`cmd.User should not be null or white space`) and
|
|
passes after the fix. Surgical one-line src change; no public-contract/wire change. Full
|
|
`OpcUaServer.Tests` suite green (275/275).
|