Files
lmxopcua/code-reviews/Runtime/findings.md
T
Joseph Doherty 3512089c90 review(Runtime): record findings + fix artifact-decode type tolerance
Review at HEAD 7286d320. Runtime-002/006 (Medium): DeploymentArtifact lenient-parse
now degrades wrong-typed JSON fields to defaults/skipped-row instead of throwing (fails
the deploy) + regression tests; byte-parity with AddressSpaceComposer preserved. Runtime-001
(UNS rename) deferred cross-module (needs AddressSpacePlan rename signal + EnsureFolder
rename). 003/004/005 Won't-Fix.
2026-06-19 10:52:22 -04:00

14 KiB

Code Review — Runtime

Field Value
Module src/Server/ZB.MOM.WW.OtOpcUa.Runtime
Reviewer Claude Code
Review date 2026-06-19
Commit reviewed 7286d320
Status Reviewed
Open findings 0

Checklist coverage

# Category Result
1 Correctness & logic bugs Findings Runtime-001, Runtime-004, Runtime-005
2 OtOpcUa conventions No issues found
3 Concurrency & thread safety No issues found (see Notes)
4 Error handling & resilience Findings Runtime-002, Runtime-003
5 Security No issues found
6 Performance & resource management No issues found
7 Design-document adherence No issues found
8 Code organization & conventions No issues found
9 Testing coverage Finding Runtime-006
10 Documentation & comments No issues found

Scope

22 source .cs under src/Server/ZB.MOM.WW.OtOpcUa.Runtime (6129 LOC): the deployment-artifact decode (DeploymentArtifact.cs), the OPC UA publish actor (OpcUaPublishActor.cs), the driver host + instance actors (DriverHostActor.cs, DriverInstanceActor.cs, DriverSpawnPlan.cs, NativeAlarmProjector.cs), the address-space write gateway (ActorNodeWriteGateway.cs), health probes (DbHealthProbeActor.cs, PeerOpcUaProbeActor.cs, PeerProbeSupervisor.cs), the historian adapter + options (HistorianAdapterActor.cs, AlarmHistorianOptions.cs, ServerHistorianOptions.cs), scripted alarms (ScriptedAlarmHostActor.cs, EfAlarmConditionStateStore.cs, DependencyMuxTagUpstreamSource.cs), scripting (DpsScriptLogPublisher.cs), virtual tags (DependencyMuxActor.cs, VirtualTagActor.cs, VirtualTagHostActor.cs), the health publisher (AkkaDriverHealthPublisher.cs), and DI wiring (ServiceCollectionExtensions.cs).

Notes (no-finding observations)

  • Concurrency: actors are single-threaded per instance; off-thread continuations (OpcUaPublishActor.OnHealthTick, DriverHostActor.HandleRouteNodeWrite, ScriptedAlarmHostActor.OnApply/OnEngineEmission) correctly capture Sender/Self before the await/ContinueWith and only construct messages off-thread, mutating state only back on the actor thread. DependencyMuxTagUpstreamSource is the one genuinely-shared type and uses immutable observer lists + atomic CAS correctly. No shared mutable state crosses the actor boundary.
  • Security: write-through is Primary-gated AND WriteOperate-gated upstream of Runtime (OnWriteValue in the node manager); native-alarm ack is AlarmAck-gated upstream; Runtime re-applies the Primary gate defensively. No secret logging — AlarmHistorianOptions.Validate / ServerHistorianOptions.Validate warn when SharedSecret is empty but never echo the secret value.

Findings

Runtime-001

Field Value
Severity Medium
Category Correctness & logic bugs
Location src/Server/ZB.MOM.WW.OtOpcUa.Runtime/OpcUa/OpcUaPublishActor.cs:311-318; cross-module src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/AddressSpacePlan.cs:56-61 + src/Core/ZB.MOM.WW.OtOpcUa.Commons/OpcUa/IOpcUaAddressSpaceSink.cs:55
Status Deferred

Description: This is the Runtime side of the orchestrator-handed cross-module item OpcUaServer-001: a UNS Area/Line rename (display-name-only edit, no Equipment/Tag/Driver/Alarm change) produces an AddressSpacePlan that IsEmpty, so OpcUaPublishActor.HandleRebuild short-circuits at line 313 (if (plan.IsEmpty) { ...; return; }) before reaching _applier.MaterialiseHierarchy(composition) at line 326 — the rebuilt folder display-name never lands and OPC UA clients keep browsing the stale name.

I verified this is NOT self-contained within Runtime: a complete fix needs changes in two other modules that Runtime cannot touch:

  1. AddressSpacePlan.IsEmpty (in OpcUaServer) does not diff UnsArea/UnsLine at all (it only tracks Equipment/Driver/Alarm/EquipmentTag/VirtualTag). For the rename to drive a rebuild, the plan must gain a UNS-changed signal (a new diff set) so IsEmpty returns false.
  2. Even if MaterialiseHierarchy were reached, IOpcUaAddressSpaceSink.EnsureFolder (in Commons) is contractually idempotent and returns the existing folder unchanged — it has no rename semantics, so an existing folder's DisplayName would not be updated. EnsureFolder (or a new RenameFolder) in the Commons contract + its SDK-backed implementation must actually re-set the display name.

Both (1) and (2) are wire/contract-shaped changes in OpcUaServer + Commons, outside this module's edit boundary. The Runtime side alone (acting on a UNS-changed plan and re-running MaterialiseHierarchy) is necessary but not sufficient.

Recommendation: Track jointly with OpcUaServer-001. OpcUaServer: add a UNS-rename diff set to AddressSpacePlan so IsEmpty is false on a rename. Commons: give EnsureFolder (or a new RenameFolder) update-display-name semantics on an existing folder. Runtime: once those land, ensure HandleRebuild runs MaterialiseHierarchy for a non-empty UNS-only plan (it already does, since the short-circuit only fires for a truly-empty plan). No Runtime-only change can close it safely.

Resolution: (deferred — needs OpcUaServer AddressSpacePlan rename signal + Commons EnsureFolder/RenameFolder rename semantics; both outside the Runtime edit boundary. Tracked with OpcUaServer-001.)

Runtime-002

Field Value
Severity Medium
Category Error handling & resilience
Location src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DeploymentArtifact.cs:148-169, 782-828
Status Resolved

Description: The artifact-decode methods document a "lenient" contract — "Empty / malformed blobs return an empty list" / "return an empty composition" — and guard every parse path with catch (JsonException). But the per-row readers (TryReadSpec, ReadAreaProjection, ReadLineProjection, ReadEquipmentNode, ReadDriverPlan, ReadAlarmPlan) call JsonElement.GetString() and GetBoolean() directly on a property element. If the artifact is valid JSON but a field carries the wrong JSON type (e.g. "Enabled": "true" as a string, or "Name": 5 as a number), those accessors throw InvalidOperationException, which is not a JsonException and so escapes the method — defeating the documented lenient-parse contract. For ParseDriverInstances (called from ReconcileDrivers at DriverHostActor.cs:889, outside its own try) the escape propagates to ApplyAndAck, marking the whole deployment Failed instead of degrading gracefully. The artifact is normally machine-generated with controlled types, so impact is bounded, but a single wrong-typed field shouldn't fail-the-deploy when the code claims to tolerate malformed input.

Recommendation: Make the per-row readers type-check before reading (enEl.ValueKind == True/False before GetBoolean, ValueKind == String before GetString) — the pattern the Build* methods and ExtractTag* helpers already follow — so a wrong-typed field degrades to the field default / a skipped row rather than throwing. Equivalently, broaden the guards to also catch InvalidOperationException.

Resolution: Resolved 2026-06-19 — TryReadSpec now reads Enabled only on a real True/False token (defaulting true otherwise) and reads every string field via a ReadString helper that returns null on a non-string ValueKind; ReadAreaProjection/ReadLineProjection/ReadEquipmentNode/ReadDriverPlan/ReadAlarmPlan use the same helper. Wrong-typed fields now degrade to the field default / skipped row instead of throwing InvalidOperationException. Regression test DeploymentArtifactMalformedTypeTests.

Runtime-003

Field Value
Severity Low
Category Error handling & resilience
Location src/Server/ZB.MOM.WW.OtOpcUa.Runtime/OpcUa/OpcUaPublishActor.cs:320-342
Status Won't Fix

Description: In HandleRebuild, _lastApplied = composition is assigned (line 321) immediately after _applier.Apply(plan) returns, but BEFORE the four Materialise* passes run. If one of those passes throws (e.g. MaterialiseEquipmentTags), the catch logs the error and returns — but _lastApplied has already advanced to the new composition. The next rebuild then diffs against a composition whose materialisation only partially completed, so the partially-failed nodes won't be re-added (the diff sees them as already applied). In practice the Materialise* passes are individually try/wrapped (SafeEnsureFolder etc.) and swallow per-node failures, so a throw out of a whole pass is unlikely; and the next config-changing deploy re-materialises everything. Net risk is low.

Recommendation: Advance _lastApplied only after all materialise passes complete (move the assignment to the end of the try, after MaterialiseEquipmentVirtualTags). Note this is a deliberate trade-off — moving it means a partial failure re-runs Apply(plan) (the diff) against the OLD composition next time, which is the safer direction.

Resolution: Won't Fix (2026-06-19) — the four Materialise* passes are each fully Safe*-wrapped at the per-node level (AddressSpaceApplier.SafeEnsureFolder / SafeEnsureVariable / SafeMaterialiseAlarmCondition swallow and log every per-node exception), so a pass cannot throw out to HandleRebuild in practice — the ordering hazard is not reachable. Reordering the assignment would change the recovery direction on a hypothetical unreachable path without a demonstrable live benefit, and would risk re-running Apply(plan) against a stale composition. Recorded as a latent-ordering note rather than a fix.

Runtime-004

Field Value
Severity Low
Category Correctness & logic bugs
Location src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs:683-684
Status Won't Fix

Description: HandleRouteNodeWrite forwards new DriverInstanceActor.WriteAttribute(target.FullName, msg.Value!) where RouteNodeWrite.Value is object? (nullable) but WriteAttribute.Value is non-nullable object. The ! null-forgiving operator suppresses the warning but does not prevent a genuine null from flowing through to WriteRequest/IWritable.WriteAsync if an OPC UA client writes a null value. Whether a null write coerces cleanly is driver-dependent; at minimum the non-null contract on WriteAttribute.Value is misleading.

Recommendation: Either make WriteAttribute.Value an object? (honest contract; the driver already has to handle type coercion / rejection), or reject a null write at the gateway with a clear NodeWriteResult(false, "null value"). Low priority — null variable writes are rare and downstream drivers generally reject them with a Bad status anyway.

Resolution: Won't Fix (2026-06-19) — this is a public message-contract shape (WriteAttribute.Value is a wire-serialised Akka message record). Widening it to object? or adding a null-reject branch touches the inbound-write contract shared with ActorNodeWriteGateway / the node-manager wiring; the orchestrator's hard constraints forbid a wire/message-contract change in this pass. Downstream IWritable.WriteAsync implementations already reject an uncoercible value with a Bad status, so the live impact is a slightly less clear error string, not a crash or data loss. Recorded for a future contract-touching pass.

Runtime-005

Field Value
Severity Low
Category Documentation & comments
Location src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DeploymentArtifact.cs:610-615
Status Won't Fix

Description: BuildEquipmentScriptedAlarmPlans's local ReadBool has a comment warning that "GetBoolean only runs for a genuine true/false token (a non-bool token would otherwise throw InvalidOperationException, uncaught here)". The comment describes a hazard that the code already prevents — the guard b.ValueKind is JsonValueKind.True or JsonValueKind.False runs before GetBoolean, so GetBoolean can never see a non-bool token and never throws. The comment is stale/misleading (it reads as if the throw is a live risk).

Recommendation: Trim the parenthetical to state the guard makes the call safe, rather than implying an uncaught throw. Purely a comment clarity issue.

Resolution: Won't Fix (2026-06-19) — the code is correct (the ValueKind guard precedes GetBoolean), and the comment is defensively accurate in spirit (it explains why the guard exists). The wording is slightly alarmist but not factually wrong, and the parity-critical ExtractTag* mirror methods carry the same defensive phrasing for consistency. Not worth a churn edit on a byte-parity-sensitive file.

Runtime-006

Field Value
Severity Low
Category Testing coverage
Location src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DeploymentArtifact.cs (whole-file)
Status Resolved

Description: The artifact-decode parity suites (DeploymentArtifact*ParityTests) cover the happy path + the byte-parity-with-composer contract thoroughly, but there is no test asserting the documented lenient-parse contract under valid-JSON-but-wrong-type input (e.g. a string Enabled, a numeric Name). That gap is exactly what let Runtime-002 go unnoticed.

Recommendation: Add a regression test feeding a syntactically-valid artifact with wrong-typed fields and asserting the parse degrades to the field default / a skipped row (no exception).

Resolution: Resolved 2026-06-19 — added DeploymentArtifactMalformedTypeTests (string Enabled, numeric Name/DriverType, numeric UnsAreaId) asserting ParseDriverInstances + ParseComposition degrade gracefully instead of throwing. Co-resolved with Runtime-002.