Files
lmxopcua/code-reviews/Runtime/findings.md
T
Joseph Doherty 3512089c90 review(Runtime): record findings + fix artifact-decode type tolerance
Review at HEAD 7286d320. Runtime-002/006 (Medium): DeploymentArtifact lenient-parse
now degrades wrong-typed JSON fields to defaults/skipped-row instead of throwing (fails
the deploy) + regression tests; byte-parity with AddressSpaceComposer preserved. Runtime-001
(UNS rename) deferred cross-module (needs AddressSpacePlan rename signal + EnsureFolder
rename). 003/004/005 Won't-Fix.
2026-06-19 10:52:22 -04:00

230 lines
14 KiB
Markdown

# Code Review — Runtime
<!-- Per-module findings file. See ../../REVIEW-PROCESS.md for the full process. The base README.md is
generated from these files by regen-readme.py — do not edit README.md by hand. -->
| Field | Value |
|---|---|
| Module | `src/Server/ZB.MOM.WW.OtOpcUa.Runtime` |
| Reviewer | Claude Code |
| Review date | 2026-06-19 |
| Commit reviewed | `7286d320` |
| Status | Reviewed |
| Open findings | 0 |
## Checklist coverage
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Findings Runtime-001, Runtime-004, Runtime-005 |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | No issues found (see Notes) |
| 4 | Error handling & resilience | Findings Runtime-002, Runtime-003 |
| 5 | Security | No issues found |
| 6 | Performance & resource management | No issues found |
| 7 | Design-document adherence | No issues found |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | Finding Runtime-006 |
| 10 | Documentation & comments | No issues found |
### Scope
22 source `.cs` under `src/Server/ZB.MOM.WW.OtOpcUa.Runtime` (6129 LOC): the deployment-artifact
decode (`DeploymentArtifact.cs`), the OPC UA publish actor (`OpcUaPublishActor.cs`), the driver host +
instance actors (`DriverHostActor.cs`, `DriverInstanceActor.cs`, `DriverSpawnPlan.cs`,
`NativeAlarmProjector.cs`), the address-space write gateway (`ActorNodeWriteGateway.cs`), health probes
(`DbHealthProbeActor.cs`, `PeerOpcUaProbeActor.cs`, `PeerProbeSupervisor.cs`), the historian adapter +
options (`HistorianAdapterActor.cs`, `AlarmHistorianOptions.cs`, `ServerHistorianOptions.cs`), scripted
alarms (`ScriptedAlarmHostActor.cs`, `EfAlarmConditionStateStore.cs`,
`DependencyMuxTagUpstreamSource.cs`), scripting (`DpsScriptLogPublisher.cs`), virtual tags
(`DependencyMuxActor.cs`, `VirtualTagActor.cs`, `VirtualTagHostActor.cs`), the health publisher
(`AkkaDriverHealthPublisher.cs`), and DI wiring (`ServiceCollectionExtensions.cs`).
### Notes (no-finding observations)
- **Concurrency**: actors are single-threaded per instance; off-thread continuations
(`OpcUaPublishActor.OnHealthTick`, `DriverHostActor.HandleRouteNodeWrite`,
`ScriptedAlarmHostActor.OnApply`/`OnEngineEmission`) correctly capture `Sender`/`Self` before the
`await`/`ContinueWith` and only construct messages off-thread, mutating state only back on the actor
thread. `DependencyMuxTagUpstreamSource` is the one genuinely-shared type and uses immutable
observer lists + atomic CAS correctly. No shared mutable state crosses the actor boundary.
- **Security**: write-through is Primary-gated AND `WriteOperate`-gated upstream of Runtime
(`OnWriteValue` in the node manager); native-alarm ack is `AlarmAck`-gated upstream; Runtime re-applies
the Primary gate defensively. No secret logging — `AlarmHistorianOptions.Validate` /
`ServerHistorianOptions.Validate` warn when `SharedSecret` is empty but never echo the secret value.
## Findings
### Runtime-001
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/OpcUa/OpcUaPublishActor.cs:311-318`; cross-module `src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/AddressSpacePlan.cs:56-61` + `src/Core/ZB.MOM.WW.OtOpcUa.Commons/OpcUa/IOpcUaAddressSpaceSink.cs:55` |
| Status | Deferred |
**Description:** This is the Runtime side of the orchestrator-handed cross-module item **OpcUaServer-001**:
a UNS Area/Line *rename* (display-name-only edit, no Equipment/Tag/Driver/Alarm change) produces an
`AddressSpacePlan` that `IsEmpty`, so `OpcUaPublishActor.HandleRebuild` short-circuits at line 313
(`if (plan.IsEmpty) { ...; return; }`) **before** reaching `_applier.MaterialiseHierarchy(composition)`
at line 326 — the rebuilt folder display-name never lands and OPC UA clients keep browsing the stale name.
I verified this is **NOT self-contained within Runtime**: a complete fix needs changes in two other
modules that Runtime cannot touch:
1. `AddressSpacePlan.IsEmpty` (in OpcUaServer) does not diff `UnsArea`/`UnsLine` at all (it only tracks
Equipment/Driver/Alarm/EquipmentTag/VirtualTag). For the rename to drive a rebuild, the plan must
gain a UNS-changed signal (a new diff set) so `IsEmpty` returns `false`.
2. Even if `MaterialiseHierarchy` were reached, `IOpcUaAddressSpaceSink.EnsureFolder` (in Commons) is
contractually **idempotent and returns the existing folder unchanged** — it has no rename semantics,
so an existing folder's `DisplayName` would not be updated. `EnsureFolder` (or a new `RenameFolder`)
in the Commons contract + its SDK-backed implementation must actually re-set the display name.
Both (1) and (2) are wire/contract-shaped changes in OpcUaServer + Commons, outside this module's edit
boundary. The Runtime side alone (acting on a UNS-changed plan and re-running `MaterialiseHierarchy`)
is necessary but not sufficient.
**Recommendation:** Track jointly with OpcUaServer-001. OpcUaServer: add a UNS-rename diff set to
`AddressSpacePlan` so `IsEmpty` is false on a rename. Commons: give `EnsureFolder` (or a new
`RenameFolder`) update-display-name semantics on an existing folder. Runtime: once those land, ensure
`HandleRebuild` runs `MaterialiseHierarchy` for a non-empty UNS-only plan (it already does, since the
short-circuit only fires for a truly-empty plan). No Runtime-only change can close it safely.
**Resolution:** _(deferred — needs OpcUaServer `AddressSpacePlan` rename signal + Commons `EnsureFolder`/`RenameFolder` rename semantics; both outside the Runtime edit boundary. Tracked with OpcUaServer-001.)_
### Runtime-002
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DeploymentArtifact.cs:148-169, 782-828` |
| Status | Resolved |
**Description:** The artifact-decode methods document a "lenient" contract — *"Empty / malformed blobs
return an empty list"* / *"return an empty composition"* — and guard every parse path with
`catch (JsonException)`. But the per-row readers (`TryReadSpec`, `ReadAreaProjection`,
`ReadLineProjection`, `ReadEquipmentNode`, `ReadDriverPlan`, `ReadAlarmPlan`) call `JsonElement.GetString()`
and `GetBoolean()` directly on a property element. If the artifact is *valid JSON but a field carries the
wrong JSON type* (e.g. `"Enabled": "true"` as a string, or `"Name": 5` as a number), those accessors throw
`InvalidOperationException`, which is **not** a `JsonException` and so escapes the method — defeating the
documented lenient-parse contract. For `ParseDriverInstances` (called from `ReconcileDrivers` at
`DriverHostActor.cs:889`, outside its own try) the escape propagates to `ApplyAndAck`, marking the whole
deployment **Failed** instead of degrading gracefully. The artifact is normally machine-generated with
controlled types, so impact is bounded, but a single wrong-typed field shouldn't fail-the-deploy when the
code claims to tolerate malformed input.
**Recommendation:** Make the per-row readers type-check before reading (`enEl.ValueKind == True/False`
before `GetBoolean`, `ValueKind == String` before `GetString`) — the pattern the `Build*` methods and
`ExtractTag*` helpers already follow — so a wrong-typed field degrades to the field default / a skipped row
rather than throwing. Equivalently, broaden the guards to also catch `InvalidOperationException`.
**Resolution:** Resolved 2026-06-19 — `TryReadSpec` now reads `Enabled` only on a real `True/False` token
(defaulting `true` otherwise) and reads every string field via a `ReadString` helper that returns null on a
non-string `ValueKind`; `ReadAreaProjection`/`ReadLineProjection`/`ReadEquipmentNode`/`ReadDriverPlan`/`ReadAlarmPlan`
use the same helper. Wrong-typed fields now degrade to the field default / skipped row instead of throwing
`InvalidOperationException`. Regression test `DeploymentArtifactMalformedTypeTests`.
### Runtime-003
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/OpcUa/OpcUaPublishActor.cs:320-342` |
| Status | Won't Fix |
**Description:** In `HandleRebuild`, `_lastApplied = composition` is assigned (line 321) immediately after
`_applier.Apply(plan)` returns, but BEFORE the four `Materialise*` passes run. If one of those passes throws
(e.g. `MaterialiseEquipmentTags`), the catch logs the error and returns — but `_lastApplied` has already
advanced to the new composition. The next rebuild then diffs against a composition whose materialisation
only partially completed, so the partially-failed nodes won't be re-added (the diff sees them as already
applied). In practice the `Materialise*` passes are individually try/wrapped (`SafeEnsureFolder` etc.) and
swallow per-node failures, so a throw out of a whole pass is unlikely; and the next *config-changing*
deploy re-materialises everything. Net risk is low.
**Recommendation:** Advance `_lastApplied` only after all materialise passes complete (move the assignment
to the end of the try, after `MaterialiseEquipmentVirtualTags`). Note this is a deliberate trade-off — moving
it means a partial failure re-runs `Apply(plan)` (the diff) against the OLD composition next time, which is
the safer direction.
**Resolution:** Won't Fix (2026-06-19) — the four `Materialise*` passes are each fully `Safe*`-wrapped at the
per-node level (`AddressSpaceApplier.SafeEnsureFolder` / `SafeEnsureVariable` / `SafeMaterialiseAlarmCondition`
swallow and log every per-node exception), so a pass cannot throw out to `HandleRebuild` in practice — the
ordering hazard is not reachable. Reordering the assignment would change the recovery direction on a
hypothetical unreachable path without a demonstrable live benefit, and would risk re-running `Apply(plan)`
against a stale composition. Recorded as a latent-ordering note rather than a fix.
### Runtime-004
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs:683-684` |
| Status | Won't Fix |
**Description:** `HandleRouteNodeWrite` forwards `new DriverInstanceActor.WriteAttribute(target.FullName,
msg.Value!)` where `RouteNodeWrite.Value` is `object?` (nullable) but `WriteAttribute.Value` is non-nullable
`object`. The `!` null-forgiving operator suppresses the warning but does not prevent a genuine `null` from
flowing through to `WriteRequest`/`IWritable.WriteAsync` if an OPC UA client writes a null value. Whether a
null write coerces cleanly is driver-dependent; at minimum the non-null contract on `WriteAttribute.Value`
is misleading.
**Recommendation:** Either make `WriteAttribute.Value` an `object?` (honest contract; the driver already has
to handle type coercion / rejection), or reject a null write at the gateway with a clear
`NodeWriteResult(false, "null value")`. Low priority — null variable writes are rare and downstream drivers
generally reject them with a Bad status anyway.
**Resolution:** Won't Fix (2026-06-19) — this is a public message-contract shape (`WriteAttribute.Value` is a
wire-serialised Akka message record). Widening it to `object?` or adding a null-reject branch touches the
inbound-write contract shared with `ActorNodeWriteGateway` / the node-manager wiring; the orchestrator's hard
constraints forbid a wire/message-contract change in this pass. Downstream `IWritable.WriteAsync`
implementations already reject an uncoercible value with a Bad status, so the live impact is a slightly less
clear error string, not a crash or data loss. Recorded for a future contract-touching pass.
### Runtime-005
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DeploymentArtifact.cs:610-615` |
| Status | Won't Fix |
**Description:** `BuildEquipmentScriptedAlarmPlans`'s local `ReadBool` has a comment warning that *"GetBoolean
only runs for a genuine true/false token (a non-bool token would otherwise throw InvalidOperationException,
uncaught here)"*. The comment describes a hazard that the code already prevents — the guard
`b.ValueKind is JsonValueKind.True or JsonValueKind.False` runs before `GetBoolean`, so `GetBoolean` can
never see a non-bool token and never throws. The comment is stale/misleading (it reads as if the throw is a
live risk).
**Recommendation:** Trim the parenthetical to state the guard makes the call safe, rather than implying an
uncaught throw. Purely a comment clarity issue.
**Resolution:** Won't Fix (2026-06-19) — the code is correct (the `ValueKind` guard precedes `GetBoolean`),
and the comment is defensively accurate in spirit (it explains *why* the guard exists). The wording is
slightly alarmist but not factually wrong, and the parity-critical `ExtractTag*` mirror methods carry the
same defensive phrasing for consistency. Not worth a churn edit on a byte-parity-sensitive file.
### Runtime-006
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DeploymentArtifact.cs` (whole-file) |
| Status | Resolved |
**Description:** The artifact-decode parity suites (`DeploymentArtifact*ParityTests`) cover the happy path +
the byte-parity-with-composer contract thoroughly, but there is no test asserting the documented
*lenient-parse* contract under **valid-JSON-but-wrong-type** input (e.g. a string `Enabled`, a numeric
`Name`). That gap is exactly what let Runtime-002 go unnoticed.
**Recommendation:** Add a regression test feeding a syntactically-valid artifact with wrong-typed fields and
asserting the parse degrades to the field default / a skipped row (no exception).
**Resolution:** Resolved 2026-06-19 — added `DeploymentArtifactMalformedTypeTests` (string `Enabled`, numeric
`Name`/`DriverType`, numeric `UnsAreaId`) asserting `ParseDriverInstances` + `ParseComposition` degrade
gracefully instead of throwing. Co-resolved with Runtime-002.