Files
lmxopcua/docs/plans/2026-06-14-write-outcome-self-correction-design.md
T
Joseph Doherty 8a79cb0ec7
v2-ci / build (push) Failing after 42s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
docs(design): write-outcome self-correction (surface real device-write status, #5)
2026-06-14 01:13:38 -04:00

12 KiB

Surface real device-write status (write-outcome self-correction) — Design

Date: 2026-06-14 Status: Approved (brainstorming) — ready for implementation plan Closes: pending.md follow-up #5 ("surface real device-write status to the client" / the "optimistic-write phantom")

Goal

When an authorized inbound OPC UA operator write to a writable equipment-tag node fails at the device, surface that to the client by reverting the node to its real pre-write value on the subscription — instead of leaving the optimistic value lingering as a phantom the device never accepted.

Background — the optimistic-write phantom

The inbound write pipeline (shipped in Milestone 1b, 8d8c05f5):

  1. A client writes a writable equipment-tag node. The OPC UA SDK's CustomNodeManager2.Write holds the node-manager Lock and invokes the node's OnWriteValue handler, OtOpcUaNodeManager.OnEquipmentTagWrite (src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/OtOpcUaNodeManager.cs:604).
  2. EvaluateEquipmentWrite (OtOpcUaNodeManager.cs:637) gates on the WriteOperate data-plane role (fails closed → BadUserAccessDenied), then fire-and-forgets the dispatch via the Action<string,object?> NodeWriteRouter (OtOpcUaNodeManager.cs:100) and returns ServiceResult.Good. The SDK then applies the client value to the node optimistically.
  3. The router lambda (src/Server/ZB.MOM.WW.OtOpcUa.Host/OpcUa/OtOpcUaServerHostedService.cs:145) Asks the local DriverHostActor a bounded RouteNodeWrite (10 s) → NodeWriteResult(Success, Reason) (src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs:130,136), and in its ContinueWith only logs failures.

Why fire-and-forget is mandatory: OnWriteValue runs under the node-manager Lock. Blocking on the device round-trip there would freeze every address-space operation (reads, subscription notifications, the publish path) for up to the Ask timeout. So the synchronous Write response cannot carry the device status — OPC UA writes are optimistic by protocol.

The phantom: a device-rejected write still returned Good, and the SDK kept the client value on the node. For a polling tag the next poll republishes the real register value (self-corrects, so the phantom is transient). For a static attribute that never re-pushes (e.g. many Galaxy attributes), the wrong value lingers forever — a value the device never accepted.

Constraint that shapes the design

The real device status is only available after the async round-trip, and we can't block under the Lock. Therefore the only honest place to surface it is the node's value/quality on the subscription, after the fact. And the real pre-write value is only knowable at OnWriteValue time inside the node manager — once the SDK applies the optimistic value, the prior value is gone. So the node manager must (a) capture the prior value at write time and (b) own the post-round-trip correction. That requires the dispatch to return the outcome to the node manager — which the current fire-and-forget Action<string,object?> router cannot do.

Approved approach — complete the B3 gateway + async self-correct

Reintroduce the originally-designed (B3) write gateway that returns the outcome, and have the node manager own a compare-and-revert correction.

Components

  1. IOpcUaNodeWriteGateway (NEW, Commons)src/Core/ZB.MOM.WW.OtOpcUa.Commons/OpcUa/IOpcUaNodeWriteGateway.cs:

    public interface IOpcUaNodeWriteGateway
    {
        Task<NodeWriteOutcome> WriteAsync(string nodeId, object? value, CancellationToken ct);
    }
    public readonly record struct NodeWriteOutcome(bool Success, string? Reason);
    

    Lives in Commons (Akka-free), the same layer as IOpcUaAddressSpaceSink. A NullOpcUaNodeWriteGateway default returns NodeWriteOutcome(false, "writes unavailable") (matching today's no-router-wired BadNotWritable).

  2. Production impl (Host) — Asks the local DriverHostActor RouteNodeWrite (the existing bounded 10 s Ask) and maps NodeWriteResultNodeWriteOutcome. This is the current router lambda repackaged to return the outcome (and resolve the actor lazily per write, as today). Lives in the Akka-aware Host assembly; the OpcUaServer assembly stays Akka-free.

  3. Node-manager seam swap — replace the Action<string,object?> NodeWriteRouter (OtOpcUaNodeManager.cs:100) and OtOpcUaSdkServer.SetNodeWriteRouter (OtOpcUaSdkServer.cs:52) with a settable IOpcUaNodeWriteGateway (volatile field set by the host at StartAsync, exactly the same lifecycle as the current router — null ⇒ writes unavailable). The host wires the production gateway in place of the current SetNodeWriteRouter(...) lambda (OtOpcUaServerHostedService.cs:145).

  4. OnEquipmentTagWrite change (OtOpcUaNodeManager.cs:604): after the WriteOperate gate passes, capture the prior value (((BaseDataVariableState)node).Value + its StatusCode — the SDK hasn't applied the optimistic value yet at handler entry), then kick off gateway.WriteAsync(nodeKey, writtenValue, ct) fire-and-forget (still non-blocking under the Lock) with a continuation that performs the correction on failure. Return ServiceResult.Good as today.

The correction — compare-and-revert

In the continuation (runs on a thread-pool thread after the original Write returned, so the Lock is free to re-acquire briefly):

  • If outcome.Successdo nothing (the poll path keeps polling tags fresh, exactly as today).
  • If !outcome.Successcompare-and-revert: only revert if the node still holds the optimistic value (currentValue equals the value we wrote). This avoids clobbering a fresh poll that already corrected a polling tag between the optimistic write and the failure. When it still holds the optimistic value, revert the node to the captured prior value (with its prior StatusCode, fresh now timestamp) — for a failed write the device value is unchanged, so the prior value is the ground truth (no device re-read needed). Keep the existing failure log line.

Net effect: a device-rejected write to a static attribute bounces the value back on the subscription (phantom gone); polling tags behave exactly as today.

Where the revert is applied

OnEquipmentTagWrite is inside the node manager, which owns the nodes and the Lock. The continuation applies the revert directly (the same lock (Lock) { variable.Value = …; variable.StatusCode = …; variable.ClearChangeMasks(…); } shape the poll/WriteValue path already uses), so no new sink call is strictly required — though the revert MAY be funnelled through the node manager's existing internal value-update helper for consistency. (Implementation plan decides the exact internal call; the behaviour is fixed here.)

Data flow

client Write ─▶ SDK (holds Lock) ─▶ OnEquipmentTagWrite
     │                                    │ WriteOperate gate (fail-closed)
     │                                    │ capture priorValue + priorStatus
     │                                    │ gateway.WriteAsync(nodeId, value, ct)  ── fire-and-forget
     │                                    └─ return Good  (SDK applies optimistic value)
     ▼
  (later, off-Lock continuation)
   outcome = await WriteAsync          ── via DriverHostActor.RouteNodeWrite → NodeWriteResult
   success?  ── yes ▶ nothing (poll keeps it fresh)
             └ no  ▶ if node still == optimistic value: revert to priorValue (+ log)

Edge cases & decisions

  • Compare-and-revert race: between optimistic-apply and the failure, a poll may overwrite the node. We revert only if the current node value still equals the optimistic value. Best-effort value equality (Equals on the boxed values); a missed revert just means a polling tag self-corrects on its own next poll anyway — the only case that needs the revert is the static attribute, which never gets overwritten, so the equality check holds there.
  • Redundancy / not-primary: RouteNodeWrite already primary-gates (returns NodeWriteResult(false, "not primary")). A secondary doesn't serve client writes (clients connect to the primary), and the correction is local to the node that received the write. A "not primary" outcome is just another failure → compare-and-revert on the local node. No special handling.
  • Timeout / actor unreachable: the gateway maps a faulted/timed-out Ask to NodeWriteOutcome(false, "write timeout"|…) → treated as failure → revert. (A genuine device write that succeeded but whose ack was lost would be wrongly reverted, then re-corrected by the next poll for polling tags; for static attributes this is an acceptable, rare, conservative bias toward showing the last-known device value.)
  • No double-apply: the correction never writes to the device — it only updates the local node. So it can't loop.
  • writtenValue capture: the existing handler already captures value into a local for the closure (OtOpcUaNodeManager.cs:612); the prior value is a second capture taken before returning.

Testing strategy

  • Pure decision extraction: extract the correction decision as a pure static (mirroring how EvaluateEquipmentWrite was extracted at OtOpcUaNodeManager.cs:637) — e.g. ShouldRevert(NodeWriteOutcome outcome, object? currentNodeValue, object? optimisticValue) → bool (+ the revert arguments). Unit tests (xUnit + Shouldly): fail + node still optimistic → revert; fail + node moved on → no revert; success → no revert; null-value edge cases.
  • Gateway: unit-test the production gateway's NodeWriteResult → NodeWriteOutcome mapping (success, failure-with-reason, timeout/faulted Ask → failure) with an Akka.TestKit probe standing in for the DriverHostActor; and the Null/unwired default → (false, "writes unavailable").
  • No bUnit / no Razor — this is server-side only.
  • Live /run gate: write to a static writable tag whose device write deterministically fails, and confirm via Client.CLI subscribe that the node reverts to its prior value; confirm a successful write stays put. The plan picks a reliably-failing write (candidates: a Modbus illegal/out-of-range write on the :5020 sim; or a Galaxy ViewOnly attribute surfaced as a ReadWrite equipment tag so the gateway commit is refused). docker-dev is LOCAL; rebuild central on the branch + redeploy as in prior milestones.

Out of scope (deferred; per the chosen mechanism)

  • Bad-quality blip on failure (keep the value but flag StatusCode = Bad).
  • OPC UA AuditWriteUpdateEvent on failure.
  • Synchronous structural fail-fast (rejecting not-primary / driver-down / unmapped writes in the Write response — needs a Lock-readable driver/primary snapshot).

These remain independently addable later if richer signalling is wanted.

Touched code (anticipated)

  • NEW src/Core/ZB.MOM.WW.OtOpcUa.Commons/OpcUa/IOpcUaNodeWriteGateway.cs (interface + NodeWriteOutcome + Null… default).
  • src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/OtOpcUaNodeManager.cs — replace the Action router with the gateway; capture prior value + the compare-and-revert continuation in OnEquipmentTagWrite; add the pure ShouldRevert decision.
  • src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/OtOpcUaSdkServer.csSetNodeWriteRouterSetNodeWriteGateway.
  • src/Server/ZB.MOM.WW.OtOpcUa.Host/OpcUa/OtOpcUaServerHostedService.cs — wire the production gateway impl instead of the current router lambda; clear it on StopAsync.
  • NEW production gateway impl (Host assembly) — Asks RouteNodeWrite, maps to NodeWriteOutcome.
  • Tests in the OpcUaServer test project + the Host/Runtime test project.

No EF/Configuration schema change, no migration. DriverHostActor.RouteNodeWrite / NodeWriteResult and the driver write path are unchanged — this design only changes how the outcome is consumed on the server side.