12 KiB
Surface real device-write status (write-outcome self-correction) — Design
Date: 2026-06-14
Status: Approved (brainstorming) — ready for implementation plan
Closes: pending.md follow-up #5 ("surface real device-write status to the client" / the "optimistic-write phantom")
Goal
When an authorized inbound OPC UA operator write to a writable equipment-tag node fails at the device, surface that to the client by reverting the node to its real pre-write value on the subscription — instead of leaving the optimistic value lingering as a phantom the device never accepted.
Background — the optimistic-write phantom
The inbound write pipeline (shipped in Milestone 1b, 8d8c05f5):
- A client writes a writable equipment-tag node. The OPC UA SDK's
CustomNodeManager2.Writeholds the node-managerLockand invokes the node'sOnWriteValuehandler,OtOpcUaNodeManager.OnEquipmentTagWrite(src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/OtOpcUaNodeManager.cs:604). EvaluateEquipmentWrite(OtOpcUaNodeManager.cs:637) gates on theWriteOperatedata-plane role (fails closed →BadUserAccessDenied), then fire-and-forgets the dispatch via theAction<string,object?> NodeWriteRouter(OtOpcUaNodeManager.cs:100) and returnsServiceResult.Good. The SDK then applies the client value to the node optimistically.- The router lambda (
src/Server/ZB.MOM.WW.OtOpcUa.Host/OpcUa/OtOpcUaServerHostedService.cs:145) Asks the localDriverHostActora boundedRouteNodeWrite(10 s) →NodeWriteResult(Success, Reason)(src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs:130,136), and in itsContinueWithonly logs failures.
Why fire-and-forget is mandatory: OnWriteValue runs under the node-manager Lock. Blocking on
the device round-trip there would freeze every address-space operation (reads, subscription
notifications, the publish path) for up to the Ask timeout. So the synchronous Write response
cannot carry the device status — OPC UA writes are optimistic by protocol.
The phantom: a device-rejected write still returned Good, and the SDK kept the client value on
the node. For a polling tag the next poll republishes the real register value (self-corrects, so
the phantom is transient). For a static attribute that never re-pushes (e.g. many Galaxy
attributes), the wrong value lingers forever — a value the device never accepted.
Constraint that shapes the design
The real device status is only available after the async round-trip, and we can't block under the
Lock. Therefore the only honest place to surface it is the node's value/quality on the
subscription, after the fact. And the real pre-write value is only knowable at OnWriteValue
time inside the node manager — once the SDK applies the optimistic value, the prior value is gone. So
the node manager must (a) capture the prior value at write time and (b) own the post-round-trip
correction. That requires the dispatch to return the outcome to the node manager — which the
current fire-and-forget Action<string,object?> router cannot do.
Approved approach — complete the B3 gateway + async self-correct
Reintroduce the originally-designed (B3) write gateway that returns the outcome, and have the node manager own a compare-and-revert correction.
Components
-
IOpcUaNodeWriteGateway(NEW, Commons) —src/Core/ZB.MOM.WW.OtOpcUa.Commons/OpcUa/IOpcUaNodeWriteGateway.cs:public interface IOpcUaNodeWriteGateway { Task<NodeWriteOutcome> WriteAsync(string nodeId, object? value, CancellationToken ct); } public readonly record struct NodeWriteOutcome(bool Success, string? Reason);Lives in Commons (Akka-free), the same layer as
IOpcUaAddressSpaceSink. ANullOpcUaNodeWriteGatewaydefault returnsNodeWriteOutcome(false, "writes unavailable")(matching today's no-router-wiredBadNotWritable). -
Production impl (Host) — Asks the local
DriverHostActorRouteNodeWrite(the existing bounded 10 s Ask) and mapsNodeWriteResult→NodeWriteOutcome. This is the current router lambda repackaged to return the outcome (and resolve the actor lazily per write, as today). Lives in the Akka-aware Host assembly; the OpcUaServer assembly stays Akka-free. -
Node-manager seam swap — replace the
Action<string,object?> NodeWriteRouter(OtOpcUaNodeManager.cs:100) andOtOpcUaSdkServer.SetNodeWriteRouter(OtOpcUaSdkServer.cs:52) with a settableIOpcUaNodeWriteGateway(volatile field set by the host atStartAsync, exactly the same lifecycle as the current router — null ⇒ writes unavailable). The host wires the production gateway in place of the currentSetNodeWriteRouter(...)lambda (OtOpcUaServerHostedService.cs:145). -
OnEquipmentTagWritechange (OtOpcUaNodeManager.cs:604): after theWriteOperategate passes, capture the prior value (((BaseDataVariableState)node).Value+ itsStatusCode— the SDK hasn't applied the optimistic value yet at handler entry), then kick offgateway.WriteAsync(nodeKey, writtenValue, ct)fire-and-forget (still non-blocking under the Lock) with a continuation that performs the correction on failure. ReturnServiceResult.Goodas today.
The correction — compare-and-revert
In the continuation (runs on a thread-pool thread after the original Write returned, so the Lock
is free to re-acquire briefly):
- If
outcome.Success→ do nothing (the poll path keeps polling tags fresh, exactly as today). - If
!outcome.Success→ compare-and-revert: only revert if the node still holds the optimistic value (currentValueequals the value we wrote). This avoids clobbering a fresh poll that already corrected a polling tag between the optimistic write and the failure. When it still holds the optimistic value, revert the node to the captured prior value (with its priorStatusCode, freshnowtimestamp) — for a failed write the device value is unchanged, so the prior value is the ground truth (no device re-read needed). Keep the existing failure log line.
Net effect: a device-rejected write to a static attribute bounces the value back on the subscription (phantom gone); polling tags behave exactly as today.
Where the revert is applied
OnEquipmentTagWrite is inside the node manager, which owns the nodes and the Lock. The continuation
applies the revert directly (the same lock (Lock) { variable.Value = …; variable.StatusCode = …; variable.ClearChangeMasks(…); } shape the poll/WriteValue path already uses), so no new sink call is
strictly required — though the revert MAY be funnelled through the node manager's existing internal
value-update helper for consistency. (Implementation plan decides the exact internal call; the
behaviour is fixed here.)
Data flow
client Write ─▶ SDK (holds Lock) ─▶ OnEquipmentTagWrite
│ │ WriteOperate gate (fail-closed)
│ │ capture priorValue + priorStatus
│ │ gateway.WriteAsync(nodeId, value, ct) ── fire-and-forget
│ └─ return Good (SDK applies optimistic value)
▼
(later, off-Lock continuation)
outcome = await WriteAsync ── via DriverHostActor.RouteNodeWrite → NodeWriteResult
success? ── yes ▶ nothing (poll keeps it fresh)
└ no ▶ if node still == optimistic value: revert to priorValue (+ log)
Edge cases & decisions
- Compare-and-revert race: between optimistic-apply and the failure, a poll may overwrite the
node. We revert only if the current node value still equals the optimistic value. Best-effort
value equality (
Equalson the boxed values); a missed revert just means a polling tag self-corrects on its own next poll anyway — the only case that needs the revert is the static attribute, which never gets overwritten, so the equality check holds there. - Redundancy / not-primary:
RouteNodeWritealready primary-gates (returnsNodeWriteResult(false, "not primary")). A secondary doesn't serve client writes (clients connect to the primary), and the correction is local to the node that received the write. A "not primary" outcome is just another failure → compare-and-revert on the local node. No special handling. - Timeout / actor unreachable: the gateway maps a faulted/timed-out Ask to
NodeWriteOutcome(false, "write timeout"|…)→ treated as failure → revert. (A genuine device write that succeeded but whose ack was lost would be wrongly reverted, then re-corrected by the next poll for polling tags; for static attributes this is an acceptable, rare, conservative bias toward showing the last-known device value.) - No double-apply: the correction never writes to the device — it only updates the local node. So it can't loop.
writtenValuecapture: the existing handler already capturesvalueinto a local for the closure (OtOpcUaNodeManager.cs:612); the prior value is a second capture taken before returning.
Testing strategy
- Pure decision extraction: extract the correction decision as a pure static (mirroring how
EvaluateEquipmentWritewas extracted atOtOpcUaNodeManager.cs:637) — e.g.ShouldRevert(NodeWriteOutcome outcome, object? currentNodeValue, object? optimisticValue) → bool(+ the revert arguments). Unit tests (xUnit + Shouldly): fail + node still optimistic → revert; fail + node moved on → no revert; success → no revert; null-value edge cases. - Gateway: unit-test the production gateway's
NodeWriteResult → NodeWriteOutcomemapping (success, failure-with-reason, timeout/faulted Ask → failure) with an Akka.TestKit probe standing in for theDriverHostActor; and theNull/unwired default →(false, "writes unavailable"). - No bUnit / no Razor — this is server-side only.
- Live /run gate: write to a static writable tag whose device write deterministically
fails, and confirm via Client.CLI subscribe that the node reverts to its prior value; confirm a
successful write stays put. The plan picks a reliably-failing write (candidates: a Modbus
illegal/out-of-range write on the
:5020sim; or a GalaxyViewOnlyattribute surfaced as a ReadWrite equipment tag so the gateway commit is refused). docker-dev is LOCAL; rebuild central on the branch + redeploy as in prior milestones.
Out of scope (deferred; per the chosen mechanism)
- Bad-quality blip on failure (keep the value but flag
StatusCode = Bad). - OPC UA
AuditWriteUpdateEventon failure. - Synchronous structural fail-fast (rejecting not-primary / driver-down / unmapped writes in the Write response — needs a Lock-readable driver/primary snapshot).
These remain independently addable later if richer signalling is wanted.
Touched code (anticipated)
- NEW
src/Core/ZB.MOM.WW.OtOpcUa.Commons/OpcUa/IOpcUaNodeWriteGateway.cs(interface +NodeWriteOutcome+Null…default). src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/OtOpcUaNodeManager.cs— replace theActionrouter with the gateway; capture prior value + the compare-and-revert continuation inOnEquipmentTagWrite; add the pureShouldRevertdecision.src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/OtOpcUaSdkServer.cs—SetNodeWriteRouter→SetNodeWriteGateway.src/Server/ZB.MOM.WW.OtOpcUa.Host/OpcUa/OtOpcUaServerHostedService.cs— wire the production gateway impl instead of the current router lambda; clear it onStopAsync.- NEW production gateway impl (Host assembly) — Asks
RouteNodeWrite, maps toNodeWriteOutcome. - Tests in the OpcUaServer test project + the Host/Runtime test project.
No EF/Configuration schema change, no migration. DriverHostActor.RouteNodeWrite /
NodeWriteResult and the driver write path are unchanged — this design only changes how the outcome
is consumed on the server side.