# Surface real device-write status (write-outcome self-correction) — Design **Date:** 2026-06-14 **Status:** Approved (brainstorming) — ready for implementation plan **Closes:** `pending.md` follow-up #5 ("surface real device-write status to the client" / the "optimistic-write phantom") ## Goal When an authorized inbound OPC UA operator write to a writable equipment-tag node **fails at the device**, surface that to the client by **reverting the node to its real pre-write value** on the subscription — instead of leaving the optimistic value lingering as a phantom the device never accepted. ## Background — the optimistic-write phantom The inbound write pipeline (shipped in Milestone 1b, `8d8c05f5`): 1. A client writes a writable equipment-tag node. The OPC UA SDK's `CustomNodeManager2.Write` holds the node-manager `Lock` and invokes the node's `OnWriteValue` handler, `OtOpcUaNodeManager.OnEquipmentTagWrite` (`src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/OtOpcUaNodeManager.cs:604`). 2. `EvaluateEquipmentWrite` (`OtOpcUaNodeManager.cs:637`) gates on the `WriteOperate` data-plane role (fails closed → `BadUserAccessDenied`), then **fire-and-forgets** the dispatch via the `Action NodeWriteRouter` (`OtOpcUaNodeManager.cs:100`) and returns `ServiceResult.Good`. The SDK then applies the client value to the node optimistically. 3. The router lambda (`src/Server/ZB.MOM.WW.OtOpcUa.Host/OpcUa/OtOpcUaServerHostedService.cs:145`) Asks the local `DriverHostActor` a bounded `RouteNodeWrite` (10 s) → `NodeWriteResult(Success, Reason)` (`src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs:130,136`), and in its `ContinueWith` **only logs** failures. **Why fire-and-forget is mandatory:** `OnWriteValue` runs under the node-manager `Lock`. Blocking on the device round-trip there would freeze every address-space operation (reads, subscription notifications, the publish path) for up to the Ask timeout. So the synchronous Write *response* cannot carry the device status — OPC UA writes are optimistic by protocol. **The phantom:** a device-rejected write still returned `Good`, and the SDK kept the client value on the node. For a **polling** tag the next poll republishes the real register value (self-corrects, so the phantom is transient). For a **static** attribute that never re-pushes (e.g. many Galaxy attributes), the wrong value **lingers forever** — a value the device never accepted. ## Constraint that shapes the design The real device status is only available **after** the async round-trip, and we can't block under the `Lock`. Therefore the only honest place to surface it is the node's **value/quality on the subscription**, after the fact. And the **real pre-write value** is only knowable at `OnWriteValue` time inside the node manager — once the SDK applies the optimistic value, the prior value is gone. So the node manager must (a) capture the prior value at write time and (b) own the post-round-trip correction. That requires the dispatch to **return the outcome** to the node manager — which the current fire-and-forget `Action` router cannot do. ## Approved approach — complete the B3 gateway + async self-correct Reintroduce the originally-designed (B3) write gateway that returns the outcome, and have the node manager own a compare-and-revert correction. ### Components 1. **`IOpcUaNodeWriteGateway` (NEW, Commons)** — `src/Core/ZB.MOM.WW.OtOpcUa.Commons/OpcUa/IOpcUaNodeWriteGateway.cs`: ```csharp public interface IOpcUaNodeWriteGateway { Task WriteAsync(string nodeId, object? value, CancellationToken ct); } public readonly record struct NodeWriteOutcome(bool Success, string? Reason); ``` Lives in Commons (Akka-free), the same layer as `IOpcUaAddressSpaceSink`. A `NullOpcUaNodeWriteGateway` default returns `NodeWriteOutcome(false, "writes unavailable")` (matching today's no-router-wired `BadNotWritable`). 2. **Production impl (Host)** — Asks the local `DriverHostActor` `RouteNodeWrite` (the existing bounded 10 s Ask) and maps `NodeWriteResult` → `NodeWriteOutcome`. This is the current router lambda repackaged to *return* the outcome (and resolve the actor lazily per write, as today). Lives in the Akka-aware Host assembly; the OpcUaServer assembly stays Akka-free. 3. **Node-manager seam swap** — replace the `Action NodeWriteRouter` (`OtOpcUaNodeManager.cs:100`) and `OtOpcUaSdkServer.SetNodeWriteRouter` (`OtOpcUaSdkServer.cs:52`) with a settable `IOpcUaNodeWriteGateway` (volatile field set by the host at `StartAsync`, exactly the same lifecycle as the current router — null ⇒ writes unavailable). The host wires the production gateway in place of the current `SetNodeWriteRouter(...)` lambda (`OtOpcUaServerHostedService.cs:145`). 4. **`OnEquipmentTagWrite` change** (`OtOpcUaNodeManager.cs:604`): after the `WriteOperate` gate passes, **capture the prior value** (`((BaseDataVariableState)node).Value` + its `StatusCode` — the SDK hasn't applied the optimistic value yet at handler entry), then kick off `gateway.WriteAsync(nodeKey, writtenValue, ct)` **fire-and-forget** (still non-blocking under the Lock) with a continuation that performs the correction on failure. Return `ServiceResult.Good` as today. ### The correction — compare-and-revert In the continuation (runs on a thread-pool thread *after* the original Write returned, so the `Lock` is free to re-acquire briefly): - If `outcome.Success` → **do nothing** (the poll path keeps polling tags fresh, exactly as today). - If `!outcome.Success` → **compare-and-revert**: only revert if the node **still holds the optimistic value** (`currentValue` equals the value we wrote). This avoids clobbering a fresh poll that already corrected a polling tag between the optimistic write and the failure. When it still holds the optimistic value, revert the node to the captured **prior value** (with its prior `StatusCode`, fresh `now` timestamp) — for a failed write the device value is unchanged, so the prior value *is* the ground truth (no device re-read needed). Keep the existing failure log line. Net effect: a device-rejected write to a **static** attribute bounces the value back on the subscription (phantom gone); **polling** tags behave exactly as today. ### Where the revert is applied `OnEquipmentTagWrite` is inside the node manager, which owns the nodes and the `Lock`. The continuation applies the revert directly (the same `lock (Lock) { variable.Value = …; variable.StatusCode = …; variable.ClearChangeMasks(…); }` shape the poll/`WriteValue` path already uses), so no new sink call is strictly required — though the revert MAY be funnelled through the node manager's existing internal value-update helper for consistency. (Implementation plan decides the exact internal call; the behaviour is fixed here.) ## Data flow ``` client Write ─▶ SDK (holds Lock) ─▶ OnEquipmentTagWrite │ │ WriteOperate gate (fail-closed) │ │ capture priorValue + priorStatus │ │ gateway.WriteAsync(nodeId, value, ct) ── fire-and-forget │ └─ return Good (SDK applies optimistic value) ▼ (later, off-Lock continuation) outcome = await WriteAsync ── via DriverHostActor.RouteNodeWrite → NodeWriteResult success? ── yes ▶ nothing (poll keeps it fresh) └ no ▶ if node still == optimistic value: revert to priorValue (+ log) ``` ## Edge cases & decisions - **Compare-and-revert race:** between optimistic-apply and the failure, a poll may overwrite the node. We revert *only* if the current node value still equals the optimistic value. Best-effort value equality (`Equals` on the boxed values); a missed revert just means a polling tag self-corrects on its own next poll anyway — the only case that *needs* the revert is the static attribute, which never gets overwritten, so the equality check holds there. - **Redundancy / not-primary:** `RouteNodeWrite` already primary-gates (returns `NodeWriteResult(false, "not primary")`). A secondary doesn't serve client writes (clients connect to the primary), and the correction is local to the node that received the write. A "not primary" outcome is just another failure → compare-and-revert on the local node. No special handling. - **Timeout / actor unreachable:** the gateway maps a faulted/timed-out Ask to `NodeWriteOutcome(false, "write timeout"|…)` → treated as failure → revert. (A genuine device write that succeeded but whose ack was lost would be wrongly reverted, then re-corrected by the next poll for polling tags; for static attributes this is an acceptable, rare, conservative bias toward showing the last-known device value.) - **No double-apply:** the correction never *writes to the device* — it only updates the local node. So it can't loop. - **`writtenValue` capture:** the existing handler already captures `value` into a local for the closure (`OtOpcUaNodeManager.cs:612`); the prior value is a second capture taken before returning. ## Testing strategy - **Pure decision extraction:** extract the correction decision as a pure static (mirroring how `EvaluateEquipmentWrite` was extracted at `OtOpcUaNodeManager.cs:637`) — e.g. `ShouldRevert(NodeWriteOutcome outcome, object? currentNodeValue, object? optimisticValue) → bool` (+ the revert arguments). Unit tests (xUnit + Shouldly): fail + node still optimistic → revert; fail + node moved on → no revert; success → no revert; null-value edge cases. - **Gateway:** unit-test the production gateway's `NodeWriteResult → NodeWriteOutcome` mapping (success, failure-with-reason, timeout/faulted Ask → failure) with an Akka.TestKit probe standing in for the `DriverHostActor`; and the `Null`/unwired default → `(false, "writes unavailable")`. - **No bUnit / no Razor** — this is server-side only. - **Live /run gate:** write to a **static** writable tag whose device write **deterministically fails**, and confirm via Client.CLI subscribe that the node **reverts** to its prior value; confirm a **successful** write stays put. The plan picks a reliably-failing write (candidates: a Modbus illegal/out-of-range write on the `:5020` sim; or a Galaxy `ViewOnly` attribute surfaced as a ReadWrite equipment tag so the gateway commit is refused). docker-dev is LOCAL; rebuild central on the branch + redeploy as in prior milestones. ## Out of scope (deferred; per the chosen mechanism) - Bad-quality blip on failure (keep the value but flag `StatusCode = Bad`). - OPC UA `AuditWriteUpdateEvent` on failure. - Synchronous structural fail-fast (rejecting not-primary / driver-down / unmapped writes in the Write *response* — needs a Lock-readable driver/primary snapshot). These remain independently addable later if richer signalling is wanted. ## Touched code (anticipated) - NEW `src/Core/ZB.MOM.WW.OtOpcUa.Commons/OpcUa/IOpcUaNodeWriteGateway.cs` (interface + `NodeWriteOutcome` + `Null…` default). - `src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/OtOpcUaNodeManager.cs` — replace the `Action` router with the gateway; capture prior value + the compare-and-revert continuation in `OnEquipmentTagWrite`; add the pure `ShouldRevert` decision. - `src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/OtOpcUaSdkServer.cs` — `SetNodeWriteRouter` → `SetNodeWriteGateway`. - `src/Server/ZB.MOM.WW.OtOpcUa.Host/OpcUa/OtOpcUaServerHostedService.cs` — wire the production gateway impl instead of the current router lambda; clear it on `StopAsync`. - NEW production gateway impl (Host assembly) — Asks `RouteNodeWrite`, maps to `NodeWriteOutcome`. - Tests in the OpcUaServer test project + the Host/Runtime test project. **No EF/Configuration schema change, no migration.** `DriverHostActor.RouteNodeWrite` / `NodeWriteResult` and the driver write path are unchanged — this design only changes how the *outcome* is consumed on the server side.