diff --git a/docs/plans/2026-06-14-write-outcome-self-correction-design.md b/docs/plans/2026-06-14-write-outcome-self-correction-design.md new file mode 100644 index 00000000..42584bfd --- /dev/null +++ b/docs/plans/2026-06-14-write-outcome-self-correction-design.md @@ -0,0 +1,193 @@ +# Surface real device-write status (write-outcome self-correction) — Design + +**Date:** 2026-06-14 +**Status:** Approved (brainstorming) — ready for implementation plan +**Closes:** `pending.md` follow-up #5 ("surface real device-write status to the client" / the "optimistic-write phantom") + +## Goal + +When an authorized inbound OPC UA operator write to a writable equipment-tag node **fails at the +device**, surface that to the client by **reverting the node to its real pre-write value** on the +subscription — instead of leaving the optimistic value lingering as a phantom the device never +accepted. + +## Background — the optimistic-write phantom + +The inbound write pipeline (shipped in Milestone 1b, `8d8c05f5`): + +1. A client writes a writable equipment-tag node. The OPC UA SDK's `CustomNodeManager2.Write` holds + the node-manager `Lock` and invokes the node's `OnWriteValue` handler, + `OtOpcUaNodeManager.OnEquipmentTagWrite` (`src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/OtOpcUaNodeManager.cs:604`). +2. `EvaluateEquipmentWrite` (`OtOpcUaNodeManager.cs:637`) gates on the `WriteOperate` data-plane role + (fails closed → `BadUserAccessDenied`), then **fire-and-forgets** the dispatch via the + `Action NodeWriteRouter` (`OtOpcUaNodeManager.cs:100`) and returns + `ServiceResult.Good`. The SDK then applies the client value to the node optimistically. +3. The router lambda (`src/Server/ZB.MOM.WW.OtOpcUa.Host/OpcUa/OtOpcUaServerHostedService.cs:145`) + Asks the local `DriverHostActor` a bounded `RouteNodeWrite` (10 s) → + `NodeWriteResult(Success, Reason)` (`src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs:130,136`), + and in its `ContinueWith` **only logs** failures. + +**Why fire-and-forget is mandatory:** `OnWriteValue` runs under the node-manager `Lock`. Blocking on +the device round-trip there would freeze every address-space operation (reads, subscription +notifications, the publish path) for up to the Ask timeout. So the synchronous Write *response* +cannot carry the device status — OPC UA writes are optimistic by protocol. + +**The phantom:** a device-rejected write still returned `Good`, and the SDK kept the client value on +the node. For a **polling** tag the next poll republishes the real register value (self-corrects, so +the phantom is transient). For a **static** attribute that never re-pushes (e.g. many Galaxy +attributes), the wrong value **lingers forever** — a value the device never accepted. + +## Constraint that shapes the design + +The real device status is only available **after** the async round-trip, and we can't block under the +`Lock`. Therefore the only honest place to surface it is the node's **value/quality on the +subscription**, after the fact. And the **real pre-write value** is only knowable at `OnWriteValue` +time inside the node manager — once the SDK applies the optimistic value, the prior value is gone. So +the node manager must (a) capture the prior value at write time and (b) own the post-round-trip +correction. That requires the dispatch to **return the outcome** to the node manager — which the +current fire-and-forget `Action` router cannot do. + +## Approved approach — complete the B3 gateway + async self-correct + +Reintroduce the originally-designed (B3) write gateway that returns the outcome, and have the node +manager own a compare-and-revert correction. + +### Components + +1. **`IOpcUaNodeWriteGateway` (NEW, Commons)** — + `src/Core/ZB.MOM.WW.OtOpcUa.Commons/OpcUa/IOpcUaNodeWriteGateway.cs`: + ```csharp + public interface IOpcUaNodeWriteGateway + { + Task WriteAsync(string nodeId, object? value, CancellationToken ct); + } + public readonly record struct NodeWriteOutcome(bool Success, string? Reason); + ``` + Lives in Commons (Akka-free), the same layer as `IOpcUaAddressSpaceSink`. A `NullOpcUaNodeWriteGateway` + default returns `NodeWriteOutcome(false, "writes unavailable")` (matching today's + no-router-wired `BadNotWritable`). + +2. **Production impl (Host)** — Asks the local `DriverHostActor` `RouteNodeWrite` (the existing + bounded 10 s Ask) and maps `NodeWriteResult` → `NodeWriteOutcome`. This is the current router + lambda repackaged to *return* the outcome (and resolve the actor lazily per write, as today). Lives + in the Akka-aware Host assembly; the OpcUaServer assembly stays Akka-free. + +3. **Node-manager seam swap** — replace the `Action NodeWriteRouter` + (`OtOpcUaNodeManager.cs:100`) and `OtOpcUaSdkServer.SetNodeWriteRouter` (`OtOpcUaSdkServer.cs:52`) + with a settable `IOpcUaNodeWriteGateway` (volatile field set by the host at `StartAsync`, exactly + the same lifecycle as the current router — null ⇒ writes unavailable). The host wires the + production gateway in place of the current `SetNodeWriteRouter(...)` lambda + (`OtOpcUaServerHostedService.cs:145`). + +4. **`OnEquipmentTagWrite` change** (`OtOpcUaNodeManager.cs:604`): after the `WriteOperate` gate + passes, **capture the prior value** (`((BaseDataVariableState)node).Value` + its `StatusCode` — + the SDK hasn't applied the optimistic value yet at handler entry), then kick off + `gateway.WriteAsync(nodeKey, writtenValue, ct)` **fire-and-forget** (still non-blocking under the + Lock) with a continuation that performs the correction on failure. Return `ServiceResult.Good` as + today. + +### The correction — compare-and-revert + +In the continuation (runs on a thread-pool thread *after* the original Write returned, so the `Lock` +is free to re-acquire briefly): + +- If `outcome.Success` → **do nothing** (the poll path keeps polling tags fresh, exactly as today). +- If `!outcome.Success` → **compare-and-revert**: only revert if the node **still holds the optimistic + value** (`currentValue` equals the value we wrote). This avoids clobbering a fresh poll that already + corrected a polling tag between the optimistic write and the failure. When it still holds the + optimistic value, revert the node to the captured **prior value** (with its prior `StatusCode`, + fresh `now` timestamp) — for a failed write the device value is unchanged, so the prior value *is* + the ground truth (no device re-read needed). Keep the existing failure log line. + +Net effect: a device-rejected write to a **static** attribute bounces the value back on the +subscription (phantom gone); **polling** tags behave exactly as today. + +### Where the revert is applied + +`OnEquipmentTagWrite` is inside the node manager, which owns the nodes and the `Lock`. The continuation +applies the revert directly (the same `lock (Lock) { variable.Value = …; variable.StatusCode = …; +variable.ClearChangeMasks(…); }` shape the poll/`WriteValue` path already uses), so no new sink call is +strictly required — though the revert MAY be funnelled through the node manager's existing internal +value-update helper for consistency. (Implementation plan decides the exact internal call; the +behaviour is fixed here.) + +## Data flow + +``` +client Write ─▶ SDK (holds Lock) ─▶ OnEquipmentTagWrite + │ │ WriteOperate gate (fail-closed) + │ │ capture priorValue + priorStatus + │ │ gateway.WriteAsync(nodeId, value, ct) ── fire-and-forget + │ └─ return Good (SDK applies optimistic value) + ▼ + (later, off-Lock continuation) + outcome = await WriteAsync ── via DriverHostActor.RouteNodeWrite → NodeWriteResult + success? ── yes ▶ nothing (poll keeps it fresh) + └ no ▶ if node still == optimistic value: revert to priorValue (+ log) +``` + +## Edge cases & decisions + +- **Compare-and-revert race:** between optimistic-apply and the failure, a poll may overwrite the + node. We revert *only* if the current node value still equals the optimistic value. Best-effort + value equality (`Equals` on the boxed values); a missed revert just means a polling tag self-corrects + on its own next poll anyway — the only case that *needs* the revert is the static attribute, which + never gets overwritten, so the equality check holds there. +- **Redundancy / not-primary:** `RouteNodeWrite` already primary-gates (returns `NodeWriteResult(false, + "not primary")`). A secondary doesn't serve client writes (clients connect to the primary), and the + correction is local to the node that received the write. A "not primary" outcome is just another + failure → compare-and-revert on the local node. No special handling. +- **Timeout / actor unreachable:** the gateway maps a faulted/timed-out Ask to + `NodeWriteOutcome(false, "write timeout"|…)` → treated as failure → revert. (A genuine device write + that succeeded but whose ack was lost would be wrongly reverted, then re-corrected by the next poll + for polling tags; for static attributes this is an acceptable, rare, conservative bias toward + showing the last-known device value.) +- **No double-apply:** the correction never *writes to the device* — it only updates the local node. + So it can't loop. +- **`writtenValue` capture:** the existing handler already captures `value` into a local for the + closure (`OtOpcUaNodeManager.cs:612`); the prior value is a second capture taken before returning. + +## Testing strategy + +- **Pure decision extraction:** extract the correction decision as a pure static (mirroring how + `EvaluateEquipmentWrite` was extracted at `OtOpcUaNodeManager.cs:637`) — e.g. + `ShouldRevert(NodeWriteOutcome outcome, object? currentNodeValue, object? optimisticValue) → bool` + (+ the revert arguments). Unit tests (xUnit + Shouldly): fail + node still optimistic → revert; + fail + node moved on → no revert; success → no revert; null-value edge cases. +- **Gateway:** unit-test the production gateway's `NodeWriteResult → NodeWriteOutcome` mapping + (success, failure-with-reason, timeout/faulted Ask → failure) with an Akka.TestKit probe standing + in for the `DriverHostActor`; and the `Null`/unwired default → `(false, "writes unavailable")`. +- **No bUnit / no Razor** — this is server-side only. +- **Live /run gate:** write to a **static** writable tag whose device write **deterministically + fails**, and confirm via Client.CLI subscribe that the node **reverts** to its prior value; confirm a + **successful** write stays put. The plan picks a reliably-failing write (candidates: a Modbus + illegal/out-of-range write on the `:5020` sim; or a Galaxy `ViewOnly` attribute surfaced as a + ReadWrite equipment tag so the gateway commit is refused). docker-dev is LOCAL; rebuild central on + the branch + redeploy as in prior milestones. + +## Out of scope (deferred; per the chosen mechanism) + +- Bad-quality blip on failure (keep the value but flag `StatusCode = Bad`). +- OPC UA `AuditWriteUpdateEvent` on failure. +- Synchronous structural fail-fast (rejecting not-primary / driver-down / unmapped writes in the Write + *response* — needs a Lock-readable driver/primary snapshot). + +These remain independently addable later if richer signalling is wanted. + +## Touched code (anticipated) + +- NEW `src/Core/ZB.MOM.WW.OtOpcUa.Commons/OpcUa/IOpcUaNodeWriteGateway.cs` (interface + + `NodeWriteOutcome` + `Null…` default). +- `src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/OtOpcUaNodeManager.cs` — replace the `Action` router with + the gateway; capture prior value + the compare-and-revert continuation in `OnEquipmentTagWrite`; + add the pure `ShouldRevert` decision. +- `src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/OtOpcUaSdkServer.cs` — `SetNodeWriteRouter` → + `SetNodeWriteGateway`. +- `src/Server/ZB.MOM.WW.OtOpcUa.Host/OpcUa/OtOpcUaServerHostedService.cs` — wire the production + gateway impl instead of the current router lambda; clear it on `StopAsync`. +- NEW production gateway impl (Host assembly) — Asks `RouteNodeWrite`, maps to `NodeWriteOutcome`. +- Tests in the OpcUaServer test project + the Host/Runtime test project. + +**No EF/Configuration schema change, no migration.** `DriverHostActor.RouteNodeWrite` / +`NodeWriteResult` and the driver write path are unchanged — this design only changes how the *outcome* +is consumed on the server side.