docs(design): write-outcome self-correction (surface real device-write status, #5)
v2-ci / build (push) Failing after 42s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped

This commit is contained in:
Joseph Doherty
2026-06-14 01:13:38 -04:00
parent 945c238039
commit 8a79cb0ec7
@@ -0,0 +1,193 @@
# Surface real device-write status (write-outcome self-correction) — Design
**Date:** 2026-06-14
**Status:** Approved (brainstorming) — ready for implementation plan
**Closes:** `pending.md` follow-up #5 ("surface real device-write status to the client" / the "optimistic-write phantom")
## Goal
When an authorized inbound OPC UA operator write to a writable equipment-tag node **fails at the
device**, surface that to the client by **reverting the node to its real pre-write value** on the
subscription — instead of leaving the optimistic value lingering as a phantom the device never
accepted.
## Background — the optimistic-write phantom
The inbound write pipeline (shipped in Milestone 1b, `8d8c05f5`):
1. A client writes a writable equipment-tag node. The OPC UA SDK's `CustomNodeManager2.Write` holds
the node-manager `Lock` and invokes the node's `OnWriteValue` handler,
`OtOpcUaNodeManager.OnEquipmentTagWrite` (`src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/OtOpcUaNodeManager.cs:604`).
2. `EvaluateEquipmentWrite` (`OtOpcUaNodeManager.cs:637`) gates on the `WriteOperate` data-plane role
(fails closed → `BadUserAccessDenied`), then **fire-and-forgets** the dispatch via the
`Action<string,object?> NodeWriteRouter` (`OtOpcUaNodeManager.cs:100`) and returns
`ServiceResult.Good`. The SDK then applies the client value to the node optimistically.
3. The router lambda (`src/Server/ZB.MOM.WW.OtOpcUa.Host/OpcUa/OtOpcUaServerHostedService.cs:145`)
Asks the local `DriverHostActor` a bounded `RouteNodeWrite` (10 s) →
`NodeWriteResult(Success, Reason)` (`src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs:130,136`),
and in its `ContinueWith` **only logs** failures.
**Why fire-and-forget is mandatory:** `OnWriteValue` runs under the node-manager `Lock`. Blocking on
the device round-trip there would freeze every address-space operation (reads, subscription
notifications, the publish path) for up to the Ask timeout. So the synchronous Write *response*
cannot carry the device status — OPC UA writes are optimistic by protocol.
**The phantom:** a device-rejected write still returned `Good`, and the SDK kept the client value on
the node. For a **polling** tag the next poll republishes the real register value (self-corrects, so
the phantom is transient). For a **static** attribute that never re-pushes (e.g. many Galaxy
attributes), the wrong value **lingers forever** — a value the device never accepted.
## Constraint that shapes the design
The real device status is only available **after** the async round-trip, and we can't block under the
`Lock`. Therefore the only honest place to surface it is the node's **value/quality on the
subscription**, after the fact. And the **real pre-write value** is only knowable at `OnWriteValue`
time inside the node manager — once the SDK applies the optimistic value, the prior value is gone. So
the node manager must (a) capture the prior value at write time and (b) own the post-round-trip
correction. That requires the dispatch to **return the outcome** to the node manager — which the
current fire-and-forget `Action<string,object?>` router cannot do.
## Approved approach — complete the B3 gateway + async self-correct
Reintroduce the originally-designed (B3) write gateway that returns the outcome, and have the node
manager own a compare-and-revert correction.
### Components
1. **`IOpcUaNodeWriteGateway` (NEW, Commons)** —
`src/Core/ZB.MOM.WW.OtOpcUa.Commons/OpcUa/IOpcUaNodeWriteGateway.cs`:
```csharp
public interface IOpcUaNodeWriteGateway
{
Task<NodeWriteOutcome> WriteAsync(string nodeId, object? value, CancellationToken ct);
}
public readonly record struct NodeWriteOutcome(bool Success, string? Reason);
```
Lives in Commons (Akka-free), the same layer as `IOpcUaAddressSpaceSink`. A `NullOpcUaNodeWriteGateway`
default returns `NodeWriteOutcome(false, "writes unavailable")` (matching today's
no-router-wired `BadNotWritable`).
2. **Production impl (Host)** — Asks the local `DriverHostActor` `RouteNodeWrite` (the existing
bounded 10 s Ask) and maps `NodeWriteResult` → `NodeWriteOutcome`. This is the current router
lambda repackaged to *return* the outcome (and resolve the actor lazily per write, as today). Lives
in the Akka-aware Host assembly; the OpcUaServer assembly stays Akka-free.
3. **Node-manager seam swap** — replace the `Action<string,object?> NodeWriteRouter`
(`OtOpcUaNodeManager.cs:100`) and `OtOpcUaSdkServer.SetNodeWriteRouter` (`OtOpcUaSdkServer.cs:52`)
with a settable `IOpcUaNodeWriteGateway` (volatile field set by the host at `StartAsync`, exactly
the same lifecycle as the current router — null ⇒ writes unavailable). The host wires the
production gateway in place of the current `SetNodeWriteRouter(...)` lambda
(`OtOpcUaServerHostedService.cs:145`).
4. **`OnEquipmentTagWrite` change** (`OtOpcUaNodeManager.cs:604`): after the `WriteOperate` gate
passes, **capture the prior value** (`((BaseDataVariableState)node).Value` + its `StatusCode` —
the SDK hasn't applied the optimistic value yet at handler entry), then kick off
`gateway.WriteAsync(nodeKey, writtenValue, ct)` **fire-and-forget** (still non-blocking under the
Lock) with a continuation that performs the correction on failure. Return `ServiceResult.Good` as
today.
### The correction — compare-and-revert
In the continuation (runs on a thread-pool thread *after* the original Write returned, so the `Lock`
is free to re-acquire briefly):
- If `outcome.Success` → **do nothing** (the poll path keeps polling tags fresh, exactly as today).
- If `!outcome.Success` → **compare-and-revert**: only revert if the node **still holds the optimistic
value** (`currentValue` equals the value we wrote). This avoids clobbering a fresh poll that already
corrected a polling tag between the optimistic write and the failure. When it still holds the
optimistic value, revert the node to the captured **prior value** (with its prior `StatusCode`,
fresh `now` timestamp) — for a failed write the device value is unchanged, so the prior value *is*
the ground truth (no device re-read needed). Keep the existing failure log line.
Net effect: a device-rejected write to a **static** attribute bounces the value back on the
subscription (phantom gone); **polling** tags behave exactly as today.
### Where the revert is applied
`OnEquipmentTagWrite` is inside the node manager, which owns the nodes and the `Lock`. The continuation
applies the revert directly (the same `lock (Lock) { variable.Value = …; variable.StatusCode = …;
variable.ClearChangeMasks(…); }` shape the poll/`WriteValue` path already uses), so no new sink call is
strictly required — though the revert MAY be funnelled through the node manager's existing internal
value-update helper for consistency. (Implementation plan decides the exact internal call; the
behaviour is fixed here.)
## Data flow
```
client Write ─▶ SDK (holds Lock) ─▶ OnEquipmentTagWrite
│ │ WriteOperate gate (fail-closed)
│ │ capture priorValue + priorStatus
│ │ gateway.WriteAsync(nodeId, value, ct) ── fire-and-forget
│ └─ return Good (SDK applies optimistic value)
(later, off-Lock continuation)
outcome = await WriteAsync ── via DriverHostActor.RouteNodeWrite → NodeWriteResult
success? ── yes ▶ nothing (poll keeps it fresh)
└ no ▶ if node still == optimistic value: revert to priorValue (+ log)
```
## Edge cases & decisions
- **Compare-and-revert race:** between optimistic-apply and the failure, a poll may overwrite the
node. We revert *only* if the current node value still equals the optimistic value. Best-effort
value equality (`Equals` on the boxed values); a missed revert just means a polling tag self-corrects
on its own next poll anyway — the only case that *needs* the revert is the static attribute, which
never gets overwritten, so the equality check holds there.
- **Redundancy / not-primary:** `RouteNodeWrite` already primary-gates (returns `NodeWriteResult(false,
"not primary")`). A secondary doesn't serve client writes (clients connect to the primary), and the
correction is local to the node that received the write. A "not primary" outcome is just another
failure → compare-and-revert on the local node. No special handling.
- **Timeout / actor unreachable:** the gateway maps a faulted/timed-out Ask to
`NodeWriteOutcome(false, "write timeout"|…)` → treated as failure → revert. (A genuine device write
that succeeded but whose ack was lost would be wrongly reverted, then re-corrected by the next poll
for polling tags; for static attributes this is an acceptable, rare, conservative bias toward
showing the last-known device value.)
- **No double-apply:** the correction never *writes to the device* — it only updates the local node.
So it can't loop.
- **`writtenValue` capture:** the existing handler already captures `value` into a local for the
closure (`OtOpcUaNodeManager.cs:612`); the prior value is a second capture taken before returning.
## Testing strategy
- **Pure decision extraction:** extract the correction decision as a pure static (mirroring how
`EvaluateEquipmentWrite` was extracted at `OtOpcUaNodeManager.cs:637`) — e.g.
`ShouldRevert(NodeWriteOutcome outcome, object? currentNodeValue, object? optimisticValue) → bool`
(+ the revert arguments). Unit tests (xUnit + Shouldly): fail + node still optimistic → revert;
fail + node moved on → no revert; success → no revert; null-value edge cases.
- **Gateway:** unit-test the production gateway's `NodeWriteResult → NodeWriteOutcome` mapping
(success, failure-with-reason, timeout/faulted Ask → failure) with an Akka.TestKit probe standing
in for the `DriverHostActor`; and the `Null`/unwired default → `(false, "writes unavailable")`.
- **No bUnit / no Razor** — this is server-side only.
- **Live /run gate:** write to a **static** writable tag whose device write **deterministically
fails**, and confirm via Client.CLI subscribe that the node **reverts** to its prior value; confirm a
**successful** write stays put. The plan picks a reliably-failing write (candidates: a Modbus
illegal/out-of-range write on the `:5020` sim; or a Galaxy `ViewOnly` attribute surfaced as a
ReadWrite equipment tag so the gateway commit is refused). docker-dev is LOCAL; rebuild central on
the branch + redeploy as in prior milestones.
## Out of scope (deferred; per the chosen mechanism)
- Bad-quality blip on failure (keep the value but flag `StatusCode = Bad`).
- OPC UA `AuditWriteUpdateEvent` on failure.
- Synchronous structural fail-fast (rejecting not-primary / driver-down / unmapped writes in the Write
*response* — needs a Lock-readable driver/primary snapshot).
These remain independently addable later if richer signalling is wanted.
## Touched code (anticipated)
- NEW `src/Core/ZB.MOM.WW.OtOpcUa.Commons/OpcUa/IOpcUaNodeWriteGateway.cs` (interface +
`NodeWriteOutcome` + `Null…` default).
- `src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/OtOpcUaNodeManager.cs` — replace the `Action` router with
the gateway; capture prior value + the compare-and-revert continuation in `OnEquipmentTagWrite`;
add the pure `ShouldRevert` decision.
- `src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/OtOpcUaSdkServer.cs` — `SetNodeWriteRouter` →
`SetNodeWriteGateway`.
- `src/Server/ZB.MOM.WW.OtOpcUa.Host/OpcUa/OtOpcUaServerHostedService.cs` — wire the production
gateway impl instead of the current router lambda; clear it on `StopAsync`.
- NEW production gateway impl (Host assembly) — Asks `RouteNodeWrite`, maps to `NodeWriteOutcome`.
- Tests in the OpcUaServer test project + the Host/Runtime test project.
**No EF/Configuration schema change, no migration.** `DriverHostActor.RouteNodeWrite` /
`NodeWriteResult` and the driver write path are unchanged — this design only changes how the *outcome*
is consumed on the server side.