docs(design): write-outcome self-correction (surface real device-write status, #5)
v2-ci / build (push) Failing after 42s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
v2-ci / build (push) Failing after 42s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
This commit is contained in:
@@ -0,0 +1,193 @@
|
||||
# Surface real device-write status (write-outcome self-correction) — Design
|
||||
|
||||
**Date:** 2026-06-14
|
||||
**Status:** Approved (brainstorming) — ready for implementation plan
|
||||
**Closes:** `pending.md` follow-up #5 ("surface real device-write status to the client" / the "optimistic-write phantom")
|
||||
|
||||
## Goal
|
||||
|
||||
When an authorized inbound OPC UA operator write to a writable equipment-tag node **fails at the
|
||||
device**, surface that to the client by **reverting the node to its real pre-write value** on the
|
||||
subscription — instead of leaving the optimistic value lingering as a phantom the device never
|
||||
accepted.
|
||||
|
||||
## Background — the optimistic-write phantom
|
||||
|
||||
The inbound write pipeline (shipped in Milestone 1b, `8d8c05f5`):
|
||||
|
||||
1. A client writes a writable equipment-tag node. The OPC UA SDK's `CustomNodeManager2.Write` holds
|
||||
the node-manager `Lock` and invokes the node's `OnWriteValue` handler,
|
||||
`OtOpcUaNodeManager.OnEquipmentTagWrite` (`src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/OtOpcUaNodeManager.cs:604`).
|
||||
2. `EvaluateEquipmentWrite` (`OtOpcUaNodeManager.cs:637`) gates on the `WriteOperate` data-plane role
|
||||
(fails closed → `BadUserAccessDenied`), then **fire-and-forgets** the dispatch via the
|
||||
`Action<string,object?> NodeWriteRouter` (`OtOpcUaNodeManager.cs:100`) and returns
|
||||
`ServiceResult.Good`. The SDK then applies the client value to the node optimistically.
|
||||
3. The router lambda (`src/Server/ZB.MOM.WW.OtOpcUa.Host/OpcUa/OtOpcUaServerHostedService.cs:145`)
|
||||
Asks the local `DriverHostActor` a bounded `RouteNodeWrite` (10 s) →
|
||||
`NodeWriteResult(Success, Reason)` (`src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs:130,136`),
|
||||
and in its `ContinueWith` **only logs** failures.
|
||||
|
||||
**Why fire-and-forget is mandatory:** `OnWriteValue` runs under the node-manager `Lock`. Blocking on
|
||||
the device round-trip there would freeze every address-space operation (reads, subscription
|
||||
notifications, the publish path) for up to the Ask timeout. So the synchronous Write *response*
|
||||
cannot carry the device status — OPC UA writes are optimistic by protocol.
|
||||
|
||||
**The phantom:** a device-rejected write still returned `Good`, and the SDK kept the client value on
|
||||
the node. For a **polling** tag the next poll republishes the real register value (self-corrects, so
|
||||
the phantom is transient). For a **static** attribute that never re-pushes (e.g. many Galaxy
|
||||
attributes), the wrong value **lingers forever** — a value the device never accepted.
|
||||
|
||||
## Constraint that shapes the design
|
||||
|
||||
The real device status is only available **after** the async round-trip, and we can't block under the
|
||||
`Lock`. Therefore the only honest place to surface it is the node's **value/quality on the
|
||||
subscription**, after the fact. And the **real pre-write value** is only knowable at `OnWriteValue`
|
||||
time inside the node manager — once the SDK applies the optimistic value, the prior value is gone. So
|
||||
the node manager must (a) capture the prior value at write time and (b) own the post-round-trip
|
||||
correction. That requires the dispatch to **return the outcome** to the node manager — which the
|
||||
current fire-and-forget `Action<string,object?>` router cannot do.
|
||||
|
||||
## Approved approach — complete the B3 gateway + async self-correct
|
||||
|
||||
Reintroduce the originally-designed (B3) write gateway that returns the outcome, and have the node
|
||||
manager own a compare-and-revert correction.
|
||||
|
||||
### Components
|
||||
|
||||
1. **`IOpcUaNodeWriteGateway` (NEW, Commons)** —
|
||||
`src/Core/ZB.MOM.WW.OtOpcUa.Commons/OpcUa/IOpcUaNodeWriteGateway.cs`:
|
||||
```csharp
|
||||
public interface IOpcUaNodeWriteGateway
|
||||
{
|
||||
Task<NodeWriteOutcome> WriteAsync(string nodeId, object? value, CancellationToken ct);
|
||||
}
|
||||
public readonly record struct NodeWriteOutcome(bool Success, string? Reason);
|
||||
```
|
||||
Lives in Commons (Akka-free), the same layer as `IOpcUaAddressSpaceSink`. A `NullOpcUaNodeWriteGateway`
|
||||
default returns `NodeWriteOutcome(false, "writes unavailable")` (matching today's
|
||||
no-router-wired `BadNotWritable`).
|
||||
|
||||
2. **Production impl (Host)** — Asks the local `DriverHostActor` `RouteNodeWrite` (the existing
|
||||
bounded 10 s Ask) and maps `NodeWriteResult` → `NodeWriteOutcome`. This is the current router
|
||||
lambda repackaged to *return* the outcome (and resolve the actor lazily per write, as today). Lives
|
||||
in the Akka-aware Host assembly; the OpcUaServer assembly stays Akka-free.
|
||||
|
||||
3. **Node-manager seam swap** — replace the `Action<string,object?> NodeWriteRouter`
|
||||
(`OtOpcUaNodeManager.cs:100`) and `OtOpcUaSdkServer.SetNodeWriteRouter` (`OtOpcUaSdkServer.cs:52`)
|
||||
with a settable `IOpcUaNodeWriteGateway` (volatile field set by the host at `StartAsync`, exactly
|
||||
the same lifecycle as the current router — null ⇒ writes unavailable). The host wires the
|
||||
production gateway in place of the current `SetNodeWriteRouter(...)` lambda
|
||||
(`OtOpcUaServerHostedService.cs:145`).
|
||||
|
||||
4. **`OnEquipmentTagWrite` change** (`OtOpcUaNodeManager.cs:604`): after the `WriteOperate` gate
|
||||
passes, **capture the prior value** (`((BaseDataVariableState)node).Value` + its `StatusCode` —
|
||||
the SDK hasn't applied the optimistic value yet at handler entry), then kick off
|
||||
`gateway.WriteAsync(nodeKey, writtenValue, ct)` **fire-and-forget** (still non-blocking under the
|
||||
Lock) with a continuation that performs the correction on failure. Return `ServiceResult.Good` as
|
||||
today.
|
||||
|
||||
### The correction — compare-and-revert
|
||||
|
||||
In the continuation (runs on a thread-pool thread *after* the original Write returned, so the `Lock`
|
||||
is free to re-acquire briefly):
|
||||
|
||||
- If `outcome.Success` → **do nothing** (the poll path keeps polling tags fresh, exactly as today).
|
||||
- If `!outcome.Success` → **compare-and-revert**: only revert if the node **still holds the optimistic
|
||||
value** (`currentValue` equals the value we wrote). This avoids clobbering a fresh poll that already
|
||||
corrected a polling tag between the optimistic write and the failure. When it still holds the
|
||||
optimistic value, revert the node to the captured **prior value** (with its prior `StatusCode`,
|
||||
fresh `now` timestamp) — for a failed write the device value is unchanged, so the prior value *is*
|
||||
the ground truth (no device re-read needed). Keep the existing failure log line.
|
||||
|
||||
Net effect: a device-rejected write to a **static** attribute bounces the value back on the
|
||||
subscription (phantom gone); **polling** tags behave exactly as today.
|
||||
|
||||
### Where the revert is applied
|
||||
|
||||
`OnEquipmentTagWrite` is inside the node manager, which owns the nodes and the `Lock`. The continuation
|
||||
applies the revert directly (the same `lock (Lock) { variable.Value = …; variable.StatusCode = …;
|
||||
variable.ClearChangeMasks(…); }` shape the poll/`WriteValue` path already uses), so no new sink call is
|
||||
strictly required — though the revert MAY be funnelled through the node manager's existing internal
|
||||
value-update helper for consistency. (Implementation plan decides the exact internal call; the
|
||||
behaviour is fixed here.)
|
||||
|
||||
## Data flow
|
||||
|
||||
```
|
||||
client Write ─▶ SDK (holds Lock) ─▶ OnEquipmentTagWrite
|
||||
│ │ WriteOperate gate (fail-closed)
|
||||
│ │ capture priorValue + priorStatus
|
||||
│ │ gateway.WriteAsync(nodeId, value, ct) ── fire-and-forget
|
||||
│ └─ return Good (SDK applies optimistic value)
|
||||
▼
|
||||
(later, off-Lock continuation)
|
||||
outcome = await WriteAsync ── via DriverHostActor.RouteNodeWrite → NodeWriteResult
|
||||
success? ── yes ▶ nothing (poll keeps it fresh)
|
||||
└ no ▶ if node still == optimistic value: revert to priorValue (+ log)
|
||||
```
|
||||
|
||||
## Edge cases & decisions
|
||||
|
||||
- **Compare-and-revert race:** between optimistic-apply and the failure, a poll may overwrite the
|
||||
node. We revert *only* if the current node value still equals the optimistic value. Best-effort
|
||||
value equality (`Equals` on the boxed values); a missed revert just means a polling tag self-corrects
|
||||
on its own next poll anyway — the only case that *needs* the revert is the static attribute, which
|
||||
never gets overwritten, so the equality check holds there.
|
||||
- **Redundancy / not-primary:** `RouteNodeWrite` already primary-gates (returns `NodeWriteResult(false,
|
||||
"not primary")`). A secondary doesn't serve client writes (clients connect to the primary), and the
|
||||
correction is local to the node that received the write. A "not primary" outcome is just another
|
||||
failure → compare-and-revert on the local node. No special handling.
|
||||
- **Timeout / actor unreachable:** the gateway maps a faulted/timed-out Ask to
|
||||
`NodeWriteOutcome(false, "write timeout"|…)` → treated as failure → revert. (A genuine device write
|
||||
that succeeded but whose ack was lost would be wrongly reverted, then re-corrected by the next poll
|
||||
for polling tags; for static attributes this is an acceptable, rare, conservative bias toward
|
||||
showing the last-known device value.)
|
||||
- **No double-apply:** the correction never *writes to the device* — it only updates the local node.
|
||||
So it can't loop.
|
||||
- **`writtenValue` capture:** the existing handler already captures `value` into a local for the
|
||||
closure (`OtOpcUaNodeManager.cs:612`); the prior value is a second capture taken before returning.
|
||||
|
||||
## Testing strategy
|
||||
|
||||
- **Pure decision extraction:** extract the correction decision as a pure static (mirroring how
|
||||
`EvaluateEquipmentWrite` was extracted at `OtOpcUaNodeManager.cs:637`) — e.g.
|
||||
`ShouldRevert(NodeWriteOutcome outcome, object? currentNodeValue, object? optimisticValue) → bool`
|
||||
(+ the revert arguments). Unit tests (xUnit + Shouldly): fail + node still optimistic → revert;
|
||||
fail + node moved on → no revert; success → no revert; null-value edge cases.
|
||||
- **Gateway:** unit-test the production gateway's `NodeWriteResult → NodeWriteOutcome` mapping
|
||||
(success, failure-with-reason, timeout/faulted Ask → failure) with an Akka.TestKit probe standing
|
||||
in for the `DriverHostActor`; and the `Null`/unwired default → `(false, "writes unavailable")`.
|
||||
- **No bUnit / no Razor** — this is server-side only.
|
||||
- **Live /run gate:** write to a **static** writable tag whose device write **deterministically
|
||||
fails**, and confirm via Client.CLI subscribe that the node **reverts** to its prior value; confirm a
|
||||
**successful** write stays put. The plan picks a reliably-failing write (candidates: a Modbus
|
||||
illegal/out-of-range write on the `:5020` sim; or a Galaxy `ViewOnly` attribute surfaced as a
|
||||
ReadWrite equipment tag so the gateway commit is refused). docker-dev is LOCAL; rebuild central on
|
||||
the branch + redeploy as in prior milestones.
|
||||
|
||||
## Out of scope (deferred; per the chosen mechanism)
|
||||
|
||||
- Bad-quality blip on failure (keep the value but flag `StatusCode = Bad`).
|
||||
- OPC UA `AuditWriteUpdateEvent` on failure.
|
||||
- Synchronous structural fail-fast (rejecting not-primary / driver-down / unmapped writes in the Write
|
||||
*response* — needs a Lock-readable driver/primary snapshot).
|
||||
|
||||
These remain independently addable later if richer signalling is wanted.
|
||||
|
||||
## Touched code (anticipated)
|
||||
|
||||
- NEW `src/Core/ZB.MOM.WW.OtOpcUa.Commons/OpcUa/IOpcUaNodeWriteGateway.cs` (interface +
|
||||
`NodeWriteOutcome` + `Null…` default).
|
||||
- `src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/OtOpcUaNodeManager.cs` — replace the `Action` router with
|
||||
the gateway; capture prior value + the compare-and-revert continuation in `OnEquipmentTagWrite`;
|
||||
add the pure `ShouldRevert` decision.
|
||||
- `src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/OtOpcUaSdkServer.cs` — `SetNodeWriteRouter` →
|
||||
`SetNodeWriteGateway`.
|
||||
- `src/Server/ZB.MOM.WW.OtOpcUa.Host/OpcUa/OtOpcUaServerHostedService.cs` — wire the production
|
||||
gateway impl instead of the current router lambda; clear it on `StopAsync`.
|
||||
- NEW production gateway impl (Host assembly) — Asks `RouteNodeWrite`, maps to `NodeWriteOutcome`.
|
||||
- Tests in the OpcUaServer test project + the Host/Runtime test project.
|
||||
|
||||
**No EF/Configuration schema change, no migration.** `DriverHostActor.RouteNodeWrite` /
|
||||
`NodeWriteResult` and the driver write path are unchanged — this design only changes how the *outcome*
|
||||
is consumed on the server side.
|
||||
Reference in New Issue
Block a user