8a79cb0ec7
v2-ci / build (push) Failing after 42s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
194 lines
12 KiB
Markdown
194 lines
12 KiB
Markdown
# Surface real device-write status (write-outcome self-correction) — Design
|
|
|
|
**Date:** 2026-06-14
|
|
**Status:** Approved (brainstorming) — ready for implementation plan
|
|
**Closes:** `pending.md` follow-up #5 ("surface real device-write status to the client" / the "optimistic-write phantom")
|
|
|
|
## Goal
|
|
|
|
When an authorized inbound OPC UA operator write to a writable equipment-tag node **fails at the
|
|
device**, surface that to the client by **reverting the node to its real pre-write value** on the
|
|
subscription — instead of leaving the optimistic value lingering as a phantom the device never
|
|
accepted.
|
|
|
|
## Background — the optimistic-write phantom
|
|
|
|
The inbound write pipeline (shipped in Milestone 1b, `8d8c05f5`):
|
|
|
|
1. A client writes a writable equipment-tag node. The OPC UA SDK's `CustomNodeManager2.Write` holds
|
|
the node-manager `Lock` and invokes the node's `OnWriteValue` handler,
|
|
`OtOpcUaNodeManager.OnEquipmentTagWrite` (`src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/OtOpcUaNodeManager.cs:604`).
|
|
2. `EvaluateEquipmentWrite` (`OtOpcUaNodeManager.cs:637`) gates on the `WriteOperate` data-plane role
|
|
(fails closed → `BadUserAccessDenied`), then **fire-and-forgets** the dispatch via the
|
|
`Action<string,object?> NodeWriteRouter` (`OtOpcUaNodeManager.cs:100`) and returns
|
|
`ServiceResult.Good`. The SDK then applies the client value to the node optimistically.
|
|
3. The router lambda (`src/Server/ZB.MOM.WW.OtOpcUa.Host/OpcUa/OtOpcUaServerHostedService.cs:145`)
|
|
Asks the local `DriverHostActor` a bounded `RouteNodeWrite` (10 s) →
|
|
`NodeWriteResult(Success, Reason)` (`src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs:130,136`),
|
|
and in its `ContinueWith` **only logs** failures.
|
|
|
|
**Why fire-and-forget is mandatory:** `OnWriteValue` runs under the node-manager `Lock`. Blocking on
|
|
the device round-trip there would freeze every address-space operation (reads, subscription
|
|
notifications, the publish path) for up to the Ask timeout. So the synchronous Write *response*
|
|
cannot carry the device status — OPC UA writes are optimistic by protocol.
|
|
|
|
**The phantom:** a device-rejected write still returned `Good`, and the SDK kept the client value on
|
|
the node. For a **polling** tag the next poll republishes the real register value (self-corrects, so
|
|
the phantom is transient). For a **static** attribute that never re-pushes (e.g. many Galaxy
|
|
attributes), the wrong value **lingers forever** — a value the device never accepted.
|
|
|
|
## Constraint that shapes the design
|
|
|
|
The real device status is only available **after** the async round-trip, and we can't block under the
|
|
`Lock`. Therefore the only honest place to surface it is the node's **value/quality on the
|
|
subscription**, after the fact. And the **real pre-write value** is only knowable at `OnWriteValue`
|
|
time inside the node manager — once the SDK applies the optimistic value, the prior value is gone. So
|
|
the node manager must (a) capture the prior value at write time and (b) own the post-round-trip
|
|
correction. That requires the dispatch to **return the outcome** to the node manager — which the
|
|
current fire-and-forget `Action<string,object?>` router cannot do.
|
|
|
|
## Approved approach — complete the B3 gateway + async self-correct
|
|
|
|
Reintroduce the originally-designed (B3) write gateway that returns the outcome, and have the node
|
|
manager own a compare-and-revert correction.
|
|
|
|
### Components
|
|
|
|
1. **`IOpcUaNodeWriteGateway` (NEW, Commons)** —
|
|
`src/Core/ZB.MOM.WW.OtOpcUa.Commons/OpcUa/IOpcUaNodeWriteGateway.cs`:
|
|
```csharp
|
|
public interface IOpcUaNodeWriteGateway
|
|
{
|
|
Task<NodeWriteOutcome> WriteAsync(string nodeId, object? value, CancellationToken ct);
|
|
}
|
|
public readonly record struct NodeWriteOutcome(bool Success, string? Reason);
|
|
```
|
|
Lives in Commons (Akka-free), the same layer as `IOpcUaAddressSpaceSink`. A `NullOpcUaNodeWriteGateway`
|
|
default returns `NodeWriteOutcome(false, "writes unavailable")` (matching today's
|
|
no-router-wired `BadNotWritable`).
|
|
|
|
2. **Production impl (Host)** — Asks the local `DriverHostActor` `RouteNodeWrite` (the existing
|
|
bounded 10 s Ask) and maps `NodeWriteResult` → `NodeWriteOutcome`. This is the current router
|
|
lambda repackaged to *return* the outcome (and resolve the actor lazily per write, as today). Lives
|
|
in the Akka-aware Host assembly; the OpcUaServer assembly stays Akka-free.
|
|
|
|
3. **Node-manager seam swap** — replace the `Action<string,object?> NodeWriteRouter`
|
|
(`OtOpcUaNodeManager.cs:100`) and `OtOpcUaSdkServer.SetNodeWriteRouter` (`OtOpcUaSdkServer.cs:52`)
|
|
with a settable `IOpcUaNodeWriteGateway` (volatile field set by the host at `StartAsync`, exactly
|
|
the same lifecycle as the current router — null ⇒ writes unavailable). The host wires the
|
|
production gateway in place of the current `SetNodeWriteRouter(...)` lambda
|
|
(`OtOpcUaServerHostedService.cs:145`).
|
|
|
|
4. **`OnEquipmentTagWrite` change** (`OtOpcUaNodeManager.cs:604`): after the `WriteOperate` gate
|
|
passes, **capture the prior value** (`((BaseDataVariableState)node).Value` + its `StatusCode` —
|
|
the SDK hasn't applied the optimistic value yet at handler entry), then kick off
|
|
`gateway.WriteAsync(nodeKey, writtenValue, ct)` **fire-and-forget** (still non-blocking under the
|
|
Lock) with a continuation that performs the correction on failure. Return `ServiceResult.Good` as
|
|
today.
|
|
|
|
### The correction — compare-and-revert
|
|
|
|
In the continuation (runs on a thread-pool thread *after* the original Write returned, so the `Lock`
|
|
is free to re-acquire briefly):
|
|
|
|
- If `outcome.Success` → **do nothing** (the poll path keeps polling tags fresh, exactly as today).
|
|
- If `!outcome.Success` → **compare-and-revert**: only revert if the node **still holds the optimistic
|
|
value** (`currentValue` equals the value we wrote). This avoids clobbering a fresh poll that already
|
|
corrected a polling tag between the optimistic write and the failure. When it still holds the
|
|
optimistic value, revert the node to the captured **prior value** (with its prior `StatusCode`,
|
|
fresh `now` timestamp) — for a failed write the device value is unchanged, so the prior value *is*
|
|
the ground truth (no device re-read needed). Keep the existing failure log line.
|
|
|
|
Net effect: a device-rejected write to a **static** attribute bounces the value back on the
|
|
subscription (phantom gone); **polling** tags behave exactly as today.
|
|
|
|
### Where the revert is applied
|
|
|
|
`OnEquipmentTagWrite` is inside the node manager, which owns the nodes and the `Lock`. The continuation
|
|
applies the revert directly (the same `lock (Lock) { variable.Value = …; variable.StatusCode = …;
|
|
variable.ClearChangeMasks(…); }` shape the poll/`WriteValue` path already uses), so no new sink call is
|
|
strictly required — though the revert MAY be funnelled through the node manager's existing internal
|
|
value-update helper for consistency. (Implementation plan decides the exact internal call; the
|
|
behaviour is fixed here.)
|
|
|
|
## Data flow
|
|
|
|
```
|
|
client Write ─▶ SDK (holds Lock) ─▶ OnEquipmentTagWrite
|
|
│ │ WriteOperate gate (fail-closed)
|
|
│ │ capture priorValue + priorStatus
|
|
│ │ gateway.WriteAsync(nodeId, value, ct) ── fire-and-forget
|
|
│ └─ return Good (SDK applies optimistic value)
|
|
▼
|
|
(later, off-Lock continuation)
|
|
outcome = await WriteAsync ── via DriverHostActor.RouteNodeWrite → NodeWriteResult
|
|
success? ── yes ▶ nothing (poll keeps it fresh)
|
|
└ no ▶ if node still == optimistic value: revert to priorValue (+ log)
|
|
```
|
|
|
|
## Edge cases & decisions
|
|
|
|
- **Compare-and-revert race:** between optimistic-apply and the failure, a poll may overwrite the
|
|
node. We revert *only* if the current node value still equals the optimistic value. Best-effort
|
|
value equality (`Equals` on the boxed values); a missed revert just means a polling tag self-corrects
|
|
on its own next poll anyway — the only case that *needs* the revert is the static attribute, which
|
|
never gets overwritten, so the equality check holds there.
|
|
- **Redundancy / not-primary:** `RouteNodeWrite` already primary-gates (returns `NodeWriteResult(false,
|
|
"not primary")`). A secondary doesn't serve client writes (clients connect to the primary), and the
|
|
correction is local to the node that received the write. A "not primary" outcome is just another
|
|
failure → compare-and-revert on the local node. No special handling.
|
|
- **Timeout / actor unreachable:** the gateway maps a faulted/timed-out Ask to
|
|
`NodeWriteOutcome(false, "write timeout"|…)` → treated as failure → revert. (A genuine device write
|
|
that succeeded but whose ack was lost would be wrongly reverted, then re-corrected by the next poll
|
|
for polling tags; for static attributes this is an acceptable, rare, conservative bias toward
|
|
showing the last-known device value.)
|
|
- **No double-apply:** the correction never *writes to the device* — it only updates the local node.
|
|
So it can't loop.
|
|
- **`writtenValue` capture:** the existing handler already captures `value` into a local for the
|
|
closure (`OtOpcUaNodeManager.cs:612`); the prior value is a second capture taken before returning.
|
|
|
|
## Testing strategy
|
|
|
|
- **Pure decision extraction:** extract the correction decision as a pure static (mirroring how
|
|
`EvaluateEquipmentWrite` was extracted at `OtOpcUaNodeManager.cs:637`) — e.g.
|
|
`ShouldRevert(NodeWriteOutcome outcome, object? currentNodeValue, object? optimisticValue) → bool`
|
|
(+ the revert arguments). Unit tests (xUnit + Shouldly): fail + node still optimistic → revert;
|
|
fail + node moved on → no revert; success → no revert; null-value edge cases.
|
|
- **Gateway:** unit-test the production gateway's `NodeWriteResult → NodeWriteOutcome` mapping
|
|
(success, failure-with-reason, timeout/faulted Ask → failure) with an Akka.TestKit probe standing
|
|
in for the `DriverHostActor`; and the `Null`/unwired default → `(false, "writes unavailable")`.
|
|
- **No bUnit / no Razor** — this is server-side only.
|
|
- **Live /run gate:** write to a **static** writable tag whose device write **deterministically
|
|
fails**, and confirm via Client.CLI subscribe that the node **reverts** to its prior value; confirm a
|
|
**successful** write stays put. The plan picks a reliably-failing write (candidates: a Modbus
|
|
illegal/out-of-range write on the `:5020` sim; or a Galaxy `ViewOnly` attribute surfaced as a
|
|
ReadWrite equipment tag so the gateway commit is refused). docker-dev is LOCAL; rebuild central on
|
|
the branch + redeploy as in prior milestones.
|
|
|
|
## Out of scope (deferred; per the chosen mechanism)
|
|
|
|
- Bad-quality blip on failure (keep the value but flag `StatusCode = Bad`).
|
|
- OPC UA `AuditWriteUpdateEvent` on failure.
|
|
- Synchronous structural fail-fast (rejecting not-primary / driver-down / unmapped writes in the Write
|
|
*response* — needs a Lock-readable driver/primary snapshot).
|
|
|
|
These remain independently addable later if richer signalling is wanted.
|
|
|
|
## Touched code (anticipated)
|
|
|
|
- NEW `src/Core/ZB.MOM.WW.OtOpcUa.Commons/OpcUa/IOpcUaNodeWriteGateway.cs` (interface +
|
|
`NodeWriteOutcome` + `Null…` default).
|
|
- `src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/OtOpcUaNodeManager.cs` — replace the `Action` router with
|
|
the gateway; capture prior value + the compare-and-revert continuation in `OnEquipmentTagWrite`;
|
|
add the pure `ShouldRevert` decision.
|
|
- `src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/OtOpcUaSdkServer.cs` — `SetNodeWriteRouter` →
|
|
`SetNodeWriteGateway`.
|
|
- `src/Server/ZB.MOM.WW.OtOpcUa.Host/OpcUa/OtOpcUaServerHostedService.cs` — wire the production
|
|
gateway impl instead of the current router lambda; clear it on `StopAsync`.
|
|
- NEW production gateway impl (Host assembly) — Asks `RouteNodeWrite`, maps to `NodeWriteOutcome`.
|
|
- Tests in the OpcUaServer test project + the Host/Runtime test project.
|
|
|
|
**No EF/Configuration schema change, no migration.** `DriverHostActor.RouteNodeWrite` /
|
|
`NodeWriteResult` and the driver write path are unchanged — this design only changes how the *outcome*
|
|
is consumed on the server side.
|