The plan + task list for the write-outcome self-correction work (B1, already
shipped via master 1d797c1c). Its design-doc counterpart is already committed;
this adds the matching plan artifacts, consistent with the other docs/plans/.
22 KiB
Write-Outcome Self-Correction — Implementation Plan
For Claude: REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans (or subagent-driven-development) to implement this plan task-by-task.
Goal: When an authorized inbound OPC UA operator write to a writable equipment-tag node fails at the device, revert the node to its real pre-write value on the subscription — killing the "optimistic-write phantom" (pending.md #5). Approved design: docs/plans/2026-06-14-write-outcome-self-correction-design.md (committed master 8a79cb0e).
Architecture: Complete the originally-designed (B3) write gateway. Add IOpcUaNodeWriteGateway { Task<NodeWriteOutcome> WriteAsync(...) } (Commons), an Akka adapter ActorNodeWriteGateway (Runtime) that Asks the existing DriverHostActor.RouteNodeWrite, and swap the node manager's fire-and-forget Action router for the gateway. The node manager captures the prior value at OnWriteValue time and, on a failed async outcome, compare-and-reverts the node (only if it still holds the optimistic value, so a fresh poll isn't clobbered). The dispatch stays non-blocking under the node-manager Lock; the correction runs in the off-Lock continuation.
Tech Stack: C# / .NET 10, OPC UA Foundation stack (CustomNodeManager2), Akka.NET, xUnit + Shouldly + Akka.TestKit. TreatWarningsAsErrors (production projects).
Branch: feat/write-outcome-self-correction off master (8a79cb0e).
Hard rules (every task): stage by path (never git add .); never stage sql_login.txt / src/Server/.../pki/ / pending.md / current.md / docker-dev/docker-compose.yml; never echo secrets; no force-push / no --no-verify; NO Configuration entity / EF migration change; NO bUnit (this is server-side only). Done = build clean + dotnet test green + the live /run gate (T5) passes.
Task DAG
T0 (branch)
└─ T1 Commons: IOpcUaNodeWriteGateway + NodeWriteOutcome + Null default [standard]
├─ T2 Runtime: ActorNodeWriteGateway (Asks RouteNodeWrite) + TestKit [standard] ∥ T3
└─ T3 OpcUaServer: node-manager seam swap + capture-prior + revert [high-risk] ∥ T2
└─ T4 Host: wire ActorNodeWriteGateway; SetNodeWriteGateway/clear [standard] (needs T2+T3)
└─ T5 Live /run: failing write reverts, success stays [verification]
T2 and T3 are both blockedBy T1 and touch different assemblies (Runtime vs OpcUaServer) — Parallelizable with each other. T4 needs both. T5 needs T4.
Task T0: Create feature branch
Classification: trivial Estimated implement time: ~1 min Parallelizable with: none (gates all)
Files: none (git only)
git checkout -b feat/write-outcome-self-correction master
git rev-parse --abbrev-ref HEAD # expect: feat/write-outcome-self-correction
Do NOT touch the working-tree docker-dev/docker-compose.yml / pending.md (leave modified+unstaged).
Task T1: Commons — IOpcUaNodeWriteGateway + NodeWriteOutcome + Null default
Classification: standard Estimated implement time: ~3 min Parallelizable with: none (T2 + T3 depend on it)
Files:
- Create:
src/Core/ZB.MOM.WW.OtOpcUa.Commons/OpcUa/IOpcUaNodeWriteGateway.cs - Test: the Commons test project (the one that tests
IOpcUaAddressSpaceSink/EquipmentNodeIds— locate it undertests/Core/, e.g.*.Commons.Tests; confirm exact name)
Context: mirrors IOpcUaAddressSpaceSink.cs (same folder) — a Commons-level (Akka-free) contract with a Null… default. The write path will use this instead of the Action<string,object?> router so the device outcome can flow back to the node manager.
Step 1: Write the interface + outcome + null default
namespace ZB.MOM.WW.OtOpcUa.Commons.OpcUa;
/// <summary>
/// Reverse-path gateway for an inbound OPC UA operator write to a writable equipment-tag node.
/// The node manager calls <see cref="WriteAsync"/> fire-and-forget from its OnWriteValue handler
/// (which runs under the node-manager Lock, so the call MUST return promptly — it kicks off an
/// asynchronous route and the Task completes later) and uses the resolved
/// <see cref="NodeWriteOutcome"/> to self-correct the node on failure.
/// </summary>
public interface IOpcUaNodeWriteGateway
{
/// <summary>Route a write of <paramref name="value"/> to the driver backing node
/// <paramref name="nodeId"/>; the Task resolves with the device-write outcome.</summary>
Task<NodeWriteOutcome> WriteAsync(string nodeId, object? value, CancellationToken ct);
}
/// <summary>Outcome of routing an inbound node write to the backing driver.</summary>
/// <param name="Success">True when the driver accepted the write.</param>
/// <param name="Reason">Failure detail when <paramref name="Success"/> is false; null on success.</param>
public readonly record struct NodeWriteOutcome(bool Success, string? Reason);
/// <summary>No-op gateway: every write resolves to "writes unavailable" (matches the legacy
/// no-router-wired <c>BadNotWritable</c>). Used as the default before the host wires the real one.</summary>
public sealed class NullOpcUaNodeWriteGateway : IOpcUaNodeWriteGateway
{
/// <summary>Singleton instance.</summary>
public static readonly NullOpcUaNodeWriteGateway Instance = new();
private NullOpcUaNodeWriteGateway() { }
/// <inheritdoc />
public Task<NodeWriteOutcome> WriteAsync(string nodeId, object? value, CancellationToken ct) =>
Task.FromResult(new NodeWriteOutcome(false, "writes unavailable"));
}
Step 2: Unit test the Null default
[Fact]
public async Task NullGateway_returns_writes_unavailable()
{
var outcome = await NullOpcUaNodeWriteGateway.Instance.WriteAsync("ns=2;s=x", 1, default);
outcome.Success.ShouldBeFalse();
outcome.Reason.ShouldBe("writes unavailable");
}
Step 3: Build + test + commit
dotnet build src/Core/ZB.MOM.WW.OtOpcUa.Commons/ZB.MOM.WW.OtOpcUa.Commons.csproj
dotnet test <commons-test-project> --filter "FullyQualifiedName~NullGateway"
git add src/Core/ZB.MOM.WW.OtOpcUa.Commons/OpcUa/IOpcUaNodeWriteGateway.cs <commons-test-file>
git commit -m "feat(commons): IOpcUaNodeWriteGateway + NodeWriteOutcome for write-outcome routing"
Task T2: Runtime — ActorNodeWriteGateway (Asks RouteNodeWrite)
Classification: standard Estimated implement time: ~5 min Parallelizable with: T3
Files:
- Create:
src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/ActorNodeWriteGateway.cs - Test:
tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/ActorNodeWriteGatewayTests.cs(NEW)
Context: this is the current host router lambda (OtOpcUaServerHostedService.cs:145-161) repackaged as a named IOpcUaNodeWriteGateway that returns the outcome. It lives in Runtime (Akka-aware, co-located with DriverHostActor + RouteNodeWrite/NodeWriteResult), so it has an Akka.TestKit home and keeps the Host thin. It resolves the DriverHostActor lazily per write via the actor registry (StartAsync ordering — same reason the current lambda does), does the bounded 10 s Ask, logs failures (the node manager has no logger), and maps NodeWriteResult → NodeWriteOutcome.
Step 1: Write the failing TestKit tests
Mirror the harness in DriverHostActorWriteRoutingTests.cs (a TestProbe standing in for the DriverHostActor, registered under DriverHostActorKey in a test IActorRegistry — reuse whatever registry fake the suite uses, or construct the gateway with a direct IActorRef overload to keep the test simple). Cases:
WriteAsync_maps_successful_NodeWriteResult_to_success— probe repliesNodeWriteResult(true, null)→ outcome(true, null).WriteAsync_maps_failed_NodeWriteResult_to_failure— probe repliesNodeWriteResult(false, "not primary")→ outcome(false, "not primary").WriteAsync_maps_timeout_to_failure— probe never replies → outcome(false, <non-null, e.g. "write timeout">)(use a short Ask timeout in the test, or an injected timeout).WriteAsync_with_no_DriverHostActor_returns_failure— registry can't resolve →(false, "writes unavailable"|…).
dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~ActorNodeWriteGateway" # expect FAIL (type missing)
Step 2: Implement
public sealed class ActorNodeWriteGateway : IOpcUaNodeWriteGateway
{
private readonly Func<IActorRef?> _resolveDriverHost; // lazy per-write resolve (StartAsync ordering)
private readonly TimeSpan _askTimeout;
private readonly ILogger _logger;
public ActorNodeWriteGateway(Func<IActorRef?> resolveDriverHost, ILogger logger, TimeSpan? askTimeout = null)
{ _resolveDriverHost = …; _logger = …; _askTimeout = askTimeout ?? TimeSpan.FromSeconds(10); }
public async Task<NodeWriteOutcome> WriteAsync(string nodeId, object? value, CancellationToken ct)
{
var driverHost = _resolveDriverHost();
if (driverHost is null)
{
_logger.LogWarning("Inbound write to {NodeId} dropped: no DriverHostActor registered", nodeId);
return new NodeWriteOutcome(false, "writes unavailable");
}
try
{
var result = await driverHost.Ask<DriverHostActor.NodeWriteResult>(
new DriverHostActor.RouteNodeWrite(nodeId, value), _askTimeout, ct).ConfigureAwait(false);
if (!result.Success)
_logger.LogWarning("Operator write to {NodeId} rejected: {Reason}", nodeId, result.Reason);
return new NodeWriteOutcome(result.Success, result.Reason);
}
catch (Exception ex) // AskTimeoutException / actor faults
{
_logger.LogWarning(ex, "Operator write to {NodeId} failed or timed out", nodeId);
return new NodeWriteOutcome(false, "write timeout");
}
}
}
(Provide whatever constructor shape best fits the test + the host's registry resolve — a Func<IActorRef?> keeps it unit-testable with a direct probe while supporting the host's lazy _actorRegistry.TryGet<DriverHostActorKey>.)
Step 3: test + build + commit
dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~ActorNodeWriteGateway"
dotnet build src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ZB.MOM.WW.OtOpcUa.Runtime.csproj
git add src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/ActorNodeWriteGateway.cs tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/ActorNodeWriteGatewayTests.cs
git commit -m "feat(runtime): ActorNodeWriteGateway — Asks RouteNodeWrite, returns NodeWriteOutcome"
Task T3: OpcUaServer — node-manager seam swap + capture-prior + compare-and-revert
Classification: high-risk Estimated implement time: ~5 min Parallelizable with: T2
Files:
- Modify:
src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/OtOpcUaNodeManager.cs(NodeWriteRouter100-104;OnEquipmentTagWrite604-617;EvaluateEquipmentWrite637-657; theWriteValuenode-update pattern 122-141 to mirror) - Modify:
src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/OtOpcUaSdkServer.cs(SetNodeWriteRouter52-57) - Test:
tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests/(the existingEvaluateEquipmentWritetest class)
Context: the node manager has no logger (new OtOpcUaNodeManager(server, configuration)), so the revert is silent (the gateway logs the failure). The continuation runs off-Lock (after the original Write returns) and re-acquires Lock to revert — the same shape WriteValue (122-141) already uses for the poll path.
Step 1 (TDD): pure-decision unit tests first
Add tests to the existing OnEquipmentTagWrite/EvaluateEquipmentWrite test class for the two pure statics this task introduces:
// gate (role + availability) — replaces the old EvaluateEquipmentWrite(identity, route) shape
[Fact] public void Gate_denies_when_identity_null() =>
OtOpcUaNodeManager.EvaluateEquipmentWriteGate(null, gatewayWired: true)!.StatusCode.ShouldBe((StatusCode)StatusCodes.BadUserAccessDenied);
[Fact] public void Gate_denies_when_role_missing() => /* identity without WriteOperate → BadUserAccessDenied */ ;
[Fact] public void Gate_unavailable_when_no_gateway() =>
OtOpcUaNodeManager.EvaluateEquipmentWriteGate(WriteOperateIdentity(), gatewayWired: false)!.StatusCode.ShouldBe((StatusCode)StatusCodes.BadNotWritable);
[Fact] public void Gate_passes_returns_null() =>
OtOpcUaNodeManager.EvaluateEquipmentWriteGate(WriteOperateIdentity(), gatewayWired: true).ShouldBeNull();
// compare-and-revert decision
[Theory]
[InlineData(true, "v", "v", false)] // success → never revert
[InlineData(false, "v", "v", true)] // fail + node still holds optimistic value → revert
[InlineData(false, "w", "v", false)] // fail + node moved on (poll) → don't clobber
public void ShouldRevert_decides(bool success, string current, string optimistic, bool expected) =>
OtOpcUaNodeManager.ShouldRevert(new NodeWriteOutcome(success, null), current, optimistic).ShouldBe(expected);
Run → expect FAIL (methods don't exist yet).
Step 2: Implement the seam swap + the two pure statics + the continuation
In OtOpcUaNodeManager.cs:
- Replace the
Action<string,object?> NodeWriteRouterproperty + itsvolatilebacking field with anIOpcUaNodeWriteGateway(defaultNullOpcUaNodeWriteGateway.Instance), volatile-backed, e.g.:private volatile IOpcUaNodeWriteGateway _nodeWriteGateway = NullOpcUaNodeWriteGateway.Instance; public IOpcUaNodeWriteGateway NodeWriteGateway { get => _nodeWriteGateway; set => _nodeWriteGateway = value ?? NullOpcUaNodeWriteGateway.Instance; } - Replace
OnEquipmentTagWrite+EvaluateEquipmentWritewith:(Confirmprivate ServiceResult OnEquipmentTagWrite( ISystemContext context, NodeState node, NumericRange indexRange, QualifiedName dataEncoding, ref object value, ref StatusCode statusCode, ref DateTime timestamp) { var identity = (context as ISessionOperationContext)?.UserIdentity as RoleCarryingUserIdentity; var gateway = _nodeWriteGateway; var gate = EvaluateEquipmentWriteGate(identity, gateway is not NullOpcUaNodeWriteGateway); if (gate is not null) return gate; // Capture optimistic value + the REAL prior value/status BEFORE the SDK applies the write. var optimisticValue = value; var nodeKey = node.NodeId.Identifier?.ToString() ?? string.Empty; object? priorValue = null; StatusCode priorStatus = StatusCodes.Good; if (node is BaseDataVariableState v) { priorValue = v.Value; priorStatus = v.StatusCode; } // Fire-and-forget (MUST NOT block under Lock). On a failed outcome, compare-and-revert. _ = gateway.WriteAsync(nodeKey, optimisticValue, CancellationToken.None) .ContinueWith(t => { var outcome = t.IsCompletedSuccessfully ? t.Result : new NodeWriteOutcome(false, "write dispatch faulted"); RevertOptimisticWriteIfNeeded(nodeKey, outcome, optimisticValue, priorValue, priorStatus); }, CancellationToken.None, TaskContinuationOptions.None, TaskScheduler.Default); return ServiceResult.Good; } internal static ServiceResult? EvaluateEquipmentWriteGate(RoleCarryingUserIdentity? identity, bool gatewayWired) { if (identity is null || !identity.Roles.Contains(OpcUaDataPlaneRoles.WriteOperate, StringComparer.OrdinalIgnoreCase)) return new ServiceResult(StatusCodes.BadUserAccessDenied); if (!gatewayWired) return new ServiceResult(StatusCodes.BadNotWritable, "writes unavailable"); return null; } internal static bool ShouldRevert(NodeWriteOutcome outcome, object? currentNodeValue, object? optimisticValue) => !outcome.Success && Equals(currentNodeValue, optimisticValue); private void RevertOptimisticWriteIfNeeded( string nodeId, NodeWriteOutcome outcome, object? optimisticValue, object? priorValue, StatusCode priorStatus) { lock (Lock) { if (!_variables.TryGetValue(nodeId, out var variable)) return; // rebuilt/removed → no-op if (!ShouldRevert(outcome, variable.Value, optimisticValue)) return; variable.Value = priorValue; variable.StatusCode = priorStatus; variable.Timestamp = DateTime.UtcNow; variable.ClearChangeMasks(SystemContext, includeChildren: false); } }_variablesis the dictionaryWriteValueuses at 130 — reuse it. Confirmnodeis/casts toBaseDataVariableState. Use the node manager's existingusings.) - In
OtOpcUaSdkServer.cs, renameSetNodeWriteRouter(Action<string,object?>?)→SetNodeWriteGateway(IOpcUaNodeWriteGateway?)setting_otOpcUaNodeManager.NodeWriteGateway. Update the XML doc.
Step 3: update the old EvaluateEquipmentWrite tests to the new gate shape (they currently assert the (identity, route) form). Keep the role-gate behaviour identical; only the signature changed.
Step 4: run + build + commit
dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests --filter "FullyQualifiedName~EquipmentWrite|FullyQualifiedName~ShouldRevert"
dotnet build src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/ZB.MOM.WW.OtOpcUa.OpcUaServer.csproj
git add src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/OtOpcUaNodeManager.cs src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/OtOpcUaSdkServer.cs tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests/
git commit -m "feat(opcua): write-outcome self-correction — capture prior + compare-and-revert on failure"
Task T4: Host — wire ActorNodeWriteGateway
Classification: standard Estimated implement time: ~3 min Parallelizable with: none (needs T2 + T3)
Files:
- Modify:
src/Server/ZB.MOM.WW.OtOpcUa.Host/OpcUa/OtOpcUaServerHostedService.cs(StartAsyncrouter-wiring 133-161;StopAsync178-187)
Context: replace the inline SetNodeWriteRouter((nodeId,value)=>{ Ask… }) lambda (145-161) with construction of the ActorNodeWriteGateway (lazy registry resolve + a host logger) and SetNodeWriteGateway(gateway). Clear it on StopAsync (185 SetNodeWriteRouter(null) → SetNodeWriteGateway(null)).
Step 1: Rewire
// replaces lines 145-161
var writeGateway = new ActorNodeWriteGateway(
resolveDriverHost: () => _actorRegistry.TryGet<DriverHostActorKey>(out var h) ? h : null,
logger: _loggerFactory.CreateLogger<ActorNodeWriteGateway>());
_server.SetNodeWriteGateway(writeGateway);
And in StopAsync: _server?.SetNodeWriteGateway(null);
Keep the existing explanatory comment block (133-144) — it still describes the fire-and-forget + lazy-resolve rationale; trim the "logs failures from a continuation" clause to reflect the gateway now owning that.
Step 2: build the Host + a focused integration/host test if one exists
dotnet build src/Server/ZB.MOM.WW.OtOpcUa.Host/ZB.MOM.WW.OtOpcUa.Host.csproj
(No unit test here — it's DI wiring; covered by T2's gateway tests + the T5 live gate. If Host.IntegrationTests has a boot-smoke test, ensure it still passes.)
Step 3: commit
git add src/Server/ZB.MOM.WW.OtOpcUa.Host/OpcUa/OtOpcUaServerHostedService.cs
git commit -m "feat(host): wire ActorNodeWriteGateway for write-outcome self-correction"
Task T5: Live /run verification
Classification: verification Estimated implement time: ~10 min (interactive) Parallelizable with: none (needs T4)
Goal: prove on the LOCAL docker-dev rig that a failed device write reverts the node, and a successful write stays.
Setup: docker-dev is LOCAL (OrbStack). Rebuild central on the branch + redeploy:
docker compose -f docker-dev/docker-compose.yml up -d --build migrator central-1 central-2
curl -s -X POST http://localhost:9200/api/deployments -H "X-Api-Key: docker-dev-deploy-key"
If a driver is stuck faulted, restart central-1/central-2 so it respawns from the artifact (the reconfigure-while-faulted gotcha).
Pick a deterministically-FAILING write (the plan's executor chooses whichever the rig makes easy):
- Modbus illegal write: point a writable equipment tag at a Modbus address the
:5020sim rejects (out-of-range / illegal-data-address), so the driver write returns Bad. Sims on10.100.0.35(/opt/otopcua-modbus). - Galaxy ViewOnly: surface a Galaxy
ViewOnlyattribute as a ReadWrite equipment tag — the gateway commit is refused. (TestMachine_002attrs from the Galaxy discovery probe.)
Verify with Client.CLI (bind as the opc-writeop/WriteOperate LDAP user — see docs/Client.CLI.md; shared GLAuth 10.100.0.35:3893, users password password):
subscribethe target node; record the current (real) value.writea new value through the authorized session → the write returns Good (optimistic) and the node briefly shows the new value, then reverts to the prior value on the subscription when the device write fails. ✅ phantom gone.- Repeat on a successful writable tag (e.g. the proven Modbus HR[200] scratch) → the value stays (no spurious revert). ✅ no regression.
- Confirm the gateway logged the rejection (
docker compose -f docker-dev/docker-compose.yml logs central-1 | grep -i "rejected\|write").
Then finish via superpowers-extended-cc:finishing-a-development-branch (intent: merge-to-master + push). After merge, update pending.md to mark #5 resolved.
Final verification (after T1–T4, before T5)
dotnet build ZB.MOM.WW.OtOpcUa.slnx # 0 errors
dotnet test tests/Core/<commons-test-project>
dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests
dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests
Out of scope (per the approved design / mechanism choice)
- Bad-quality blip on failure; OPC UA
AuditWriteUpdateEvent; synchronous structural fail-fast. Independently addable later. DriverHostActor.RouteNodeWrite/NodeWriteResultand the driver write path are UNCHANGED — this only changes how the outcome is consumed server-side.