Files
lmxopcua/docs/plans/2026-06-14-write-outcome-self-correction-plan.md
T
Joseph Doherty 43b3769a1d
v2-ci / build (push) Failing after 32s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
docs(plans): add write-outcome self-correction implementation plan
The plan + task list for the write-outcome self-correction work (B1, already
shipped via master 1d797c1c). Its design-doc counterpart is already committed;
this adds the matching plan artifacts, consistent with the other docs/plans/.
2026-06-15 05:57:15 -04:00

22 KiB
Raw Blame History

Write-Outcome Self-Correction — Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans (or subagent-driven-development) to implement this plan task-by-task.

Goal: When an authorized inbound OPC UA operator write to a writable equipment-tag node fails at the device, revert the node to its real pre-write value on the subscription — killing the "optimistic-write phantom" (pending.md #5). Approved design: docs/plans/2026-06-14-write-outcome-self-correction-design.md (committed master 8a79cb0e).

Architecture: Complete the originally-designed (B3) write gateway. Add IOpcUaNodeWriteGateway { Task<NodeWriteOutcome> WriteAsync(...) } (Commons), an Akka adapter ActorNodeWriteGateway (Runtime) that Asks the existing DriverHostActor.RouteNodeWrite, and swap the node manager's fire-and-forget Action router for the gateway. The node manager captures the prior value at OnWriteValue time and, on a failed async outcome, compare-and-reverts the node (only if it still holds the optimistic value, so a fresh poll isn't clobbered). The dispatch stays non-blocking under the node-manager Lock; the correction runs in the off-Lock continuation.

Tech Stack: C# / .NET 10, OPC UA Foundation stack (CustomNodeManager2), Akka.NET, xUnit + Shouldly + Akka.TestKit. TreatWarningsAsErrors (production projects).

Branch: feat/write-outcome-self-correction off master (8a79cb0e).

Hard rules (every task): stage by path (never git add .); never stage sql_login.txt / src/Server/.../pki/ / pending.md / current.md / docker-dev/docker-compose.yml; never echo secrets; no force-push / no --no-verify; NO Configuration entity / EF migration change; NO bUnit (this is server-side only). Done = build clean + dotnet test green + the live /run gate (T5) passes.


Task DAG

T0 (branch)
 └─ T1  Commons: IOpcUaNodeWriteGateway + NodeWriteOutcome + Null default      [standard]
     ├─ T2  Runtime: ActorNodeWriteGateway (Asks RouteNodeWrite) + TestKit     [standard]   ∥ T3
     └─ T3  OpcUaServer: node-manager seam swap + capture-prior + revert       [high-risk]  ∥ T2
         └─ T4  Host: wire ActorNodeWriteGateway; SetNodeWriteGateway/clear    [standard]   (needs T2+T3)
             └─ T5  Live /run: failing write reverts, success stays           [verification]

T2 and T3 are both blockedBy T1 and touch different assemblies (Runtime vs OpcUaServer) — Parallelizable with each other. T4 needs both. T5 needs T4.


Task T0: Create feature branch

Classification: trivial Estimated implement time: ~1 min Parallelizable with: none (gates all)

Files: none (git only)

git checkout -b feat/write-outcome-self-correction master
git rev-parse --abbrev-ref HEAD    # expect: feat/write-outcome-self-correction

Do NOT touch the working-tree docker-dev/docker-compose.yml / pending.md (leave modified+unstaged).


Task T1: Commons — IOpcUaNodeWriteGateway + NodeWriteOutcome + Null default

Classification: standard Estimated implement time: ~3 min Parallelizable with: none (T2 + T3 depend on it)

Files:

  • Create: src/Core/ZB.MOM.WW.OtOpcUa.Commons/OpcUa/IOpcUaNodeWriteGateway.cs
  • Test: the Commons test project (the one that tests IOpcUaAddressSpaceSink / EquipmentNodeIds — locate it under tests/Core/, e.g. *.Commons.Tests; confirm exact name)

Context: mirrors IOpcUaAddressSpaceSink.cs (same folder) — a Commons-level (Akka-free) contract with a Null… default. The write path will use this instead of the Action<string,object?> router so the device outcome can flow back to the node manager.

Step 1: Write the interface + outcome + null default

namespace ZB.MOM.WW.OtOpcUa.Commons.OpcUa;

/// <summary>
///     Reverse-path gateway for an inbound OPC UA operator write to a writable equipment-tag node.
///     The node manager calls <see cref="WriteAsync"/> fire-and-forget from its OnWriteValue handler
///     (which runs under the node-manager Lock, so the call MUST return promptly — it kicks off an
///     asynchronous route and the Task completes later) and uses the resolved
///     <see cref="NodeWriteOutcome"/> to self-correct the node on failure.
/// </summary>
public interface IOpcUaNodeWriteGateway
{
    /// <summary>Route a write of <paramref name="value"/> to the driver backing node
    /// <paramref name="nodeId"/>; the Task resolves with the device-write outcome.</summary>
    Task<NodeWriteOutcome> WriteAsync(string nodeId, object? value, CancellationToken ct);
}

/// <summary>Outcome of routing an inbound node write to the backing driver.</summary>
/// <param name="Success">True when the driver accepted the write.</param>
/// <param name="Reason">Failure detail when <paramref name="Success"/> is false; null on success.</param>
public readonly record struct NodeWriteOutcome(bool Success, string? Reason);

/// <summary>No-op gateway: every write resolves to "writes unavailable" (matches the legacy
/// no-router-wired <c>BadNotWritable</c>). Used as the default before the host wires the real one.</summary>
public sealed class NullOpcUaNodeWriteGateway : IOpcUaNodeWriteGateway
{
    /// <summary>Singleton instance.</summary>
    public static readonly NullOpcUaNodeWriteGateway Instance = new();
    private NullOpcUaNodeWriteGateway() { }
    /// <inheritdoc />
    public Task<NodeWriteOutcome> WriteAsync(string nodeId, object? value, CancellationToken ct) =>
        Task.FromResult(new NodeWriteOutcome(false, "writes unavailable"));
}

Step 2: Unit test the Null default

[Fact]
public async Task NullGateway_returns_writes_unavailable()
{
    var outcome = await NullOpcUaNodeWriteGateway.Instance.WriteAsync("ns=2;s=x", 1, default);
    outcome.Success.ShouldBeFalse();
    outcome.Reason.ShouldBe("writes unavailable");
}

Step 3: Build + test + commit

dotnet build src/Core/ZB.MOM.WW.OtOpcUa.Commons/ZB.MOM.WW.OtOpcUa.Commons.csproj
dotnet test <commons-test-project> --filter "FullyQualifiedName~NullGateway"
git add src/Core/ZB.MOM.WW.OtOpcUa.Commons/OpcUa/IOpcUaNodeWriteGateway.cs <commons-test-file>
git commit -m "feat(commons): IOpcUaNodeWriteGateway + NodeWriteOutcome for write-outcome routing"

Task T2: Runtime — ActorNodeWriteGateway (Asks RouteNodeWrite)

Classification: standard Estimated implement time: ~5 min Parallelizable with: T3

Files:

  • Create: src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/ActorNodeWriteGateway.cs
  • Test: tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/ActorNodeWriteGatewayTests.cs (NEW)

Context: this is the current host router lambda (OtOpcUaServerHostedService.cs:145-161) repackaged as a named IOpcUaNodeWriteGateway that returns the outcome. It lives in Runtime (Akka-aware, co-located with DriverHostActor + RouteNodeWrite/NodeWriteResult), so it has an Akka.TestKit home and keeps the Host thin. It resolves the DriverHostActor lazily per write via the actor registry (StartAsync ordering — same reason the current lambda does), does the bounded 10 s Ask, logs failures (the node manager has no logger), and maps NodeWriteResult → NodeWriteOutcome.

Step 1: Write the failing TestKit tests

Mirror the harness in DriverHostActorWriteRoutingTests.cs (a TestProbe standing in for the DriverHostActor, registered under DriverHostActorKey in a test IActorRegistry — reuse whatever registry fake the suite uses, or construct the gateway with a direct IActorRef overload to keep the test simple). Cases:

  • WriteAsync_maps_successful_NodeWriteResult_to_success — probe replies NodeWriteResult(true, null) → outcome (true, null).
  • WriteAsync_maps_failed_NodeWriteResult_to_failure — probe replies NodeWriteResult(false, "not primary") → outcome (false, "not primary").
  • WriteAsync_maps_timeout_to_failure — probe never replies → outcome (false, <non-null, e.g. "write timeout">) (use a short Ask timeout in the test, or an injected timeout).
  • WriteAsync_with_no_DriverHostActor_returns_failure — registry can't resolve → (false, "writes unavailable"|…).
dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~ActorNodeWriteGateway"   # expect FAIL (type missing)

Step 2: Implement

public sealed class ActorNodeWriteGateway : IOpcUaNodeWriteGateway
{
    private readonly Func<IActorRef?> _resolveDriverHost;   // lazy per-write resolve (StartAsync ordering)
    private readonly TimeSpan _askTimeout;
    private readonly ILogger _logger;

    public ActorNodeWriteGateway(Func<IActorRef?> resolveDriverHost, ILogger logger, TimeSpan? askTimeout = null)
    { _resolveDriverHost = ; _logger = ; _askTimeout = askTimeout ?? TimeSpan.FromSeconds(10); }

    public async Task<NodeWriteOutcome> WriteAsync(string nodeId, object? value, CancellationToken ct)
    {
        var driverHost = _resolveDriverHost();
        if (driverHost is null)
        {
            _logger.LogWarning("Inbound write to {NodeId} dropped: no DriverHostActor registered", nodeId);
            return new NodeWriteOutcome(false, "writes unavailable");
        }
        try
        {
            var result = await driverHost.Ask<DriverHostActor.NodeWriteResult>(
                new DriverHostActor.RouteNodeWrite(nodeId, value), _askTimeout, ct).ConfigureAwait(false);
            if (!result.Success)
                _logger.LogWarning("Operator write to {NodeId} rejected: {Reason}", nodeId, result.Reason);
            return new NodeWriteOutcome(result.Success, result.Reason);
        }
        catch (Exception ex)   // AskTimeoutException / actor faults
        {
            _logger.LogWarning(ex, "Operator write to {NodeId} failed or timed out", nodeId);
            return new NodeWriteOutcome(false, "write timeout");
        }
    }
}

(Provide whatever constructor shape best fits the test + the host's registry resolve — a Func<IActorRef?> keeps it unit-testable with a direct probe while supporting the host's lazy _actorRegistry.TryGet<DriverHostActorKey>.)

Step 3: test + build + commit

dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~ActorNodeWriteGateway"
dotnet build src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ZB.MOM.WW.OtOpcUa.Runtime.csproj
git add src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/ActorNodeWriteGateway.cs tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/ActorNodeWriteGatewayTests.cs
git commit -m "feat(runtime): ActorNodeWriteGateway — Asks RouteNodeWrite, returns NodeWriteOutcome"

Task T3: OpcUaServer — node-manager seam swap + capture-prior + compare-and-revert

Classification: high-risk Estimated implement time: ~5 min Parallelizable with: T2

Files:

  • Modify: src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/OtOpcUaNodeManager.cs (NodeWriteRouter 100-104; OnEquipmentTagWrite 604-617; EvaluateEquipmentWrite 637-657; the WriteValue node-update pattern 122-141 to mirror)
  • Modify: src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/OtOpcUaSdkServer.cs (SetNodeWriteRouter 52-57)
  • Test: tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests/ (the existing EvaluateEquipmentWrite test class)

Context: the node manager has no logger (new OtOpcUaNodeManager(server, configuration)), so the revert is silent (the gateway logs the failure). The continuation runs off-Lock (after the original Write returns) and re-acquires Lock to revert — the same shape WriteValue (122-141) already uses for the poll path.

Step 1 (TDD): pure-decision unit tests first

Add tests to the existing OnEquipmentTagWrite/EvaluateEquipmentWrite test class for the two pure statics this task introduces:

// gate (role + availability) — replaces the old EvaluateEquipmentWrite(identity, route) shape
[Fact] public void Gate_denies_when_identity_null() =>
    OtOpcUaNodeManager.EvaluateEquipmentWriteGate(null, gatewayWired: true)!.StatusCode.ShouldBe((StatusCode)StatusCodes.BadUserAccessDenied);
[Fact] public void Gate_denies_when_role_missing() => /* identity without WriteOperate → BadUserAccessDenied */ ;
[Fact] public void Gate_unavailable_when_no_gateway() =>
    OtOpcUaNodeManager.EvaluateEquipmentWriteGate(WriteOperateIdentity(), gatewayWired: false)!.StatusCode.ShouldBe((StatusCode)StatusCodes.BadNotWritable);
[Fact] public void Gate_passes_returns_null() =>
    OtOpcUaNodeManager.EvaluateEquipmentWriteGate(WriteOperateIdentity(), gatewayWired: true).ShouldBeNull();

// compare-and-revert decision
[Theory]
[InlineData(true,  "v", "v", false)]   // success → never revert
[InlineData(false, "v", "v", true)]    // fail + node still holds optimistic value → revert
[InlineData(false, "w", "v", false)]   // fail + node moved on (poll) → don't clobber
public void ShouldRevert_decides(bool success, string current, string optimistic, bool expected) =>
    OtOpcUaNodeManager.ShouldRevert(new NodeWriteOutcome(success, null), current, optimistic).ShouldBe(expected);

Run → expect FAIL (methods don't exist yet).

Step 2: Implement the seam swap + the two pure statics + the continuation

In OtOpcUaNodeManager.cs:

  • Replace the Action<string,object?> NodeWriteRouter property + its volatile backing field with an IOpcUaNodeWriteGateway (default NullOpcUaNodeWriteGateway.Instance), volatile-backed, e.g.:
    private volatile IOpcUaNodeWriteGateway _nodeWriteGateway = NullOpcUaNodeWriteGateway.Instance;
    public IOpcUaNodeWriteGateway NodeWriteGateway { get => _nodeWriteGateway; set => _nodeWriteGateway = value ?? NullOpcUaNodeWriteGateway.Instance; }
    
  • Replace OnEquipmentTagWrite + EvaluateEquipmentWrite with:
    private ServiceResult OnEquipmentTagWrite(
        ISystemContext context, NodeState node, NumericRange indexRange, QualifiedName dataEncoding,
        ref object value, ref StatusCode statusCode, ref DateTime timestamp)
    {
        var identity = (context as ISessionOperationContext)?.UserIdentity as RoleCarryingUserIdentity;
        var gateway = _nodeWriteGateway;
        var gate = EvaluateEquipmentWriteGate(identity, gateway is not NullOpcUaNodeWriteGateway);
        if (gate is not null) return gate;
    
        // Capture optimistic value + the REAL prior value/status BEFORE the SDK applies the write.
        var optimisticValue = value;
        var nodeKey = node.NodeId.Identifier?.ToString() ?? string.Empty;
        object? priorValue = null; StatusCode priorStatus = StatusCodes.Good;
        if (node is BaseDataVariableState v) { priorValue = v.Value; priorStatus = v.StatusCode; }
    
        // Fire-and-forget (MUST NOT block under Lock). On a failed outcome, compare-and-revert.
        _ = gateway.WriteAsync(nodeKey, optimisticValue, CancellationToken.None)
            .ContinueWith(t =>
            {
                var outcome = t.IsCompletedSuccessfully ? t.Result : new NodeWriteOutcome(false, "write dispatch faulted");
                RevertOptimisticWriteIfNeeded(nodeKey, outcome, optimisticValue, priorValue, priorStatus);
            }, CancellationToken.None, TaskContinuationOptions.None, TaskScheduler.Default);
    
        return ServiceResult.Good;
    }
    
    internal static ServiceResult? EvaluateEquipmentWriteGate(RoleCarryingUserIdentity? identity, bool gatewayWired)
    {
        if (identity is null || !identity.Roles.Contains(OpcUaDataPlaneRoles.WriteOperate, StringComparer.OrdinalIgnoreCase))
            return new ServiceResult(StatusCodes.BadUserAccessDenied);
        if (!gatewayWired)
            return new ServiceResult(StatusCodes.BadNotWritable, "writes unavailable");
        return null;
    }
    
    internal static bool ShouldRevert(NodeWriteOutcome outcome, object? currentNodeValue, object? optimisticValue) =>
        !outcome.Success && Equals(currentNodeValue, optimisticValue);
    
    private void RevertOptimisticWriteIfNeeded(
        string nodeId, NodeWriteOutcome outcome, object? optimisticValue, object? priorValue, StatusCode priorStatus)
    {
        lock (Lock)
        {
            if (!_variables.TryGetValue(nodeId, out var variable)) return;     // rebuilt/removed → no-op
            if (!ShouldRevert(outcome, variable.Value, optimisticValue)) return;
            variable.Value = priorValue;
            variable.StatusCode = priorStatus;
            variable.Timestamp = DateTime.UtcNow;
            variable.ClearChangeMasks(SystemContext, includeChildren: false);
        }
    }
    
    (Confirm _variables is the dictionary WriteValue uses at 130 — reuse it. Confirm node is/casts to BaseDataVariableState. Use the node manager's existing usings.)
  • In OtOpcUaSdkServer.cs, rename SetNodeWriteRouter(Action<string,object?>?)SetNodeWriteGateway(IOpcUaNodeWriteGateway?) setting _otOpcUaNodeManager.NodeWriteGateway. Update the XML doc.

Step 3: update the old EvaluateEquipmentWrite tests to the new gate shape (they currently assert the (identity, route) form). Keep the role-gate behaviour identical; only the signature changed.

Step 4: run + build + commit

dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests --filter "FullyQualifiedName~EquipmentWrite|FullyQualifiedName~ShouldRevert"
dotnet build src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/ZB.MOM.WW.OtOpcUa.OpcUaServer.csproj
git add src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/OtOpcUaNodeManager.cs src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/OtOpcUaSdkServer.cs tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests/
git commit -m "feat(opcua): write-outcome self-correction — capture prior + compare-and-revert on failure"

Task T4: Host — wire ActorNodeWriteGateway

Classification: standard Estimated implement time: ~3 min Parallelizable with: none (needs T2 + T3)

Files:

  • Modify: src/Server/ZB.MOM.WW.OtOpcUa.Host/OpcUa/OtOpcUaServerHostedService.cs (StartAsync router-wiring 133-161; StopAsync 178-187)

Context: replace the inline SetNodeWriteRouter((nodeId,value)=>{ Ask… }) lambda (145-161) with construction of the ActorNodeWriteGateway (lazy registry resolve + a host logger) and SetNodeWriteGateway(gateway). Clear it on StopAsync (185 SetNodeWriteRouter(null)SetNodeWriteGateway(null)).

Step 1: Rewire

// replaces lines 145-161
var writeGateway = new ActorNodeWriteGateway(
    resolveDriverHost: () => _actorRegistry.TryGet<DriverHostActorKey>(out var h) ? h : null,
    logger: _loggerFactory.CreateLogger<ActorNodeWriteGateway>());
_server.SetNodeWriteGateway(writeGateway);

And in StopAsync: _server?.SetNodeWriteGateway(null);

Keep the existing explanatory comment block (133-144) — it still describes the fire-and-forget + lazy-resolve rationale; trim the "logs failures from a continuation" clause to reflect the gateway now owning that.

Step 2: build the Host + a focused integration/host test if one exists

dotnet build src/Server/ZB.MOM.WW.OtOpcUa.Host/ZB.MOM.WW.OtOpcUa.Host.csproj

(No unit test here — it's DI wiring; covered by T2's gateway tests + the T5 live gate. If Host.IntegrationTests has a boot-smoke test, ensure it still passes.)

Step 3: commit

git add src/Server/ZB.MOM.WW.OtOpcUa.Host/OpcUa/OtOpcUaServerHostedService.cs
git commit -m "feat(host): wire ActorNodeWriteGateway for write-outcome self-correction"

Task T5: Live /run verification

Classification: verification Estimated implement time: ~10 min (interactive) Parallelizable with: none (needs T4)

Goal: prove on the LOCAL docker-dev rig that a failed device write reverts the node, and a successful write stays.

Setup: docker-dev is LOCAL (OrbStack). Rebuild central on the branch + redeploy:

docker compose -f docker-dev/docker-compose.yml up -d --build migrator central-1 central-2
curl -s -X POST http://localhost:9200/api/deployments -H "X-Api-Key: docker-dev-deploy-key"

If a driver is stuck faulted, restart central-1/central-2 so it respawns from the artifact (the reconfigure-while-faulted gotcha).

Pick a deterministically-FAILING write (the plan's executor chooses whichever the rig makes easy):

  • Modbus illegal write: point a writable equipment tag at a Modbus address the :5020 sim rejects (out-of-range / illegal-data-address), so the driver write returns Bad. Sims on 10.100.0.35 (/opt/otopcua-modbus).
  • Galaxy ViewOnly: surface a Galaxy ViewOnly attribute as a ReadWrite equipment tag — the gateway commit is refused. (TestMachine_002 attrs from the Galaxy discovery probe.)

Verify with Client.CLI (bind as the opc-writeop/WriteOperate LDAP user — see docs/Client.CLI.md; shared GLAuth 10.100.0.35:3893, users password password):

  1. subscribe the target node; record the current (real) value.
  2. write a new value through the authorized session → the write returns Good (optimistic) and the node briefly shows the new value, then reverts to the prior value on the subscription when the device write fails. phantom gone.
  3. Repeat on a successful writable tag (e.g. the proven Modbus HR[200] scratch) → the value stays (no spurious revert). no regression.
  4. Confirm the gateway logged the rejection (docker compose -f docker-dev/docker-compose.yml logs central-1 | grep -i "rejected\|write").

Then finish via superpowers-extended-cc:finishing-a-development-branch (intent: merge-to-master + push). After merge, update pending.md to mark #5 resolved.


Final verification (after T1T4, before T5)

dotnet build ZB.MOM.WW.OtOpcUa.slnx                 # 0 errors
dotnet test tests/Core/<commons-test-project>
dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests
dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests

Out of scope (per the approved design / mechanism choice)

  • Bad-quality blip on failure; OPC UA AuditWriteUpdateEvent; synchronous structural fail-fast. Independently addable later.
  • DriverHostActor.RouteNodeWrite / NodeWriteResult and the driver write path are UNCHANGED — this only changes how the outcome is consumed server-side.