Files
lmxopcua/docs/plans/2026-06-07-otopcua-followons.md
T

19 KiB

OtOpcUa Follow-ons Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans (or subagent-driven-development) to implement this plan task-by-task.

Goal: Ship three independent hardening items from docs/plans/2026-06-07-otopcua-followons-design.md: (1) fix the DriverInstanceActor subscribe ActorContext race, (2) reject Tag↔VirtualTag NodeId collisions at deploy via a surgical validator gate, (3) make the docker-dev stack survive a constrained host.

Architecture: All three live in OtOpcUa, on branch feat/otopcua-followons (off master 4c221ce). They are independent (no cross-deps except T3→T2). #1 is a Runtime actor fix; #2 wires the (currently dormant) DraftValidator into the deploy handler but gates only on the new collision rule (other dormant rules run but don't block — avoids breaking the existing non-canonical company overlay); #3 is docker-dev config only.

Tech Stack: .NET 10, Akka.NET (TestKit), EF Core (OtOpcUaConfigDbContext), xUnit + Shouldly, Docker Compose.

Decisions locked (from brainstorming): #2 = surgical gate (reject only on EquipmentSignalNameCollision; build the DraftSnapshot + wire DraftValidator.Validate so it's future-ready, but filter the gate to the one code). #3 = quiet EF/AspNetCore logging + per-service mem_limit/mem_reservation + README (no lite profile).

Environment caveat: docker-dev is being actively reconfigured by the user (DB reset, single-mesh, fresh-volume bootstrap). Do not rely on live-DB state for tests — #2's tests are unit/TestKit with synthetic snapshots; #3 is verified by bringing the stack up. Coordinate #3's edits with the in-flight docker-dev commits on master (keep them additive).


Task graph

T1 (#1 subscribe race)        ┐ independent, disjoint files
T2 (#2 snapshot field + rule) ┤ → T3 (#2 deploy gate)
T4 (#3 docker-dev config)     ┘

T1, T2, T4 are mutually parallelizable. T3 depends on T2.


Task 1: Fix the DriverInstanceActor subscribe ActorContext race

Classification: high-risk Estimated implement time: ~5 min Parallelizable with: Task 2, Task 4

Files:

  • Modify: src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs (HandleSubscribeAsync ~349-384, UnsubscribeAsync ~386-403, HandleWriteAsync ~315-345 — drop ConfigureAwait(false))
  • Test: tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DriverInstanceActorTests.cs

Root cause (recap): HandleSubscribeAsync reads Sender (= Context.Sender) on line 362 after a conditional await UnsubscribeAsync().ConfigureAwait(false) on line 359. ConfigureAwait(false) resumes the continuation off Akka's ActorTaskScheduler, so the re-subscribe path reads Context.Sender on a thread-pool thread → no active ActorContext.

Step 1 — Write the failing test. In DriverInstanceActorTests.cs (read it first for the existing fake-ISubscribable driver + TestKit harness pattern; reuse the fake driver that drives the actor to the Connected state). Add a test that subscribes, then subscribes AGAIN (so _subscriptionHandle is not null → drives the line-359 unsubscribe-then-resubscribe path):

[Fact]
public void Resubscribe_does_not_throw_no_active_ActorContext()
{
    var (actor, _) = CreateConnectedSubscribableDriver(); // mirror the existing helper
    actor.Tell(new DriverInstanceActor.Subscribe(new[] { "ref.A" }, TimeSpan.FromSeconds(1)));
    ExpectMsg<DriverInstanceActor.SubscriptionEstablished>();
    // Second subscribe exercises the await-Unsubscribe-then-read-Sender path that threw.
    actor.Tell(new DriverInstanceActor.Subscribe(new[] { "ref.B" }, TimeSpan.FromSeconds(1)));
    ExpectMsg<DriverInstanceActor.SubscriptionEstablished>(TimeSpan.FromSeconds(5));
}

(Match the real message type names/namespaces from the actor. If the existing tests already have a "subscribe" test, clone its setup.)

Step 2 — Run, verify it fails. dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/ --filter "FullyQualifiedName~DriverInstanceActor" — expect the new test to fail (the second subscribe throws no active ActorContext, so no SubscriptionEstablished arrives → ExpectMsg times out).

Step 3 — Implement. In HandleSubscribeAsync, move the Sender/Self capture to the top, before the ISubscribable check and the conditional unsubscribe await; use the captured locals everywhere after; and drop .ConfigureAwait(false) from the awaits inside this and the sibling ReceiveAsync handlers:

private async Task HandleSubscribeAsync(Subscribe msg)
{
    var replyTo = Sender;          // capture BEFORE any await (Context.Sender is invalid post-await)
    var self = Self;
    if (_driver is not ISubscribable subscribable)
    {
        replyTo.Tell(new SubscriptionFailed("Driver does not implement ISubscribable"));
        return;
    }
    if (_subscriptionHandle is not null)
    {
        await UnsubscribeAsync();   // no ConfigureAwait(false) — resume on the actor context
    }
    try
    {
        _dataChangeHandler = (_, args) => self.Tell(new DataChangeForward(args.FullReference, args.Snapshot));
        subscribable.OnDataChange += _dataChangeHandler;
        using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(10));
        _subscriptionHandle = await subscribable
            .SubscribeAsync(msg.FullReferences, msg.PublishingInterval, cts.Token);   // no ConfigureAwait(false)
        replyTo.Tell(new SubscriptionEstablished(_subscriptionHandle.DiagnosticId, msg.FullReferences.Count));
        _log.Info("DriverInstance {Id}: subscribed to {Count} refs ({Diag})",
            _driverInstanceId, msg.FullReferences.Count, _subscriptionHandle.DiagnosticId);
    }
    catch (Exception ex)
    {
        DetachSubscription();
        _log.Warning(ex, "DriverInstance {Id}: subscribe failed", _driverInstanceId);
        replyTo.Tell(new SubscriptionFailed(ex.Message));
    }
}

Also remove .ConfigureAwait(false) from the awaits in UnsubscribeAsync (line ~396) and HandleWriteAsync (line ~330) so every ReceiveAsync continuation resumes on the actor context. (HandleApplyDeltaAsync already captures Sender first and has no ConfigureAwait(false) — leave it.) Audit each handler: no raw Sender/Self/Context access past any await.

Step 4 — Run, verify it passes. Same filter → green. Then run the broader Runtime driver slice: dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/ --filter "FullyQualifiedName~DriverInstance|FullyQualifiedName~DriverHost" — no regression.

Step 5 — Commit. git add … && git commit -m "fix(runtime): capture Sender before await in DriverInstanceActor subscribe (no-ActorContext race)"


Task 2: Add VirtualTags to DraftSnapshot + the collision rule

Classification: standard Estimated implement time: ~5 min Parallelizable with: Task 1, Task 4

Files:

  • Modify: src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Validation/DraftSnapshot.cs (add VirtualTags)
  • Modify: src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Validation/DraftValidator.cs (add rule + wire into Validate)
  • Test: tests/Core/ZB.MOM.WW.OtOpcUa.Configuration.Tests/DraftValidatorTests.cs

Step 1 — Write failing tests. In DraftValidatorTests.cs (read it for the snapshot-builder helper the existing tests use — they construct DraftSnapshot via object initializer with the entities). Add:

[Fact]
public void Tag_and_VirtualTag_same_equipment_and_name_collide()
{
    var draft = MakeDraft(/* one Equipment eq-1 */) with-ish initializer:
        Tags = [ new Tag { TagId="t1", EquipmentId="eq-1", Name="speed", DataType="Float64", FolderPath=null, /*required fields*/ } ],
        VirtualTags = [ new VirtualTag { VirtualTagId="v1", EquipmentId="eq-1", Name="speed", DataType="Float64", ScriptId="s1" } ];
    DraftValidator.Validate(draft).ShouldContain(e => e.Code == "EquipmentSignalNameCollision");
}

[Fact]
public void Same_name_different_folder_does_not_collide()
{
    // Tag under FolderPath "metrics" vs VirtualTag at equipment root → different NodeId → OK
    ... Tags=[Tag{EquipmentId="eq-1",Name="speed",FolderPath="metrics"}], VirtualTags=[VirtualTag{EquipmentId="eq-1",Name="speed"}]
    DraftValidator.Validate(draft).ShouldNotContain(e => e.Code == "EquipmentSignalNameCollision");
}

[Fact]
public void SystemPlatform_tag_sharing_name_with_equipment_vtag_is_excluded()
{
    // Tag with EquipmentId == null (galaxy/SystemPlatform) never collides with an equipment vtag
    ... Tags=[Tag{EquipmentId=null,Name="speed"}], VirtualTags=[VirtualTag{EquipmentId="eq-1",Name="speed"}]
    DraftValidator.Validate(draft).ShouldNotContain(e => e.Code == "EquipmentSignalNameCollision");
}

(Fill in every required/NOT-NULL field on Tag/VirtualTag/Equipment — read the entity classes src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Entities/{Tag,VirtualTag,Equipment}.cs. Mirror how existing tests build entities.)

Step 2 — Run, verify fail (compile error: VirtualTags not on DraftSnapshot; then rule missing). dotnet test tests/Core/ZB.MOM.WW.OtOpcUa.Configuration.Tests/ --filter "FullyQualifiedName~DraftValidator"

Step 3 — Implement. In DraftSnapshot.cs, after the Tags member:

/// <summary>Equipment-bound VirtualTags (script-derived signals). Used by the NodeId-collision rule.</summary>
public IReadOnlyList<VirtualTag> VirtualTags { get; init; } = [];

In DraftValidator.cs, add the rule and call it from Validate:

private static void ValidateNoEquipmentSignalNameCollision(DraftSnapshot draft, List<ValidationError> errors)
{
    // The materialiser keys equipment signal variables on the folder-scoped NodeId
    // "{EquipmentId}[/{FolderPath}]/{Name}". Tag (EquipmentId != null) and VirtualTag share that
    // node space with no DB-level cross-table uniqueness, so the same key from both collides.
    static string Key(string eq, string? folder, string name) =>
        string.IsNullOrWhiteSpace(folder) ? $"{eq}/{name}" : $"{eq}/{folder}/{name}";

    var signals = draft.Tags
        .Where(t => t.EquipmentId is not null)
        .Select(t => (Key: Key(t.EquipmentId!, t.FolderPath, t.Name), Eq: t.EquipmentId!, t.Name))
        .Concat(draft.VirtualTags
            .Select(v => (Key: Key(v.EquipmentId, null, v.Name), Eq: v.EquipmentId, v.Name)));

    foreach (var g in signals.GroupBy(s => s.Key, StringComparer.Ordinal).Where(g => g.Count() > 1))
    {
        var f = g.First();
        errors.Add(new("EquipmentSignalNameCollision",
            $"{g.Count()} signals collide on OPC UA NodeId '{g.Key}' (equipment '{f.Eq}', name '{f.Name}'); " +
            "a Name must be unique across Tag and VirtualTag within an equipment+folder",
            f.Eq));
    }
}

Add ValidateNoEquipmentSignalNameCollision(draft, errors); to the Validate method's rule list (around line 33). Confirm the VirtualTag entity exposes EquipmentId, Name (it does); it has no FolderPath (so vtags use null folder).

Step 4 — Run, verify pass. Same filter → green; then the whole Configuration.Tests suite to confirm no regression.

Step 5 — Commit. git add … && git commit -m "feat(config): DraftValidator rule + DraftSnapshot.VirtualTags for Tag/VirtualTag NodeId collisions"


Task 3: Build DraftSnapshot from the DB + wire the surgical deploy gate

Classification: high-risk Estimated implement time: ~5 min Parallelizable with: none (depends on Task 2)

Files:

  • Create: src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Validation/DraftSnapshotFactory.cs
  • Modify: src/Server/ZB.MOM.WW.OtOpcUa.ControlPlane/AdminOperations/AdminOperationsActor.cs (HandleStartDeploymentAsync, before the ConfigComposer.SnapshotAndFlattenAsync seal)
  • Read first: AdminOperationsActor.cs (the db in scope + the StartDeploymentOutcome enum incl. Rejected + the StartDeploymentResult shape), and src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Api/DeployApiEndpoints.cs (confirm Rejected maps to a non-2xx HTTP — if not, map it, e.g. 422).
  • Test: tests/Core/ZB.MOM.WW.OtOpcUa.Configuration.Tests/DraftSnapshotFactoryTests.cs (new) + extend an AdminOperationsActor test if one exists (TestKit) to assert a colliding config is Rejected.

Design: Build a DraftSnapshot from the live config DB and run the FULL DraftValidator.Validate, but gate the deploy only on EquipmentSignalNameCollision — other (dormant, possibly-failing) rules run but do not block. This wires the validator for future activation without breaking the non-canonical company overlay.

Step 1 — Write failing tests. DraftSnapshotFactoryTests.cs: using an in-memory OtOpcUaConfigDbContext (mirror how other Configuration.Tests build one) seeded with one Equipment + a Tag and a VirtualTag sharing (EquipmentId, Name), assert DraftSnapshotFactory.FromConfigDbAsync(db) returns a snapshot whose Tags and VirtualTags are populated, and DraftValidator.Validate(snapshot) contains EquipmentSignalNameCollision. If AdminOperationsActor has a TestKit test harness, add: seed a colliding config → StartDeploymentStartDeploymentResult outcome is Rejected with the collision message; seed a non-colliding config → Accepted (sealed).

Step 2 — Run, verify fail.

Step 3 — Implement. DraftSnapshotFactory.cs — load the tables the rules touch (so they don't NRE), gate-relevant ones fully populated:

public static class DraftSnapshotFactory
{
    public static async Task<DraftSnapshot> FromConfigDbAsync(OtOpcUaConfigDbContext db, CancellationToken ct = default)
        => new DraftSnapshot
        {
            GenerationId = 0,            // generation model dropped; placeholder (no rule reads it)
            ClusterId = string.Empty,    // global snapshot; rules compare entity ClusterId fields, not this
            Namespaces        = await db.Namespaces.AsNoTracking().ToListAsync(ct),
            DriverInstances   = await db.DriverInstances.AsNoTracking().ToListAsync(ct),
            Devices           = await db.Devices.AsNoTracking().ToListAsync(ct),
            UnsAreas          = await db.UnsAreas.AsNoTracking().ToListAsync(ct),
            UnsLines          = await db.UnsLines.AsNoTracking().ToListAsync(ct),
            Equipment         = await db.Equipment.AsNoTracking().ToListAsync(ct),
            Tags              = await db.Tags.AsNoTracking().ToListAsync(ct),
            VirtualTags       = await db.VirtualTags.AsNoTracking().ToListAsync(ct),
            PollGroups        = await db.PollGroups.AsNoTracking().ToListAsync(ct),
            PriorEquipment    = [],      // no prior generation in the live model (UUID-immutability rule no-ops)
            ActiveReservations= await db.ExternalIdReservations.AsNoTracking().ToListAsync(ct),
            // Enterprise/Site left null — only PathLength reads them (under-counts; acceptable, not gated here)
        };
}

(Confirm the DbSet names: db.ExternalIdReservations etc. — read OtOpcUaConfigDbContext. Adjust if different.)

In AdminOperationsActor.HandleStartDeploymentAsync, immediately before ConfigComposer.SnapshotAndFlattenAsync(db):

var draft = await DraftSnapshotFactory.FromConfigDbAsync(db);
var collisions = DraftValidator.Validate(draft)
    .Where(e => e.Code == "EquipmentSignalNameCollision")
    .ToList();
if (collisions.Count > 0)
{
    var summary = string.Join("; ", collisions.Select(e => e.Message));
    _log.Warning("StartDeployment rejected: {Summary}", summary);
    replyTo.Tell(new StartDeploymentResult(StartDeploymentOutcome.Rejected, /* id */ null, summary /* match the result shape */));
    return;
}

(Match the real StartDeploymentResult constructor + how replyTo/the failure path is shaped at lines ~108-119.) If DeployApiEndpoints doesn't already map Rejected to a non-2xx, map it to 422 Unprocessable Entity with the message.

Step 4 — Run, verify pass. Configuration.Tests + the AdminOperations slice green. Build the Host project to confirm the endpoint mapping compiles.

Step 5 — Commit. git add … && git commit -m "feat(deploy): reject Tag/VirtualTag NodeId collisions at deploy (surgical DraftValidator gate)"


Task 4: docker-dev resource hardening (logging + memory + README)

Classification: standard Estimated implement time: ~5 min Parallelizable with: Task 1, Task 2

Files:

  • Modify: docker-dev/docker-compose.yml (the host-service env + per-service limits; use the &otopcua-host anchor / per-service blocks)
  • Modify: docker-dev/README.md (memory note)
  • Read first: the current compose to see the shared anchor + each host service's environment block, and rebase/coordinate with the in-flight docker-dev commits on master (keep edits additive).

Step 1 — Quiet logging. Add to the host services' environment (prefer the &otopcua-host anchor so all inherit; if envs are per-service, add to each of central-1/2, site-a-1/2, site-b-1/2):

      Logging__LogLevel__Microsoft.EntityFrameworkCore: "Warning"
      Logging__LogLevel__Microsoft.AspNetCore: "Warning"

(Keep Default/app loggers at their current level. This removes the per-poll SELECT FROM Deployment + Executed DbCommand flood that drove the heartbeat thread-starvation.)

Step 2 — Memory bounds. Add to each host service (sized with headroom; start from a measured steady-state — see Step 3):

    mem_limit: 1500m
    mem_reservation: 768m

(Use Compose v2 top-level mem_limit/mem_reservation keys, matching whatever resource style the file already uses; if it uses deploy.resources.limits, match that instead.)

Step 3 — Size + verify (operational). Bring up the full stack: cd docker-dev && docker compose up -d. Wait for the cluster to form, then docker stats --no-stream to read steady-state RSS per node; set mem_limit ≈ peak + ~30% headroom (adjust the Step-2 numbers). Confirm: no OOMKilled (docker inspect … --format '{{.State.OOMKilled}}' = false on all), :9200 healthy, :4840 listening, and the EF SQL flood is gone from docker compose logs central-1.

Step 4 — README. Add a short note to docker-dev/README.md: the full single-mesh stack needs ≈ <measured total> of Docker Desktop VM memory; on a constrained host, raise the VM memory or run fewer host services. Mention the EF log level is pinned to Warning in dev.

Step 5 — Commit. git add docker-dev/docker-compose.yml docker-dev/README.md && git commit -m "fix(docker-dev): pin EF/AspNetCore logs to Warning + per-service mem limits to stop OOM/starvation"


After all tasks

Run the affected test projects (Runtime.Tests, Configuration.Tests) + build the Host project, then use superpowers-extended-cc:finishing-a-development-branch: verify green, then present merge/PR/keep/discard for feat/otopcua-followons. Merge/push only on the user's explicit go (the user manages their own integration in this repo — coordinate, given concurrent docker-dev work on master).