19 KiB
OtOpcUa Follow-ons Implementation Plan
For Claude: REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans (or subagent-driven-development) to implement this plan task-by-task.
Goal: Ship three independent hardening items from docs/plans/2026-06-07-otopcua-followons-design.md: (1) fix the DriverInstanceActor subscribe ActorContext race, (2) reject Tag↔VirtualTag NodeId collisions at deploy via a surgical validator gate, (3) make the docker-dev stack survive a constrained host.
Architecture: All three live in OtOpcUa, on branch feat/otopcua-followons (off master 4c221ce). They are independent (no cross-deps except T3→T2). #1 is a Runtime actor fix; #2 wires the (currently dormant) DraftValidator into the deploy handler but gates only on the new collision rule (other dormant rules run but don't block — avoids breaking the existing non-canonical company overlay); #3 is docker-dev config only.
Tech Stack: .NET 10, Akka.NET (TestKit), EF Core (OtOpcUaConfigDbContext), xUnit + Shouldly, Docker Compose.
Decisions locked (from brainstorming): #2 = surgical gate (reject only on EquipmentSignalNameCollision; build the DraftSnapshot + wire DraftValidator.Validate so it's future-ready, but filter the gate to the one code). #3 = quiet EF/AspNetCore logging + per-service mem_limit/mem_reservation + README (no lite profile).
Environment caveat: docker-dev is being actively reconfigured by the user (DB reset, single-mesh, fresh-volume bootstrap). Do not rely on live-DB state for tests — #2's tests are unit/TestKit with synthetic snapshots; #3 is verified by bringing the stack up. Coordinate #3's edits with the in-flight docker-dev commits on master (keep them additive).
Task graph
T1 (#1 subscribe race) ┐ independent, disjoint files
T2 (#2 snapshot field + rule) ┤ → T3 (#2 deploy gate)
T4 (#3 docker-dev config) ┘
T1, T2, T4 are mutually parallelizable. T3 depends on T2.
Task 1: Fix the DriverInstanceActor subscribe ActorContext race
Classification: high-risk Estimated implement time: ~5 min Parallelizable with: Task 2, Task 4
Files:
- Modify:
src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs(HandleSubscribeAsync~349-384,UnsubscribeAsync~386-403,HandleWriteAsync~315-345 — dropConfigureAwait(false)) - Test:
tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DriverInstanceActorTests.cs
Root cause (recap): HandleSubscribeAsync reads Sender (= Context.Sender) on line 362 after a conditional await UnsubscribeAsync().ConfigureAwait(false) on line 359. ConfigureAwait(false) resumes the continuation off Akka's ActorTaskScheduler, so the re-subscribe path reads Context.Sender on a thread-pool thread → no active ActorContext.
Step 1 — Write the failing test. In DriverInstanceActorTests.cs (read it first for the existing fake-ISubscribable driver + TestKit harness pattern; reuse the fake driver that drives the actor to the Connected state). Add a test that subscribes, then subscribes AGAIN (so _subscriptionHandle is not null → drives the line-359 unsubscribe-then-resubscribe path):
[Fact]
public void Resubscribe_does_not_throw_no_active_ActorContext()
{
var (actor, _) = CreateConnectedSubscribableDriver(); // mirror the existing helper
actor.Tell(new DriverInstanceActor.Subscribe(new[] { "ref.A" }, TimeSpan.FromSeconds(1)));
ExpectMsg<DriverInstanceActor.SubscriptionEstablished>();
// Second subscribe exercises the await-Unsubscribe-then-read-Sender path that threw.
actor.Tell(new DriverInstanceActor.Subscribe(new[] { "ref.B" }, TimeSpan.FromSeconds(1)));
ExpectMsg<DriverInstanceActor.SubscriptionEstablished>(TimeSpan.FromSeconds(5));
}
(Match the real message type names/namespaces from the actor. If the existing tests already have a "subscribe" test, clone its setup.)
Step 2 — Run, verify it fails. dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/ --filter "FullyQualifiedName~DriverInstanceActor" — expect the new test to fail (the second subscribe throws no active ActorContext, so no SubscriptionEstablished arrives → ExpectMsg times out).
Step 3 — Implement. In HandleSubscribeAsync, move the Sender/Self capture to the top, before the ISubscribable check and the conditional unsubscribe await; use the captured locals everywhere after; and drop .ConfigureAwait(false) from the awaits inside this and the sibling ReceiveAsync handlers:
private async Task HandleSubscribeAsync(Subscribe msg)
{
var replyTo = Sender; // capture BEFORE any await (Context.Sender is invalid post-await)
var self = Self;
if (_driver is not ISubscribable subscribable)
{
replyTo.Tell(new SubscriptionFailed("Driver does not implement ISubscribable"));
return;
}
if (_subscriptionHandle is not null)
{
await UnsubscribeAsync(); // no ConfigureAwait(false) — resume on the actor context
}
try
{
_dataChangeHandler = (_, args) => self.Tell(new DataChangeForward(args.FullReference, args.Snapshot));
subscribable.OnDataChange += _dataChangeHandler;
using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(10));
_subscriptionHandle = await subscribable
.SubscribeAsync(msg.FullReferences, msg.PublishingInterval, cts.Token); // no ConfigureAwait(false)
replyTo.Tell(new SubscriptionEstablished(_subscriptionHandle.DiagnosticId, msg.FullReferences.Count));
_log.Info("DriverInstance {Id}: subscribed to {Count} refs ({Diag})",
_driverInstanceId, msg.FullReferences.Count, _subscriptionHandle.DiagnosticId);
}
catch (Exception ex)
{
DetachSubscription();
_log.Warning(ex, "DriverInstance {Id}: subscribe failed", _driverInstanceId);
replyTo.Tell(new SubscriptionFailed(ex.Message));
}
}
Also remove .ConfigureAwait(false) from the awaits in UnsubscribeAsync (line ~396) and HandleWriteAsync (line ~330) so every ReceiveAsync continuation resumes on the actor context. (HandleApplyDeltaAsync already captures Sender first and has no ConfigureAwait(false) — leave it.) Audit each handler: no raw Sender/Self/Context access past any await.
Step 4 — Run, verify it passes. Same filter → green. Then run the broader Runtime driver slice: dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/ --filter "FullyQualifiedName~DriverInstance|FullyQualifiedName~DriverHost" — no regression.
Step 5 — Commit. git add … && git commit -m "fix(runtime): capture Sender before await in DriverInstanceActor subscribe (no-ActorContext race)"
Task 2: Add VirtualTags to DraftSnapshot + the collision rule
Classification: standard Estimated implement time: ~5 min Parallelizable with: Task 1, Task 4
Files:
- Modify:
src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Validation/DraftSnapshot.cs(addVirtualTags) - Modify:
src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Validation/DraftValidator.cs(add rule + wire intoValidate) - Test:
tests/Core/ZB.MOM.WW.OtOpcUa.Configuration.Tests/DraftValidatorTests.cs
Step 1 — Write failing tests. In DraftValidatorTests.cs (read it for the snapshot-builder helper the existing tests use — they construct DraftSnapshot via object initializer with the entities). Add:
[Fact]
public void Tag_and_VirtualTag_same_equipment_and_name_collide()
{
var draft = MakeDraft(/* one Equipment eq-1 */) with-ish initializer:
Tags = [ new Tag { TagId="t1", EquipmentId="eq-1", Name="speed", DataType="Float64", FolderPath=null, /*required fields*/ } ],
VirtualTags = [ new VirtualTag { VirtualTagId="v1", EquipmentId="eq-1", Name="speed", DataType="Float64", ScriptId="s1" } ];
DraftValidator.Validate(draft).ShouldContain(e => e.Code == "EquipmentSignalNameCollision");
}
[Fact]
public void Same_name_different_folder_does_not_collide()
{
// Tag under FolderPath "metrics" vs VirtualTag at equipment root → different NodeId → OK
... Tags=[Tag{EquipmentId="eq-1",Name="speed",FolderPath="metrics"}], VirtualTags=[VirtualTag{EquipmentId="eq-1",Name="speed"}]
DraftValidator.Validate(draft).ShouldNotContain(e => e.Code == "EquipmentSignalNameCollision");
}
[Fact]
public void SystemPlatform_tag_sharing_name_with_equipment_vtag_is_excluded()
{
// Tag with EquipmentId == null (galaxy/SystemPlatform) never collides with an equipment vtag
... Tags=[Tag{EquipmentId=null,Name="speed"}], VirtualTags=[VirtualTag{EquipmentId="eq-1",Name="speed"}]
DraftValidator.Validate(draft).ShouldNotContain(e => e.Code == "EquipmentSignalNameCollision");
}
(Fill in every required/NOT-NULL field on Tag/VirtualTag/Equipment — read the entity classes src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Entities/{Tag,VirtualTag,Equipment}.cs. Mirror how existing tests build entities.)
Step 2 — Run, verify fail (compile error: VirtualTags not on DraftSnapshot; then rule missing).
dotnet test tests/Core/ZB.MOM.WW.OtOpcUa.Configuration.Tests/ --filter "FullyQualifiedName~DraftValidator"
Step 3 — Implement.
In DraftSnapshot.cs, after the Tags member:
/// <summary>Equipment-bound VirtualTags (script-derived signals). Used by the NodeId-collision rule.</summary>
public IReadOnlyList<VirtualTag> VirtualTags { get; init; } = [];
In DraftValidator.cs, add the rule and call it from Validate:
private static void ValidateNoEquipmentSignalNameCollision(DraftSnapshot draft, List<ValidationError> errors)
{
// The materialiser keys equipment signal variables on the folder-scoped NodeId
// "{EquipmentId}[/{FolderPath}]/{Name}". Tag (EquipmentId != null) and VirtualTag share that
// node space with no DB-level cross-table uniqueness, so the same key from both collides.
static string Key(string eq, string? folder, string name) =>
string.IsNullOrWhiteSpace(folder) ? $"{eq}/{name}" : $"{eq}/{folder}/{name}";
var signals = draft.Tags
.Where(t => t.EquipmentId is not null)
.Select(t => (Key: Key(t.EquipmentId!, t.FolderPath, t.Name), Eq: t.EquipmentId!, t.Name))
.Concat(draft.VirtualTags
.Select(v => (Key: Key(v.EquipmentId, null, v.Name), Eq: v.EquipmentId, v.Name)));
foreach (var g in signals.GroupBy(s => s.Key, StringComparer.Ordinal).Where(g => g.Count() > 1))
{
var f = g.First();
errors.Add(new("EquipmentSignalNameCollision",
$"{g.Count()} signals collide on OPC UA NodeId '{g.Key}' (equipment '{f.Eq}', name '{f.Name}'); " +
"a Name must be unique across Tag and VirtualTag within an equipment+folder",
f.Eq));
}
}
Add ValidateNoEquipmentSignalNameCollision(draft, errors); to the Validate method's rule list (around line 33). Confirm the VirtualTag entity exposes EquipmentId, Name (it does); it has no FolderPath (so vtags use null folder).
Step 4 — Run, verify pass. Same filter → green; then the whole Configuration.Tests suite to confirm no regression.
Step 5 — Commit. git add … && git commit -m "feat(config): DraftValidator rule + DraftSnapshot.VirtualTags for Tag/VirtualTag NodeId collisions"
Task 3: Build DraftSnapshot from the DB + wire the surgical deploy gate
Classification: high-risk Estimated implement time: ~5 min Parallelizable with: none (depends on Task 2)
Files:
- Create:
src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Validation/DraftSnapshotFactory.cs - Modify:
src/Server/ZB.MOM.WW.OtOpcUa.ControlPlane/AdminOperations/AdminOperationsActor.cs(HandleStartDeploymentAsync, before theConfigComposer.SnapshotAndFlattenAsyncseal) - Read first:
AdminOperationsActor.cs(thedbin scope + theStartDeploymentOutcomeenum incl.Rejected+ theStartDeploymentResultshape), andsrc/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Api/DeployApiEndpoints.cs(confirmRejectedmaps to a non-2xx HTTP — if not, map it, e.g. 422). - Test:
tests/Core/ZB.MOM.WW.OtOpcUa.Configuration.Tests/DraftSnapshotFactoryTests.cs(new) + extend anAdminOperationsActortest if one exists (TestKit) to assert a colliding config is Rejected.
Design: Build a DraftSnapshot from the live config DB and run the FULL DraftValidator.Validate, but gate the deploy only on EquipmentSignalNameCollision — other (dormant, possibly-failing) rules run but do not block. This wires the validator for future activation without breaking the non-canonical company overlay.
Step 1 — Write failing tests.
DraftSnapshotFactoryTests.cs: using an in-memory OtOpcUaConfigDbContext (mirror how other Configuration.Tests build one) seeded with one Equipment + a Tag and a VirtualTag sharing (EquipmentId, Name), assert DraftSnapshotFactory.FromConfigDbAsync(db) returns a snapshot whose Tags and VirtualTags are populated, and DraftValidator.Validate(snapshot) contains EquipmentSignalNameCollision.
If AdminOperationsActor has a TestKit test harness, add: seed a colliding config → StartDeployment → StartDeploymentResult outcome is Rejected with the collision message; seed a non-colliding config → Accepted (sealed).
Step 2 — Run, verify fail.
Step 3 — Implement.
DraftSnapshotFactory.cs — load the tables the rules touch (so they don't NRE), gate-relevant ones fully populated:
public static class DraftSnapshotFactory
{
public static async Task<DraftSnapshot> FromConfigDbAsync(OtOpcUaConfigDbContext db, CancellationToken ct = default)
=> new DraftSnapshot
{
GenerationId = 0, // generation model dropped; placeholder (no rule reads it)
ClusterId = string.Empty, // global snapshot; rules compare entity ClusterId fields, not this
Namespaces = await db.Namespaces.AsNoTracking().ToListAsync(ct),
DriverInstances = await db.DriverInstances.AsNoTracking().ToListAsync(ct),
Devices = await db.Devices.AsNoTracking().ToListAsync(ct),
UnsAreas = await db.UnsAreas.AsNoTracking().ToListAsync(ct),
UnsLines = await db.UnsLines.AsNoTracking().ToListAsync(ct),
Equipment = await db.Equipment.AsNoTracking().ToListAsync(ct),
Tags = await db.Tags.AsNoTracking().ToListAsync(ct),
VirtualTags = await db.VirtualTags.AsNoTracking().ToListAsync(ct),
PollGroups = await db.PollGroups.AsNoTracking().ToListAsync(ct),
PriorEquipment = [], // no prior generation in the live model (UUID-immutability rule no-ops)
ActiveReservations= await db.ExternalIdReservations.AsNoTracking().ToListAsync(ct),
// Enterprise/Site left null — only PathLength reads them (under-counts; acceptable, not gated here)
};
}
(Confirm the DbSet names: db.ExternalIdReservations etc. — read OtOpcUaConfigDbContext. Adjust if different.)
In AdminOperationsActor.HandleStartDeploymentAsync, immediately before ConfigComposer.SnapshotAndFlattenAsync(db):
var draft = await DraftSnapshotFactory.FromConfigDbAsync(db);
var collisions = DraftValidator.Validate(draft)
.Where(e => e.Code == "EquipmentSignalNameCollision")
.ToList();
if (collisions.Count > 0)
{
var summary = string.Join("; ", collisions.Select(e => e.Message));
_log.Warning("StartDeployment rejected: {Summary}", summary);
replyTo.Tell(new StartDeploymentResult(StartDeploymentOutcome.Rejected, /* id */ null, summary /* match the result shape */));
return;
}
(Match the real StartDeploymentResult constructor + how replyTo/the failure path is shaped at lines ~108-119.) If DeployApiEndpoints doesn't already map Rejected to a non-2xx, map it to 422 Unprocessable Entity with the message.
Step 4 — Run, verify pass. Configuration.Tests + the AdminOperations slice green. Build the Host project to confirm the endpoint mapping compiles.
Step 5 — Commit. git add … && git commit -m "feat(deploy): reject Tag/VirtualTag NodeId collisions at deploy (surgical DraftValidator gate)"
Task 4: docker-dev resource hardening (logging + memory + README)
Classification: standard Estimated implement time: ~5 min Parallelizable with: Task 1, Task 2
Files:
- Modify:
docker-dev/docker-compose.yml(the host-service env + per-service limits; use the&otopcua-hostanchor / per-service blocks) - Modify:
docker-dev/README.md(memory note) - Read first: the current compose to see the shared anchor + each host service's
environmentblock, and rebase/coordinate with the in-flight docker-dev commits on master (keep edits additive).
Step 1 — Quiet logging. Add to the host services' environment (prefer the &otopcua-host anchor so all inherit; if envs are per-service, add to each of central-1/2, site-a-1/2, site-b-1/2):
Logging__LogLevel__Microsoft.EntityFrameworkCore: "Warning"
Logging__LogLevel__Microsoft.AspNetCore: "Warning"
(Keep Default/app loggers at their current level. This removes the per-poll SELECT FROM Deployment + Executed DbCommand flood that drove the heartbeat thread-starvation.)
Step 2 — Memory bounds. Add to each host service (sized with headroom; start from a measured steady-state — see Step 3):
mem_limit: 1500m
mem_reservation: 768m
(Use Compose v2 top-level mem_limit/mem_reservation keys, matching whatever resource style the file already uses; if it uses deploy.resources.limits, match that instead.)
Step 3 — Size + verify (operational). Bring up the full stack: cd docker-dev && docker compose up -d. Wait for the cluster to form, then docker stats --no-stream to read steady-state RSS per node; set mem_limit ≈ peak + ~30% headroom (adjust the Step-2 numbers). Confirm: no OOMKilled (docker inspect … --format '{{.State.OOMKilled}}' = false on all), :9200 healthy, :4840 listening, and the EF SQL flood is gone from docker compose logs central-1.
Step 4 — README. Add a short note to docker-dev/README.md: the full single-mesh stack needs ≈ <measured total> of Docker Desktop VM memory; on a constrained host, raise the VM memory or run fewer host services. Mention the EF log level is pinned to Warning in dev.
Step 5 — Commit. git add docker-dev/docker-compose.yml docker-dev/README.md && git commit -m "fix(docker-dev): pin EF/AspNetCore logs to Warning + per-service mem limits to stop OOM/starvation"
After all tasks
Run the affected test projects (Runtime.Tests, Configuration.Tests) + build the Host project, then use superpowers-extended-cc:finishing-a-development-branch: verify green, then present merge/PR/keep/discard for feat/otopcua-followons. Merge/push only on the user's explicit go (the user manages their own integration in this repo — coordinate, given concurrent docker-dev work on master).