diff --git a/docs/plans/2026-06-07-otopcua-followons.md b/docs/plans/2026-06-07-otopcua-followons.md new file mode 100644 index 00000000..7055bc55 --- /dev/null +++ b/docs/plans/2026-06-07-otopcua-followons.md @@ -0,0 +1,286 @@ +# OtOpcUa Follow-ons Implementation Plan + +> **For Claude:** REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans (or subagent-driven-development) to implement this plan task-by-task. + +**Goal:** Ship three independent hardening items from `docs/plans/2026-06-07-otopcua-followons-design.md`: (1) fix the `DriverInstanceActor` subscribe ActorContext race, (2) reject Tag↔VirtualTag NodeId collisions at deploy via a surgical validator gate, (3) make the docker-dev stack survive a constrained host. + +**Architecture:** All three live in OtOpcUa, on branch `feat/otopcua-followons` (off `master` `4c221ce`). They are independent (no cross-deps except T3→T2). #1 is a Runtime actor fix; #2 wires the (currently dormant) `DraftValidator` into the deploy handler but gates *only* on the new collision rule (other dormant rules run but don't block — avoids breaking the existing non-canonical company overlay); #3 is docker-dev config only. + +**Tech Stack:** .NET 10, Akka.NET (TestKit), EF Core (`OtOpcUaConfigDbContext`), xUnit + Shouldly, Docker Compose. + +**Decisions locked (from brainstorming):** #2 = **surgical gate** (reject only on `EquipmentSignalNameCollision`; build the `DraftSnapshot` + wire `DraftValidator.Validate` so it's future-ready, but filter the gate to the one code). #3 = quiet EF/AspNetCore logging + per-service `mem_limit`/`mem_reservation` + README (no lite profile). + +**Environment caveat:** docker-dev is being actively reconfigured by the user (DB reset, single-mesh, fresh-volume bootstrap). Do **not** rely on live-DB state for tests — #2's tests are unit/TestKit with synthetic snapshots; #3 is verified by bringing the stack up. Coordinate #3's edits with the in-flight docker-dev commits on master (keep them additive). + +--- + +## Task graph + +``` +T1 (#1 subscribe race) ┐ independent, disjoint files +T2 (#2 snapshot field + rule) ┤ → T3 (#2 deploy gate) +T4 (#3 docker-dev config) ┘ +``` +T1, T2, T4 are mutually parallelizable. T3 depends on T2. + +--- + +### Task 1: Fix the `DriverInstanceActor` subscribe ActorContext race + +**Classification:** high-risk +**Estimated implement time:** ~5 min +**Parallelizable with:** Task 2, Task 4 + +**Files:** +- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs` (`HandleSubscribeAsync` ~349-384, `UnsubscribeAsync` ~386-403, `HandleWriteAsync` ~315-345 — drop `ConfigureAwait(false)`) +- Test: `tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DriverInstanceActorTests.cs` + +**Root cause (recap):** `HandleSubscribeAsync` reads `Sender` (= `Context.Sender`) on line 362 *after* a conditional `await UnsubscribeAsync().ConfigureAwait(false)` on line 359. `ConfigureAwait(false)` resumes the continuation off Akka's `ActorTaskScheduler`, so the re-subscribe path reads `Context.Sender` on a thread-pool thread → `no active ActorContext`. + +**Step 1 — Write the failing test.** In `DriverInstanceActorTests.cs` (read it first for the existing fake-`ISubscribable` driver + TestKit harness pattern; reuse the fake driver that drives the actor to the `Connected` state). Add a test that subscribes, then subscribes AGAIN (so `_subscriptionHandle is not null` → drives the line-359 unsubscribe-then-resubscribe path): +```csharp +[Fact] +public void Resubscribe_does_not_throw_no_active_ActorContext() +{ + var (actor, _) = CreateConnectedSubscribableDriver(); // mirror the existing helper + actor.Tell(new DriverInstanceActor.Subscribe(new[] { "ref.A" }, TimeSpan.FromSeconds(1))); + ExpectMsg(); + // Second subscribe exercises the await-Unsubscribe-then-read-Sender path that threw. + actor.Tell(new DriverInstanceActor.Subscribe(new[] { "ref.B" }, TimeSpan.FromSeconds(1))); + ExpectMsg(TimeSpan.FromSeconds(5)); +} +``` +(Match the real message type names/namespaces from the actor. If the existing tests already have a "subscribe" test, clone its setup.) + +**Step 2 — Run, verify it fails.** `dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/ --filter "FullyQualifiedName~DriverInstanceActor"` — expect the new test to fail (the second subscribe throws `no active ActorContext`, so no `SubscriptionEstablished` arrives → `ExpectMsg` times out). + +**Step 3 — Implement.** In `HandleSubscribeAsync`, move the `Sender`/`Self` capture to the **top**, before the `ISubscribable` check and the conditional unsubscribe await; use the captured locals everywhere after; and drop `.ConfigureAwait(false)` from the awaits inside this and the sibling `ReceiveAsync` handlers: +```csharp +private async Task HandleSubscribeAsync(Subscribe msg) +{ + var replyTo = Sender; // capture BEFORE any await (Context.Sender is invalid post-await) + var self = Self; + if (_driver is not ISubscribable subscribable) + { + replyTo.Tell(new SubscriptionFailed("Driver does not implement ISubscribable")); + return; + } + if (_subscriptionHandle is not null) + { + await UnsubscribeAsync(); // no ConfigureAwait(false) — resume on the actor context + } + try + { + _dataChangeHandler = (_, args) => self.Tell(new DataChangeForward(args.FullReference, args.Snapshot)); + subscribable.OnDataChange += _dataChangeHandler; + using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(10)); + _subscriptionHandle = await subscribable + .SubscribeAsync(msg.FullReferences, msg.PublishingInterval, cts.Token); // no ConfigureAwait(false) + replyTo.Tell(new SubscriptionEstablished(_subscriptionHandle.DiagnosticId, msg.FullReferences.Count)); + _log.Info("DriverInstance {Id}: subscribed to {Count} refs ({Diag})", + _driverInstanceId, msg.FullReferences.Count, _subscriptionHandle.DiagnosticId); + } + catch (Exception ex) + { + DetachSubscription(); + _log.Warning(ex, "DriverInstance {Id}: subscribe failed", _driverInstanceId); + replyTo.Tell(new SubscriptionFailed(ex.Message)); + } +} +``` +Also remove `.ConfigureAwait(false)` from the awaits in `UnsubscribeAsync` (line ~396) and `HandleWriteAsync` (line ~330) so every `ReceiveAsync` continuation resumes on the actor context. (`HandleApplyDeltaAsync` already captures `Sender` first and has no `ConfigureAwait(false)` — leave it.) Audit each handler: no raw `Sender`/`Self`/`Context` access past any await. + +**Step 4 — Run, verify it passes.** Same filter → green. Then run the broader Runtime driver slice: `dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/ --filter "FullyQualifiedName~DriverInstance|FullyQualifiedName~DriverHost"` — no regression. + +**Step 5 — Commit.** `git add … && git commit -m "fix(runtime): capture Sender before await in DriverInstanceActor subscribe (no-ActorContext race)"` + +--- + +### Task 2: Add `VirtualTags` to `DraftSnapshot` + the collision rule + +**Classification:** standard +**Estimated implement time:** ~5 min +**Parallelizable with:** Task 1, Task 4 + +**Files:** +- Modify: `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Validation/DraftSnapshot.cs` (add `VirtualTags`) +- Modify: `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Validation/DraftValidator.cs` (add rule + wire into `Validate`) +- Test: `tests/Core/ZB.MOM.WW.OtOpcUa.Configuration.Tests/DraftValidatorTests.cs` + +**Step 1 — Write failing tests.** In `DraftValidatorTests.cs` (read it for the snapshot-builder helper the existing tests use — they construct `DraftSnapshot` via object initializer with the entities). Add: +```csharp +[Fact] +public void Tag_and_VirtualTag_same_equipment_and_name_collide() +{ + var draft = MakeDraft(/* one Equipment eq-1 */) with-ish initializer: + Tags = [ new Tag { TagId="t1", EquipmentId="eq-1", Name="speed", DataType="Float64", FolderPath=null, /*required fields*/ } ], + VirtualTags = [ new VirtualTag { VirtualTagId="v1", EquipmentId="eq-1", Name="speed", DataType="Float64", ScriptId="s1" } ]; + DraftValidator.Validate(draft).ShouldContain(e => e.Code == "EquipmentSignalNameCollision"); +} + +[Fact] +public void Same_name_different_folder_does_not_collide() +{ + // Tag under FolderPath "metrics" vs VirtualTag at equipment root → different NodeId → OK + ... Tags=[Tag{EquipmentId="eq-1",Name="speed",FolderPath="metrics"}], VirtualTags=[VirtualTag{EquipmentId="eq-1",Name="speed"}] + DraftValidator.Validate(draft).ShouldNotContain(e => e.Code == "EquipmentSignalNameCollision"); +} + +[Fact] +public void SystemPlatform_tag_sharing_name_with_equipment_vtag_is_excluded() +{ + // Tag with EquipmentId == null (galaxy/SystemPlatform) never collides with an equipment vtag + ... Tags=[Tag{EquipmentId=null,Name="speed"}], VirtualTags=[VirtualTag{EquipmentId="eq-1",Name="speed"}] + DraftValidator.Validate(draft).ShouldNotContain(e => e.Code == "EquipmentSignalNameCollision"); +} +``` +(Fill in every `required`/NOT-NULL field on `Tag`/`VirtualTag`/`Equipment` — read the entity classes `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Entities/{Tag,VirtualTag,Equipment}.cs`. Mirror how existing tests build entities.) + +**Step 2 — Run, verify fail** (compile error: `VirtualTags` not on `DraftSnapshot`; then rule missing). +`dotnet test tests/Core/ZB.MOM.WW.OtOpcUa.Configuration.Tests/ --filter "FullyQualifiedName~DraftValidator"` + +**Step 3 — Implement.** +In `DraftSnapshot.cs`, after the `Tags` member: +```csharp +/// Equipment-bound VirtualTags (script-derived signals). Used by the NodeId-collision rule. +public IReadOnlyList VirtualTags { get; init; } = []; +``` +In `DraftValidator.cs`, add the rule and call it from `Validate`: +```csharp +private static void ValidateNoEquipmentSignalNameCollision(DraftSnapshot draft, List errors) +{ + // The materialiser keys equipment signal variables on the folder-scoped NodeId + // "{EquipmentId}[/{FolderPath}]/{Name}". Tag (EquipmentId != null) and VirtualTag share that + // node space with no DB-level cross-table uniqueness, so the same key from both collides. + static string Key(string eq, string? folder, string name) => + string.IsNullOrWhiteSpace(folder) ? $"{eq}/{name}" : $"{eq}/{folder}/{name}"; + + var signals = draft.Tags + .Where(t => t.EquipmentId is not null) + .Select(t => (Key: Key(t.EquipmentId!, t.FolderPath, t.Name), Eq: t.EquipmentId!, t.Name)) + .Concat(draft.VirtualTags + .Select(v => (Key: Key(v.EquipmentId, null, v.Name), Eq: v.EquipmentId, v.Name))); + + foreach (var g in signals.GroupBy(s => s.Key, StringComparer.Ordinal).Where(g => g.Count() > 1)) + { + var f = g.First(); + errors.Add(new("EquipmentSignalNameCollision", + $"{g.Count()} signals collide on OPC UA NodeId '{g.Key}' (equipment '{f.Eq}', name '{f.Name}'); " + + "a Name must be unique across Tag and VirtualTag within an equipment+folder", + f.Eq)); + } +} +``` +Add `ValidateNoEquipmentSignalNameCollision(draft, errors);` to the `Validate` method's rule list (around line 33). Confirm the `VirtualTag` entity exposes `EquipmentId`, `Name` (it does); it has no `FolderPath` (so vtags use `null` folder). + +**Step 4 — Run, verify pass.** Same filter → green; then the whole Configuration.Tests suite to confirm no regression. + +**Step 5 — Commit.** `git add … && git commit -m "feat(config): DraftValidator rule + DraftSnapshot.VirtualTags for Tag/VirtualTag NodeId collisions"` + +--- + +### Task 3: Build `DraftSnapshot` from the DB + wire the surgical deploy gate + +**Classification:** high-risk +**Estimated implement time:** ~5 min +**Parallelizable with:** none (depends on Task 2) + +**Files:** +- Create: `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Validation/DraftSnapshotFactory.cs` +- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.ControlPlane/AdminOperations/AdminOperationsActor.cs` (`HandleStartDeploymentAsync`, before the `ConfigComposer.SnapshotAndFlattenAsync` seal) +- Read first: `AdminOperationsActor.cs` (the `db` in scope + the `StartDeploymentOutcome` enum incl. `Rejected` + the `StartDeploymentResult` shape), and `src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Api/DeployApiEndpoints.cs` (confirm `Rejected` maps to a non-2xx HTTP — if not, map it, e.g. 422). +- Test: `tests/Core/ZB.MOM.WW.OtOpcUa.Configuration.Tests/DraftSnapshotFactoryTests.cs` (new) + extend an `AdminOperationsActor` test if one exists (TestKit) to assert a colliding config is Rejected. + +**Design:** Build a `DraftSnapshot` from the live config DB and run the FULL `DraftValidator.Validate`, but **gate the deploy only on `EquipmentSignalNameCollision`** — other (dormant, possibly-failing) rules run but do not block. This wires the validator for future activation without breaking the non-canonical company overlay. + +**Step 1 — Write failing tests.** +`DraftSnapshotFactoryTests.cs`: using an in-memory `OtOpcUaConfigDbContext` (mirror how other Configuration.Tests build one) seeded with one Equipment + a Tag and a VirtualTag sharing (EquipmentId, Name), assert `DraftSnapshotFactory.FromConfigDbAsync(db)` returns a snapshot whose `Tags` and `VirtualTags` are populated, and `DraftValidator.Validate(snapshot)` contains `EquipmentSignalNameCollision`. +If `AdminOperationsActor` has a TestKit test harness, add: seed a colliding config → `StartDeployment` → `StartDeploymentResult` outcome is `Rejected` with the collision message; seed a non-colliding config → `Accepted` (sealed). + +**Step 2 — Run, verify fail.** + +**Step 3 — Implement.** +`DraftSnapshotFactory.cs` — load the tables the rules touch (so they don't NRE), gate-relevant ones fully populated: +```csharp +public static class DraftSnapshotFactory +{ + public static async Task FromConfigDbAsync(OtOpcUaConfigDbContext db, CancellationToken ct = default) + => new DraftSnapshot + { + GenerationId = 0, // generation model dropped; placeholder (no rule reads it) + ClusterId = string.Empty, // global snapshot; rules compare entity ClusterId fields, not this + Namespaces = await db.Namespaces.AsNoTracking().ToListAsync(ct), + DriverInstances = await db.DriverInstances.AsNoTracking().ToListAsync(ct), + Devices = await db.Devices.AsNoTracking().ToListAsync(ct), + UnsAreas = await db.UnsAreas.AsNoTracking().ToListAsync(ct), + UnsLines = await db.UnsLines.AsNoTracking().ToListAsync(ct), + Equipment = await db.Equipment.AsNoTracking().ToListAsync(ct), + Tags = await db.Tags.AsNoTracking().ToListAsync(ct), + VirtualTags = await db.VirtualTags.AsNoTracking().ToListAsync(ct), + PollGroups = await db.PollGroups.AsNoTracking().ToListAsync(ct), + PriorEquipment = [], // no prior generation in the live model (UUID-immutability rule no-ops) + ActiveReservations= await db.ExternalIdReservations.AsNoTracking().ToListAsync(ct), + // Enterprise/Site left null — only PathLength reads them (under-counts; acceptable, not gated here) + }; +} +``` +(Confirm the DbSet names: `db.ExternalIdReservations` etc. — read `OtOpcUaConfigDbContext`. Adjust if different.) + +In `AdminOperationsActor.HandleStartDeploymentAsync`, immediately before `ConfigComposer.SnapshotAndFlattenAsync(db)`: +```csharp +var draft = await DraftSnapshotFactory.FromConfigDbAsync(db); +var collisions = DraftValidator.Validate(draft) + .Where(e => e.Code == "EquipmentSignalNameCollision") + .ToList(); +if (collisions.Count > 0) +{ + var summary = string.Join("; ", collisions.Select(e => e.Message)); + _log.Warning("StartDeployment rejected: {Summary}", summary); + replyTo.Tell(new StartDeploymentResult(StartDeploymentOutcome.Rejected, /* id */ null, summary /* match the result shape */)); + return; +} +``` +(Match the real `StartDeploymentResult` constructor + how `replyTo`/the failure path is shaped at lines ~108-119.) If `DeployApiEndpoints` doesn't already map `Rejected` to a non-2xx, map it to **422 Unprocessable Entity** with the message. + +**Step 4 — Run, verify pass.** Configuration.Tests + the AdminOperations slice green. Build the Host project to confirm the endpoint mapping compiles. + +**Step 5 — Commit.** `git add … && git commit -m "feat(deploy): reject Tag/VirtualTag NodeId collisions at deploy (surgical DraftValidator gate)"` + +--- + +### Task 4: docker-dev resource hardening (logging + memory + README) + +**Classification:** standard +**Estimated implement time:** ~5 min +**Parallelizable with:** Task 1, Task 2 + +**Files:** +- Modify: `docker-dev/docker-compose.yml` (the host-service env + per-service limits; use the `&otopcua-host` anchor / per-service blocks) +- Modify: `docker-dev/README.md` (memory note) +- Read first: the current compose to see the shared anchor + each host service's `environment` block, and **rebase/coordinate** with the in-flight docker-dev commits on master (keep edits additive). + +**Step 1 — Quiet logging.** Add to the host services' `environment` (prefer the `&otopcua-host` anchor so all inherit; if envs are per-service, add to each of central-1/2, site-a-1/2, site-b-1/2): +```yaml + Logging__LogLevel__Microsoft.EntityFrameworkCore: "Warning" + Logging__LogLevel__Microsoft.AspNetCore: "Warning" +``` +(Keep `Default`/app loggers at their current level. This removes the per-poll `SELECT FROM Deployment` + `Executed DbCommand` flood that drove the heartbeat thread-starvation.) + +**Step 2 — Memory bounds.** Add to each host service (sized with headroom; start from a measured steady-state — see Step 3): +```yaml + mem_limit: 1500m + mem_reservation: 768m +``` +(Use Compose v2 top-level `mem_limit`/`mem_reservation` keys, matching whatever resource style the file already uses; if it uses `deploy.resources.limits`, match that instead.) + +**Step 3 — Size + verify (operational).** Bring up the full stack: `cd docker-dev && docker compose up -d`. Wait for the cluster to form, then `docker stats --no-stream` to read steady-state RSS per node; set `mem_limit` ≈ peak + ~30% headroom (adjust the Step-2 numbers). Confirm: no `OOMKilled` (`docker inspect … --format '{{.State.OOMKilled}}'` = false on all), `:9200` healthy, `:4840` listening, and the EF SQL flood is gone from `docker compose logs central-1`. + +**Step 4 — README.** Add a short note to `docker-dev/README.md`: the full single-mesh stack needs ≈ `` of Docker Desktop VM memory; on a constrained host, raise the VM memory or run fewer host services. Mention the EF log level is pinned to Warning in dev. + +**Step 5 — Commit.** `git add docker-dev/docker-compose.yml docker-dev/README.md && git commit -m "fix(docker-dev): pin EF/AspNetCore logs to Warning + per-service mem limits to stop OOM/starvation"` + +--- + +## After all tasks + +Run the affected test projects (`Runtime.Tests`, `Configuration.Tests`) + build the Host project, then use **superpowers-extended-cc:finishing-a-development-branch**: verify green, then present merge/PR/keep/discard for `feat/otopcua-followons`. Merge/push only on the user's explicit go (the user manages their own integration in this repo — coordinate, given concurrent docker-dev work on master). diff --git a/docs/plans/2026-06-07-otopcua-followons.md.tasks.json b/docs/plans/2026-06-07-otopcua-followons.md.tasks.json new file mode 100644 index 00000000..1805f500 --- /dev/null +++ b/docs/plans/2026-06-07-otopcua-followons.md.tasks.json @@ -0,0 +1,10 @@ +{ + "planPath": "docs/plans/2026-06-07-otopcua-followons.md", + "tasks": [ + {"id": 1, "subject": "Task 1: Fix DriverInstanceActor subscribe ActorContext race", "status": "pending", "classification": "high-risk", "parallelizableWith": [2, 4]}, + {"id": 2, "subject": "Task 2: DraftSnapshot.VirtualTags + EquipmentSignalNameCollision rule", "status": "pending", "classification": "standard", "parallelizableWith": [1, 4]}, + {"id": 3, "subject": "Task 3: DraftSnapshotFactory + surgical deploy gate", "status": "pending", "classification": "high-risk", "blockedBy": [2]}, + {"id": 4, "subject": "Task 4: docker-dev logging + mem limits + README", "status": "pending", "classification": "standard", "parallelizableWith": [1, 2]} + ], + "lastUpdated": "2026-06-07" +}