diff --git a/docs/plans/2026-06-14-harden-milestone-1b-plan.md b/docs/plans/2026-06-14-harden-milestone-1b-plan.md new file mode 100644 index 00000000..5d23a7fa --- /dev/null +++ b/docs/plans/2026-06-14-harden-milestone-1b-plan.md @@ -0,0 +1,557 @@ +# Harden Milestone 1b — Implementation Plan + +> **For Claude:** REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans (or subagent-driven-development) to implement this plan task-by-task. + +**Goal:** Close out the review nits left from the just-shipped Milestone 1b (equipment-tag live values + inbound operator-write pipeline) — fast-fail degraded-state writes, two micro-optimisations, a FOCAS equipment-tag parse-cache gap, a benign Galaxy dead-letter, a Galaxy reconnect stale-handle gap, two test-hardening additions, and one deployment-facing docs note. + +**Architecture:** Pure hardening — no new feature surface, **no EF/Configuration schema change, no migration**. Changes are additive fast-fail handlers on existing actor behaviours, a thread-safe cache swap, a writer cache-invalidation seam on the Galaxy reconnect path, test additions, and a docs subsection. Source of these items: `pending.md` follow-ups #3, #4, #6 (from the Milestone 1b integration review), all enumerated below. + +**Tech Stack:** C# / .NET 10, Akka.NET (DriverHostActor / DriverInstanceActor), xUnit + Shouldly + Akka.TestKit. `TreatWarningsAsErrors` is on solution-wide. + +**Branch:** `chore/harden-milestone-1b` off `master` (HEAD `02e37a6d`). + +**Hard rules (every task):** stage by path (never `git add .`); never stage `sql_login.txt` / `src/Server/ZB.MOM.WW.OtOpcUa.Host/pki/` / `pending.md` / `current.md` / `docker-dev/docker-compose.yml`; never echo secrets; no force-push / no `--no-verify`; **NO Configuration entity / EF migration change**. Done = build clean + `dotnet test` green for every touched project. + +--- + +## Task DAG + +``` +T0 (branch) + ├─ T1 DriverHostActor: Stale fast-fail + micro-opts + raw-blob routing test [standard] + ├─ T2 DriverInstanceActor: Connecting/Reconnecting/Stubbed write fast-fail + │ + benign SubscriptionEstablished handler [standard] + ├─ T3 FOCAS: cache equipment-tag parsed addresses (ConcurrentDictionary) [small] + ├─ T4 Parity test: future-enum trap InlineData(2,false) [trivial] + ├─ T5 docs/security.md: data-plane GroupToRole requirement note [trivial] + └─ T6 Galaxy writer: invalidate handle caches on reconnect [small] +``` + +T1–T6 all `blockedBy` T0 and have **disjoint file sets** — they are mutually `Parallelizable`. Dispatch them concurrently after T0. + +--- + +### Task T0: Create feature branch + +**Classification:** trivial +**Estimated implement time:** ~1 min +**Parallelizable with:** none (gates all) + +**Files:** none (git only) + +**Step 1: Branch off master** + +```bash +git -C /Users/dohertj2/Desktop/OtOpcUa checkout -b chore/harden-milestone-1b master +git -C /Users/dohertj2/Desktop/OtOpcUa rev-parse --abbrev-ref HEAD # expect: chore/harden-milestone-1b +``` + +Do NOT touch the working-tree `docker-dev/docker-compose.yml` (leave it modified+unstaged throughout). + +--- + +### Task T1: DriverHostActor — Stale-state write fast-fail + micro-opts + raw-blob routing test + +**Classification:** standard +**Estimated implement time:** ~5 min +**Parallelizable with:** T2, T3, T4, T5, T6 + +Addresses pending.md #4(a-host), #4(b), #4(d). + +**Files:** +- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs` (Stale behaviour ~537-550; `HandleRouteNodeWrite` ContinueWith line 519; forward-map build line 754-756) +- Test: `tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DriverHostActorWriteRoutingTests.cs` + +**Context:** `RouteNodeWrite`/`NodeWriteResult` are defined at `DriverHostActor.cs:130,136`. The running behaviour wires `HandleRouteNodeWrite`, but the **`Stale()`** behaviour (DB unreachable; lines 537-550) has no `RouteNodeWrite` handler → an inbound operator write while Stale dead-letters and the node-manager's 10 s Ask times out with a generic log (the client already got the optimistic `Good`). Add a fast-fail. Two micro-opts ride along in the same file. + +**Step 1: Add the failing Akka.TestKit test for Stale fast-fail** + +In `DriverHostActorWriteRoutingTests.cs`, add a test that drives the host into `Stale` (construct it with a DB factory that throws, or send whatever the existing tests use to reach Stale — mirror an existing Stale-path test in the file/suite) and asserts an immediate negative `NodeWriteResult`: + +```csharp +[Fact] +public void RouteNodeWrite_while_stale_fast_fails_instead_of_timing_out() +{ + // Arrange: a host that cannot reach the config DB enters Stale on first DispatchDeployment. + var host = /* spawn DriverHostActor with a throwing IDbContextFactory — reuse the suite's Stale helper */; + DriveIntoStale(host); + + // Act + host.Tell(new DriverHostActor.RouteNodeWrite("ns=2;s=EQ-x/Tag", 42), TestActor); + var reply = ExpectMsg(TimeSpan.FromSeconds(2)); + + // Assert: fast negative, not a ~10s timeout + reply.Success.ShouldBeFalse(); + reply.Reason.ShouldContain("stale"); +} +``` + +If the suite has no existing Stale helper, reach Stale the same way the existing "ignoring DispatchDeployment while Stale" coverage does; if none exists, construct the actor with a `IDbContextFactory` stub whose `CreateDbContext` throws and send the message that triggers the DB read. Keep the helper local to the test file. + +**Step 2: Run it — expect FAIL (times out / dead-letters, no reply)** + +```bash +dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests \ + --filter "FullyQualifiedName~RouteNodeWrite_while_stale_fast_fails" +``` + +**Step 3: Add the Stale fast-fail handler** + +In `DriverHostActor.cs` `Stale()` (after the existing `Receive` line 547), add: + +```csharp +// An inbound operator write can't be serviced while the config DB is unreachable — fast-fail so the +// node-manager's bounded Ask gets an immediate clear status instead of dead-lettering into a timeout. +Receive(_ => + Sender.Tell(new NodeWriteResult(false, "driver host stale (config DB unreachable)"))); +``` + +**Step 4: Micro-opt — drop `ExecuteSynchronously` on the router continuation** + +In `HandleRouteNodeWrite` (line 514-520), the `.ContinueWith(...)` passes `TaskContinuationOptions.ExecuteSynchronously`. Running the continuation synchronously on the Ask's completing thread (a pool/transport thread) is needless coupling for a tiny projection. Change it to `TaskContinuationOptions.None`: + +```csharp + .ContinueWith( + t => t.IsCompletedSuccessfully + ? new NodeWriteResult(t.Result.Success, t.Result.Reason) + : new NodeWriteResult(false, "write timeout"), + CancellationToken.None, + TaskContinuationOptions.None, + TaskScheduler.Default) + .PipeTo(replyTo); +``` + +**Step 5: Micro-opt — `List.Contains` → `HashSet` in the forward-map build** + +The forward map `_nodeIdByDriverRef` (declared as `Dictionary<.., List>`) does `if (!list.Contains(nodeId)) list.Add(nodeId);` at line 756 — O(n) per node. Convert the value type to `HashSet`: +- Change the field declaration to `Dictionary<(string,string), HashSet>` (keep `StringComparer.Ordinal` on the inner set). +- Line 754-756 becomes: + ```csharp + if (!_nodeIdByDriverRef.TryGetValue(key, out var set)) + _nodeIdByDriverRef[key] = set = new HashSet(StringComparer.Ordinal); + set.Add(nodeId); + ``` +- **Grep every consumer of `_nodeIdByDriverRef`** (`grep -n _nodeIdByDriverRef src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs`) and confirm they only `foreach`/`Count`/membership-check the value (all of which `HashSet` supports). If any consumer indexes by position (`list[i]`) leave it a `List` and skip this micro-opt — note that in the commit message. (Expected: ForwardToMux only enumerates.) + +**Step 6: Raw protocol-blob routing test (#4d)** + +The suite seeds equipment tags with `TagConfig = JsonSerializer.Serialize(new { FullName = t.FullName })` (a Galaxy-style `{FullName}` blob). The router keys purely on **NodeId**, so the TagConfig blob shape is irrelevant to routing — prove that with a raw protocol blob. Add a tag to the seeded composition (or a new focused test) whose `TagConfig` is a raw Modbus-style blob and assert the write still routes to the owning driver by NodeId: + +```csharp +[Fact] +public void RouteNodeWrite_routes_by_nodeId_regardless_of_raw_tagconfig_blob_shape() +{ + // Seed an equipment tag whose TagConfig is a RAW protocol blob (region/address), NOT {FullName}. + var rawBlob = JsonSerializer.Serialize(new { region = "HoldingRegister", address = 200, dataType = "UInt16" }); + var host = SeedHostWithEquipmentTag(driverId: "MAIN-modbus-eq", tagConfig: rawBlob, /* nodeId derived as in the suite */); + + host.Tell(new DriverHostActor.RouteNodeWrite(expectedNodeId, (ushort)7), TestActor); + + // The child receives WriteAttribute carrying the wire-ref FullName (the blob is the wire-ref here); + // assert via the suite's existing child probe that the write was forwarded (Success path). + var reply = ExpectMsg(); + reply.Success.ShouldBeTrue(); +} +``` + +Reuse the suite's existing child-probe / seed helpers (the file already exercises `RouteNodeWrite` happy-path — clone that setup and only swap the `TagConfig` to the raw blob). The point is belt-and-suspenders: routing is blob-shape-independent. + +**Step 7: Run the full routing suite — expect PASS** + +```bash +dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~DriverHostActorWriteRouting" +``` + +**Step 8: Build + commit (stage by path)** + +```bash +git add src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs \ + tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DriverHostActorWriteRoutingTests.cs +git commit -m "fix(runtime): fast-fail RouteNodeWrite while Stale + micro-opts + raw-blob routing test" +``` + +--- + +### Task T2: DriverInstanceActor — write fast-fail in degraded states + benign SubscriptionEstablished handler + +**Classification:** standard +**Estimated implement time:** ~5 min +**Parallelizable with:** T1, T3, T4, T5, T6 + +Addresses pending.md #4(a-instance) and #6(a). + +**Files:** +- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs` (Stubbed ~205-209; `Connecting()` 211-231; `Reconnecting()` 265-281; `Connected()` 233-263) +- Test: `tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DriverInstanceActorWriteAndSubscribeTests.cs` (NEW file — avoids touching T1's test file) + +**Context:** +- `WriteAttribute`/`WriteAttributeResult` are at lines 41-42. Only `Connected()` handles `WriteAttribute` (line 252). When the actor is `Connecting`/`Reconnecting`/`Stubbed`, an inbound `WriteAttribute` (Asked by `DriverHostActor.HandleRouteNodeWrite` with an 8 s timeout) dead-letters → the Ask times out as a generic "write timeout". Fast-fail in those states with a clear reason. +- `ResubscribeDesired()` (line 437-443) does `Self.Tell(new Subscribe(...))`; `HandleSubscribeAsync` replies `SubscriptionEstablished` to `replyTo = Sender = Self` (line 379) → no handler → the benign dead-letter `Message [SubscriptionEstablished] from drv-… to drv-… was unhandled` seen live on Galaxy. This is the **generic** actor, so a no-op handler quiets it for every driver. + +**Step 1: Failing tests (new file)** + +```csharp +public sealed class DriverInstanceActorWriteAndSubscribeTests : TestKit +{ + [Fact] + public void WriteAttribute_while_connecting_fast_fails() + { + var actor = SpawnInConnectingState(); // see helper note below + actor.Tell(new DriverInstanceActor.WriteAttribute("Tag.Attr", 1), TestActor); + var r = ExpectMsg(TimeSpan.FromSeconds(2)); + r.Success.ShouldBeFalse(); + r.Reason.ShouldNotBeNull(); + } + + [Fact] + public void SubscriptionEstablished_is_swallowed_without_deadletter() + { + var actor = SpawnConnected(); + // subscribe with a desired set so ResubscribeDesired self-Tells Subscribe and the reply lands on Self + // Subscribe to DeadLetters and assert none of type SubscriptionEstablished arrive. + var deadLetters = CreateTestProbe(); + Sys.EventStream.Subscribe(deadLetters, typeof(DeadLetter)); + // drive a (re)subscribe ... then: + deadLetters.ExpectNoMsg(TimeSpan.FromMilliseconds(500)); + } +} +``` + +Helper note: reuse the spawn/connect helpers the existing DriverInstanceActor tests use (look for them in the Runtime.Tests `Drivers/` folder; there is established setup for stub drivers and the Connected/Reconnecting transitions). To reach `Connecting`, send `InitializeRequested` with a driver-config that the stub keeps pending, or `Become`-drive via the suite's existing harness. If reaching exact internal states is awkward, the **minimum viable** assertions are: (a) a `WriteAttribute` to a Stubbed actor returns a negative `WriteAttributeResult` quickly, and (b) no `SubscriptionEstablished` dead-letter is produced during a connect+resubscribe cycle. + +**Step 2: Run — expect FAIL** + +```bash +dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~DriverInstanceActorWriteAndSubscribe" +``` + +**Step 3: Add write fast-fail to the three degraded behaviours** + +Add this line to `Stubbed` (after line 208), `Connecting()` (after line 230), and `Reconnecting()` (after line 279): + +```csharp +Receive(_ => + Sender.Tell(new WriteAttributeResult(false, "driver not connected"))); +``` + +(Stubbed may use a reason like `"driver stubbed"` — implementer's call; the host maps either into a `NodeWriteResult(false, …)`.) + +**Step 4: Add the benign SubscriptionEstablished handler** + +In `Connected()` (alongside the other `Receive`s, e.g. after line 262) add a swallow-with-trace so the self-reply from `ResubscribeDesired` doesn't dead-letter: + +```csharp +// ResubscribeDesired self-Tells Subscribe; HandleSubscribeAsync replies SubscriptionEstablished to the +// sender, which on the self-resubscribe path is Self. Swallow it (trace only) so it doesn't dead-letter. +Receive(msg => + _log.Debug("DriverInstance {Id}: subscription established ({Count} refs, {Diag})", + _driverInstanceId, msg.ReferenceCount, msg.DiagnosticId)); +``` + +Add the same one-liner to `Reconnecting()` and `Connecting()` to cover the race where the async subscribe reply lands after a state flip (cheap, same file). Do NOT alter where/how `SubscriptionEstablished` is *sent* — a host that genuinely Asks `Subscribe` still receives its reply (the handler only runs when the message arrives at this actor as the asker). + +**Step 5: Run — expect PASS, then build the project** + +```bash +dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~DriverInstanceActorWriteAndSubscribe" +dotnet build src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ZB.MOM.WW.OtOpcUa.Runtime.csproj +``` + +**Step 6: Commit (stage by path)** + +```bash +git add src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs \ + tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DriverInstanceActorWriteAndSubscribeTests.cs +git commit -m "fix(runtime): fast-fail writes in degraded driver states + swallow self SubscriptionEstablished" +``` + +--- + +### Task T3: FOCAS — cache equipment-tag parsed addresses + +**Classification:** small +**Estimated implement time:** ~4 min +**Parallelizable with:** T1, T2, T4, T5, T6 + +Addresses pending.md #4(c). + +**Files:** +- Modify: `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS/FocasDriver.cs` (cache field 37-38; read path 275-280; write path 337-342; init seed 120) +- Test: `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests/` (add to an existing FocasDriver read/write test class) + +**Context:** `_parsedAddressesByTagName` (a plain `Dictionary`, line 37) is seeded only from `_options.Tags` at init (line 120). Equipment tags arrive via `_resolver` (line 66) as transient defs whose `Name` is the raw ref — never seeded — so both `ReadAsync` (275-280) and `WriteAsync` (337-342) hit the `TryGetValue` miss and **re-parse `FocasAddress.TryParse(def.Address)` every call without storing**. Fix: make the cache a `ConcurrentDictionary` (reads run on the poll loop, writes on the host Ask thread — concurrent access) and `GetOrAdd` so a transient def's address is parsed once. + +**Step 1: Failing test — second resolve must reuse the cache** + +Add an `internal` test seam to `FocasDriver` so the test can observe caching without a parser spy: + +```csharp +// In FocasDriver (internal, for tests via InternalsVisibleTo — confirm the attribute exists for the .Tests asm) +internal bool IsParsedAddressCached(string reference) => _parsedAddressesByTagName.ContainsKey(reference); +``` + +Test (equipment-tag ref, not an authored tag): + +```csharp +[Fact] +public async Task Equipment_tag_address_is_parsed_once_and_cached() +{ + var driver = /* construct FocasDriver with a fake client + one device, no authored Tags */; + await driver.InitializeAsync(configJson, default); + var equipRef = "{\"address\":\"R100\",\"dataType\":\"Int16\",\"deviceHostAddress\":\"focas://10.0.0.5\",\"writable\":true}"; // matches FocasEquipmentTagParser keys + + driver.IsParsedAddressCached(equipRef).ShouldBeFalse(); // not seeded at init + await driver.ReadAsync(new[] { equipRef }, default); // first resolve parses + caches + driver.IsParsedAddressCached(equipRef).ShouldBeTrue(); // cached for subsequent hot calls +} +``` + +Use the exact equipment-ref JSON shape that `FocasEquipmentTagParser.TryParse` accepts (open that parser to copy its camelCase keys + required fields). If `InternalsVisibleTo` for the `.Tests` assembly isn't already present on the FOCAS driver project, add it (mirror how the other drivers expose internals to their test projects). + +**Step 2: Run — expect FAIL (second assert false: never cached)** + +```bash +dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests --filter "FullyQualifiedName~Equipment_tag_address_is_parsed_once" +``` + +**Step 3: Implement the cache** + +- Change the field (line 37-38) to: + ```csharp + private readonly ConcurrentDictionary _parsedAddressesByTagName = + new(StringComparer.OrdinalIgnoreCase); + ``` + (add `using System.Collections.Concurrent;` if absent). +- Add a private helper: + ```csharp + private FocasAddress ResolveParsedAddress(FocasTagDefinition def) => + _parsedAddressesByTagName.GetOrAdd(def.Name, _ => + FocasAddress.TryParse(def.Address) + ?? throw new InvalidOperationException( + $"FOCAS tag '{def.Name}' has malformed Address '{def.Address}'.")); + ``` + (The throwing factory propagates and does **not** cache a bad parse — preserving the existing fail behaviour. `GetOrAdd` may invoke the factory more than once under race but only stores one — fine, parsing is pure.) +- Replace the read-path block (275-280) and the write-path block (337-342) with `var parsed = ResolveParsedAddress(def);`. +- Line 120 (`_parsedAddressesByTagName[tag.Name] = parsed;`) compiles unchanged on `ConcurrentDictionary`; keep it (init pre-validates + seeds authored tags). + +**Step 4: Run — expect PASS, build the project** + +```bash +dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests --filter "FullyQualifiedName~Equipment_tag_address_is_parsed_once" +dotnet build src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.csproj +``` + +**Step 5: Commit (stage by path)** + +```bash +git add src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS/ tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests/ +git commit -m "perf(focas): cache equipment-tag parsed addresses (no per-call reparse)" +``` + +--- + +### Task T4: Parity test — future-enum trap + +**Classification:** trivial +**Estimated implement time:** ~2 min +**Parallelizable with:** T1, T2, T3, T5, T6 + +Addresses pending.md #4(e). + +**Files:** +- Test only: `tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DeploymentArtifactAliasParityTests.cs` (numeric-AccessLevel theory ~73-104) + +**Context:** `TagAccessLevel` has `Read = 0`, `ReadWrite = 1` (`src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Enums/TagAccessLevel.cs`). The parity test maps numeric AccessLevel → `Writable` with `[InlineData(1, true)]`/`[InlineData(0, false)]`. Add a future-enum guard so an as-yet-undefined value maps fail-safe to non-writable (only `ReadWrite==1` should yield a writable node). + +**Step 1: Add the InlineData row** + +In the numeric-AccessLevel `[Theory]` (the one with `[InlineData(1, true)]` / `[InlineData(0, false)]`), add: + +```csharp +[InlineData(2, false)] // future/unknown AccessLevel must map fail-safe to read-only (only ReadWrite⇒writable) +``` + +**Step 2: Run — expect PASS (confirms the derivation is `== ReadWrite`, not `!= Read`)** + +```bash +dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~DeploymentArtifactAliasParity" +``` + +If it FAILS, the production derivation is using `!= Read` (so `2` wrongly maps to writable) — that is a real fail-safety bug; STOP and surface it (the fix would be a one-line `== TagAccessLevel.ReadWrite` in both Phase7Composer and DeploymentArtifact, but only with reviewer sign-off since it touches the byte-parity contract). + +**Step 3: Commit (stage by path)** + +```bash +git add tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DeploymentArtifactAliasParityTests.cs +git commit -m "test(runtime): future-enum trap for AccessLevel→Writable parity" +``` + +--- + +### Task T5: docs/security.md — data-plane GroupToRole requirement + +**Classification:** trivial +**Estimated implement time:** ~3 min +**Parallelizable with:** T1, T2, T3, T4, T6 + +Addresses pending.md #3. + +**Files:** +- Modify: `docs/security.md` (Data-Plane Authorization section starts line 167; `GroupToRole` is described at line 163 in the LDAP section; troubleshooting at 313-315) + +**Context:** OPC UA session roles are unioned from two sources: the DB `LdapGroupRoleMapping` (whose `Role` column is the **`AdminRole`** enum — Administrator/Designer/Viewer only) AND the appsettings `Security:Ldap:GroupToRole` (free-form `string→string`). The data-plane gates (`WriteOperate`, `WriteTune`, `WriteConfigure`, `AlarmAck`/`AlarmAcknowledge`, `ReadOnly`) read literal **role strings** that the AdminRole-typed DB mapping **cannot produce** — so a deployment MUST map its data-plane LDAP groups → data-plane role strings via `GroupToRole`, or write-through and OPC UA alarm-ack are silently inert (every write → `BadUserAccessDenied`, default-deny). Line 163 currently frames `GroupToRole` as mapping only to *Admin* roles — that's the misleading bit to correct. + +**Step 1: Add a "Role grant source (data-plane)" subsection** + +Under **## Data-Plane Authorization** (after the "Full model" subsection ~line 241, or right after the "Permission flags" mapping ~216), add a subsection. Suggested content (match the doc's voice; keep it tight): + +```markdown +### Role grant source (data-plane) + +Data-plane roles come from `Security:Ldap:GroupToRole` (appsettings), **not** from the Config-DB +`LdapGroupRoleMapping` table. That table's `Role` column is the `AdminRole` enum +(`Administrator`/`Designer`/`Viewer`) and supplies **control-plane** roles only — it cannot emit the +data-plane role strings the OPC UA gates read (`ReadOnly`, `WriteOperate`, `WriteTune`, +`WriteConfigure`, `AlarmAcknowledge`). A deployment therefore **must** map its data-plane LDAP groups +to those role strings via `GroupToRole`, e.g.: + +```json +"GroupToRole": { + "ot-operators": "WriteOperate", + "ot-tuners": "WriteTune", + "ot-engineers": "WriteConfigure", + "ot-alarm-ack": "AlarmAcknowledge", + "ot-readonly": "ReadOnly" +} +``` + +If this mapping is absent the data-plane evaluator is strictly default-deny: inbound operator +writes and OPC UA Part-9 alarm acknowledgement all return `BadUserAccessDenied` even for users who +authenticate successfully. (The same requirement gates the pre-existing scripted-alarm ack path.) +``` + +Also tighten line 163 (the `GroupToRole` description in the LDAP section) so it notes the values are control-plane **and** data-plane role strings, cross-referencing the new subsection — a one-clause edit, not a rewrite. + +**Step 2: Verify the doc renders / no broken anchors, then commit** + +```bash +git add docs/security.md +git commit -m "docs(security): document the GroupToRole data-plane role requirement" +``` + +(No test; verification = re-read the section for accuracy against the permission-flag names at lines 186-216.) + +--- + +### Task T6: Galaxy writer — invalidate handle caches on reconnect + +**Classification:** small +**Estimated implement time:** ~5 min +**Parallelizable with:** T1, T2, T3, T4, T5 + +Addresses pending.md #6(b). + +**Files:** +- Modify: `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/Runtime/IGalaxyDataWriter.cs` (interface — add method) +- Modify: `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/Runtime/GatewayGalaxyDataWriter.cs` (caches 28-32; implement clear) +- Modify: `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/Runtime/TracedGalaxyDataWriter.cs` (decorator — delegate) +- Modify: `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/GalaxyDriver.cs` (`ReopenAsync` 291-296; `_dataWriter` set at 255-257) +- Test: `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests/Runtime/` (GatewayGalaxyDataWriter test class) + +**Context:** `GatewayGalaxyDataWriter` caches `_itemHandles` (ref→gw item handle, line 28) and `_supervisedHandles` (handle→advised, line 32). `_dataWriter` is built **once** in `BuildProductionRuntimeAsync` (255-257) and reused across reconnects. On reconnect, `ReopenAsync` (291-296) calls `_ownedMxSession.RecreateAsync` — which disposes + rebuilds the session, so all prior gw item handles are **dead**. The writer's caches still hold the stale handles, so a write right after a reconnect AddItem-skips and uses a dead handle. Add a cache-invalidation seam called on reopen. + +**Step 1: Failing test — invalidation forces a fresh AddItem on the next write** + +In the existing `GatewayGalaxyDataWriter` test class (it already constructs the writer over a fake/owned session — reuse that harness), add: + +```csharp +[Fact] +public async Task InvalidateHandleCaches_forces_AddItem_again_on_next_write() +{ + var (writer, fakeSession) = BuildWriterWithFakeSession(); // reuse existing harness + await writer.WriteAsync(new[] { new WriteRequest("Tag.Attr", 1) }, _ => SecurityClassification.Operate, default); + fakeSession.AddItemCallCount.ShouldBe(1); + + writer.InvalidateHandleCaches(); + + await writer.WriteAsync(new[] { new WriteRequest("Tag.Attr", 2) }, _ => SecurityClassification.Operate, default); + fakeSession.AddItemCallCount.ShouldBe(2); // re-AddItem'd because the cache was cleared +} +``` + +If the existing fake session doesn't count `AddItem`, either extend it with a counter or assert via an internal seam on the writer (`internal int CachedItemHandleCount => _itemHandles.Count;`): assert `>0` after a write, `0` after `InvalidateHandleCaches()`. Pick whichever the existing harness makes cleanest. + +**Step 2: Run — expect FAIL (no such method)** + +```bash +dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests --filter "FullyQualifiedName~InvalidateHandleCaches" +``` + +**Step 3: Add the method to the interface + impl + decorator** + +- `IGalaxyDataWriter`: add + ```csharp + /// Drop cached gw item handles + supervisory-advise state. Call after a session + /// reconnect — the old handles are dead, so the next write must re-AddItem + re-advise. + void InvalidateHandleCaches(); + ``` +- `GatewayGalaxyDataWriter`: + ```csharp + public void InvalidateHandleCaches() + { + _itemHandles.Clear(); + _supervisedHandles.Clear(); + } + ``` +- `TracedGalaxyDataWriter` (decorator): delegate — `public void InvalidateHandleCaches() => _inner.InvalidateHandleCaches();` (match the field name the decorator uses for its inner writer). + +**Step 4: Call it on reopen** + +In `GalaxyDriver.ReopenAsync` (291-296), after `RecreateAsync` succeeds: + +```csharp +private async Task ReopenAsync(CancellationToken cancellationToken) +{ + if (_ownedMxSession is null) return; + var clientOptions = BuildClientOptions(_options.Gateway); + await _ownedMxSession.RecreateAsync(clientOptions, cancellationToken).ConfigureAwait(false); + // The recreated session invalidates every prior gw item handle; drop the writer's handle/advise + // caches so the next write re-AddItems + re-AdviseSupervisory against the fresh session. + _dataWriter?.InvalidateHandleCaches(); +} +``` + +(`_dataWriter` is the `TracedGalaxyDataWriter`-wrapped field set at 255-257; the call flows through to the inner `GatewayGalaxyDataWriter`.) + +**Step 5: Run — expect PASS, run the full Galaxy suite, build** + +```bash +dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests --filter "FullyQualifiedName~InvalidateHandleCaches" +dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests +``` + +**Step 6: Commit (stage by path)** + +```bash +git add src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/ tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests/ +git commit -m "fix(galaxy): invalidate writer handle caches on session reconnect" +``` + +--- + +## Final verification (after T1–T6) + +```bash +dotnet build ZB.MOM.WW.OtOpcUa.slnx # 0 errors (TreatWarningsAsErrors) +dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests +dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests +dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests +``` + +These are all unit-level changes (no live docker-dev `/run` gate needed — none touch Razor/JS or the wire protocol behaviourally; the fast-fail paths and cache changes are exercised by Akka.TestKit + unit tests). Then finish via **superpowers-extended-cc:finishing-a-development-branch** (intent: merge-to-master + push). After merge, update `pending.md` to mark #3, #4, #6 resolved. + +## Out of scope (left in pending.md) + +- #4 sub-item: FOCAS — covered (T3). The "surface real device-write status" item (#5) is a **separate** chosen-later effort, not part of this cluster. +- #7 driver-reconfigure-while-faulted — its own high-risk actor-state-machine design/plan. +- #1/#2 Phase B/C; #8 rig cleanups — unchanged. diff --git a/docs/plans/2026-06-14-harden-milestone-1b-plan.md.tasks.json b/docs/plans/2026-06-14-harden-milestone-1b-plan.md.tasks.json new file mode 100644 index 00000000..ed8e8f40 --- /dev/null +++ b/docs/plans/2026-06-14-harden-milestone-1b-plan.md.tasks.json @@ -0,0 +1,13 @@ +{ + "planPath": "docs/plans/2026-06-14-harden-milestone-1b-plan.md", + "tasks": [ + {"id": "354", "subject": "T0: Create feature branch chore/harden-milestone-1b", "classification": "trivial", "status": "completed"}, + {"id": "355", "subject": "T1: DriverHostActor Stale write fast-fail + micro-opts + raw-blob routing test", "classification": "standard", "status": "completed", "blockedBy": ["354"], "commits": ["4cda275b", "99eea0b4"]}, + {"id": "356", "subject": "T2: DriverInstanceActor degraded-state write fast-fail + benign SubscriptionEstablished handler", "classification": "standard", "status": "completed", "blockedBy": ["354"], "commits": ["42b4a923", "7e405e94"]}, + {"id": "357", "subject": "T3: FOCAS cache equipment-tag parsed addresses", "classification": "small", "status": "completed", "blockedBy": ["354"], "commits": ["46f559f5"]}, + {"id": "358", "subject": "T4: Parity test future-enum trap InlineData(2,false)", "classification": "trivial", "status": "completed", "blockedBy": ["354"], "commits": ["97d82dda"]}, + {"id": "359", "subject": "T5: docs/security.md data-plane GroupToRole requirement note", "classification": "trivial", "status": "completed", "blockedBy": ["354"], "commits": ["190ef34e"]}, + {"id": "360", "subject": "T6: Galaxy writer invalidate handle caches on reconnect", "classification": "small", "status": "completed", "blockedBy": ["354"], "commits": ["f77488ee"]} + ], + "lastUpdated": "2026-06-14" +}