Files
lmxopcua/docs/plans/2026-06-14-harden-milestone-1b-plan.md
T
Joseph Doherty 945c238039
v2-ci / build (push) Failing after 48s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
docs(plans): harden-milestone-1b plan + task record
2026-06-14 00:51:05 -04:00

558 lines
30 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Harden Milestone 1b — Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans (or subagent-driven-development) to implement this plan task-by-task.
**Goal:** Close out the review nits left from the just-shipped Milestone 1b (equipment-tag live values + inbound operator-write pipeline) — fast-fail degraded-state writes, two micro-optimisations, a FOCAS equipment-tag parse-cache gap, a benign Galaxy dead-letter, a Galaxy reconnect stale-handle gap, two test-hardening additions, and one deployment-facing docs note.
**Architecture:** Pure hardening — no new feature surface, **no EF/Configuration schema change, no migration**. Changes are additive fast-fail handlers on existing actor behaviours, a thread-safe cache swap, a writer cache-invalidation seam on the Galaxy reconnect path, test additions, and a docs subsection. Source of these items: `pending.md` follow-ups #3, #4, #6 (from the Milestone 1b integration review), all enumerated below.
**Tech Stack:** C# / .NET 10, Akka.NET (DriverHostActor / DriverInstanceActor), xUnit + Shouldly + Akka.TestKit. `TreatWarningsAsErrors` is on solution-wide.
**Branch:** `chore/harden-milestone-1b` off `master` (HEAD `02e37a6d`).
**Hard rules (every task):** stage by path (never `git add .`); never stage `sql_login.txt` / `src/Server/ZB.MOM.WW.OtOpcUa.Host/pki/` / `pending.md` / `current.md` / `docker-dev/docker-compose.yml`; never echo secrets; no force-push / no `--no-verify`; **NO Configuration entity / EF migration change**. Done = build clean + `dotnet test` green for every touched project.
---
## Task DAG
```
T0 (branch)
├─ T1 DriverHostActor: Stale fast-fail + micro-opts + raw-blob routing test [standard]
├─ T2 DriverInstanceActor: Connecting/Reconnecting/Stubbed write fast-fail
│ + benign SubscriptionEstablished handler [standard]
├─ T3 FOCAS: cache equipment-tag parsed addresses (ConcurrentDictionary) [small]
├─ T4 Parity test: future-enum trap InlineData(2,false) [trivial]
├─ T5 docs/security.md: data-plane GroupToRole requirement note [trivial]
└─ T6 Galaxy writer: invalidate handle caches on reconnect [small]
```
T1T6 all `blockedBy` T0 and have **disjoint file sets** — they are mutually `Parallelizable`. Dispatch them concurrently after T0.
---
### Task T0: Create feature branch
**Classification:** trivial
**Estimated implement time:** ~1 min
**Parallelizable with:** none (gates all)
**Files:** none (git only)
**Step 1: Branch off master**
```bash
git -C /Users/dohertj2/Desktop/OtOpcUa checkout -b chore/harden-milestone-1b master
git -C /Users/dohertj2/Desktop/OtOpcUa rev-parse --abbrev-ref HEAD # expect: chore/harden-milestone-1b
```
Do NOT touch the working-tree `docker-dev/docker-compose.yml` (leave it modified+unstaged throughout).
---
### Task T1: DriverHostActor — Stale-state write fast-fail + micro-opts + raw-blob routing test
**Classification:** standard
**Estimated implement time:** ~5 min
**Parallelizable with:** T2, T3, T4, T5, T6
Addresses pending.md #4(a-host), #4(b), #4(d).
**Files:**
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs` (Stale behaviour ~537-550; `HandleRouteNodeWrite` ContinueWith line 519; forward-map build line 754-756)
- Test: `tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DriverHostActorWriteRoutingTests.cs`
**Context:** `RouteNodeWrite`/`NodeWriteResult` are defined at `DriverHostActor.cs:130,136`. The running behaviour wires `HandleRouteNodeWrite`, but the **`Stale()`** behaviour (DB unreachable; lines 537-550) has no `RouteNodeWrite` handler → an inbound operator write while Stale dead-letters and the node-manager's 10 s Ask times out with a generic log (the client already got the optimistic `Good`). Add a fast-fail. Two micro-opts ride along in the same file.
**Step 1: Add the failing Akka.TestKit test for Stale fast-fail**
In `DriverHostActorWriteRoutingTests.cs`, add a test that drives the host into `Stale` (construct it with a DB factory that throws, or send whatever the existing tests use to reach Stale — mirror an existing Stale-path test in the file/suite) and asserts an immediate negative `NodeWriteResult`:
```csharp
[Fact]
public void RouteNodeWrite_while_stale_fast_fails_instead_of_timing_out()
{
// Arrange: a host that cannot reach the config DB enters Stale on first DispatchDeployment.
var host = /* spawn DriverHostActor with a throwing IDbContextFactory — reuse the suite's Stale helper */;
DriveIntoStale(host);
// Act
host.Tell(new DriverHostActor.RouteNodeWrite("ns=2;s=EQ-x/Tag", 42), TestActor);
var reply = ExpectMsg<DriverHostActor.NodeWriteResult>(TimeSpan.FromSeconds(2));
// Assert: fast negative, not a ~10s timeout
reply.Success.ShouldBeFalse();
reply.Reason.ShouldContain("stale");
}
```
If the suite has no existing Stale helper, reach Stale the same way the existing "ignoring DispatchDeployment while Stale" coverage does; if none exists, construct the actor with a `IDbContextFactory` stub whose `CreateDbContext` throws and send the message that triggers the DB read. Keep the helper local to the test file.
**Step 2: Run it — expect FAIL (times out / dead-letters, no reply)**
```bash
dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests \
--filter "FullyQualifiedName~RouteNodeWrite_while_stale_fast_fails"
```
**Step 3: Add the Stale fast-fail handler**
In `DriverHostActor.cs` `Stale()` (after the existing `Receive<RedundancyStateChanged>` line 547), add:
```csharp
// An inbound operator write can't be serviced while the config DB is unreachable — fast-fail so the
// node-manager's bounded Ask gets an immediate clear status instead of dead-lettering into a timeout.
Receive<RouteNodeWrite>(_ =>
Sender.Tell(new NodeWriteResult(false, "driver host stale (config DB unreachable)")));
```
**Step 4: Micro-opt — drop `ExecuteSynchronously` on the router continuation**
In `HandleRouteNodeWrite` (line 514-520), the `.ContinueWith(...)` passes `TaskContinuationOptions.ExecuteSynchronously`. Running the continuation synchronously on the Ask's completing thread (a pool/transport thread) is needless coupling for a tiny projection. Change it to `TaskContinuationOptions.None`:
```csharp
.ContinueWith(
t => t.IsCompletedSuccessfully
? new NodeWriteResult(t.Result.Success, t.Result.Reason)
: new NodeWriteResult(false, "write timeout"),
CancellationToken.None,
TaskContinuationOptions.None,
TaskScheduler.Default)
.PipeTo(replyTo);
```
**Step 5: Micro-opt — `List.Contains` → `HashSet` in the forward-map build**
The forward map `_nodeIdByDriverRef` (declared as `Dictionary<.., List<string>>`) does `if (!list.Contains(nodeId)) list.Add(nodeId);` at line 756 — O(n) per node. Convert the value type to `HashSet<string>`:
- Change the field declaration to `Dictionary<(string,string), HashSet<string>>` (keep `StringComparer.Ordinal` on the inner set).
- Line 754-756 becomes:
```csharp
if (!_nodeIdByDriverRef.TryGetValue(key, out var set))
_nodeIdByDriverRef[key] = set = new HashSet<string>(StringComparer.Ordinal);
set.Add(nodeId);
```
- **Grep every consumer of `_nodeIdByDriverRef`** (`grep -n _nodeIdByDriverRef src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs`) and confirm they only `foreach`/`Count`/membership-check the value (all of which `HashSet` supports). If any consumer indexes by position (`list[i]`) leave it a `List` and skip this micro-opt — note that in the commit message. (Expected: ForwardToMux only enumerates.)
**Step 6: Raw protocol-blob routing test (#4d)**
The suite seeds equipment tags with `TagConfig = JsonSerializer.Serialize(new { FullName = t.FullName })` (a Galaxy-style `{FullName}` blob). The router keys purely on **NodeId**, so the TagConfig blob shape is irrelevant to routing — prove that with a raw protocol blob. Add a tag to the seeded composition (or a new focused test) whose `TagConfig` is a raw Modbus-style blob and assert the write still routes to the owning driver by NodeId:
```csharp
[Fact]
public void RouteNodeWrite_routes_by_nodeId_regardless_of_raw_tagconfig_blob_shape()
{
// Seed an equipment tag whose TagConfig is a RAW protocol blob (region/address), NOT {FullName}.
var rawBlob = JsonSerializer.Serialize(new { region = "HoldingRegister", address = 200, dataType = "UInt16" });
var host = SeedHostWithEquipmentTag(driverId: "MAIN-modbus-eq", tagConfig: rawBlob, /* nodeId derived as in the suite */);
host.Tell(new DriverHostActor.RouteNodeWrite(expectedNodeId, (ushort)7), TestActor);
// The child receives WriteAttribute carrying the wire-ref FullName (the blob is the wire-ref here);
// assert via the suite's existing child probe that the write was forwarded (Success path).
var reply = ExpectMsg<DriverHostActor.NodeWriteResult>();
reply.Success.ShouldBeTrue();
}
```
Reuse the suite's existing child-probe / seed helpers (the file already exercises `RouteNodeWrite` happy-path — clone that setup and only swap the `TagConfig` to the raw blob). The point is belt-and-suspenders: routing is blob-shape-independent.
**Step 7: Run the full routing suite — expect PASS**
```bash
dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~DriverHostActorWriteRouting"
```
**Step 8: Build + commit (stage by path)**
```bash
git add src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs \
tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DriverHostActorWriteRoutingTests.cs
git commit -m "fix(runtime): fast-fail RouteNodeWrite while Stale + micro-opts + raw-blob routing test"
```
---
### Task T2: DriverInstanceActor — write fast-fail in degraded states + benign SubscriptionEstablished handler
**Classification:** standard
**Estimated implement time:** ~5 min
**Parallelizable with:** T1, T3, T4, T5, T6
Addresses pending.md #4(a-instance) and #6(a).
**Files:**
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs` (Stubbed ~205-209; `Connecting()` 211-231; `Reconnecting()` 265-281; `Connected()` 233-263)
- Test: `tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DriverInstanceActorWriteAndSubscribeTests.cs` (NEW file — avoids touching T1's test file)
**Context:**
- `WriteAttribute`/`WriteAttributeResult` are at lines 41-42. Only `Connected()` handles `WriteAttribute` (line 252). When the actor is `Connecting`/`Reconnecting`/`Stubbed`, an inbound `WriteAttribute` (Asked by `DriverHostActor.HandleRouteNodeWrite` with an 8 s timeout) dead-letters → the Ask times out as a generic "write timeout". Fast-fail in those states with a clear reason.
- `ResubscribeDesired()` (line 437-443) does `Self.Tell(new Subscribe(...))`; `HandleSubscribeAsync` replies `SubscriptionEstablished` to `replyTo = Sender = Self` (line 379) → no handler → the benign dead-letter `Message [SubscriptionEstablished] from drv-… to drv-… was unhandled` seen live on Galaxy. This is the **generic** actor, so a no-op handler quiets it for every driver.
**Step 1: Failing tests (new file)**
```csharp
public sealed class DriverInstanceActorWriteAndSubscribeTests : TestKit
{
[Fact]
public void WriteAttribute_while_connecting_fast_fails()
{
var actor = SpawnInConnectingState(); // see helper note below
actor.Tell(new DriverInstanceActor.WriteAttribute("Tag.Attr", 1), TestActor);
var r = ExpectMsg<DriverInstanceActor.WriteAttributeResult>(TimeSpan.FromSeconds(2));
r.Success.ShouldBeFalse();
r.Reason.ShouldNotBeNull();
}
[Fact]
public void SubscriptionEstablished_is_swallowed_without_deadletter()
{
var actor = SpawnConnected();
// subscribe with a desired set so ResubscribeDesired self-Tells Subscribe and the reply lands on Self
// Subscribe to DeadLetters and assert none of type SubscriptionEstablished arrive.
var deadLetters = CreateTestProbe();
Sys.EventStream.Subscribe(deadLetters, typeof(DeadLetter));
// drive a (re)subscribe ... then:
deadLetters.ExpectNoMsg(TimeSpan.FromMilliseconds(500));
}
}
```
Helper note: reuse the spawn/connect helpers the existing DriverInstanceActor tests use (look for them in the Runtime.Tests `Drivers/` folder; there is established setup for stub drivers and the Connected/Reconnecting transitions). To reach `Connecting`, send `InitializeRequested` with a driver-config that the stub keeps pending, or `Become`-drive via the suite's existing harness. If reaching exact internal states is awkward, the **minimum viable** assertions are: (a) a `WriteAttribute` to a Stubbed actor returns a negative `WriteAttributeResult` quickly, and (b) no `SubscriptionEstablished` dead-letter is produced during a connect+resubscribe cycle.
**Step 2: Run — expect FAIL**
```bash
dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~DriverInstanceActorWriteAndSubscribe"
```
**Step 3: Add write fast-fail to the three degraded behaviours**
Add this line to `Stubbed` (after line 208), `Connecting()` (after line 230), and `Reconnecting()` (after line 279):
```csharp
Receive<WriteAttribute>(_ =>
Sender.Tell(new WriteAttributeResult(false, "driver not connected")));
```
(Stubbed may use a reason like `"driver stubbed"` — implementer's call; the host maps either into a `NodeWriteResult(false, …)`.)
**Step 4: Add the benign SubscriptionEstablished handler**
In `Connected()` (alongside the other `Receive`s, e.g. after line 262) add a swallow-with-trace so the self-reply from `ResubscribeDesired` doesn't dead-letter:
```csharp
// ResubscribeDesired self-Tells Subscribe; HandleSubscribeAsync replies SubscriptionEstablished to the
// sender, which on the self-resubscribe path is Self. Swallow it (trace only) so it doesn't dead-letter.
Receive<SubscriptionEstablished>(msg =>
_log.Debug("DriverInstance {Id}: subscription established ({Count} refs, {Diag})",
_driverInstanceId, msg.ReferenceCount, msg.DiagnosticId));
```
Add the same one-liner to `Reconnecting()` and `Connecting()` to cover the race where the async subscribe reply lands after a state flip (cheap, same file). Do NOT alter where/how `SubscriptionEstablished` is *sent* — a host that genuinely Asks `Subscribe` still receives its reply (the handler only runs when the message arrives at this actor as the asker).
**Step 5: Run — expect PASS, then build the project**
```bash
dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~DriverInstanceActorWriteAndSubscribe"
dotnet build src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ZB.MOM.WW.OtOpcUa.Runtime.csproj
```
**Step 6: Commit (stage by path)**
```bash
git add src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs \
tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DriverInstanceActorWriteAndSubscribeTests.cs
git commit -m "fix(runtime): fast-fail writes in degraded driver states + swallow self SubscriptionEstablished"
```
---
### Task T3: FOCAS — cache equipment-tag parsed addresses
**Classification:** small
**Estimated implement time:** ~4 min
**Parallelizable with:** T1, T2, T4, T5, T6
Addresses pending.md #4(c).
**Files:**
- Modify: `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS/FocasDriver.cs` (cache field 37-38; read path 275-280; write path 337-342; init seed 120)
- Test: `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests/` (add to an existing FocasDriver read/write test class)
**Context:** `_parsedAddressesByTagName` (a plain `Dictionary`, line 37) is seeded only from `_options.Tags` at init (line 120). Equipment tags arrive via `_resolver` (line 66) as transient defs whose `Name` is the raw ref — never seeded — so both `ReadAsync` (275-280) and `WriteAsync` (337-342) hit the `TryGetValue` miss and **re-parse `FocasAddress.TryParse(def.Address)` every call without storing**. Fix: make the cache a `ConcurrentDictionary` (reads run on the poll loop, writes on the host Ask thread — concurrent access) and `GetOrAdd` so a transient def's address is parsed once.
**Step 1: Failing test — second resolve must reuse the cache**
Add an `internal` test seam to `FocasDriver` so the test can observe caching without a parser spy:
```csharp
// In FocasDriver (internal, for tests via InternalsVisibleTo — confirm the attribute exists for the .Tests asm)
internal bool IsParsedAddressCached(string reference) => _parsedAddressesByTagName.ContainsKey(reference);
```
Test (equipment-tag ref, not an authored tag):
```csharp
[Fact]
public async Task Equipment_tag_address_is_parsed_once_and_cached()
{
var driver = /* construct FocasDriver with a fake client + one device, no authored Tags */;
await driver.InitializeAsync(configJson, default);
var equipRef = "{\"address\":\"R100\",\"dataType\":\"Int16\",\"deviceHostAddress\":\"focas://10.0.0.5\",\"writable\":true}"; // matches FocasEquipmentTagParser keys
driver.IsParsedAddressCached(equipRef).ShouldBeFalse(); // not seeded at init
await driver.ReadAsync(new[] { equipRef }, default); // first resolve parses + caches
driver.IsParsedAddressCached(equipRef).ShouldBeTrue(); // cached for subsequent hot calls
}
```
Use the exact equipment-ref JSON shape that `FocasEquipmentTagParser.TryParse` accepts (open that parser to copy its camelCase keys + required fields). If `InternalsVisibleTo` for the `.Tests` assembly isn't already present on the FOCAS driver project, add it (mirror how the other drivers expose internals to their test projects).
**Step 2: Run — expect FAIL (second assert false: never cached)**
```bash
dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests --filter "FullyQualifiedName~Equipment_tag_address_is_parsed_once"
```
**Step 3: Implement the cache**
- Change the field (line 37-38) to:
```csharp
private readonly ConcurrentDictionary<string, FocasAddress> _parsedAddressesByTagName =
new(StringComparer.OrdinalIgnoreCase);
```
(add `using System.Collections.Concurrent;` if absent).
- Add a private helper:
```csharp
private FocasAddress ResolveParsedAddress(FocasTagDefinition def) =>
_parsedAddressesByTagName.GetOrAdd(def.Name, _ =>
FocasAddress.TryParse(def.Address)
?? throw new InvalidOperationException(
$"FOCAS tag '{def.Name}' has malformed Address '{def.Address}'."));
```
(The throwing factory propagates and does **not** cache a bad parse — preserving the existing fail behaviour. `GetOrAdd` may invoke the factory more than once under race but only stores one — fine, parsing is pure.)
- Replace the read-path block (275-280) and the write-path block (337-342) with `var parsed = ResolveParsedAddress(def);`.
- Line 120 (`_parsedAddressesByTagName[tag.Name] = parsed;`) compiles unchanged on `ConcurrentDictionary`; keep it (init pre-validates + seeds authored tags).
**Step 4: Run — expect PASS, build the project**
```bash
dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests --filter "FullyQualifiedName~Equipment_tag_address_is_parsed_once"
dotnet build src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.csproj
```
**Step 5: Commit (stage by path)**
```bash
git add src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS/ tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests/
git commit -m "perf(focas): cache equipment-tag parsed addresses (no per-call reparse)"
```
---
### Task T4: Parity test — future-enum trap
**Classification:** trivial
**Estimated implement time:** ~2 min
**Parallelizable with:** T1, T2, T3, T5, T6
Addresses pending.md #4(e).
**Files:**
- Test only: `tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DeploymentArtifactAliasParityTests.cs` (numeric-AccessLevel theory ~73-104)
**Context:** `TagAccessLevel` has `Read = 0`, `ReadWrite = 1` (`src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Enums/TagAccessLevel.cs`). The parity test maps numeric AccessLevel → `Writable` with `[InlineData(1, true)]`/`[InlineData(0, false)]`. Add a future-enum guard so an as-yet-undefined value maps fail-safe to non-writable (only `ReadWrite==1` should yield a writable node).
**Step 1: Add the InlineData row**
In the numeric-AccessLevel `[Theory]` (the one with `[InlineData(1, true)]` / `[InlineData(0, false)]`), add:
```csharp
[InlineData(2, false)] // future/unknown AccessLevel must map fail-safe to read-only (only ReadWrite⇒writable)
```
**Step 2: Run — expect PASS (confirms the derivation is `== ReadWrite`, not `!= Read`)**
```bash
dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~DeploymentArtifactAliasParity"
```
If it FAILS, the production derivation is using `!= Read` (so `2` wrongly maps to writable) — that is a real fail-safety bug; STOP and surface it (the fix would be a one-line `== TagAccessLevel.ReadWrite` in both Phase7Composer and DeploymentArtifact, but only with reviewer sign-off since it touches the byte-parity contract).
**Step 3: Commit (stage by path)**
```bash
git add tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DeploymentArtifactAliasParityTests.cs
git commit -m "test(runtime): future-enum trap for AccessLevel→Writable parity"
```
---
### Task T5: docs/security.md — data-plane GroupToRole requirement
**Classification:** trivial
**Estimated implement time:** ~3 min
**Parallelizable with:** T1, T2, T3, T4, T6
Addresses pending.md #3.
**Files:**
- Modify: `docs/security.md` (Data-Plane Authorization section starts line 167; `GroupToRole` is described at line 163 in the LDAP section; troubleshooting at 313-315)
**Context:** OPC UA session roles are unioned from two sources: the DB `LdapGroupRoleMapping` (whose `Role` column is the **`AdminRole`** enum — Administrator/Designer/Viewer only) AND the appsettings `Security:Ldap:GroupToRole` (free-form `string→string`). The data-plane gates (`WriteOperate`, `WriteTune`, `WriteConfigure`, `AlarmAck`/`AlarmAcknowledge`, `ReadOnly`) read literal **role strings** that the AdminRole-typed DB mapping **cannot produce** — so a deployment MUST map its data-plane LDAP groups → data-plane role strings via `GroupToRole`, or write-through and OPC UA alarm-ack are silently inert (every write → `BadUserAccessDenied`, default-deny). Line 163 currently frames `GroupToRole` as mapping only to *Admin* roles — that's the misleading bit to correct.
**Step 1: Add a "Role grant source (data-plane)" subsection**
Under **## Data-Plane Authorization** (after the "Full model" subsection ~line 241, or right after the "Permission flags" mapping ~216), add a subsection. Suggested content (match the doc's voice; keep it tight):
```markdown
### Role grant source (data-plane)
Data-plane roles come from `Security:Ldap:GroupToRole` (appsettings), **not** from the Config-DB
`LdapGroupRoleMapping` table. That table's `Role` column is the `AdminRole` enum
(`Administrator`/`Designer`/`Viewer`) and supplies **control-plane** roles only — it cannot emit the
data-plane role strings the OPC UA gates read (`ReadOnly`, `WriteOperate`, `WriteTune`,
`WriteConfigure`, `AlarmAcknowledge`). A deployment therefore **must** map its data-plane LDAP groups
to those role strings via `GroupToRole`, e.g.:
```json
"GroupToRole": {
"ot-operators": "WriteOperate",
"ot-tuners": "WriteTune",
"ot-engineers": "WriteConfigure",
"ot-alarm-ack": "AlarmAcknowledge",
"ot-readonly": "ReadOnly"
}
```
If this mapping is absent the data-plane evaluator is strictly default-deny: inbound operator
writes and OPC UA Part-9 alarm acknowledgement all return `BadUserAccessDenied` even for users who
authenticate successfully. (The same requirement gates the pre-existing scripted-alarm ack path.)
```
Also tighten line 163 (the `GroupToRole` description in the LDAP section) so it notes the values are control-plane **and** data-plane role strings, cross-referencing the new subsection — a one-clause edit, not a rewrite.
**Step 2: Verify the doc renders / no broken anchors, then commit**
```bash
git add docs/security.md
git commit -m "docs(security): document the GroupToRole data-plane role requirement"
```
(No test; verification = re-read the section for accuracy against the permission-flag names at lines 186-216.)
---
### Task T6: Galaxy writer — invalidate handle caches on reconnect
**Classification:** small
**Estimated implement time:** ~5 min
**Parallelizable with:** T1, T2, T3, T4, T5
Addresses pending.md #6(b).
**Files:**
- Modify: `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/Runtime/IGalaxyDataWriter.cs` (interface — add method)
- Modify: `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/Runtime/GatewayGalaxyDataWriter.cs` (caches 28-32; implement clear)
- Modify: `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/Runtime/TracedGalaxyDataWriter.cs` (decorator — delegate)
- Modify: `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/GalaxyDriver.cs` (`ReopenAsync` 291-296; `_dataWriter` set at 255-257)
- Test: `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests/Runtime/` (GatewayGalaxyDataWriter test class)
**Context:** `GatewayGalaxyDataWriter` caches `_itemHandles` (ref→gw item handle, line 28) and `_supervisedHandles` (handle→advised, line 32). `_dataWriter` is built **once** in `BuildProductionRuntimeAsync` (255-257) and reused across reconnects. On reconnect, `ReopenAsync` (291-296) calls `_ownedMxSession.RecreateAsync` — which disposes + rebuilds the session, so all prior gw item handles are **dead**. The writer's caches still hold the stale handles, so a write right after a reconnect AddItem-skips and uses a dead handle. Add a cache-invalidation seam called on reopen.
**Step 1: Failing test — invalidation forces a fresh AddItem on the next write**
In the existing `GatewayGalaxyDataWriter` test class (it already constructs the writer over a fake/owned session — reuse that harness), add:
```csharp
[Fact]
public async Task InvalidateHandleCaches_forces_AddItem_again_on_next_write()
{
var (writer, fakeSession) = BuildWriterWithFakeSession(); // reuse existing harness
await writer.WriteAsync(new[] { new WriteRequest("Tag.Attr", 1) }, _ => SecurityClassification.Operate, default);
fakeSession.AddItemCallCount.ShouldBe(1);
writer.InvalidateHandleCaches();
await writer.WriteAsync(new[] { new WriteRequest("Tag.Attr", 2) }, _ => SecurityClassification.Operate, default);
fakeSession.AddItemCallCount.ShouldBe(2); // re-AddItem'd because the cache was cleared
}
```
If the existing fake session doesn't count `AddItem`, either extend it with a counter or assert via an internal seam on the writer (`internal int CachedItemHandleCount => _itemHandles.Count;`): assert `>0` after a write, `0` after `InvalidateHandleCaches()`. Pick whichever the existing harness makes cleanest.
**Step 2: Run — expect FAIL (no such method)**
```bash
dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests --filter "FullyQualifiedName~InvalidateHandleCaches"
```
**Step 3: Add the method to the interface + impl + decorator**
- `IGalaxyDataWriter`: add
```csharp
/// <summary>Drop cached gw item handles + supervisory-advise state. Call after a session
/// reconnect — the old handles are dead, so the next write must re-AddItem + re-advise.</summary>
void InvalidateHandleCaches();
```
- `GatewayGalaxyDataWriter`:
```csharp
public void InvalidateHandleCaches()
{
_itemHandles.Clear();
_supervisedHandles.Clear();
}
```
- `TracedGalaxyDataWriter` (decorator): delegate — `public void InvalidateHandleCaches() => _inner.InvalidateHandleCaches();` (match the field name the decorator uses for its inner writer).
**Step 4: Call it on reopen**
In `GalaxyDriver.ReopenAsync` (291-296), after `RecreateAsync` succeeds:
```csharp
private async Task ReopenAsync(CancellationToken cancellationToken)
{
if (_ownedMxSession is null) return;
var clientOptions = BuildClientOptions(_options.Gateway);
await _ownedMxSession.RecreateAsync(clientOptions, cancellationToken).ConfigureAwait(false);
// The recreated session invalidates every prior gw item handle; drop the writer's handle/advise
// caches so the next write re-AddItems + re-AdviseSupervisory against the fresh session.
_dataWriter?.InvalidateHandleCaches();
}
```
(`_dataWriter` is the `TracedGalaxyDataWriter`-wrapped field set at 255-257; the call flows through to the inner `GatewayGalaxyDataWriter`.)
**Step 5: Run — expect PASS, run the full Galaxy suite, build**
```bash
dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests --filter "FullyQualifiedName~InvalidateHandleCaches"
dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests
```
**Step 6: Commit (stage by path)**
```bash
git add src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/ tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests/
git commit -m "fix(galaxy): invalidate writer handle caches on session reconnect"
```
---
## Final verification (after T1T6)
```bash
dotnet build ZB.MOM.WW.OtOpcUa.slnx # 0 errors (TreatWarningsAsErrors)
dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests
dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests
dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests
```
These are all unit-level changes (no live docker-dev `/run` gate needed — none touch Razor/JS or the wire protocol behaviourally; the fast-fail paths and cache changes are exercised by Akka.TestKit + unit tests). Then finish via **superpowers-extended-cc:finishing-a-development-branch** (intent: merge-to-master + push). After merge, update `pending.md` to mark #3, #4, #6 resolved.
## Out of scope (left in pending.md)
- #4 sub-item: FOCAS — covered (T3). The "surface real device-write status" item (#5) is a **separate** chosen-later effort, not part of this cluster.
- #7 driver-reconfigure-while-faulted — its own high-risk actor-state-machine design/plan.
- #1/#2 Phase B/C; #8 rig cleanups — unchanged.