Files
lmxopcua/docs/plans/2026-06-14-harden-milestone-1b-plan.md
T
Joseph Doherty 945c238039
v2-ci / build (push) Failing after 48s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
docs(plans): harden-milestone-1b plan + task record
2026-06-14 00:51:05 -04:00

30 KiB
Raw Blame History

Harden Milestone 1b — Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans (or subagent-driven-development) to implement this plan task-by-task.

Goal: Close out the review nits left from the just-shipped Milestone 1b (equipment-tag live values + inbound operator-write pipeline) — fast-fail degraded-state writes, two micro-optimisations, a FOCAS equipment-tag parse-cache gap, a benign Galaxy dead-letter, a Galaxy reconnect stale-handle gap, two test-hardening additions, and one deployment-facing docs note.

Architecture: Pure hardening — no new feature surface, no EF/Configuration schema change, no migration. Changes are additive fast-fail handlers on existing actor behaviours, a thread-safe cache swap, a writer cache-invalidation seam on the Galaxy reconnect path, test additions, and a docs subsection. Source of these items: pending.md follow-ups #3, #4, #6 (from the Milestone 1b integration review), all enumerated below.

Tech Stack: C# / .NET 10, Akka.NET (DriverHostActor / DriverInstanceActor), xUnit + Shouldly + Akka.TestKit. TreatWarningsAsErrors is on solution-wide.

Branch: chore/harden-milestone-1b off master (HEAD 02e37a6d).

Hard rules (every task): stage by path (never git add .); never stage sql_login.txt / src/Server/ZB.MOM.WW.OtOpcUa.Host/pki/ / pending.md / current.md / docker-dev/docker-compose.yml; never echo secrets; no force-push / no --no-verify; NO Configuration entity / EF migration change. Done = build clean + dotnet test green for every touched project.


Task DAG

T0 (branch)
 ├─ T1  DriverHostActor: Stale fast-fail + micro-opts + raw-blob routing test   [standard]
 ├─ T2  DriverInstanceActor: Connecting/Reconnecting/Stubbed write fast-fail
 │       + benign SubscriptionEstablished handler                                [standard]
 ├─ T3  FOCAS: cache equipment-tag parsed addresses (ConcurrentDictionary)       [small]
 ├─ T4  Parity test: future-enum trap InlineData(2,false)                        [trivial]
 ├─ T5  docs/security.md: data-plane GroupToRole requirement note                [trivial]
 └─ T6  Galaxy writer: invalidate handle caches on reconnect                     [small]

T1T6 all blockedBy T0 and have disjoint file sets — they are mutually Parallelizable. Dispatch them concurrently after T0.


Task T0: Create feature branch

Classification: trivial Estimated implement time: ~1 min Parallelizable with: none (gates all)

Files: none (git only)

Step 1: Branch off master

git -C /Users/dohertj2/Desktop/OtOpcUa checkout -b chore/harden-milestone-1b master
git -C /Users/dohertj2/Desktop/OtOpcUa rev-parse --abbrev-ref HEAD   # expect: chore/harden-milestone-1b

Do NOT touch the working-tree docker-dev/docker-compose.yml (leave it modified+unstaged throughout).


Task T1: DriverHostActor — Stale-state write fast-fail + micro-opts + raw-blob routing test

Classification: standard Estimated implement time: ~5 min Parallelizable with: T2, T3, T4, T5, T6

Addresses pending.md #4(a-host), #4(b), #4(d).

Files:

  • Modify: src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs (Stale behaviour ~537-550; HandleRouteNodeWrite ContinueWith line 519; forward-map build line 754-756)
  • Test: tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DriverHostActorWriteRoutingTests.cs

Context: RouteNodeWrite/NodeWriteResult are defined at DriverHostActor.cs:130,136. The running behaviour wires HandleRouteNodeWrite, but the Stale() behaviour (DB unreachable; lines 537-550) has no RouteNodeWrite handler → an inbound operator write while Stale dead-letters and the node-manager's 10 s Ask times out with a generic log (the client already got the optimistic Good). Add a fast-fail. Two micro-opts ride along in the same file.

Step 1: Add the failing Akka.TestKit test for Stale fast-fail

In DriverHostActorWriteRoutingTests.cs, add a test that drives the host into Stale (construct it with a DB factory that throws, or send whatever the existing tests use to reach Stale — mirror an existing Stale-path test in the file/suite) and asserts an immediate negative NodeWriteResult:

[Fact]
public void RouteNodeWrite_while_stale_fast_fails_instead_of_timing_out()
{
    // Arrange: a host that cannot reach the config DB enters Stale on first DispatchDeployment.
    var host = /* spawn DriverHostActor with a throwing IDbContextFactory — reuse the suite's Stale helper */;
    DriveIntoStale(host);

    // Act
    host.Tell(new DriverHostActor.RouteNodeWrite("ns=2;s=EQ-x/Tag", 42), TestActor);
    var reply = ExpectMsg<DriverHostActor.NodeWriteResult>(TimeSpan.FromSeconds(2));

    // Assert: fast negative, not a ~10s timeout
    reply.Success.ShouldBeFalse();
    reply.Reason.ShouldContain("stale");
}

If the suite has no existing Stale helper, reach Stale the same way the existing "ignoring DispatchDeployment while Stale" coverage does; if none exists, construct the actor with a IDbContextFactory stub whose CreateDbContext throws and send the message that triggers the DB read. Keep the helper local to the test file.

Step 2: Run it — expect FAIL (times out / dead-letters, no reply)

dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests \
  --filter "FullyQualifiedName~RouteNodeWrite_while_stale_fast_fails"

Step 3: Add the Stale fast-fail handler

In DriverHostActor.cs Stale() (after the existing Receive<RedundancyStateChanged> line 547), add:

// An inbound operator write can't be serviced while the config DB is unreachable — fast-fail so the
// node-manager's bounded Ask gets an immediate clear status instead of dead-lettering into a timeout.
Receive<RouteNodeWrite>(_ =>
    Sender.Tell(new NodeWriteResult(false, "driver host stale (config DB unreachable)")));

Step 4: Micro-opt — drop ExecuteSynchronously on the router continuation

In HandleRouteNodeWrite (line 514-520), the .ContinueWith(...) passes TaskContinuationOptions.ExecuteSynchronously. Running the continuation synchronously on the Ask's completing thread (a pool/transport thread) is needless coupling for a tiny projection. Change it to TaskContinuationOptions.None:

            .ContinueWith(
                t => t.IsCompletedSuccessfully
                    ? new NodeWriteResult(t.Result.Success, t.Result.Reason)
                    : new NodeWriteResult(false, "write timeout"),
                CancellationToken.None,
                TaskContinuationOptions.None,
                TaskScheduler.Default)
            .PipeTo(replyTo);

Step 5: Micro-opt — List.ContainsHashSet in the forward-map build

The forward map _nodeIdByDriverRef (declared as Dictionary<.., List<string>>) does if (!list.Contains(nodeId)) list.Add(nodeId); at line 756 — O(n) per node. Convert the value type to HashSet<string>:

  • Change the field declaration to Dictionary<(string,string), HashSet<string>> (keep StringComparer.Ordinal on the inner set).
  • Line 754-756 becomes:
    if (!_nodeIdByDriverRef.TryGetValue(key, out var set))
        _nodeIdByDriverRef[key] = set = new HashSet<string>(StringComparer.Ordinal);
    set.Add(nodeId);
    
  • Grep every consumer of _nodeIdByDriverRef (grep -n _nodeIdByDriverRef src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs) and confirm they only foreach/Count/membership-check the value (all of which HashSet supports). If any consumer indexes by position (list[i]) leave it a List and skip this micro-opt — note that in the commit message. (Expected: ForwardToMux only enumerates.)

Step 6: Raw protocol-blob routing test (#4d)

The suite seeds equipment tags with TagConfig = JsonSerializer.Serialize(new { FullName = t.FullName }) (a Galaxy-style {FullName} blob). The router keys purely on NodeId, so the TagConfig blob shape is irrelevant to routing — prove that with a raw protocol blob. Add a tag to the seeded composition (or a new focused test) whose TagConfig is a raw Modbus-style blob and assert the write still routes to the owning driver by NodeId:

[Fact]
public void RouteNodeWrite_routes_by_nodeId_regardless_of_raw_tagconfig_blob_shape()
{
    // Seed an equipment tag whose TagConfig is a RAW protocol blob (region/address), NOT {FullName}.
    var rawBlob = JsonSerializer.Serialize(new { region = "HoldingRegister", address = 200, dataType = "UInt16" });
    var host = SeedHostWithEquipmentTag(driverId: "MAIN-modbus-eq", tagConfig: rawBlob, /* nodeId derived as in the suite */);

    host.Tell(new DriverHostActor.RouteNodeWrite(expectedNodeId, (ushort)7), TestActor);

    // The child receives WriteAttribute carrying the wire-ref FullName (the blob is the wire-ref here);
    // assert via the suite's existing child probe that the write was forwarded (Success path).
    var reply = ExpectMsg<DriverHostActor.NodeWriteResult>();
    reply.Success.ShouldBeTrue();
}

Reuse the suite's existing child-probe / seed helpers (the file already exercises RouteNodeWrite happy-path — clone that setup and only swap the TagConfig to the raw blob). The point is belt-and-suspenders: routing is blob-shape-independent.

Step 7: Run the full routing suite — expect PASS

dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~DriverHostActorWriteRouting"

Step 8: Build + commit (stage by path)

git add src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs \
        tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DriverHostActorWriteRoutingTests.cs
git commit -m "fix(runtime): fast-fail RouteNodeWrite while Stale + micro-opts + raw-blob routing test"

Task T2: DriverInstanceActor — write fast-fail in degraded states + benign SubscriptionEstablished handler

Classification: standard Estimated implement time: ~5 min Parallelizable with: T1, T3, T4, T5, T6

Addresses pending.md #4(a-instance) and #6(a).

Files:

  • Modify: src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs (Stubbed ~205-209; Connecting() 211-231; Reconnecting() 265-281; Connected() 233-263)
  • Test: tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DriverInstanceActorWriteAndSubscribeTests.cs (NEW file — avoids touching T1's test file)

Context:

  • WriteAttribute/WriteAttributeResult are at lines 41-42. Only Connected() handles WriteAttribute (line 252). When the actor is Connecting/Reconnecting/Stubbed, an inbound WriteAttribute (Asked by DriverHostActor.HandleRouteNodeWrite with an 8 s timeout) dead-letters → the Ask times out as a generic "write timeout". Fast-fail in those states with a clear reason.
  • ResubscribeDesired() (line 437-443) does Self.Tell(new Subscribe(...)); HandleSubscribeAsync replies SubscriptionEstablished to replyTo = Sender = Self (line 379) → no handler → the benign dead-letter Message [SubscriptionEstablished] from drv-… to drv-… was unhandled seen live on Galaxy. This is the generic actor, so a no-op handler quiets it for every driver.

Step 1: Failing tests (new file)

public sealed class DriverInstanceActorWriteAndSubscribeTests : TestKit
{
    [Fact]
    public void WriteAttribute_while_connecting_fast_fails()
    {
        var actor = SpawnInConnectingState();   // see helper note below
        actor.Tell(new DriverInstanceActor.WriteAttribute("Tag.Attr", 1), TestActor);
        var r = ExpectMsg<DriverInstanceActor.WriteAttributeResult>(TimeSpan.FromSeconds(2));
        r.Success.ShouldBeFalse();
        r.Reason.ShouldNotBeNull();
    }

    [Fact]
    public void SubscriptionEstablished_is_swallowed_without_deadletter()
    {
        var actor = SpawnConnected();
        // subscribe with a desired set so ResubscribeDesired self-Tells Subscribe and the reply lands on Self
        // Subscribe to DeadLetters and assert none of type SubscriptionEstablished arrive.
        var deadLetters = CreateTestProbe();
        Sys.EventStream.Subscribe(deadLetters, typeof(DeadLetter));
        // drive a (re)subscribe ... then:
        deadLetters.ExpectNoMsg(TimeSpan.FromMilliseconds(500));
    }
}

Helper note: reuse the spawn/connect helpers the existing DriverInstanceActor tests use (look for them in the Runtime.Tests Drivers/ folder; there is established setup for stub drivers and the Connected/Reconnecting transitions). To reach Connecting, send InitializeRequested with a driver-config that the stub keeps pending, or Become-drive via the suite's existing harness. If reaching exact internal states is awkward, the minimum viable assertions are: (a) a WriteAttribute to a Stubbed actor returns a negative WriteAttributeResult quickly, and (b) no SubscriptionEstablished dead-letter is produced during a connect+resubscribe cycle.

Step 2: Run — expect FAIL

dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~DriverInstanceActorWriteAndSubscribe"

Step 3: Add write fast-fail to the three degraded behaviours

Add this line to Stubbed (after line 208), Connecting() (after line 230), and Reconnecting() (after line 279):

Receive<WriteAttribute>(_ =>
    Sender.Tell(new WriteAttributeResult(false, "driver not connected")));

(Stubbed may use a reason like "driver stubbed" — implementer's call; the host maps either into a NodeWriteResult(false, …).)

Step 4: Add the benign SubscriptionEstablished handler

In Connected() (alongside the other Receives, e.g. after line 262) add a swallow-with-trace so the self-reply from ResubscribeDesired doesn't dead-letter:

// ResubscribeDesired self-Tells Subscribe; HandleSubscribeAsync replies SubscriptionEstablished to the
// sender, which on the self-resubscribe path is Self. Swallow it (trace only) so it doesn't dead-letter.
Receive<SubscriptionEstablished>(msg =>
    _log.Debug("DriverInstance {Id}: subscription established ({Count} refs, {Diag})",
        _driverInstanceId, msg.ReferenceCount, msg.DiagnosticId));

Add the same one-liner to Reconnecting() and Connecting() to cover the race where the async subscribe reply lands after a state flip (cheap, same file). Do NOT alter where/how SubscriptionEstablished is sent — a host that genuinely Asks Subscribe still receives its reply (the handler only runs when the message arrives at this actor as the asker).

Step 5: Run — expect PASS, then build the project

dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~DriverInstanceActorWriteAndSubscribe"
dotnet build src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ZB.MOM.WW.OtOpcUa.Runtime.csproj

Step 6: Commit (stage by path)

git add src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs \
        tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DriverInstanceActorWriteAndSubscribeTests.cs
git commit -m "fix(runtime): fast-fail writes in degraded driver states + swallow self SubscriptionEstablished"

Task T3: FOCAS — cache equipment-tag parsed addresses

Classification: small Estimated implement time: ~4 min Parallelizable with: T1, T2, T4, T5, T6

Addresses pending.md #4(c).

Files:

  • Modify: src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS/FocasDriver.cs (cache field 37-38; read path 275-280; write path 337-342; init seed 120)
  • Test: tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests/ (add to an existing FocasDriver read/write test class)

Context: _parsedAddressesByTagName (a plain Dictionary, line 37) is seeded only from _options.Tags at init (line 120). Equipment tags arrive via _resolver (line 66) as transient defs whose Name is the raw ref — never seeded — so both ReadAsync (275-280) and WriteAsync (337-342) hit the TryGetValue miss and re-parse FocasAddress.TryParse(def.Address) every call without storing. Fix: make the cache a ConcurrentDictionary (reads run on the poll loop, writes on the host Ask thread — concurrent access) and GetOrAdd so a transient def's address is parsed once.

Step 1: Failing test — second resolve must reuse the cache

Add an internal test seam to FocasDriver so the test can observe caching without a parser spy:

// In FocasDriver (internal, for tests via InternalsVisibleTo — confirm the attribute exists for the .Tests asm)
internal bool IsParsedAddressCached(string reference) => _parsedAddressesByTagName.ContainsKey(reference);

Test (equipment-tag ref, not an authored tag):

[Fact]
public async Task Equipment_tag_address_is_parsed_once_and_cached()
{
    var driver = /* construct FocasDriver with a fake client + one device, no authored Tags */;
    await driver.InitializeAsync(configJson, default);
    var equipRef = "{\"address\":\"R100\",\"dataType\":\"Int16\",\"deviceHostAddress\":\"focas://10.0.0.5\",\"writable\":true}"; // matches FocasEquipmentTagParser keys

    driver.IsParsedAddressCached(equipRef).ShouldBeFalse();           // not seeded at init
    await driver.ReadAsync(new[] { equipRef }, default);              // first resolve parses + caches
    driver.IsParsedAddressCached(equipRef).ShouldBeTrue();            // cached for subsequent hot calls
}

Use the exact equipment-ref JSON shape that FocasEquipmentTagParser.TryParse accepts (open that parser to copy its camelCase keys + required fields). If InternalsVisibleTo for the .Tests assembly isn't already present on the FOCAS driver project, add it (mirror how the other drivers expose internals to their test projects).

Step 2: Run — expect FAIL (second assert false: never cached)

dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests --filter "FullyQualifiedName~Equipment_tag_address_is_parsed_once"

Step 3: Implement the cache

  • Change the field (line 37-38) to:
    private readonly ConcurrentDictionary<string, FocasAddress> _parsedAddressesByTagName =
        new(StringComparer.OrdinalIgnoreCase);
    
    (add using System.Collections.Concurrent; if absent).
  • Add a private helper:
    private FocasAddress ResolveParsedAddress(FocasTagDefinition def) =>
        _parsedAddressesByTagName.GetOrAdd(def.Name, _ =>
            FocasAddress.TryParse(def.Address)
                ?? throw new InvalidOperationException(
                    $"FOCAS tag '{def.Name}' has malformed Address '{def.Address}'."));
    
    (The throwing factory propagates and does not cache a bad parse — preserving the existing fail behaviour. GetOrAdd may invoke the factory more than once under race but only stores one — fine, parsing is pure.)
  • Replace the read-path block (275-280) and the write-path block (337-342) with var parsed = ResolveParsedAddress(def);.
  • Line 120 (_parsedAddressesByTagName[tag.Name] = parsed;) compiles unchanged on ConcurrentDictionary; keep it (init pre-validates + seeds authored tags).

Step 4: Run — expect PASS, build the project

dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests --filter "FullyQualifiedName~Equipment_tag_address_is_parsed_once"
dotnet build src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.csproj

Step 5: Commit (stage by path)

git add src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS/ tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests/
git commit -m "perf(focas): cache equipment-tag parsed addresses (no per-call reparse)"

Task T4: Parity test — future-enum trap

Classification: trivial Estimated implement time: ~2 min Parallelizable with: T1, T2, T3, T5, T6

Addresses pending.md #4(e).

Files:

  • Test only: tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DeploymentArtifactAliasParityTests.cs (numeric-AccessLevel theory ~73-104)

Context: TagAccessLevel has Read = 0, ReadWrite = 1 (src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Enums/TagAccessLevel.cs). The parity test maps numeric AccessLevel → Writable with [InlineData(1, true)]/[InlineData(0, false)]. Add a future-enum guard so an as-yet-undefined value maps fail-safe to non-writable (only ReadWrite==1 should yield a writable node).

Step 1: Add the InlineData row

In the numeric-AccessLevel [Theory] (the one with [InlineData(1, true)] / [InlineData(0, false)]), add:

[InlineData(2, false)]   // future/unknown AccessLevel must map fail-safe to read-only (only ReadWrite⇒writable)

Step 2: Run — expect PASS (confirms the derivation is == ReadWrite, not != Read)

dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~DeploymentArtifactAliasParity"

If it FAILS, the production derivation is using != Read (so 2 wrongly maps to writable) — that is a real fail-safety bug; STOP and surface it (the fix would be a one-line == TagAccessLevel.ReadWrite in both Phase7Composer and DeploymentArtifact, but only with reviewer sign-off since it touches the byte-parity contract).

Step 3: Commit (stage by path)

git add tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DeploymentArtifactAliasParityTests.cs
git commit -m "test(runtime): future-enum trap for AccessLevel→Writable parity"

Task T5: docs/security.md — data-plane GroupToRole requirement

Classification: trivial Estimated implement time: ~3 min Parallelizable with: T1, T2, T3, T4, T6

Addresses pending.md #3.

Files:

  • Modify: docs/security.md (Data-Plane Authorization section starts line 167; GroupToRole is described at line 163 in the LDAP section; troubleshooting at 313-315)

Context: OPC UA session roles are unioned from two sources: the DB LdapGroupRoleMapping (whose Role column is the AdminRole enum — Administrator/Designer/Viewer only) AND the appsettings Security:Ldap:GroupToRole (free-form string→string). The data-plane gates (WriteOperate, WriteTune, WriteConfigure, AlarmAck/AlarmAcknowledge, ReadOnly) read literal role strings that the AdminRole-typed DB mapping cannot produce — so a deployment MUST map its data-plane LDAP groups → data-plane role strings via GroupToRole, or write-through and OPC UA alarm-ack are silently inert (every write → BadUserAccessDenied, default-deny). Line 163 currently frames GroupToRole as mapping only to Admin roles — that's the misleading bit to correct.

Step 1: Add a "Role grant source (data-plane)" subsection

Under ## Data-Plane Authorization (after the "Full model" subsection ~line 241, or right after the "Permission flags" mapping ~216), add a subsection. Suggested content (match the doc's voice; keep it tight):

### Role grant source (data-plane)

Data-plane roles come from `Security:Ldap:GroupToRole` (appsettings), **not** from the Config-DB
`LdapGroupRoleMapping` table. That table's `Role` column is the `AdminRole` enum
(`Administrator`/`Designer`/`Viewer`) and supplies **control-plane** roles only — it cannot emit the
data-plane role strings the OPC UA gates read (`ReadOnly`, `WriteOperate`, `WriteTune`,
`WriteConfigure`, `AlarmAcknowledge`). A deployment therefore **must** map its data-plane LDAP groups
to those role strings via `GroupToRole`, e.g.:

```json
"GroupToRole": {
  "ot-operators":  "WriteOperate",
  "ot-tuners":     "WriteTune",
  "ot-engineers":  "WriteConfigure",
  "ot-alarm-ack":  "AlarmAcknowledge",
  "ot-readonly":   "ReadOnly"
}

If this mapping is absent the data-plane evaluator is strictly default-deny: inbound operator writes and OPC UA Part-9 alarm acknowledgement all return BadUserAccessDenied even for users who authenticate successfully. (The same requirement gates the pre-existing scripted-alarm ack path.)


Also tighten line 163 (the `GroupToRole` description in the LDAP section) so it notes the values are control-plane **and** data-plane role strings, cross-referencing the new subsection — a one-clause edit, not a rewrite.

**Step 2: Verify the doc renders / no broken anchors, then commit**

```bash
git add docs/security.md
git commit -m "docs(security): document the GroupToRole data-plane role requirement"

(No test; verification = re-read the section for accuracy against the permission-flag names at lines 186-216.)


Task T6: Galaxy writer — invalidate handle caches on reconnect

Classification: small Estimated implement time: ~5 min Parallelizable with: T1, T2, T3, T4, T5

Addresses pending.md #6(b).

Files:

  • Modify: src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/Runtime/IGalaxyDataWriter.cs (interface — add method)
  • Modify: src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/Runtime/GatewayGalaxyDataWriter.cs (caches 28-32; implement clear)
  • Modify: src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/Runtime/TracedGalaxyDataWriter.cs (decorator — delegate)
  • Modify: src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/GalaxyDriver.cs (ReopenAsync 291-296; _dataWriter set at 255-257)
  • Test: tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests/Runtime/ (GatewayGalaxyDataWriter test class)

Context: GatewayGalaxyDataWriter caches _itemHandles (ref→gw item handle, line 28) and _supervisedHandles (handle→advised, line 32). _dataWriter is built once in BuildProductionRuntimeAsync (255-257) and reused across reconnects. On reconnect, ReopenAsync (291-296) calls _ownedMxSession.RecreateAsync — which disposes + rebuilds the session, so all prior gw item handles are dead. The writer's caches still hold the stale handles, so a write right after a reconnect AddItem-skips and uses a dead handle. Add a cache-invalidation seam called on reopen.

Step 1: Failing test — invalidation forces a fresh AddItem on the next write

In the existing GatewayGalaxyDataWriter test class (it already constructs the writer over a fake/owned session — reuse that harness), add:

[Fact]
public async Task InvalidateHandleCaches_forces_AddItem_again_on_next_write()
{
    var (writer, fakeSession) = BuildWriterWithFakeSession();   // reuse existing harness
    await writer.WriteAsync(new[] { new WriteRequest("Tag.Attr", 1) }, _ => SecurityClassification.Operate, default);
    fakeSession.AddItemCallCount.ShouldBe(1);

    writer.InvalidateHandleCaches();

    await writer.WriteAsync(new[] { new WriteRequest("Tag.Attr", 2) }, _ => SecurityClassification.Operate, default);
    fakeSession.AddItemCallCount.ShouldBe(2);   // re-AddItem'd because the cache was cleared
}

If the existing fake session doesn't count AddItem, either extend it with a counter or assert via an internal seam on the writer (internal int CachedItemHandleCount => _itemHandles.Count;): assert >0 after a write, 0 after InvalidateHandleCaches(). Pick whichever the existing harness makes cleanest.

Step 2: Run — expect FAIL (no such method)

dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests --filter "FullyQualifiedName~InvalidateHandleCaches"

Step 3: Add the method to the interface + impl + decorator

  • IGalaxyDataWriter: add
    /// <summary>Drop cached gw item handles + supervisory-advise state. Call after a session
    /// reconnect — the old handles are dead, so the next write must re-AddItem + re-advise.</summary>
    void InvalidateHandleCaches();
    
  • GatewayGalaxyDataWriter:
    public void InvalidateHandleCaches()
    {
        _itemHandles.Clear();
        _supervisedHandles.Clear();
    }
    
  • TracedGalaxyDataWriter (decorator): delegate — public void InvalidateHandleCaches() => _inner.InvalidateHandleCaches(); (match the field name the decorator uses for its inner writer).

Step 4: Call it on reopen

In GalaxyDriver.ReopenAsync (291-296), after RecreateAsync succeeds:

private async Task ReopenAsync(CancellationToken cancellationToken)
{
    if (_ownedMxSession is null) return;
    var clientOptions = BuildClientOptions(_options.Gateway);
    await _ownedMxSession.RecreateAsync(clientOptions, cancellationToken).ConfigureAwait(false);
    // The recreated session invalidates every prior gw item handle; drop the writer's handle/advise
    // caches so the next write re-AddItems + re-AdviseSupervisory against the fresh session.
    _dataWriter?.InvalidateHandleCaches();
}

(_dataWriter is the TracedGalaxyDataWriter-wrapped field set at 255-257; the call flows through to the inner GatewayGalaxyDataWriter.)

Step 5: Run — expect PASS, run the full Galaxy suite, build

dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests --filter "FullyQualifiedName~InvalidateHandleCaches"
dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests

Step 6: Commit (stage by path)

git add src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/ tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests/
git commit -m "fix(galaxy): invalidate writer handle caches on session reconnect"

Final verification (after T1T6)

dotnet build ZB.MOM.WW.OtOpcUa.slnx                 # 0 errors (TreatWarningsAsErrors)
dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests
dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests
dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests

These are all unit-level changes (no live docker-dev /run gate needed — none touch Razor/JS or the wire protocol behaviourally; the fast-fail paths and cache changes are exercised by Akka.TestKit + unit tests). Then finish via superpowers-extended-cc:finishing-a-development-branch (intent: merge-to-master + push). After merge, update pending.md to mark #3, #4, #6 resolved.

Out of scope (left in pending.md)

  • #4 sub-item: FOCAS — covered (T3). The "surface real device-write status" item (#5) is a separate chosen-later effort, not part of this cluster.
  • #7 driver-reconfigure-while-faulted — its own high-risk actor-state-machine design/plan.
  • #1/#2 Phase B/C; #8 rig cleanups — unchanged.