30 KiB
Harden Milestone 1b — Implementation Plan
For Claude: REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans (or subagent-driven-development) to implement this plan task-by-task.
Goal: Close out the review nits left from the just-shipped Milestone 1b (equipment-tag live values + inbound operator-write pipeline) — fast-fail degraded-state writes, two micro-optimisations, a FOCAS equipment-tag parse-cache gap, a benign Galaxy dead-letter, a Galaxy reconnect stale-handle gap, two test-hardening additions, and one deployment-facing docs note.
Architecture: Pure hardening — no new feature surface, no EF/Configuration schema change, no migration. Changes are additive fast-fail handlers on existing actor behaviours, a thread-safe cache swap, a writer cache-invalidation seam on the Galaxy reconnect path, test additions, and a docs subsection. Source of these items: pending.md follow-ups #3, #4, #6 (from the Milestone 1b integration review), all enumerated below.
Tech Stack: C# / .NET 10, Akka.NET (DriverHostActor / DriverInstanceActor), xUnit + Shouldly + Akka.TestKit. TreatWarningsAsErrors is on solution-wide.
Branch: chore/harden-milestone-1b off master (HEAD 02e37a6d).
Hard rules (every task): stage by path (never git add .); never stage sql_login.txt / src/Server/ZB.MOM.WW.OtOpcUa.Host/pki/ / pending.md / current.md / docker-dev/docker-compose.yml; never echo secrets; no force-push / no --no-verify; NO Configuration entity / EF migration change. Done = build clean + dotnet test green for every touched project.
Task DAG
T0 (branch)
├─ T1 DriverHostActor: Stale fast-fail + micro-opts + raw-blob routing test [standard]
├─ T2 DriverInstanceActor: Connecting/Reconnecting/Stubbed write fast-fail
│ + benign SubscriptionEstablished handler [standard]
├─ T3 FOCAS: cache equipment-tag parsed addresses (ConcurrentDictionary) [small]
├─ T4 Parity test: future-enum trap InlineData(2,false) [trivial]
├─ T5 docs/security.md: data-plane GroupToRole requirement note [trivial]
└─ T6 Galaxy writer: invalidate handle caches on reconnect [small]
T1–T6 all blockedBy T0 and have disjoint file sets — they are mutually Parallelizable. Dispatch them concurrently after T0.
Task T0: Create feature branch
Classification: trivial Estimated implement time: ~1 min Parallelizable with: none (gates all)
Files: none (git only)
Step 1: Branch off master
git -C /Users/dohertj2/Desktop/OtOpcUa checkout -b chore/harden-milestone-1b master
git -C /Users/dohertj2/Desktop/OtOpcUa rev-parse --abbrev-ref HEAD # expect: chore/harden-milestone-1b
Do NOT touch the working-tree docker-dev/docker-compose.yml (leave it modified+unstaged throughout).
Task T1: DriverHostActor — Stale-state write fast-fail + micro-opts + raw-blob routing test
Classification: standard Estimated implement time: ~5 min Parallelizable with: T2, T3, T4, T5, T6
Addresses pending.md #4(a-host), #4(b), #4(d).
Files:
- Modify:
src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs(Stale behaviour ~537-550;HandleRouteNodeWriteContinueWith line 519; forward-map build line 754-756) - Test:
tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DriverHostActorWriteRoutingTests.cs
Context: RouteNodeWrite/NodeWriteResult are defined at DriverHostActor.cs:130,136. The running behaviour wires HandleRouteNodeWrite, but the Stale() behaviour (DB unreachable; lines 537-550) has no RouteNodeWrite handler → an inbound operator write while Stale dead-letters and the node-manager's 10 s Ask times out with a generic log (the client already got the optimistic Good). Add a fast-fail. Two micro-opts ride along in the same file.
Step 1: Add the failing Akka.TestKit test for Stale fast-fail
In DriverHostActorWriteRoutingTests.cs, add a test that drives the host into Stale (construct it with a DB factory that throws, or send whatever the existing tests use to reach Stale — mirror an existing Stale-path test in the file/suite) and asserts an immediate negative NodeWriteResult:
[Fact]
public void RouteNodeWrite_while_stale_fast_fails_instead_of_timing_out()
{
// Arrange: a host that cannot reach the config DB enters Stale on first DispatchDeployment.
var host = /* spawn DriverHostActor with a throwing IDbContextFactory — reuse the suite's Stale helper */;
DriveIntoStale(host);
// Act
host.Tell(new DriverHostActor.RouteNodeWrite("ns=2;s=EQ-x/Tag", 42), TestActor);
var reply = ExpectMsg<DriverHostActor.NodeWriteResult>(TimeSpan.FromSeconds(2));
// Assert: fast negative, not a ~10s timeout
reply.Success.ShouldBeFalse();
reply.Reason.ShouldContain("stale");
}
If the suite has no existing Stale helper, reach Stale the same way the existing "ignoring DispatchDeployment while Stale" coverage does; if none exists, construct the actor with a IDbContextFactory stub whose CreateDbContext throws and send the message that triggers the DB read. Keep the helper local to the test file.
Step 2: Run it — expect FAIL (times out / dead-letters, no reply)
dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests \
--filter "FullyQualifiedName~RouteNodeWrite_while_stale_fast_fails"
Step 3: Add the Stale fast-fail handler
In DriverHostActor.cs Stale() (after the existing Receive<RedundancyStateChanged> line 547), add:
// An inbound operator write can't be serviced while the config DB is unreachable — fast-fail so the
// node-manager's bounded Ask gets an immediate clear status instead of dead-lettering into a timeout.
Receive<RouteNodeWrite>(_ =>
Sender.Tell(new NodeWriteResult(false, "driver host stale (config DB unreachable)")));
Step 4: Micro-opt — drop ExecuteSynchronously on the router continuation
In HandleRouteNodeWrite (line 514-520), the .ContinueWith(...) passes TaskContinuationOptions.ExecuteSynchronously. Running the continuation synchronously on the Ask's completing thread (a pool/transport thread) is needless coupling for a tiny projection. Change it to TaskContinuationOptions.None:
.ContinueWith(
t => t.IsCompletedSuccessfully
? new NodeWriteResult(t.Result.Success, t.Result.Reason)
: new NodeWriteResult(false, "write timeout"),
CancellationToken.None,
TaskContinuationOptions.None,
TaskScheduler.Default)
.PipeTo(replyTo);
Step 5: Micro-opt — List.Contains → HashSet in the forward-map build
The forward map _nodeIdByDriverRef (declared as Dictionary<.., List<string>>) does if (!list.Contains(nodeId)) list.Add(nodeId); at line 756 — O(n) per node. Convert the value type to HashSet<string>:
- Change the field declaration to
Dictionary<(string,string), HashSet<string>>(keepStringComparer.Ordinalon the inner set). - Line 754-756 becomes:
if (!_nodeIdByDriverRef.TryGetValue(key, out var set)) _nodeIdByDriverRef[key] = set = new HashSet<string>(StringComparer.Ordinal); set.Add(nodeId); - Grep every consumer of
_nodeIdByDriverRef(grep -n _nodeIdByDriverRef src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs) and confirm they onlyforeach/Count/membership-check the value (all of whichHashSetsupports). If any consumer indexes by position (list[i]) leave it aListand skip this micro-opt — note that in the commit message. (Expected: ForwardToMux only enumerates.)
Step 6: Raw protocol-blob routing test (#4d)
The suite seeds equipment tags with TagConfig = JsonSerializer.Serialize(new { FullName = t.FullName }) (a Galaxy-style {FullName} blob). The router keys purely on NodeId, so the TagConfig blob shape is irrelevant to routing — prove that with a raw protocol blob. Add a tag to the seeded composition (or a new focused test) whose TagConfig is a raw Modbus-style blob and assert the write still routes to the owning driver by NodeId:
[Fact]
public void RouteNodeWrite_routes_by_nodeId_regardless_of_raw_tagconfig_blob_shape()
{
// Seed an equipment tag whose TagConfig is a RAW protocol blob (region/address), NOT {FullName}.
var rawBlob = JsonSerializer.Serialize(new { region = "HoldingRegister", address = 200, dataType = "UInt16" });
var host = SeedHostWithEquipmentTag(driverId: "MAIN-modbus-eq", tagConfig: rawBlob, /* nodeId derived as in the suite */);
host.Tell(new DriverHostActor.RouteNodeWrite(expectedNodeId, (ushort)7), TestActor);
// The child receives WriteAttribute carrying the wire-ref FullName (the blob is the wire-ref here);
// assert via the suite's existing child probe that the write was forwarded (Success path).
var reply = ExpectMsg<DriverHostActor.NodeWriteResult>();
reply.Success.ShouldBeTrue();
}
Reuse the suite's existing child-probe / seed helpers (the file already exercises RouteNodeWrite happy-path — clone that setup and only swap the TagConfig to the raw blob). The point is belt-and-suspenders: routing is blob-shape-independent.
Step 7: Run the full routing suite — expect PASS
dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~DriverHostActorWriteRouting"
Step 8: Build + commit (stage by path)
git add src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs \
tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DriverHostActorWriteRoutingTests.cs
git commit -m "fix(runtime): fast-fail RouteNodeWrite while Stale + micro-opts + raw-blob routing test"
Task T2: DriverInstanceActor — write fast-fail in degraded states + benign SubscriptionEstablished handler
Classification: standard Estimated implement time: ~5 min Parallelizable with: T1, T3, T4, T5, T6
Addresses pending.md #4(a-instance) and #6(a).
Files:
- Modify:
src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs(Stubbed ~205-209;Connecting()211-231;Reconnecting()265-281;Connected()233-263) - Test:
tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DriverInstanceActorWriteAndSubscribeTests.cs(NEW file — avoids touching T1's test file)
Context:
WriteAttribute/WriteAttributeResultare at lines 41-42. OnlyConnected()handlesWriteAttribute(line 252). When the actor isConnecting/Reconnecting/Stubbed, an inboundWriteAttribute(Asked byDriverHostActor.HandleRouteNodeWritewith an 8 s timeout) dead-letters → the Ask times out as a generic "write timeout". Fast-fail in those states with a clear reason.ResubscribeDesired()(line 437-443) doesSelf.Tell(new Subscribe(...));HandleSubscribeAsyncrepliesSubscriptionEstablishedtoreplyTo = Sender = Self(line 379) → no handler → the benign dead-letterMessage [SubscriptionEstablished] from drv-… to drv-… was unhandledseen live on Galaxy. This is the generic actor, so a no-op handler quiets it for every driver.
Step 1: Failing tests (new file)
public sealed class DriverInstanceActorWriteAndSubscribeTests : TestKit
{
[Fact]
public void WriteAttribute_while_connecting_fast_fails()
{
var actor = SpawnInConnectingState(); // see helper note below
actor.Tell(new DriverInstanceActor.WriteAttribute("Tag.Attr", 1), TestActor);
var r = ExpectMsg<DriverInstanceActor.WriteAttributeResult>(TimeSpan.FromSeconds(2));
r.Success.ShouldBeFalse();
r.Reason.ShouldNotBeNull();
}
[Fact]
public void SubscriptionEstablished_is_swallowed_without_deadletter()
{
var actor = SpawnConnected();
// subscribe with a desired set so ResubscribeDesired self-Tells Subscribe and the reply lands on Self
// Subscribe to DeadLetters and assert none of type SubscriptionEstablished arrive.
var deadLetters = CreateTestProbe();
Sys.EventStream.Subscribe(deadLetters, typeof(DeadLetter));
// drive a (re)subscribe ... then:
deadLetters.ExpectNoMsg(TimeSpan.FromMilliseconds(500));
}
}
Helper note: reuse the spawn/connect helpers the existing DriverInstanceActor tests use (look for them in the Runtime.Tests Drivers/ folder; there is established setup for stub drivers and the Connected/Reconnecting transitions). To reach Connecting, send InitializeRequested with a driver-config that the stub keeps pending, or Become-drive via the suite's existing harness. If reaching exact internal states is awkward, the minimum viable assertions are: (a) a WriteAttribute to a Stubbed actor returns a negative WriteAttributeResult quickly, and (b) no SubscriptionEstablished dead-letter is produced during a connect+resubscribe cycle.
Step 2: Run — expect FAIL
dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~DriverInstanceActorWriteAndSubscribe"
Step 3: Add write fast-fail to the three degraded behaviours
Add this line to Stubbed (after line 208), Connecting() (after line 230), and Reconnecting() (after line 279):
Receive<WriteAttribute>(_ =>
Sender.Tell(new WriteAttributeResult(false, "driver not connected")));
(Stubbed may use a reason like "driver stubbed" — implementer's call; the host maps either into a NodeWriteResult(false, …).)
Step 4: Add the benign SubscriptionEstablished handler
In Connected() (alongside the other Receives, e.g. after line 262) add a swallow-with-trace so the self-reply from ResubscribeDesired doesn't dead-letter:
// ResubscribeDesired self-Tells Subscribe; HandleSubscribeAsync replies SubscriptionEstablished to the
// sender, which on the self-resubscribe path is Self. Swallow it (trace only) so it doesn't dead-letter.
Receive<SubscriptionEstablished>(msg =>
_log.Debug("DriverInstance {Id}: subscription established ({Count} refs, {Diag})",
_driverInstanceId, msg.ReferenceCount, msg.DiagnosticId));
Add the same one-liner to Reconnecting() and Connecting() to cover the race where the async subscribe reply lands after a state flip (cheap, same file). Do NOT alter where/how SubscriptionEstablished is sent — a host that genuinely Asks Subscribe still receives its reply (the handler only runs when the message arrives at this actor as the asker).
Step 5: Run — expect PASS, then build the project
dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~DriverInstanceActorWriteAndSubscribe"
dotnet build src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ZB.MOM.WW.OtOpcUa.Runtime.csproj
Step 6: Commit (stage by path)
git add src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs \
tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DriverInstanceActorWriteAndSubscribeTests.cs
git commit -m "fix(runtime): fast-fail writes in degraded driver states + swallow self SubscriptionEstablished"
Task T3: FOCAS — cache equipment-tag parsed addresses
Classification: small Estimated implement time: ~4 min Parallelizable with: T1, T2, T4, T5, T6
Addresses pending.md #4(c).
Files:
- Modify:
src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS/FocasDriver.cs(cache field 37-38; read path 275-280; write path 337-342; init seed 120) - Test:
tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests/(add to an existing FocasDriver read/write test class)
Context: _parsedAddressesByTagName (a plain Dictionary, line 37) is seeded only from _options.Tags at init (line 120). Equipment tags arrive via _resolver (line 66) as transient defs whose Name is the raw ref — never seeded — so both ReadAsync (275-280) and WriteAsync (337-342) hit the TryGetValue miss and re-parse FocasAddress.TryParse(def.Address) every call without storing. Fix: make the cache a ConcurrentDictionary (reads run on the poll loop, writes on the host Ask thread — concurrent access) and GetOrAdd so a transient def's address is parsed once.
Step 1: Failing test — second resolve must reuse the cache
Add an internal test seam to FocasDriver so the test can observe caching without a parser spy:
// In FocasDriver (internal, for tests via InternalsVisibleTo — confirm the attribute exists for the .Tests asm)
internal bool IsParsedAddressCached(string reference) => _parsedAddressesByTagName.ContainsKey(reference);
Test (equipment-tag ref, not an authored tag):
[Fact]
public async Task Equipment_tag_address_is_parsed_once_and_cached()
{
var driver = /* construct FocasDriver with a fake client + one device, no authored Tags */;
await driver.InitializeAsync(configJson, default);
var equipRef = "{\"address\":\"R100\",\"dataType\":\"Int16\",\"deviceHostAddress\":\"focas://10.0.0.5\",\"writable\":true}"; // matches FocasEquipmentTagParser keys
driver.IsParsedAddressCached(equipRef).ShouldBeFalse(); // not seeded at init
await driver.ReadAsync(new[] { equipRef }, default); // first resolve parses + caches
driver.IsParsedAddressCached(equipRef).ShouldBeTrue(); // cached for subsequent hot calls
}
Use the exact equipment-ref JSON shape that FocasEquipmentTagParser.TryParse accepts (open that parser to copy its camelCase keys + required fields). If InternalsVisibleTo for the .Tests assembly isn't already present on the FOCAS driver project, add it (mirror how the other drivers expose internals to their test projects).
Step 2: Run — expect FAIL (second assert false: never cached)
dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests --filter "FullyQualifiedName~Equipment_tag_address_is_parsed_once"
Step 3: Implement the cache
- Change the field (line 37-38) to:
(add
private readonly ConcurrentDictionary<string, FocasAddress> _parsedAddressesByTagName = new(StringComparer.OrdinalIgnoreCase);using System.Collections.Concurrent;if absent). - Add a private helper:
(The throwing factory propagates and does not cache a bad parse — preserving the existing fail behaviour.
private FocasAddress ResolveParsedAddress(FocasTagDefinition def) => _parsedAddressesByTagName.GetOrAdd(def.Name, _ => FocasAddress.TryParse(def.Address) ?? throw new InvalidOperationException( $"FOCAS tag '{def.Name}' has malformed Address '{def.Address}'."));GetOrAddmay invoke the factory more than once under race but only stores one — fine, parsing is pure.) - Replace the read-path block (275-280) and the write-path block (337-342) with
var parsed = ResolveParsedAddress(def);. - Line 120 (
_parsedAddressesByTagName[tag.Name] = parsed;) compiles unchanged onConcurrentDictionary; keep it (init pre-validates + seeds authored tags).
Step 4: Run — expect PASS, build the project
dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests --filter "FullyQualifiedName~Equipment_tag_address_is_parsed_once"
dotnet build src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.csproj
Step 5: Commit (stage by path)
git add src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS/ tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests/
git commit -m "perf(focas): cache equipment-tag parsed addresses (no per-call reparse)"
Task T4: Parity test — future-enum trap
Classification: trivial Estimated implement time: ~2 min Parallelizable with: T1, T2, T3, T5, T6
Addresses pending.md #4(e).
Files:
- Test only:
tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DeploymentArtifactAliasParityTests.cs(numeric-AccessLevel theory ~73-104)
Context: TagAccessLevel has Read = 0, ReadWrite = 1 (src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Enums/TagAccessLevel.cs). The parity test maps numeric AccessLevel → Writable with [InlineData(1, true)]/[InlineData(0, false)]. Add a future-enum guard so an as-yet-undefined value maps fail-safe to non-writable (only ReadWrite==1 should yield a writable node).
Step 1: Add the InlineData row
In the numeric-AccessLevel [Theory] (the one with [InlineData(1, true)] / [InlineData(0, false)]), add:
[InlineData(2, false)] // future/unknown AccessLevel must map fail-safe to read-only (only ReadWrite⇒writable)
Step 2: Run — expect PASS (confirms the derivation is == ReadWrite, not != Read)
dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter "FullyQualifiedName~DeploymentArtifactAliasParity"
If it FAILS, the production derivation is using != Read (so 2 wrongly maps to writable) — that is a real fail-safety bug; STOP and surface it (the fix would be a one-line == TagAccessLevel.ReadWrite in both Phase7Composer and DeploymentArtifact, but only with reviewer sign-off since it touches the byte-parity contract).
Step 3: Commit (stage by path)
git add tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Drivers/DeploymentArtifactAliasParityTests.cs
git commit -m "test(runtime): future-enum trap for AccessLevel→Writable parity"
Task T5: docs/security.md — data-plane GroupToRole requirement
Classification: trivial Estimated implement time: ~3 min Parallelizable with: T1, T2, T3, T4, T6
Addresses pending.md #3.
Files:
- Modify:
docs/security.md(Data-Plane Authorization section starts line 167;GroupToRoleis described at line 163 in the LDAP section; troubleshooting at 313-315)
Context: OPC UA session roles are unioned from two sources: the DB LdapGroupRoleMapping (whose Role column is the AdminRole enum — Administrator/Designer/Viewer only) AND the appsettings Security:Ldap:GroupToRole (free-form string→string). The data-plane gates (WriteOperate, WriteTune, WriteConfigure, AlarmAck/AlarmAcknowledge, ReadOnly) read literal role strings that the AdminRole-typed DB mapping cannot produce — so a deployment MUST map its data-plane LDAP groups → data-plane role strings via GroupToRole, or write-through and OPC UA alarm-ack are silently inert (every write → BadUserAccessDenied, default-deny). Line 163 currently frames GroupToRole as mapping only to Admin roles — that's the misleading bit to correct.
Step 1: Add a "Role grant source (data-plane)" subsection
Under ## Data-Plane Authorization (after the "Full model" subsection ~line 241, or right after the "Permission flags" mapping ~216), add a subsection. Suggested content (match the doc's voice; keep it tight):
### Role grant source (data-plane)
Data-plane roles come from `Security:Ldap:GroupToRole` (appsettings), **not** from the Config-DB
`LdapGroupRoleMapping` table. That table's `Role` column is the `AdminRole` enum
(`Administrator`/`Designer`/`Viewer`) and supplies **control-plane** roles only — it cannot emit the
data-plane role strings the OPC UA gates read (`ReadOnly`, `WriteOperate`, `WriteTune`,
`WriteConfigure`, `AlarmAcknowledge`). A deployment therefore **must** map its data-plane LDAP groups
to those role strings via `GroupToRole`, e.g.:
```json
"GroupToRole": {
"ot-operators": "WriteOperate",
"ot-tuners": "WriteTune",
"ot-engineers": "WriteConfigure",
"ot-alarm-ack": "AlarmAcknowledge",
"ot-readonly": "ReadOnly"
}
If this mapping is absent the data-plane evaluator is strictly default-deny: inbound operator
writes and OPC UA Part-9 alarm acknowledgement all return BadUserAccessDenied even for users who
authenticate successfully. (The same requirement gates the pre-existing scripted-alarm ack path.)
Also tighten line 163 (the `GroupToRole` description in the LDAP section) so it notes the values are control-plane **and** data-plane role strings, cross-referencing the new subsection — a one-clause edit, not a rewrite.
**Step 2: Verify the doc renders / no broken anchors, then commit**
```bash
git add docs/security.md
git commit -m "docs(security): document the GroupToRole data-plane role requirement"
(No test; verification = re-read the section for accuracy against the permission-flag names at lines 186-216.)
Task T6: Galaxy writer — invalidate handle caches on reconnect
Classification: small Estimated implement time: ~5 min Parallelizable with: T1, T2, T3, T4, T5
Addresses pending.md #6(b).
Files:
- Modify:
src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/Runtime/IGalaxyDataWriter.cs(interface — add method) - Modify:
src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/Runtime/GatewayGalaxyDataWriter.cs(caches 28-32; implement clear) - Modify:
src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/Runtime/TracedGalaxyDataWriter.cs(decorator — delegate) - Modify:
src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/GalaxyDriver.cs(ReopenAsync291-296;_dataWriterset at 255-257) - Test:
tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests/Runtime/(GatewayGalaxyDataWriter test class)
Context: GatewayGalaxyDataWriter caches _itemHandles (ref→gw item handle, line 28) and _supervisedHandles (handle→advised, line 32). _dataWriter is built once in BuildProductionRuntimeAsync (255-257) and reused across reconnects. On reconnect, ReopenAsync (291-296) calls _ownedMxSession.RecreateAsync — which disposes + rebuilds the session, so all prior gw item handles are dead. The writer's caches still hold the stale handles, so a write right after a reconnect AddItem-skips and uses a dead handle. Add a cache-invalidation seam called on reopen.
Step 1: Failing test — invalidation forces a fresh AddItem on the next write
In the existing GatewayGalaxyDataWriter test class (it already constructs the writer over a fake/owned session — reuse that harness), add:
[Fact]
public async Task InvalidateHandleCaches_forces_AddItem_again_on_next_write()
{
var (writer, fakeSession) = BuildWriterWithFakeSession(); // reuse existing harness
await writer.WriteAsync(new[] { new WriteRequest("Tag.Attr", 1) }, _ => SecurityClassification.Operate, default);
fakeSession.AddItemCallCount.ShouldBe(1);
writer.InvalidateHandleCaches();
await writer.WriteAsync(new[] { new WriteRequest("Tag.Attr", 2) }, _ => SecurityClassification.Operate, default);
fakeSession.AddItemCallCount.ShouldBe(2); // re-AddItem'd because the cache was cleared
}
If the existing fake session doesn't count AddItem, either extend it with a counter or assert via an internal seam on the writer (internal int CachedItemHandleCount => _itemHandles.Count;): assert >0 after a write, 0 after InvalidateHandleCaches(). Pick whichever the existing harness makes cleanest.
Step 2: Run — expect FAIL (no such method)
dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests --filter "FullyQualifiedName~InvalidateHandleCaches"
Step 3: Add the method to the interface + impl + decorator
IGalaxyDataWriter: add/// <summary>Drop cached gw item handles + supervisory-advise state. Call after a session /// reconnect — the old handles are dead, so the next write must re-AddItem + re-advise.</summary> void InvalidateHandleCaches();GatewayGalaxyDataWriter:public void InvalidateHandleCaches() { _itemHandles.Clear(); _supervisedHandles.Clear(); }TracedGalaxyDataWriter(decorator): delegate —public void InvalidateHandleCaches() => _inner.InvalidateHandleCaches();(match the field name the decorator uses for its inner writer).
Step 4: Call it on reopen
In GalaxyDriver.ReopenAsync (291-296), after RecreateAsync succeeds:
private async Task ReopenAsync(CancellationToken cancellationToken)
{
if (_ownedMxSession is null) return;
var clientOptions = BuildClientOptions(_options.Gateway);
await _ownedMxSession.RecreateAsync(clientOptions, cancellationToken).ConfigureAwait(false);
// The recreated session invalidates every prior gw item handle; drop the writer's handle/advise
// caches so the next write re-AddItems + re-AdviseSupervisory against the fresh session.
_dataWriter?.InvalidateHandleCaches();
}
(_dataWriter is the TracedGalaxyDataWriter-wrapped field set at 255-257; the call flows through to the inner GatewayGalaxyDataWriter.)
Step 5: Run — expect PASS, run the full Galaxy suite, build
dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests --filter "FullyQualifiedName~InvalidateHandleCaches"
dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests
Step 6: Commit (stage by path)
git add src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/ tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests/
git commit -m "fix(galaxy): invalidate writer handle caches on session reconnect"
Final verification (after T1–T6)
dotnet build ZB.MOM.WW.OtOpcUa.slnx # 0 errors (TreatWarningsAsErrors)
dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests
dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests
dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests
These are all unit-level changes (no live docker-dev /run gate needed — none touch Razor/JS or the wire protocol behaviourally; the fast-fail paths and cache changes are exercised by Akka.TestKit + unit tests). Then finish via superpowers-extended-cc:finishing-a-development-branch (intent: merge-to-master + push). After merge, update pending.md to mark #3, #4, #6 resolved.
Out of scope (left in pending.md)
- #4 sub-item: FOCAS — covered (T3). The "surface real device-write status" item (#5) is a separate chosen-later effort, not part of this cluster.
- #7 driver-reconfigure-while-faulted — its own high-risk actor-state-machine design/plan.
- #1/#2 Phase B/C; #8 rig cleanups — unchanged.