Files
lmxopcua/pending.md
T
Joseph Doherty cd20c3c064
v2-ci / build (push) Failing after 31s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
docs: refresh pending.md for compaction (Phase C shipped; open-items digest)
2026-06-14 22:23:58 -04:00

93 lines
33 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Pending — open follow-ups & deferrals
As of 2026-06-14. master HEAD `c24abc8a` (synced with origin; feature branch `feat/galaxy-phase-c-historian` ff-merged + deleted). Working tree is clean except the expected DISK-ONLY files: `docker-dev/docker-compose.yml` (M — uncommitted rig config, never staged) and `pending.md` (M — these notes, never staged), plus two untracked pre-existing `docs/plans/2026-06-14-write-outcome-self-correction-plan.md*` docs. HARD RULE: never `git add .`; never stage `pending.md` / `current.md` / `docker-dev/docker-compose.yml` / `sql_login.txt` / `src/Server/ZB.MOM.WW.OtOpcUa.Host/pki/`; never commit secrets.
**GALAXY PHASE C — SERVER-SIDE OPC UA HistoryRead MERGED + PUSHED to master `c24abc8a` (2026-06-14, fast-forward, 14 commits = 2 design/plan docs + 12 feature/test/doc).** The server now answers OPC UA **HistoryRead** (Raw / Processed / AtTime over historized variable nodes; **Events** over alarm-owning equipment-folder event-notifier nodes) for any equipment tag flagged historized, driver-agnostically, by dispatching to the registered `IHistorianDataSource` (the Wonderware historian TCP client, which already implemented that interface). **NO EF migration** — the flag rides in the existing `TagConfig` JSON blob (`{"FullName":"…","isHistorized":true,"historianTagname":"…"?}`, the Phase-B `alarm`-object carrier), `historianTagname` defaults to the tag's driver `FullName`. Design/plan `docs/plans/2026-06-14-galaxy-phase-c-historian-*.md`; guide `docs/Historian.md`. Pipeline: `Phase7Composer.ExtractTagHistorize` + byte-parity `DeploymentArtifact.ExtractTagHistorize``EquipmentTagPlan.{IsHistorized,HistorianTagname}``Phase7Applier` resolves `IsHistorized ? (HistorianTagname ?? FullName) : null` → sink seam (`IOpcUaAddressSpaceSink.EnsureVariable` +`string? historianTagname`) → `OtOpcUaNodeManager.EnsureVariable` sets `Historizing`+`AccessLevels.HistoryRead`+registers `_historizedTagnames`; the node manager overrides the four `CustomNodeManager2` HistoryRead virtuals, block-bridging to the `HistorianDataSource` property (volatile, default `NullHistorianDataSource.Instance`). DI mirrors `AlarmHistorian`: `AddServerHistorian` (config-gated, `ServerHistorian` appsettings section, Null default via `TryAddSingleton`) + `OtOpcUaSdkServer.SetHistorianDataSource` + Host `Program.cs`/`OtOpcUaServerHostedService` Start/Stop wiring. **Graceful degrade:** historized node + Null/unconfigured source → `Good_NoData` (empty), non-historized node → `BadHistoryOperationUnsupported`. **KEY FACTS/GOTCHAS:** the SDK base filters event reads by the `EventNotifier.HistoryRead` bit (variable nodes never reach the events arm); the SDK master propagates `errors[i].Code → results[i].StatusCode` (confirmed by decompiling `MasterNodeManager.HistoryReadAsync`), so the override signals per-node status via `errors[handle.Index]`; the two `HistoryReadResult` types (SDK `Opc.Ua.HistoryReadResult` vs Core.Abstractions DTO) are aliased `SdkHistoryReadResult`/`HistorianRead`; `ReadRawModifiedDetails.IsReadModified` defaults TRUE (Initialize() sets it) so a plain raw read must clear it — modified-history is unsupported; the events arm registers folder→sourceName (= equipment id) only when a non-Null historian is wired at promotion time (Host wires the source at StartAsync before any deployment materialises, so normal boot ordering is correct). Built via subagent-driven dev (T1/T2/T5 standard parallel review, T3/T4 high-risk serial spec→code, + a final integration review READY-TO-MERGE). **Build clean (0 errors); OpcUaServer.Tests 152/0, Runtime.Tests 234/0, Core.Abstractions.Tests 88/0.** **LIVE `/run` GATE (T7) DEFERRED — operator-driven: it needs the Wonderware sidecar + AVEVA Historian on the WW Historian VM `10.100.0.48`, which is NOT on the local docker-dev rig.** When run: author a historized Galaxy tag (`TagConfig` `"isHistorized":true`), set `ServerHistorian:Enabled=true` → sidecar (Host/Port/SharedSecret/TLS), deploy on `MAIN-galaxy-eq`, then `Client.CLI historyread -n "ns=2;s=<equip>/<tag>" --start … --end …` → samples; a non-historized tag → `BadHistoryOperationUnsupported`. KNOWN follow-ups (non-blocking, documented): single-shot reads only (no server-managed continuation-point paging yet); no modified-value history; no explicit timeout at the block-bridge call site (bounded by the `WonderwareHistorianClient` 30s `CallTimeout`); the StopAsync→DisposeAsync warm-shutdown window returns Good_NoData (mirrors the `SetNodeWriteGateway` pattern).
**DRIVER-RECONFIGURE-WHILE-FAULTED (#7) MERGED + PUSHED to master `56f73e49` (2026-06-14, fast-forward, 5 commits = 2 docs + fix + review-nit + task-status).** A `DriverInstanceActor` stuck `Connecting`/`Reconnecting` now **adopts a corrected config delivered via `ApplyDelta`** and re-initialises with it, instead of dead-lettering the message and retrying the stale config forever (old workaround = restart node). Design/plan `docs/plans/2026-06-14-driver-reconfigure-while-faulted-*.md`. Mechanism (approach B): a monotonic `_initGeneration` tags each `InitializeAsync`; `InitializeSucceeded(int Generation)`/`InitializeFailed(string Reason, int Generation)` carry it; the `Connecting`/`Reconnecting` result handlers **drop superseded (stale-generation) results** so a corrected config always wins against an old init still in flight; a new `AdoptConfigDuringInit` (wired into both not-connected states) calls `InitializeAsync(newConfig)` (swaps `_currentConfigJson`, bumps the generation, retries immediately) + replies `ApplyResult(true,…)`. Contained to ONE file (`src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs`) + its test — NO host/contract/EF change; `Connected`/`Stubbed` `ApplyDelta` paths untouched; the two result records have ZERO external consumers (grep-verified) so the shape change is fully contained. Built via subagent-driven dev (high-risk full chain: spec ✅ · code ✅ · final integration ✅ — the integration pass traced the no-strand lifecycle in both states, host↔child contract, subscription/desired-refs/alarm safety, health/redundancy, double-adopt, and test fidelity, all clean). **Build clean (0 errors); Runtime 224/224.** **LIVE `/run` GATE DEFERRED per user ("skip the live test part").** When run: put `MAIN-opcua-eq` into a faulted/`Reconnecting` state via a bad `DriverConfig`, deploy a corrected config (`POST http://localhost:9200/api/deployments`, `X-Api-Key: docker-dev-deploy-key`), and confirm from central-1 logs that the driver adopts the new config + connects WITHOUT a node restart.
**GALAXY PHASE B — NATIVE ALARMS ON THE EQUIPMENT-TAG PATH MERGED + PUSHED to master `f9be3843` (2026-06-14, fast-forward, 12 feature + 2 doc commits).** A Galaxy equipment `Tag` marked as a native alarm via its `TagConfig.alarm` object — `{"FullName":"tag.attr","alarm":{"alarmType":"OffNormalAlarm","severity":700}}`, **NO EF migration** — now materialises a real OPC UA Part 9 `AlarmConditionState` under its equipment folder, driven live by the driver's `IAlarmSource.OnAlarmEvent`; transitions fan out to `/alerts` + the historian (Primary-gated). Design/plan `docs/plans/2026-06-14-galaxy-phase-b-native-alarms-*.md`. NEW seam (mirrors the scripted-alarm seam, reuses the condition sink UNCHANGED): `AlarmEventArgs.Kind` (additive contract; Galaxy populates it) → `DriverInstanceActor` subscribes `OnAlarmEvent``AttributeAlarmPublished``DriverHostActor._alarmNodeIdByDriverRef` + `NativeAlarmProjector` (transition→`AlarmConditionSnapshot`) → `OpcUaPublishActor.AlarmStateUpdate` → reused `OtOpcUaNodeManager.WriteAlarmCondition`. Built via subagent-driven dev (full per-task review chain). **Build clean; Core.Abstractions 81/0, OpcUaServer 118/0, Runtime 222/0, Galaxy 262/0(+1 live-gw skip).** **THE FINAL INTEGRATION REVIEW CAUGHT A CRITICAL SEAM BUG every unit test missed (fixed in `f9be3843`):** the alarm map is keyed by the dotted `FullName`, but `GalaxyDriver` puts the BARE owning object in `AlarmEventArgs.SourceNodeId` and the DOTTED alarm ref in `ConditionId` (`GalaxyDriver.cs:1148-1149`; `AlarmFullReference`) — so `ForwardNativeAlarm` MUST resolve on `msg.Args.ConditionId` (= `AlarmFullReference` = the authored `FullName`), NOT `SourceNodeId`. The unit test had masked it by setting `SourceNodeId==FullName` (never true in prod); it is now production-shaped (`SourceNodeId="Temp"``ConditionId="Temp.HiHi"=FullName`) so it genuinely guards the seam. **LIVE `/run` GATE (T9) NOT YET DONE — user-driven, deferred at merge (user choice "merge now").** When run: author a Galaxy alarm equipment tag whose `FullName` EXACTLY matches the gateway's `AlarmFullReference` (discoverable via the Galaxy picker/a probe), deploy on `MAIN-galaxy-eq`, trip the alarm → Part 9 condition goes active under the equipment + the `/alerts` row appears; clear → inactive. RESIDUAL non-blocking follow-ups (review-surfaced): (a) `DetachSubscription` alarm-coupling doc note + a dead-letter-during-reconnect regression test (WS-4b); (b) ack/comment-path test + assert `evt.Comment` (WS-5); (c) a `docs/ScriptedAlarms.md` note that authored severity (1..1000, seeds the condition at materialise) snaps to the projector's 4-bucket value (200/500/700/900) on the first transition; (d) DEFERRED by design: inbound device-ack (client Ack → `IAlarmSource.AcknowledgeAsync` → AVEVA), driving `SubscribeAlarmsAsync` from the materialised alarm-ref set (Galaxy doesn't need it), AdminUI Galaxy-picker `alarm` pre-fill, carrying raw OPC UA severity end-to-end. Phase **C** (server `HistoryRead`) **DONE — merged `c24abc8a` (2026-06-14); see the Phase C banner at top.**
**WRITE-OUTCOME SELF-CORRECTION (#5) MERGED + PUSHED to master `1d797c1c` (2026-06-14, fast-forward, 6 commits).** A failed inbound device write now reverts the node to its real pre-write value (compare-and-revert) instead of leaving the optimistic-`Good` phantom. Design `docs/plans/2026-06-14-write-outcome-self-correction-design.md`; plan `…-plan.md`. NEW `IOpcUaNodeWriteGateway`/`NodeWriteOutcome` (Commons) + `ActorNodeWriteGateway` (Runtime, Asks `RouteNodeWrite`, returns the outcome) replace the fire-and-forget `Action` router; `OnEquipmentTagWrite` captures the prior value + fires an off-Lock (`RunContinuationsAsynchronously`) continuation that reverts on a failed outcome IF the node still holds the optimistic value (`ShouldRevert`). Build clean; Commons 39/0, Runtime 201/0, OpcUaServer 111/0. High-risk review verified the prior-value capture against the actual UA-.NETStandard source (`OnWriteValue` fires before `m_value=value`). **LIVE-PROVEN end-to-end** via a local Modbus exception-injector (FC06 reject on HR[20]): authorized failing write → gateway-logged `0x808B0000` reject → node reverts 99→20; authorized success (HR200=7777) stays; anon → BadUserAccessDenied. KEY FINDING: the **Galaxy gateway worker's `ExecuteWrite` is fire-and-forget** (returns OK without awaiting the MXAccess commit), so Galaxy writes ALWAYS return Success at the OPC UA layer and can NEVER surface a device-write failure to this revert — only protocol drivers (which await + return real status) can; this is the same gateway-side limitation noted under "optimistic-write phantom" (out of our scope). Two MINOR deferred follow-ups remain: a Bad-quality blip / OPC UA AuditWriteUpdateEvent on failure, and synchronous structural fail-fast (both explicitly out of scope per the chosen mechanism).
**HARDEN MILESTONE 1b cluster MERGED + PUSHED to master `945c2380` (2026-06-14, fast-forward, 9 commits).** Follow-ups #3 (data-plane role docs), #4 (write-pipeline review nits), and #6 (Galaxy driver nits) below are CLOSED. Plan: `docs/plans/2026-06-14-harden-milestone-1b-plan.md`. Build clean; Runtime 197/0, FOCAS 185/0, Galaxy 257/0(+1 pre-existing skip); final integration review READY-TO-MERGE. Two MINOR residual follow-ups surfaced by review (both deferred, non-blocking): (a) a *driver-level* regression test that `GalaxyDriver.ReopenAsync` actually calls `InvalidateHandleCaches` — needs a live gw (`RecreateAsync` can't be faked), so it's an integration test; (b) stub-driver test-class duplication between `DriverInstanceActorTests` + `DriverInstanceActorWriteAndSubscribeTests` (hygiene — extract a shared harness).
## STATE SUMMARY (post-compaction pickup)
**ALL feature work is SHIPPED + PUSHED to master `c24abc8a` (synced with origin). Nothing is blocking.** Milestone 1b (equipment-tag live values: live READ + authorized inbound WRITE across OpcUaClient / the 6 protocol drivers / Galaxy, via the `FullName→NodeId` router) is COMPLETE, and all three Galaxy phases shipped: **A** standard Equipment driver `c3c56172`, **B** native alarms `f9be3843`, **C** server-side HistoryRead `c24abc8a`. The session's cluster-harden / write-outcome-self-correction / driver-reconfigure-while-faulted follow-ups merged too (`945c2380` / `1d797c1c` / `56f73e49`). The six banners above carry each feature's mechanism + gotchas + deferred live gate; the closed open-follow-ups #1#7 (Phase B, Phase C, data-plane role docs, write-pipeline nits, write-outcome, Galaxy driver nits, reconfigure-while-faulted) are all DONE at those SHAs.
**The ONLY genuinely open items (all user-driven / deferred — pick up here):**
1. **User-driven live `/run` gates** — the agent does NOT sign in; all code is merged + unit-verified; these are the operator's end-to-end confirmations:
- **Phase C HistoryRead** (T7) — needs the Wonderware sidecar + AVEVA on the WW Historian VM `10.100.0.48` (NOT on the local docker-dev rig). Recipe: Phase C banner + `docs/Historian.md`.
- **Phase B native alarms** (T9) — author a Galaxy alarm tag whose `FullName` == the gateway `AlarmFullReference`, deploy on `MAIN-galaxy-eq`, trip → Part 9 condition + `/alerts` row. Recipe: Phase B banner.
- **Driver-reconfigure-while-faulted** — fault `MAIN-opcua-eq` with a bad config, deploy a corrected one (`POST http://localhost:9200/api/deployments`, `X-Api-Key: docker-dev-deploy-key`), confirm it adopts WITHOUT a node restart. Recipe: that banner.
2. **Rig cleanups** (operational, user-deferred) — see "Operational deferral" at the bottom.
3. **Minor non-blocking residual follow-ups** (review-surfaced, all explicitly deferred, none gate anything): Phase B residuals (a)(d) in its banner; write-outcome residuals (Bad-quality blip / AuditWriteUpdateEvent / synchronous fail-fast); harden-1b two residuals (`945c2380` banner: Galaxy-reopen integration test, stub-driver test-class de-dup); Phase C documented follow-ups (no continuation-point paging, no modified-value history, block-bridge timeout bounded only by the client's 30s `CallTimeout`); the data-plane `GroupToRole` production-default note; Galaxy `_itemHandles`/`_supervisedHandles` not cleared on reconnect + the cosmetic `SubscriptionEstablished` self-dead-letter.
**No queued feature remains** — Milestone 1b + Galaxy A/B/C were the headline deliverables and are all done. Future directions (NOT requested): the Phase C HistoryRead follow-ups above, or new driver/UNS work.
---
The **six historian code follow-ups** (HistorizeToAveva opt-out, drain/capacity/retention config
knobs, SharedSecret/DatabasePath/non-positive-knob startup validation, operator-recording for
shelve/enable/disable, and the `SqliteStoreAndForwardSink` thread-safety nits) were **all resolved**
on branch `feat/alarm-historian-followups` (plan: `docs/plans/2026-06-11-alarm-historian-followups.md`).
They are no longer listed here.
## Equipment-tag live values — MILESTONE 1b COMPLETE (2026-06-13)
The Galaxy standard-driver effort shipped Phase A (`c3c56172`) + the **`FullName→NodeId` live-value ROUTER** (`c4435e4f`, both pushed). The router is done + verified (322 tests + integration review READY-TO-MERGE). **All three driver-publish gaps are now CLOSED** — an equipment tag bound to OpcUaClient, any protocol driver, OR Galaxy publishes a live value delivered by the router (full detail in `current.md` "Milestone 1b" + `docs/plans/2026-06-13-equipment-tag-live-values-design.md`):
1. ~~**OpcUaClient has NO factory (real bug — always stubbed).**~~ **DONE — SHIPPED+PUSHED master `22d553af` 2026-06-13.** Added `OpcUaClientDriverFactoryExtensions` (mirror Modbus) + registered it in `DriverFactoryBootstrap`. **First live equipment-tag value PROVEN end-to-end:** OpcUaClient driver `MAIN-opcua-eq` spawns `stub=False`, connects to opc-plc, subscribes to `ns=3;s=FastUInt1`; the `FullName→NodeId` router (`c4435e4f`) delivers it to the materialised variable `ns=2;s=EQ-55297329838d/FastUInt1`, which reads a live **changing** value (10135→10141, Good) via Client.CLI. Design/plan `docs/plans/2026-06-13-opcuaclient-factory-*.md`. Two incidental findings while live-verifying (see below).
2. ~~**Protocol drivers (Modbus/S7/AbCip/…) — equipment-tag↔driver tag-table linkage unbuilt.**~~ **DONE — SHIPPED+PUSHED master `8d8c05f5` 2026-06-13 (+ full inbound operator WRITE pipeline).** Approach B (driver-side direct-ref): a shared `EquipmentTagRefResolver<TDef>` (Core.Abstractions) resolves an equipment-tag ref (the raw `TagConfig` JSON blob the router already keys on) into a transient driver tag-def on a `_tagsByName` miss — wired into READ + WRITE for **all six** drivers (Modbus/S7/AbCip/AbLegacy/TwinCAT/FOCAS), each with a hardened never-throw `<Driver>EquipmentTagParser`. **Part B (write-through):** writable nodes (`Tag.AccessLevel==ReadWrite``CurrentReadWrite`, byte-parity in Phase7Composer+DeploymentArtifact), an `OnWriteValue` gate on the `WriteOperate` data-plane role (mirrors the alarm-ack bridge; fire-and-forget dispatch since the SDK holds the node-manager Lock during `OnWriteValue`), a `NodeWriteRouter` on the node manager, and `DriverHostActor.RouteNodeWrite` (NodeId→driver reverse map, primary-gated). **LIVE-PROVEN end-to-end:** Modbus equipment tag (HR[100]) reads a live changing value; an authorized write (`opc-writeop`/WriteOperate) to HR[200] changes the register + persists; an anonymous write → BadUserAccessDenied. Design/plan `docs/plans/2026-06-13-protocol-equipment-tag-linkage-*.md`. Findings + rig artifacts below.
3. ~~**Galaxy — needs a reachable mxaccessgw.**~~ **DONE — LIVE-PROVEN 2026-06-13 (no code change; config-only).** The code-investigation confirmed Galaxy was already fully wired: `GalaxyDriverFactoryExtensions` IS registered in `DriverFactoryBootstrap.cs:103` (not the missing-factory bug OpcUaClient had), and the Galaxy driver keys subscriptions on the FullReference (`tag_name.AttributeName`) DIRECTLY (no `_tagsByName` miss). gap (c) was purely a misconfigured dev driver-instance + placeholder tag ref + unset key — ALL data in existing columns, NO EF/schema change. Fixes applied to the dev rig (`otopcua-dev-sql-1`/`OtOpcUa`): `MAIN-galaxy-eq` `DriverConfig` `gateway.endpoint` `https://10.100.0.35:5001``http://10.100.0.48:5120`, `useTls` `true``false`, `apiKeySecretRef` `env:MX_API_KEY`(unset)→`env:GALAXY_MXGW_API_KEY` (the var the compose already wires on every node); `GalaxyTestTag` `TagConfig.FullName` `TestMachine_002.SomeAttr`(placeholder)→`TestMachine_002.TestDuration` (a real galaxy Float attr). The gateway API key was injected via **ephemeral shell env** at `docker compose up -d --no-deps --force-recreate central-1 central-2` time (NEVER written to a tracked file; the compose's `${GALAXY_MXGW_API_KEY:-stale-default}` substitution picks it up — the running containers carry the real key only until the next recreate-without-the-env-var). **Live (central-1 logs):** `spawned GalaxyMxGateway driver MAIN-galaxy-eq (stub=False)``GalaxyMxSession connected — clientName=OtOpcUa` (auth OK) → `initialized — endpoint=http://10.100.0.48:5120``subscribed to 1 refs (galaxy-sub-1)` (TestMachine_002.TestDuration accepted, no BadNodeIdUnknown). **Value:** `Client.CLI read ns=2;s=EQ-55297329838d/GalaxyTestTag` → Value `0`, Status `0x00000000` (Good), Source Time `2026-05-07T07:14:26Z` (a real galaxy timestamp — a genuine attribute snapshot, NOT BadWaitingForInitialData; static because that attr isn't actively moving). Restore-the-rig SQL saved at `/tmp/galaxy-gapc-snapshot.sql`. **Milestone 1b is now COMPLETE — all three gaps closed.** Findings/follow-ups below.
Then: Phase **B** = native `IAlarmSource` alarms on the equipment-tag path **(DONE — `f9be3843`)**; Phase **C** = server-side `HistoryRead` backend over the Wonderware reader **(DONE — `c24abc8a`, 2026-06-14; design sections in `docs/plans/2026-06-12-galaxy-standard-driver-design.md` + the dedicated `docs/plans/2026-06-14-galaxy-phase-c-historian-*.md`)**.
### Findings + follow-ups from the Galaxy gap-(c) live-verify (2026-06-13)
- **Benign dead-letter (minor, pre-existing in the Galaxy driver — NOT introduced here).** On subscribe the driver logs: `Message [SubscriptionEstablished] from drv-MAIN-galaxy-eq to drv-MAIN-galaxy-eq was unhandled. [N] dead letters`. The Galaxy `DriverInstanceActor`/driver sends itself a `SubscriptionEstablished` message that has no `Receive<>` handler. Harmless (the subscription IS established + delivering values), but noisy — add a handler (or stop self-Telling it). Cosmetic.
- **CHANGING-value read PROVEN (2026-06-13).** Repointed `GalaxyTestTag.FullName``TestMachine_002.TestChangingInt` (a script-driven Integer, `sec=Operate`): three Client.CLI reads returned **810 → 787 → 764** with **real galaxy source timestamps advancing ~7s each** (`02:28:41`/`:48`/`:55`) — a genuine live moving Galaxy value through the router (not optimistic/phantom). The dev rig is now left with `GalaxyTestTag` pointing here (DataType `Int32`, AccessLevel `Read`). Discovery was done with a throwaway probe (now deleted) using `GalaxyDriverBrowser.OpenAsync``session.AttributesAsync("TestMachine_002")`, which lists every attribute's `SecurityClass` (`ViewOnly`=read-only; `FreeAccess`/`Operate`/`Tune`/`Configure`=writable). Useful attrs on `TestMachine_002`: `TestFloat`(Float,Operate), `TestDouble`(Double,Operate), `TestChangingInt`(Integer,Operate,**moves**), `TestDuration`(**ElapsedTime**,Operate), `AlarmInhibit`(Boolean,FreeAccess).
- **GALAXY WRITE-THROUGH — FIXED + MERGED to master `f05b5d79` (`AdviseSupervisory` before raw `Write`).** Symptom was every Galaxy operator write returning `MxaccessFailure "ArgumentException: HRESULT 0x80070057"` (`E_INVALIDARG`). **TWO-LAYER root cause (debugged by SSH-reading the gateway source on 10.100.0.48 `C:\Users\dohertj2\Desktop\mxaccessgw` — there is NO live gateway file log: console-only/uncaptured, NSSM `stdout.log` stale, dashboard :5130 is Blazor/no-REST):** (1) the writer `AddItem`'d an UN-advised handle → MXAccess `Write` threw E_INVALIDARG (worker chain `ExecuteWrite``MxAccessSession.Write``MxAccessComServer.Write``AsProxyServer().Write(...)`). (2) DEEPER — a plain `Write` runs with **no user login** (`WriteUserId`=0), and MXAccess only **COMMITS** such a write when the item is advised in **SUPERVISORY** mode; a *regular* `Advise` removed the E_INVALIDARG but never committed (proven by a persistence check: read-back showed the value, but a `--force-recreate`+fresh-resubscribe reverted to the original `0 @ 2026-05-07`; the worker's `ExecuteWrite` is fire-and-forget, returns OK without awaiting `OnWriteComplete`). Confirmed against the sister **ScadaBridge** driver (`~/Desktop/ScadaBridge/.../RealMxGatewayClient.cs`): it commits the OTHER way — a configured **non-zero `WriteUserId`** + regular `Advise` + `WriteBulk`. We have no galaxy login → supervisory context. **FIX:** `GatewayGalaxyDataWriter` calls `AdviseSupervisory` (raw `MxCommand{Kind=AdviseSupervisory, AdviseSupervisory=new AdviseSupervisoryCommand{ServerHandle,ItemHandle}}` via `session.InvokeAsync`, mirroring `InvokeWriteSecuredAsync`; idempotent per handle via `_supervisedHandles`) before each raw `WriteRawAsync`; `SecuredWrite`/`VerifiedWrite` tags keep their own user-identity path (`NeedsSecuredWrite` unchanged — WriteSecured is ONLY for those special-security tags). The dead-end "reuse the subscription's advised handle" resolver attempt was reverted. **LIVE-PROVEN:** authorized write (`opc-writeop`/WriteOperate) of `TestMachine_002.TestFloat`=1234.5 then 8888.25 COMMITS + **PERSISTS across recreate/re-subscribe** (galaxy-sourced timestamp); anonymous → `BadUserAccessDenied`. 254 Galaxy tests green; central `--build` clean. **OPEN follow-ups from this:** (a) the worker's fire-and-forget `ExecuteWrite` can't surface an async write failure — with supervisory advise the write commits, but only a read-back confirms a *specific* write (gateway-side; out of our scope). (b) `_itemHandles`/`_supervisedHandles` caches aren't cleared on reconnect (pre-existing for `_itemHandles`) — a write right after a reconnect could use a stale handle; minor.
- **OPTIMISTIC-WRITE PHANTOM (open follow-up — surface real write status to the client).** The inbound write dispatch is fire-and-forget: it returns optimistic `Good` before the driver result (required — `OnWriteValue` runs under the node-manager Lock), and the SDK applies the written value to the node locally. So a write whose DEVICE write FAILS still returns `Good`, and for a STATIC attribute that never re-pushes, the wrong value LINGERS (a phantom the device never accepted). The pipeline already computes the real status in `NodeWriteResult.Success/Reason` but only LOGS it — consider surfacing it to the client. (How it was caught live: a failed Galaxy write showed the written value on read-back with a SERVER-clock source timestamp + a `rejected` driver log; a committed write shows a GALAXY-clock timestamp + no rejection, and persists across a re-subscribe.)
- **Dev-rig Galaxy config is CORRECT + WORKING (left in place).** The `MAIN-galaxy-eq` driver-instance is deployed and connecting to the live gateway `http://10.100.0.48:5120`. `GalaxyTestTag` (on `EQ-55297329838d`/filler-02, `nw-uns` namespace) is currently `{"FullName":"TestMachine_002.TestFloat"}`, DataType `Float`, AccessLevel `ReadWrite` (the write demo; galaxy now holds the last written value `8888.25`). Other useful `TestMachine_002` attrs (from the discovery probe): `TestChangingInt`(Integer,Operate,**moves on its own** — the live-changing READ demo), `TestDouble`(Double,Operate), `TestDuration`(**ElapsedTime**,Operate — reads as Float but a Float write is a type-mismatch), `AlarmInhibit`(Boolean,FreeAccess). To restore the original placeholder tag: `/tmp/galaxy-gapc-snapshot.sql`. The base seed `docker-dev/seed/seed-clusters.sql` still seeds the *legacy* SystemPlatform-namespace Galaxy driver (`MAIN-galaxy-mxgw`, tags `TestMachine_001.TestAlarm001..003`) — pre-Phase-A model, untouched/separate. **The injected gateway key is EPHEMERAL** — key=`mxgw_otopcuakey2_so0…` is supplied via shell env `GALAXY_MXGW_API_KEY='…'` at `docker compose up --no-deps --force-recreate central-1 central-2`; a recreate WITHOUT it re-exported falls back to the compose's stale default and Galaxy auth fails. ORDER on a redeploy: POST deploy FIRST, THEN recreate (a faulted driver ignores `ApplyDelta`).
### Findings + follow-ups from the protocol-linkage + write-through work (2026-06-13)
- **DATA-PLANE ROLE CONFIG REQUIREMENT (important, deployment-facing).** The OPC UA session's roles come from two sources unioned: the **DB `LdapGroupRoleMapping`** (its `Role` column is the **`AdminRole` enum** — Administrator/Designer/Viewer only, for the AdminUI) AND the **appsettings `Security:Ldap:GroupToRole`** baseline (free-form `string→string`). The OPC UA **data-plane** gates (`WriteOperate`, `AlarmAck`, …) read literal role STRINGS that the AdminRole-typed DB mapping **cannot** produce — so a deployment MUST map its LDAP data-plane groups → data-plane role strings via `GroupToRole`, or write-through (and scripted-alarm OPC UA ack) is inert (every write → `BadUserAccessDenied`). The shared dev GLAuth already has dedicated groups+users (group `WriteOperate`, user `opc-writeop`, `multi-role` in all; `opc-readonly`); the dev rig just never seeded the `GroupToRole`. **Consider a docs note (and/or a documented default) so production deployments wire this.** (Same latent requirement applies to the pre-existing alarm-ack gate.)
- **Write-pipeline review follow-ups (non-blocking, from the final integration review):** (a) `DriverHostActor.Stale` (and `DriverInstanceActor.Connecting`/`Reconnecting`) have **no `RouteNodeWrite`/`WriteAttribute` handler** → an operator write while stale/reconnecting dead-letters and the 10s Ask times out with a generic log (client got optimistic Good). Add fast-fail handlers returning a clear status. (b) Drop `TaskContinuationOptions.ExecuteSynchronously` on the router `ContinueWith`; `List.Contains``HashSet` in the forward-map build (micro). (c) FOCAS re-parses the address on every equipment-tag write (`_parsedAddressesByTagName` miss; perf only, rare). (d) `DriverHostActorWriteRoutingTests` seeds a Galaxy-style `{"FullName":...}` artifact, not a raw protocol-driver TagConfig blob — add a raw-blob case for belt-and-suspenders (runtime path is identical + live-verified). (e) Task-9 parity test is a faithful simulation of `ConfigComposer` (`ToSnapshot` casts AccessLevel to int) not a through-the-real-serializer proof; add an `InlineData(2,false)` future-enum trap.
### Dev-rig artifacts created for the protocol-linkage live-verify (left in place, NOT committed)
- **`docker-dev/docker-compose.yml`** gained `Security__Ldap__GroupToRole__{ReadOnly,WriteOperate,WriteTune,WriteConfigure,AlarmAck}` identity entries on both central nodes (needed for data-plane roles — see above). **Uncommitted** (rig config; the file was already modified at session start).
- **DB seeds** on `otopcua-dev-sql-1`/`OtOpcUa`: driver `MAIN-modbus-eq` (DriverType=Modbus, `{"Host":"10.100.0.35","Port":5020,"UnitId":1,"Tags":[]}`, namespace `nw-uns`, cluster MAIN) + tags `tag-modbus-hr100` (HR[100] auto-increment, Read — read demo) and `tag-modbus-hr200` (HR[200] scratch, ReadWrite — write demo), both on equipment `EQ-55297329838d` (filler-02). The pymodbus `standard` sim (`10.100.0.35:5020`) serves HR[0..31]=addr-as-value, HR[100]=auto-increment, HR[200..209]=writable scratch.
### Incidental findings from the OpcUaClient live-verify (2026-06-13)
- **Driver-reconfigure-while-faulted gap — FIXED + MERGED `56f73e49` (2026-06-14).** (Was: a `DriverInstanceActor` stuck in `Reconnecting`/`Connecting` had **no `ApplyDelta` handler**, so a corrected config dead-lettered and the actor retried the OLD `_currentConfigJson` forever; workaround = restart the node.) Now `Connecting`/`Reconnecting` handle `ApplyDelta` via `AdoptConfigDuringInit`, re-initialising with the new config; a monotonic `_initGeneration` guard supersedes the in-flight old init so the corrected config always wins. See the top banner + `docs/plans/2026-06-14-driver-reconfigure-while-faulted-*.md`. Live `/run` gate deferred (user choice).
- **Dev-rig config edit applied directly in DB.** The `MAIN-opcua-eq` `DriverConfig.targetNamespaceKind` was `0` (Equipment, which requires a `UnsMappingTable``InitializeAsync` rejected it). Set to `1` (SystemPlatform — the direct-ref mode the equipment-tag model wants; empty `unsMappingTable:{}` passes validation) via a direct `JSON_MODIFY` UPDATE on `otopcua-dev-sql-1` (DB `OtOpcUa`, `SET QUOTED_IDENTIFIER ON` required for JSON fns; `sqlcmd -h` and `-y 0` are mutually exclusive — pick one). The AdminUI driver-edit combobox for "Target namespace kind" did **not** persist the change (suspected live-only Blazor binding bug — unverified; the DB edit sidestepped it). Deploy snapshots the **live** config DB directly (`AdminOperationsActor``DraftSnapshotFactory.FromConfigDbAsync` + `ConfigComposer.SnapshotAndFlattenAsync`), so a DB edit flows through on the next `POST /api/deployments` (new revisionHash).
## Operational deferral (user choice)
1. **docker-dev rig cleanup (round-1 T9) deferred.** The local docker-dev rig still has the
live-verify seed artifacts deployed: the `t12-overheat` scripted alarm, the
`SC-ba675b168a85` predicate script, the `layer0-logcheck` vtag/script, and filler-02's
modified `cycle-time-s` line. Left as-is to inspect the working double-emit fix. **To clean
up:** delete those artifacts in the AdminUI (or DB), revert filler-02's `cycle-time-s` to
`return ctx.GetTag("TestMachine_002.TestDuration").Value;`, then redeploy
(`POST http://localhost:9200/api/deployments`, header `X-Api-Key: docker-dev-deploy-key`).
2. **Equipment-tag live-value verify artifacts (left in place — all now FUNCTIONAL).** The docker-dev
rig carries verify artifacts under the `nw-uns` Equipment namespace on `EQ-55297329838d`
(filler-02), all three now working: `MAIN-galaxy-eq` (GalaxyMxGateway → live gateway, `GalaxyTestTag`
= `TestMachine_002.TestFloat` RW, write-proven; needs the ephemeral `GALAXY_MXGW_API_KEY` re-exported
on recreate — see Galaxy dev-rig note above); `MAIN-opcua-eq` (OpcUaClient, factory shipped `22d553af`)
+ `FastUInt1` tag (`{"FullName":"ns=3;s=FastUInt1"}`); `MAIN-modbus-eq` (Modbus) + `tag-modbus-hr100`
(read) / `tag-modbus-hr200` (RW write demo). docker-dev is **LOCAL on this Mac** (OrbStack); central-1
@ `localhost:4840`/AdminUI+deploy @ `localhost:9200`, sql @ `localhost:14330` (sa/`OtOpcUa!Dev123`),
login disabled. Sims for the protocol/opcua verifies run on the docker host `10.100.0.35`
(`otopcua-pymodbus-standard` :5020, `otopcua-opc-plc` :50000) — leave up or `docker compose down` per
`/opt/otopcua-modbus` + `/opt/otopcua-opcuaclient`. Phase A backup
`OtOpcUa-prePhaseA-20260612-224908.bak` is in the SQL volume `/var/opt/mssql/backup/`.