Files
lmxopcua/pending.md
T
Joseph Doherty cd20c3c064
v2-ci / build (push) Failing after 31s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
docs: refresh pending.md for compaction (Phase C shipped; open-items digest)
2026-06-14 22:23:58 -04:00

33 KiB
Raw Blame History

Pending — open follow-ups & deferrals

As of 2026-06-14. master HEAD c24abc8a (synced with origin; feature branch feat/galaxy-phase-c-historian ff-merged + deleted). Working tree is clean except the expected DISK-ONLY files: docker-dev/docker-compose.yml (M — uncommitted rig config, never staged) and pending.md (M — these notes, never staged), plus two untracked pre-existing docs/plans/2026-06-14-write-outcome-self-correction-plan.md* docs. HARD RULE: never git add .; never stage pending.md / current.md / docker-dev/docker-compose.yml / sql_login.txt / src/Server/ZB.MOM.WW.OtOpcUa.Host/pki/; never commit secrets.

GALAXY PHASE C — SERVER-SIDE OPC UA HistoryRead MERGED + PUSHED to master c24abc8a (2026-06-14, fast-forward, 14 commits = 2 design/plan docs + 12 feature/test/doc). The server now answers OPC UA HistoryRead (Raw / Processed / AtTime over historized variable nodes; Events over alarm-owning equipment-folder event-notifier nodes) for any equipment tag flagged historized, driver-agnostically, by dispatching to the registered IHistorianDataSource (the Wonderware historian TCP client, which already implemented that interface). NO EF migration — the flag rides in the existing TagConfig JSON blob ({"FullName":"…","isHistorized":true,"historianTagname":"…"?}, the Phase-B alarm-object carrier), historianTagname defaults to the tag's driver FullName. Design/plan docs/plans/2026-06-14-galaxy-phase-c-historian-*.md; guide docs/Historian.md. Pipeline: Phase7Composer.ExtractTagHistorize + byte-parity DeploymentArtifact.ExtractTagHistorizeEquipmentTagPlan.{IsHistorized,HistorianTagname}Phase7Applier resolves IsHistorized ? (HistorianTagname ?? FullName) : null → sink seam (IOpcUaAddressSpaceSink.EnsureVariable +string? historianTagname) → OtOpcUaNodeManager.EnsureVariable sets Historizing+AccessLevels.HistoryRead+registers _historizedTagnames; the node manager overrides the four CustomNodeManager2 HistoryRead virtuals, block-bridging to the HistorianDataSource property (volatile, default NullHistorianDataSource.Instance). DI mirrors AlarmHistorian: AddServerHistorian (config-gated, ServerHistorian appsettings section, Null default via TryAddSingleton) + OtOpcUaSdkServer.SetHistorianDataSource + Host Program.cs/OtOpcUaServerHostedService Start/Stop wiring. Graceful degrade: historized node + Null/unconfigured source → Good_NoData (empty), non-historized node → BadHistoryOperationUnsupported. KEY FACTS/GOTCHAS: the SDK base filters event reads by the EventNotifier.HistoryRead bit (variable nodes never reach the events arm); the SDK master propagates errors[i].Code → results[i].StatusCode (confirmed by decompiling MasterNodeManager.HistoryReadAsync), so the override signals per-node status via errors[handle.Index]; the two HistoryReadResult types (SDK Opc.Ua.HistoryReadResult vs Core.Abstractions DTO) are aliased SdkHistoryReadResult/HistorianRead; ReadRawModifiedDetails.IsReadModified defaults TRUE (Initialize() sets it) so a plain raw read must clear it — modified-history is unsupported; the events arm registers folder→sourceName (= equipment id) only when a non-Null historian is wired at promotion time (Host wires the source at StartAsync before any deployment materialises, so normal boot ordering is correct). Built via subagent-driven dev (T1/T2/T5 standard parallel review, T3/T4 high-risk serial spec→code, + a final integration review READY-TO-MERGE). Build clean (0 errors); OpcUaServer.Tests 152/0, Runtime.Tests 234/0, Core.Abstractions.Tests 88/0. LIVE /run GATE (T7) DEFERRED — operator-driven: it needs the Wonderware sidecar + AVEVA Historian on the WW Historian VM 10.100.0.48, which is NOT on the local docker-dev rig. When run: author a historized Galaxy tag (TagConfig "isHistorized":true), set ServerHistorian:Enabled=true → sidecar (Host/Port/SharedSecret/TLS), deploy on MAIN-galaxy-eq, then Client.CLI historyread -n "ns=2;s=<equip>/<tag>" --start … --end … → samples; a non-historized tag → BadHistoryOperationUnsupported. KNOWN follow-ups (non-blocking, documented): single-shot reads only (no server-managed continuation-point paging yet); no modified-value history; no explicit timeout at the block-bridge call site (bounded by the WonderwareHistorianClient 30s CallTimeout); the StopAsync→DisposeAsync warm-shutdown window returns Good_NoData (mirrors the SetNodeWriteGateway pattern).

DRIVER-RECONFIGURE-WHILE-FAULTED (#7) MERGED + PUSHED to master 56f73e49 (2026-06-14, fast-forward, 5 commits = 2 docs + fix + review-nit + task-status). A DriverInstanceActor stuck Connecting/Reconnecting now adopts a corrected config delivered via ApplyDelta and re-initialises with it, instead of dead-lettering the message and retrying the stale config forever (old workaround = restart node). Design/plan docs/plans/2026-06-14-driver-reconfigure-while-faulted-*.md. Mechanism (approach B): a monotonic _initGeneration tags each InitializeAsync; InitializeSucceeded(int Generation)/InitializeFailed(string Reason, int Generation) carry it; the Connecting/Reconnecting result handlers drop superseded (stale-generation) results so a corrected config always wins against an old init still in flight; a new AdoptConfigDuringInit (wired into both not-connected states) calls InitializeAsync(newConfig) (swaps _currentConfigJson, bumps the generation, retries immediately) + replies ApplyResult(true,…). Contained to ONE file (src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs) + its test — NO host/contract/EF change; Connected/Stubbed ApplyDelta paths untouched; the two result records have ZERO external consumers (grep-verified) so the shape change is fully contained. Built via subagent-driven dev (high-risk full chain: spec · code · final integration — the integration pass traced the no-strand lifecycle in both states, host↔child contract, subscription/desired-refs/alarm safety, health/redundancy, double-adopt, and test fidelity, all clean). Build clean (0 errors); Runtime 224/224. LIVE /run GATE DEFERRED per user ("skip the live test part"). When run: put MAIN-opcua-eq into a faulted/Reconnecting state via a bad DriverConfig, deploy a corrected config (POST http://localhost:9200/api/deployments, X-Api-Key: docker-dev-deploy-key), and confirm from central-1 logs that the driver adopts the new config + connects WITHOUT a node restart.

GALAXY PHASE B — NATIVE ALARMS ON THE EQUIPMENT-TAG PATH MERGED + PUSHED to master f9be3843 (2026-06-14, fast-forward, 12 feature + 2 doc commits). A Galaxy equipment Tag marked as a native alarm via its TagConfig.alarm object — {"FullName":"tag.attr","alarm":{"alarmType":"OffNormalAlarm","severity":700}}, NO EF migration — now materialises a real OPC UA Part 9 AlarmConditionState under its equipment folder, driven live by the driver's IAlarmSource.OnAlarmEvent; transitions fan out to /alerts + the historian (Primary-gated). Design/plan docs/plans/2026-06-14-galaxy-phase-b-native-alarms-*.md. NEW seam (mirrors the scripted-alarm seam, reuses the condition sink UNCHANGED): AlarmEventArgs.Kind (additive contract; Galaxy populates it) → DriverInstanceActor subscribes OnAlarmEventAttributeAlarmPublishedDriverHostActor._alarmNodeIdByDriverRef + NativeAlarmProjector (transition→AlarmConditionSnapshot) → OpcUaPublishActor.AlarmStateUpdate → reused OtOpcUaNodeManager.WriteAlarmCondition. Built via subagent-driven dev (full per-task review chain). Build clean; Core.Abstractions 81/0, OpcUaServer 118/0, Runtime 222/0, Galaxy 262/0(+1 live-gw skip). THE FINAL INTEGRATION REVIEW CAUGHT A CRITICAL SEAM BUG every unit test missed (fixed in f9be3843): the alarm map is keyed by the dotted FullName, but GalaxyDriver puts the BARE owning object in AlarmEventArgs.SourceNodeId and the DOTTED alarm ref in ConditionId (GalaxyDriver.cs:1148-1149; AlarmFullReference) — so ForwardNativeAlarm MUST resolve on msg.Args.ConditionId (= AlarmFullReference = the authored FullName), NOT SourceNodeId. The unit test had masked it by setting SourceNodeId==FullName (never true in prod); it is now production-shaped (SourceNodeId="Temp"ConditionId="Temp.HiHi"=FullName) so it genuinely guards the seam. LIVE /run GATE (T9) NOT YET DONE — user-driven, deferred at merge (user choice "merge now"). When run: author a Galaxy alarm equipment tag whose FullName EXACTLY matches the gateway's AlarmFullReference (discoverable via the Galaxy picker/a probe), deploy on MAIN-galaxy-eq, trip the alarm → Part 9 condition goes active under the equipment + the /alerts row appears; clear → inactive. RESIDUAL non-blocking follow-ups (review-surfaced): (a) DetachSubscription alarm-coupling doc note + a dead-letter-during-reconnect regression test (WS-4b); (b) ack/comment-path test + assert evt.Comment (WS-5); (c) a docs/ScriptedAlarms.md note that authored severity (1..1000, seeds the condition at materialise) snaps to the projector's 4-bucket value (200/500/700/900) on the first transition; (d) DEFERRED by design: inbound device-ack (client Ack → IAlarmSource.AcknowledgeAsync → AVEVA), driving SubscribeAlarmsAsync from the materialised alarm-ref set (Galaxy doesn't need it), AdminUI Galaxy-picker alarm pre-fill, carrying raw OPC UA severity end-to-end. Phase C (server HistoryRead) DONE — merged c24abc8a (2026-06-14); see the Phase C banner at top.

WRITE-OUTCOME SELF-CORRECTION (#5) MERGED + PUSHED to master 1d797c1c (2026-06-14, fast-forward, 6 commits). A failed inbound device write now reverts the node to its real pre-write value (compare-and-revert) instead of leaving the optimistic-Good phantom. Design docs/plans/2026-06-14-write-outcome-self-correction-design.md; plan …-plan.md. NEW IOpcUaNodeWriteGateway/NodeWriteOutcome (Commons) + ActorNodeWriteGateway (Runtime, Asks RouteNodeWrite, returns the outcome) replace the fire-and-forget Action router; OnEquipmentTagWrite captures the prior value + fires an off-Lock (RunContinuationsAsynchronously) continuation that reverts on a failed outcome IF the node still holds the optimistic value (ShouldRevert). Build clean; Commons 39/0, Runtime 201/0, OpcUaServer 111/0. High-risk review verified the prior-value capture against the actual UA-.NETStandard source (OnWriteValue fires before m_value=value). LIVE-PROVEN end-to-end via a local Modbus exception-injector (FC06 reject on HR[20]): authorized failing write → gateway-logged 0x808B0000 reject → node reverts 99→20; authorized success (HR200=7777) stays; anon → BadUserAccessDenied. KEY FINDING: the Galaxy gateway worker's ExecuteWrite is fire-and-forget (returns OK without awaiting the MXAccess commit), so Galaxy writes ALWAYS return Success at the OPC UA layer and can NEVER surface a device-write failure to this revert — only protocol drivers (which await + return real status) can; this is the same gateway-side limitation noted under "optimistic-write phantom" (out of our scope). Two MINOR deferred follow-ups remain: a Bad-quality blip / OPC UA AuditWriteUpdateEvent on failure, and synchronous structural fail-fast (both explicitly out of scope per the chosen mechanism).

HARDEN MILESTONE 1b cluster MERGED + PUSHED to master 945c2380 (2026-06-14, fast-forward, 9 commits). Follow-ups #3 (data-plane role docs), #4 (write-pipeline review nits), and #6 (Galaxy driver nits) below are CLOSED. Plan: docs/plans/2026-06-14-harden-milestone-1b-plan.md. Build clean; Runtime 197/0, FOCAS 185/0, Galaxy 257/0(+1 pre-existing skip); final integration review READY-TO-MERGE. Two MINOR residual follow-ups surfaced by review (both deferred, non-blocking): (a) a driver-level regression test that GalaxyDriver.ReopenAsync actually calls InvalidateHandleCaches — needs a live gw (RecreateAsync can't be faked), so it's an integration test; (b) stub-driver test-class duplication between DriverInstanceActorTests + DriverInstanceActorWriteAndSubscribeTests (hygiene — extract a shared harness).

STATE SUMMARY (post-compaction pickup)

ALL feature work is SHIPPED + PUSHED to master c24abc8a (synced with origin). Nothing is blocking. Milestone 1b (equipment-tag live values: live READ + authorized inbound WRITE across OpcUaClient / the 6 protocol drivers / Galaxy, via the FullName→NodeId router) is COMPLETE, and all three Galaxy phases shipped: A standard Equipment driver c3c56172, B native alarms f9be3843, C server-side HistoryRead c24abc8a. The session's cluster-harden / write-outcome-self-correction / driver-reconfigure-while-faulted follow-ups merged too (945c2380 / 1d797c1c / 56f73e49). The six banners above carry each feature's mechanism + gotchas + deferred live gate; the closed open-follow-ups #1#7 (Phase B, Phase C, data-plane role docs, write-pipeline nits, write-outcome, Galaxy driver nits, reconfigure-while-faulted) are all DONE at those SHAs.

The ONLY genuinely open items (all user-driven / deferred — pick up here):

  1. User-driven live /run gates — the agent does NOT sign in; all code is merged + unit-verified; these are the operator's end-to-end confirmations:
    • Phase C HistoryRead (T7) — needs the Wonderware sidecar + AVEVA on the WW Historian VM 10.100.0.48 (NOT on the local docker-dev rig). Recipe: Phase C banner + docs/Historian.md.
    • Phase B native alarms (T9) — author a Galaxy alarm tag whose FullName == the gateway AlarmFullReference, deploy on MAIN-galaxy-eq, trip → Part 9 condition + /alerts row. Recipe: Phase B banner.
    • Driver-reconfigure-while-faulted — fault MAIN-opcua-eq with a bad config, deploy a corrected one (POST http://localhost:9200/api/deployments, X-Api-Key: docker-dev-deploy-key), confirm it adopts WITHOUT a node restart. Recipe: that banner.
  2. Rig cleanups (operational, user-deferred) — see "Operational deferral" at the bottom.
  3. Minor non-blocking residual follow-ups (review-surfaced, all explicitly deferred, none gate anything): Phase B residuals (a)(d) in its banner; write-outcome residuals (Bad-quality blip / AuditWriteUpdateEvent / synchronous fail-fast); harden-1b two residuals (945c2380 banner: Galaxy-reopen integration test, stub-driver test-class de-dup); Phase C documented follow-ups (no continuation-point paging, no modified-value history, block-bridge timeout bounded only by the client's 30s CallTimeout); the data-plane GroupToRole production-default note; Galaxy _itemHandles/_supervisedHandles not cleared on reconnect + the cosmetic SubscriptionEstablished self-dead-letter.

No queued feature remains — Milestone 1b + Galaxy A/B/C were the headline deliverables and are all done. Future directions (NOT requested): the Phase C HistoryRead follow-ups above, or new driver/UNS work.


The six historian code follow-ups (HistorizeToAveva opt-out, drain/capacity/retention config knobs, SharedSecret/DatabasePath/non-positive-knob startup validation, operator-recording for shelve/enable/disable, and the SqliteStoreAndForwardSink thread-safety nits) were all resolved on branch feat/alarm-historian-followups (plan: docs/plans/2026-06-11-alarm-historian-followups.md). They are no longer listed here.

Equipment-tag live values — MILESTONE 1b COMPLETE (2026-06-13)

The Galaxy standard-driver effort shipped Phase A (c3c56172) + the FullName→NodeId live-value ROUTER (c4435e4f, both pushed). The router is done + verified (322 tests + integration review READY-TO-MERGE). All three driver-publish gaps are now CLOSED — an equipment tag bound to OpcUaClient, any protocol driver, OR Galaxy publishes a live value delivered by the router (full detail in current.md "Milestone 1b" + docs/plans/2026-06-13-equipment-tag-live-values-design.md):

  1. OpcUaClient has NO factory (real bug — always stubbed). DONE — SHIPPED+PUSHED master 22d553af 2026-06-13. Added OpcUaClientDriverFactoryExtensions (mirror Modbus) + registered it in DriverFactoryBootstrap. First live equipment-tag value PROVEN end-to-end: OpcUaClient driver MAIN-opcua-eq spawns stub=False, connects to opc-plc, subscribes to ns=3;s=FastUInt1; the FullName→NodeId router (c4435e4f) delivers it to the materialised variable ns=2;s=EQ-55297329838d/FastUInt1, which reads a live changing value (10135→10141, Good) via Client.CLI. Design/plan docs/plans/2026-06-13-opcuaclient-factory-*.md. Two incidental findings while live-verifying (see below).
  2. Protocol drivers (Modbus/S7/AbCip/…) — equipment-tag↔driver tag-table linkage unbuilt. DONE — SHIPPED+PUSHED master 8d8c05f5 2026-06-13 (+ full inbound operator WRITE pipeline). Approach B (driver-side direct-ref): a shared EquipmentTagRefResolver<TDef> (Core.Abstractions) resolves an equipment-tag ref (the raw TagConfig JSON blob the router already keys on) into a transient driver tag-def on a _tagsByName miss — wired into READ + WRITE for all six drivers (Modbus/S7/AbCip/AbLegacy/TwinCAT/FOCAS), each with a hardened never-throw <Driver>EquipmentTagParser. Part B (write-through): writable nodes (Tag.AccessLevel==ReadWriteCurrentReadWrite, byte-parity in Phase7Composer+DeploymentArtifact), an OnWriteValue gate on the WriteOperate data-plane role (mirrors the alarm-ack bridge; fire-and-forget dispatch since the SDK holds the node-manager Lock during OnWriteValue), a NodeWriteRouter on the node manager, and DriverHostActor.RouteNodeWrite (NodeId→driver reverse map, primary-gated). LIVE-PROVEN end-to-end: Modbus equipment tag (HR[100]) reads a live changing value; an authorized write (opc-writeop/WriteOperate) to HR[200] changes the register + persists; an anonymous write → BadUserAccessDenied. Design/plan docs/plans/2026-06-13-protocol-equipment-tag-linkage-*.md. Findings + rig artifacts below.
  3. Galaxy — needs a reachable mxaccessgw. DONE — LIVE-PROVEN 2026-06-13 (no code change; config-only). The code-investigation confirmed Galaxy was already fully wired: GalaxyDriverFactoryExtensions IS registered in DriverFactoryBootstrap.cs:103 (not the missing-factory bug OpcUaClient had), and the Galaxy driver keys subscriptions on the FullReference (tag_name.AttributeName) DIRECTLY (no _tagsByName miss). gap (c) was purely a misconfigured dev driver-instance + placeholder tag ref + unset key — ALL data in existing columns, NO EF/schema change. Fixes applied to the dev rig (otopcua-dev-sql-1/OtOpcUa): MAIN-galaxy-eq DriverConfig gateway.endpoint https://10.100.0.35:5001http://10.100.0.48:5120, useTls truefalse, apiKeySecretRef env:MX_API_KEY(unset)→env:GALAXY_MXGW_API_KEY (the var the compose already wires on every node); GalaxyTestTag TagConfig.FullName TestMachine_002.SomeAttr(placeholder)→TestMachine_002.TestDuration (a real galaxy Float attr). The gateway API key was injected via ephemeral shell env at docker compose up -d --no-deps --force-recreate central-1 central-2 time (NEVER written to a tracked file; the compose's ${GALAXY_MXGW_API_KEY:-stale-default} substitution picks it up — the running containers carry the real key only until the next recreate-without-the-env-var). Live (central-1 logs): spawned GalaxyMxGateway driver MAIN-galaxy-eq (stub=False)GalaxyMxSession connected — clientName=OtOpcUa (auth OK) → initialized — endpoint=http://10.100.0.48:5120subscribed to 1 refs (galaxy-sub-1) (TestMachine_002.TestDuration accepted, no BadNodeIdUnknown). Value: Client.CLI read ns=2;s=EQ-55297329838d/GalaxyTestTag → Value 0, Status 0x00000000 (Good), Source Time 2026-05-07T07:14:26Z (a real galaxy timestamp — a genuine attribute snapshot, NOT BadWaitingForInitialData; static because that attr isn't actively moving). Restore-the-rig SQL saved at /tmp/galaxy-gapc-snapshot.sql. Milestone 1b is now COMPLETE — all three gaps closed. Findings/follow-ups below.

Then: Phase B = native IAlarmSource alarms on the equipment-tag path (DONE — f9be3843); Phase C = server-side HistoryRead backend over the Wonderware reader (DONE — c24abc8a, 2026-06-14; design sections in docs/plans/2026-06-12-galaxy-standard-driver-design.md + the dedicated docs/plans/2026-06-14-galaxy-phase-c-historian-*.md).

Findings + follow-ups from the Galaxy gap-(c) live-verify (2026-06-13)

  • Benign dead-letter (minor, pre-existing in the Galaxy driver — NOT introduced here). On subscribe the driver logs: Message [SubscriptionEstablished] from drv-MAIN-galaxy-eq to drv-MAIN-galaxy-eq was unhandled. [N] dead letters. The Galaxy DriverInstanceActor/driver sends itself a SubscriptionEstablished message that has no Receive<> handler. Harmless (the subscription IS established + delivering values), but noisy — add a handler (or stop self-Telling it). Cosmetic.
  • CHANGING-value read PROVEN (2026-06-13). Repointed GalaxyTestTag.FullNameTestMachine_002.TestChangingInt (a script-driven Integer, sec=Operate): three Client.CLI reads returned 810 → 787 → 764 with real galaxy source timestamps advancing ~7s each (02:28:41/:48/:55) — a genuine live moving Galaxy value through the router (not optimistic/phantom). The dev rig is now left with GalaxyTestTag pointing here (DataType Int32, AccessLevel Read). Discovery was done with a throwaway probe (now deleted) using GalaxyDriverBrowser.OpenAsyncsession.AttributesAsync("TestMachine_002"), which lists every attribute's SecurityClass (ViewOnly=read-only; FreeAccess/Operate/Tune/Configure=writable). Useful attrs on TestMachine_002: TestFloat(Float,Operate), TestDouble(Double,Operate), TestChangingInt(Integer,Operate,moves), TestDuration(ElapsedTime,Operate), AlarmInhibit(Boolean,FreeAccess).
  • GALAXY WRITE-THROUGH — FIXED + MERGED to master f05b5d79 (AdviseSupervisory before raw Write). Symptom was every Galaxy operator write returning MxaccessFailure "ArgumentException: HRESULT 0x80070057" (E_INVALIDARG). TWO-LAYER root cause (debugged by SSH-reading the gateway source on 10.100.0.48 C:\Users\dohertj2\Desktop\mxaccessgw — there is NO live gateway file log: console-only/uncaptured, NSSM stdout.log stale, dashboard :5130 is Blazor/no-REST): (1) the writer AddItem'd an UN-advised handle → MXAccess Write threw E_INVALIDARG (worker chain ExecuteWriteMxAccessSession.WriteMxAccessComServer.WriteAsProxyServer().Write(...)). (2) DEEPER — a plain Write runs with no user login (WriteUserId=0), and MXAccess only COMMITS such a write when the item is advised in SUPERVISORY mode; a regular Advise removed the E_INVALIDARG but never committed (proven by a persistence check: read-back showed the value, but a --force-recreate+fresh-resubscribe reverted to the original 0 @ 2026-05-07; the worker's ExecuteWrite is fire-and-forget, returns OK without awaiting OnWriteComplete). Confirmed against the sister ScadaBridge driver (~/Desktop/ScadaBridge/.../RealMxGatewayClient.cs): it commits the OTHER way — a configured non-zero WriteUserId + regular Advise + WriteBulk. We have no galaxy login → supervisory context. FIX: GatewayGalaxyDataWriter calls AdviseSupervisory (raw MxCommand{Kind=AdviseSupervisory, AdviseSupervisory=new AdviseSupervisoryCommand{ServerHandle,ItemHandle}} via session.InvokeAsync, mirroring InvokeWriteSecuredAsync; idempotent per handle via _supervisedHandles) before each raw WriteRawAsync; SecuredWrite/VerifiedWrite tags keep their own user-identity path (NeedsSecuredWrite unchanged — WriteSecured is ONLY for those special-security tags). The dead-end "reuse the subscription's advised handle" resolver attempt was reverted. LIVE-PROVEN: authorized write (opc-writeop/WriteOperate) of TestMachine_002.TestFloat=1234.5 then 8888.25 COMMITS + PERSISTS across recreate/re-subscribe (galaxy-sourced timestamp); anonymous → BadUserAccessDenied. 254 Galaxy tests green; central --build clean. OPEN follow-ups from this: (a) the worker's fire-and-forget ExecuteWrite can't surface an async write failure — with supervisory advise the write commits, but only a read-back confirms a specific write (gateway-side; out of our scope). (b) _itemHandles/_supervisedHandles caches aren't cleared on reconnect (pre-existing for _itemHandles) — a write right after a reconnect could use a stale handle; minor.
  • OPTIMISTIC-WRITE PHANTOM (open follow-up — surface real write status to the client). The inbound write dispatch is fire-and-forget: it returns optimistic Good before the driver result (required — OnWriteValue runs under the node-manager Lock), and the SDK applies the written value to the node locally. So a write whose DEVICE write FAILS still returns Good, and for a STATIC attribute that never re-pushes, the wrong value LINGERS (a phantom the device never accepted). The pipeline already computes the real status in NodeWriteResult.Success/Reason but only LOGS it — consider surfacing it to the client. (How it was caught live: a failed Galaxy write showed the written value on read-back with a SERVER-clock source timestamp + a rejected driver log; a committed write shows a GALAXY-clock timestamp + no rejection, and persists across a re-subscribe.)
  • Dev-rig Galaxy config is CORRECT + WORKING (left in place). The MAIN-galaxy-eq driver-instance is deployed and connecting to the live gateway http://10.100.0.48:5120. GalaxyTestTag (on EQ-55297329838d/filler-02, nw-uns namespace) is currently {"FullName":"TestMachine_002.TestFloat"}, DataType Float, AccessLevel ReadWrite (the write demo; galaxy now holds the last written value 8888.25). Other useful TestMachine_002 attrs (from the discovery probe): TestChangingInt(Integer,Operate,moves on its own — the live-changing READ demo), TestDouble(Double,Operate), TestDuration(ElapsedTime,Operate — reads as Float but a Float write is a type-mismatch), AlarmInhibit(Boolean,FreeAccess). To restore the original placeholder tag: /tmp/galaxy-gapc-snapshot.sql. The base seed docker-dev/seed/seed-clusters.sql still seeds the legacy SystemPlatform-namespace Galaxy driver (MAIN-galaxy-mxgw, tags TestMachine_001.TestAlarm001..003) — pre-Phase-A model, untouched/separate. The injected gateway key is EPHEMERAL — key=mxgw_otopcuakey2_so0… is supplied via shell env GALAXY_MXGW_API_KEY='…' at docker compose up --no-deps --force-recreate central-1 central-2; a recreate WITHOUT it re-exported falls back to the compose's stale default and Galaxy auth fails. ORDER on a redeploy: POST deploy FIRST, THEN recreate (a faulted driver ignores ApplyDelta).

Findings + follow-ups from the protocol-linkage + write-through work (2026-06-13)

  • DATA-PLANE ROLE CONFIG REQUIREMENT (important, deployment-facing). The OPC UA session's roles come from two sources unioned: the DB LdapGroupRoleMapping (its Role column is the AdminRole enum — Administrator/Designer/Viewer only, for the AdminUI) AND the appsettings Security:Ldap:GroupToRole baseline (free-form string→string). The OPC UA data-plane gates (WriteOperate, AlarmAck, …) read literal role STRINGS that the AdminRole-typed DB mapping cannot produce — so a deployment MUST map its LDAP data-plane groups → data-plane role strings via GroupToRole, or write-through (and scripted-alarm OPC UA ack) is inert (every write → BadUserAccessDenied). The shared dev GLAuth already has dedicated groups+users (group WriteOperate, user opc-writeop, multi-role in all; opc-readonly); the dev rig just never seeded the GroupToRole. Consider a docs note (and/or a documented default) so production deployments wire this. (Same latent requirement applies to the pre-existing alarm-ack gate.)
  • Write-pipeline review follow-ups (non-blocking, from the final integration review): (a) DriverHostActor.Stale (and DriverInstanceActor.Connecting/Reconnecting) have no RouteNodeWrite/WriteAttribute handler → an operator write while stale/reconnecting dead-letters and the 10s Ask times out with a generic log (client got optimistic Good). Add fast-fail handlers returning a clear status. (b) Drop TaskContinuationOptions.ExecuteSynchronously on the router ContinueWith; List.ContainsHashSet in the forward-map build (micro). (c) FOCAS re-parses the address on every equipment-tag write (_parsedAddressesByTagName miss; perf only, rare). (d) DriverHostActorWriteRoutingTests seeds a Galaxy-style {"FullName":...} artifact, not a raw protocol-driver TagConfig blob — add a raw-blob case for belt-and-suspenders (runtime path is identical + live-verified). (e) Task-9 parity test is a faithful simulation of ConfigComposer (ToSnapshot casts AccessLevel to int) not a through-the-real-serializer proof; add an InlineData(2,false) future-enum trap.

Dev-rig artifacts created for the protocol-linkage live-verify (left in place, NOT committed)

  • docker-dev/docker-compose.yml gained Security__Ldap__GroupToRole__{ReadOnly,WriteOperate,WriteTune,WriteConfigure,AlarmAck} identity entries on both central nodes (needed for data-plane roles — see above). Uncommitted (rig config; the file was already modified at session start).
  • DB seeds on otopcua-dev-sql-1/OtOpcUa: driver MAIN-modbus-eq (DriverType=Modbus, {"Host":"10.100.0.35","Port":5020,"UnitId":1,"Tags":[]}, namespace nw-uns, cluster MAIN) + tags tag-modbus-hr100 (HR[100] auto-increment, Read — read demo) and tag-modbus-hr200 (HR[200] scratch, ReadWrite — write demo), both on equipment EQ-55297329838d (filler-02). The pymodbus standard sim (10.100.0.35:5020) serves HR[0..31]=addr-as-value, HR[100]=auto-increment, HR[200..209]=writable scratch.

Incidental findings from the OpcUaClient live-verify (2026-06-13)

  • Driver-reconfigure-while-faulted gap — FIXED + MERGED 56f73e49 (2026-06-14). (Was: a DriverInstanceActor stuck in Reconnecting/Connecting had no ApplyDelta handler, so a corrected config dead-lettered and the actor retried the OLD _currentConfigJson forever; workaround = restart the node.) Now Connecting/Reconnecting handle ApplyDelta via AdoptConfigDuringInit, re-initialising with the new config; a monotonic _initGeneration guard supersedes the in-flight old init so the corrected config always wins. See the top banner + docs/plans/2026-06-14-driver-reconfigure-while-faulted-*.md. Live /run gate deferred (user choice).
  • Dev-rig config edit applied directly in DB. The MAIN-opcua-eq DriverConfig.targetNamespaceKind was 0 (Equipment, which requires a UnsMappingTableInitializeAsync rejected it). Set to 1 (SystemPlatform — the direct-ref mode the equipment-tag model wants; empty unsMappingTable:{} passes validation) via a direct JSON_MODIFY UPDATE on otopcua-dev-sql-1 (DB OtOpcUa, SET QUOTED_IDENTIFIER ON required for JSON fns; sqlcmd -h and -y 0 are mutually exclusive — pick one). The AdminUI driver-edit combobox for "Target namespace kind" did not persist the change (suspected live-only Blazor binding bug — unverified; the DB edit sidestepped it). Deploy snapshots the live config DB directly (AdminOperationsActorDraftSnapshotFactory.FromConfigDbAsync + ConfigComposer.SnapshotAndFlattenAsync), so a DB edit flows through on the next POST /api/deployments (new revisionHash).

Operational deferral (user choice)

  1. docker-dev rig cleanup (round-1 T9) deferred. The local docker-dev rig still has the live-verify seed artifacts deployed: the t12-overheat scripted alarm, the SC-ba675b168a85 predicate script, the layer0-logcheck vtag/script, and filler-02's modified cycle-time-s line. Left as-is to inspect the working double-emit fix. To clean up: delete those artifacts in the AdminUI (or DB), revert filler-02's cycle-time-s to return ctx.GetTag("TestMachine_002.TestDuration").Value;, then redeploy (POST http://localhost:9200/api/deployments, header X-Api-Key: docker-dev-deploy-key).

  2. Equipment-tag live-value verify artifacts (left in place — all now FUNCTIONAL). The docker-dev rig carries verify artifacts under the nw-uns Equipment namespace on EQ-55297329838d (filler-02), all three now working: MAIN-galaxy-eq (GalaxyMxGateway → live gateway, GalaxyTestTag = TestMachine_002.TestFloat RW, write-proven; needs the ephemeral GALAXY_MXGW_API_KEY re-exported on recreate — see Galaxy dev-rig note above); MAIN-opcua-eq (OpcUaClient, factory shipped 22d553af)

    • FastUInt1 tag ({"FullName":"ns=3;s=FastUInt1"}); MAIN-modbus-eq (Modbus) + tag-modbus-hr100 (read) / tag-modbus-hr200 (RW write demo). docker-dev is LOCAL on this Mac (OrbStack); central-1 @ localhost:4840/AdminUI+deploy @ localhost:9200, sql @ localhost:14330 (sa/OtOpcUa!Dev123), login disabled. Sims for the protocol/opcua verifies run on the docker host 10.100.0.35 (otopcua-pymodbus-standard :5020, otopcua-opc-plc :50000) — leave up or docker compose down per /opt/otopcua-modbus + /opt/otopcua-opcuaclient. Phase A backup OtOpcUa-prePhaseA-20260612-224908.bak is in the SQL volume /var/opt/mssql/backup/.