Snapshot of Milestone 1b completion (equipment-tag live values across OpcUaClient / protocol drivers / Galaxy, read + write) and the open follow-ups, for cross-session continuity.
21 KiB
Pending — open follow-ups & deferrals
As of 2026-06-13. master HEAD f05b5d79 (synced with origin). Working tree: only docker-dev/docker-compose.yml (uncommitted rig config) + untracked current.md/pending.md.
STATE SUMMARY (for compaction pickup)
Milestone 1b (equipment-tag live values) is COMPLETE — an equipment tag bound to OpcUaClient, any of the 6 protocol drivers, OR Galaxy now READS a live value AND (authorized) WRITES it back, all delivered by the FullName→NodeId router (c4435e4f). Shipped this session, all pushed to master:
- OpcUaClient factory
22d553af; protocol-driver linkage + inbound write pipeline8d8c05f5; Galaxy gap-(c) config-only (no commit); Galaxy write-throughf05b5d79(AdviseSupervisorybefore raw Write).
OPEN FOLLOW-UPS (none blocking; pick up here):
- Phase B — native
IAlarmSourcealarms on the equipment-tag path (portGenericDriverNodeManager's forwarder ontoMaterialiseEquipmentTags). Deferred; design §indocs/plans/2026-06-12-galaxy-standard-driver-design.md. - Phase C — server-side
HistoryReadbackend over the Wonderware reader. Deferred; same design doc. - Data-plane role config (deployment-facing) — document that
Security:Ldap:GroupToRoleMUST map data-plane LDAP groups → role strings (WriteOperate/AlarmAck/…), else write-through + OPC UA alarm-ack are silently inert. Detail below. - Write-pipeline review nits — fast-fail
RouteNodeWrite/WriteAttributeinDriverHostActor.Stale+DriverInstanceActor.Connecting/Reconnecting; dropExecuteSynchronously;List.Contains→HashSet; FOCAS per-write reparse; raw-blob routing test; Task-9 parity future-enum trap. Detail below. - Surface real device-write status to the client — the inbound write is fire-and-forget optimistic
Good(the "optimistic-write phantom");NodeWriteResult.Success/Reasonexists but is only logged. Detail below. - Galaxy driver nits — benign
SubscriptionEstablishedself-dead-letter (cosmetic); writer_itemHandles/_supervisedHandlescaches not cleared on reconnect (stale-handle risk right after a reconnect). Detail in the Galaxy findings below. - Driver-reconfigure-while-faulted — a
Reconnecting/ConnectingDriverInstanceActorignoresApplyDelta(retries old config forever; workaround = restart node). High-risk actor-state-machine change → own design/plan. Detail below. - Rig cleanups (operational, user-deferred) — see bottom.
The six historian code follow-ups (HistorizeToAveva opt-out, drain/capacity/retention config
knobs, SharedSecret/DatabasePath/non-positive-knob startup validation, operator-recording for
shelve/enable/disable, and the SqliteStoreAndForwardSink thread-safety nits) were all resolved
on branch feat/alarm-historian-followups (plan: docs/plans/2026-06-11-alarm-historian-followups.md).
They are no longer listed here.
Equipment-tag live values — MILESTONE 1b COMPLETE (2026-06-13)
The Galaxy standard-driver effort shipped Phase A (c3c56172) + the FullName→NodeId live-value ROUTER (c4435e4f, both pushed). The router is done + verified (322 tests + integration review READY-TO-MERGE). All three driver-publish gaps are now CLOSED — an equipment tag bound to OpcUaClient, any protocol driver, OR Galaxy publishes a live value delivered by the router (full detail in current.md "Milestone 1b" + docs/plans/2026-06-13-equipment-tag-live-values-design.md):
OpcUaClient has NO factory (real bug — always stubbed).DONE — SHIPPED+PUSHED master22d553af2026-06-13. AddedOpcUaClientDriverFactoryExtensions(mirror Modbus) + registered it inDriverFactoryBootstrap. First live equipment-tag value PROVEN end-to-end: OpcUaClient driverMAIN-opcua-eqspawnsstub=False, connects to opc-plc, subscribes tons=3;s=FastUInt1; theFullName→NodeIdrouter (c4435e4f) delivers it to the materialised variablens=2;s=EQ-55297329838d/FastUInt1, which reads a live changing value (10135→10141, Good) via Client.CLI. Design/plandocs/plans/2026-06-13-opcuaclient-factory-*.md. Two incidental findings while live-verifying (see below).Protocol drivers (Modbus/S7/AbCip/…) — equipment-tag↔driver tag-table linkage unbuilt.DONE — SHIPPED+PUSHED master8d8c05f52026-06-13 (+ full inbound operator WRITE pipeline). Approach B (driver-side direct-ref): a sharedEquipmentTagRefResolver<TDef>(Core.Abstractions) resolves an equipment-tag ref (the rawTagConfigJSON blob the router already keys on) into a transient driver tag-def on a_tagsByNamemiss — wired into READ + WRITE for all six drivers (Modbus/S7/AbCip/AbLegacy/TwinCAT/FOCAS), each with a hardened never-throw<Driver>EquipmentTagParser. Part B (write-through): writable nodes (Tag.AccessLevel==ReadWrite→CurrentReadWrite, byte-parity in Phase7Composer+DeploymentArtifact), anOnWriteValuegate on theWriteOperatedata-plane role (mirrors the alarm-ack bridge; fire-and-forget dispatch since the SDK holds the node-manager Lock duringOnWriteValue), aNodeWriteRouteron the node manager, andDriverHostActor.RouteNodeWrite(NodeId→driver reverse map, primary-gated). LIVE-PROVEN end-to-end: Modbus equipment tag (HR[100]) reads a live changing value; an authorized write (opc-writeop/WriteOperate) to HR[200] changes the register + persists; an anonymous write → BadUserAccessDenied. Design/plandocs/plans/2026-06-13-protocol-equipment-tag-linkage-*.md. Findings + rig artifacts below.Galaxy — needs a reachable mxaccessgw.DONE — LIVE-PROVEN 2026-06-13 (no code change; config-only). The code-investigation confirmed Galaxy was already fully wired:GalaxyDriverFactoryExtensionsIS registered inDriverFactoryBootstrap.cs:103(not the missing-factory bug OpcUaClient had), and the Galaxy driver keys subscriptions on the FullReference (tag_name.AttributeName) DIRECTLY (no_tagsByNamemiss). gap (c) was purely a misconfigured dev driver-instance + placeholder tag ref + unset key — ALL data in existing columns, NO EF/schema change. Fixes applied to the dev rig (otopcua-dev-sql-1/OtOpcUa):MAIN-galaxy-eqDriverConfiggateway.endpointhttps://10.100.0.35:5001→http://10.100.0.48:5120,useTlstrue→false,apiKeySecretRefenv:MX_API_KEY(unset)→env:GALAXY_MXGW_API_KEY(the var the compose already wires on every node);GalaxyTestTagTagConfig.FullNameTestMachine_002.SomeAttr(placeholder)→TestMachine_002.TestDuration(a real galaxy Float attr). The gateway API key was injected via ephemeral shell env atdocker compose up -d --no-deps --force-recreate central-1 central-2time (NEVER written to a tracked file; the compose's${GALAXY_MXGW_API_KEY:-stale-default}substitution picks it up — the running containers carry the real key only until the next recreate-without-the-env-var). Live (central-1 logs):spawned GalaxyMxGateway driver MAIN-galaxy-eq (stub=False)→GalaxyMxSession connected — clientName=OtOpcUa(auth OK) →initialized — endpoint=http://10.100.0.48:5120→subscribed to 1 refs (galaxy-sub-1)(TestMachine_002.TestDuration accepted, no BadNodeIdUnknown). Value:Client.CLI read ns=2;s=EQ-55297329838d/GalaxyTestTag→ Value0, Status0x00000000(Good), Source Time2026-05-07T07:14:26Z(a real galaxy timestamp — a genuine attribute snapshot, NOT BadWaitingForInitialData; static because that attr isn't actively moving). Restore-the-rig SQL saved at/tmp/galaxy-gapc-snapshot.sql. Milestone 1b is now COMPLETE — all three gaps closed. Findings/follow-ups below.
Then: Phase B = native IAlarmSource alarms on the equipment-tag path; Phase C = server-side HistoryRead backend over the Wonderware reader (both deferred, design sections in docs/plans/2026-06-12-galaxy-standard-driver-design.md).
Findings + follow-ups from the Galaxy gap-(c) live-verify (2026-06-13)
- Benign dead-letter (minor, pre-existing in the Galaxy driver — NOT introduced here). On subscribe the driver logs:
Message [SubscriptionEstablished] from drv-MAIN-galaxy-eq to drv-MAIN-galaxy-eq was unhandled. [N] dead letters. The GalaxyDriverInstanceActor/driver sends itself aSubscriptionEstablishedmessage that has noReceive<>handler. Harmless (the subscription IS established + delivering values), but noisy — add a handler (or stop self-Telling it). Cosmetic. - CHANGING-value read PROVEN (2026-06-13). Repointed
GalaxyTestTag.FullName→TestMachine_002.TestChangingInt(a script-driven Integer,sec=Operate): three Client.CLI reads returned 810 → 787 → 764 with real galaxy source timestamps advancing ~7s each (02:28:41/:48/:55) — a genuine live moving Galaxy value through the router (not optimistic/phantom). The dev rig is now left withGalaxyTestTagpointing here (DataTypeInt32, AccessLevelRead). Discovery was done with a throwaway probe (now deleted) usingGalaxyDriverBrowser.OpenAsync→session.AttributesAsync("TestMachine_002"), which lists every attribute'sSecurityClass(ViewOnly=read-only;FreeAccess/Operate/Tune/Configure=writable). Useful attrs onTestMachine_002:TestFloat(Float,Operate),TestDouble(Double,Operate),TestChangingInt(Integer,Operate,moves),TestDuration(ElapsedTime,Operate),AlarmInhibit(Boolean,FreeAccess). - GALAXY WRITE-THROUGH — FIXED + MERGED to master
f05b5d79(AdviseSupervisorybefore rawWrite). Symptom was every Galaxy operator write returningMxaccessFailure "ArgumentException: HRESULT 0x80070057"(E_INVALIDARG). TWO-LAYER root cause (debugged by SSH-reading the gateway source on 10.100.0.48C:\Users\dohertj2\Desktop\mxaccessgw— there is NO live gateway file log: console-only/uncaptured, NSSMstdout.logstale, dashboard :5130 is Blazor/no-REST): (1) the writerAddItem'd an UN-advised handle → MXAccessWritethrew E_INVALIDARG (worker chainExecuteWrite→MxAccessSession.Write→MxAccessComServer.Write→AsProxyServer().Write(...)). (2) DEEPER — a plainWriteruns with no user login (WriteUserId=0), and MXAccess only COMMITS such a write when the item is advised in SUPERVISORY mode; a regularAdviseremoved the E_INVALIDARG but never committed (proven by a persistence check: read-back showed the value, but a--force-recreate+fresh-resubscribe reverted to the original0 @ 2026-05-07; the worker'sExecuteWriteis fire-and-forget, returns OK without awaitingOnWriteComplete). Confirmed against the sister ScadaBridge driver (~/Desktop/ScadaBridge/.../RealMxGatewayClient.cs): it commits the OTHER way — a configured non-zeroWriteUserId+ regularAdvise+WriteBulk. We have no galaxy login → supervisory context. FIX:GatewayGalaxyDataWritercallsAdviseSupervisory(rawMxCommand{Kind=AdviseSupervisory, AdviseSupervisory=new AdviseSupervisoryCommand{ServerHandle,ItemHandle}}viasession.InvokeAsync, mirroringInvokeWriteSecuredAsync; idempotent per handle via_supervisedHandles) before each rawWriteRawAsync;SecuredWrite/VerifiedWritetags keep their own user-identity path (NeedsSecuredWriteunchanged — WriteSecured is ONLY for those special-security tags). The dead-end "reuse the subscription's advised handle" resolver attempt was reverted. LIVE-PROVEN: authorized write (opc-writeop/WriteOperate) ofTestMachine_002.TestFloat=1234.5 then 8888.25 COMMITS + PERSISTS across recreate/re-subscribe (galaxy-sourced timestamp); anonymous →BadUserAccessDenied. 254 Galaxy tests green; central--buildclean. OPEN follow-ups from this: (a) the worker's fire-and-forgetExecuteWritecan't surface an async write failure — with supervisory advise the write commits, but only a read-back confirms a specific write (gateway-side; out of our scope). (b)_itemHandles/_supervisedHandlescaches aren't cleared on reconnect (pre-existing for_itemHandles) — a write right after a reconnect could use a stale handle; minor. - OPTIMISTIC-WRITE PHANTOM (open follow-up — surface real write status to the client). The inbound write dispatch is fire-and-forget: it returns optimistic
Goodbefore the driver result (required —OnWriteValueruns under the node-manager Lock), and the SDK applies the written value to the node locally. So a write whose DEVICE write FAILS still returnsGood, and for a STATIC attribute that never re-pushes, the wrong value LINGERS (a phantom the device never accepted). The pipeline already computes the real status inNodeWriteResult.Success/Reasonbut only LOGS it — consider surfacing it to the client. (How it was caught live: a failed Galaxy write showed the written value on read-back with a SERVER-clock source timestamp + arejecteddriver log; a committed write shows a GALAXY-clock timestamp + no rejection, and persists across a re-subscribe.) - Dev-rig Galaxy config is CORRECT + WORKING (left in place). The
MAIN-galaxy-eqdriver-instance is deployed and connecting to the live gatewayhttp://10.100.0.48:5120.GalaxyTestTag(onEQ-55297329838d/filler-02,nw-unsnamespace) is currently{"FullName":"TestMachine_002.TestFloat"}, DataTypeFloat, AccessLevelReadWrite(the write demo; galaxy now holds the last written value8888.25). Other usefulTestMachine_002attrs (from the discovery probe):TestChangingInt(Integer,Operate,moves on its own — the live-changing READ demo),TestDouble(Double,Operate),TestDuration(ElapsedTime,Operate — reads as Float but a Float write is a type-mismatch),AlarmInhibit(Boolean,FreeAccess). To restore the original placeholder tag:/tmp/galaxy-gapc-snapshot.sql. The base seeddocker-dev/seed/seed-clusters.sqlstill seeds the legacy SystemPlatform-namespace Galaxy driver (MAIN-galaxy-mxgw, tagsTestMachine_001.TestAlarm001..003) — pre-Phase-A model, untouched/separate. The injected gateway key is EPHEMERAL — key=mxgw_otopcuakey2_so0…is supplied via shell envGALAXY_MXGW_API_KEY='…'atdocker compose up --no-deps --force-recreate central-1 central-2; a recreate WITHOUT it re-exported falls back to the compose's stale default and Galaxy auth fails. ORDER on a redeploy: POST deploy FIRST, THEN recreate (a faulted driver ignoresApplyDelta).
Findings + follow-ups from the protocol-linkage + write-through work (2026-06-13)
- DATA-PLANE ROLE CONFIG REQUIREMENT (important, deployment-facing). The OPC UA session's roles come from two sources unioned: the DB
LdapGroupRoleMapping(itsRolecolumn is theAdminRoleenum — Administrator/Designer/Viewer only, for the AdminUI) AND the appsettingsSecurity:Ldap:GroupToRolebaseline (free-formstring→string). The OPC UA data-plane gates (WriteOperate,AlarmAck, …) read literal role STRINGS that the AdminRole-typed DB mapping cannot produce — so a deployment MUST map its LDAP data-plane groups → data-plane role strings viaGroupToRole, or write-through (and scripted-alarm OPC UA ack) is inert (every write →BadUserAccessDenied). The shared dev GLAuth already has dedicated groups+users (groupWriteOperate, useropc-writeop,multi-rolein all;opc-readonly); the dev rig just never seeded theGroupToRole. Consider a docs note (and/or a documented default) so production deployments wire this. (Same latent requirement applies to the pre-existing alarm-ack gate.) - Write-pipeline review follow-ups (non-blocking, from the final integration review): (a)
DriverHostActor.Stale(andDriverInstanceActor.Connecting/Reconnecting) have noRouteNodeWrite/WriteAttributehandler → an operator write while stale/reconnecting dead-letters and the 10s Ask times out with a generic log (client got optimistic Good). Add fast-fail handlers returning a clear status. (b) DropTaskContinuationOptions.ExecuteSynchronouslyon the routerContinueWith;List.Contains→HashSetin the forward-map build (micro). (c) FOCAS re-parses the address on every equipment-tag write (_parsedAddressesByTagNamemiss; perf only, rare). (d)DriverHostActorWriteRoutingTestsseeds a Galaxy-style{"FullName":...}artifact, not a raw protocol-driver TagConfig blob — add a raw-blob case for belt-and-suspenders (runtime path is identical + live-verified). (e) Task-9 parity test is a faithful simulation ofConfigComposer(ToSnapshotcasts AccessLevel to int) not a through-the-real-serializer proof; add anInlineData(2,false)future-enum trap.
Dev-rig artifacts created for the protocol-linkage live-verify (left in place, NOT committed)
docker-dev/docker-compose.ymlgainedSecurity__Ldap__GroupToRole__{ReadOnly,WriteOperate,WriteTune,WriteConfigure,AlarmAck}identity entries on both central nodes (needed for data-plane roles — see above). Uncommitted (rig config; the file was already modified at session start).- DB seeds on
otopcua-dev-sql-1/OtOpcUa: driverMAIN-modbus-eq(DriverType=Modbus,{"Host":"10.100.0.35","Port":5020,"UnitId":1,"Tags":[]}, namespacenw-uns, cluster MAIN) + tagstag-modbus-hr100(HR[100] auto-increment, Read — read demo) andtag-modbus-hr200(HR[200] scratch, ReadWrite — write demo), both on equipmentEQ-55297329838d(filler-02). The pymodbusstandardsim (10.100.0.35:5020) serves HR[0..31]=addr-as-value, HR[100]=auto-increment, HR[200..209]=writable scratch.
Incidental findings from the OpcUaClient live-verify (2026-06-13)
- Driver-reconfigure-while-faulted gap (real, pre-existing, NOT fixed). When a
DriverInstanceActoris stuck inReconnecting(init keeps failing) and the operator deploys a corrected config,DriverHostActorsendsApplyDelta— but theReconnectingbehavior (DriverInstanceActor.cs~L266) has noApplyDeltahandler, so it's dead-lettered and the actor keeps retrying the OLD_currentConfigJsonforever. Workaround: restart the node (respawns the driver actor fresh from the current deployment artifact). Proper fix = handleApplyDeltainReconnecting(andConnecting) to adopt the new config mid-retry. Touches the actor state machine → its own design/plan (high-risk). Surfaced because the devMAIN-opcua-eqdriver was already faulted from a prior bad config. - Dev-rig config edit applied directly in DB. The
MAIN-opcua-eqDriverConfig.targetNamespaceKindwas0(Equipment, which requires aUnsMappingTable→InitializeAsyncrejected it). Set to1(SystemPlatform — the direct-ref mode the equipment-tag model wants; emptyunsMappingTable:{}passes validation) via a directJSON_MODIFYUPDATE onotopcua-dev-sql-1(DBOtOpcUa,SET QUOTED_IDENTIFIER ONrequired for JSON fns;sqlcmd -hand-y 0are mutually exclusive — pick one). The AdminUI driver-edit combobox for "Target namespace kind" did not persist the change (suspected live-only Blazor binding bug — unverified; the DB edit sidestepped it). Deploy snapshots the live config DB directly (AdminOperationsActor→DraftSnapshotFactory.FromConfigDbAsync+ConfigComposer.SnapshotAndFlattenAsync), so a DB edit flows through on the nextPOST /api/deployments(new revisionHash).
Operational deferral (user choice)
-
docker-dev rig cleanup (round-1 T9) deferred. The local docker-dev rig still has the live-verify seed artifacts deployed: the
t12-overheatscripted alarm, theSC-ba675b168a85predicate script, thelayer0-logcheckvtag/script, and filler-02's modifiedcycle-time-sline. Left as-is to inspect the working double-emit fix. To clean up: delete those artifacts in the AdminUI (or DB), revert filler-02'scycle-time-storeturn ctx.GetTag("TestMachine_002.TestDuration").Value;, then redeploy (POST http://localhost:9200/api/deployments, headerX-Api-Key: docker-dev-deploy-key). -
Equipment-tag live-value verify artifacts (left in place — all now FUNCTIONAL). The docker-dev rig carries verify artifacts under the
nw-unsEquipment namespace onEQ-55297329838d(filler-02), all three now working:MAIN-galaxy-eq(GalaxyMxGateway → live gateway,GalaxyTestTag=TestMachine_002.TestFloatRW, write-proven; needs the ephemeralGALAXY_MXGW_API_KEYre-exported on recreate — see Galaxy dev-rig note above);MAIN-opcua-eq(OpcUaClient, factory shipped22d553af)FastUInt1tag ({"FullName":"ns=3;s=FastUInt1"});MAIN-modbus-eq(Modbus) +tag-modbus-hr100(read) /tag-modbus-hr200(RW write demo). docker-dev is LOCAL on this Mac (OrbStack); central-1 @localhost:4840/AdminUI+deploy @localhost:9200, sql @localhost:14330(sa/OtOpcUa!Dev123), login disabled. Sims for the protocol/opcua verifies run on the docker host10.100.0.35(otopcua-pymodbus-standard:5020,otopcua-opc-plc:50000) — leave up ordocker compose downper/opt/otopcua-modbus+/opt/otopcua-opcuaclient. Phase A backupOtOpcUa-prePhaseA-20260612-224908.bakis in the SQL volume/var/opt/mssql/backup/.