38 KiB
OtOpcUa equipment-tag data-plane — why live FOCAS values don't surface (investigation plan)
Date: 2026-06-25
Companion to: 2026-06-25-focas-pdu-v3-implementation-plan.md (the FOCAS PDU-v3 driver work — DONE + deployed) and ../deployments/wonder-app-vd03-makino-z-34184.md.
Goal: make the deployed Makino tags (parts-count/parts-required = MACRO:3901/3902) actually carry live values over OPC UA, and make the FOCAS FixedTree (Identity/Axes/Timers/…) appear in the served address space.
The symptom (precise, observed 2026-06-25 after the v3 deploy)
Against opc.tcp://wonder-app-vd03.zmr.zimmer.com:4840/OtOpcUa (Security None, anonymous, DisableLogin):
readAND a 30 ssubscribeofns=2;s=EQ-3686c0272279/parts-count(and/parts-required) return0x80320000Bad_WaitingForInitialData — never a value.- A recursive browse of the whole served tree shows ONLY
machining → makino → z-34184 → {parts-count, parts-required}— no FixedTree nodes (Identity/Axes/Timers/Program/…). - Unchanged by host-restart, a full AdminUI re-deploy (
12e0d528, Sealed/In-sync), and a driverRestart. - A ~15 s box-side watch saw no 250 ms-cadence TCP connection to
10.201.31.5:8193— only the occasional probe connect.
What is RULED OUT (do not re-investigate)
- The FOCAS PDU-v3 driver. Proven working: the rewritten
FocasDriverProbedoes a real initiate +cnc_statinfoand the AdminUI shows DRIVER STATUS: HEALTHY, last success Ns ago; and on the dev box every surface (sysinfo / axes / dynamic / macros / timer / PMC / servo / alarms) reads correctly through the sameWireFocasClient. The wire path talks v3 to the Makino. - Deployment / Akka roles. Re-deploy sealed with no "task canceled"; the node has admin+driver roles; the address space (the 2 UNS tags) is served.
- Reachability. TCP 8193 reachable from the box; OPC 4840 reachable from the dev box.
So the problem is the OtOpcUa data plane: the driver's control/probe plane runs, but its data plane (the FixedTree poll loop output + the equipment-tag value source) is not reaching the served address space. This was present last session too — it was masked by the v3 reject; with v3 now working it stands alone.
Leading hypotheses (to confirm/refute, not assume)
- H1 — FixedTree bootstrap is stuck/throwing.
FocasDriver.FixedTreeLoopAsyncis started atFocasDriver.cs:160insideInitializeAsync(which demonstrably ran — the probe loop at:137is alive). Its bootstrapwhile (state.FixedTreeCache is null)loop (~:650) callsGetSysInfoAsync/GetAxisNamesAsync/optional probes; if one throws it retries every 2 s and the cache never populates → no FixedTree nodes, no poll. The brief 2 s-spaced connects could explain "no steady 250 ms connection." - H2 — FixedTree nodes aren't projected under the Equipment model. Galaxy/FOCAS are now standard Equipment-kind drivers; the served tree is the UNS/Equipment projection (configured tags only). Driver
DiscoverAsyncFixedTree auto-nodes may only live in a raw-driver namespace that the UNS browse doesn't show — i.e. FixedTree-under-Equipment may be unsupported-by-design, not a bug. - H3 — equipment-tag value source never primed. A point-read returns the server's cached value; with no active poll/subscription seeding it, that's
WaitingForInitialData. But a 30 ssubscribeALSO stayed bad — so either the monitored-item →FocasDriver.ReadAsyncwiring isn't firing, orReadAsync(MACRO:3901)returns no value on the box. (Analogy: the ScadaBridge DCL seed-ordering / static-tag false-bad class of bug — see that memory.) - H4 — a data-plane gate. A role/active-node gate (cf. "data-plane roles need
Security:Ldap:GroupToRolemapped or write/ack is inert") could suppress the read/poll on this single fused node under DisableLogin.
⚠️ Observability is the bottleneck
There are currently no OtOpcUa app logs on the box (Windows eventlog shows only NSSM start/stop; Serilog has no file sink in this deployment — cf. the MxGateway windev note). Getting runtime logs is the single highest-leverage step and gates H1/H3/H4.
Phase 1 — Offline code trace of the data plane (do FIRST; free, no box/CNC)
Trace, in the repo on the dev box, the two paths end-to-end. This may reveal the root cause with zero box access.
- Equipment-tag read path: from an OPC UA
Read/CreateMonitoredItemonns=2;s=EQ-…/parts-count→ the server's address-space node → the "FullName→NodeId router" →FocasDriver.ReadAsync. Find: who invokes the driver read, whether point-reads vs subscriptions differ, and where/whether a value is cached vsWaitingForInitialDatais returned. Entry points:FocasDriver.cs:247(resolver-produced equipment tags),FocasDriver.ReadAsync, and the server-side poll-group / monitored-item sampler (grepPollGroup,MonitoredItem, the Equipment namespace value provider). - FixedTree path:
FocasDriver.FixedTreeLoopAsync(:644) →state.FixedTreeCache→TryReadFixedTree→ how those nodes are added to the address space. Determine whether FixedTree nodes are emitted into the Equipment/UNS projection at all, or only a raw-driver namespace (resolves H2). GrepDiscoverAsync,FixedTreeReference, and the address-space builder for the Equipment kind. - Driver lifecycle/gates: confirm what conditions start/suppress the data poll vs the probe loop (active-node, role, enabled flags) —
FocasDriver.cs:87–181, the cluster/active-node gate.
- Output: a written root-cause hypothesis (which of H1–H4) + the specific code site, OR a precise list of what only box logs can disambiguate.
✅ Phase 1 FINDINGS (2026-06-25, offline subagent code-trace — DONE)
| Hypothesis | Verdict | Confidence |
|---|---|---|
| H2 — FixedTree nodes not projected under Equipment model | CONFIRMED root cause of symptom #2 — by-design | proven from code |
| H3 — equipment-tag value source never primed | CONFIRMED mechanism of symptom #1; routing code is self-consistent → break is runtime, not static | mechanism proven; exact break needs box logs |
| H4 — role/active-node data-plane gate | REFUTED for the read/subscribe path | proven from code |
| H1 — FixedTree bootstrap stuck/throwing | best fit for symptom #3 (no 250 ms connect), but moot for #2 and not the cause of #1 | strongly-suggested; needs logs |
H5 (new) — FOCAS tag with blank deviceHostAddress → BadNodeIdUnknown |
latent trap — yields a Bad value, not WaitingForInitialData, so it's the next failure, not the current symptom | proven from code |
Why symptom #2 (no FixedTree nodes): FixedTree nodes are emitted only by FocasDriver.DiscoverAsync (FocasDriver.cs:391-494, gated on FixedTreeCache), and DiscoverAsync is never invoked by the deployed equipment-kind address-space build. Production materializes the served tree purely from Config-DB Tag rows (EquipmentNodeWalker.cs:24-25,161-165; OpcUaPublishActor.RebuildAddressSpace → MaterialiseEquipmentTags(composition) at OpcUaPublishActor.cs:326-337). No runtime caller of IDriver.DiscoverAsync exists. ⇒ By-design gap, not a bug. Surfacing FixedTree under the Equipment tree is a feature (author explicit Tag rows per signal, or extend the composition pipeline to merge ITagDiscovery output — a substantial AddressSpaceComposer/AddressSpaceApplier change).
Why symptom #1 (0x80320000 even under a 30 s subscribe): equipment-tag node value is push-only. 0x80320000 is the materialization seed (OtOpcUaNodeManager.EnsureVariable:1400-1402); there is no OnReadValue/server-side sampler for equipment tags, so both Read and MonitoredItem serve the cached node value. The seed is overwritten only by ForwardToMux → WriteValue (OtOpcUaNodeManager.cs:261-281), which writes even Bad/null. StatusFromQuality maps Bad → 0x80000000, not 0x80320000. ⇒ A persistent 0x80320000 proves no AttributeValueUpdate ever landed for those two NodeIds. The routing chain (DriverHostActor.PushDesiredSubscriptions → _nodeIdByDriverRef → DriverInstanceActor.ResubscribeDesired → PollGroupEngine forceRaise → OnDataChangeForward → ForwardToMux) is provably self-consistent for FOCAS (FocasTagConfigModel writes no top-level FullName → ExtractTagFullName returns the raw JSON blob → FocasEquipmentTagParser.TryParse resolves it; subscribed ref == published ref == routing-map key == NodeId, all from the same t.FullName). So the static mapping is refuted; the break is runtime: either the FOCAS child was never subscribed those refs, or _nodeIdByDriverRef lacked the (DriverInstanceId, FullName) key at publish time — most likely a DriverInstanceId attribution mismatch between the equipment Tag rows and the spawned DriverInstance.
What only box logs (or a ConfigDb snapshot read) can disambiguate for symptom #1 — capture these on the driver-role node after a deploy:
DriverHostActorinfo "SubscribeBulk pushed {Refs} references across {Drivers} driver(s)" (DriverHostActor.cs:1058) —Refs<2⇒ equipment TagDriverInstanceId≠ spawned FOCAS child id (attribution mismatch).DriverInstanceActorinfo "subscribed to {Count} refs" (DriverInstanceActor.cs:571) — absent/0⇒_desiredRefsempty or never reachedConnected.DriverHostActordebug "no equipment-tag NodeId for ({Driver},{Ref}) — value dropped" (DriverHostActor.cs:550) — present ⇒_nodeIdByDriverRefkey miss.FocasDriverdebug "FOCAS fixed-tree bootstrap failed … retrying" every ~2 s (FocasDriver.cs:686) ⇒ confirms H1 (cnc_sysinfo/cnc_rdaxisnamefailing on the Makino). Needs Debug level + logger-constructed wire client.- A driver-side read of
MACRO:3901(live Makino):BadNodeIdUnknown⇒ H5 blank-deviceHostAddresstrap; Good value ⇒ wire path fine, gap is purely the OPC-UA equipment subscription wiring.
Cheaper-than-Phase-2 lead: the prime suspect (Tag-row DriverInstanceId vs DriverInstance row mismatch) is checkable by reading the deployed ConfigDb snapshot (OtOpcUaConfig on wonder-sql-vd03:1433) — no service restart, no log-sink change. Worth doing before/instead of the Phase 2 log-sink change.
Suggested minimal-fix shapes (NOT implemented):
- Symptom #1: depends on which log line/row fires — if attribution miss, it's a config/snapshot re-bind (no code). If a real wiring gap, localize first.
- H5 latent: in
FocasDriver.ReadAsync/WriteAsyncdevice lookup (:289,:346), whenDeviceHostAddressis blank and_devices.Count==1, resolve to the sole device instead ofBadNodeIdUnknown(or make the AdminUI field mandatory). - Symptom #2 (H2): no fix — by-design; surfacing FixedTree is a feature decision.
✅ ConfigDb snapshot read (2026-06-25, symptom #1 path — chosen over box logs)
Read live OtOpcUaConfig on wonder-sql-vd03 (query run on-box so the SQL password never left the box). Ground truth:
- Equipment
EQ-3686c0272279(z-34184):DriverInstanceId=focas-z-34184,DeviceId=NULL,Enabled=True,UnsLineId=makino. - Tags
parts-count/parts-required: bothDriverInstanceId=focas-z-34184,DeviceId=NULL,DataType=Double,AccessLevel=Read,FolderPath=NULL,PollGroupId=NULL,TagConfig={"deviceHostAddress":"10.201.31.5:8193","address":"MACRO:390{1,2}","dataType":"Float64"}. - DriverInstance: one row
focas-z-34184(DriverType=Focas,Enabled=True, nsdev-equipment, clusterDEV). - Device table: 0 rows — NORMAL:
FocasDriverbuilds_devicesfrom_options.Devices(DriverConfig JSON) not the Device table (FocasDriver.cs:97).
Refuted by this read: prime-suspect DriverInstanceId attribution mismatch (matches exactly) and H5 blank-deviceHostAddress (present). The deployed config is clean. ⇒ symptom #1 is a pure value-flow-plumbing break. New live leads: PollGroupId=NULL on both tags (is a poll group required to subscribe/poll?) and the resolver-registration path (equipment-tag refs are "resolver-produced, not seeded at InitializeAsync" per FocasDriver.cs:247 — does poll-time TryResolve of the JSON-blob ref ever succeed?). A second offline subagent trace of DriverHostActor↔DriverInstanceActor↔PollGroupEngine↔resolver is running to pin the exact broken link.
✅ FixedTree feature (symptom #2) — BUILT 2026-06-26 (architecture reality below)
Mapped the composition pipeline. Two address-space paths exist: (1) Equipment/UNS projection AddressSpaceComposer.Compose (config entities only) → AddressSpaceApplier.MaterialiseEquipmentTags → the served ns=2 tree where EQ-… lives; (2) raw-driver namespace GenericDriverNodeManager.BuildAddressSpaceAsync → driver.DiscoverAsync(IAddressSpaceBuilder). Path 2 is DEAD: BuildAddressSpaceAsync has no runtime caller and OpcUaApplicationHost.PopulateAddressSpaces (its referenced caller) no longer exists. Even GalaxyDriver.DiscoverAsync (:588) is reachable only via that dead path — Galaxy surfaces its hierarchy by being authored as config equipment/tags, not via discovery. ⇒ In the current Equipment-kind model every served node is config-driven; ITagDiscovery/DiscoverAsync is legacy/dead for serving. So "build the FixedTree feature" is NOT re-wiring an existing path — it's a new dynamic-node-injection capability into the Equipment projection, and it must solve a timing problem: composition runs at deploy/apply (before the driver connects), but FixedTree data only exists after the driver's async FixedTreeCache bootstrap. The far cheaper alternative that yields the same visible result is to author FixedTree signals as config Tag rows (each bound to a FOCAS fixed-tree reference) — same mechanism every other equipment tag uses. (The user chose to build the dynamic feature over the config-rows alternative.)
✅ BUILT (2026-06-26). Implemented as a generic post-connect ITagDiscovery injection pipeline: when a driver reaches Connected, DriverInstanceActor runs bounded re-discovery into a capturing IAddressSpaceBuilder and ships DiscoveredNodesReady to DriverHostActor, which maps the nodes under the equipment (EQ-…/FOCAS/…, read-only), extends the _nodeIdByDriverRef routing map, and tells OpcUaPublishActor to incrementally materialise them — reusing the existing materialize→subscribe→poll→push pipeline (no full rebuild). Survives redeploys (re-applied at the tail of PushDesiredSubscriptions) and restarts (re-discovered on reconnect). Design: 2026-06-26-otopcua-fixedtree-equipment-injection-design.md; implementation plan (11 bite-sized tasks, all green): 2026-06-26-otopcua-fixedtree-equipment-injection.md. Offline-complete on branch feat/focas-fixedtree-equipment-injection (solution build 0 errors / 0 warnings; Runtime.Tests 312, OpcUaServer.Tests 304, FOCAS 247 + an end-to-end injection+value-flow test, all green). The review chain caught + fixed three real defects (a DriverDataType→OPC-UA-type string mismatch, a Server.ReportEvent-under-lock deadlock, and a ConfigureAwait(false) off-actor-context crash for async drivers). Live wonder validation pending (deploy the current host + browse ns=2;s=EQ-3686c0272279/FOCAS/Identity/SeriesNumber, …/FOCAS/Axes/X/AbsolutePosition).
🎯 ROOT CAUSE — symptom #1 (CONFIRMED, 2026-06-25, 2nd subagent trace + code verify)
The FOCAS poll read hangs forever because (1) all wire I/O for a device shares one socket with NO serialization, and (2) the steady-state read has NO timeout.
- Unsynchronized shared socket:
FocasDriver.EnsureConnectedAsync(:1101-1128) returns the singledevice.Clientwith no I/O mutex.DeviceState.ProbeLock(:1172) only synchronizes probes with each other. Four independent loops issue wire ops on that same socket concurrently: the equipment poll (PollGroupEngine, 1 s —SubscriptionPublishingIntervalconstDriverHostActor.cs:58), the FixedTree loop (250 ms / 2 s bootstrap), the probe (5 s), the recycle loop. FOCAS/2 is strict request-response on one socket → concurrent unsynchronized reads interleave; one reader consumes another's response PDU and the victim read blocks waiting for bytes that never come. - No read timeout: the poll calls
FocasDriver.ReadAsync(:295-308) →WireFocasClient.ReadAsync(:417) →ReadMacroAsyncwithout the optionaltimeout→FocasWireClient.CreateCallTimeout(:843-848) skipsCancelAfterwhentimeoutis null →ReadExpectedPduAsyncawaits the socket under a token that never fires. The blocked firstforceRaisepoll (PollGroupEngine.cs:119) never completes →onChangenever fires → theBadWaitingForInitialDatamaterialization seed (OtOpcUaNodeManager.cs:1400) is never overwritten → permanent0x80320000. (Connect:1119and probe:604-610ARE bounded — only the steady-state read isn't.)
Why it works on the dev box but not deployed: the dev-box harness/CLI does ONE read at a time with no FixedTree loop running → no socket collision → macros read fine (proven in the v3 work). Deployed, the FixedTree loop races the equipment poll on one socket → collision → hung read.
Why both tags fail together: they're in one driver's single 1 s subscription batch; PollOnceAsync reads them sequentially and hangs on the first, so neither reaches onChange.
Ruled out: PollGroupId=NULL is a red herring — verified at ConfigComposer.cs:38 (snapshots all tags, no poll-group filter), AddressSpaceComposer.Compose (never reads PollGroupId), and DriverHostActor.PushDesiredSubscriptions:965-973 (groups by DriverInstanceId, constant 1 s interval, no poll-group keying). Routing-key mismatch ruled out: subscribe ref == routing key == onChange ref (all from the same FullName). Not-subscribed ruled out by elimination: a served node implies it's in the same ParseComposition artifact used for subscribe; and every deterministic ReadAsync outcome pushes a specific Bad code (BadNodeIdUnknown/BadCommunicationError/BadNotSupported), none of which is 0x80320000 — only a hung (never-returning) read leaves the seed intact.
PROPOSED FIX (two parts, FOCAS-driver-only, single managed DLL, no migration/proto change — same low-blast-radius deploy path as the v3 work):
- Serialize per-device wire I/O — an async gate (
SemaphoreSlim) onDeviceStateheld around eachEnsureConnected + wire opso the equipment poll, FixedTree loop, probe, and recycle never collide on the one socket. This is what makes values actually read Good. MUST be paired with #2 (a lock around an unbounded read would deadlock all I/O). - Bound every steady-state wire read/write with
_options.Timeout(mirror the probe's linked CTS at:604-610, or threadtimeoutthroughWireFocasClient.ReadAsync→ReadMacroAsync). Converts any stall into a recoverableBadCommunicationErrorpush (overwrites the seed, downgrades health → observable + self-healing) instead of permanent silence.
Confidence: the two structural gaps are proven from code; that their interaction is the active trigger is strongly-suggested (works single-threaded on dev box, hangs deployed; fits every observation). Definitive live confirmation = the fix makes the tags leave 0x80320000 (Good, or recoverable BadComm if the macro read genuinely fails). This is a debugging fix headed for a production CNC node — implement + unit-test (serialization + read-timeout) locally, then deploy to wonder and verify, per the diagnose-before-deploy discipline.
Implementation + deploy status (2026-06-26)
- Fix implemented on branch
fix/focas-poll-io-serialization(offfeat/focas-pdu-v3): newSynchronizedFocasClientdecorator (per-deviceSemaphoreSlimgate + per-call_options.Timeout) wired intoFocasDriver.EnsureConnectedAsync;ReadAsync/WriteAsyncnow map a per-call timeout (OCE while caller token live) →BadCommunicationErrorinstead of rethrowing. 243 FOCAS tests green (8 new inFocasIoSerializationTests.cs+ 1 sibling inFocasReadWriteTests), full solution builds 0 errors. - Live baseline re-confirmed via OtOpcUa CLI from the Mac: both tags
[80320000]. - Deployed the Release DLL to wonder (single managed-DLL swap; v3 DLL 222208 B → fix DLL 228864 B; backup
E:\ApiInstall\OtOpcUa\_focasbak-pre-iofix-20260626T041913.dll; service Running, OPC 4840 listening). - BLOCKER for verification: after the swap+restart the FOCAS driver is not connecting at all (zero TCP to
10.201.31.5:8193at 150 s uptime — not even a probe connect), so tags still[80320000]. - Deploy API enabled (user-approved): added
Security__DeployApiKey(64-char generated, value on-box only) to theOtOpcUaHostserviceEnvironmentREG_MULTI_SZ + restart; env backed up to_envbak-pre-deploykey-20260626T043001.txt. HeadlessPOST http://localhost:9000/api/deployments(key read from registry on-box) returned HTTP 202 Accepted, sealed deploymentef384b04…. - Tags STILL
[80320000]after the sealed deploy. Root of the live blocker (from box logs): the service writes Serilog toC:\Windows\System32\logs\otopcua-<date>.log(relative-path-from-service-CWD bug — pending.md D2). Today's log shows the admin/publish side only:Phase7Applier: hierarchy materialised (areas=1,lines=1,equipment=1)+equipment tags materialised (tags=2)at each restart (04:19, 04:30) — which is why the 2 tags are served with the0x80320000seed. But across the entire day's log there are ZERO driver-side lines:DriverHost=0,SubscribeBulk=0,subscribed to=0,Focas=0,GenericDriver=0 (the 13DriverInstancehits are all SQLFROM [DriverInstance]). ⇒ the DriverHostActor / driver-role side is not spawning the FOCAS driver at all on this node today, so there is no poll, no value push, permanent seed. Health:/healthzHealthy, Akka member Up,admin-leaderActive; roles env intact (Cluster__Roles__0=admin,__1=driver, seed=self:4053);DOTNET_ENVIRONMENT=Production. The admin half is fully alive; only the driver half is silent. - This is a SEPARATE blocker from the I/O fix (which is correct, unit-tested, and deployed — but cannot be exercised until the driver actually polls).
Driver-host activation diagnosis (2026-06-26, Debug-logging pass + code reads)
- Debug Serilog enabled (env
Serilog__MinimumLevel__Default=Debug+ EF/Microsoft/System→Warning; env bak_envbak-pre-debug-*.txt). Revealed: NO exceptions / load errors anywhere (my fix DLL is NOT the cause), but also ZERO Akka lines in Serilog (Akka/Member/akka.tcp/singleton= 0). DriverHostActorlogs viaContext.GetLogger()(AkkaILoggingAdapter), and there is NOAkka.Logger.Serilogbridge anywhere in the app (verified across*.cs/*.csproj/HOCON). So Akka actor logs go to the default console logger → discarded for a Windows service → invisible in BOTH Production AND Development.appsettings.Development.jsononly changes Serilog min-levels (+Ldap:DevStubMode=true); it adds no Akka bridge. ⇒DOTNET_ENVIRONMENT=Developmentwould NOT surface the driver host (held off — futile + flips DevStubMode on prod).- The
DriverHostActorIS instantiated (unconditionally, samesystem.ActorOfblock as the workingOpcUaPublishActoratRuntime/ServiceCollectionExtensions.cs:~232). So it runs — it just isn't producing a connected FOCAS driver, and logs via Akka so the reason is invisible. - STRONG HYPOTHESIS (explains everything): on bootstrap the
DriverHostActorrecovers its persisted state (its own logs nameentering Steady/recovered Applied state at rev {Rev}), but a process restart has already killed the liveDriverInstanceActors + their FOCAS sockets — and recovery does NOT re-instantiate them. Drivers re-spawn only on a deployment dispatch whose revision changes. This config is now stable (revisionHash 924b59097eba…is identical across0c2db588/ef384b04/a2a84646— everyPOST /api/deploymentsreturns 202 but with the SAME rev), so repeated deploys are no-ops for the driver host → after any restart there are no live drivers and nothing re-spawns them → zero Makino TCP → tags frozen at the0x80320000materialization seed. The prior session got HEALTHY because the config was being authored (each deploy had a NEW rev → forced a spawn); once authoring stopped, a restart leaves the driver side dark. ⇒ likely the real reason "host-restart didn't change it" in the original symptom. - Implication: to exercise the I/O fix live, force a driver re-spawn by bumping the revision (e.g. toggle the FOCAS DriverInstance
Enabledoff→on, or any benign config edit, then deploy) — reversible. Separately, "restart should re-spawn applied drivers" + "Akka logs should bridge to Serilog" are real OtOpcUa robustness/observability gaps (the latter = the observability bottleneck this plan flagged). - Box env mutations still in place (to revert when done):
Security__DeployApiKey(deploy API),Serilog__MinimumLevel__*(Debug). Backups:_envbak-pre-deploykey-*,_envbak-pre-debug-*.
Akka→Serilog observability attempt (2026-06-26) — code done, prod deploy BLOCKED by self-contained layout
- Code change (correct, builds 0 errors): added
Akka.Logger.Serilog1.5.60 (deps all satisfied by existing pins — Akka 1.5.62 / Serilog 4.3.1) toDirectory.Packages.props+ the Cluster csproj, andloggers=["Akka.Logger.Serilog.SerilogLogger, …"]+loglevel=DEBUG+logger-startup-timeout=30sto the embeddedResources/akka.conf. This bridges the DriverHostActor's AkkaILoggingAdapterinto the Serilog file sink. - Prod deploy FAILED TWICE (both auto-rolled-back; service is UP + 4840 listening the whole time after each):
- Cluster.dll + Akka.Logger.Serilog.dll swap → crash
FileNotFoundException: Could not load 'Akka.Logger.Serilog 1.5.60'fromSerilog.Settings.Configuration.ConfigurationReader.LoadConfigurationAssemblies(the deployedOtOpcUa.Host.deps.jsondidn't list the new assembly; Serilog's*.Serilogconfig-assembly scan then fails fatally). - Added the updated
OtOpcUa.Host.deps.json→ crash "Could not resolve CoreCLR path." ⇒ the box deployment is SELF-CONTAINED (bundles its own runtime), so adotnet build(framework-dependent) deps.json breaks the apphost's runtime resolution.
- Cluster.dll + Akka.Logger.Serilog.dll swap → crash
- Conclusion: adding a NEW assembly to this box requires a full self-contained publish-overlay (match the box layout; preserve
appsettings*/data\), NOT a DLL/deps.json swap. That's a heavy/risky prod op for what is diagnostic observability of the (separate, pre-existing) driver-host re-spawn issue. - Prod state now (verified healthy): FOCAS I/O-fix DLL still deployed (228864 B — untouched by these rollbacks), Cluster.dll + deps.json restored to baseline,
Akka.Logger.Serilog.dllrenamed.disabled, service Running + OPC 4840 listening. Deploy-API key + Debug-Serilog env still set. - RECOMMENDED PIVOT: the cheap, no-prod-change way to validate the I/O fix is to force a driver re-spawn via a benign config rev-bump (the leading hypothesis); and the proper driver-host re-spawn fix + the Akka-observability are best developed in a local docker-dev repro (safe, full logging, one clean publish-overlay deploy at the end) rather than iterated on the production CNC node. (User chose: move proper fix to local docker-dev.)
🎯 ROOT CAUSE of the driver-not-spawning — STALE DEPLOYED BINARY (CONFIRMED 2026-06-26)
The DriverHostActor's bootstrap-recovery already re-spawns drivers in the current source: BootstrapRecover case Applied: logs "recovered Applied state at rev" and calls RestoreApplied() (DriverHostActor.cs:449-457), which does ReconcileDrivers (re-spawn) + RebuildAddressSpace + PushDesiredSubscriptions (:910-918) — comment: "the in-memory driver children + OPC UA address space were lost on restart … re-spawn … instead of waiting for a config change (whose identical-config revision would no-op)." Added by b1b3f3ff (2026-06-06) + 397f9b78 (2026-06-07); both ancestors of feat/focas-pdu-v3; covered by existing Restore_on_bootstrap_* tests.
- The DEPLOYED wonder
Runtime.dllis the June-16 base install (mtime 2026-06-16 13:05:46; only the FOCAS driver DLL was ever swapped). Binary string scan:RestoreApplied/ReconcileDriversmethod names PRESENT, but the bootstrap-restore log strings"recovered Applied state at rev"+"restored served state for applied deployment"are ABSENT ⇒ the deployed binary predatesb1b3f3ffand does NOT restore served state / re-spawn drivers on bootstrap. That is exactly why, on this stable-config box, a restart leaves the driver side dark (and why only config-authoring re-deploys ever lit it up). - ⇒ No new driver-host fix is needed — it already exists in source. The proper remedy = deploy the current Host (full self-contained publish-overlay onto
E:\ApiInstall\OtOpcUa, preservingappsettings*/data\), which in one shot delivers: (1)RestoreApplied-on-bootstrap (drivers re-spawn after restart), (2) the FOCAS I/O serialization fix (symptom #1), (3) the Akka→Serilog bridge (observability). The earlier DLL-swap crashes were purely the self-contained-layout mismatch — a full publish-overlay is internally consistent and avoids them. - PLAN: confirm the current Host build boots cleanly + re-spawns drivers on restart in docker-dev (safe), then do ONE self-contained publish-overlay to wonder + verify tags leave
0x80320000.
✅ docker-dev confirmation (2026-06-26) — current build re-spawns on bootstrap + Akka bridge works
Built the current source into the docker-dev image (otopcua-host:dev), booted central-1 (fused admin+driver, like wonder). Results:
- Boots cleanly with
Akka.Logger.Serilog(the prod DLL-swap crashes were purely the self-contained-layout mismatch, now confirmed). - Akka→Serilog bridge works — but needed TWO fixes beyond the package: (1) Akka.Hosting owns logger setup so HOCON
akka.loggersis ignored → wire viaConfigureLoggers(setup => { setup.LogLevel=DebugLevel; setup.ClearLoggers(); setup.AddLogger<SerilogLogger>(); })inWithOtOpcUaClusterBootstrap; (2)AddZbSerilogregisters Serilog as the MEL provider but does NOT set the staticSerilog.Log.Logger(whichAkka.Logger.Serilogwrites to, AND which the Program.cs startup banner uses) → setSerilog.Log.Logger = app.Services.GetRequiredService<Serilog.ILogger>()in Program.cs right afterBuild(). With both, the startup banner + full Akka cluster/DriverHost logs now emit. (TheLog.Logger-unset gap is a latent bug in the sharedAddZbSeriloglib affecting all 3 apps' static-Logcalls — worth a follow-up there.) RestoreAppliedre-spawns drivers on bootstrap — CONFIRMED live:DriverHost central-1: recovered Applied state at rev …→spawned GalaxyMxGateway/Modbus/OpcUaClient driver(×3) →SubscribeBulk pushed 5 references across 3 driver(s)→restored served state for applied deployment … on bootstrap→DriverInstance …: subscribed to N refs. These are the exact strings ABSENT from the stale wonder binary. ⇒ deploying the current Host fixes the wonder driver-not-spawning blocker.
Deploy plan (self-contained publish-overlay → wonder)
- Local:
dotnet publish …Host.csproj -c Release -r win-x64 --self-contained true -p:PublishSingleFile=false(self-contained = brings its own runtime; no version-match concern with the box). - Zip + SFTP to
win64. Box overlay (self-healing): full app-dir backup → robocopy publish overE:\ApiInstall\OtOpcUaEXCLUDINGpki\(OPC server cert) +appsettings*.json→ start → verify (4840 + tags) → AUTO-ROLLBACK from backup if unhealthy. Env (deploy key, Debug logging) is registry-side, untouched by the file overlay. - Then revert the Debug-Serilog env (now superseded; verbose) and verify
parts-count/parts-requiredread Good (or recoverable BadComm), and that the FixedTree/driver values flow.
✅✅ SYMPTOM #1 FIXED + LIVE-VALIDATED ON WONDER (2026-06-26)
After the self-contained overlay (current Host) + two light single-DLL FOCAS swaps, the validation revealed — and each fix peeled back — a cascade of latent FOCAS-config-vs-driver mismatches the stale binary had masked. Final state, all live on wonder:
DriverHost …: recovered Applied state … → spawned Focas driver focas-z-34184 (stub=false) → DriverInstance focas-z-34184: connected → subscribed to 2 refs— two ESTABLISHED TCP sessions to10.201.31.5:8193.read ns=2;s=EQ-3686c0272279/parts-count→ Value 0, Status0x00000000(Good);parts-required→ Good. (Value 0 is correct on the idle machine — status, not magnitude.) The original0x80320000is gone.
The complete fix chain (all on fix/focas-poll-io-serialization, deployed):
- FOCAS I/O serialization + read timeout (
SynchronizedFocasClient) — the original diagnosed root cause: prevents the poll read hanging on the shared socket. RestoreApplied-on-bootstrap — already in source sinceb1b3f3ff; the wonder fix was deploying the current Host over the stale June-16 binary so the driver re-spawns on restart.- Akka→Serilog bridge (
ConfigureLoggers().AddLogger<SerilogLogger>()+ set staticSerilog.Log.Loggerin Program.cs) — made the driver-host actor observable; this is what surfaced the next two issues. FlexibleStringConverteron the FOCAS configSeries— the AdminUI persists the enum as a number ("series":6); the factory now tolerates number-or-string instead of throwing → stub.- Scheme-less host tolerance in
FocasHostAddress.TryParse— the AdminUI persistshostAddressas a bareip:port;TryParsenow accepts it (canonicalfocas://unchanged) instead of failing init.
- FOCAS test suite 247 green; each fix carries a regression test.
- Follow-up (product quality): the AdminUI authors FOCAS configs (
seriesas number,hostAddresswithoutfocas://) that the driver only now tolerates — the AdminUI↔driver config-format mismatch is worth reconciling at the source. Also: the sharedAddZbSerilognot setting staticSerilog.Log.Loggeris a latent gap across all 3 apps. The FixedTree-under-Equipment feature (task #14) is now BUILT (offline-complete; see the 2026-06-26 design + implementation-plan docs above) — live wonder validation pending.
Phase 2 — Get OtOpcUa runtime logs on wonder
Make the Host emit driver-level logs so the data plane is observable. Options (least invasive first): point the service at a Serilog file sink via config/env, or temporarily run with DOTNET_ENVIRONMENT=Development (file sink + dev errors — cf. MxGateway note), or add a console capture. Preserve appsettings*/data\; restore the env after. Then read: did InitializeAsync start the FixedTree loop, does the bootstrap throw (and on which call), is ReadAsync invoked for the equipment tags, what does it return.
- Access: servecli
:2222, key~/.ssh/servecli_wonder,scratchpad/wonder-ps.sh(base64 PS over the cmd PTY), SFTP rootC:\Users\dohertj2\Desktop\win64. - Output: the actual runtime behavior of the data poll + equipment read on the box.
Phase 3 — Local reproduction in docker-dev (isolate FOCAS-specific vs general)
Reproduce off the production box: in the docker-dev OtOpcUa, configure an equipment tag bound to a driver and check whether values flow at all. Use a driver with an easy local source (e.g. Modbus against a local sim, or the FOCAS wire client against a v1 responder if one can be stood up). If equipment-tag values flow for another driver locally but not FOCAS → FOCAS-specific; if they don't flow for any → a general equipment-projection/data-plane gap. A local repro gives a full-logging debug loop.
- Output: scope (FOCAS-only vs general) + a local failing case to fix against.
Phase 4 — Root cause + minimal fix
From Phases 1–3, fix the smallest thing that makes the equipment tag carry a value and (if H2 is not by-design) the FixedTree surface. Likely shapes: a swallowed bootstrap exception; a seed/poll-group wiring gap; an Equipment-projection that should include FixedTree/driver auto-nodes; or a data-plane gate that needs config on the wonder node.
Phase 5 — Validate
- Local (docker-dev) green where reproduced; unit/integration tests for the fixed path.
- Live: re-deploy to wonder, then via the OtOpcUa CLI confirm
parts-count/parts-requiredread Good (value 0 is correct on this idle machine — assert status, not magnitude) and, if applicable, FixedTree Identity/Axes nodes appear with live values. The live Makino helps for FixedTree but is NOT blocking for the value-flow plumbing (macro values are 0 regardless), so this is lower time-pressure than the v3 capture was.
Phase 6 — Docs + commit
Update this plan's status + the deployment doc; commit on a branch (separate from feat/focas-pdu-v3 if the fix is OtOpcUa-core rather than FOCAS-driver). Push per the repo flow when asked.
Notes
- Execution approach: this is a debugging investigation (unknown root cause) — diagnose before fixing; reproduce before claiming a fix; change the smallest thing. Don't deploy a guessed fix to the production CNC node.
- Key node IDs / endpoints: equipment
ns=2;s=EQ-3686c0272279; tags…/parts-count(MACRO:3901),…/parts-required(MACRO:3902); OPCopc.tcp://wonder-app-vd03.zmr.zimmer.com:4840/OtOpcUa; AdminUIhttp://wonder-app-vd03.zmr.zimmer.com:9000(DisableLogin); deploy = AdminUI/deployments→ "Deploy current configuration" (Blazo→Akka; no headless API). - Reusable tools:
scratchpad/focas-status/(live IFocasClient harness),scratchpad/wonder-ps.sh,scratchpad/deploy-focas-v3.ps1; the OtOpcUa CLIsrc/Client/ZB.MOM.WW.OtOpcUa.Client.CLI(read/subscribe/browse --recursive).