Files
lmxopcua/docs/plans/2026-06-25-otopcua-equipment-dataplane-investigation.md
T

38 KiB
Raw Blame History

OtOpcUa equipment-tag data-plane — why live FOCAS values don't surface (investigation plan)

Date: 2026-06-25 Companion to: 2026-06-25-focas-pdu-v3-implementation-plan.md (the FOCAS PDU-v3 driver work — DONE + deployed) and ../deployments/wonder-app-vd03-makino-z-34184.md. Goal: make the deployed Makino tags (parts-count/parts-required = MACRO:3901/3902) actually carry live values over OPC UA, and make the FOCAS FixedTree (Identity/Axes/Timers/…) appear in the served address space.

The symptom (precise, observed 2026-06-25 after the v3 deploy)

Against opc.tcp://wonder-app-vd03.zmr.zimmer.com:4840/OtOpcUa (Security None, anonymous, DisableLogin):

  • read AND a 30 s subscribe of ns=2;s=EQ-3686c0272279/parts-count (and /parts-required) return 0x80320000 Bad_WaitingForInitialData — never a value.
  • A recursive browse of the whole served tree shows ONLY machining → makino → z-34184 → {parts-count, parts-required}no FixedTree nodes (Identity/Axes/Timers/Program/…).
  • Unchanged by host-restart, a full AdminUI re-deploy (12e0d528, Sealed/In-sync), and a driver Restart.
  • A ~15 s box-side watch saw no 250 ms-cadence TCP connection to 10.201.31.5:8193 — only the occasional probe connect.

What is RULED OUT (do not re-investigate)

  • The FOCAS PDU-v3 driver. Proven working: the rewritten FocasDriverProbe does a real initiate + cnc_statinfo and the AdminUI shows DRIVER STATUS: HEALTHY, last success Ns ago; and on the dev box every surface (sysinfo / axes / dynamic / macros / timer / PMC / servo / alarms) reads correctly through the same WireFocasClient. The wire path talks v3 to the Makino.
  • Deployment / Akka roles. Re-deploy sealed with no "task canceled"; the node has admin+driver roles; the address space (the 2 UNS tags) is served.
  • Reachability. TCP 8193 reachable from the box; OPC 4840 reachable from the dev box.

So the problem is the OtOpcUa data plane: the driver's control/probe plane runs, but its data plane (the FixedTree poll loop output + the equipment-tag value source) is not reaching the served address space. This was present last session too — it was masked by the v3 reject; with v3 now working it stands alone.

Leading hypotheses (to confirm/refute, not assume)

  • H1 — FixedTree bootstrap is stuck/throwing. FocasDriver.FixedTreeLoopAsync is started at FocasDriver.cs:160 inside InitializeAsync (which demonstrably ran — the probe loop at :137 is alive). Its bootstrap while (state.FixedTreeCache is null) loop (~:650) calls GetSysInfoAsync/GetAxisNamesAsync/optional probes; if one throws it retries every 2 s and the cache never populates → no FixedTree nodes, no poll. The brief 2 s-spaced connects could explain "no steady 250 ms connection."
  • H2 — FixedTree nodes aren't projected under the Equipment model. Galaxy/FOCAS are now standard Equipment-kind drivers; the served tree is the UNS/Equipment projection (configured tags only). Driver DiscoverAsync FixedTree auto-nodes may only live in a raw-driver namespace that the UNS browse doesn't show — i.e. FixedTree-under-Equipment may be unsupported-by-design, not a bug.
  • H3 — equipment-tag value source never primed. A point-read returns the server's cached value; with no active poll/subscription seeding it, that's WaitingForInitialData. But a 30 s subscribe ALSO stayed bad — so either the monitored-item → FocasDriver.ReadAsync wiring isn't firing, or ReadAsync(MACRO:3901) returns no value on the box. (Analogy: the ScadaBridge DCL seed-ordering / static-tag false-bad class of bug — see that memory.)
  • H4 — a data-plane gate. A role/active-node gate (cf. "data-plane roles need Security:Ldap:GroupToRole mapped or write/ack is inert") could suppress the read/poll on this single fused node under DisableLogin.

⚠️ Observability is the bottleneck

There are currently no OtOpcUa app logs on the box (Windows eventlog shows only NSSM start/stop; Serilog has no file sink in this deployment — cf. the MxGateway windev note). Getting runtime logs is the single highest-leverage step and gates H1/H3/H4.


Phase 1 — Offline code trace of the data plane (do FIRST; free, no box/CNC)

Trace, in the repo on the dev box, the two paths end-to-end. This may reveal the root cause with zero box access.

  1. Equipment-tag read path: from an OPC UA Read/CreateMonitoredItem on ns=2;s=EQ-…/parts-count → the server's address-space node → the "FullName→NodeId router" → FocasDriver.ReadAsync. Find: who invokes the driver read, whether point-reads vs subscriptions differ, and where/whether a value is cached vs WaitingForInitialData is returned. Entry points: FocasDriver.cs:247 (resolver-produced equipment tags), FocasDriver.ReadAsync, and the server-side poll-group / monitored-item sampler (grep PollGroup, MonitoredItem, the Equipment namespace value provider).
  2. FixedTree path: FocasDriver.FixedTreeLoopAsync (:644) → state.FixedTreeCacheTryReadFixedTree → how those nodes are added to the address space. Determine whether FixedTree nodes are emitted into the Equipment/UNS projection at all, or only a raw-driver namespace (resolves H2). Grep DiscoverAsync, FixedTreeReference, and the address-space builder for the Equipment kind.
  3. Driver lifecycle/gates: confirm what conditions start/suppress the data poll vs the probe loop (active-node, role, enabled flags) — FocasDriver.cs:87181, the cluster/active-node gate.
  • Output: a written root-cause hypothesis (which of H1H4) + the specific code site, OR a precise list of what only box logs can disambiguate.

Phase 1 FINDINGS (2026-06-25, offline subagent code-trace — DONE)

Hypothesis Verdict Confidence
H2 — FixedTree nodes not projected under Equipment model CONFIRMED root cause of symptom #2 — by-design proven from code
H3 — equipment-tag value source never primed CONFIRMED mechanism of symptom #1; routing code is self-consistent → break is runtime, not static mechanism proven; exact break needs box logs
H4 — role/active-node data-plane gate REFUTED for the read/subscribe path proven from code
H1 — FixedTree bootstrap stuck/throwing best fit for symptom #3 (no 250 ms connect), but moot for #2 and not the cause of #1 strongly-suggested; needs logs
H5 (new) — FOCAS tag with blank deviceHostAddressBadNodeIdUnknown latent trap — yields a Bad value, not WaitingForInitialData, so it's the next failure, not the current symptom proven from code

Why symptom #2 (no FixedTree nodes): FixedTree nodes are emitted only by FocasDriver.DiscoverAsync (FocasDriver.cs:391-494, gated on FixedTreeCache), and DiscoverAsync is never invoked by the deployed equipment-kind address-space build. Production materializes the served tree purely from Config-DB Tag rows (EquipmentNodeWalker.cs:24-25,161-165; OpcUaPublishActor.RebuildAddressSpaceMaterialiseEquipmentTags(composition) at OpcUaPublishActor.cs:326-337). No runtime caller of IDriver.DiscoverAsync exists. ⇒ By-design gap, not a bug. Surfacing FixedTree under the Equipment tree is a feature (author explicit Tag rows per signal, or extend the composition pipeline to merge ITagDiscovery output — a substantial AddressSpaceComposer/AddressSpaceApplier change).

Why symptom #1 (0x80320000 even under a 30 s subscribe): equipment-tag node value is push-only. 0x80320000 is the materialization seed (OtOpcUaNodeManager.EnsureVariable:1400-1402); there is no OnReadValue/server-side sampler for equipment tags, so both Read and MonitoredItem serve the cached node value. The seed is overwritten only by ForwardToMux → WriteValue (OtOpcUaNodeManager.cs:261-281), which writes even Bad/null. StatusFromQuality maps Bad → 0x80000000, not 0x80320000. ⇒ A persistent 0x80320000 proves no AttributeValueUpdate ever landed for those two NodeIds. The routing chain (DriverHostActor.PushDesiredSubscriptions_nodeIdByDriverRefDriverInstanceActor.ResubscribeDesiredPollGroupEngine forceRaise → OnDataChangeForwardForwardToMux) is provably self-consistent for FOCAS (FocasTagConfigModel writes no top-level FullNameExtractTagFullName returns the raw JSON blob → FocasEquipmentTagParser.TryParse resolves it; subscribed ref == published ref == routing-map key == NodeId, all from the same t.FullName). So the static mapping is refuted; the break is runtime: either the FOCAS child was never subscribed those refs, or _nodeIdByDriverRef lacked the (DriverInstanceId, FullName) key at publish time — most likely a DriverInstanceId attribution mismatch between the equipment Tag rows and the spawned DriverInstance.

What only box logs (or a ConfigDb snapshot read) can disambiguate for symptom #1 — capture these on the driver-role node after a deploy:

  1. DriverHostActor info "SubscribeBulk pushed {Refs} references across {Drivers} driver(s)" (DriverHostActor.cs:1058) — Refs<2 ⇒ equipment Tag DriverInstanceId ≠ spawned FOCAS child id (attribution mismatch).
  2. DriverInstanceActor info "subscribed to {Count} refs" (DriverInstanceActor.cs:571) — absent/0_desiredRefs empty or never reached Connected.
  3. DriverHostActor debug "no equipment-tag NodeId for ({Driver},{Ref}) — value dropped" (DriverHostActor.cs:550) — present ⇒ _nodeIdByDriverRef key miss.
  4. FocasDriver debug "FOCAS fixed-tree bootstrap failed … retrying" every ~2 s (FocasDriver.cs:686) ⇒ confirms H1 (cnc_sysinfo/cnc_rdaxisname failing on the Makino). Needs Debug level + logger-constructed wire client.
  5. A driver-side read of MACRO:3901 (live Makino): BadNodeIdUnknown ⇒ H5 blank-deviceHostAddress trap; Good value ⇒ wire path fine, gap is purely the OPC-UA equipment subscription wiring.

Cheaper-than-Phase-2 lead: the prime suspect (Tag-row DriverInstanceId vs DriverInstance row mismatch) is checkable by reading the deployed ConfigDb snapshot (OtOpcUaConfig on wonder-sql-vd03:1433) — no service restart, no log-sink change. Worth doing before/instead of the Phase 2 log-sink change.

Suggested minimal-fix shapes (NOT implemented):

  • Symptom #1: depends on which log line/row fires — if attribution miss, it's a config/snapshot re-bind (no code). If a real wiring gap, localize first.
  • H5 latent: in FocasDriver.ReadAsync/WriteAsync device lookup (:289,:346), when DeviceHostAddress is blank and _devices.Count==1, resolve to the sole device instead of BadNodeIdUnknown (or make the AdminUI field mandatory).
  • Symptom #2 (H2): no fix — by-design; surfacing FixedTree is a feature decision.

ConfigDb snapshot read (2026-06-25, symptom #1 path — chosen over box logs)

Read live OtOpcUaConfig on wonder-sql-vd03 (query run on-box so the SQL password never left the box). Ground truth:

  • Equipment EQ-3686c0272279 (z-34184): DriverInstanceId=focas-z-34184, DeviceId=NULL, Enabled=True, UnsLineId=makino.
  • Tags parts-count/parts-required: both DriverInstanceId=focas-z-34184, DeviceId=NULL, DataType=Double, AccessLevel=Read, FolderPath=NULL, PollGroupId=NULL, TagConfig={"deviceHostAddress":"10.201.31.5:8193","address":"MACRO:390{1,2}","dataType":"Float64"}.
  • DriverInstance: one row focas-z-34184 (DriverType=Focas, Enabled=True, ns dev-equipment, cluster DEV).
  • Device table: 0 rows — NORMAL: FocasDriver builds _devices from _options.Devices (DriverConfig JSON) not the Device table (FocasDriver.cs:97).

Refuted by this read: prime-suspect DriverInstanceId attribution mismatch (matches exactly) and H5 blank-deviceHostAddress (present). The deployed config is clean. ⇒ symptom #1 is a pure value-flow-plumbing break. New live leads: PollGroupId=NULL on both tags (is a poll group required to subscribe/poll?) and the resolver-registration path (equipment-tag refs are "resolver-produced, not seeded at InitializeAsync" per FocasDriver.cs:247 — does poll-time TryResolve of the JSON-blob ref ever succeed?). A second offline subagent trace of DriverHostActor↔DriverInstanceActor↔PollGroupEngine↔resolver is running to pin the exact broken link.

FixedTree feature (symptom #2) — BUILT 2026-06-26 (architecture reality below)

Mapped the composition pipeline. Two address-space paths exist: (1) Equipment/UNS projection AddressSpaceComposer.Compose (config entities only) → AddressSpaceApplier.MaterialiseEquipmentTags → the served ns=2 tree where EQ-… lives; (2) raw-driver namespace GenericDriverNodeManager.BuildAddressSpaceAsyncdriver.DiscoverAsync(IAddressSpaceBuilder). Path 2 is DEAD: BuildAddressSpaceAsync has no runtime caller and OpcUaApplicationHost.PopulateAddressSpaces (its referenced caller) no longer exists. Even GalaxyDriver.DiscoverAsync (:588) is reachable only via that dead path — Galaxy surfaces its hierarchy by being authored as config equipment/tags, not via discovery. ⇒ In the current Equipment-kind model every served node is config-driven; ITagDiscovery/DiscoverAsync is legacy/dead for serving. So "build the FixedTree feature" is NOT re-wiring an existing path — it's a new dynamic-node-injection capability into the Equipment projection, and it must solve a timing problem: composition runs at deploy/apply (before the driver connects), but FixedTree data only exists after the driver's async FixedTreeCache bootstrap. The far cheaper alternative that yields the same visible result is to author FixedTree signals as config Tag rows (each bound to a FOCAS fixed-tree reference) — same mechanism every other equipment tag uses. (The user chose to build the dynamic feature over the config-rows alternative.)

BUILT (2026-06-26). Implemented as a generic post-connect ITagDiscovery injection pipeline: when a driver reaches Connected, DriverInstanceActor runs bounded re-discovery into a capturing IAddressSpaceBuilder and ships DiscoveredNodesReady to DriverHostActor, which maps the nodes under the equipment (EQ-…/FOCAS/…, read-only), extends the _nodeIdByDriverRef routing map, and tells OpcUaPublishActor to incrementally materialise them — reusing the existing materialize→subscribe→poll→push pipeline (no full rebuild). Survives redeploys (re-applied at the tail of PushDesiredSubscriptions) and restarts (re-discovered on reconnect). Design: 2026-06-26-otopcua-fixedtree-equipment-injection-design.md; implementation plan (11 bite-sized tasks, all green): 2026-06-26-otopcua-fixedtree-equipment-injection.md. Offline-complete on branch feat/focas-fixedtree-equipment-injection (solution build 0 errors / 0 warnings; Runtime.Tests 312, OpcUaServer.Tests 304, FOCAS 247 + an end-to-end injection+value-flow test, all green). The review chain caught + fixed three real defects (a DriverDataType→OPC-UA-type string mismatch, a Server.ReportEvent-under-lock deadlock, and a ConfigureAwait(false) off-actor-context crash for async drivers). Live wonder validation pending (deploy the current host + browse ns=2;s=EQ-3686c0272279/FOCAS/Identity/SeriesNumber, …/FOCAS/Axes/X/AbsolutePosition).

🎯 ROOT CAUSE — symptom #1 (CONFIRMED, 2026-06-25, 2nd subagent trace + code verify)

The FOCAS poll read hangs forever because (1) all wire I/O for a device shares one socket with NO serialization, and (2) the steady-state read has NO timeout.

  • Unsynchronized shared socket: FocasDriver.EnsureConnectedAsync (:1101-1128) returns the single device.Client with no I/O mutex. DeviceState.ProbeLock (:1172) only synchronizes probes with each other. Four independent loops issue wire ops on that same socket concurrently: the equipment poll (PollGroupEngine, 1 s — SubscriptionPublishingInterval const DriverHostActor.cs:58), the FixedTree loop (250 ms / 2 s bootstrap), the probe (5 s), the recycle loop. FOCAS/2 is strict request-response on one socket → concurrent unsynchronized reads interleave; one reader consumes another's response PDU and the victim read blocks waiting for bytes that never come.
  • No read timeout: the poll calls FocasDriver.ReadAsync (:295-308) → WireFocasClient.ReadAsync (:417) → ReadMacroAsync without the optional timeoutFocasWireClient.CreateCallTimeout (:843-848) skips CancelAfter when timeout is null → ReadExpectedPduAsync awaits the socket under a token that never fires. The blocked first forceRaise poll (PollGroupEngine.cs:119) never completes → onChange never fires → the BadWaitingForInitialData materialization seed (OtOpcUaNodeManager.cs:1400) is never overwritten → permanent 0x80320000. (Connect :1119 and probe :604-610 ARE bounded — only the steady-state read isn't.)

Why it works on the dev box but not deployed: the dev-box harness/CLI does ONE read at a time with no FixedTree loop running → no socket collision → macros read fine (proven in the v3 work). Deployed, the FixedTree loop races the equipment poll on one socket → collision → hung read.

Why both tags fail together: they're in one driver's single 1 s subscription batch; PollOnceAsync reads them sequentially and hangs on the first, so neither reaches onChange.

Ruled out: PollGroupId=NULL is a red herring — verified at ConfigComposer.cs:38 (snapshots all tags, no poll-group filter), AddressSpaceComposer.Compose (never reads PollGroupId), and DriverHostActor.PushDesiredSubscriptions:965-973 (groups by DriverInstanceId, constant 1 s interval, no poll-group keying). Routing-key mismatch ruled out: subscribe ref == routing key == onChange ref (all from the same FullName). Not-subscribed ruled out by elimination: a served node implies it's in the same ParseComposition artifact used for subscribe; and every deterministic ReadAsync outcome pushes a specific Bad code (BadNodeIdUnknown/BadCommunicationError/BadNotSupported), none of which is 0x80320000 — only a hung (never-returning) read leaves the seed intact.

PROPOSED FIX (two parts, FOCAS-driver-only, single managed DLL, no migration/proto change — same low-blast-radius deploy path as the v3 work):

  1. Serialize per-device wire I/O — an async gate (SemaphoreSlim) on DeviceState held around each EnsureConnected + wire op so the equipment poll, FixedTree loop, probe, and recycle never collide on the one socket. This is what makes values actually read Good. MUST be paired with #2 (a lock around an unbounded read would deadlock all I/O).
  2. Bound every steady-state wire read/write with _options.Timeout (mirror the probe's linked CTS at :604-610, or thread timeout through WireFocasClient.ReadAsyncReadMacroAsync). Converts any stall into a recoverable BadCommunicationError push (overwrites the seed, downgrades health → observable + self-healing) instead of permanent silence.

Confidence: the two structural gaps are proven from code; that their interaction is the active trigger is strongly-suggested (works single-threaded on dev box, hangs deployed; fits every observation). Definitive live confirmation = the fix makes the tags leave 0x80320000 (Good, or recoverable BadComm if the macro read genuinely fails). This is a debugging fix headed for a production CNC node — implement + unit-test (serialization + read-timeout) locally, then deploy to wonder and verify, per the diagnose-before-deploy discipline.

Implementation + deploy status (2026-06-26)

  • Fix implemented on branch fix/focas-poll-io-serialization (off feat/focas-pdu-v3): new SynchronizedFocasClient decorator (per-device SemaphoreSlim gate + per-call _options.Timeout) wired into FocasDriver.EnsureConnectedAsync; ReadAsync/WriteAsync now map a per-call timeout (OCE while caller token live) → BadCommunicationError instead of rethrowing. 243 FOCAS tests green (8 new in FocasIoSerializationTests.cs + 1 sibling in FocasReadWriteTests), full solution builds 0 errors.
  • Live baseline re-confirmed via OtOpcUa CLI from the Mac: both tags [80320000].
  • Deployed the Release DLL to wonder (single managed-DLL swap; v3 DLL 222208 B → fix DLL 228864 B; backup E:\ApiInstall\OtOpcUa\_focasbak-pre-iofix-20260626T041913.dll; service Running, OPC 4840 listening).
  • BLOCKER for verification: after the swap+restart the FOCAS driver is not connecting at all (zero TCP to 10.201.31.5:8193 at 150 s uptime — not even a probe connect), so tags still [80320000].
  • Deploy API enabled (user-approved): added Security__DeployApiKey (64-char generated, value on-box only) to the OtOpcUaHost service Environment REG_MULTI_SZ + restart; env backed up to _envbak-pre-deploykey-20260626T043001.txt. Headless POST http://localhost:9000/api/deployments (key read from registry on-box) returned HTTP 202 Accepted, sealed deployment ef384b04….
  • Tags STILL [80320000] after the sealed deploy. Root of the live blocker (from box logs): the service writes Serilog to C:\Windows\System32\logs\otopcua-<date>.log (relative-path-from-service-CWD bug — pending.md D2). Today's log shows the admin/publish side only: Phase7Applier: hierarchy materialised (areas=1,lines=1,equipment=1) + equipment tags materialised (tags=2) at each restart (04:19, 04:30) — which is why the 2 tags are served with the 0x80320000 seed. But across the entire day's log there are ZERO driver-side lines: DriverHost=0, SubscribeBulk=0, subscribed to=0, Focas=0, GenericDriver=0 (the 13 DriverInstance hits are all SQL FROM [DriverInstance]). ⇒ the DriverHostActor / driver-role side is not spawning the FOCAS driver at all on this node today, so there is no poll, no value push, permanent seed. Health: /healthz Healthy, Akka member Up, admin-leader Active; roles env intact (Cluster__Roles__0=admin, __1=driver, seed=self:4053); DOTNET_ENVIRONMENT=Production. The admin half is fully alive; only the driver half is silent.
  • This is a SEPARATE blocker from the I/O fix (which is correct, unit-tested, and deployed — but cannot be exercised until the driver actually polls).

Driver-host activation diagnosis (2026-06-26, Debug-logging pass + code reads)

  • Debug Serilog enabled (env Serilog__MinimumLevel__Default=Debug + EF/Microsoft/System→Warning; env bak _envbak-pre-debug-*.txt). Revealed: NO exceptions / load errors anywhere (my fix DLL is NOT the cause), but also ZERO Akka lines in Serilog (Akka/Member/akka.tcp/singleton = 0).
  • DriverHostActor logs via Context.GetLogger() (Akka ILoggingAdapter), and there is NO Akka.Logger.Serilog bridge anywhere in the app (verified across *.cs/*.csproj/HOCON). So Akka actor logs go to the default console logger → discarded for a Windows service → invisible in BOTH Production AND Development. appsettings.Development.json only changes Serilog min-levels (+ Ldap:DevStubMode=true); it adds no Akka bridge. ⇒ DOTNET_ENVIRONMENT=Development would NOT surface the driver host (held off — futile + flips DevStubMode on prod).
  • The DriverHostActor IS instantiated (unconditionally, same system.ActorOf block as the working OpcUaPublishActor at Runtime/ServiceCollectionExtensions.cs:~232). So it runs — it just isn't producing a connected FOCAS driver, and logs via Akka so the reason is invisible.
  • STRONG HYPOTHESIS (explains everything): on bootstrap the DriverHostActor recovers its persisted state (its own logs name entering Steady / recovered Applied state at rev {Rev}), but a process restart has already killed the live DriverInstanceActors + their FOCAS sockets — and recovery does NOT re-instantiate them. Drivers re-spawn only on a deployment dispatch whose revision changes. This config is now stable (revisionHash 924b59097eba… is identical across 0c2db588/ef384b04/a2a84646 — every POST /api/deployments returns 202 but with the SAME rev), so repeated deploys are no-ops for the driver host → after any restart there are no live drivers and nothing re-spawns them → zero Makino TCP → tags frozen at the 0x80320000 materialization seed. The prior session got HEALTHY because the config was being authored (each deploy had a NEW rev → forced a spawn); once authoring stopped, a restart leaves the driver side dark. ⇒ likely the real reason "host-restart didn't change it" in the original symptom.
  • Implication: to exercise the I/O fix live, force a driver re-spawn by bumping the revision (e.g. toggle the FOCAS DriverInstance Enabled off→on, or any benign config edit, then deploy) — reversible. Separately, "restart should re-spawn applied drivers" + "Akka logs should bridge to Serilog" are real OtOpcUa robustness/observability gaps (the latter = the observability bottleneck this plan flagged).
  • Box env mutations still in place (to revert when done): Security__DeployApiKey (deploy API), Serilog__MinimumLevel__* (Debug). Backups: _envbak-pre-deploykey-*, _envbak-pre-debug-*.

Akka→Serilog observability attempt (2026-06-26) — code done, prod deploy BLOCKED by self-contained layout

  • Code change (correct, builds 0 errors): added Akka.Logger.Serilog 1.5.60 (deps all satisfied by existing pins — Akka 1.5.62 / Serilog 4.3.1) to Directory.Packages.props + the Cluster csproj, and loggers=["Akka.Logger.Serilog.SerilogLogger, …"] + loglevel=DEBUG + logger-startup-timeout=30s to the embedded Resources/akka.conf. This bridges the DriverHostActor's Akka ILoggingAdapter into the Serilog file sink.
  • Prod deploy FAILED TWICE (both auto-rolled-back; service is UP + 4840 listening the whole time after each):
    1. Cluster.dll + Akka.Logger.Serilog.dll swap → crash FileNotFoundException: Could not load 'Akka.Logger.Serilog 1.5.60' from Serilog.Settings.Configuration.ConfigurationReader.LoadConfigurationAssemblies (the deployed OtOpcUa.Host.deps.json didn't list the new assembly; Serilog's *.Serilog config-assembly scan then fails fatally).
    2. Added the updated OtOpcUa.Host.deps.json → crash "Could not resolve CoreCLR path." ⇒ the box deployment is SELF-CONTAINED (bundles its own runtime), so a dotnet build (framework-dependent) deps.json breaks the apphost's runtime resolution.
  • Conclusion: adding a NEW assembly to this box requires a full self-contained publish-overlay (match the box layout; preserve appsettings*/data\), NOT a DLL/deps.json swap. That's a heavy/risky prod op for what is diagnostic observability of the (separate, pre-existing) driver-host re-spawn issue.
  • Prod state now (verified healthy): FOCAS I/O-fix DLL still deployed (228864 B — untouched by these rollbacks), Cluster.dll + deps.json restored to baseline, Akka.Logger.Serilog.dll renamed .disabled, service Running + OPC 4840 listening. Deploy-API key + Debug-Serilog env still set.
  • RECOMMENDED PIVOT: the cheap, no-prod-change way to validate the I/O fix is to force a driver re-spawn via a benign config rev-bump (the leading hypothesis); and the proper driver-host re-spawn fix + the Akka-observability are best developed in a local docker-dev repro (safe, full logging, one clean publish-overlay deploy at the end) rather than iterated on the production CNC node. (User chose: move proper fix to local docker-dev.)

🎯 ROOT CAUSE of the driver-not-spawning — STALE DEPLOYED BINARY (CONFIRMED 2026-06-26)

The DriverHostActor's bootstrap-recovery already re-spawns drivers in the current source: BootstrapRecover case Applied: logs "recovered Applied state at rev" and calls RestoreApplied() (DriverHostActor.cs:449-457), which does ReconcileDrivers (re-spawn) + RebuildAddressSpace + PushDesiredSubscriptions (:910-918) — comment: "the in-memory driver children + OPC UA address space were lost on restart … re-spawn … instead of waiting for a config change (whose identical-config revision would no-op)." Added by b1b3f3ff (2026-06-06) + 397f9b78 (2026-06-07); both ancestors of feat/focas-pdu-v3; covered by existing Restore_on_bootstrap_* tests.

  • The DEPLOYED wonder Runtime.dll is the June-16 base install (mtime 2026-06-16 13:05:46; only the FOCAS driver DLL was ever swapped). Binary string scan: RestoreApplied/ReconcileDrivers method names PRESENT, but the bootstrap-restore log strings "recovered Applied state at rev" + "restored served state for applied deployment" are ABSENT ⇒ the deployed binary predates b1b3f3ff and does NOT restore served state / re-spawn drivers on bootstrap. That is exactly why, on this stable-config box, a restart leaves the driver side dark (and why only config-authoring re-deploys ever lit it up).
  • ⇒ No new driver-host fix is needed — it already exists in source. The proper remedy = deploy the current Host (full self-contained publish-overlay onto E:\ApiInstall\OtOpcUa, preserving appsettings*/data\), which in one shot delivers: (1) RestoreApplied-on-bootstrap (drivers re-spawn after restart), (2) the FOCAS I/O serialization fix (symptom #1), (3) the Akka→Serilog bridge (observability). The earlier DLL-swap crashes were purely the self-contained-layout mismatch — a full publish-overlay is internally consistent and avoids them.
  • PLAN: confirm the current Host build boots cleanly + re-spawns drivers on restart in docker-dev (safe), then do ONE self-contained publish-overlay to wonder + verify tags leave 0x80320000.

docker-dev confirmation (2026-06-26) — current build re-spawns on bootstrap + Akka bridge works

Built the current source into the docker-dev image (otopcua-host:dev), booted central-1 (fused admin+driver, like wonder). Results:

  • Boots cleanly with Akka.Logger.Serilog (the prod DLL-swap crashes were purely the self-contained-layout mismatch, now confirmed).
  • Akka→Serilog bridge works — but needed TWO fixes beyond the package: (1) Akka.Hosting owns logger setup so HOCON akka.loggers is ignored → wire via ConfigureLoggers(setup => { setup.LogLevel=DebugLevel; setup.ClearLoggers(); setup.AddLogger<SerilogLogger>(); }) in WithOtOpcUaClusterBootstrap; (2) AddZbSerilog registers Serilog as the MEL provider but does NOT set the static Serilog.Log.Logger (which Akka.Logger.Serilog writes to, AND which the Program.cs startup banner uses) → set Serilog.Log.Logger = app.Services.GetRequiredService<Serilog.ILogger>() in Program.cs right after Build(). With both, the startup banner + full Akka cluster/DriverHost logs now emit. (The Log.Logger-unset gap is a latent bug in the shared AddZbSerilog lib affecting all 3 apps' static-Log calls — worth a follow-up there.)
  • RestoreApplied re-spawns drivers on bootstrap — CONFIRMED live: DriverHost central-1: recovered Applied state at rev …spawned GalaxyMxGateway/Modbus/OpcUaClient driver (×3) → SubscribeBulk pushed 5 references across 3 driver(s)restored served state for applied deployment … on bootstrapDriverInstance …: subscribed to N refs. These are the exact strings ABSENT from the stale wonder binary. ⇒ deploying the current Host fixes the wonder driver-not-spawning blocker.

Deploy plan (self-contained publish-overlay → wonder)

  • Local: dotnet publish …Host.csproj -c Release -r win-x64 --self-contained true -p:PublishSingleFile=false (self-contained = brings its own runtime; no version-match concern with the box).
  • Zip + SFTP to win64. Box overlay (self-healing): full app-dir backup → robocopy publish over E:\ApiInstall\OtOpcUa EXCLUDING pki\ (OPC server cert) + appsettings*.json → start → verify (4840 + tags) → AUTO-ROLLBACK from backup if unhealthy. Env (deploy key, Debug logging) is registry-side, untouched by the file overlay.
  • Then revert the Debug-Serilog env (now superseded; verbose) and verify parts-count/parts-required read Good (or recoverable BadComm), and that the FixedTree/driver values flow.

SYMPTOM #1 FIXED + LIVE-VALIDATED ON WONDER (2026-06-26)

After the self-contained overlay (current Host) + two light single-DLL FOCAS swaps, the validation revealed — and each fix peeled back — a cascade of latent FOCAS-config-vs-driver mismatches the stale binary had masked. Final state, all live on wonder:

  • DriverHost …: recovered Applied state … → spawned Focas driver focas-z-34184 (stub=false) → DriverInstance focas-z-34184: connected → subscribed to 2 refstwo ESTABLISHED TCP sessions to 10.201.31.5:8193.
  • read ns=2;s=EQ-3686c0272279/parts-count → Value 0, Status 0x00000000 (Good); parts-required → Good. (Value 0 is correct on the idle machine — status, not magnitude.) The original 0x80320000 is gone.

The complete fix chain (all on fix/focas-poll-io-serialization, deployed):

  1. FOCAS I/O serialization + read timeout (SynchronizedFocasClient) — the original diagnosed root cause: prevents the poll read hanging on the shared socket.
  2. RestoreApplied-on-bootstrap — already in source since b1b3f3ff; the wonder fix was deploying the current Host over the stale June-16 binary so the driver re-spawns on restart.
  3. Akka→Serilog bridge (ConfigureLoggers().AddLogger<SerilogLogger>() + set static Serilog.Log.Logger in Program.cs) — made the driver-host actor observable; this is what surfaced the next two issues.
  4. FlexibleStringConverter on the FOCAS config Series — the AdminUI persists the enum as a number ("series":6); the factory now tolerates number-or-string instead of throwing → stub.
  5. Scheme-less host tolerance in FocasHostAddress.TryParse — the AdminUI persists hostAddress as a bare ip:port; TryParse now accepts it (canonical focas:// unchanged) instead of failing init.
  • FOCAS test suite 247 green; each fix carries a regression test.
  • Follow-up (product quality): the AdminUI authors FOCAS configs (series as number, hostAddress without focas://) that the driver only now tolerates — the AdminUI↔driver config-format mismatch is worth reconciling at the source. Also: the shared AddZbSerilog not setting static Serilog.Log.Logger is a latent gap across all 3 apps. The FixedTree-under-Equipment feature (task #14) is now BUILT (offline-complete; see the 2026-06-26 design + implementation-plan docs above) — live wonder validation pending.

Phase 2 — Get OtOpcUa runtime logs on wonder

Make the Host emit driver-level logs so the data plane is observable. Options (least invasive first): point the service at a Serilog file sink via config/env, or temporarily run with DOTNET_ENVIRONMENT=Development (file sink + dev errors — cf. MxGateway note), or add a console capture. Preserve appsettings*/data\; restore the env after. Then read: did InitializeAsync start the FixedTree loop, does the bootstrap throw (and on which call), is ReadAsync invoked for the equipment tags, what does it return.

  • Access: servecli :2222, key ~/.ssh/servecli_wonder, scratchpad/wonder-ps.sh (base64 PS over the cmd PTY), SFTP root C:\Users\dohertj2\Desktop\win64.
  • Output: the actual runtime behavior of the data poll + equipment read on the box.

Phase 3 — Local reproduction in docker-dev (isolate FOCAS-specific vs general)

Reproduce off the production box: in the docker-dev OtOpcUa, configure an equipment tag bound to a driver and check whether values flow at all. Use a driver with an easy local source (e.g. Modbus against a local sim, or the FOCAS wire client against a v1 responder if one can be stood up). If equipment-tag values flow for another driver locally but not FOCAS → FOCAS-specific; if they don't flow for any → a general equipment-projection/data-plane gap. A local repro gives a full-logging debug loop.

  • Output: scope (FOCAS-only vs general) + a local failing case to fix against.

Phase 4 — Root cause + minimal fix

From Phases 13, fix the smallest thing that makes the equipment tag carry a value and (if H2 is not by-design) the FixedTree surface. Likely shapes: a swallowed bootstrap exception; a seed/poll-group wiring gap; an Equipment-projection that should include FixedTree/driver auto-nodes; or a data-plane gate that needs config on the wonder node.

Phase 5 — Validate

  • Local (docker-dev) green where reproduced; unit/integration tests for the fixed path.
  • Live: re-deploy to wonder, then via the OtOpcUa CLI confirm parts-count/parts-required read Good (value 0 is correct on this idle machine — assert status, not magnitude) and, if applicable, FixedTree Identity/Axes nodes appear with live values. The live Makino helps for FixedTree but is NOT blocking for the value-flow plumbing (macro values are 0 regardless), so this is lower time-pressure than the v3 capture was.

Phase 6 — Docs + commit

Update this plan's status + the deployment doc; commit on a branch (separate from feat/focas-pdu-v3 if the fix is OtOpcUa-core rather than FOCAS-driver). Push per the repo flow when asked.


Notes

  • Execution approach: this is a debugging investigation (unknown root cause) — diagnose before fixing; reproduce before claiming a fix; change the smallest thing. Don't deploy a guessed fix to the production CNC node.
  • Key node IDs / endpoints: equipment ns=2;s=EQ-3686c0272279; tags …/parts-count (MACRO:3901), …/parts-required (MACRO:3902); OPC opc.tcp://wonder-app-vd03.zmr.zimmer.com:4840/OtOpcUa; AdminUI http://wonder-app-vd03.zmr.zimmer.com:9000 (DisableLogin); deploy = AdminUI /deployments → "Deploy current configuration" (Blazo→Akka; no headless API).
  • Reusable tools: scratchpad/focas-status/ (live IFocasClient harness), scratchpad/wonder-ps.sh, scratchpad/deploy-focas-v3.ps1; the OtOpcUa CLI src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI (read/subscribe/browse --recursive).