Files
lmxopcua/docs/plans/2026-06-25-otopcua-equipment-dataplane-investigation.md
T

181 lines
38 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# OtOpcUa equipment-tag data-plane — why live FOCAS values don't surface (investigation plan)
**Date:** 2026-06-25
**Companion to:** [`2026-06-25-focas-pdu-v3-implementation-plan.md`](2026-06-25-focas-pdu-v3-implementation-plan.md) (the FOCAS PDU-v3 driver work — DONE + deployed) and [`../deployments/wonder-app-vd03-makino-z-34184.md`](../deployments/wonder-app-vd03-makino-z-34184.md).
**Goal:** make the deployed Makino tags (`parts-count`/`parts-required` = `MACRO:3901/3902`) actually carry live values over OPC UA, and make the FOCAS FixedTree (Identity/Axes/Timers/…) appear in the served address space.
## The symptom (precise, observed 2026-06-25 after the v3 deploy)
Against `opc.tcp://wonder-app-vd03.zmr.zimmer.com:4840/OtOpcUa` (Security None, anonymous, DisableLogin):
- `read` AND a 30 s `subscribe` of `ns=2;s=EQ-3686c0272279/parts-count` (and `/parts-required`) return **`0x80320000` Bad_WaitingForInitialData** — never a value.
- A recursive browse of the whole served tree shows ONLY `machining → makino → z-34184 → {parts-count, parts-required}`**no FixedTree nodes** (Identity/Axes/Timers/Program/…).
- Unchanged by host-restart, a full AdminUI re-deploy (`12e0d528`, Sealed/In-sync), and a driver `Restart`.
- A ~15 s box-side watch saw **no 250 ms-cadence TCP connection** to `10.201.31.5:8193` — only the occasional probe connect.
## What is RULED OUT (do not re-investigate)
- **The FOCAS PDU-v3 driver.** Proven working: the rewritten `FocasDriverProbe` does a real initiate + `cnc_statinfo` and the AdminUI shows **DRIVER STATUS: HEALTHY, last success Ns ago**; and on the dev box every surface (sysinfo / axes / dynamic / macros / timer / PMC / servo / alarms) reads correctly through the same `WireFocasClient`. The wire path talks v3 to the Makino.
- **Deployment / Akka roles.** Re-deploy sealed with no "task canceled"; the node has admin+driver roles; the address space (the 2 UNS tags) is served.
- **Reachability.** TCP 8193 reachable from the box; OPC 4840 reachable from the dev box.
So the problem is the OtOpcUa **data plane**: the driver's *control/probe plane* runs, but its *data plane* (the FixedTree poll loop output + the equipment-tag value source) is not reaching the served address space. This was present last session too — it was masked by the v3 reject; with v3 now working it stands alone.
## Leading hypotheses (to confirm/refute, not assume)
- **H1 — FixedTree bootstrap is stuck/throwing.** `FocasDriver.FixedTreeLoopAsync` is started at `FocasDriver.cs:160` inside `InitializeAsync` (which demonstrably ran — the probe loop at `:137` is alive). Its bootstrap `while (state.FixedTreeCache is null)` loop (~`:650`) calls `GetSysInfoAsync`/`GetAxisNamesAsync`/optional probes; if one throws it retries every 2 s and the cache never populates → no FixedTree nodes, no poll. The brief 2 s-spaced connects could explain "no steady 250 ms connection."
- **H2 — FixedTree nodes aren't projected under the Equipment model.** Galaxy/FOCAS are now *standard Equipment-kind drivers*; the served tree is the UNS/Equipment projection (configured tags only). Driver `DiscoverAsync` FixedTree auto-nodes may only live in a raw-driver namespace that the UNS browse doesn't show — i.e. FixedTree-under-Equipment may be unsupported-by-design, not a bug.
- **H3 — equipment-tag value source never primed.** A point-read returns the server's cached value; with no active poll/subscription seeding it, that's `WaitingForInitialData`. But a 30 s `subscribe` ALSO stayed bad — so either the monitored-item → `FocasDriver.ReadAsync` wiring isn't firing, or `ReadAsync(MACRO:3901)` returns no value on the box. (Analogy: the ScadaBridge DCL seed-ordering / static-tag false-bad class of bug — see that memory.)
- **H4 — a data-plane gate.** A role/active-node gate (cf. "data-plane roles need `Security:Ldap:GroupToRole` mapped or write/ack is inert") could suppress the read/poll on this single fused node under DisableLogin.
## ⚠️ Observability is the bottleneck
There are currently **no OtOpcUa app logs on the box** (Windows eventlog shows only NSSM start/stop; Serilog has no file sink in this deployment — cf. the MxGateway windev note). **Getting runtime logs is the single highest-leverage step** and gates H1/H3/H4.
---
## Phase 1 — Offline code trace of the data plane (do FIRST; free, no box/CNC)
Trace, in the repo on the dev box, the two paths end-to-end. This may reveal the root cause with zero box access.
1. **Equipment-tag read path:** from an OPC UA `Read`/`CreateMonitoredItem` on `ns=2;s=EQ-…/parts-count` → the server's address-space node → the "FullName→NodeId router" → `FocasDriver.ReadAsync`. Find: who invokes the driver read, whether point-reads vs subscriptions differ, and where/whether a value is cached vs `WaitingForInitialData` is returned. Entry points: `FocasDriver.cs:247` (resolver-produced equipment tags), `FocasDriver.ReadAsync`, and the server-side poll-group / monitored-item sampler (grep `PollGroup`, `MonitoredItem`, the Equipment namespace value provider).
2. **FixedTree path:** `FocasDriver.FixedTreeLoopAsync` (`:644`) → `state.FixedTreeCache``TryReadFixedTree` → how those nodes are added to the address space. Determine whether FixedTree nodes are emitted into the **Equipment/UNS projection** at all, or only a raw-driver namespace (resolves H2). Grep `DiscoverAsync`, `FixedTreeReference`, and the address-space builder for the Equipment kind.
3. **Driver lifecycle/gates:** confirm what conditions start/suppress the data poll vs the probe loop (active-node, role, enabled flags) — `FocasDriver.cs:87181`, the cluster/active-node gate.
- **Output:** a written root-cause hypothesis (which of H1H4) + the specific code site, OR a precise list of what only box logs can disambiguate.
### ✅ Phase 1 FINDINGS (2026-06-25, offline subagent code-trace — DONE)
| Hypothesis | Verdict | Confidence |
|---|---|---|
| **H2** — FixedTree nodes not projected under Equipment model | **CONFIRMED root cause of symptom #2 — by-design** | proven from code |
| **H3** — equipment-tag value source never primed | **CONFIRMED mechanism of symptom #1**; routing code is self-consistent → break is **runtime**, not static | mechanism proven; exact break needs box logs |
| **H4** — role/active-node data-plane gate | **REFUTED** for the read/subscribe path | proven from code |
| **H1** — FixedTree bootstrap stuck/throwing | best fit for symptom #3 (no 250 ms connect), but **moot** for #2 and **not** the cause of #1 | strongly-suggested; needs logs |
| **H5 (new)** — FOCAS tag with blank `deviceHostAddress``BadNodeIdUnknown` | **latent trap** — yields a *Bad* value, not WaitingForInitialData, so it's the *next* failure, not the current symptom | proven from code |
**Why symptom #2 (no FixedTree nodes):** FixedTree nodes are emitted **only** by `FocasDriver.DiscoverAsync` (`FocasDriver.cs:391-494`, gated on `FixedTreeCache`), and `DiscoverAsync` is **never invoked** by the deployed equipment-kind address-space build. Production materializes the served tree **purely from Config-DB Tag rows** (`EquipmentNodeWalker.cs:24-25,161-165`; `OpcUaPublishActor.RebuildAddressSpace``MaterialiseEquipmentTags(composition)` at `OpcUaPublishActor.cs:326-337`). No runtime caller of `IDriver.DiscoverAsync` exists. ⇒ **By-design gap, not a bug.** Surfacing FixedTree under the Equipment tree is a *feature* (author explicit Tag rows per signal, or extend the composition pipeline to merge `ITagDiscovery` output — a substantial `AddressSpaceComposer`/`AddressSpaceApplier` change).
**Why symptom #1 (`0x80320000` even under a 30 s subscribe):** equipment-tag node value is **push-only**. `0x80320000` is the materialization **seed** (`OtOpcUaNodeManager.EnsureVariable:1400-1402`); there is **no `OnReadValue`/server-side sampler** for equipment tags, so both Read and MonitoredItem serve the cached node value. The seed is overwritten **only** by `ForwardToMux → WriteValue` (`OtOpcUaNodeManager.cs:261-281`), which writes even Bad/null. `StatusFromQuality` maps Bad → `0x80000000`, **not** `0x80320000`. ⇒ A persistent `0x80320000` **proves no `AttributeValueUpdate` ever landed** for those two NodeIds. The routing chain (`DriverHostActor.PushDesiredSubscriptions``_nodeIdByDriverRef``DriverInstanceActor.ResubscribeDesired``PollGroupEngine` forceRaise → `OnDataChangeForward``ForwardToMux`) is **provably self-consistent for FOCAS** (`FocasTagConfigModel` writes no top-level `FullName``ExtractTagFullName` returns the raw JSON blob → `FocasEquipmentTagParser.TryParse` resolves it; subscribed ref == published ref == routing-map key == NodeId, all from the same `t.FullName`). So the static mapping is **refuted**; the break is **runtime**: either the FOCAS child was never subscribed those refs, or `_nodeIdByDriverRef` lacked the `(DriverInstanceId, FullName)` key at publish time — most likely a **`DriverInstanceId` attribution mismatch** between the equipment Tag rows and the spawned `DriverInstance`.
**What only box logs (or a ConfigDb snapshot read) can disambiguate for symptom #1 — capture these on the driver-role node after a deploy:**
1. `DriverHostActor` info *"SubscribeBulk pushed {Refs} references across {Drivers} driver(s)"* (`DriverHostActor.cs:1058`) — `Refs<2` ⇒ equipment Tag `DriverInstanceId` ≠ spawned FOCAS child id (attribution mismatch).
2. `DriverInstanceActor` info *"subscribed to {Count} refs"* (`DriverInstanceActor.cs:571`) — absent/`0``_desiredRefs` empty or never reached `Connected`.
3. `DriverHostActor` debug *"no equipment-tag NodeId for ({Driver},{Ref}) — value dropped"* (`DriverHostActor.cs:550`) — present ⇒ `_nodeIdByDriverRef` key miss.
4. `FocasDriver` debug *"FOCAS fixed-tree bootstrap failed … retrying"* every ~2 s (`FocasDriver.cs:686`) ⇒ confirms H1 (`cnc_sysinfo`/`cnc_rdaxisname` failing on the Makino). Needs Debug level + logger-constructed wire client.
5. A driver-side read of `MACRO:3901` (live Makino): `BadNodeIdUnknown` ⇒ H5 blank-`deviceHostAddress` trap; Good value ⇒ wire path fine, gap is purely the OPC-UA equipment subscription wiring.
**Cheaper-than-Phase-2 lead:** the prime suspect (Tag-row `DriverInstanceId` vs `DriverInstance` row mismatch) is checkable by reading the **deployed ConfigDb snapshot** (`OtOpcUaConfig` on `wonder-sql-vd03:1433`) — no service restart, no log-sink change. Worth doing before/instead of the Phase 2 log-sink change.
**Suggested minimal-fix shapes (NOT implemented):**
- Symptom #1: depends on which log line/row fires — if attribution miss, it's a **config/snapshot re-bind** (no code). If a real wiring gap, localize first.
- H5 latent: in `FocasDriver.ReadAsync`/`WriteAsync` device lookup (`:289`,`:346`), when `DeviceHostAddress` is blank and `_devices.Count==1`, resolve to the sole device instead of `BadNodeIdUnknown` (or make the AdminUI field mandatory).
- Symptom #2 (H2): no fix — by-design; surfacing FixedTree is a feature decision.
### ✅ ConfigDb snapshot read (2026-06-25, symptom #1 path — chosen over box logs)
Read live `OtOpcUaConfig` on `wonder-sql-vd03` (query run on-box so the SQL password never left the box). Ground truth:
- **Equipment** `EQ-3686c0272279` (`z-34184`): `DriverInstanceId=focas-z-34184`, `DeviceId=NULL`, `Enabled=True`, `UnsLineId=makino`.
- **Tags** `parts-count`/`parts-required`: both `DriverInstanceId=focas-z-34184`, `DeviceId=NULL`, `DataType=Double`, `AccessLevel=Read`, `FolderPath=NULL`, **`PollGroupId=NULL`**, `TagConfig={"deviceHostAddress":"10.201.31.5:8193","address":"MACRO:390{1,2}","dataType":"Float64"}`.
- **DriverInstance**: one row `focas-z-34184` (`DriverType=Focas`, `Enabled=True`, ns `dev-equipment`, cluster `DEV`).
- **Device** table: **0 rows** — NORMAL: `FocasDriver` builds `_devices` from `_options.Devices` (DriverConfig JSON) not the Device table (`FocasDriver.cs:97`).
**Refuted by this read:** prime-suspect `DriverInstanceId` attribution mismatch (matches exactly) and H5 blank-`deviceHostAddress` (present). The deployed config is **clean**. ⇒ symptom #1 is a pure value-flow-plumbing break. New live leads: **`PollGroupId=NULL`** on both tags (is a poll group required to subscribe/poll?) and the **resolver-registration** path (equipment-tag refs are "resolver-produced, not seeded at `InitializeAsync`" per `FocasDriver.cs:247` — does poll-time `TryResolve` of the JSON-blob ref ever succeed?). A second offline subagent trace of DriverHostActor↔DriverInstanceActor↔PollGroupEngine↔resolver is running to pin the exact broken link.
### ✅ FixedTree feature (symptom #2) — BUILT 2026-06-26 (architecture reality below)
Mapped the composition pipeline. Two address-space paths exist: (1) **Equipment/UNS projection** `AddressSpaceComposer.Compose` (config entities only) → `AddressSpaceApplier.MaterialiseEquipmentTags` → the served `ns=2` tree where `EQ-…` lives; (2) **raw-driver namespace** `GenericDriverNodeManager.BuildAddressSpaceAsync``driver.DiscoverAsync(IAddressSpaceBuilder)`. **Path 2 is DEAD: `BuildAddressSpaceAsync` has no runtime caller and `OpcUaApplicationHost.PopulateAddressSpaces` (its referenced caller) no longer exists.** Even `GalaxyDriver.DiscoverAsync` (`:588`) is reachable only via that dead path — Galaxy surfaces its hierarchy by being **authored as config equipment/tags**, not via discovery. ⇒ In the current Equipment-kind model **every served node is config-driven; `ITagDiscovery`/`DiscoverAsync` is legacy/dead for serving.** So "build the FixedTree feature" is NOT re-wiring an existing path — it's a **new dynamic-node-injection capability** into the Equipment projection, and it must solve a **timing problem**: composition runs at deploy/apply (before the driver connects), but FixedTree data only exists after the driver's async `FixedTreeCache` bootstrap. The far cheaper alternative that yields the same visible result is to **author FixedTree signals as config Tag rows** (each bound to a FOCAS fixed-tree reference) — same mechanism every other equipment tag uses. (The user chose to **build the dynamic feature** over the config-rows alternative.)
**✅ BUILT (2026-06-26).** Implemented as a generic **post-connect `ITagDiscovery` injection pipeline**: when a driver reaches `Connected`, `DriverInstanceActor` runs bounded re-discovery into a capturing `IAddressSpaceBuilder` and ships `DiscoveredNodesReady` to `DriverHostActor`, which maps the nodes under the equipment (`EQ-…/FOCAS/…`, read-only), extends the `_nodeIdByDriverRef` routing map, and tells `OpcUaPublishActor` to incrementally materialise them — reusing the existing materialize→subscribe→poll→push pipeline (no full rebuild). Survives redeploys (re-applied at the tail of `PushDesiredSubscriptions`) and restarts (re-discovered on reconnect). Design: [`2026-06-26-otopcua-fixedtree-equipment-injection-design.md`](2026-06-26-otopcua-fixedtree-equipment-injection-design.md); implementation plan (11 bite-sized tasks, all green): [`2026-06-26-otopcua-fixedtree-equipment-injection.md`](2026-06-26-otopcua-fixedtree-equipment-injection.md). **Offline-complete** on branch `feat/focas-fixedtree-equipment-injection` (solution build 0 errors / 0 warnings; Runtime.Tests 312, OpcUaServer.Tests 304, FOCAS 247 + an end-to-end injection+value-flow test, all green). The review chain caught + fixed three real defects (a `DriverDataType→OPC-UA-type` string mismatch, a `Server.ReportEvent`-under-lock deadlock, and a `ConfigureAwait(false)` off-actor-context crash for async drivers). **Live wonder validation pending** (deploy the current host + browse `ns=2;s=EQ-3686c0272279/FOCAS/Identity/SeriesNumber`, `…/FOCAS/Axes/X/AbsolutePosition`).
### 🎯 ROOT CAUSE — symptom #1 (CONFIRMED, 2026-06-25, 2nd subagent trace + code verify)
**The FOCAS poll read hangs forever because (1) all wire I/O for a device shares one socket with NO serialization, and (2) the steady-state read has NO timeout.**
- **Unsynchronized shared socket:** `FocasDriver.EnsureConnectedAsync` (`:1101-1128`) returns the single `device.Client` with no I/O mutex. `DeviceState.ProbeLock` (`:1172`) only synchronizes probes with each other. Four independent loops issue wire ops on that same socket concurrently: the equipment poll (`PollGroupEngine`, 1 s — `SubscriptionPublishingInterval` const `DriverHostActor.cs:58`), the FixedTree loop (250 ms / 2 s bootstrap), the probe (5 s), the recycle loop. FOCAS/2 is strict request-response on one socket → concurrent unsynchronized reads interleave; one reader consumes another's response PDU and the victim read blocks waiting for bytes that never come.
- **No read timeout:** the poll calls `FocasDriver.ReadAsync` (`:295-308`) → `WireFocasClient.ReadAsync` (`:417`) → `ReadMacroAsync` **without** the optional `timeout``FocasWireClient.CreateCallTimeout` (`:843-848`) skips `CancelAfter` when `timeout` is null → `ReadExpectedPduAsync` awaits the socket under a token that never fires. The blocked first `forceRaise` poll (`PollGroupEngine.cs:119`) never completes → `onChange` never fires → the `BadWaitingForInitialData` materialization seed (`OtOpcUaNodeManager.cs:1400`) is never overwritten → permanent `0x80320000`. (Connect `:1119` and probe `:604-610` ARE bounded — only the steady-state read isn't.)
**Why it works on the dev box but not deployed:** the dev-box harness/CLI does ONE read at a time with no FixedTree loop running → no socket collision → macros read fine (proven in the v3 work). Deployed, the FixedTree loop races the equipment poll on one socket → collision → hung read.
**Why both tags fail together:** they're in one driver's single 1 s subscription batch; `PollOnceAsync` reads them sequentially and hangs on the first, so neither reaches `onChange`.
**Ruled out:** `PollGroupId=NULL` is a red herring — verified at `ConfigComposer.cs:38` (snapshots all tags, no poll-group filter), `AddressSpaceComposer.Compose` (never reads `PollGroupId`), and `DriverHostActor.PushDesiredSubscriptions:965-973` (groups by `DriverInstanceId`, constant 1 s interval, no poll-group keying). Routing-key mismatch ruled out: subscribe ref == routing key == `onChange` ref (all from the same `FullName`). Not-subscribed ruled out by elimination: a *served* node implies it's in the same `ParseComposition` artifact used for subscribe; and every *deterministic* `ReadAsync` outcome pushes a specific Bad code (`BadNodeIdUnknown`/`BadCommunicationError`/`BadNotSupported`), none of which is `0x80320000` — only a hung (never-returning) read leaves the seed intact.
**PROPOSED FIX (two parts, FOCAS-driver-only, single managed DLL, no migration/proto change — same low-blast-radius deploy path as the v3 work):**
1. **Serialize per-device wire I/O** — an async gate (`SemaphoreSlim`) on `DeviceState` held around each `EnsureConnected + wire op` so the equipment poll, FixedTree loop, probe, and recycle never collide on the one socket. *This is what makes values actually read Good.* MUST be paired with #2 (a lock around an unbounded read would deadlock all I/O).
2. **Bound every steady-state wire read/write** with `_options.Timeout` (mirror the probe's linked CTS at `:604-610`, or thread `timeout` through `WireFocasClient.ReadAsync``ReadMacroAsync`). Converts any stall into a recoverable `BadCommunicationError` push (overwrites the seed, downgrades health → observable + self-healing) instead of permanent silence.
**Confidence:** the two structural gaps are *proven from code*; that their interaction is the active trigger is *strongly-suggested* (works single-threaded on dev box, hangs deployed; fits every observation). Definitive live confirmation = the fix makes the tags leave `0x80320000` (Good, or recoverable BadComm if the macro read genuinely fails). **This is a debugging fix headed for a production CNC node — implement + unit-test (serialization + read-timeout) locally, then deploy to wonder and verify, per the diagnose-before-deploy discipline.**
### Implementation + deploy status (2026-06-26)
- **Fix implemented** on branch `fix/focas-poll-io-serialization` (off `feat/focas-pdu-v3`): new `SynchronizedFocasClient` decorator (per-device `SemaphoreSlim` gate + per-call `_options.Timeout`) wired into `FocasDriver.EnsureConnectedAsync`; `ReadAsync`/`WriteAsync` now map a per-call timeout (OCE while caller token live) → `BadCommunicationError` instead of rethrowing. **243 FOCAS tests green** (8 new in `FocasIoSerializationTests.cs` + 1 sibling in `FocasReadWriteTests`), full solution builds 0 errors.
- **Live baseline re-confirmed** via OtOpcUa CLI from the Mac: both tags `[80320000]`.
- **Deployed** the Release DLL to wonder (single managed-DLL swap; v3 DLL 222208 B → fix DLL 228864 B; backup `E:\ApiInstall\OtOpcUa\_focasbak-pre-iofix-20260626T041913.dll`; service Running, OPC 4840 listening).
- **BLOCKER for verification:** after the swap+restart the FOCAS driver is **not connecting at all** (zero TCP to `10.201.31.5:8193` at 150 s uptime — not even a probe connect), so tags still `[80320000]`.
- **Deploy API enabled** (user-approved): added `Security__DeployApiKey` (64-char generated, value on-box only) to the `OtOpcUaHost` service `Environment` REG_MULTI_SZ + restart; env backed up to `_envbak-pre-deploykey-20260626T043001.txt`. Headless `POST http://localhost:9000/api/deployments` (key read from registry on-box) returned **HTTP 202 Accepted**, sealed deployment `ef384b04…`.
- **Tags STILL `[80320000]`** after the sealed deploy. **Root of the live blocker (from box logs):** the service writes Serilog to `C:\Windows\System32\logs\otopcua-<date>.log` (relative-path-from-service-CWD bug — pending.md D2). Today's log shows the **admin/publish side only**: `Phase7Applier: hierarchy materialised (areas=1,lines=1,equipment=1)` + `equipment tags materialised (tags=2)` at each restart (04:19, 04:30) — which is why the 2 tags are served with the `0x80320000` seed. But across the **entire** day's log there are **ZERO** driver-side lines: `DriverHost`=0, `SubscribeBulk`=0, `subscribed to`=0, `Focas`=0, `GenericDriver`=0 (the 13 `DriverInstance` hits are all SQL `FROM [DriverInstance]`). **⇒ the DriverHostActor / driver-role side is not spawning the FOCAS driver at all on this node today**, so there is no poll, no value push, permanent seed. Health: `/healthz` Healthy, Akka member **Up**, `admin-leader` **Active**; roles env intact (`Cluster__Roles__0=admin`, `__1=driver`, seed=self:4053); `DOTNET_ENVIRONMENT=Production`. The admin half is fully alive; only the driver half is silent.
- **This is a SEPARATE blocker from the I/O fix** (which is correct, unit-tested, and deployed — but cannot be exercised until the driver actually polls).
### Driver-host activation diagnosis (2026-06-26, Debug-logging pass + code reads)
- **Debug Serilog enabled** (env `Serilog__MinimumLevel__Default=Debug` + EF/Microsoft/System→Warning; env bak `_envbak-pre-debug-*.txt`). Revealed: **NO exceptions / load errors anywhere** (my fix DLL is NOT the cause), but also **ZERO Akka lines** in Serilog (`Akka`/`Member`/`akka.tcp`/`singleton` = 0).
- **`DriverHostActor` logs via `Context.GetLogger()` (Akka `ILoggingAdapter`), and there is NO `Akka.Logger.Serilog` bridge anywhere in the app** (verified across `*.cs`/`*.csproj`/HOCON). So Akka actor logs go to the default console logger → **discarded for a Windows service → invisible in BOTH Production AND Development.** `appsettings.Development.json` only changes Serilog min-levels (+ `Ldap:DevStubMode=true`); it adds no Akka bridge. ⇒ **`DOTNET_ENVIRONMENT=Development` would NOT surface the driver host** (held off — futile + flips DevStubMode on prod).
- **The `DriverHostActor` IS instantiated** (unconditionally, same `system.ActorOf` block as the working `OpcUaPublishActor` at `Runtime/ServiceCollectionExtensions.cs:~232`). So it runs — it just isn't producing a connected FOCAS driver, and logs via Akka so the reason is invisible.
- **STRONG HYPOTHESIS (explains everything):** on bootstrap the `DriverHostActor` *recovers* its persisted state (its own logs name `entering Steady` / `recovered Applied state at rev {Rev}`), but a process restart has already killed the live `DriverInstanceActor`s + their FOCAS sockets — and **recovery does NOT re-instantiate them**. Drivers re-spawn only on a deployment dispatch whose **revision changes**. This config is now **stable** (`revisionHash 924b59097eba…` is identical across `0c2db588`/`ef384b04`/`a2a84646` — every `POST /api/deployments` returns 202 but with the SAME rev), so repeated deploys are no-ops for the driver host → after any restart there are **no live drivers** and nothing re-spawns them → zero Makino TCP → tags frozen at the `0x80320000` materialization seed. The **prior session got HEALTHY because the config was being *authored*** (each deploy had a NEW rev → forced a spawn); once authoring stopped, a restart leaves the driver side dark. ⇒ likely the real reason "host-restart didn't change it" in the original symptom.
- **Implication:** to exercise the I/O fix live, force a driver re-spawn by bumping the revision (e.g. toggle the FOCAS DriverInstance `Enabled` off→on, or any benign config edit, then deploy) — reversible. Separately, "restart should re-spawn applied drivers" + "Akka logs should bridge to Serilog" are real OtOpcUa robustness/observability gaps (the latter = the observability bottleneck this plan flagged).
- **Box env mutations still in place (to revert when done):** `Security__DeployApiKey` (deploy API), `Serilog__MinimumLevel__*` (Debug). Backups: `_envbak-pre-deploykey-*`, `_envbak-pre-debug-*`.
### Akka→Serilog observability attempt (2026-06-26) — code done, prod deploy BLOCKED by self-contained layout
- **Code change (correct, builds 0 errors):** added `Akka.Logger.Serilog` 1.5.60 (deps all satisfied by existing pins — Akka 1.5.62 / Serilog 4.3.1) to `Directory.Packages.props` + the Cluster csproj, and `loggers=["Akka.Logger.Serilog.SerilogLogger, …"]` + `loglevel=DEBUG` + `logger-startup-timeout=30s` to the embedded `Resources/akka.conf`. This bridges the DriverHostActor's Akka `ILoggingAdapter` into the Serilog file sink.
- **Prod deploy FAILED TWICE (both auto-rolled-back; service is UP + 4840 listening the whole time after each):**
1. Cluster.dll + Akka.Logger.Serilog.dll swap → crash `FileNotFoundException: Could not load 'Akka.Logger.Serilog 1.5.60'` from `Serilog.Settings.Configuration.ConfigurationReader.LoadConfigurationAssemblies` (the deployed `OtOpcUa.Host.deps.json` didn't list the new assembly; Serilog's `*.Serilog` config-assembly scan then fails fatally).
2. Added the updated `OtOpcUa.Host.deps.json` → crash **"Could not resolve CoreCLR path."** ⇒ the box deployment is **SELF-CONTAINED** (bundles its own runtime), so a `dotnet build` (framework-dependent) deps.json breaks the apphost's runtime resolution.
- **Conclusion:** adding a NEW assembly to this box requires a **full self-contained publish-overlay** (match the box layout; preserve `appsettings*`/`data\`), NOT a DLL/deps.json swap. That's a heavy/risky prod op for what is *diagnostic* observability of the (separate, pre-existing) driver-host re-spawn issue.
- **Prod state now (verified healthy):** FOCAS **I/O-fix DLL still deployed** (228864 B — untouched by these rollbacks), Cluster.dll + deps.json restored to baseline, `Akka.Logger.Serilog.dll` renamed `.disabled`, service Running + OPC 4840 listening. Deploy-API key + Debug-Serilog env still set.
- **RECOMMENDED PIVOT:** the cheap, no-prod-change way to validate the I/O fix is to **force a driver re-spawn via a benign config rev-bump** (the leading hypothesis); and the *proper* driver-host re-spawn fix + the Akka-observability are best developed in a **local docker-dev repro** (safe, full logging, one clean publish-overlay deploy at the end) rather than iterated on the production CNC node. **(User chose: move proper fix to local docker-dev.)**
### 🎯 ROOT CAUSE of the driver-not-spawning — STALE DEPLOYED BINARY (CONFIRMED 2026-06-26)
The DriverHostActor's bootstrap-recovery **already re-spawns drivers in the current source**: `BootstrapRecover` `case Applied:` logs `"recovered Applied state at rev"` and calls `RestoreApplied()` (`DriverHostActor.cs:449-457`), which does `ReconcileDrivers` (re-spawn) + `RebuildAddressSpace` + `PushDesiredSubscriptions` (`:910-918`) — comment: *"the in-memory driver children + OPC UA address space were lost on restart … re-spawn … instead of waiting for a config change (whose identical-config revision would no-op)."* Added by `b1b3f3ff` (2026-06-06) + `397f9b78` (2026-06-07); both ancestors of `feat/focas-pdu-v3`; covered by existing `Restore_on_bootstrap_*` tests.
- **The DEPLOYED wonder `Runtime.dll` is the June-16 base install** (`mtime 2026-06-16 13:05:46`; only the FOCAS *driver* DLL was ever swapped). Binary string scan: `RestoreApplied`/`ReconcileDrivers` method names PRESENT, but the bootstrap-restore log strings `"recovered Applied state at rev"` + `"restored served state for applied deployment"` are **ABSENT** ⇒ the deployed binary **predates b1b3f3ff** and does **NOT** restore served state / re-spawn drivers on bootstrap. **That is exactly why, on this stable-config box, a restart leaves the driver side dark** (and why only config-authoring re-deploys ever lit it up).
- **⇒ No new driver-host fix is needed — it already exists in source.** The proper remedy = **deploy the current Host** (full self-contained publish-overlay onto `E:\ApiInstall\OtOpcUa`, preserving `appsettings*`/`data\`), which in one shot delivers: (1) `RestoreApplied`-on-bootstrap (drivers re-spawn after restart), (2) the FOCAS I/O serialization fix (symptom #1), (3) the Akka→Serilog bridge (observability). The earlier DLL-swap crashes were purely the self-contained-layout mismatch — a full publish-overlay is internally consistent and avoids them.
- **PLAN:** confirm the current Host build boots cleanly + re-spawns drivers on restart in **docker-dev** (safe), then do ONE self-contained publish-overlay to wonder + verify tags leave `0x80320000`.
### ✅ docker-dev confirmation (2026-06-26) — current build re-spawns on bootstrap + Akka bridge works
Built the current source into the docker-dev image (`otopcua-host:dev`), booted `central-1` (fused admin+driver, like wonder). Results:
- **Boots cleanly** with `Akka.Logger.Serilog` (the prod DLL-swap crashes were purely the self-contained-layout mismatch, now confirmed).
- **Akka→Serilog bridge works** — but needed TWO fixes beyond the package: (1) Akka.Hosting owns logger setup so HOCON `akka.loggers` is ignored → wire via `ConfigureLoggers(setup => { setup.LogLevel=DebugLevel; setup.ClearLoggers(); setup.AddLogger<SerilogLogger>(); })` in `WithOtOpcUaClusterBootstrap`; (2) `AddZbSerilog` registers Serilog as the MEL provider but does NOT set the static `Serilog.Log.Logger` (which `Akka.Logger.Serilog` writes to, AND which the Program.cs startup banner uses) → set `Serilog.Log.Logger = app.Services.GetRequiredService<Serilog.ILogger>()` in Program.cs right after `Build()`. With both, the startup banner + full Akka cluster/DriverHost logs now emit. *(The `Log.Logger`-unset gap is a latent bug in the shared `AddZbSerilog` lib affecting all 3 apps' static-`Log` calls — worth a follow-up there.)*
- **`RestoreApplied` re-spawns drivers on bootstrap — CONFIRMED live:** `DriverHost central-1: recovered Applied state at rev …``spawned GalaxyMxGateway/Modbus/OpcUaClient driver` (×3) → `SubscribeBulk pushed 5 references across 3 driver(s)``restored served state for applied deployment … on bootstrap``DriverInstance …: subscribed to N refs`. These are the exact strings ABSENT from the stale wonder binary. ⇒ deploying the current Host fixes the wonder driver-not-spawning blocker.
### Deploy plan (self-contained publish-overlay → wonder)
- Local: `dotnet publish …Host.csproj -c Release -r win-x64 --self-contained true -p:PublishSingleFile=false` (self-contained = brings its own runtime; no version-match concern with the box).
- Zip + SFTP to `win64`. Box overlay (self-healing): full app-dir backup → robocopy publish over `E:\ApiInstall\OtOpcUa` **EXCLUDING `pki\` (OPC server cert) + `appsettings*.json`** → start → verify (4840 + tags) → AUTO-ROLLBACK from backup if unhealthy. Env (deploy key, Debug logging) is registry-side, untouched by the file overlay.
- Then revert the Debug-Serilog env (now superseded; verbose) and verify `parts-count`/`parts-required` read Good (or recoverable BadComm), and that the FixedTree/driver values flow.
### ✅✅ SYMPTOM #1 FIXED + LIVE-VALIDATED ON WONDER (2026-06-26)
After the self-contained overlay (current Host) + two light single-DLL FOCAS swaps, the validation revealed — and each fix peeled back — a **cascade of latent FOCAS-config-vs-driver mismatches** the stale binary had masked. Final state, all live on wonder:
- `DriverHost …: recovered Applied state … → spawned Focas driver focas-z-34184 (stub=false) → DriverInstance focas-z-34184: connected → subscribed to 2 refs`**two ESTABLISHED TCP sessions to `10.201.31.5:8193`**.
- **`read ns=2;s=EQ-3686c0272279/parts-count` → Value 0, Status `0x00000000` (Good)**; `parts-required` → Good. (Value 0 is correct on the idle machine — status, not magnitude.) The original `0x80320000` is gone.
**The complete fix chain (all on `fix/focas-poll-io-serialization`, deployed):**
1. **FOCAS I/O serialization + read timeout** (`SynchronizedFocasClient`) — the original diagnosed root cause: prevents the poll read hanging on the shared socket.
2. **`RestoreApplied`-on-bootstrap** — already in source since `b1b3f3ff`; the wonder fix was deploying the current Host over the stale June-16 binary so the driver re-spawns on restart.
3. **Akka→Serilog bridge** (`ConfigureLoggers().AddLogger<SerilogLogger>()` + set static `Serilog.Log.Logger` in Program.cs) — made the driver-host actor observable; this is what surfaced the next two issues.
4. **`FlexibleStringConverter`** on the FOCAS config `Series` — the AdminUI persists the enum as a number (`"series":6`); the factory now tolerates number-or-string instead of throwing → stub.
5. **Scheme-less host tolerance** in `FocasHostAddress.TryParse` — the AdminUI persists `hostAddress` as a bare `ip:port`; `TryParse` now accepts it (canonical `focas://` unchanged) instead of failing init.
- FOCAS test suite **247 green**; each fix carries a regression test.
- **Follow-up (product quality):** the AdminUI authors FOCAS configs (`series` as number, `hostAddress` without `focas://`) that the driver only now tolerates — the AdminUI↔driver config-format mismatch is worth reconciling at the source. Also: the shared `AddZbSerilog` not setting static `Serilog.Log.Logger` is a latent gap across all 3 apps. The FixedTree-under-Equipment feature (task #14) is now **BUILT** (offline-complete; see the 2026-06-26 design + implementation-plan docs above) — live wonder validation pending.
## Phase 2 — Get OtOpcUa runtime logs on wonder
Make the Host emit driver-level logs so the data plane is observable. Options (least invasive first): point the service at a Serilog file sink via config/env, or temporarily run with `DOTNET_ENVIRONMENT=Development` (file sink + dev errors — cf. MxGateway note), or add a console capture. Preserve `appsettings*`/`data\`; restore the env after. Then read: did `InitializeAsync` start the FixedTree loop, does the bootstrap throw (and on which call), is `ReadAsync` invoked for the equipment tags, what does it return.
- **Access:** servecli `:2222`, key `~/.ssh/servecli_wonder`, `scratchpad/wonder-ps.sh` (base64 PS over the cmd PTY), SFTP root `C:\Users\dohertj2\Desktop\win64`.
- **Output:** the actual runtime behavior of the data poll + equipment read on the box.
## Phase 3 — Local reproduction in docker-dev (isolate FOCAS-specific vs general)
Reproduce off the production box: in the docker-dev OtOpcUa, configure an equipment tag bound to a driver and check whether values flow at all. Use a driver with an easy local source (e.g. Modbus against a local sim, or the FOCAS wire client against a v1 responder if one can be stood up). If equipment-tag values flow for another driver locally but not FOCAS → FOCAS-specific; if they don't flow for any → a general equipment-projection/data-plane gap. A local repro gives a full-logging debug loop.
- **Output:** scope (FOCAS-only vs general) + a local failing case to fix against.
## Phase 4 — Root cause + minimal fix
From Phases 13, fix the smallest thing that makes the equipment tag carry a value and (if H2 is not by-design) the FixedTree surface. Likely shapes: a swallowed bootstrap exception; a seed/poll-group wiring gap; an Equipment-projection that should include FixedTree/driver auto-nodes; or a data-plane gate that needs config on the wonder node.
## Phase 5 — Validate
- Local (docker-dev) green where reproduced; unit/integration tests for the fixed path.
- Live: re-deploy to wonder, then via the OtOpcUa CLI confirm `parts-count`/`parts-required` read **Good** (value 0 is correct on this idle machine — assert status, not magnitude) and, if applicable, FixedTree Identity/Axes nodes appear with live values. The live Makino helps for FixedTree but is NOT blocking for the value-flow plumbing (macro values are 0 regardless), so this is lower time-pressure than the v3 capture was.
## Phase 6 — Docs + commit
Update this plan's status + the deployment doc; commit on a branch (separate from `feat/focas-pdu-v3` if the fix is OtOpcUa-core rather than FOCAS-driver). Push per the repo flow when asked.
---
## Notes
- **Execution approach:** this is a *debugging* investigation (unknown root cause) — diagnose before fixing; reproduce before claiming a fix; change the smallest thing. Don't deploy a guessed fix to the production CNC node.
- **Key node IDs / endpoints:** equipment `ns=2;s=EQ-3686c0272279`; tags `…/parts-count` (`MACRO:3901`), `…/parts-required` (`MACRO:3902`); OPC `opc.tcp://wonder-app-vd03.zmr.zimmer.com:4840/OtOpcUa`; AdminUI `http://wonder-app-vd03.zmr.zimmer.com:9000` (DisableLogin); deploy = AdminUI `/deployments` → "Deploy current configuration" (Blazo→Akka; no headless API).
- **Reusable tools:** `scratchpad/focas-status/` (live IFocasClient harness), `scratchpad/wonder-ps.sh`, `scratchpad/deploy-focas-v3.ps1`; the OtOpcUa CLI `src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI` (`read`/`subscribe`/`browse --recursive`).