Compare commits
3 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 74e07225c9 | |||
| 6600ce9940 | |||
| 235b8b8e6d |
@@ -8,6 +8,7 @@
|
|||||||
<PackageVersion Include="Akka.Cluster.Hosting" Version="1.5.62" />
|
<PackageVersion Include="Akka.Cluster.Hosting" Version="1.5.62" />
|
||||||
<PackageVersion Include="Akka.Cluster.Tools" Version="1.5.62" />
|
<PackageVersion Include="Akka.Cluster.Tools" Version="1.5.62" />
|
||||||
<PackageVersion Include="Akka.Hosting" Version="1.5.62" />
|
<PackageVersion Include="Akka.Hosting" Version="1.5.62" />
|
||||||
|
<PackageVersion Include="Akka.Logger.Serilog" Version="1.5.60" />
|
||||||
<PackageVersion Include="Akka.Remote" Version="1.5.62" />
|
<PackageVersion Include="Akka.Remote" Version="1.5.62" />
|
||||||
<PackageVersion Include="Akka.Remote.Hosting" Version="1.5.62" />
|
<PackageVersion Include="Akka.Remote.Hosting" Version="1.5.62" />
|
||||||
<PackageVersion Include="Akka.Streams" Version="1.5.62" />
|
<PackageVersion Include="Akka.Streams" Version="1.5.62" />
|
||||||
|
|||||||
@@ -0,0 +1,178 @@
|
|||||||
|
# OtOpcUa equipment-tag data-plane — why live FOCAS values don't surface (investigation plan)
|
||||||
|
|
||||||
|
**Date:** 2026-06-25
|
||||||
|
**Companion to:** [`2026-06-25-focas-pdu-v3-implementation-plan.md`](2026-06-25-focas-pdu-v3-implementation-plan.md) (the FOCAS PDU-v3 driver work — DONE + deployed) and [`../deployments/wonder-app-vd03-makino-z-34184.md`](../deployments/wonder-app-vd03-makino-z-34184.md).
|
||||||
|
**Goal:** make the deployed Makino tags (`parts-count`/`parts-required` = `MACRO:3901/3902`) actually carry live values over OPC UA, and make the FOCAS FixedTree (Identity/Axes/Timers/…) appear in the served address space.
|
||||||
|
|
||||||
|
## The symptom (precise, observed 2026-06-25 after the v3 deploy)
|
||||||
|
Against `opc.tcp://wonder-app-vd03.zmr.zimmer.com:4840/OtOpcUa` (Security None, anonymous, DisableLogin):
|
||||||
|
- `read` AND a 30 s `subscribe` of `ns=2;s=EQ-3686c0272279/parts-count` (and `/parts-required`) return **`0x80320000` Bad_WaitingForInitialData** — never a value.
|
||||||
|
- A recursive browse of the whole served tree shows ONLY `machining → makino → z-34184 → {parts-count, parts-required}` — **no FixedTree nodes** (Identity/Axes/Timers/Program/…).
|
||||||
|
- Unchanged by host-restart, a full AdminUI re-deploy (`12e0d528`, Sealed/In-sync), and a driver `Restart`.
|
||||||
|
- A ~15 s box-side watch saw **no 250 ms-cadence TCP connection** to `10.201.31.5:8193` — only the occasional probe connect.
|
||||||
|
|
||||||
|
## What is RULED OUT (do not re-investigate)
|
||||||
|
- **The FOCAS PDU-v3 driver.** Proven working: the rewritten `FocasDriverProbe` does a real initiate + `cnc_statinfo` and the AdminUI shows **DRIVER STATUS: HEALTHY, last success Ns ago**; and on the dev box every surface (sysinfo / axes / dynamic / macros / timer / PMC / servo / alarms) reads correctly through the same `WireFocasClient`. The wire path talks v3 to the Makino.
|
||||||
|
- **Deployment / Akka roles.** Re-deploy sealed with no "task canceled"; the node has admin+driver roles; the address space (the 2 UNS tags) is served.
|
||||||
|
- **Reachability.** TCP 8193 reachable from the box; OPC 4840 reachable from the dev box.
|
||||||
|
|
||||||
|
So the problem is the OtOpcUa **data plane**: the driver's *control/probe plane* runs, but its *data plane* (the FixedTree poll loop output + the equipment-tag value source) is not reaching the served address space. This was present last session too — it was masked by the v3 reject; with v3 now working it stands alone.
|
||||||
|
|
||||||
|
## Leading hypotheses (to confirm/refute, not assume)
|
||||||
|
- **H1 — FixedTree bootstrap is stuck/throwing.** `FocasDriver.FixedTreeLoopAsync` is started at `FocasDriver.cs:160` inside `InitializeAsync` (which demonstrably ran — the probe loop at `:137` is alive). Its bootstrap `while (state.FixedTreeCache is null)` loop (~`:650`) calls `GetSysInfoAsync`/`GetAxisNamesAsync`/optional probes; if one throws it retries every 2 s and the cache never populates → no FixedTree nodes, no poll. The brief 2 s-spaced connects could explain "no steady 250 ms connection."
|
||||||
|
- **H2 — FixedTree nodes aren't projected under the Equipment model.** Galaxy/FOCAS are now *standard Equipment-kind drivers*; the served tree is the UNS/Equipment projection (configured tags only). Driver `DiscoverAsync` FixedTree auto-nodes may only live in a raw-driver namespace that the UNS browse doesn't show — i.e. FixedTree-under-Equipment may be unsupported-by-design, not a bug.
|
||||||
|
- **H3 — equipment-tag value source never primed.** A point-read returns the server's cached value; with no active poll/subscription seeding it, that's `WaitingForInitialData`. But a 30 s `subscribe` ALSO stayed bad — so either the monitored-item → `FocasDriver.ReadAsync` wiring isn't firing, or `ReadAsync(MACRO:3901)` returns no value on the box. (Analogy: the ScadaBridge DCL seed-ordering / static-tag false-bad class of bug — see that memory.)
|
||||||
|
- **H4 — a data-plane gate.** A role/active-node gate (cf. "data-plane roles need `Security:Ldap:GroupToRole` mapped or write/ack is inert") could suppress the read/poll on this single fused node under DisableLogin.
|
||||||
|
|
||||||
|
## ⚠️ Observability is the bottleneck
|
||||||
|
There are currently **no OtOpcUa app logs on the box** (Windows eventlog shows only NSSM start/stop; Serilog has no file sink in this deployment — cf. the MxGateway windev note). **Getting runtime logs is the single highest-leverage step** and gates H1/H3/H4.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 1 — Offline code trace of the data plane (do FIRST; free, no box/CNC)
|
||||||
|
Trace, in the repo on the dev box, the two paths end-to-end. This may reveal the root cause with zero box access.
|
||||||
|
1. **Equipment-tag read path:** from an OPC UA `Read`/`CreateMonitoredItem` on `ns=2;s=EQ-…/parts-count` → the server's address-space node → the "FullName→NodeId router" → `FocasDriver.ReadAsync`. Find: who invokes the driver read, whether point-reads vs subscriptions differ, and where/whether a value is cached vs `WaitingForInitialData` is returned. Entry points: `FocasDriver.cs:247` (resolver-produced equipment tags), `FocasDriver.ReadAsync`, and the server-side poll-group / monitored-item sampler (grep `PollGroup`, `MonitoredItem`, the Equipment namespace value provider).
|
||||||
|
2. **FixedTree path:** `FocasDriver.FixedTreeLoopAsync` (`:644`) → `state.FixedTreeCache` → `TryReadFixedTree` → how those nodes are added to the address space. Determine whether FixedTree nodes are emitted into the **Equipment/UNS projection** at all, or only a raw-driver namespace (resolves H2). Grep `DiscoverAsync`, `FixedTreeReference`, and the address-space builder for the Equipment kind.
|
||||||
|
3. **Driver lifecycle/gates:** confirm what conditions start/suppress the data poll vs the probe loop (active-node, role, enabled flags) — `FocasDriver.cs:87–181`, the cluster/active-node gate.
|
||||||
|
- **Output:** a written root-cause hypothesis (which of H1–H4) + the specific code site, OR a precise list of what only box logs can disambiguate.
|
||||||
|
|
||||||
|
### ✅ Phase 1 FINDINGS (2026-06-25, offline subagent code-trace — DONE)
|
||||||
|
|
||||||
|
| Hypothesis | Verdict | Confidence |
|
||||||
|
|---|---|---|
|
||||||
|
| **H2** — FixedTree nodes not projected under Equipment model | **CONFIRMED root cause of symptom #2 — by-design** | proven from code |
|
||||||
|
| **H3** — equipment-tag value source never primed | **CONFIRMED mechanism of symptom #1**; routing code is self-consistent → break is **runtime**, not static | mechanism proven; exact break needs box logs |
|
||||||
|
| **H4** — role/active-node data-plane gate | **REFUTED** for the read/subscribe path | proven from code |
|
||||||
|
| **H1** — FixedTree bootstrap stuck/throwing | best fit for symptom #3 (no 250 ms connect), but **moot** for #2 and **not** the cause of #1 | strongly-suggested; needs logs |
|
||||||
|
| **H5 (new)** — FOCAS tag with blank `deviceHostAddress` → `BadNodeIdUnknown` | **latent trap** — yields a *Bad* value, not WaitingForInitialData, so it's the *next* failure, not the current symptom | proven from code |
|
||||||
|
|
||||||
|
**Why symptom #2 (no FixedTree nodes):** FixedTree nodes are emitted **only** by `FocasDriver.DiscoverAsync` (`FocasDriver.cs:391-494`, gated on `FixedTreeCache`), and `DiscoverAsync` is **never invoked** by the deployed equipment-kind address-space build. Production materializes the served tree **purely from Config-DB Tag rows** (`EquipmentNodeWalker.cs:24-25,161-165`; `OpcUaPublishActor.RebuildAddressSpace` → `MaterialiseEquipmentTags(composition)` at `OpcUaPublishActor.cs:326-337`). No runtime caller of `IDriver.DiscoverAsync` exists. ⇒ **By-design gap, not a bug.** Surfacing FixedTree under the Equipment tree is a *feature* (author explicit Tag rows per signal, or extend the composition pipeline to merge `ITagDiscovery` output — a substantial `AddressSpaceComposer`/`AddressSpaceApplier` change).
|
||||||
|
|
||||||
|
**Why symptom #1 (`0x80320000` even under a 30 s subscribe):** equipment-tag node value is **push-only**. `0x80320000` is the materialization **seed** (`OtOpcUaNodeManager.EnsureVariable:1400-1402`); there is **no `OnReadValue`/server-side sampler** for equipment tags, so both Read and MonitoredItem serve the cached node value. The seed is overwritten **only** by `ForwardToMux → WriteValue` (`OtOpcUaNodeManager.cs:261-281`), which writes even Bad/null. `StatusFromQuality` maps Bad → `0x80000000`, **not** `0x80320000`. ⇒ A persistent `0x80320000` **proves no `AttributeValueUpdate` ever landed** for those two NodeIds. The routing chain (`DriverHostActor.PushDesiredSubscriptions` → `_nodeIdByDriverRef` → `DriverInstanceActor.ResubscribeDesired` → `PollGroupEngine` forceRaise → `OnDataChangeForward` → `ForwardToMux`) is **provably self-consistent for FOCAS** (`FocasTagConfigModel` writes no top-level `FullName` → `ExtractTagFullName` returns the raw JSON blob → `FocasEquipmentTagParser.TryParse` resolves it; subscribed ref == published ref == routing-map key == NodeId, all from the same `t.FullName`). So the static mapping is **refuted**; the break is **runtime**: either the FOCAS child was never subscribed those refs, or `_nodeIdByDriverRef` lacked the `(DriverInstanceId, FullName)` key at publish time — most likely a **`DriverInstanceId` attribution mismatch** between the equipment Tag rows and the spawned `DriverInstance`.
|
||||||
|
|
||||||
|
**What only box logs (or a ConfigDb snapshot read) can disambiguate for symptom #1 — capture these on the driver-role node after a deploy:**
|
||||||
|
1. `DriverHostActor` info *"SubscribeBulk pushed {Refs} references across {Drivers} driver(s)"* (`DriverHostActor.cs:1058`) — `Refs<2` ⇒ equipment Tag `DriverInstanceId` ≠ spawned FOCAS child id (attribution mismatch).
|
||||||
|
2. `DriverInstanceActor` info *"subscribed to {Count} refs"* (`DriverInstanceActor.cs:571`) — absent/`0` ⇒ `_desiredRefs` empty or never reached `Connected`.
|
||||||
|
3. `DriverHostActor` debug *"no equipment-tag NodeId for ({Driver},{Ref}) — value dropped"* (`DriverHostActor.cs:550`) — present ⇒ `_nodeIdByDriverRef` key miss.
|
||||||
|
4. `FocasDriver` debug *"FOCAS fixed-tree bootstrap failed … retrying"* every ~2 s (`FocasDriver.cs:686`) ⇒ confirms H1 (`cnc_sysinfo`/`cnc_rdaxisname` failing on the Makino). Needs Debug level + logger-constructed wire client.
|
||||||
|
5. A driver-side read of `MACRO:3901` (live Makino): `BadNodeIdUnknown` ⇒ H5 blank-`deviceHostAddress` trap; Good value ⇒ wire path fine, gap is purely the OPC-UA equipment subscription wiring.
|
||||||
|
|
||||||
|
**Cheaper-than-Phase-2 lead:** the prime suspect (Tag-row `DriverInstanceId` vs `DriverInstance` row mismatch) is checkable by reading the **deployed ConfigDb snapshot** (`OtOpcUaConfig` on `wonder-sql-vd03:1433`) — no service restart, no log-sink change. Worth doing before/instead of the Phase 2 log-sink change.
|
||||||
|
|
||||||
|
**Suggested minimal-fix shapes (NOT implemented):**
|
||||||
|
- Symptom #1: depends on which log line/row fires — if attribution miss, it's a **config/snapshot re-bind** (no code). If a real wiring gap, localize first.
|
||||||
|
- H5 latent: in `FocasDriver.ReadAsync`/`WriteAsync` device lookup (`:289`,`:346`), when `DeviceHostAddress` is blank and `_devices.Count==1`, resolve to the sole device instead of `BadNodeIdUnknown` (or make the AdminUI field mandatory).
|
||||||
|
- Symptom #2 (H2): no fix — by-design; surfacing FixedTree is a feature decision.
|
||||||
|
|
||||||
|
### ✅ ConfigDb snapshot read (2026-06-25, symptom #1 path — chosen over box logs)
|
||||||
|
Read live `OtOpcUaConfig` on `wonder-sql-vd03` (query run on-box so the SQL password never left the box). Ground truth:
|
||||||
|
- **Equipment** `EQ-3686c0272279` (`z-34184`): `DriverInstanceId=focas-z-34184`, `DeviceId=NULL`, `Enabled=True`, `UnsLineId=makino`.
|
||||||
|
- **Tags** `parts-count`/`parts-required`: both `DriverInstanceId=focas-z-34184`, `DeviceId=NULL`, `DataType=Double`, `AccessLevel=Read`, `FolderPath=NULL`, **`PollGroupId=NULL`**, `TagConfig={"deviceHostAddress":"10.201.31.5:8193","address":"MACRO:390{1,2}","dataType":"Float64"}`.
|
||||||
|
- **DriverInstance**: one row `focas-z-34184` (`DriverType=Focas`, `Enabled=True`, ns `dev-equipment`, cluster `DEV`).
|
||||||
|
- **Device** table: **0 rows** — NORMAL: `FocasDriver` builds `_devices` from `_options.Devices` (DriverConfig JSON) not the Device table (`FocasDriver.cs:97`).
|
||||||
|
|
||||||
|
**Refuted by this read:** prime-suspect `DriverInstanceId` attribution mismatch (matches exactly) and H5 blank-`deviceHostAddress` (present). The deployed config is **clean**. ⇒ symptom #1 is a pure value-flow-plumbing break. New live leads: **`PollGroupId=NULL`** on both tags (is a poll group required to subscribe/poll?) and the **resolver-registration** path (equipment-tag refs are "resolver-produced, not seeded at `InitializeAsync`" per `FocasDriver.cs:247` — does poll-time `TryResolve` of the JSON-blob ref ever succeed?). A second offline subagent trace of DriverHostActor↔DriverInstanceActor↔PollGroupEngine↔resolver is running to pin the exact broken link.
|
||||||
|
|
||||||
|
### ⚠️ FixedTree feature (symptom #2 — user chose "build the feature") — ARCHITECTURE REALITY
|
||||||
|
Mapped the composition pipeline. Two address-space paths exist: (1) **Equipment/UNS projection** `AddressSpaceComposer.Compose` (config entities only) → `AddressSpaceApplier.MaterialiseEquipmentTags` → the served `ns=2` tree where `EQ-…` lives; (2) **raw-driver namespace** `GenericDriverNodeManager.BuildAddressSpaceAsync` → `driver.DiscoverAsync(IAddressSpaceBuilder)`. **Path 2 is DEAD: `BuildAddressSpaceAsync` has no runtime caller and `OpcUaApplicationHost.PopulateAddressSpaces` (its referenced caller) no longer exists.** Even `GalaxyDriver.DiscoverAsync` (`:588`) is reachable only via that dead path — Galaxy surfaces its hierarchy by being **authored as config equipment/tags**, not via discovery. ⇒ In the current Equipment-kind model **every served node is config-driven; `ITagDiscovery`/`DiscoverAsync` is legacy/dead for serving.** So "build the FixedTree feature" is NOT re-wiring an existing path — it's a **new dynamic-node-injection capability** into the Equipment projection, and it must solve a **timing problem**: composition runs at deploy/apply (before the driver connects), but FixedTree data only exists after the driver's async `FixedTreeCache` bootstrap. The far cheaper alternative that yields the same visible result is to **author FixedTree signals as config Tag rows** (each bound to a FOCAS fixed-tree reference) — same mechanism every other equipment tag uses. **Recommend re-confirming scope with the user given this cost delta before building.**
|
||||||
|
|
||||||
|
### 🎯 ROOT CAUSE — symptom #1 (CONFIRMED, 2026-06-25, 2nd subagent trace + code verify)
|
||||||
|
**The FOCAS poll read hangs forever because (1) all wire I/O for a device shares one socket with NO serialization, and (2) the steady-state read has NO timeout.**
|
||||||
|
|
||||||
|
- **Unsynchronized shared socket:** `FocasDriver.EnsureConnectedAsync` (`:1101-1128`) returns the single `device.Client` with no I/O mutex. `DeviceState.ProbeLock` (`:1172`) only synchronizes probes with each other. Four independent loops issue wire ops on that same socket concurrently: the equipment poll (`PollGroupEngine`, 1 s — `SubscriptionPublishingInterval` const `DriverHostActor.cs:58`), the FixedTree loop (250 ms / 2 s bootstrap), the probe (5 s), the recycle loop. FOCAS/2 is strict request-response on one socket → concurrent unsynchronized reads interleave; one reader consumes another's response PDU and the victim read blocks waiting for bytes that never come.
|
||||||
|
- **No read timeout:** the poll calls `FocasDriver.ReadAsync` (`:295-308`) → `WireFocasClient.ReadAsync` (`:417`) → `ReadMacroAsync` **without** the optional `timeout` → `FocasWireClient.CreateCallTimeout` (`:843-848`) skips `CancelAfter` when `timeout` is null → `ReadExpectedPduAsync` awaits the socket under a token that never fires. The blocked first `forceRaise` poll (`PollGroupEngine.cs:119`) never completes → `onChange` never fires → the `BadWaitingForInitialData` materialization seed (`OtOpcUaNodeManager.cs:1400`) is never overwritten → permanent `0x80320000`. (Connect `:1119` and probe `:604-610` ARE bounded — only the steady-state read isn't.)
|
||||||
|
|
||||||
|
**Why it works on the dev box but not deployed:** the dev-box harness/CLI does ONE read at a time with no FixedTree loop running → no socket collision → macros read fine (proven in the v3 work). Deployed, the FixedTree loop races the equipment poll on one socket → collision → hung read.
|
||||||
|
|
||||||
|
**Why both tags fail together:** they're in one driver's single 1 s subscription batch; `PollOnceAsync` reads them sequentially and hangs on the first, so neither reaches `onChange`.
|
||||||
|
|
||||||
|
**Ruled out:** `PollGroupId=NULL` is a red herring — verified at `ConfigComposer.cs:38` (snapshots all tags, no poll-group filter), `AddressSpaceComposer.Compose` (never reads `PollGroupId`), and `DriverHostActor.PushDesiredSubscriptions:965-973` (groups by `DriverInstanceId`, constant 1 s interval, no poll-group keying). Routing-key mismatch ruled out: subscribe ref == routing key == `onChange` ref (all from the same `FullName`). Not-subscribed ruled out by elimination: a *served* node implies it's in the same `ParseComposition` artifact used for subscribe; and every *deterministic* `ReadAsync` outcome pushes a specific Bad code (`BadNodeIdUnknown`/`BadCommunicationError`/`BadNotSupported`), none of which is `0x80320000` — only a hung (never-returning) read leaves the seed intact.
|
||||||
|
|
||||||
|
**PROPOSED FIX (two parts, FOCAS-driver-only, single managed DLL, no migration/proto change — same low-blast-radius deploy path as the v3 work):**
|
||||||
|
1. **Serialize per-device wire I/O** — an async gate (`SemaphoreSlim`) on `DeviceState` held around each `EnsureConnected + wire op` so the equipment poll, FixedTree loop, probe, and recycle never collide on the one socket. *This is what makes values actually read Good.* MUST be paired with #2 (a lock around an unbounded read would deadlock all I/O).
|
||||||
|
2. **Bound every steady-state wire read/write** with `_options.Timeout` (mirror the probe's linked CTS at `:604-610`, or thread `timeout` through `WireFocasClient.ReadAsync`→`ReadMacroAsync`). Converts any stall into a recoverable `BadCommunicationError` push (overwrites the seed, downgrades health → observable + self-healing) instead of permanent silence.
|
||||||
|
|
||||||
|
**Confidence:** the two structural gaps are *proven from code*; that their interaction is the active trigger is *strongly-suggested* (works single-threaded on dev box, hangs deployed; fits every observation). Definitive live confirmation = the fix makes the tags leave `0x80320000` (Good, or recoverable BadComm if the macro read genuinely fails). **This is a debugging fix headed for a production CNC node — implement + unit-test (serialization + read-timeout) locally, then deploy to wonder and verify, per the diagnose-before-deploy discipline.**
|
||||||
|
|
||||||
|
### Implementation + deploy status (2026-06-26)
|
||||||
|
- **Fix implemented** on branch `fix/focas-poll-io-serialization` (off `feat/focas-pdu-v3`): new `SynchronizedFocasClient` decorator (per-device `SemaphoreSlim` gate + per-call `_options.Timeout`) wired into `FocasDriver.EnsureConnectedAsync`; `ReadAsync`/`WriteAsync` now map a per-call timeout (OCE while caller token live) → `BadCommunicationError` instead of rethrowing. **243 FOCAS tests green** (8 new in `FocasIoSerializationTests.cs` + 1 sibling in `FocasReadWriteTests`), full solution builds 0 errors.
|
||||||
|
- **Live baseline re-confirmed** via OtOpcUa CLI from the Mac: both tags `[80320000]`.
|
||||||
|
- **Deployed** the Release DLL to wonder (single managed-DLL swap; v3 DLL 222208 B → fix DLL 228864 B; backup `E:\ApiInstall\OtOpcUa\_focasbak-pre-iofix-20260626T041913.dll`; service Running, OPC 4840 listening).
|
||||||
|
- **BLOCKER for verification:** after the swap+restart the FOCAS driver is **not connecting at all** (zero TCP to `10.201.31.5:8193` at 150 s uptime — not even a probe connect), so tags still `[80320000]`.
|
||||||
|
- **Deploy API enabled** (user-approved): added `Security__DeployApiKey` (64-char generated, value on-box only) to the `OtOpcUaHost` service `Environment` REG_MULTI_SZ + restart; env backed up to `_envbak-pre-deploykey-20260626T043001.txt`. Headless `POST http://localhost:9000/api/deployments` (key read from registry on-box) returned **HTTP 202 Accepted**, sealed deployment `ef384b04…`.
|
||||||
|
- **Tags STILL `[80320000]`** after the sealed deploy. **Root of the live blocker (from box logs):** the service writes Serilog to `C:\Windows\System32\logs\otopcua-<date>.log` (relative-path-from-service-CWD bug — pending.md D2). Today's log shows the **admin/publish side only**: `Phase7Applier: hierarchy materialised (areas=1,lines=1,equipment=1)` + `equipment tags materialised (tags=2)` at each restart (04:19, 04:30) — which is why the 2 tags are served with the `0x80320000` seed. But across the **entire** day's log there are **ZERO** driver-side lines: `DriverHost`=0, `SubscribeBulk`=0, `subscribed to`=0, `Focas`=0, `GenericDriver`=0 (the 13 `DriverInstance` hits are all SQL `FROM [DriverInstance]`). **⇒ the DriverHostActor / driver-role side is not spawning the FOCAS driver at all on this node today**, so there is no poll, no value push, permanent seed. Health: `/healthz` Healthy, Akka member **Up**, `admin-leader` **Active**; roles env intact (`Cluster__Roles__0=admin`, `__1=driver`, seed=self:4053); `DOTNET_ENVIRONMENT=Production`. The admin half is fully alive; only the driver half is silent.
|
||||||
|
- **This is a SEPARATE blocker from the I/O fix** (which is correct, unit-tested, and deployed — but cannot be exercised until the driver actually polls).
|
||||||
|
|
||||||
|
### Driver-host activation diagnosis (2026-06-26, Debug-logging pass + code reads)
|
||||||
|
- **Debug Serilog enabled** (env `Serilog__MinimumLevel__Default=Debug` + EF/Microsoft/System→Warning; env bak `_envbak-pre-debug-*.txt`). Revealed: **NO exceptions / load errors anywhere** (my fix DLL is NOT the cause), but also **ZERO Akka lines** in Serilog (`Akka`/`Member`/`akka.tcp`/`singleton` = 0).
|
||||||
|
- **`DriverHostActor` logs via `Context.GetLogger()` (Akka `ILoggingAdapter`), and there is NO `Akka.Logger.Serilog` bridge anywhere in the app** (verified across `*.cs`/`*.csproj`/HOCON). So Akka actor logs go to the default console logger → **discarded for a Windows service → invisible in BOTH Production AND Development.** `appsettings.Development.json` only changes Serilog min-levels (+ `Ldap:DevStubMode=true`); it adds no Akka bridge. ⇒ **`DOTNET_ENVIRONMENT=Development` would NOT surface the driver host** (held off — futile + flips DevStubMode on prod).
|
||||||
|
- **The `DriverHostActor` IS instantiated** (unconditionally, same `system.ActorOf` block as the working `OpcUaPublishActor` at `Runtime/ServiceCollectionExtensions.cs:~232`). So it runs — it just isn't producing a connected FOCAS driver, and logs via Akka so the reason is invisible.
|
||||||
|
- **STRONG HYPOTHESIS (explains everything):** on bootstrap the `DriverHostActor` *recovers* its persisted state (its own logs name `entering Steady` / `recovered Applied state at rev {Rev}`), but a process restart has already killed the live `DriverInstanceActor`s + their FOCAS sockets — and **recovery does NOT re-instantiate them**. Drivers re-spawn only on a deployment dispatch whose **revision changes**. This config is now **stable** (`revisionHash 924b59097eba…` is identical across `0c2db588`/`ef384b04`/`a2a84646` — every `POST /api/deployments` returns 202 but with the SAME rev), so repeated deploys are no-ops for the driver host → after any restart there are **no live drivers** and nothing re-spawns them → zero Makino TCP → tags frozen at the `0x80320000` materialization seed. The **prior session got HEALTHY because the config was being *authored*** (each deploy had a NEW rev → forced a spawn); once authoring stopped, a restart leaves the driver side dark. ⇒ likely the real reason "host-restart didn't change it" in the original symptom.
|
||||||
|
- **Implication:** to exercise the I/O fix live, force a driver re-spawn by bumping the revision (e.g. toggle the FOCAS DriverInstance `Enabled` off→on, or any benign config edit, then deploy) — reversible. Separately, "restart should re-spawn applied drivers" + "Akka logs should bridge to Serilog" are real OtOpcUa robustness/observability gaps (the latter = the observability bottleneck this plan flagged).
|
||||||
|
- **Box env mutations still in place (to revert when done):** `Security__DeployApiKey` (deploy API), `Serilog__MinimumLevel__*` (Debug). Backups: `_envbak-pre-deploykey-*`, `_envbak-pre-debug-*`.
|
||||||
|
|
||||||
|
### Akka→Serilog observability attempt (2026-06-26) — code done, prod deploy BLOCKED by self-contained layout
|
||||||
|
- **Code change (correct, builds 0 errors):** added `Akka.Logger.Serilog` 1.5.60 (deps all satisfied by existing pins — Akka 1.5.62 / Serilog 4.3.1) to `Directory.Packages.props` + the Cluster csproj, and `loggers=["Akka.Logger.Serilog.SerilogLogger, …"]` + `loglevel=DEBUG` + `logger-startup-timeout=30s` to the embedded `Resources/akka.conf`. This bridges the DriverHostActor's Akka `ILoggingAdapter` into the Serilog file sink.
|
||||||
|
- **Prod deploy FAILED TWICE (both auto-rolled-back; service is UP + 4840 listening the whole time after each):**
|
||||||
|
1. Cluster.dll + Akka.Logger.Serilog.dll swap → crash `FileNotFoundException: Could not load 'Akka.Logger.Serilog 1.5.60'` from `Serilog.Settings.Configuration.ConfigurationReader.LoadConfigurationAssemblies` (the deployed `OtOpcUa.Host.deps.json` didn't list the new assembly; Serilog's `*.Serilog` config-assembly scan then fails fatally).
|
||||||
|
2. Added the updated `OtOpcUa.Host.deps.json` → crash **"Could not resolve CoreCLR path."** ⇒ the box deployment is **SELF-CONTAINED** (bundles its own runtime), so a `dotnet build` (framework-dependent) deps.json breaks the apphost's runtime resolution.
|
||||||
|
- **Conclusion:** adding a NEW assembly to this box requires a **full self-contained publish-overlay** (match the box layout; preserve `appsettings*`/`data\`), NOT a DLL/deps.json swap. That's a heavy/risky prod op for what is *diagnostic* observability of the (separate, pre-existing) driver-host re-spawn issue.
|
||||||
|
- **Prod state now (verified healthy):** FOCAS **I/O-fix DLL still deployed** (228864 B — untouched by these rollbacks), Cluster.dll + deps.json restored to baseline, `Akka.Logger.Serilog.dll` renamed `.disabled`, service Running + OPC 4840 listening. Deploy-API key + Debug-Serilog env still set.
|
||||||
|
- **RECOMMENDED PIVOT:** the cheap, no-prod-change way to validate the I/O fix is to **force a driver re-spawn via a benign config rev-bump** (the leading hypothesis); and the *proper* driver-host re-spawn fix + the Akka-observability are best developed in a **local docker-dev repro** (safe, full logging, one clean publish-overlay deploy at the end) rather than iterated on the production CNC node. **(User chose: move proper fix to local docker-dev.)**
|
||||||
|
|
||||||
|
### 🎯 ROOT CAUSE of the driver-not-spawning — STALE DEPLOYED BINARY (CONFIRMED 2026-06-26)
|
||||||
|
The DriverHostActor's bootstrap-recovery **already re-spawns drivers in the current source**: `BootstrapRecover` `case Applied:` logs `"recovered Applied state at rev"` and calls `RestoreApplied()` (`DriverHostActor.cs:449-457`), which does `ReconcileDrivers` (re-spawn) + `RebuildAddressSpace` + `PushDesiredSubscriptions` (`:910-918`) — comment: *"the in-memory driver children + OPC UA address space were lost on restart … re-spawn … instead of waiting for a config change (whose identical-config revision would no-op)."* Added by `b1b3f3ff` (2026-06-06) + `397f9b78` (2026-06-07); both ancestors of `feat/focas-pdu-v3`; covered by existing `Restore_on_bootstrap_*` tests.
|
||||||
|
- **The DEPLOYED wonder `Runtime.dll` is the June-16 base install** (`mtime 2026-06-16 13:05:46`; only the FOCAS *driver* DLL was ever swapped). Binary string scan: `RestoreApplied`/`ReconcileDrivers` method names PRESENT, but the bootstrap-restore log strings `"recovered Applied state at rev"` + `"restored served state for applied deployment"` are **ABSENT** ⇒ the deployed binary **predates b1b3f3ff** and does **NOT** restore served state / re-spawn drivers on bootstrap. **That is exactly why, on this stable-config box, a restart leaves the driver side dark** (and why only config-authoring re-deploys ever lit it up).
|
||||||
|
- **⇒ No new driver-host fix is needed — it already exists in source.** The proper remedy = **deploy the current Host** (full self-contained publish-overlay onto `E:\ApiInstall\OtOpcUa`, preserving `appsettings*`/`data\`), which in one shot delivers: (1) `RestoreApplied`-on-bootstrap (drivers re-spawn after restart), (2) the FOCAS I/O serialization fix (symptom #1), (3) the Akka→Serilog bridge (observability). The earlier DLL-swap crashes were purely the self-contained-layout mismatch — a full publish-overlay is internally consistent and avoids them.
|
||||||
|
- **PLAN:** confirm the current Host build boots cleanly + re-spawns drivers on restart in **docker-dev** (safe), then do ONE self-contained publish-overlay to wonder + verify tags leave `0x80320000`.
|
||||||
|
|
||||||
|
### ✅ docker-dev confirmation (2026-06-26) — current build re-spawns on bootstrap + Akka bridge works
|
||||||
|
Built the current source into the docker-dev image (`otopcua-host:dev`), booted `central-1` (fused admin+driver, like wonder). Results:
|
||||||
|
- **Boots cleanly** with `Akka.Logger.Serilog` (the prod DLL-swap crashes were purely the self-contained-layout mismatch, now confirmed).
|
||||||
|
- **Akka→Serilog bridge works** — but needed TWO fixes beyond the package: (1) Akka.Hosting owns logger setup so HOCON `akka.loggers` is ignored → wire via `ConfigureLoggers(setup => { setup.LogLevel=DebugLevel; setup.ClearLoggers(); setup.AddLogger<SerilogLogger>(); })` in `WithOtOpcUaClusterBootstrap`; (2) `AddZbSerilog` registers Serilog as the MEL provider but does NOT set the static `Serilog.Log.Logger` (which `Akka.Logger.Serilog` writes to, AND which the Program.cs startup banner uses) → set `Serilog.Log.Logger = app.Services.GetRequiredService<Serilog.ILogger>()` in Program.cs right after `Build()`. With both, the startup banner + full Akka cluster/DriverHost logs now emit. *(The `Log.Logger`-unset gap is a latent bug in the shared `AddZbSerilog` lib affecting all 3 apps' static-`Log` calls — worth a follow-up there.)*
|
||||||
|
- **`RestoreApplied` re-spawns drivers on bootstrap — CONFIRMED live:** `DriverHost central-1: recovered Applied state at rev …` → `spawned GalaxyMxGateway/Modbus/OpcUaClient driver` (×3) → `SubscribeBulk pushed 5 references across 3 driver(s)` → `restored served state for applied deployment … on bootstrap` → `DriverInstance …: subscribed to N refs`. These are the exact strings ABSENT from the stale wonder binary. ⇒ deploying the current Host fixes the wonder driver-not-spawning blocker.
|
||||||
|
|
||||||
|
### Deploy plan (self-contained publish-overlay → wonder)
|
||||||
|
- Local: `dotnet publish …Host.csproj -c Release -r win-x64 --self-contained true -p:PublishSingleFile=false` (self-contained = brings its own runtime; no version-match concern with the box).
|
||||||
|
- Zip + SFTP to `win64`. Box overlay (self-healing): full app-dir backup → robocopy publish over `E:\ApiInstall\OtOpcUa` **EXCLUDING `pki\` (OPC server cert) + `appsettings*.json`** → start → verify (4840 + tags) → AUTO-ROLLBACK from backup if unhealthy. Env (deploy key, Debug logging) is registry-side, untouched by the file overlay.
|
||||||
|
- Then revert the Debug-Serilog env (now superseded; verbose) and verify `parts-count`/`parts-required` read Good (or recoverable BadComm), and that the FixedTree/driver values flow.
|
||||||
|
|
||||||
|
### ✅✅ SYMPTOM #1 FIXED + LIVE-VALIDATED ON WONDER (2026-06-26)
|
||||||
|
After the self-contained overlay (current Host) + two light single-DLL FOCAS swaps, the validation revealed — and each fix peeled back — a **cascade of latent FOCAS-config-vs-driver mismatches** the stale binary had masked. Final state, all live on wonder:
|
||||||
|
- `DriverHost …: recovered Applied state … → spawned Focas driver focas-z-34184 (stub=false) → DriverInstance focas-z-34184: connected → subscribed to 2 refs` — **two ESTABLISHED TCP sessions to `10.201.31.5:8193`**.
|
||||||
|
- **`read ns=2;s=EQ-3686c0272279/parts-count` → Value 0, Status `0x00000000` (Good)**; `parts-required` → Good. (Value 0 is correct on the idle machine — status, not magnitude.) The original `0x80320000` is gone.
|
||||||
|
|
||||||
|
**The complete fix chain (all on `fix/focas-poll-io-serialization`, deployed):**
|
||||||
|
1. **FOCAS I/O serialization + read timeout** (`SynchronizedFocasClient`) — the original diagnosed root cause: prevents the poll read hanging on the shared socket.
|
||||||
|
2. **`RestoreApplied`-on-bootstrap** — already in source since `b1b3f3ff`; the wonder fix was deploying the current Host over the stale June-16 binary so the driver re-spawns on restart.
|
||||||
|
3. **Akka→Serilog bridge** (`ConfigureLoggers().AddLogger<SerilogLogger>()` + set static `Serilog.Log.Logger` in Program.cs) — made the driver-host actor observable; this is what surfaced the next two issues.
|
||||||
|
4. **`FlexibleStringConverter`** on the FOCAS config `Series` — the AdminUI persists the enum as a number (`"series":6`); the factory now tolerates number-or-string instead of throwing → stub.
|
||||||
|
5. **Scheme-less host tolerance** in `FocasHostAddress.TryParse` — the AdminUI persists `hostAddress` as a bare `ip:port`; `TryParse` now accepts it (canonical `focas://` unchanged) instead of failing init.
|
||||||
|
- FOCAS test suite **247 green**; each fix carries a regression test.
|
||||||
|
- **Follow-up (product quality):** the AdminUI authors FOCAS configs (`series` as number, `hostAddress` without `focas://`) that the driver only now tolerates — the AdminUI↔driver config-format mismatch is worth reconciling at the source. Also: the shared `AddZbSerilog` not setting static `Serilog.Log.Logger` is a latent gap across all 3 apps. And the FixedTree-under-Equipment feature (task #14) is still pending.
|
||||||
|
|
||||||
|
## Phase 2 — Get OtOpcUa runtime logs on wonder
|
||||||
|
Make the Host emit driver-level logs so the data plane is observable. Options (least invasive first): point the service at a Serilog file sink via config/env, or temporarily run with `DOTNET_ENVIRONMENT=Development` (file sink + dev errors — cf. MxGateway note), or add a console capture. Preserve `appsettings*`/`data\`; restore the env after. Then read: did `InitializeAsync` start the FixedTree loop, does the bootstrap throw (and on which call), is `ReadAsync` invoked for the equipment tags, what does it return.
|
||||||
|
- **Access:** servecli `:2222`, key `~/.ssh/servecli_wonder`, `scratchpad/wonder-ps.sh` (base64 PS over the cmd PTY), SFTP root `C:\Users\dohertj2\Desktop\win64`.
|
||||||
|
- **Output:** the actual runtime behavior of the data poll + equipment read on the box.
|
||||||
|
|
||||||
|
## Phase 3 — Local reproduction in docker-dev (isolate FOCAS-specific vs general)
|
||||||
|
Reproduce off the production box: in the docker-dev OtOpcUa, configure an equipment tag bound to a driver and check whether values flow at all. Use a driver with an easy local source (e.g. Modbus against a local sim, or the FOCAS wire client against a v1 responder if one can be stood up). If equipment-tag values flow for another driver locally but not FOCAS → FOCAS-specific; if they don't flow for any → a general equipment-projection/data-plane gap. A local repro gives a full-logging debug loop.
|
||||||
|
- **Output:** scope (FOCAS-only vs general) + a local failing case to fix against.
|
||||||
|
|
||||||
|
## Phase 4 — Root cause + minimal fix
|
||||||
|
From Phases 1–3, fix the smallest thing that makes the equipment tag carry a value and (if H2 is not by-design) the FixedTree surface. Likely shapes: a swallowed bootstrap exception; a seed/poll-group wiring gap; an Equipment-projection that should include FixedTree/driver auto-nodes; or a data-plane gate that needs config on the wonder node.
|
||||||
|
|
||||||
|
## Phase 5 — Validate
|
||||||
|
- Local (docker-dev) green where reproduced; unit/integration tests for the fixed path.
|
||||||
|
- Live: re-deploy to wonder, then via the OtOpcUa CLI confirm `parts-count`/`parts-required` read **Good** (value 0 is correct on this idle machine — assert status, not magnitude) and, if applicable, FixedTree Identity/Axes nodes appear with live values. The live Makino helps for FixedTree but is NOT blocking for the value-flow plumbing (macro values are 0 regardless), so this is lower time-pressure than the v3 capture was.
|
||||||
|
|
||||||
|
## Phase 6 — Docs + commit
|
||||||
|
Update this plan's status + the deployment doc; commit on a branch (separate from `feat/focas-pdu-v3` if the fix is OtOpcUa-core rather than FOCAS-driver). Push per the repo flow when asked.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
- **Execution approach:** this is a *debugging* investigation (unknown root cause) — diagnose before fixing; reproduce before claiming a fix; change the smallest thing. Don't deploy a guessed fix to the production CNC node.
|
||||||
|
- **Key node IDs / endpoints:** equipment `ns=2;s=EQ-3686c0272279`; tags `…/parts-count` (`MACRO:3901`), `…/parts-required` (`MACRO:3902`); OPC `opc.tcp://wonder-app-vd03.zmr.zimmer.com:4840/OtOpcUa`; AdminUI `http://wonder-app-vd03.zmr.zimmer.com:9000` (DisableLogin); deploy = AdminUI `/deployments` → "Deploy current configuration" (Blazo→Akka; no headless API).
|
||||||
|
- **Reusable tools:** `scratchpad/focas-status/` (live IFocasClient harness), `scratchpad/wonder-ps.sh`, `scratchpad/deploy-focas-v3.ps1`; the OtOpcUa CLI `src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI` (`read`/`subscribe`/`browse --recursive`).
|
||||||
@@ -7,6 +7,12 @@
|
|||||||
# Any divergence from these defaults must be deliberate and recorded in docs/v2/Architecture.md.
|
# Any divergence from these defaults must be deliberate and recorded in docs/v2/Architecture.md.
|
||||||
|
|
||||||
akka {
|
akka {
|
||||||
|
# Akka logger wiring (route ILoggingAdapter → Serilog) is configured via Akka.Hosting's
|
||||||
|
# ConfigureLoggers in ServiceCollectionExtensions.WithOtOpcUaClusterBootstrap — HOCON
|
||||||
|
# `akka.loggers` alone is not honored by Akka.Hosting. logger-startup-timeout is kept here
|
||||||
|
# since the Serilog logger can be slow to initialize at startup.
|
||||||
|
logger-startup-timeout = 30s
|
||||||
|
|
||||||
extensions = [
|
extensions = [
|
||||||
"Akka.Cluster.Tools.PublishSubscribe.DistributedPubSubExtensionProvider, Akka.Cluster.Tools"
|
"Akka.Cluster.Tools.PublishSubscribe.DistributedPubSubExtensionProvider, Akka.Cluster.Tools"
|
||||||
]
|
]
|
||||||
|
|||||||
@@ -1,5 +1,7 @@
|
|||||||
using Akka.Cluster.Hosting;
|
using Akka.Cluster.Hosting;
|
||||||
|
using Akka.Event;
|
||||||
using Akka.Hosting;
|
using Akka.Hosting;
|
||||||
|
using Akka.Logger.Serilog;
|
||||||
using Akka.Remote.Hosting;
|
using Akka.Remote.Hosting;
|
||||||
using Microsoft.Extensions.Configuration;
|
using Microsoft.Extensions.Configuration;
|
||||||
using Microsoft.Extensions.DependencyInjection;
|
using Microsoft.Extensions.DependencyInjection;
|
||||||
@@ -53,6 +55,19 @@ public static class ServiceCollectionExtensions
|
|||||||
|
|
||||||
builder.AddHocon(HoconLoader.LoadBaseConfig(), HoconAddMode.Append);
|
builder.AddHocon(HoconLoader.LoadBaseConfig(), HoconAddMode.Append);
|
||||||
|
|
||||||
|
// Route Akka's internal ILoggingAdapter (DriverHostActor, DriverInstanceActor, cluster
|
||||||
|
// events, …) into Serilog so those logs reach the same sinks as the MEL/Serilog application
|
||||||
|
// logs. Akka.Hosting owns logger setup, so HOCON `akka.loggers` alone is not honored — the
|
||||||
|
// logger must be registered through ConfigureLoggers. Without this the actor graph logs only
|
||||||
|
// to the default StandardOutLogger (discarded under the Windows service host), which is why
|
||||||
|
// the driver-role actors were invisible during the 2026-06 data-plane investigation.
|
||||||
|
builder.ConfigureLoggers(setup =>
|
||||||
|
{
|
||||||
|
setup.LogLevel = LogLevel.DebugLevel;
|
||||||
|
setup.ClearLoggers();
|
||||||
|
setup.AddLogger<SerilogLogger>();
|
||||||
|
});
|
||||||
|
|
||||||
builder.WithRemoting(new RemoteOptions
|
builder.WithRemoting(new RemoteOptions
|
||||||
{
|
{
|
||||||
HostName = options.Hostname,
|
HostName = options.Hostname,
|
||||||
|
|||||||
@@ -10,6 +10,7 @@
|
|||||||
<PackageReference Include="Akka.Cluster"/>
|
<PackageReference Include="Akka.Cluster"/>
|
||||||
<PackageReference Include="Akka.Cluster.Hosting"/>
|
<PackageReference Include="Akka.Cluster.Hosting"/>
|
||||||
<PackageReference Include="Akka.Cluster.Tools"/>
|
<PackageReference Include="Akka.Cluster.Tools"/>
|
||||||
|
<PackageReference Include="Akka.Logger.Serilog"/>
|
||||||
<PackageReference Include="Akka.Remote.Hosting"/>
|
<PackageReference Include="Akka.Remote.Hosting"/>
|
||||||
<PackageReference Include="Microsoft.Extensions.Hosting"/>
|
<PackageReference Include="Microsoft.Extensions.Hosting"/>
|
||||||
<PackageReference Include="Microsoft.Extensions.Options.ConfigurationExtensions"/>
|
<PackageReference Include="Microsoft.Extensions.Options.ConfigurationExtensions"/>
|
||||||
|
|||||||
@@ -306,7 +306,16 @@ public sealed class FocasDriver : IDriver, IReadable, IWritable, ITagDiscovery,
|
|||||||
Volatile.Read(ref _health).LastSuccessfulRead,
|
Volatile.Read(ref _health).LastSuccessfulRead,
|
||||||
$"FOCAS status 0x{status:X8} reading {reference}"));
|
$"FOCAS status 0x{status:X8} reading {reference}"));
|
||||||
}
|
}
|
||||||
catch (OperationCanceledException) { throw; }
|
catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) { throw; }
|
||||||
|
catch (OperationCanceledException)
|
||||||
|
{
|
||||||
|
// Per-call timeout (not external cancellation) — the read stalled past the device
|
||||||
|
// Timeout budget. Surface a recoverable comm error so the BadWaitingForInitialData
|
||||||
|
// seed is overwritten and health degrades, instead of the read hanging forever.
|
||||||
|
results[i] = new DataValueSnapshot(null, FocasStatusMapper.BadCommunicationError, null, now);
|
||||||
|
Volatile.Write(ref _health, new DriverHealth(DriverState.Degraded,
|
||||||
|
Volatile.Read(ref _health).LastSuccessfulRead, $"FOCAS read timed out for {reference}"));
|
||||||
|
}
|
||||||
catch (Exception ex)
|
catch (Exception ex)
|
||||||
{
|
{
|
||||||
results[i] = new DataValueSnapshot(null, FocasStatusMapper.BadCommunicationError, null, now);
|
results[i] = new DataValueSnapshot(null, FocasStatusMapper.BadCommunicationError, null, now);
|
||||||
@@ -356,7 +365,15 @@ public sealed class FocasDriver : IDriver, IReadable, IWritable, ITagDiscovery,
|
|||||||
var status = await client.WriteAsync(parsed, def.DataType, w.Value, cancellationToken).ConfigureAwait(false);
|
var status = await client.WriteAsync(parsed, def.DataType, w.Value, cancellationToken).ConfigureAwait(false);
|
||||||
results[i] = new WriteResult(status);
|
results[i] = new WriteResult(status);
|
||||||
}
|
}
|
||||||
catch (OperationCanceledException) { throw; }
|
catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) { throw; }
|
||||||
|
catch (OperationCanceledException)
|
||||||
|
{
|
||||||
|
// Per-call timeout (not external cancellation) — the write stalled past the device
|
||||||
|
// Timeout budget. Surface a recoverable comm error rather than aborting the batch.
|
||||||
|
results[i] = new WriteResult(FocasStatusMapper.BadCommunicationError);
|
||||||
|
Volatile.Write(ref _health, new DriverHealth(DriverState.Degraded,
|
||||||
|
Volatile.Read(ref _health).LastSuccessfulRead, $"FOCAS write timed out for {w.FullReference}"));
|
||||||
|
}
|
||||||
catch (NotSupportedException nse)
|
catch (NotSupportedException nse)
|
||||||
{
|
{
|
||||||
results[i] = new WriteResult(FocasStatusMapper.BadNotSupported);
|
results[i] = new WriteResult(FocasStatusMapper.BadNotSupported);
|
||||||
@@ -1113,7 +1130,11 @@ public sealed class FocasDriver : IDriver, IReadable, IWritable, ITagDiscovery,
|
|||||||
device.Client = null;
|
device.Client = null;
|
||||||
}
|
}
|
||||||
|
|
||||||
device.Client = _clientFactory.Create();
|
// Wrap the raw wire client so every operation on the device's single FOCAS/2 socket is
|
||||||
|
// serialized (request→response on one socket cannot interleave) and time-bounded. Without
|
||||||
|
// this, the equipment poll, fixed-tree loop, probe, and recycle loop collide on the shared
|
||||||
|
// socket and a stalled read blocks forever — leaving bound tags at BadWaitingForInitialData.
|
||||||
|
device.Client = new SynchronizedFocasClient(_clientFactory.Create(), _options.Timeout);
|
||||||
try
|
try
|
||||||
{
|
{
|
||||||
await device.Client.ConnectAsync(device.ParsedAddress, _options.Timeout, ct).ConfigureAwait(false);
|
await device.Client.ConnectAsync(device.ParsedAddress, _options.Timeout, ct).ConfigureAwait(false);
|
||||||
|
|||||||
@@ -195,12 +195,41 @@ public static class FocasDriverFactoryExtensions
|
|||||||
AllowTrailingCommas = true,
|
AllowTrailingCommas = true,
|
||||||
};
|
};
|
||||||
|
|
||||||
|
/// <summary>
|
||||||
|
/// Reads a JSON property as a string, tolerating a JSON <b>number</b> token as well. The
|
||||||
|
/// AdminUI persists the FOCAS <c>Series</c> enum as its integer value (e.g. <c>"series":6</c>),
|
||||||
|
/// while this DTO models <c>Series</c> as a string handed to <see cref="ParseSeries"/>
|
||||||
|
/// (Enum.TryParse accepts the numeric form). Without this, System.Text.Json throws
|
||||||
|
/// "Cannot get the value of a token type 'Number' as a string" on the bare number and the
|
||||||
|
/// driver falls back to a stub. Accepts string / number / null and emits a string.
|
||||||
|
/// </summary>
|
||||||
|
internal sealed class FlexibleStringConverter : JsonConverter<string?>
|
||||||
|
{
|
||||||
|
public override string? Read(ref Utf8JsonReader reader, Type typeToConvert, JsonSerializerOptions options) =>
|
||||||
|
reader.TokenType switch
|
||||||
|
{
|
||||||
|
JsonTokenType.String => reader.GetString(),
|
||||||
|
JsonTokenType.Number => reader.TryGetInt64(out var n)
|
||||||
|
? n.ToString(System.Globalization.CultureInfo.InvariantCulture)
|
||||||
|
: reader.GetDouble().ToString(System.Globalization.CultureInfo.InvariantCulture),
|
||||||
|
JsonTokenType.Null => null,
|
||||||
|
_ => throw new JsonException($"Expected string, number, or null but got {reader.TokenType}."),
|
||||||
|
};
|
||||||
|
|
||||||
|
public override void Write(Utf8JsonWriter writer, string? value, JsonSerializerOptions options)
|
||||||
|
{
|
||||||
|
if (value is null) writer.WriteNullValue();
|
||||||
|
else writer.WriteStringValue(value);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
internal sealed class FocasDriverConfigDto
|
internal sealed class FocasDriverConfigDto
|
||||||
{
|
{
|
||||||
/// <summary>Gets or sets the FOCAS client factory backend name (e.g. "wire" or "stub").</summary>
|
/// <summary>Gets or sets the FOCAS client factory backend name (e.g. "wire" or "stub").</summary>
|
||||||
public string? Backend { get; init; }
|
public string? Backend { get; init; }
|
||||||
|
|
||||||
/// <summary>Gets or sets the CNC series for this driver.</summary>
|
/// <summary>Gets or sets the CNC series for this driver.</summary>
|
||||||
|
[JsonConverter(typeof(FlexibleStringConverter))]
|
||||||
public string? Series { get; init; }
|
public string? Series { get; init; }
|
||||||
|
|
||||||
/// <summary>Gets or sets the operation timeout in milliseconds.</summary>
|
/// <summary>Gets or sets the operation timeout in milliseconds.</summary>
|
||||||
@@ -234,6 +263,7 @@ public static class FocasDriverFactoryExtensions
|
|||||||
public string? DeviceName { get; init; }
|
public string? DeviceName { get; init; }
|
||||||
|
|
||||||
/// <summary>Gets or sets the CNC series for this device (overrides top-level series if provided).</summary>
|
/// <summary>Gets or sets the CNC series for this device (overrides top-level series if provided).</summary>
|
||||||
|
[JsonConverter(typeof(FlexibleStringConverter))]
|
||||||
public string? Series { get; init; }
|
public string? Series { get; init; }
|
||||||
|
|
||||||
/// <summary>
|
/// <summary>
|
||||||
|
|||||||
@@ -21,9 +21,19 @@ public sealed record FocasHostAddress(string Host, int Port)
|
|||||||
{
|
{
|
||||||
if (string.IsNullOrWhiteSpace(value)) return null;
|
if (string.IsNullOrWhiteSpace(value)) return null;
|
||||||
const string prefix = "focas://";
|
const string prefix = "focas://";
|
||||||
if (!value.StartsWith(prefix, StringComparison.OrdinalIgnoreCase)) return null;
|
|
||||||
|
|
||||||
var body = value[prefix.Length..];
|
// Canonical form is focas://{ip}[:{port}], but the AdminUI persists the device host as a
|
||||||
|
// scheme-less "{ip}[:{port}]" (e.g. "10.201.31.5:8193"). Accept that too: take the body
|
||||||
|
// after focas:// when present, else the whole value when it carries NO other URI scheme
|
||||||
|
// (a "://" that isn't ours — e.g. http:// — is still rejected). The host-contains-colon
|
||||||
|
// guard below then rejects malformed scheme typos like "focas:10.0.0.5:8193".
|
||||||
|
string body;
|
||||||
|
if (value.StartsWith(prefix, StringComparison.OrdinalIgnoreCase))
|
||||||
|
body = value[prefix.Length..];
|
||||||
|
else if (!value.Contains("://", StringComparison.Ordinal))
|
||||||
|
body = value;
|
||||||
|
else
|
||||||
|
return null;
|
||||||
if (string.IsNullOrEmpty(body)) return null;
|
if (string.IsNullOrEmpty(body)) return null;
|
||||||
|
|
||||||
var colonIdx = body.LastIndexOf(':');
|
var colonIdx = body.LastIndexOf(':');
|
||||||
@@ -39,7 +49,9 @@ public sealed record FocasHostAddress(string Host, int Port)
|
|||||||
{
|
{
|
||||||
host = body;
|
host = body;
|
||||||
}
|
}
|
||||||
if (string.IsNullOrEmpty(host)) return null;
|
// Empty host, or a host still carrying a colon (e.g. the malformed "focas:10.0.0.5" left
|
||||||
|
// when someone wrote "focas:10.0.0.5:8193" without the //), is invalid.
|
||||||
|
if (string.IsNullOrEmpty(host) || host.Contains(':', StringComparison.Ordinal)) return null;
|
||||||
return new FocasHostAddress(host, port);
|
return new FocasHostAddress(host, port);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -0,0 +1,152 @@
|
|||||||
|
namespace ZB.MOM.WW.OtOpcUa.Driver.FOCAS;
|
||||||
|
|
||||||
|
/// <summary>
|
||||||
|
/// Decorates an <see cref="IFocasClient"/> so that every wire operation on the device's
|
||||||
|
/// single FOCAS/2 socket is (1) <b>serialized</b> against all other operations and
|
||||||
|
/// (2) <b>time-bounded</b>.
|
||||||
|
/// </summary>
|
||||||
|
/// <remarks>
|
||||||
|
/// <para>FOCAS/2 over TCP:8193 is a strict request→response protocol on ONE socket. The
|
||||||
|
/// driver holds a single <see cref="IFocasClient"/> per device, but several independent loops
|
||||||
|
/// read from it concurrently — the equipment poll (<see cref="FocasDriver.ReadAsync"/>), the
|
||||||
|
/// fixed-tree loop (<c>FixedTreeLoopAsync</c>), the connectivity probe, and the recycle loop.
|
||||||
|
/// Without serialization, two reads interleave their <c>send(request); read(response)</c> on the
|
||||||
|
/// same socket: one reader consumes the other's response PDU and the victim then blocks forever
|
||||||
|
/// waiting for bytes that never arrive — leaving the bound OPC UA node stuck at
|
||||||
|
/// <c>BadWaitingForInitialData</c>. This was the root cause of FOCAS equipment tags never
|
||||||
|
/// surfacing a value while the probe reported HEALTHY (the probe reads work single-threaded on a
|
||||||
|
/// dev box, but collide deployed once the fixed-tree loop runs concurrently).</para>
|
||||||
|
///
|
||||||
|
/// <para>The gate (<see cref="SemaphoreSlim"/> of count 1) makes each request→response atomic on
|
||||||
|
/// the socket. The per-call timeout ensures a stalled response can never hold the gate — and thus
|
||||||
|
/// the socket — indefinitely; a hung read surfaces as a recoverable error at the configured
|
||||||
|
/// <c>Timeout</c> budget instead of permanent silence. The gate and timeout are paired
|
||||||
|
/// deliberately: a lock around an <i>unbounded</i> read would deadlock all I/O for the device.</para>
|
||||||
|
///
|
||||||
|
/// <para><see cref="ConnectAsync"/> and <see cref="ProbeAsync"/> are serialized but NOT bounded by
|
||||||
|
/// this decorator's call timeout — they carry their own budgets (the connect timeout argument and
|
||||||
|
/// the probe's caller-supplied linked token respectively), and double-bounding would shrink them.</para>
|
||||||
|
/// </remarks>
|
||||||
|
public sealed class SynchronizedFocasClient : IFocasClient
|
||||||
|
{
|
||||||
|
private readonly IFocasClient _inner;
|
||||||
|
private readonly TimeSpan _callTimeout;
|
||||||
|
private readonly SemaphoreSlim _gate = new(1, 1);
|
||||||
|
|
||||||
|
/// <summary>Wraps <paramref name="inner"/> with per-device serialization + a per-call timeout.</summary>
|
||||||
|
/// <param name="inner">The underlying FOCAS client to serialize access to.</param>
|
||||||
|
/// <param name="callTimeout">
|
||||||
|
/// The budget applied to each data read/write. <see cref="TimeSpan.Zero"/> or negative disables
|
||||||
|
/// the per-call timeout (callers' own cancellation tokens still apply).
|
||||||
|
/// </param>
|
||||||
|
public SynchronizedFocasClient(IFocasClient inner, TimeSpan callTimeout)
|
||||||
|
{
|
||||||
|
_inner = inner ?? throw new ArgumentNullException(nameof(inner));
|
||||||
|
_callTimeout = callTimeout;
|
||||||
|
}
|
||||||
|
|
||||||
|
/// <inheritdoc />
|
||||||
|
public bool IsConnected => _inner.IsConnected;
|
||||||
|
|
||||||
|
/// <inheritdoc />
|
||||||
|
public Task ConnectAsync(FocasHostAddress address, TimeSpan timeout, CancellationToken cancellationToken) =>
|
||||||
|
RunGatedAsync(ct => _inner.ConnectAsync(address, timeout, ct), cancellationToken);
|
||||||
|
|
||||||
|
/// <inheritdoc />
|
||||||
|
public Task<bool> ProbeAsync(CancellationToken cancellationToken) =>
|
||||||
|
RunGatedAsync(ct => _inner.ProbeAsync(ct), cancellationToken);
|
||||||
|
|
||||||
|
/// <inheritdoc />
|
||||||
|
public Task<(object? value, uint status)> ReadAsync(
|
||||||
|
FocasAddress address, FocasDataType type, CancellationToken cancellationToken) =>
|
||||||
|
RunBoundedAsync(ct => _inner.ReadAsync(address, type, ct), cancellationToken);
|
||||||
|
|
||||||
|
/// <inheritdoc />
|
||||||
|
public Task<uint> WriteAsync(
|
||||||
|
FocasAddress address, FocasDataType type, object? value, CancellationToken cancellationToken) =>
|
||||||
|
RunBoundedAsync(ct => _inner.WriteAsync(address, type, value, ct), cancellationToken);
|
||||||
|
|
||||||
|
/// <inheritdoc />
|
||||||
|
public Task<IReadOnlyList<FocasActiveAlarm>> ReadAlarmsAsync(CancellationToken cancellationToken) =>
|
||||||
|
RunBoundedAsync(ct => _inner.ReadAlarmsAsync(ct), cancellationToken);
|
||||||
|
|
||||||
|
/// <inheritdoc />
|
||||||
|
public Task<FocasSysInfo> GetSysInfoAsync(CancellationToken cancellationToken) =>
|
||||||
|
RunBoundedAsync(ct => _inner.GetSysInfoAsync(ct), cancellationToken);
|
||||||
|
|
||||||
|
/// <inheritdoc />
|
||||||
|
public Task<IReadOnlyList<FocasAxisName>> GetAxisNamesAsync(CancellationToken cancellationToken) =>
|
||||||
|
RunBoundedAsync(ct => _inner.GetAxisNamesAsync(ct), cancellationToken);
|
||||||
|
|
||||||
|
/// <inheritdoc />
|
||||||
|
public Task<IReadOnlyList<FocasSpindleName>> GetSpindleNamesAsync(CancellationToken cancellationToken) =>
|
||||||
|
RunBoundedAsync(ct => _inner.GetSpindleNamesAsync(ct), cancellationToken);
|
||||||
|
|
||||||
|
/// <inheritdoc />
|
||||||
|
public Task<FocasDynamicSnapshot> ReadDynamicAsync(int axisIndex, CancellationToken cancellationToken) =>
|
||||||
|
RunBoundedAsync(ct => _inner.ReadDynamicAsync(axisIndex, ct), cancellationToken);
|
||||||
|
|
||||||
|
/// <inheritdoc />
|
||||||
|
public Task<FocasProgramInfo> GetProgramInfoAsync(CancellationToken cancellationToken) =>
|
||||||
|
RunBoundedAsync(ct => _inner.GetProgramInfoAsync(ct), cancellationToken);
|
||||||
|
|
||||||
|
/// <inheritdoc />
|
||||||
|
public Task<FocasTimer> GetTimerAsync(FocasTimerKind kind, CancellationToken cancellationToken) =>
|
||||||
|
RunBoundedAsync(ct => _inner.GetTimerAsync(kind, ct), cancellationToken);
|
||||||
|
|
||||||
|
/// <inheritdoc />
|
||||||
|
public Task<IReadOnlyList<FocasServoLoad>> GetServoLoadsAsync(CancellationToken cancellationToken) =>
|
||||||
|
RunBoundedAsync(ct => _inner.GetServoLoadsAsync(ct), cancellationToken);
|
||||||
|
|
||||||
|
/// <inheritdoc />
|
||||||
|
public Task<IReadOnlyList<int>> GetSpindleLoadsAsync(CancellationToken cancellationToken) =>
|
||||||
|
RunBoundedAsync(ct => _inner.GetSpindleLoadsAsync(ct), cancellationToken);
|
||||||
|
|
||||||
|
/// <inheritdoc />
|
||||||
|
public Task<IReadOnlyList<int>> GetSpindleMaxRpmsAsync(CancellationToken cancellationToken) =>
|
||||||
|
RunBoundedAsync(ct => _inner.GetSpindleMaxRpmsAsync(ct), cancellationToken);
|
||||||
|
|
||||||
|
/// <inheritdoc />
|
||||||
|
public Task<IReadOnlyList<int>> GetPositionFiguresAsync(CancellationToken cancellationToken) =>
|
||||||
|
RunBoundedAsync(ct => _inner.GetPositionFiguresAsync(ct), cancellationToken);
|
||||||
|
|
||||||
|
/// <inheritdoc />
|
||||||
|
public void Dispose()
|
||||||
|
{
|
||||||
|
_inner.Dispose();
|
||||||
|
_gate.Dispose();
|
||||||
|
}
|
||||||
|
|
||||||
|
// Gate only — the caller already governs the budget (connect timeout arg / probe linked token).
|
||||||
|
private async Task<T> RunGatedAsync<T>(Func<CancellationToken, Task<T>> op, CancellationToken ct)
|
||||||
|
{
|
||||||
|
await _gate.WaitAsync(ct).ConfigureAwait(false);
|
||||||
|
try { return await op(ct).ConfigureAwait(false); }
|
||||||
|
finally { _gate.Release(); }
|
||||||
|
}
|
||||||
|
|
||||||
|
private async Task RunGatedAsync(Func<CancellationToken, Task> op, CancellationToken ct)
|
||||||
|
{
|
||||||
|
await _gate.WaitAsync(ct).ConfigureAwait(false);
|
||||||
|
try { await op(ct).ConfigureAwait(false); }
|
||||||
|
finally { _gate.Release(); }
|
||||||
|
}
|
||||||
|
|
||||||
|
// Gate + per-call timeout. A fired timeout surfaces as OperationCanceledException whose token is
|
||||||
|
// the linked (not the caller's) token — callers distinguish it from real cancellation by testing
|
||||||
|
// their own token's IsCancellationRequested.
|
||||||
|
private async Task<T> RunBoundedAsync<T>(Func<CancellationToken, Task<T>> op, CancellationToken ct)
|
||||||
|
{
|
||||||
|
await _gate.WaitAsync(ct).ConfigureAwait(false);
|
||||||
|
try
|
||||||
|
{
|
||||||
|
if (_callTimeout <= TimeSpan.Zero)
|
||||||
|
return await op(ct).ConfigureAwait(false);
|
||||||
|
|
||||||
|
using var linked = CancellationTokenSource.CreateLinkedTokenSource(ct);
|
||||||
|
linked.CancelAfter(_callTimeout);
|
||||||
|
return await op(linked.Token).ConfigureAwait(false);
|
||||||
|
}
|
||||||
|
finally { _gate.Release(); }
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -228,6 +228,14 @@ builder.Services.AddOtOpcUaHealth();
|
|||||||
builder.Services.AddOtOpcUaObservability(builder.Configuration);
|
builder.Services.AddOtOpcUaObservability(builder.Configuration);
|
||||||
|
|
||||||
var app = builder.Build();
|
var app = builder.Build();
|
||||||
|
|
||||||
|
// AddZbSerilog registers Serilog as the MEL logging provider but does NOT assign the static
|
||||||
|
// Serilog.Log.Logger. Set it from the DI root logger so (1) static Log.* calls like the startup
|
||||||
|
// banner below emit, and (2) Akka.Logger.Serilog's SerilogLogger — which writes to Log.Logger —
|
||||||
|
// routes the actor graph's logs (DriverHostActor et al.) to the configured sinks. Must run before
|
||||||
|
// app.RunAsync() starts the ActorSystem (the Akka logger captures Log.Logger at system start).
|
||||||
|
Serilog.Log.Logger = app.Services.GetRequiredService<Serilog.ILogger>();
|
||||||
|
|
||||||
app.UseSerilogRequestLogging();
|
app.UseSerilogRequestLogging();
|
||||||
|
|
||||||
// Razor class library static assets (_content/<libname>/...) are served via endpoint
|
// Razor class library static assets (_content/<libname>/...) are served via endpoint
|
||||||
|
|||||||
@@ -70,9 +70,10 @@ public sealed class FocasDriverProbeTests
|
|||||||
[Fact]
|
[Fact]
|
||||||
public async Task MalformedHostAddress_Returns_OkFalse_WithNoHostPortMessage()
|
public async Task MalformedHostAddress_Returns_OkFalse_WithNoHostPortMessage()
|
||||||
{
|
{
|
||||||
// "not-a-focas-url" is not a focas:// URL — TryParse returns null.
|
// A foreign URI scheme ("http://…") is rejected by TryParse → null. (A bare
|
||||||
|
// "{ip}[:{port}]" without a scheme is now tolerated, so it can't be the malformed case.)
|
||||||
var result = await Probe.ProbeAsync(
|
var result = await Probe.ProbeAsync(
|
||||||
"{\"devices\":[{\"hostAddress\":\"not-a-focas-url\"}]}",
|
"{\"devices\":[{\"hostAddress\":\"http://10.0.0.5/\"}]}",
|
||||||
TimeSpan.FromSeconds(3),
|
TimeSpan.FromSeconds(3),
|
||||||
TestContext.Current.CancellationToken);
|
TestContext.Current.CancellationToken);
|
||||||
|
|
||||||
|
|||||||
@@ -38,6 +38,25 @@ public sealed class FocasFactoryConfigTests
|
|||||||
drv.Options.FixedTree.TimerPollInterval.ShouldBe(TimeSpan.FromSeconds(30));
|
drv.Options.FixedTree.TimerPollInterval.ShouldBe(TimeSpan.FromSeconds(30));
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// <summary>
|
||||||
|
/// The AdminUI persists FocasCncSeries as its integer value (e.g. <c>"series":6</c> = Thirty_i) —
|
||||||
|
/// a bare JSON number. The factory must tolerate it (via FlexibleStringConverter) and build the
|
||||||
|
/// real driver, not throw + fall back to a stub. Regression for the 2026-06-26 wonder data-plane
|
||||||
|
/// deploy where the driver stubbed on "Cannot get the value of a token type 'Number' as a string".
|
||||||
|
/// </summary>
|
||||||
|
[Fact]
|
||||||
|
public void CreateInstance_accepts_numeric_Series_from_AdminUI_serialization()
|
||||||
|
{
|
||||||
|
const string json = """
|
||||||
|
{"Backend":"wire","series":6,"devices":[{"hostAddress":"10.0.0.5:8193","deviceName":"Makino","series":6,"positionDecimalPlaces":0}]}
|
||||||
|
""";
|
||||||
|
|
||||||
|
var drv = FocasDriverFactoryExtensions.CreateInstance("drv-1", json);
|
||||||
|
|
||||||
|
drv.Options.Devices.ShouldHaveSingleItem();
|
||||||
|
drv.Options.Devices[0].Series.ShouldBe(FocasCncSeries.Thirty_i);
|
||||||
|
}
|
||||||
|
|
||||||
/// <summary>Verifies that the AlarmProjection configuration section is mapped to driver options.</summary>
|
/// <summary>Verifies that the AlarmProjection configuration section is mapped to driver options.</summary>
|
||||||
[Fact]
|
[Fact]
|
||||||
public void CreateInstance_maps_AlarmProjection_section_onto_options()
|
public void CreateInstance_maps_AlarmProjection_section_onto_options()
|
||||||
|
|||||||
@@ -0,0 +1,207 @@
|
|||||||
|
using System.Diagnostics;
|
||||||
|
using Shouldly;
|
||||||
|
using Xunit;
|
||||||
|
using ZB.MOM.WW.OtOpcUa.Driver.FOCAS;
|
||||||
|
|
||||||
|
namespace ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests;
|
||||||
|
|
||||||
|
/// <summary>
|
||||||
|
/// Coverage for the FOCAS data-plane fix (2026-06-25 equipment-tag investigation): all wire I/O
|
||||||
|
/// on a device's single FOCAS/2 socket must be serialized (request→response cannot interleave)
|
||||||
|
/// and every steady-state read/write must be time-bounded so a stalled CNC read surfaces as a
|
||||||
|
/// recoverable error instead of hanging forever at BadWaitingForInitialData. See
|
||||||
|
/// <c>docs/plans/2026-06-25-otopcua-equipment-dataplane-investigation.md</c>.
|
||||||
|
/// </summary>
|
||||||
|
[Trait("Category", "Unit")]
|
||||||
|
public sealed class FocasIoSerializationTests
|
||||||
|
{
|
||||||
|
private static readonly FocasAddress Macro500 = new(FocasAreaKind.Macro, null, 500, null);
|
||||||
|
|
||||||
|
// ---- SynchronizedFocasClient: serialization ----
|
||||||
|
|
||||||
|
[Fact]
|
||||||
|
public async Task Concurrent_reads_are_serialized_onto_the_inner_client()
|
||||||
|
{
|
||||||
|
var inner = new RecordingClient { ReadDelay = TimeSpan.FromMilliseconds(20) };
|
||||||
|
await using var _ = NoopDispose(inner);
|
||||||
|
var client = new SynchronizedFocasClient(inner, TimeSpan.FromSeconds(5));
|
||||||
|
|
||||||
|
var reads = Enumerable.Range(0, 8)
|
||||||
|
.Select(_ => client.ReadAsync(Macro500, FocasDataType.Float64, CancellationToken.None));
|
||||||
|
await Task.WhenAll(reads);
|
||||||
|
|
||||||
|
inner.MaxConcurrency.ShouldBe(1); // never more than one wire op on the socket at a time
|
||||||
|
inner.ReadCount.ShouldBe(8);
|
||||||
|
}
|
||||||
|
|
||||||
|
// ---- SynchronizedFocasClient: per-call timeout ----
|
||||||
|
|
||||||
|
[Fact]
|
||||||
|
public async Task A_hung_read_is_bounded_by_the_call_timeout()
|
||||||
|
{
|
||||||
|
var inner = new RecordingClient { BlockReadUntilCancelled = true };
|
||||||
|
var client = new SynchronizedFocasClient(inner, TimeSpan.FromMilliseconds(100));
|
||||||
|
|
||||||
|
var sw = Stopwatch.StartNew();
|
||||||
|
await Should.ThrowAsync<OperationCanceledException>(
|
||||||
|
() => client.ReadAsync(Macro500, FocasDataType.Float64, CancellationToken.None));
|
||||||
|
sw.Stop();
|
||||||
|
|
||||||
|
sw.Elapsed.ShouldBeLessThan(TimeSpan.FromSeconds(2)); // bounded, not the indefinite OS TCP wait
|
||||||
|
}
|
||||||
|
|
||||||
|
[Fact]
|
||||||
|
public async Task A_hung_read_does_not_hold_the_socket_for_the_next_call()
|
||||||
|
{
|
||||||
|
// The gate must be released when a bounded call times out, otherwise one stall would wedge
|
||||||
|
// every subsequent op on the device. Read #1 hangs (times out); read #2 must still proceed.
|
||||||
|
var inner = new TimeoutThenServeClient { FirstCallBlocks = true };
|
||||||
|
var client = new SynchronizedFocasClient(inner, TimeSpan.FromMilliseconds(100));
|
||||||
|
|
||||||
|
await Should.ThrowAsync<OperationCanceledException>(
|
||||||
|
() => client.ReadAsync(Macro500, FocasDataType.Float64, CancellationToken.None));
|
||||||
|
|
||||||
|
var (value, status) = await client.ReadAsync(Macro500, FocasDataType.Float64, CancellationToken.None);
|
||||||
|
status.ShouldBe(FocasStatusMapper.Good);
|
||||||
|
value.ShouldBe(42);
|
||||||
|
}
|
||||||
|
|
||||||
|
[Fact]
|
||||||
|
public async Task Probe_is_not_bounded_by_the_call_timeout()
|
||||||
|
{
|
||||||
|
// Connect/Probe carry their own budgets; the decorator must not shrink them to its read budget.
|
||||||
|
var inner = new RecordingClient { ProbeDelay = TimeSpan.FromMilliseconds(200) };
|
||||||
|
var client = new SynchronizedFocasClient(inner, TimeSpan.FromMilliseconds(50));
|
||||||
|
|
||||||
|
var result = await client.ProbeAsync(CancellationToken.None);
|
||||||
|
|
||||||
|
result.ShouldBeTrue();
|
||||||
|
}
|
||||||
|
|
||||||
|
[Fact]
|
||||||
|
public async Task Zero_call_timeout_disables_the_per_call_bound()
|
||||||
|
{
|
||||||
|
var inner = new RecordingClient { ReadDelay = TimeSpan.FromMilliseconds(120) };
|
||||||
|
var client = new SynchronizedFocasClient(inner, TimeSpan.Zero);
|
||||||
|
|
||||||
|
var (value, status) = await client.ReadAsync(Macro500, FocasDataType.Float64, CancellationToken.None);
|
||||||
|
|
||||||
|
status.ShouldBe(FocasStatusMapper.Good);
|
||||||
|
value.ShouldBe(42);
|
||||||
|
}
|
||||||
|
|
||||||
|
[Fact]
|
||||||
|
public void Dispose_disposes_the_inner_client()
|
||||||
|
{
|
||||||
|
var inner = new RecordingClient();
|
||||||
|
var client = new SynchronizedFocasClient(inner, TimeSpan.FromSeconds(1));
|
||||||
|
|
||||||
|
client.Dispose();
|
||||||
|
|
||||||
|
inner.DisposeCount.ShouldBe(1);
|
||||||
|
}
|
||||||
|
|
||||||
|
// ---- Driver level: a timed-out read overwrites the seed with a recoverable status ----
|
||||||
|
|
||||||
|
[Fact]
|
||||||
|
public async Task Driver_read_that_times_out_returns_BadCommunicationError_not_a_hang()
|
||||||
|
{
|
||||||
|
var factory = new FakeFocasClientFactory { Customise = () => new RecordingClient { BlockReadUntilCancelled = true } };
|
||||||
|
var drv = new FocasDriver(new FocasDriverOptions
|
||||||
|
{
|
||||||
|
Devices = [new FocasDeviceOptions("focas://10.0.0.5:8193")],
|
||||||
|
Tags = [new FocasTagDefinition("CustomVar", "focas://10.0.0.5:8193", "MACRO:500", FocasDataType.Float64)],
|
||||||
|
Probe = new FocasProbeOptions { Enabled = false },
|
||||||
|
Timeout = TimeSpan.FromMilliseconds(150),
|
||||||
|
}, "drv-1", factory);
|
||||||
|
await drv.InitializeAsync("{}", CancellationToken.None);
|
||||||
|
|
||||||
|
var sw = Stopwatch.StartNew();
|
||||||
|
var snap = (await drv.ReadAsync(["CustomVar"], CancellationToken.None)).Single();
|
||||||
|
sw.Stop();
|
||||||
|
|
||||||
|
snap.StatusCode.ShouldBe(FocasStatusMapper.BadCommunicationError);
|
||||||
|
sw.Elapsed.ShouldBeLessThan(TimeSpan.FromSeconds(2)); // bounded by Timeout, not hung
|
||||||
|
}
|
||||||
|
|
||||||
|
[Fact]
|
||||||
|
public async Task Driver_read_does_not_propagate_a_call_timeout_as_cancellation()
|
||||||
|
{
|
||||||
|
// The per-call timeout must NOT bubble out of ReadAsync as OperationCanceledException — that
|
||||||
|
// would abort the whole poll batch. It must be caught and turned into a per-tag Bad status.
|
||||||
|
var factory = new FakeFocasClientFactory { Customise = () => new RecordingClient { BlockReadUntilCancelled = true } };
|
||||||
|
var drv = new FocasDriver(new FocasDriverOptions
|
||||||
|
{
|
||||||
|
Devices = [new FocasDeviceOptions("focas://10.0.0.5:8193")],
|
||||||
|
Tags = [new FocasTagDefinition("CustomVar", "focas://10.0.0.5:8193", "MACRO:500", FocasDataType.Float64)],
|
||||||
|
Probe = new FocasProbeOptions { Enabled = false },
|
||||||
|
Timeout = TimeSpan.FromMilliseconds(120),
|
||||||
|
}, "drv-1", factory);
|
||||||
|
await drv.InitializeAsync("{}", CancellationToken.None);
|
||||||
|
|
||||||
|
// Should complete (not throw) with a Bad snapshot, even though the caller's token is never cancelled.
|
||||||
|
var snaps = await drv.ReadAsync(["CustomVar"], CancellationToken.None);
|
||||||
|
snaps.Single().StatusCode.ShouldBe(FocasStatusMapper.BadCommunicationError);
|
||||||
|
}
|
||||||
|
|
||||||
|
private static DisposeGuard NoopDispose(IDisposable d) => new(d);
|
||||||
|
|
||||||
|
private sealed class DisposeGuard(IDisposable inner) : IAsyncDisposable
|
||||||
|
{
|
||||||
|
public ValueTask DisposeAsync() { inner.Dispose(); return ValueTask.CompletedTask; }
|
||||||
|
}
|
||||||
|
|
||||||
|
/// <summary>Fake that records concurrency + optionally delays/blocks reads and probes.</summary>
|
||||||
|
private class RecordingClient : FakeFocasClient
|
||||||
|
{
|
||||||
|
private int _current;
|
||||||
|
public int MaxConcurrency;
|
||||||
|
public int ReadCount;
|
||||||
|
public TimeSpan ReadDelay = TimeSpan.Zero;
|
||||||
|
public bool BlockReadUntilCancelled;
|
||||||
|
public TimeSpan ProbeDelay = TimeSpan.Zero;
|
||||||
|
|
||||||
|
public override async Task<(object? value, uint status)> ReadAsync(
|
||||||
|
FocasAddress address, FocasDataType type, CancellationToken ct)
|
||||||
|
{
|
||||||
|
Interlocked.Increment(ref ReadCount);
|
||||||
|
var observed = Interlocked.Increment(ref _current);
|
||||||
|
InterlockedMax(ref MaxConcurrency, observed);
|
||||||
|
try
|
||||||
|
{
|
||||||
|
if (BlockReadUntilCancelled) await Task.Delay(Timeout.Infinite, ct).ConfigureAwait(false);
|
||||||
|
else if (ReadDelay > TimeSpan.Zero) await Task.Delay(ReadDelay, ct).ConfigureAwait(false);
|
||||||
|
return ((object?)42, FocasStatusMapper.Good);
|
||||||
|
}
|
||||||
|
finally { Interlocked.Decrement(ref _current); }
|
||||||
|
}
|
||||||
|
|
||||||
|
public override async Task<bool> ProbeAsync(CancellationToken ct)
|
||||||
|
{
|
||||||
|
if (ProbeDelay > TimeSpan.Zero) await Task.Delay(ProbeDelay, ct).ConfigureAwait(false);
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
|
||||||
|
private static void InterlockedMax(ref int target, int value)
|
||||||
|
{
|
||||||
|
int seen;
|
||||||
|
do { seen = Volatile.Read(ref target); if (value <= seen) return; }
|
||||||
|
while (Interlocked.CompareExchange(ref target, value, seen) != seen);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/// <summary>First read blocks until cancelled; subsequent reads serve a Good value immediately.</summary>
|
||||||
|
private sealed class TimeoutThenServeClient : FakeFocasClient
|
||||||
|
{
|
||||||
|
public bool FirstCallBlocks;
|
||||||
|
private int _calls;
|
||||||
|
|
||||||
|
public override async Task<(object? value, uint status)> ReadAsync(
|
||||||
|
FocasAddress address, FocasDataType type, CancellationToken ct)
|
||||||
|
{
|
||||||
|
var n = Interlocked.Increment(ref _calls);
|
||||||
|
if (n == 1 && FirstCallBlocks) await Task.Delay(Timeout.Infinite, ct).ConfigureAwait(false);
|
||||||
|
return ((object?)42, FocasStatusMapper.Good);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -245,6 +245,32 @@ public sealed class FocasReadWriteTests
|
|||||||
/// <summary>Verifies that cancellation signals are propagated.</summary>
|
/// <summary>Verifies that cancellation signals are propagated.</summary>
|
||||||
[Fact]
|
[Fact]
|
||||||
public async Task Cancellation_propagates()
|
public async Task Cancellation_propagates()
|
||||||
|
{
|
||||||
|
var (drv, factory) = NewDriver(
|
||||||
|
new FocasTagDefinition("X", "focas://10.0.0.5:8193", "R100", FocasDataType.Byte));
|
||||||
|
await drv.InitializeAsync("{}", CancellationToken.None);
|
||||||
|
using var cts = new CancellationTokenSource();
|
||||||
|
cts.Cancel();
|
||||||
|
factory.Customise = () => new FakeFocasClient
|
||||||
|
{
|
||||||
|
ThrowOnRead = true,
|
||||||
|
Exception = new OperationCanceledException(cts.Token),
|
||||||
|
};
|
||||||
|
|
||||||
|
// A CANCELLATION of the caller's token must propagate (abort the read). This is distinct
|
||||||
|
// from a per-call timeout — an OCE raised while the caller's token is still live is swallowed
|
||||||
|
// to a per-tag BadCommunicationError (see Swallows_a_spurious_read_OCE_when_caller_not_cancelled).
|
||||||
|
await Should.ThrowAsync<OperationCanceledException>(
|
||||||
|
() => drv.ReadAsync(["X"], cts.Token));
|
||||||
|
}
|
||||||
|
|
||||||
|
/// <summary>
|
||||||
|
/// An OperationCanceledException from the wire read while the CALLER'S token is NOT cancelled
|
||||||
|
/// (e.g. a per-call timeout firing) must be turned into a per-tag BadCommunicationError, not
|
||||||
|
/// propagated — otherwise one stalled tag would abort the whole poll batch.
|
||||||
|
/// </summary>
|
||||||
|
[Fact]
|
||||||
|
public async Task Swallows_a_spurious_read_OCE_when_caller_not_cancelled()
|
||||||
{
|
{
|
||||||
var (drv, factory) = NewDriver(
|
var (drv, factory) = NewDriver(
|
||||||
new FocasTagDefinition("X", "focas://10.0.0.5:8193", "R100", FocasDataType.Byte));
|
new FocasTagDefinition("X", "focas://10.0.0.5:8193", "R100", FocasDataType.Byte));
|
||||||
@@ -255,8 +281,8 @@ public sealed class FocasReadWriteTests
|
|||||||
Exception = new OperationCanceledException(),
|
Exception = new OperationCanceledException(),
|
||||||
};
|
};
|
||||||
|
|
||||||
await Should.ThrowAsync<OperationCanceledException>(
|
var snap = (await drv.ReadAsync(["X"], CancellationToken.None)).Single();
|
||||||
() => drv.ReadAsync(["X"], CancellationToken.None));
|
snap.StatusCode.ShouldBe(FocasStatusMapper.BadCommunicationError);
|
||||||
}
|
}
|
||||||
|
|
||||||
/// <summary>Verifies that ShutdownAsync disposes the client.</summary>
|
/// <summary>Verifies that ShutdownAsync disposes the client.</summary>
|
||||||
|
|||||||
@@ -20,6 +20,9 @@ public sealed class FocasScaffoldingTests
|
|||||||
[InlineData("focas://cnc-01.factory.internal:8193", "cnc-01.factory.internal", 8193)]
|
[InlineData("focas://cnc-01.factory.internal:8193", "cnc-01.factory.internal", 8193)]
|
||||||
[InlineData("focas://10.0.0.5:12345", "10.0.0.5", 12345)]
|
[InlineData("focas://10.0.0.5:12345", "10.0.0.5", 12345)]
|
||||||
[InlineData("FOCAS://10.0.0.5:8193", "10.0.0.5", 8193)] // case-insensitive scheme
|
[InlineData("FOCAS://10.0.0.5:8193", "10.0.0.5", 8193)] // case-insensitive scheme
|
||||||
|
[InlineData("10.201.31.5:8193", "10.201.31.5", 8193)] // scheme-less (AdminUI-persisted form)
|
||||||
|
[InlineData("10.0.0.5", "10.0.0.5", 8193)] // scheme-less, default port
|
||||||
|
[InlineData("cnc-01.factory.internal:8193", "cnc-01.factory.internal", 8193)] // scheme-less hostname
|
||||||
public void HostAddress_parses_valid(string input, string host, int port)
|
public void HostAddress_parses_valid(string input, string host, int port)
|
||||||
{
|
{
|
||||||
var parsed = FocasHostAddress.TryParse(input);
|
var parsed = FocasHostAddress.TryParse(input);
|
||||||
@@ -224,9 +227,11 @@ public sealed class FocasScaffoldingTests
|
|||||||
[Fact]
|
[Fact]
|
||||||
public async Task InitializeAsync_malformed_address_faults()
|
public async Task InitializeAsync_malformed_address_faults()
|
||||||
{
|
{
|
||||||
|
// A non-focas:// URI scheme is rejected by TryParse (a bare "{ip}[:{port}]" is now
|
||||||
|
// tolerated, so the malformed case must carry a foreign scheme).
|
||||||
var drv = new FocasDriver(new FocasDriverOptions
|
var drv = new FocasDriver(new FocasDriverOptions
|
||||||
{
|
{
|
||||||
Devices = [new FocasDeviceOptions("not-an-address")],
|
Devices = [new FocasDeviceOptions("http://10.0.0.5/")],
|
||||||
}, "drv-1");
|
}, "drv-1");
|
||||||
|
|
||||||
await Should.ThrowAsync<InvalidOperationException>(
|
await Should.ThrowAsync<InvalidOperationException>(
|
||||||
|
|||||||
Reference in New Issue
Block a user