Files
lmxopcua/docs/v2/implementation/focas-isolation-plan.md
Joseph Doherty a6be2f77b5 FOCAS version-matrix stabilization (PR 1 of #220 split) — ship the cheap half of the hardware-free stability gap ahead of the Tier-C out-of-process split. Without any CNC or simulator on the bench, the highest-leverage move is to catch operator config errors at init time instead of at steady-state per-read. Adds FocasCncSeries enum (Unknown/16i/0i-D/0i-F family/30i family/PowerMotion-i) + FocasCapabilityMatrix static class that encodes the per-series documented ranges for macro variables (cnc_rdmacro/wrmacro), parameters (cnc_rdparam/wrparam), and PMC letters + byte ceilings (pmc_rdpmcrng/wrpmcrng) straight from the Fanuc FOCAS Developer Kit. FocasDeviceOptions gains a Series knob (defaults Unknown = permissive so pre-matrix configs don't break on upgrade). FocasDriver.InitializeAsync now calls FocasAddress.TryParse on every tag + runs FocasCapabilityMatrix.Validate against the owning device's declared series, throwing InvalidOperationException with a reason string that names both the series and the documented limit ("Parameter #30000 is outside the documented range [0, 29999] for Thirty_i") so an operator can tell whether the mismatch is in the config or in their declared CNC model. Unknown series skips validation entirely. Ships 46 new theory cases in FocasCapabilityMatrixTests.cs — covering every boundary in the matrix (widen 16i->0i-F: macro ceiling 999->9999, param 9999->14999; widen 0i-F->30i: PMC letters +K+T; PMC-number 16i=999/0i-D=1999/0i-F=9999/30i=59999), permissive Unknown-series behavior, rejection-message content, and case-insensitive PMC-letter matching. Widening a range without updating docs/v2/focas-version-matrix.md fails a test because every InlineData cites the row it reflects. Full FOCAS test suite stays at 165/165 passing (119 existing + 46 new). Also authors docs/v2/focas-version-matrix.md as the authoritative range reference with per-function citations, CNC-series era context, error-surface shape, and the link back to the matrix code; docs/v2/implementation/focas-isolation-plan.md as the multi-PR plan for #220 Tier-C isolation (Shared contracts -> Host skeleton -> move Fwlib32 calls -> Supervisor+respawn -> MMF+ops glue, 2200-3200 LOC across 5 PRs mirroring the Galaxy Tier-C topology); and promotes docs/drivers/FOCAS-Test-Fixture.md from "version-matrix coverage = no" to explicit coverage via the new test file + cross-links to the matrix and isolation-plan docs. Leaves task #220 open since isolation itself (the expensive half) is still ahead.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:44:37 -04:00

8.1 KiB
Raw Blame History

FOCAS Tier-C isolation — plan for task #220

Status: DRAFT — not yet started. Tracks the multi-PR work to move Fwlib32.dll behind an out-of-process host, mirroring the Galaxy Tier-C split in phase-2-galaxy-out-of-process.md.

Pre-reqs shipped (this PR): version matrix + pre-flight validation + unit tests. Those close the cheap half of the hardware-free stability gap. Tier-C closes the expensive half.

Why isolate

Fwlib32.dll is a proprietary Fanuc library with no source, no symbols, and a documented habit of crashing the hosting process on network errors, malformed responses, and during handle recycling. Today the FOCAS driver runs in-process with the OPC UA server — a crash inside the Fanuc DLL takes every driver down with it, including ones that have nothing to do with FOCAS. Galaxy has the same class of problem and solved it with the Tier-C pattern (host service + proxy driver + named-pipe IPC); FOCAS should follow that playbook.

Topology (target)

+-------------------------------------+         +--------------------------+
|  OtOpcUa.Server (.NET 10 x64)       |         | OtOpcUaFocasHost         |
|                                     |  pipe   | (.NET 4.8 x86 Windows    |
|  ZB.MOM.WW.OtOpcUa.Driver.FOCAS     | <-----> |  service)                |
|    - FocasProxyDriver (in-proc)     |         |                          |
|    - supervisor / respawn / BackPr  |         |  Fwlib32.dll + session   |
|                                     |         |  handles + STA thread    |
+-------------------------------------+         +--------------------------+

Why .NET 4.8 x86 for the host: Fwlib32.dll ships as 32-bit only. The Galaxy.Host is already .NET 4.8 x86 for the same reason (MXAccess COM bitness), so the NSSM wrapper pattern transfers directly.

Three new projects

Project TFM Role
ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Shared netstandard2.0 MessagePack DTOs — FocasReadRequest, FocasReadResponse, FocasSubscribeRequest, FocasPmcBitWriteRequest, etc. Same assembly referenced by .NET 10 + .NET 4.8 so the wire format stays identical.
ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Host net48 x86 Windows service. Owns the Fwlib32 session handles + STA thread + handle-recycling loop. Pipe server + per-call auth (same ACL + caller SID + shared secret pattern as Galaxy.Host).
ZB.MOM.WW.OtOpcUa.Driver.FOCAS (existing) net10.0 Collapses to a proxy that forwards each IReadable / IWritable / ISubscribable call over the pipe. FocasCapabilityMatrix + FocasAddress stay here — pre-flight runs before any IPC.

Supervisor responsibilities (in the Proxy)

Mirrors Galaxy.Proxy 1:1:

  1. Start the Host process on first InitializeAsync (NSSM-wrapped service in production, direct spawn in dev) + heartbeat every 5s.
  2. If heartbeat misses 3× in a row, fan out BadCommunicationError to every subscription and respawn with exponential backoff (1s / 2s / 4s / max 30s).
  3. Crash-loop circuit breaker: 5 respawns in 60s → drop to BadDeviceFailure steady state until operator resets.
  4. Post-mortem MMF: on Host exit, Host writes its last-N operations
    • session state to an MMF the Proxy reads to log context.

IPC surface (approximate)

Every FocasDriver method that today calls into Fwlib32 directly becomes an ExecuteAsync call with a typed request:

Today (in-process) Tier-C (IPC)
FocasTagReader.Read(tag) client.Execute(new FocasReadRequest(session, address))
FocasTagWriter.Write(tag, value) client.Execute(new FocasWriteRequest(...))
FocasPmcBitRmw.Write(tag, bit, value) client.Execute(new FocasPmcBitWriteRequest(...)) — RMW happens in Host so the critical section stays on one process
FocasConnectivityProbe.ProbeAsync client.Execute(new FocasProbeRequest())
FocasSubscriber.Subscribe(tags) client.Execute(new FocasSubscribeRequest(tags)) — Host owns the poll loop + streams changes back as FocasDataChangedNotification over the pipe

Subscription streaming is the non-obvious piece: the Host polls on its own timer + pushes change notifications so the Proxy doesn't round-trip per poll. Matches Driver.Galaxy.Host subscription forwarding.

PR sequence (proposed)

  1. PR A — shared contracts Create Driver.FOCAS.Shared with the MessagePack DTOs. No behaviour change. ~200 LOC + round-trip tests for each DTO.
  2. PR B — Host project skeleton Create Driver.FOCAS.Host .NET 4.8 x86 project, NSSM wrapper, pipe server scaffold with the same ACL + caller-SID + shared secret plumbing as Galaxy.Host. No Fwlib32 wiring yet — returns NotImplemented for everything. ~400 LOC.
  3. PR C — Move Fwlib32 calls into Host Move FocasNativeSession, FocasTagReader, FocasTagWriter, FocasPmcBitRmw + the STA thread into the Host. Proxy forwards over IPC. This is the biggest PR — probably 800-1500 LOC of move-with-translation. Existing unit tests keep passing because IFocasTagFactory is the DI seam the tests inject against.
  4. PR D — Supervisor + respawn Proxy-side heartbeat + respawn + crash-loop circuit breaker + BackPressure fan-out on Host death. ~500 LOC + chaos tests.
  5. PR E — Post-mortem MMF + operational glue MMF writer in Host, reader in Proxy. Install scripts for the new OtOpcUaFocasHost Windows service. Docs. ~300 LOC.

Total estimate: 2200-3200 LOC across 5 PRs. Consistent with Galaxy Tier-C but narrower since FOCAS has no Historian + no alarm history.

Testing without hardware

Same constraint as today: no CNC, no simulator. The isolation work itself is verifiable without Fwlib32 actually being called:

  • Pipe contract: PR A's MessagePack round-trip tests cover every DTO.
  • Supervisor: PR D uses a FakeFocasHost stub that can be told to crash, hang, or miss heartbeats. The supervisor's respawn + circuit-breaker behaviour is fully testable against the stub.
  • IPC ACL + auth: reuse the Galaxy.Host's existing test harness pattern — negative tests attempt to connect as the wrong user and assert rejection.
  • Fwlib32 integration itself: still untestable without hardware. When a real CNC becomes available, the smoke tests already scaffolded in tests/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.IntegrationTests/ run against it via FOCAS_ENDPOINT.

Decisions to confirm before starting

  • Sharing transport code with Galaxy.Host — should the pipe server + ACL + shared-secret + MMF plumbing go into a common Core.Hosting.Tier-C project both hosts reference? Probably yes; deferred until PR B is drafted because the right abstraction only becomes visible after two uses.
  • Handle-recycling cadence — Fwlib32 session handles leak memory over weeks per the Fanuc-published defect list. Galaxy recycles MXAccess handles on a 24h timer; FOCAS should mirror but the trigger point (idle vs scheduled) needs operator input.
  • Per-CNC Host process vs one Host serving N CNCs — one-per-CNC isolates blast radius but scales poorly past ~20 machines; shared Host scales but one bad CNC can wedge the lot. Start with shared Host + document the blast-radius trade; revisit if operators hit it.

Non-goals

  • Simulator work. open_focas + other OSS FOCAS simulators are untested + not maintained; not worth chasing vs. waiting for real hardware.
  • Changing the public FocasDriverOptions shape beyond what already shipped (the Series knob). Operator config continues to look the same after the split — the Tier-C topology is invisible from appsettings.json.
  • Historian / long-term history integration. FOCAS driver doesn't implement IHistoryProvider + there's no plan to add it.

References