39 KiB
Driver Stability & Isolation — OtOpcUa v2
Status: DRAFT — companion to
plan.md. Defines the stability tier model, per-driver hosting decisions, cross-cutting protections every driver process must apply, and the canonical worked example (FOCAS) for the high-risk tier.Branch:
v2Created: 2026-04-17
Problem Statement
The v2 plan adds eight drivers spanning pure managed code (Modbus, OPC UA Client), wrapped C libraries (libplctag for AB CIP/Legacy, S7netplus for Siemens, Beckhoff.TwinCAT.Ads for ADS), heavy native/COM with thread affinity (Galaxy MXAccess), and black-box vendor DLLs (FANUC Fwlib64.dll for FOCAS).
These do not all carry the same failure profile, but the v1 plan treats them uniformly: every driver runs in-process in the .NET 10 server except Galaxy (isolated only because of its 32-bit COM constraint). This means:
- An
AccessViolationExceptionfromFwlib64.dll— uncatchable by managed code in modern .NET — tears down the whole OPC UA server, all subscriptions, and every other driver with it. - A native handle leak (FOCAS
cnc_allclibhndl3not paired withcnc_freelibhndl, or libplctag tag handles not freed) accumulates against the server process, not the driver. - A thread-affinity bug (calling Fwlib on two threads against the same handle) corrupts state for every other driver sharing the process.
- Polly's circuit breaker handles transient errors; it does nothing for process death or resource exhaustion.
Driver stability needs to be a first-class architectural concern, not a per-driver afterthought.
Stability Tier Model
Every driver is assigned to one of three tiers based on the trust level of its dependency stack:
Tier A — Pure Managed
Drivers whose entire dependency chain is verifiable .NET. Standard exception handling and Polly are sufficient. Run in-process in the main server.
| Driver | Stack | Notes |
|---|---|---|
| Modbus TCP | NModbus (pure managed) | Sockets only |
| OPC UA Client | OPC Foundation .NETStandard SDK (pure managed) | Reference-grade SDK |
Tier B — Wrapped Native, Mature
Drivers that P/Invoke into a mature, well-maintained native library, or use a managed wrapper that has limited native bits (router, transport). Run in-process with the cross-cutting protections from §3 mandatory: SafeHandle for every native resource, memory watchdog, bounded queues. Any driver in this tier may be promoted to Tier C if production data shows leaks or crashes.
| Driver | Stack | Notes |
|---|---|---|
| Siemens S7 | S7netplus (mostly managed) | Sockets + small native helpers |
| AB CIP | libplctag (C library via P/Invoke) | Mature, widely deployed; manages its own threads |
| AB Legacy | libplctag (same as CIP) | Same library, different protocol mode |
| TwinCAT | Beckhoff.TwinCAT.Ads v6 + AmsTcpIpRouter | Mostly managed; native callback pump for ADS notifications |
Tier C — Heavy Native / COM / Thread-Affinity
Drivers whose dependency is a black-box vendor DLL, COM object with apartment requirements, or any code where a fault is likely uncatchable. Run as a separate Windows service behind the Galaxy.Proxy/Host/Shared pattern. A crash isolates to that driver's process; the main server fans out Bad quality on the affected nodes and respawns the host.
| Driver | Stack | Reason for Tier C |
|---|---|---|
| Galaxy | MXAccess COM (.NET 4.8 x86) | Bitness mismatch + COM/STA + long history of native quirks |
| FOCAS | Fwlib64.dll P/Invoke |
Black-box vendor DLL, handle-affinity, thread-unsafe per handle, no public SLA |
Cross-Cutting Protections
Two distinct protection sets, scoped by hosting mode rather than applied uniformly. This split exists because process-level signals (RSS watchdog, recycle, kill) act on a process, not a driver — applying them in the shared server process would let a leak in one in-proc driver knock out every other driver, every session, and the OPC UA endpoint. That contradicts the v2 isolation invariant. Process-level protections therefore apply only to isolated host processes (Tier C); in-process drivers (Tier A/B) get a different set of guards that operate at the driver-instance level.
Universal — apply to every driver regardless of tier
SafeHandle for every native resource
Every native handle (FOCAS cnc_freelibhndl, libplctag tag handle, COM IUnknown ref, OS file/socket handles we pass through P/Invoke) is wrapped in a SafeHandle subclass with a finalizer that calls the release function. This guarantees release even when:
- The owning thread crashes
- A
usingblock is bypassed by an exception we forgot to catch - The driver host process is shutting down ungracefully
Marshal.ReleaseComObject calls go through CriticalFinalizerObject to honor finalizer ordering during AppDomain unload.
Bounded operation queues (per device, per driver instance)
Every driver-instance/device pairing has a bounded outgoing-operation queue (default 1000 entries). When the queue is full, new operations fail fast with BadResourceUnavailable rather than backing up unboundedly against a slow or dead device. Polly's circuit breaker also opens, surfacing the device-down state to the dashboard.
This prevents the canonical "device went offline → reads pile up → driver eats all RAM" failure mode. Crucially, it operates per device in the in-process case so one stuck device cannot starve another driver's queue or accumulate against the shared server's heap.
Crash-loop circuit breaker
If a driver host crashes 3 times within 5 minutes, the supervisor stops respawning, leaves the driver's nodes in Bad quality, raises an operator alert, and starts an escalating cooldown before attempting auto-reset. This balances "unattended sites need recovery without an operator on console" against "don't silently mask a persistent problem."
| Trip sequence | Cooldown before auto-reset |
|---|---|
| First trip | 1 hour |
| Re-trips within 10 min of an auto-reset | 4 hours |
| Re-trips after the 4 h cooldown | 24 hours, manual reset required via Admin UI |
Every trip raises a sticky operator alert that does not auto-clear when the cooldown elapses — only manual acknowledgment clears it. So even if recovery is automatic, "we crash-looped 3 times overnight" stays visible the next morning. The auto-reset path keeps unattended plants running; the sticky alert + 24 h manual-only floor prevents the breaker from becoming a "silent retry forever" mechanism.
For Tier A/B (in-process) drivers, the "crash" being counted is a driver-instance reset (capability-level reinitialization, not a process exit). For Tier C drivers, it's a host process exit.
In-process only (Tier A/B) — driver-instance allocation tracking
In-process drivers cannot be recycled by killing the server process — that would take down every other driver, every session, and the OPC UA endpoint. RSS watchdogs and scheduled recycle therefore do not apply to Tier A/B. Instead, each driver instance is monitored at a finer grain:
- Per-instance allocation tracking: drivers expose a
GetMemoryFootprint()capability returning bytes attributable to their own caches (symbol cache, subscription items, queued operations). The Core polls this every 30 s and logs growth slope per driver instance. - Soft-limit on cached state: each driver declares a memory budget for its caches in
DriverConfig. On breach, the Core asks the driver to flush optional caches (e.g. discard symbol cache, force re-discovery). No process action. - Escalation rule: if a driver instance's footprint cannot be bounded by cache flushing — or if growth is in opaque allocations the driver can't account for — that driver is a candidate for promotion to Tier C. Process recycle is the only safe leak remediation, and the only way to apply process recycle to a single driver is to give it its own process.
- No process kill on a Tier A/B driver. Ever. The only Core-initiated recovery is asking the driver to reset its own state via
IDriver.Reinitialize(). If that fails, the driver instance is marked Faulted, its nodes go Bad quality, and the operator is alerted. The server process keeps running for everyone else.
Isolated host only (Tier C) — process-level protections
These act on the host process. They cannot affect any other driver or the main server, because each Tier C driver has its own process.
Per-host memory watchdog
Each host process measures baseline RSS after warm-up (post-discovery, post-first-poll). A monitor thread samples RSS every 30 s and tracks both a multiplier of baseline and an absolute hard ceiling.
| Threshold | Action |
|---|---|
| 1.5× baseline OR baseline + 50 MB (whichever larger) | Log warning, surface in status dashboard |
| 3× baseline OR baseline + 200 MB (whichever larger) | Trigger soft recycle (graceful drain → exit → respawn) |
| 1 GB absolute hard ceiling | Force-kill driver process, supervisor respawns |
| Slope > 2 MB/min sustained 30 min | Treat as leak signal, soft recycle even below absolute threshold |
The "whichever larger" floor prevents spurious triggers when baseline is tiny — a 30 MB FOCAS Host shouldn't recycle at 45 MB just because the multiplier says so. All thresholds are per-driver-type defaults, overridable per-driver-instance in central config. Only valid for isolated hosts — never apply to the main server process.
Heartbeat between proxy and host
The proxy in the main server sends a heartbeat ping to the driver host every 2 s and expects a reply within 1 s. Three consecutive misses → proxy declares the host dead (6 s total detection latency), fans out Bad quality on all of that driver's nodes, and asks the supervisor to respawn.
2 s is fast enough that subscribers on a 1 s OPC UA publishing interval see Bad quality within one or two missed publish cycles, but slow enough that GC pauses (typically <500 ms even on bad days) and Windows pipe scheduling jitter don't generate false positives. The 3-miss tolerance absorbs single-cycle noise.
The heartbeat is on a separate named-pipe channel from the data-plane RPCs so a stuck data-plane operation doesn't mask host death. Cadence and miss-count are tunable per-driver-instance in central config.
Scheduled recycling
Each Tier C host process is recycled on a schedule (default 24 h, configurable per driver type). The recycle is a soft drain → exit → respawn, identical to a watchdog-triggered recycle. Defensive measure against slow leaks that stay below the watchdog thresholds.
Post-mortem log
Each driver process writes a ring buffer of the last 1000 operations to a memory-mapped file (%ProgramData%\OtOpcUa\driver-postmortem\<driver>.mmf):
timestamp | handle/connection ID | operation | args summary | return code | duration
On graceful shutdown, the ring is flushed to a rotating log. On a hard crash (including AV), the supervisor reads the MMF after the corpse is gone and attaches the tail to the crash event reported on the dashboard. Without this, post-mortem of a Fwlib AV is impossible.
Out-of-Process Driver Pattern (Generalized)
This is the Galaxy.Proxy/Host/Shared layout from plan.md §3, lifted to a reusable pattern for every Tier C driver. Two new projects per Tier C driver beyond the in-process driver projects:
src/
ZB.MOM.WW.OtOpcUa.Driver.<Name>.Proxy/ # In main server: implements IDriver, forwards over IPC
ZB.MOM.WW.OtOpcUa.Driver.<Name>.Host/ # Separate Windows service: actual driver implementation
ZB.MOM.WW.OtOpcUa.Driver.<Name>.Shared/ # IPC message contracts (.NET Standard 2.0)
Common contract for a Tier C host:
- Hosted as a Windows service with
Microsoft.Extensions.Hosting - Named-pipe IPC server (named pipes already established for Galaxy in §3)
- MessagePack-serialized contracts in
<Name>.Shared - Heartbeat endpoint on a separate pipe from the data plane
- Memory watchdog runs in-process and triggers
Environment.Exit(2)on threshold breach - Post-mortem MMF writer initialized on startup
- Standard supervisor protocol: respawn-with-backoff, crash-loop circuit breaker
Common contract for the proxy in the main server:
- Implements
IDriver+ capability interfaces; forwards every call over IPC - Owns the heartbeat sender and host liveness state
- Fans out Bad quality on all nodes when host is declared dead
- Owns the supervisor that respawns the host process
- Exposes host status (Up / Down / Recycling / CircuitOpen) to the status dashboard
IPC Security (mandatory for every Tier C driver)
Named pipes default to allowing connections from any local user. Without explicit ACLs, any process on the host machine that knows the pipe name could connect, bypass the OPC UA server's authentication and authorization layers, and issue reads, writes, or alarm acknowledgments directly against the driver host. This is a real privilege-escalation surface — a service account with no OPC UA permissions could write field values it should never have access to. Every Tier C driver enforces the following:
- Pipe ACL: the host creates the pipe with a
PipeSecurityACL that grantsReadWrite | Synchronizeonly to the OtOpcUa server's service principal SID. All other local users — including LocalSystem and Administrators — are explicitly denied. The ACL is set at pipe-creation time so it's atomic with the pipe being listenable. - Caller identity verification: on each new pipe connection, the host calls
NamedPipeServerStream.GetImpersonationUserName()(or impersonates and inspects the token) and verifies the connected client's SID matches the configured server service SID. Mismatches are logged and the connection is dropped before any RPC frame is read. - Per-message authorization context: every RPC frame includes the operation's authenticated OPC UA principal (forwarded by the Core after it has done its own authn/authz). The host treats this as input only — the driver-level authorization (e.g. "is this principal allowed to write Tune attributes?") is performed by the Core, but the host's own audit log records the principal so post-incident attribution is possible.
- No anonymous endpoints: the heartbeat pipe has the same ACL as the data-plane pipe. There are no "open" pipes a generic client can probe.
- Defense-in-depth shared secret: the supervisor generates a per-host-process random secret at spawn time, passes it to both proxy and host via command-line args (or a parent-pipe handshake), and the host requires it on the first frame of every connection. This is belt-and-suspenders for the case where pipe ACLs are misconfigured during deployment.
Configuration: the server service SID is read from appsettings.json (Hosting.ServiceAccountSid) and validated against the actual running identity at startup. Mismatch fails startup loudly rather than producing a silently-insecure pipe.
For Galaxy, this pattern is retroactively required (the v1 named-pipe IPC predates this contract and must be hardened during the Phase 2 refactor). For FOCAS and any future Tier C driver, IPC security is part of the initial implementation, not an add-on.
Reusability
For Galaxy, this pattern is already specified. For FOCAS, the same three projects appear in §5 below. Future Tier C escalations (e.g. if libplctag develops a stability problem) reuse the same template.
FOCAS — Deep Dive (Canonical Tier C Worked Example)
FOCAS is the most exposed driver in the v2 plan: a black-box vendor DLL (Fwlib64.dll), handle-based API with per-handle thread-affinity, no public stability SLA, and a target market (CNC integrations) where periodic-restart workarounds are common practice. The protections below are not theoretical — every one is a known FOCAS failure mode.
Project Layout
src/
ZB.MOM.WW.OtOpcUa.Driver.Focas.Proxy/ # .NET 10 x64 in main server
ZB.MOM.WW.OtOpcUa.Driver.Focas.Host/ # .NET 10 x64 separate Windows service
ZB.MOM.WW.OtOpcUa.Driver.Focas.Shared/ # .NET Standard 2.0 IPC contracts
ZB.MOM.WW.OtOpcUa.Driver.Focas.TestStub/ # Stub FOCAS server for dev/CI (per test-data-sources.md)
The Host process is the only place Fwlib64.dll is loaded. Every concern below is a Host-internal concern.
Handle Pool
One Fwlib handle per CNC connection. Pool design:
FocasHandle : SafeHandlewraps the integer handle returned bycnc_allclibhndl3. Finalizer callscnc_freelibhndl. Use of the handle inside the wrapper goes throughDangerousAddRef/DangerousReleaseto prevent finalization mid-call.- Per-handle lock. Fwlib is thread-unsafe per handle — one mutex per
FocasHandle, every API call acquires it. Lock fairness is FIFO so polling and write requests don't starve each other. - Pool size of 1 per CNC by default. FANUC controllers typically allow 4–8 concurrent FOCAS sessions; we don't need parallelism inside one driver-to-CNC link unless profiling shows it. Configurable per device.
- Health probe. A background task issues
cnc_sysinfoagainst each handle every 30 s. Failure → release the handle, mark device disconnected, let normal reconnect logic re-establish. - TTL. Each handle is forcibly recycled every 6 h (configurable) regardless of health. Defensive against slow Fwlib state corruption.
- Acquire timeout. Handle-lock acquisition has a 10 s timeout. Timeout = treat the handle as wedged, kill it, mark device disconnected. (Real FOCAS calls have hung indefinitely in production reports.)
Thread Serialization
The Host runs a single-threaded scheduler with handle-affinity dispatch: each pending operation is tagged with the target handle, and a dedicated worker thread per handle drains its queue. Two consequences:
- Zero parallel calls into Fwlib for the same handle (correctness).
- A single slow CNC's queue can grow without blocking other CNCs' workers (isolation).
The bounded outgoing queue from §3 is per-handle, not process-global, so one stuck CNC can't starve another's queue capacity.
Memory Watchdog Thresholds (FOCAS-specific)
FOCAS baseline is small (~30–50 MB after discovery on a typical 32-axis machine). Defaults tighter than the global protection — FOCAS workloads should be stable, so any meaningful growth is a leak signal worth acting on early.
| Threshold | Action |
|---|---|
| 1.5× baseline OR baseline + 25 MB (whichever larger) | Warning |
| 2× baseline OR baseline + 75 MB (whichever larger) | Soft recycle |
| 300 MB absolute hard ceiling | Force-kill |
| Slope > 1 MB/min sustained 15 min | Soft recycle |
Same multiplier + floor + hard-ceiling pattern as the global default; tighter ratios and a lower hard ceiling because the workload profile is well-bounded.
Recycle Policy
Soft recycle in the Host distinguishes between operations queued in managed code (safely cancellable) and operations currently inside Fwlib64.dll (not safely cancellable — Fwlib calls have no cancellation mechanism, and freeing a handle while a native call is using it is undefined behavior, exactly the AV path the isolation is meant to prevent).
Sequence:
- Stop accepting new IPC requests (pipe rejects with
BadServerHalted) - Cancel queued (not-yet-dispatched) operations: return
BadCommunicationErrorto the proxy - Wait up to 10 s grace for any handle's worker thread to return from its current native call
- For handles whose worker thread returned within grace: call
cnc_freelibhndlon the handle, disposeFocasHandle - For handles still inside a native call after grace: do NOT call
cnc_freelibhndl— leave the handle wrapper marked Abandoned, skip clean release. The OS reclaims the file descriptors and TCP sockets when the process exits; the CNC's session count decrements on its own connection-timeout (typically 30–60 s) - Flush post-mortem ring buffer to disk; record which handles were Abandoned and why
- If any handle was Abandoned → escalate from soft recycle to hard exit:
Environment.Exit(2)rather thanEnvironment.Exit(0). The supervisor logs this as an unclean recycle and applies the crash-loop circuit breaker to it (an Abandoned handle indicates a wedged Fwlib call, which is the kind of state that justifies treating the recycle as "this driver is in trouble"). - If all handles released cleanly →
Environment.Exit(0)and supervisor respawns normally
Recycle triggers (any one):
- Memory watchdog threshold breach
- Scheduled (daily 03:00 local by default)
- Operator command via Admin UI
- Crash-loop circuit breaker fired and reset (manual reset)
Recycle frequency cap: 1/hour. More than that = page operator instead of thrashing.
Why we never free a handle with an active native call
Calling cnc_freelibhndl on a handle while another thread is mid-call inside cnc_* against that same handle is undefined behavior per FANUC's docs (handle is not thread-safe; release races with use). The most likely outcome is an immediate AV inside Fwlib — which is precisely the scenario the entire Tier C isolation is designed to contain. The defensive choice is: if we can't release cleanly within the grace window, accept the handle leak (bounded by process lifetime) and let process exit do what we can't safely do from managed code.
This means a wedged Fwlib call always escalates to process exit. There is no in-process recovery path for a hung native call — the only correct response is to let the process die and have the supervisor start a fresh one.
What Survives a Recycle
| State | Survives? | How |
|---|---|---|
| Subscription set | ✔ | Proxy re-issues subscribe on host startup |
| Last-known values | ✔ (cached in proxy) | Surfaced as Bad quality during recycle window |
| In-flight reads | ✗ | Proxy returns BadCommunicationError; OPC UA client retries |
| In-flight writes | ✗ | Per Polly write-retry policy: NOT auto-retried; OPC UA client decides |
| Handle TTL clocks | ✗ (intentional) | Fresh handles after recycle, fresh TTL |
Recovery Sequence After Crash
- Supervisor detects host exit (heartbeat timeout or process exit code)
- Supervisor reads post-mortem MMF, attaches tail to a crash event
- Proxy fans out Bad quality on all FOCAS device nodes
- Backoff before respawn: 5 s → 15 s → 60 s (capped)
- Spawn new Host process
- Host re-discovers (functional structure is fixed; PMC/macro discovery from central config), re-subscribes
- Quality returns to Good as values arrive
- 3 crashes in 5 minutes → crash-loop circuit opens. Supervisor stops respawning, leaves Bad quality in place, raises operator alert. Manual reset required via Admin UI.
Post-Mortem Log Contents (FOCAS-specific)
In addition to the generic last-N-operations ring, the FOCAS Host post-mortem captures:
- Active handle pool snapshot (handle ID, target IP, age, last-call timestamp, consecutive failures)
- Handle health probe history (last 100 results)
- Memory samples (last 60 — 30 minutes at 30 s cadence)
- Recycle history (last 10 recycles with trigger reason)
- Last 50 IPC requests received (for correlating crashes to specific operator actions)
This makes post-mortem of an AccessViolationException actionable — without it, a Fwlib AV is essentially undebuggable.
Test Coverage for FOCAS Stability
There are two distinct test surfaces here, and an earlier draft conflated them. Splitting them honestly:
Surface 1 — Functional protocol coverage via the TCP stub
The Driver.Focas.TestStub (per test-data-sources.md §6) is a TCP listener that mimics a CNC over the FOCAS wire protocol. It can exercise everything that travels over the network:
- Inject network slow — stub adds latency on FOCAS responses, exercising the bounded queue, Polly timeout, and handle-lock acquire timeout
- Inject network hang — stub stops responding mid-call (TCP keeps the socket open but never writes), exercising the per-call grace window and the wedged-handle → hard-exit escalation
- Inject protocol error — stub returns FOCAS error codes (
EW_HANDLE,EW_SOCKET, etc.) at chosen call boundaries, exercising error-code → StatusCode mapping and Polly retry policies - Inject disconnect — stub closes the TCP socket, exercising the reconnect path and Bad-quality fan-out
This covers the majority of stability paths because most FOCAS failure modes manifest as the network behaving badly — the Fwlib library itself tends to be stable when its CNC behaves; the trouble is that real CNCs misbehave often.
Surface 2 — Native fault injection via a separate shim
Native AVs and native handle leaks cannot be triggered through a TCP stub — they live inside Fwlib64.dll, on the host side of the P/Invoke boundary. Faking them requires a separate mechanism:
Driver.Focas.FaultShimproject — a small native DLL namedFwlib64.dll(test-only build configuration) that exports the same FOCAS API surface but, instead of calling FANUC's library, performs configurable fault behaviors: deliberately raise an AV at a chosen call site, return success but never release allocated buffers (leak), return success oncnc_freelibhndlbut keep the handle table populated (orphan handle), etc.- Activated by binding redirect / DLL search path order in the Host's test fixture only; production builds load FANUC's real
Fwlib64.dll. - Tested paths: supervisor respawn after AV, post-mortem MMF readability after hard crash, watchdog → recycle path on simulated leaks, Abandoned-handle path when the shim simulates a wedged native call.
The Host code is unchanged between the two surfaces — it just experiences different symptoms depending on which DLL it loaded. Honest framing of test coverage: the TCP stub covers ~80% of real-world FOCAS failures (network/protocol); the FaultShim covers the remaining ~20% (native crashes/leaks). Hardware/manual testing on a real CNC remains the only validation path for vendor-specific Fwlib quirks that neither stub can predict.
Galaxy — Deep Dive (Tier C, COM/STA Worked Example)
Galaxy is the second Tier C driver and the only one bound to .NET 4.8 x86 (MXAccess COM has no 64-bit variant). Unlike FOCAS, Galaxy carries 12+ years of v1 production history, so the failure surface is well-mapped — most of the protections below close known incident classes rather than guarding against speculative ones. The four findings closed in commit c76ab8f (stability-review 2026-04-13) are concrete examples: a failed runtime probe subscription leaving a phantom entry that flipped Tick() to Stopped and fanned out false BadOutOfService quality, sync-over-async on the OPC UA stack thread, fire-and-forget alarm tasks racing shutdown.
Project Layout
src/
ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/ # .NET 10 x64 in main server
ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/ # .NET 4.8 x86 separate Windows service
ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/ # .NET Standard 2.0 IPC contracts
The Host is the only place MXAccess COM objects, the Galaxy SQL Server connection, and the optional Wonderware Historian SDK are loaded. Bitness mismatch with the .NET 10 x64 main server is the original isolation reason; Tier C stability isolation is the layered reason.
STA Thread + Win32 Message Pump (the foundation)
Every MXAccess COM call must execute on a dedicated STA thread that runs a GetMessage/DispatchMessage loop, because MXAccess delivers OnDataChange / OnWriteComplete / advisory callbacks via window messages. This is non-negotiable — calls from the wrong apartment fail or, worse, cross-thread COM marshaling silently corrupts state.
- One STA thread per Host process owns all
LMXProxyServerinstances and all advisory subscriptions - Work item dispatch uses
PostThreadMessage(WM_APP)to marshal incoming IPC requests onto the STA thread - Pump shutdown posts
WM_QUITonly after all outstanding work items have completed, preventing torn-down COM proxies from receiving callbacks - Pump health is itself probed: the proxy sends a no-op work item every 10 s and expects a round-trip; missing round-trip = pump wedged = trigger recycle
The pattern is the same as the v1 StaComThread in ZB.MOM.WW.LmxProxy.Host — proven at this point and not a place for invention.
COM Object Lifetime
MXAccess COM objects (LMXProxyServer connection handles, item handles) accumulate native references that the GC does not track. Leaks here are silent until the Host runs out of handles or the Galaxy refuses new advisory subscriptions.
MxAccessHandle : SafeHandlewraps eachLMXProxyServerconnection. Finalizer callsMarshal.ReleaseComObjectuntil refcount = 0, thenUnregisterProxy.- Subscription handles wrapped per item;
RemoveAdvise+RemoveItemon dispose, in that order (event handlers must be unwired before the item handle goes away — undefined behavior otherwise). CriticalFinalizerObjectfor handle wrappers so finalizer ordering during AppDomain unload is predictable.- Pre-shutdown drain: on Host stop, Proxy first cancels all subscriptions cleanly via the STA pump (
AdviseSupervisory(stop)→RemoveItem→UnregisterProxy). Only then does the Host exit. Fire-and-forget shutdown is a known v1 bug class — the four 2026-04-13 stability findings include "alarm auto-subscribe and transferred-subscription restore no longer race shutdown as untracked fire-and-forget tasks."
Subscription State and Reconnect
Galaxy's MXAccess advisory subscriptions are stateful — once established, Galaxy pushes value updates until RemoveAdvise. Network disconnects, Galaxy redeployments, and Platform/AppEngine restarts all break the subscription stream and require replay.
- Subscription registry in the Host: every
AddItem+AdviseSupervisoryis recorded so reconnect can replay - Reconnect trigger: connection-health probe (see below) detects loss → marks subscriptions Disconnected → fans out Bad quality via Proxy → enters reconnect loop
- Replay order: register proxy → re-add items → re-advise. Order matters; re-advising an item that was never re-added wedges silently.
- Quality fan-out during reconnect window respects host scope — per the same 2026-04-13 findings, a stopped DevAppEngine must not let a recovering DevPlatform's startup callback wipe Bad quality on the still-stopped engine's variables. Cross-host quality clear is gated on host-status check.
- Symbol-version-changed equivalent: Galaxy
time_of_last_deploychange → driver invokesIRediscoverable→ rebuild affected subtree only (per Galaxy platform scope filter, commitbc282b6)
Connection Health Probe (GalaxyRuntimeProbeManager)
A dedicated probe subscribes to a synthetic per-host runtime-status attribute (Platform/Engine ScanState). Probe state drives:
- Bad-quality fan-out when a host (Platform or AppEngine) reports Stopped
- Quality restoration when state transitions back to Running, scoped to that host's subtree only (not Galaxy-wide — closes the 2026-04-13 finding about a Running→Unknown→Running callback wiping sibling state)
- Probe failure handling: a failed probe subscription must NOT leave a phantom entry that Tick() flips to Stopped — phantom probes are an accidental Bad-quality source. Closed in
c76ab8f.
Memory Watchdog Thresholds (Galaxy-specific)
Galaxy baseline depends heavily on Galaxy size. The platform scope filter (commit bc282b6) reduced a dev Galaxy's footprint from 49 objects / 4206 attributes (full Galaxy) to 3 objects / 386 attributes (local subtree). Real production Galaxies vary from a few hundred to tens of thousands of attributes.
| Threshold | Action |
|---|---|
| 1.5× baseline (per-instance, after warm-up) | Warning |
| 2× baseline OR baseline + 200 MB (whichever larger) | Soft recycle |
| 1.5 GB absolute hard ceiling | Force-kill |
| Slope > 5 MB/min sustained 30 min | Soft recycle |
Higher hard ceiling than FOCAS (1.5 GB vs 300 MB) because legitimate Galaxy baselines are larger. Same multiplier-with-floor pattern. The slope threshold is more permissive (5 MB/min vs 1 MB/min) because Galaxy's address-space rebuild on redeploy can transiently allocate large amounts.
Recycle Policy (COM-specific)
Soft recycle distinguishes between work items queued for the STA pump (cancellable before dispatch) and MXAccess calls in flight on the STA thread (not cancellable — COM has no abort).
- Stop accepting new IPC requests
- Cancel queued (not-yet-dispatched) STA work items
- Wait up to 15 s grace for the in-flight STA call to return (longer than FOCAS because some MXAccess calls — bulk attribute reads, large hierarchy traversals — legitimately take seconds)
- For each subscription: post
RemoveAdvise→RemoveItem→ release item handle, in that order, on the STA thread - For the proxy connection: post
UnregisterProxy→Marshal.ReleaseComObjectuntil refcount = 0 → releaseMxAccessHandle - STA pump shutdown: post
WM_QUITonly after all of the above have completed - Flush post-mortem ring buffer
- If STA pump did not exit within 5 s of
WM_QUIT→ escalate toEnvironment.Exit(2). A wedged COM call cannot be recovered cleanly; same logic as the FOCAS Abandoned-handle escalation. - If clean →
Environment.Exit(0), supervisor respawns
Recycle frequency cap is the same as FOCAS (1/hour). Scheduled recycle defaults to 24 h.
What Survives a Galaxy Recycle
| State | Survives? | How |
|---|---|---|
| Address space (built from Galaxy DB) | ✔ | Proxy caches the last built tree; rebuild from DB on host startup |
| Subscription set | ✔ | Proxy re-issues subscribe on host startup |
| Last-known values | ✔ (in proxy cache) | Surfaced as Bad quality during recycle window |
| Alarm state | partial | Active alarm registry replayed; AlarmTracking re-subscribes |
| In-flight reads | ✗ | BadCommunicationError; client retries |
| In-flight writes | ✗ | Per Polly write-retry policy: not auto-retried |
| Historian subscriptions | ✗ | Re-established on next HistoryRead |
time_of_last_deploy watermark |
✔ | Cached in proxy; resync on startup avoids spurious full rebuild |
Recovery Sequence After Crash
Same supervisor protocol as FOCAS, with one Galaxy-specific addition:
- Supervisor detects host exit
- Reads post-mortem MMF, attaches tail to crash event
- Proxy fans out Bad quality on all Galaxy nodes scoped to the lost host's platform (not necessarily every Galaxy node — multi-host respect is per the 2026-04-13 findings)
- Backoff: 5 s → 15 s → 60 s
- Spawn new Host
- Host checks
time_of_last_deploy; if unchanged from cached watermark, skip full DB rediscovery and reuse cached hierarchy (faster recovery for the common case where the crash was unrelated to a redeploy) - Re-register MXAccess proxy, re-add items, re-advise
- Quality returns to Good as values arrive
- 3 crashes in 5 minutes → crash-loop circuit opens (same escalating-cooldown rules as FOCAS)
Post-Mortem Log Contents (Galaxy-specific)
In addition to the universal last-N-operations ring:
- STA pump state snapshot: thread ID, last-message-dispatched timestamp, queue depth
- Active subscription count + breakdown by host (Platform/AppEngine)
MxAccessHandlerefcount snapshot for every live handle- Last 100 probe results with host status transitions
- Last redeploy event timestamp (from
time_of_last_deploypolling) - Galaxy DB connection state (last query duration, last error)
- Historian connection state if HDA enabled
Test Coverage for Galaxy Stability
Galaxy is the easiest of the Tier C drivers to test because the dev machine already has a real Galaxy. Three test surfaces:
- Real Galaxy on dev machine (per
test-data-sources.md) — the primary integration test environment. Covers MXAccess wire behavior, subscription replay, redeploy-triggered rediscovery, host status transitions. Driver.Galaxy.FaultShim— analogous to the FOCAS FaultShim, a test-only managed assembly substituted forArchestrA.MxAccess.dllvia assembly binding. Injects: COM exception at chosen call site, subscription that never firesOnDataChange,Marshal.ReleaseComObjectreturning unexpected refcount, STA pump deadlock simulation.- v1 IntegrationTests parity suite — the existing v1 test suite must pass against the v2 Galaxy driver before move-behind-IPC is considered complete (decision #56). This is the primary regression net.
The 2026-04-13 stability findings should each become a regression test in the parity suite — phantom probe subscription, cross-host quality clear, sync-over-async on stack thread, fire-and-forget shutdown race. Closing those bugs without test coverage is how they come back.
Decision Additions for plan.md
Proposed new entries for the Decision Log (numbering continues from #62):
| # | Decision | Rationale |
|---|---|---|
| 63 | Driver stability tier model (A/B/C) | Drivers vary in failure profile; tier dictates hosting and protection level. See driver-stability.md |
| 64 | FOCAS is Tier C — out-of-process Windows service | Fwlib64.dll is black-box, AV uncatchable, handle-affinity, no SLA. Same Proxy/Host/Shared pattern as Galaxy |
| 65 | Cross-cutting protections mandatory in all tiers | SafeHandle, memory watchdog, bounded queues, scheduled recycle, post-mortem log apply to every driver process |
| 66 | Out-of-process driver pattern is reusable | Galaxy.Proxy/Host/Shared template generalizes to any Tier C driver; FOCAS is the second user |
| 67 | Tier B drivers may escalate to Tier C on production evidence | libplctag, S7netplus, TwinCAT.Ads start in-process; promote if leaks or crashes appear in production |
| 68 | Crash-loop circuit breaker stops respawn after 3 crashes/5 min | Prevents thrashing; requires manual reset to surface an operator-actionable problem |
| 69 | Post-mortem log via memory-mapped file | Survives hard process death (including AV); supervisor reads after corpse is gone; only viable post-mortem path for native crashes |
Resolved Defaults
The three open questions from the initial draft are resolved as follows. All values are tunable per-driver-instance in central config; the defaults are what ships out of the box.
Watchdog thresholds — hybrid multiplier + absolute floor + hard ceiling
Pure multipliers misfire on tiny baselines (a 30 MB FOCAS Host shouldn't recycle at 45 MB). Pure absolute thresholds in MB don't scale across deployment sizes. Hybrid: trigger on whichever threshold reaches first — max(N× baseline, baseline + floor MB) for warn/recycle, plus an absolute hard ceiling that always force-kills. Slope detection stays orthogonal — it catches slow leaks well below any threshold.
Crash-loop reset — auto-reset with escalating cooldown, sticky alert, 24 h manual floor
Manual-only reset is too rigid for unattended plants (CNC sites don't have operators on console 24/7). Pure auto-reset after a fixed cooldown defeats the purpose of the breaker by letting it silently retry forever. Escalating cooldown (1 h → 4 h → 24 h-with-manual-reset) auto-recovers from transient problems while ensuring persistent problems eventually demand human attention. Sticky alerts that don't auto-clear keep the trail visible regardless.
Heartbeat cadence — 2 s with 3-miss tolerance
5 s × 3 misses = 15 s detection is too slow against typical 1 s OPC UA publishing intervals (subscribers see Bad quality 15+ samples late). 1 s × 3 = 3 s is plausible but raises false-positive rate from GC pauses and Windows pipe scheduling. 2 s × 3 = 6 s is the sweet spot: subscribers see Bad quality within one or two missed publish cycles, GC pauses (~500 ms typical) and pipe jitter stay well inside the tolerance budget.