Files

Joseph Doherty a1e79cdb06 Draft v2 multi-driver planning docs (docs/v2/) so Phase 0–5 work has a complete reference: rename to OtOpcUa, migrate to .NET 10 x64 (Galaxy stays .NET 4.8 x86 out-of-process), add seven new drivers behind composable capability interfaces (Modbus TCP / DL205, AB CIP, AB Legacy, S7, TwinCAT, FOCAS, OPC UA Client), introduce a central MSSQL config DB with cluster-scoped immutable generations and per-node credential binding, deploy as two-node site clusters with non-transparent redundancy and minimal per-node overrides, classify drivers by stability tier (A pure-managed / B wrapped-native / C out-of-process Windows service) with Tier C deep dives for both Galaxy and FOCAS, define per-driver test data sources (libplctag ab_server, Snap7, NModbus in-proc, TwinCAT XAR VM, FOCAS TCP stub plus native FaultShim) plus a 6-axis cross-driver test matrix, and ship a Blazor Server admin UI mirroring ScadaLink CentralUI's Bootstrap 5 / LDAP cookie auth / dark-sidebar look-and-feel — 106 numbered decisions across six docs (plan.md, driver-specs.md, driver-stability.md, test-data-sources.md, config-db-schema.md, admin-ui.md), DRAFT only and intentionally not yet wired to code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-17 09:17:49 -04:00

39 KiB

Raw Blame History

Driver Stability & Isolation — OtOpcUa v2

Status: DRAFT — companion to plan.md. Defines the stability tier model, per-driver hosting decisions, cross-cutting protections every driver process must apply, and the canonical worked example (FOCAS) for the high-risk tier.

Branch: v2 Created: 2026-04-17

Problem Statement

The v2 plan adds eight drivers spanning pure managed code (Modbus, OPC UA Client), wrapped C libraries (libplctag for AB CIP/Legacy, S7netplus for Siemens, Beckhoff.TwinCAT.Ads for ADS), heavy native/COM with thread affinity (Galaxy MXAccess), and black-box vendor DLLs (FANUC Fwlib64.dll for FOCAS).

These do not all carry the same failure profile, but the v1 plan treats them uniformly: every driver runs in-process in the .NET 10 server except Galaxy (isolated only because of its 32-bit COM constraint). This means:

An AccessViolationException from Fwlib64.dll — uncatchable by managed code in modern .NET — tears down the whole OPC UA server, all subscriptions, and every other driver with it.
A native handle leak (FOCAS cnc_allclibhndl3 not paired with cnc_freelibhndl, or libplctag tag handles not freed) accumulates against the server process, not the driver.
A thread-affinity bug (calling Fwlib on two threads against the same handle) corrupts state for every other driver sharing the process.
Polly's circuit breaker handles transient errors; it does nothing for process death or resource exhaustion.

Driver stability needs to be a first-class architectural concern, not a per-driver afterthought.

Stability Tier Model

Every driver is assigned to one of three tiers based on the trust level of its dependency stack:

Tier A — Pure Managed

Drivers whose entire dependency chain is verifiable .NET. Standard exception handling and Polly are sufficient. Run in-process in the main server.

Driver	Stack	Notes
Modbus TCP	NModbus (pure managed)	Sockets only
OPC UA Client	OPC Foundation .NETStandard SDK (pure managed)	Reference-grade SDK

Tier B — Wrapped Native, Mature

Drivers that P/Invoke into a mature, well-maintained native library, or use a managed wrapper that has limited native bits (router, transport). Run in-process with the cross-cutting protections from §3 mandatory: SafeHandle for every native resource, memory watchdog, bounded queues. Any driver in this tier may be promoted to Tier C if production data shows leaks or crashes.

Driver	Stack	Notes
Siemens S7	S7netplus (mostly managed)	Sockets + small native helpers
AB CIP	libplctag (C library via P/Invoke)	Mature, widely deployed; manages its own threads
AB Legacy	libplctag (same as CIP)	Same library, different protocol mode
TwinCAT	Beckhoff.TwinCAT.Ads v6 + AmsTcpIpRouter	Mostly managed; native callback pump for ADS notifications

Tier C — Heavy Native / COM / Thread-Affinity

Drivers whose dependency is a black-box vendor DLL, COM object with apartment requirements, or any code where a fault is likely uncatchable. Run as a separate Windows service behind the Galaxy.Proxy/Host/Shared pattern. A crash isolates to that driver's process; the main server fans out Bad quality on the affected nodes and respawns the host.

Driver	Stack	Reason for Tier C
Galaxy	MXAccess COM (.NET 4.8 x86)	Bitness mismatch + COM/STA + long history of native quirks
FOCAS	`Fwlib64.dll` P/Invoke	Black-box vendor DLL, handle-affinity, thread-unsafe per handle, no public SLA

Cross-Cutting Protections

Two distinct protection sets, scoped by hosting mode rather than applied uniformly. This split exists because process-level signals (RSS watchdog, recycle, kill) act on a process, not a driver — applying them in the shared server process would let a leak in one in-proc driver knock out every other driver, every session, and the OPC UA endpoint. That contradicts the v2 isolation invariant. Process-level protections therefore apply only to isolated host processes (Tier C); in-process drivers (Tier A/B) get a different set of guards that operate at the driver-instance level.

Universal — apply to every driver regardless of tier

SafeHandle for every native resource

Every native handle (FOCAS cnc_freelibhndl, libplctag tag handle, COM IUnknown ref, OS file/socket handles we pass through P/Invoke) is wrapped in a SafeHandle subclass with a finalizer that calls the release function. This guarantees release even when:

The owning thread crashes
A using block is bypassed by an exception we forgot to catch
The driver host process is shutting down ungracefully

Marshal.ReleaseComObject calls go through CriticalFinalizerObject to honor finalizer ordering during AppDomain unload.

Bounded operation queues (per device, per driver instance)

Every driver-instance/device pairing has a bounded outgoing-operation queue (default 1000 entries). When the queue is full, new operations fail fast with BadResourceUnavailable rather than backing up unboundedly against a slow or dead device. Polly's circuit breaker also opens, surfacing the device-down state to the dashboard.

This prevents the canonical "device went offline → reads pile up → driver eats all RAM" failure mode. Crucially, it operates per device in the in-process case so one stuck device cannot starve another driver's queue or accumulate against the shared server's heap.

Crash-loop circuit breaker

If a driver host crashes 3 times within 5 minutes, the supervisor stops respawning, leaves the driver's nodes in Bad quality, raises an operator alert, and starts an escalating cooldown before attempting auto-reset. This balances "unattended sites need recovery without an operator on console" against "don't silently mask a persistent problem."

Trip sequence	Cooldown before auto-reset
First trip	1 hour
Re-trips within 10 min of an auto-reset	4 hours
Re-trips after the 4 h cooldown	24 hours, manual reset required via Admin UI

Every trip raises a sticky operator alert that does not auto-clear when the cooldown elapses — only manual acknowledgment clears it. So even if recovery is automatic, "we crash-looped 3 times overnight" stays visible the next morning. The auto-reset path keeps unattended plants running; the sticky alert + 24 h manual-only floor prevents the breaker from becoming a "silent retry forever" mechanism.

For Tier A/B (in-process) drivers, the "crash" being counted is a driver-instance reset (capability-level reinitialization, not a process exit). For Tier C drivers, it's a host process exit.

In-process only (Tier A/B) — driver-instance allocation tracking

In-process drivers cannot be recycled by killing the server process — that would take down every other driver, every session, and the OPC UA endpoint. RSS watchdogs and scheduled recycle therefore do not apply to Tier A/B. Instead, each driver instance is monitored at a finer grain:

Per-instance allocation tracking: drivers expose a GetMemoryFootprint() capability returning bytes attributable to their own caches (symbol cache, subscription items, queued operations). The Core polls this every 30 s and logs growth slope per driver instance.
Soft-limit on cached state: each driver declares a memory budget for its caches in DriverConfig. On breach, the Core asks the driver to flush optional caches (e.g. discard symbol cache, force re-discovery). No process action.
Escalation rule: if a driver instance's footprint cannot be bounded by cache flushing — or if growth is in opaque allocations the driver can't account for — that driver is a candidate for promotion to Tier C. Process recycle is the only safe leak remediation, and the only way to apply process recycle to a single driver is to give it its own process.
No process kill on a Tier A/B driver. Ever. The only Core-initiated recovery is asking the driver to reset its own state via IDriver.Reinitialize(). If that fails, the driver instance is marked Faulted, its nodes go Bad quality, and the operator is alerted. The server process keeps running for everyone else.

Isolated host only (Tier C) — process-level protections

These act on the host process. They cannot affect any other driver or the main server, because each Tier C driver has its own process.

Per-host memory watchdog

Each host process measures baseline RSS after warm-up (post-discovery, post-first-poll). A monitor thread samples RSS every 30 s and tracks both a multiplier of baseline and an absolute hard ceiling.

Threshold	Action
1.5× baseline OR baseline + 50 MB (whichever larger)	Log warning, surface in status dashboard
3× baseline OR baseline + 200 MB (whichever larger)	Trigger soft recycle (graceful drain → exit → respawn)
1 GB absolute hard ceiling	Force-kill driver process, supervisor respawns
Slope > 2 MB/min sustained 30 min	Treat as leak signal, soft recycle even below absolute threshold

The "whichever larger" floor prevents spurious triggers when baseline is tiny — a 30 MB FOCAS Host shouldn't recycle at 45 MB just because the multiplier says so. All thresholds are per-driver-type defaults, overridable per-driver-instance in central config. Only valid for isolated hosts — never apply to the main server process.

Heartbeat between proxy and host

The proxy in the main server sends a heartbeat ping to the driver host every 2 s and expects a reply within 1 s. Three consecutive misses → proxy declares the host dead (6 s total detection latency), fans out Bad quality on all of that driver's nodes, and asks the supervisor to respawn.

2 s is fast enough that subscribers on a 1 s OPC UA publishing interval see Bad quality within one or two missed publish cycles, but slow enough that GC pauses (typically <500 ms even on bad days) and Windows pipe scheduling jitter don't generate false positives. The 3-miss tolerance absorbs single-cycle noise.

The heartbeat is on a separate named-pipe channel from the data-plane RPCs so a stuck data-plane operation doesn't mask host death. Cadence and miss-count are tunable per-driver-instance in central config.

Scheduled recycling

Each Tier C host process is recycled on a schedule (default 24 h, configurable per driver type). The recycle is a soft drain → exit → respawn, identical to a watchdog-triggered recycle. Defensive measure against slow leaks that stay below the watchdog thresholds.

Post-mortem log

Each driver process writes a ring buffer of the last 1000 operations to a memory-mapped file (%ProgramData%\OtOpcUa\driver-postmortem\<driver>.mmf):

timestamp | handle/connection ID | operation | args summary | return code | duration

On graceful shutdown, the ring is flushed to a rotating log. On a hard crash (including AV), the supervisor reads the MMF after the corpse is gone and attaches the tail to the crash event reported on the dashboard. Without this, post-mortem of a Fwlib AV is impossible.

Out-of-Process Driver Pattern (Generalized)

This is the Galaxy.Proxy/Host/Shared layout from plan.md §3, lifted to a reusable pattern for every Tier C driver. Two new projects per Tier C driver beyond the in-process driver projects:

src/
  ZB.MOM.WW.OtOpcUa.Driver.<Name>.Proxy/   # In main server: implements IDriver, forwards over IPC
  ZB.MOM.WW.OtOpcUa.Driver.<Name>.Host/    # Separate Windows service: actual driver implementation
  ZB.MOM.WW.OtOpcUa.Driver.<Name>.Shared/  # IPC message contracts (.NET Standard 2.0)

Common contract for a Tier C host:

Hosted as a Windows service with Microsoft.Extensions.Hosting
Named-pipe IPC server (named pipes already established for Galaxy in §3)
MessagePack-serialized contracts in <Name>.Shared
Heartbeat endpoint on a separate pipe from the data plane
Memory watchdog runs in-process and triggers Environment.Exit(2) on threshold breach
Post-mortem MMF writer initialized on startup
Standard supervisor protocol: respawn-with-backoff, crash-loop circuit breaker

Common contract for the proxy in the main server:

Implements IDriver + capability interfaces; forwards every call over IPC
Owns the heartbeat sender and host liveness state
Fans out Bad quality on all nodes when host is declared dead
Owns the supervisor that respawns the host process
Exposes host status (Up / Down / Recycling / CircuitOpen) to the status dashboard

IPC Security (mandatory for every Tier C driver)

Named pipes default to allowing connections from any local user. Without explicit ACLs, any process on the host machine that knows the pipe name could connect, bypass the OPC UA server's authentication and authorization layers, and issue reads, writes, or alarm acknowledgments directly against the driver host. This is a real privilege-escalation surface — a service account with no OPC UA permissions could write field values it should never have access to. Every Tier C driver enforces the following:

Pipe ACL: the host creates the pipe with a PipeSecurity ACL that grants ReadWrite | Synchronize only to the OtOpcUa server's service principal SID. All other local users — including LocalSystem and Administrators — are explicitly denied. The ACL is set at pipe-creation time so it's atomic with the pipe being listenable.
Caller identity verification: on each new pipe connection, the host calls NamedPipeServerStream.GetImpersonationUserName() (or impersonates and inspects the token) and verifies the connected client's SID matches the configured server service SID. Mismatches are logged and the connection is dropped before any RPC frame is read.
Per-message authorization context: every RPC frame includes the operation's authenticated OPC UA principal (forwarded by the Core after it has done its own authn/authz). The host treats this as input only — the driver-level authorization (e.g. "is this principal allowed to write Tune attributes?") is performed by the Core, but the host's own audit log records the principal so post-incident attribution is possible.
No anonymous endpoints: the heartbeat pipe has the same ACL as the data-plane pipe. There are no "open" pipes a generic client can probe.
Defense-in-depth shared secret: the supervisor generates a per-host-process random secret at spawn time, passes it to both proxy and host via command-line args (or a parent-pipe handshake), and the host requires it on the first frame of every connection. This is belt-and-suspenders for the case where pipe ACLs are misconfigured during deployment.

Configuration: the server service SID is read from appsettings.json (Hosting.ServiceAccountSid) and validated against the actual running identity at startup. Mismatch fails startup loudly rather than producing a silently-insecure pipe.

For Galaxy, this pattern is retroactively required (the v1 named-pipe IPC predates this contract and must be hardened during the Phase 2 refactor). For FOCAS and any future Tier C driver, IPC security is part of the initial implementation, not an add-on.

Reusability

For Galaxy, this pattern is already specified. For FOCAS, the same three projects appear in §5 below. Future Tier C escalations (e.g. if libplctag develops a stability problem) reuse the same template.

FOCAS — Deep Dive (Canonical Tier C Worked Example)

FOCAS is the most exposed driver in the v2 plan: a black-box vendor DLL (Fwlib64.dll), handle-based API with per-handle thread-affinity, no public stability SLA, and a target market (CNC integrations) where periodic-restart workarounds are common practice. The protections below are not theoretical — every one is a known FOCAS failure mode.

Project Layout

src/
  ZB.MOM.WW.OtOpcUa.Driver.Focas.Proxy/    # .NET 10 x64 in main server
  ZB.MOM.WW.OtOpcUa.Driver.Focas.Host/     # .NET 10 x64 separate Windows service
  ZB.MOM.WW.OtOpcUa.Driver.Focas.Shared/   # .NET Standard 2.0 IPC contracts
  ZB.MOM.WW.OtOpcUa.Driver.Focas.TestStub/ # Stub FOCAS server for dev/CI (per test-data-sources.md)

The Host process is the only place Fwlib64.dll is loaded. Every concern below is a Host-internal concern.

Handle Pool

One Fwlib handle per CNC connection. Pool design:

FocasHandle : SafeHandle wraps the integer handle returned by cnc_allclibhndl3. Finalizer calls cnc_freelibhndl. Use of the handle inside the wrapper goes through DangerousAddRef/DangerousRelease to prevent finalization mid-call.
Per-handle lock. Fwlib is thread-unsafe per handle — one mutex per FocasHandle, every API call acquires it. Lock fairness is FIFO so polling and write requests don't starve each other.
Pool size of 1 per CNC by default. FANUC controllers typically allow 4–8 concurrent FOCAS sessions; we don't need parallelism inside one driver-to-CNC link unless profiling shows it. Configurable per device.
Health probe. A background task issues cnc_sysinfo against each handle every 30 s. Failure → release the handle, mark device disconnected, let normal reconnect logic re-establish.
TTL. Each handle is forcibly recycled every 6 h (configurable) regardless of health. Defensive against slow Fwlib state corruption.
Acquire timeout. Handle-lock acquisition has a 10 s timeout. Timeout = treat the handle as wedged, kill it, mark device disconnected. (Real FOCAS calls have hung indefinitely in production reports.)

Thread Serialization

The Host runs a single-threaded scheduler with handle-affinity dispatch: each pending operation is tagged with the target handle, and a dedicated worker thread per handle drains its queue. Two consequences:

Zero parallel calls into Fwlib for the same handle (correctness).
A single slow CNC's queue can grow without blocking other CNCs' workers (isolation).

The bounded outgoing queue from §3 is per-handle, not process-global, so one stuck CNC can't starve another's queue capacity.

Memory Watchdog Thresholds (FOCAS-specific)

FOCAS baseline is small (~30–50 MB after discovery on a typical 32-axis machine). Defaults tighter than the global protection — FOCAS workloads should be stable, so any meaningful growth is a leak signal worth acting on early.

Threshold	Action
1.5× baseline OR baseline + 25 MB (whichever larger)	Warning
2× baseline OR baseline + 75 MB (whichever larger)	Soft recycle
300 MB absolute hard ceiling	Force-kill
Slope > 1 MB/min sustained 15 min	Soft recycle

Same multiplier + floor + hard-ceiling pattern as the global default; tighter ratios and a lower hard ceiling because the workload profile is well-bounded.

Recycle Policy

Soft recycle in the Host distinguishes between operations queued in managed code (safely cancellable) and operations currently inside Fwlib64.dll (not safely cancellable — Fwlib calls have no cancellation mechanism, and freeing a handle while a native call is using it is undefined behavior, exactly the AV path the isolation is meant to prevent).

Sequence:

Stop accepting new IPC requests (pipe rejects with BadServerHalted)
Cancel queued (not-yet-dispatched) operations: return BadCommunicationError to the proxy
Wait up to 10 s grace for any handle's worker thread to return from its current native call
For handles whose worker thread returned within grace: call cnc_freelibhndl on the handle, dispose FocasHandle
For handles still inside a native call after grace: do NOT call cnc_freelibhndl — leave the handle wrapper marked Abandoned, skip clean release. The OS reclaims the file descriptors and TCP sockets when the process exits; the CNC's session count decrements on its own connection-timeout (typically 30–60 s)
Flush post-mortem ring buffer to disk; record which handles were Abandoned and why
If any handle was Abandoned → escalate from soft recycle to hard exit: Environment.Exit(2) rather than Environment.Exit(0). The supervisor logs this as an unclean recycle and applies the crash-loop circuit breaker to it (an Abandoned handle indicates a wedged Fwlib call, which is the kind of state that justifies treating the recycle as "this driver is in trouble").
If all handles released cleanly → Environment.Exit(0) and supervisor respawns normally

Recycle triggers (any one):

Memory watchdog threshold breach
Scheduled (daily 03:00 local by default)
Operator command via Admin UI
Crash-loop circuit breaker fired and reset (manual reset)

Recycle frequency cap: 1/hour. More than that = page operator instead of thrashing.

Why we never free a handle with an active native call

Calling cnc_freelibhndl on a handle while another thread is mid-call inside cnc_* against that same handle is undefined behavior per FANUC's docs (handle is not thread-safe; release races with use). The most likely outcome is an immediate AV inside Fwlib — which is precisely the scenario the entire Tier C isolation is designed to contain. The defensive choice is: if we can't release cleanly within the grace window, accept the handle leak (bounded by process lifetime) and let process exit do what we can't safely do from managed code.

This means a wedged Fwlib call always escalates to process exit. There is no in-process recovery path for a hung native call — the only correct response is to let the process die and have the supervisor start a fresh one.

What Survives a Recycle

State	Survives?	How
Subscription set	✔	Proxy re-issues subscribe on host startup
Last-known values	✔ (cached in proxy)	Surfaced as Bad quality during recycle window
In-flight reads	✗	Proxy returns BadCommunicationError; OPC UA client retries
In-flight writes	✗	Per Polly write-retry policy: NOT auto-retried; OPC UA client decides
Handle TTL clocks	✗ (intentional)	Fresh handles after recycle, fresh TTL

Recovery Sequence After Crash

Supervisor detects host exit (heartbeat timeout or process exit code)
Supervisor reads post-mortem MMF, attaches tail to a crash event
Proxy fans out Bad quality on all FOCAS device nodes
Backoff before respawn: 5 s → 15 s → 60 s (capped)
Spawn new Host process
Host re-discovers (functional structure is fixed; PMC/macro discovery from central config), re-subscribes
Quality returns to Good as values arrive
3 crashes in 5 minutes → crash-loop circuit opens. Supervisor stops respawning, leaves Bad quality in place, raises operator alert. Manual reset required via Admin UI.

Post-Mortem Log Contents (FOCAS-specific)

In addition to the generic last-N-operations ring, the FOCAS Host post-mortem captures:

Active handle pool snapshot (handle ID, target IP, age, last-call timestamp, consecutive failures)
Handle health probe history (last 100 results)
Memory samples (last 60 — 30 minutes at 30 s cadence)
Recycle history (last 10 recycles with trigger reason)
Last 50 IPC requests received (for correlating crashes to specific operator actions)

This makes post-mortem of an AccessViolationException actionable — without it, a Fwlib AV is essentially undebuggable.

Test Coverage for FOCAS Stability

There are two distinct test surfaces here, and an earlier draft conflated them. Splitting them honestly:

Surface 1 — Functional protocol coverage via the TCP stub

The Driver.Focas.TestStub (per test-data-sources.md §6) is a TCP listener that mimics a CNC over the FOCAS wire protocol. It can exercise everything that travels over the network:

Inject network slow — stub adds latency on FOCAS responses, exercising the bounded queue, Polly timeout, and handle-lock acquire timeout
Inject network hang — stub stops responding mid-call (TCP keeps the socket open but never writes), exercising the per-call grace window and the wedged-handle → hard-exit escalation
Inject protocol error — stub returns FOCAS error codes (EW_HANDLE, EW_SOCKET, etc.) at chosen call boundaries, exercising error-code → StatusCode mapping and Polly retry policies
Inject disconnect — stub closes the TCP socket, exercising the reconnect path and Bad-quality fan-out

This covers the majority of stability paths because most FOCAS failure modes manifest as the network behaving badly — the Fwlib library itself tends to be stable when its CNC behaves; the trouble is that real CNCs misbehave often.

Surface 2 — Native fault injection via a separate shim

Native AVs and native handle leaks cannot be triggered through a TCP stub — they live inside Fwlib64.dll, on the host side of the P/Invoke boundary. Faking them requires a separate mechanism:

Driver.Focas.FaultShim project — a small native DLL named Fwlib64.dll (test-only build configuration) that exports the same FOCAS API surface but, instead of calling FANUC's library, performs configurable fault behaviors: deliberately raise an AV at a chosen call site, return success but never release allocated buffers (leak), return success on cnc_freelibhndl but keep the handle table populated (orphan handle), etc.
Activated by binding redirect / DLL search path order in the Host's test fixture only; production builds load FANUC's real Fwlib64.dll.
Tested paths: supervisor respawn after AV, post-mortem MMF readability after hard crash, watchdog → recycle path on simulated leaks, Abandoned-handle path when the shim simulates a wedged native call.

The Host code is unchanged between the two surfaces — it just experiences different symptoms depending on which DLL it loaded. Honest framing of test coverage: the TCP stub covers ~80% of real-world FOCAS failures (network/protocol); the FaultShim covers the remaining ~20% (native crashes/leaks). Hardware/manual testing on a real CNC remains the only validation path for vendor-specific Fwlib quirks that neither stub can predict.

Galaxy — Deep Dive (Tier C, COM/STA Worked Example)

Galaxy is the second Tier C driver and the only one bound to .NET 4.8 x86 (MXAccess COM has no 64-bit variant). Unlike FOCAS, Galaxy carries 12+ years of v1 production history, so the failure surface is well-mapped — most of the protections below close known incident classes rather than guarding against speculative ones. The four findings closed in commit c76ab8f (stability-review 2026-04-13) are concrete examples: a failed runtime probe subscription leaving a phantom entry that flipped Tick() to Stopped and fanned out false BadOutOfService quality, sync-over-async on the OPC UA stack thread, fire-and-forget alarm tasks racing shutdown.

Project Layout

src/
  ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/    # .NET 10 x64 in main server
  ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/     # .NET 4.8 x86 separate Windows service
  ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/   # .NET Standard 2.0 IPC contracts

The Host is the only place MXAccess COM objects, the Galaxy SQL Server connection, and the optional Wonderware Historian SDK are loaded. Bitness mismatch with the .NET 10 x64 main server is the original isolation reason; Tier C stability isolation is the layered reason.

STA Thread + Win32 Message Pump (the foundation)

Every MXAccess COM call must execute on a dedicated STA thread that runs a GetMessage/DispatchMessage loop, because MXAccess delivers OnDataChange / OnWriteComplete / advisory callbacks via window messages. This is non-negotiable — calls from the wrong apartment fail or, worse, cross-thread COM marshaling silently corrupts state.

One STA thread per Host process owns all LMXProxyServer instances and all advisory subscriptions
Work item dispatch uses PostThreadMessage(WM_APP) to marshal incoming IPC requests onto the STA thread
Pump shutdown posts WM_QUIT only after all outstanding work items have completed, preventing torn-down COM proxies from receiving callbacks
Pump health is itself probed: the proxy sends a no-op work item every 10 s and expects a round-trip; missing round-trip = pump wedged = trigger recycle

The pattern is the same as the v1 StaComThread in ZB.MOM.WW.LmxProxy.Host — proven at this point and not a place for invention.

COM Object Lifetime

MXAccess COM objects (LMXProxyServer connection handles, item handles) accumulate native references that the GC does not track. Leaks here are silent until the Host runs out of handles or the Galaxy refuses new advisory subscriptions.

MxAccessHandle : SafeHandle wraps each LMXProxyServer connection. Finalizer calls Marshal.ReleaseComObject until refcount = 0, then UnregisterProxy.
Subscription handles wrapped per item; RemoveAdvise + RemoveItem on dispose, in that order (event handlers must be unwired before the item handle goes away — undefined behavior otherwise).
CriticalFinalizerObject for handle wrappers so finalizer ordering during AppDomain unload is predictable.
Pre-shutdown drain: on Host stop, Proxy first cancels all subscriptions cleanly via the STA pump (AdviseSupervisory(stop) → RemoveItem → UnregisterProxy). Only then does the Host exit. Fire-and-forget shutdown is a known v1 bug class — the four 2026-04-13 stability findings include "alarm auto-subscribe and transferred-subscription restore no longer race shutdown as untracked fire-and-forget tasks."

Subscription State and Reconnect

Galaxy's MXAccess advisory subscriptions are stateful — once established, Galaxy pushes value updates until RemoveAdvise. Network disconnects, Galaxy redeployments, and Platform/AppEngine restarts all break the subscription stream and require replay.

Subscription registry in the Host: every AddItem + AdviseSupervisory is recorded so reconnect can replay
Reconnect trigger: connection-health probe (see below) detects loss → marks subscriptions Disconnected → fans out Bad quality via Proxy → enters reconnect loop
Replay order: register proxy → re-add items → re-advise. Order matters; re-advising an item that was never re-added wedges silently.
Quality fan-out during reconnect window respects host scope — per the same 2026-04-13 findings, a stopped DevAppEngine must not let a recovering DevPlatform's startup callback wipe Bad quality on the still-stopped engine's variables. Cross-host quality clear is gated on host-status check.
Symbol-version-changed equivalent: Galaxy time_of_last_deploy change → driver invokes IRediscoverable → rebuild affected subtree only (per Galaxy platform scope filter, commit bc282b6)

Connection Health Probe (`GalaxyRuntimeProbeManager`)

A dedicated probe subscribes to a synthetic per-host runtime-status attribute (Platform/Engine ScanState). Probe state drives:

Bad-quality fan-out when a host (Platform or AppEngine) reports Stopped
Quality restoration when state transitions back to Running, scoped to that host's subtree only (not Galaxy-wide — closes the 2026-04-13 finding about a Running→Unknown→Running callback wiping sibling state)
Probe failure handling: a failed probe subscription must NOT leave a phantom entry that Tick() flips to Stopped — phantom probes are an accidental Bad-quality source. Closed in c76ab8f.

Memory Watchdog Thresholds (Galaxy-specific)

Galaxy baseline depends heavily on Galaxy size. The platform scope filter (commit bc282b6) reduced a dev Galaxy's footprint from 49 objects / 4206 attributes (full Galaxy) to 3 objects / 386 attributes (local subtree). Real production Galaxies vary from a few hundred to tens of thousands of attributes.

Threshold	Action
1.5× baseline (per-instance, after warm-up)	Warning
2× baseline OR baseline + 200 MB (whichever larger)	Soft recycle
1.5 GB absolute hard ceiling	Force-kill
Slope > 5 MB/min sustained 30 min	Soft recycle

Higher hard ceiling than FOCAS (1.5 GB vs 300 MB) because legitimate Galaxy baselines are larger. Same multiplier-with-floor pattern. The slope threshold is more permissive (5 MB/min vs 1 MB/min) because Galaxy's address-space rebuild on redeploy can transiently allocate large amounts.

Recycle Policy (COM-specific)

Soft recycle distinguishes between work items queued for the STA pump (cancellable before dispatch) and MXAccess calls in flight on the STA thread (not cancellable — COM has no abort).

Stop accepting new IPC requests
Cancel queued (not-yet-dispatched) STA work items
Wait up to 15 s grace for the in-flight STA call to return (longer than FOCAS because some MXAccess calls — bulk attribute reads, large hierarchy traversals — legitimately take seconds)
For each subscription: post RemoveAdvise → RemoveItem → release item handle, in that order, on the STA thread
For the proxy connection: post UnregisterProxy → Marshal.ReleaseComObject until refcount = 0 → release MxAccessHandle
STA pump shutdown: post WM_QUIT only after all of the above have completed
Flush post-mortem ring buffer
If STA pump did not exit within 5 s of WM_QUIT → escalate to Environment.Exit(2). A wedged COM call cannot be recovered cleanly; same logic as the FOCAS Abandoned-handle escalation.
If clean → Environment.Exit(0), supervisor respawns

Recycle frequency cap is the same as FOCAS (1/hour). Scheduled recycle defaults to 24 h.

What Survives a Galaxy Recycle

State	Survives?	How
Address space (built from Galaxy DB)	✔	Proxy caches the last built tree; rebuild from DB on host startup
Subscription set	✔	Proxy re-issues subscribe on host startup
Last-known values	✔ (in proxy cache)	Surfaced as Bad quality during recycle window
Alarm state	partial	Active alarm registry replayed; AlarmTracking re-subscribes
In-flight reads	✗	BadCommunicationError; client retries
In-flight writes	✗	Per Polly write-retry policy: not auto-retried
Historian subscriptions	✗	Re-established on next HistoryRead
`time_of_last_deploy` watermark	✔	Cached in proxy; resync on startup avoids spurious full rebuild

Recovery Sequence After Crash

Same supervisor protocol as FOCAS, with one Galaxy-specific addition:

Supervisor detects host exit
Reads post-mortem MMF, attaches tail to crash event
Proxy fans out Bad quality on all Galaxy nodes scoped to the lost host's platform (not necessarily every Galaxy node — multi-host respect is per the 2026-04-13 findings)
Backoff: 5 s → 15 s → 60 s
Spawn new Host
Host checks time_of_last_deploy; if unchanged from cached watermark, skip full DB rediscovery and reuse cached hierarchy (faster recovery for the common case where the crash was unrelated to a redeploy)
Re-register MXAccess proxy, re-add items, re-advise
Quality returns to Good as values arrive
3 crashes in 5 minutes → crash-loop circuit opens (same escalating-cooldown rules as FOCAS)

Post-Mortem Log Contents (Galaxy-specific)

In addition to the universal last-N-operations ring:

STA pump state snapshot: thread ID, last-message-dispatched timestamp, queue depth
Active subscription count + breakdown by host (Platform/AppEngine)
MxAccessHandle refcount snapshot for every live handle
Last 100 probe results with host status transitions
Last redeploy event timestamp (from time_of_last_deploy polling)
Galaxy DB connection state (last query duration, last error)
Historian connection state if HDA enabled

Test Coverage for Galaxy Stability

Galaxy is the easiest of the Tier C drivers to test because the dev machine already has a real Galaxy. Three test surfaces:

Real Galaxy on dev machine (per test-data-sources.md) — the primary integration test environment. Covers MXAccess wire behavior, subscription replay, redeploy-triggered rediscovery, host status transitions.
Driver.Galaxy.FaultShim — analogous to the FOCAS FaultShim, a test-only managed assembly substituted for ArchestrA.MxAccess.dll via assembly binding. Injects: COM exception at chosen call site, subscription that never fires OnDataChange, Marshal.ReleaseComObject returning unexpected refcount, STA pump deadlock simulation.
v1 IntegrationTests parity suite — the existing v1 test suite must pass against the v2 Galaxy driver before move-behind-IPC is considered complete (decision #56). This is the primary regression net.

The 2026-04-13 stability findings should each become a regression test in the parity suite — phantom probe subscription, cross-host quality clear, sync-over-async on stack thread, fire-and-forget shutdown race. Closing those bugs without test coverage is how they come back.

Decision Additions for `plan.md`

Proposed new entries for the Decision Log (numbering continues from #62):

#	Decision	Rationale
63	Driver stability tier model (A/B/C)	Drivers vary in failure profile; tier dictates hosting and protection level. See `driver-stability.md`
64	FOCAS is Tier C — out-of-process Windows service	Fwlib64.dll is black-box, AV uncatchable, handle-affinity, no SLA. Same Proxy/Host/Shared pattern as Galaxy
65	Cross-cutting protections mandatory in all tiers	SafeHandle, memory watchdog, bounded queues, scheduled recycle, post-mortem log apply to every driver process
66	Out-of-process driver pattern is reusable	Galaxy.Proxy/Host/Shared template generalizes to any Tier C driver; FOCAS is the second user
67	Tier B drivers may escalate to Tier C on production evidence	libplctag, S7netplus, TwinCAT.Ads start in-process; promote if leaks or crashes appear in production
68	Crash-loop circuit breaker stops respawn after 3 crashes/5 min	Prevents thrashing; requires manual reset to surface an operator-actionable problem
69	Post-mortem log via memory-mapped file	Survives hard process death (including AV); supervisor reads after corpse is gone; only viable post-mortem path for native crashes

Resolved Defaults

The three open questions from the initial draft are resolved as follows. All values are tunable per-driver-instance in central config; the defaults are what ships out of the box.

Watchdog thresholds — hybrid multiplier + absolute floor + hard ceiling

Pure multipliers misfire on tiny baselines (a 30 MB FOCAS Host shouldn't recycle at 45 MB). Pure absolute thresholds in MB don't scale across deployment sizes. Hybrid: trigger on whichever threshold reaches first — max(N× baseline, baseline + floor MB) for warn/recycle, plus an absolute hard ceiling that always force-kills. Slope detection stays orthogonal — it catches slow leaks well below any threshold.

Crash-loop reset — auto-reset with escalating cooldown, sticky alert, 24 h manual floor

Manual-only reset is too rigid for unattended plants (CNC sites don't have operators on console 24/7). Pure auto-reset after a fixed cooldown defeats the purpose of the breaker by letting it silently retry forever. Escalating cooldown (1 h → 4 h → 24 h-with-manual-reset) auto-recovers from transient problems while ensuring persistent problems eventually demand human attention. Sticky alerts that don't auto-clear keep the trail visible regardless.

Heartbeat cadence — 2 s with 3-miss tolerance

5 s × 3 misses = 15 s detection is too slow against typical 1 s OPC UA publishing intervals (subscribers see Bad quality 15+ samples late). 1 s × 3 = 3 s is plausible but raises false-positive rate from GC pauses and Windows pipe scheduling. 2 s × 3 = 6 s is the sweet spot: subscribers see Bad quality within one or two missed publish cycles, GC pauses (~500 ms typical) and pipe jitter stay well inside the tolerance budget.

39 KiB Raw Blame History Unescape Escape

Driver Stability & Isolation — OtOpcUa v2

Problem Statement

Stability Tier Model

Tier A — Pure Managed

Tier B — Wrapped Native, Mature

Tier C — Heavy Native / COM / Thread-Affinity

Cross-Cutting Protections

Universal — apply to every driver regardless of tier

SafeHandle for every native resource

Bounded operation queues (per device, per driver instance)

Crash-loop circuit breaker

In-process only (Tier A/B) — driver-instance allocation tracking

Isolated host only (Tier C) — process-level protections

Per-host memory watchdog

Heartbeat between proxy and host

Scheduled recycling

Post-mortem log

Out-of-Process Driver Pattern (Generalized)

IPC Security (mandatory for every Tier C driver)

Reusability

FOCAS — Deep Dive (Canonical Tier C Worked Example)

Project Layout

Handle Pool

Thread Serialization

Memory Watchdog Thresholds (FOCAS-specific)

Recycle Policy

Why we never free a handle with an active native call

What Survives a Recycle

Recovery Sequence After Crash

Post-Mortem Log Contents (FOCAS-specific)

Test Coverage for FOCAS Stability

Surface 1 — Functional protocol coverage via the TCP stub

Surface 2 — Native fault injection via a separate shim

Galaxy — Deep Dive (Tier C, COM/STA Worked Example)

Project Layout

STA Thread + Win32 Message Pump (the foundation)

COM Object Lifetime

Subscription State and Reconnect

Connection Health Probe (GalaxyRuntimeProbeManager)

Memory Watchdog Thresholds (Galaxy-specific)

Recycle Policy (COM-specific)

What Survives a Galaxy Recycle

Recovery Sequence After Crash

Post-Mortem Log Contents (Galaxy-specific)

Test Coverage for Galaxy Stability

Decision Additions for plan.md

Resolved Defaults

Watchdog thresholds — hybrid multiplier + absolute floor + hard ceiling

Crash-loop reset — auto-reset with escalating cooldown, sticky alert, 24 h manual floor

Heartbeat cadence — 2 s with 3-miss tolerance

39 KiB

Raw Blame History

Connection Health Probe (`GalaxyRuntimeProbeManager`)

Decision Additions for `plan.md`