# Galaxy Driver The Galaxy driver bridges OtOpcUa to AVEVA System Platform (Wonderware) Galaxies through the `ArchestrA.MxAccess` COM API plus the Galaxy Repository SQL database. It is one driver of seven in the OtOpcUa platform (see [drivers/README.md](README.md) for the full list); all other drivers run in-process in the main Server (.NET 10 x64). Galaxy is the exception — it runs as its own Windows service and talks to the Server over a local named pipe. For the decision record on why Galaxy is out-of-process and how the refactor was staged, see [docs/v2/plan.md §4 Galaxy/MXAccess as Out-of-Process Driver](../v2/plan.md). For the full driver spec (addressing, data-type map, config shape), see [docs/v2/driver-specs.md §1](../v2/driver-specs.md). ## Project Split Galaxy ships as three projects: | Project | Target | Role | |---------|--------|------| | `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/` | .NET Standard 2.0 | IPC contracts (MessagePack records + `MessageKind` enum) referenced by both sides | | `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/` | .NET Framework 4.8 **x86** | Separate Windows service hosting the MXAccess COM objects, STA thread + Win32 message pump, Galaxy Repository reader, Historian SDK, runtime-probe manager | | `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/` | .NET 10 (matches Server) | `GalaxyProxyDriver : IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IRediscoverable, IHostConnectivityProbe` — loaded in-process by the Server; every call forwards over the pipe to the Host | The Shared assembly is the **only** contract between the two runtimes. It carries no COM or SDK references so Proxy (net10) can reference it without dragging x86 code into the Server process. ## Why Out-of-Process Two reasons drive the split, per `docs/v2/plan.md`: 1. **Bitness constraint.** MXAccess is 32-bit COM only — `ArchestrA.MxAccess.dll` in `Program Files (x86)\ArchestrA\Framework\bin` has no 64-bit variant. The main OtOpcUa Server is .NET 10 x64 (the OPC Foundation stack, SqlClient, and every other non-Galaxy driver target 64-bit). In-process hosting would force the whole Server to x86, which every other driver project would then inherit. 2. **Tier-C stability isolation.** Galaxy is classified Tier C in [docs/v2/driver-stability.md](../v2/driver-stability.md) — the COM runtime, STA thread, Aveva Historian SDK, and SQL queries all have crash/hang modes that can take down the hosting process. Isolating the driver in its own Windows service means a COM deadlock, AccessViolation in an unmanaged Historian DLL, or a runaway SQL query never takes the Server endpoint down. The Proxy-side supervisor restarts the Host with crash-loop circuit-breaker. The same Tier-C isolation story applies to FOCAS (decision record in `docs/v2/plan.md` §7), which is the second out-of-process driver. ## IPC Transport `GalaxyProxyDriver` → `GalaxyIpcClient` → named pipe → `Galaxy.Host` pipe server. - Pipe name: `otopcua-galaxy-{DriverInstanceId}` (localhost-only, no TCP surface) - Wire format: MessagePack-CSharp, length-prefixed frames - ACL: pipe is created with a DACL that grants only the Server's service identity; the Admins group is explicitly denied so a live-smoke test running from an elevated shell fails fast rather than silently bypassing the handshake - Handshake: Proxy presents a shared secret at `OpenSessionRequest`; Host rejects anything else with `MessageKind.OpenSessionResponse{Success=false}` - Heartbeat: Proxy sends a periodic ping; missed heartbeats trigger the Proxy-side crash-loop supervisor to restart the Host Every capability call on `GalaxyProxyDriver` (Read, Write, Subscribe, HistoryRead*, etc.) serializes a `*Request`, awaits the matching `*Response` via a `CallAsync` helper, and rehydrates the result into the `Core.Abstractions` shape the Server expects. ## STA Thread Requirement (Host-side) MXAccess COM objects — `LMXProxyServer` instantiation, `Register`, `AddItem`, `AdviseSupervisory`, `Write`, and cleanup calls — must all execute on the same Single-Threaded Apartment. Calling a COM object from the wrong thread causes marshalling failures or silent data corruption. `StaComThread` in the Host provides that thread with the apartment state set before the thread starts: ```csharp _thread = new Thread(ThreadEntry) { Name = "MxAccess-STA", IsBackground = true }; _thread.SetApartmentState(ApartmentState.STA); ``` Work items queue via `RunAsync(Action)` or `RunAsync(Func)` into a `ConcurrentQueue` and post `WM_APP` to wake the pump. Each work item is wrapped in a `TaskCompletionSource` so callers can `await` the result from any thread — including the IPC handler thread that receives the inbound pipe request. ## Win32 Message Pump (Host-side) COM callbacks (`OnDataChange`, `OnWriteComplete`) are delivered through the Windows message loop. `StaComThread` runs a standard Win32 message pump via P/Invoke: 1. `PeekMessage` primes the message queue (required before `PostThreadMessage` works) 2. `GetMessage` blocks until a message arrives 3. `WM_APP` drains the work queue 4. `WM_APP + 1` drains the queue and posts `WM_QUIT` to exit the loop 5. All other messages go through `TranslateMessage` / `DispatchMessage` for COM callback delivery Without this pump MXAccess callbacks never fire and the driver delivers no live data. ## LMXProxyServer COM Object `MxProxyAdapter` wraps the real `ArchestrA.MxAccess.LMXProxyServer` COM object behind the `IMxProxy` interface so Host unit tests can substitute a fake proxy without requiring the ArchestrA runtime. Lifecycle: 1. **`Register(clientName)`** — Creates a new `LMXProxyServer` instance, wires up `OnDataChange` and `OnWriteComplete` event handlers, calls `Register` to obtain a connection handle 2. **`Unregister(handle)`** — Unwires event handlers, calls `Unregister`, releases the COM object via `Marshal.ReleaseComObject` ## Register / AddItem / AdviseSupervisory Pattern Every MXAccess data operation follows a three-step pattern, all executed on the STA thread: 1. **`AddItem(handle, address)`** — Resolves a Galaxy tag reference (e.g., `TestMachine_001.MachineID`) to an integer item handle 2. **`AdviseSupervisory(handle, itemHandle)`** — Subscribes the item for supervisory data-change callbacks 3. The runtime begins delivering `OnDataChange` events For writes, after `AddItem` + `AdviseSupervisory`, `Write(handle, itemHandle, value, securityClassification)` sends the value; `OnWriteComplete` confirms or rejects. Cleanup reverses: `UnAdviseSupervisory` then `RemoveItem`. ## OnDataChange and OnWriteComplete Callbacks ### OnDataChange Fired by the COM runtime on the STA thread when a subscribed tag changes. The handler in `MxAccessClient.EventHandlers.cs`: 1. Maps the integer `phItemHandle` back to a tag address via `_handleToAddress` 2. Maps the MXAccess quality code to the internal `Quality` enum 3. Checks `MXSTATUS_PROXY` for error details and adjusts quality 4. Converts the timestamp to UTC 5. Constructs a `Vtq` (Value/Timestamp/Quality) and delivers it to: - The stored per-tag subscription callback - Any pending one-shot read completions - The global `OnTagValueChanged` event (consumed by the Host's subscription dispatcher, which packages changes into `DataChangeEventArgs` and forwards them over the pipe to `GalaxyProxyDriver.OnDataChange`) ### OnWriteComplete Fired when the runtime acknowledges or rejects a write. The handler resolves the pending `TaskCompletionSource` for the item handle. If `MXSTATUS_PROXY.success == 0` the write is considered failed and the error detail is logged. ## Reconnection Logic `MxAccessClient` implements automatic reconnection through two mechanisms. ### Monitor loop `StartMonitor` launches a background task that polls at `MonitorIntervalSeconds`. On each cycle: - If the state is `Disconnected` or `Error` and `AutoReconnect` is enabled, it calls `ReconnectAsync` - If connected and a probe tag is configured, it checks the probe staleness threshold ### Reconnect sequence `ReconnectAsync` performs a full disconnect-then-connect cycle: 1. Increment the reconnect counter 2. `DisconnectAsync` — tear down all active subscriptions (`UnAdviseSupervisory` + `RemoveItem` for each), detach COM event handlers, call `Unregister`, clear all handle mappings 3. `ConnectAsync` — create a fresh `LMXProxyServer`, register, replay all stored subscriptions, re-subscribe the probe tag Stored subscriptions (`_storedSubscriptions`) persist across reconnects. `ReplayStoredSubscriptionsAsync` iterates the stored entries and calls `AddItem` + `AdviseSupervisory` for each. ## Probe Tag Health Monitoring A configurable probe tag (e.g., a frequently updating Galaxy attribute) serves as a connection health indicator. After connecting, the client subscribes to the probe tag and records `_lastProbeValueTime` on every `OnDataChange`. The monitor loop compares `DateTime.UtcNow - _lastProbeValueTime` against `ProbeStaleThresholdSeconds`; if the probe has not updated within the window, the connection is assumed stale and a reconnect is forced. This catches scenarios where the COM connection is technically alive but the runtime has stopped delivering data. ## Per-Host Runtime Status Probes (`.ScanState`) Separate from the connection-level probe, the driver advises `.ScanState` on every deployed `$WinPlatform` and `$AppEngine` in the Galaxy. These probes track per-host runtime state so the Admin UI dashboard can report "this specific Platform / AppEngine is off scan" and the driver can proactively invalidate every OPC UA variable hosted by the stopped object — preventing MXAccess from serving stale Good-quality cached values to clients who read those tags while the host is down. Enabled by default via `MxAccess.RuntimeStatusProbesEnabled`; see [Configuration](../Configuration.md#mxaccess) for the two config fields. ### How it works `GalaxyRuntimeProbeManager` lives in `Driver.Galaxy.Host` alongside the rest of the MXAccess code. It is owned by the Host's subscription dispatcher and runs a three-state machine per host (Unknown / Running / Stopped): 1. **Discovery** — After the Host completes `BuildAddressSpace`, the manager filters the hierarchy to rows where `CategoryId == 1` (`$WinPlatform`) or `CategoryId == 3` (`$AppEngine`) and issues `AdviseSupervisory` for `.ScanState` on each one. Probes are driver-owned, not ref-counted against client subscriptions, and persist across address-space rebuilds via a `Sync` diff. 2. **Transition predicate** — A probe callback is interpreted as `isRunning = vtq.Quality.IsGood() && vtq.Value is bool b && b`. Everything else (explicit `ScanState = false`, bad quality, communication errors) means **Stopped**. 3. **On-change-only delivery** — `ScanState` is delivered only when the value actually changes. A stably Running host may go hours without a callback. `Tick()` does NOT run a starvation check on Running entries — the only time-based transition is **Unknown → Stopped** when the initial callback hasn't arrived within `RuntimeStatusUnknownTimeoutSeconds` (default 15s). This protects against a probe that fails to resolve at all without incorrectly flipping healthy long-running hosts. 4. **Transport gating** — When `IMxAccessClient.State != Connected`, `GetSnapshot()` forces every entry to `Unknown`. The dashboard shows the Connection panel as the primary signal in that case rather than misleading operators with "every host stopped". 5. **Subscribe failure rollback** — If `SubscribeAsync` throws for a new probe (SDK failure, broker rejection, transport error), the manager rolls back both `_byProbe` and `_probeByGobjectId` so the probe never appears in `GetSnapshot()`. Stability review 2026-04-13 Finding 1. ### Subtree quality invalidation on transition When a host transitions **Running → Stopped**, the probe manager invokes a callback that walks `_hostedVariables[gobjectId]` — the set of every OPC UA variable transitively hosted by that Galaxy object — and sets each variable's `StatusCode` to `BadOutOfService`. **Stopped → Running** calls `ClearHostVariablesBadQuality` to reset each to `Good` so the next on-change MXAccess update repopulates the value. The hosted-variables map is built once per `BuildAddressSpace` by walking each object's `HostedByGobjectId` chain up to the nearest Platform or Engine ancestor. A variable hosted by an Engine inside a Platform lands in both the Engine's list and the Platform's list, so stopping the Platform transitively invalidates every descendant Engine's variables. ### Read-path short-circuit (`IsTagUnderStoppedHost`) The Host's Read handler checks `IsTagUnderStoppedHost(tagRef)` (a reverse-index lookup `_hostIdsByTagRef[tagRef]` → `GalaxyRuntimeProbeManager.IsHostStopped(hostId)`) before the MXAccess round-trip. When the owning host is Stopped, the handler returns a synthesized `DataValue { Value = cachedVar.Value, StatusCode = BadOutOfService }` directly without touching MXAccess. This guarantees clients see a uniform `BadOutOfService` on every descendant tag of a stopped host, regardless of whether they're reading or subscribing. ### Deferred dispatch — the STA deadlock **Critical**: probe transition callbacks must **not** run synchronously on the STA thread that delivered the `OnDataChange`. `MarkHostVariablesBadQuality` takes the subscription dispatcher lock, which may be held by a worker thread currently inside `Read` waiting on an `_mxAccessClient.ReadAsync()` round-trip that is itself waiting for the STA thread. Classic circular wait — the first real deploy of this feature hung inside 30 seconds from exactly this pattern. The fix is a deferred-dispatch queue: probe callbacks enqueue the transition onto `ConcurrentQueue<(int GobjectId, bool Stopped)>` and set the existing dispatch signal. The dispatch thread drains the queue inside its existing 100ms `WaitOne` loop — outside any locks held by the STA path — and then calls `MarkHostVariablesBadQuality` / `ClearHostVariablesBadQuality` under its own natural lock acquisition. No circular wait, no STA involvement. ### Dashboard and health surface - Admin UI **Galaxy Runtime** panel shows per-host state with Name / Kind / State / Since / Last Error columns. Panel color is green (all Running), yellow (any Unknown, none Stopped), red (any Stopped), gray (MXAccess transport disconnected) - `HealthCheckService.CheckHealth` rolls overall driver health to `Degraded` when any host is Stopped See [Status Dashboard](../StatusDashboard.md#galaxy-runtime) for the field table and [Configuration](../Configuration.md#mxaccess) for the config fields. ## Request Timeout Safety Backstop Every sync-over-async site on the OPC UA stack thread that calls into Galaxy (`Read`, `Write`, address-space rebuild probe sync) is wrapped in a bounded `SyncOverAsync.WaitSync(...)` helper with timeout `MxAccess.RequestTimeoutSeconds` (default 30s). Inner `ReadTimeoutSeconds` / `WriteTimeoutSeconds` bounds on the async path are the first line of defense; the outer wrapper is a backstop so a scheduler stall, slow reconnect, or any other non-returning async path cannot park the stack thread indefinitely. On timeout, the underlying task is **not** cancelled — it runs to completion on the thread pool and is abandoned. This is acceptable because Galaxy IPC clients are shared singletons and the abandoned continuation does not capture request-scoped state. The OPC UA stack receives `StatusCodes.BadTimeout` on the affected operation. `ConfigurationValidator` enforces `RequestTimeoutSeconds >= 1` and warns when it is set below the inner Read/Write timeouts (operator misconfiguration). Stability review 2026-04-13 Finding 3. All capability calls at the Server dispatch layer are additionally wrapped by `CapabilityInvoker` (Core/Resilience/) which runs them through a Polly pipeline keyed on `(DriverInstanceId, HostName, DriverCapability)`. `OTOPCUA0001` analyzer enforces the wrap at build time. ## Why Marshal.ReleaseComObject Is Needed The .NET Framework runtime's garbage collector releases COM references non-deterministically. For MXAccess, delayed release can leave stale COM connections open, preventing clean re-registration. `MxProxyAdapter.Unregister` calls `Marshal.ReleaseComObject(_lmxProxy)` in a `finally` block to immediately drive the COM reference count to zero. This ensures the underlying COM server is freed before a reconnect attempt creates a new instance. ## Tag Discovery and Historical Data Tag discovery (the Galaxy Repository SQL reader + `LocalPlatform` scope filter) is covered in [Galaxy-Repository.md](Galaxy-Repository.md). The Galaxy driver is `ITagDiscovery` for the Server's bootstrap path and `IRediscoverable` for the on-change-redeploy path. Historical data access (raw, processed, at-time, events) runs against the Aveva Historian via the `aahClientManaged` SDK and is exposed through the Galaxy driver's `IHistoryProvider` implementation. See [HistoricalDataAccess.md](../HistoricalDataAccess.md). ## Key source files Host-side (`.NET 4.8 x86`, `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/`): - `Backend/MxAccess/StaComThread.cs` — STA thread and Win32 message pump - `Backend/MxAccess/MxAccessClient.cs` — Core client (partial) - `Backend/MxAccess/MxAccessClient.Connection.cs` — Connect / disconnect / reconnect - `Backend/MxAccess/MxAccessClient.Subscription.cs` — Subscribe / unsubscribe / replay - `Backend/MxAccess/MxAccessClient.ReadWrite.cs` — Read and write operations - `Backend/MxAccess/MxAccessClient.EventHandlers.cs` — `OnDataChange` / `OnWriteComplete` handlers - `Backend/MxAccess/MxAccessClient.Monitor.cs` — Background health monitor - `Backend/MxAccess/MxProxyAdapter.cs` — COM object wrapper - `Backend/MxAccess/GalaxyRuntimeProbeManager.cs` — Per-host `ScanState` probes, state machine, `IsHostStopped` lookup - `Backend/Historian/HistorianDataSource.cs` — `aahClientManaged` SDK wrapper (see [HistoricalDataAccess.md](../HistoricalDataAccess.md)) - `Ipc/GalaxyIpcServer.cs` — Named-pipe server, message dispatch - `Domain/IMxAccessClient.cs` — Client interface Shared (`.NET Standard 2.0`, `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/`): - `Contracts/MessageKind.cs` — IPC message kinds (`ReadRequest`, `HistoryReadRequest`, `OpenSessionResponse`, …) - `Contracts/*.cs` — MessagePack DTOs for every request/response pair Proxy-side (`.NET 10`, `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/`): - `GalaxyProxyDriver.cs` — `IDriver`/`ITagDiscovery`/`IReadable`/`IWritable`/`ISubscribable`/`IAlarmSource`/`IHistoryProvider`/`IRediscoverable`/`IHostConnectivityProbe` implementation; every method forwards via `GalaxyIpcClient` - `Ipc/GalaxyIpcClient.cs` — Named-pipe client, `CallAsync`, reconnect on broken pipe - `GalaxyProxySupervisor.cs` — Host-process monitor, crash-loop circuit-breaker, Host relaunch