docs: post-PR-7.2 cleanup — audit + three-track scrub

Audit (three parallel agent passes) found 43 markdown files carrying
stale references to the deleted Galaxy.Host/Proxy/Shared projects
after the v2-mxgw merge. This commit lands the prioritized fixes.

Track 1 — high-traffic in-place rewrites (3 files, ~454 lines deleted)
- README.md (202 → 91 lines): drops .NET 4.8 / x86 / TopShelf install
  text; leads with the multi-driver .NET 10 server identity and points
  at scripts/install/Install-Services.ps1 and the parity rig.
- docs/v2/driver-specs.md §1 Galaxy (~289 → ~66 lines): replaces the
  Tier-C out-of-process spec with a Tier-A in-process description
  matching the current GalaxyDriver code, with the four-section
  GalaxyDriverOptions JSON shape pulled verbatim from
  Config/GalaxyDriverOptions.cs.
- docs/drivers/Galaxy.md (211 → 92 lines): full rewrite around the
  current Browse/Runtime/Health/Config sub-folders.

Track 2 — historical banners (5 files)
- lmx_mxgw.md, lmx_mxgw_impl.md, lmx_backend.md,
  docs/v2/Galaxy.ParityMatrix.md,
  docs/v2/implementation/phase-2-galaxy-out-of-process.md each get a
  " Completed 2026-04-30 — historical record" banner block. lmx_mxgw.md
  also fixes two dead links (`docs/Galaxy.Driver.md` and
  `docs/v2/Galaxy.Driver.md`) → `docs/drivers/Galaxy.md`.

Track 3 — v1 archive sweep (10 git mv + 1 new index + 2 in-place scrubs)
- Moved 10 v1 docs under docs/v1/ preserving subpath structure:
  AlarmTracking, Configuration, DataTypeMapping, HistoricalDataAccess,
  Subscriptions (top-level); drivers/Galaxy-Repository,
  drivers/Galaxy-Test-Fixture; reqs/GalaxyRepositoryReqs,
  reqs/MxAccessClientReqs, reqs/ServiceHostReqs.
- New docs/v1/README.md is the shared archive banner + per-file table.
- docs/README.md repointed to the v1 paths and updated to reflect the
  v2 two-process deploy shape (Server + Admin + optional
  OtOpcUaWonderwareHistorian).
- docs/v2/Galaxy.ParityRig.md got a historical banner + four inline
  scrubs marking the OtOpcUaGalaxyHost service / Driver.Galaxy.Host
  EXE / Driver.Galaxy.ParityTests project as deleted-in-PR-7.2.

The repo's live-reading surface (README + CLAUDE.md + docs/v2/) now
describes only the post-PR-7.2 architecture. v1 docs are preserved as
a labelled archive under docs/v1/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-04-30 08:59:59 -04:00
parent ae7106dfce
commit 006af51768
21 changed files with 322 additions and 671 deletions

View File

@@ -1,211 +1,92 @@
# Galaxy Driver
The Galaxy driver bridges OtOpcUa to AVEVA System Platform (Wonderware) Galaxies through the `ArchestrA.MxAccess` COM API plus the Galaxy Repository SQL database. It is one driver of seven in the OtOpcUa platform (see [drivers/README.md](README.md) for the full list); all other drivers run in-process in the main Server (.NET 10 x64). Galaxy is the exception — it runs as its own Windows service and talks to the Server over a local named pipe.
The Galaxy driver bridges OtOpcUa to AVEVA System Platform (Wonderware) Galaxies. It is a **Tier-A in-process driver** that runs in the OtOpcUa server's .NET 10 AnyCPU process and speaks gRPC to a separately installed `mxaccessgw` server (sibling repo at `c:\Users\dohertj2\Desktop\mxaccessgw\`). The gateway owns the MXAccess COM apartment, the STA + Win32 message pump, the Galaxy Repository SQL reader, and the Historian SDK — all the bits that need x86 / .NET Framework 4.8 / COM interop. The driver itself is platform-agnostic and contains no COM, no STA thread, and no x86 bitness constraint.
For the decision record on why Galaxy is out-of-process and how the refactor was staged, see [docs/v2/plan.md §4 Galaxy/MXAccess as Out-of-Process Driver](../v2/plan.md). For the full driver spec (addressing, data-type map, config shape), see [docs/v2/driver-specs.md §1](../v2/driver-specs.md).
For the driver spec (capability surface, config shape, addressing), see [docs/v2/driver-specs.md §1](../v2/driver-specs.md). For the gateway setup recipe, see [docs/v2/Galaxy.ParityRig.md](../v2/Galaxy.ParityRig.md). For tracing, metrics, and soak profile, see [docs/v2/Galaxy.Performance.md](../v2/Galaxy.Performance.md).
## Project Split
> **Note**: the related drivers `Galaxy-Repository.md` and `Galaxy-Test-Fixture.md` describe the previous v1 / out-of-process topology and are being moved to `docs/v1/` by a parallel cleanup track. Use `Galaxy.ParityRig.md` and the `mxaccessgw` repo for current testing.
Galaxy ships as three projects:
## Architecture
| Project | Target | Role |
|---------|--------|------|
| `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/` | .NET Standard 2.0 | IPC contracts (MessagePack records + `MessageKind` enum) referenced by both sides |
| `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/` | .NET Framework 4.8 **x86** | Separate Windows service hosting the MXAccess COM objects, STA thread + Win32 message pump, Galaxy Repository reader, Historian SDK, runtime-probe manager |
| `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/` | .NET 10 (matches Server) | `GalaxyProxyDriver : IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IRediscoverable, IHostConnectivityProbe` — loaded in-process by the Server; every call forwards over the pipe to the Host |
The Shared assembly is the **only** contract between the two runtimes. It carries no COM or SDK references so Proxy (net10) can reference it without dragging x86 code into the Server process.
## Why Out-of-Process
Two reasons drive the split, per `docs/v2/plan.md`:
1. **Bitness constraint.** MXAccess is 32-bit COM only — `ArchestrA.MxAccess.dll` in `Program Files (x86)\ArchestrA\Framework\bin` has no 64-bit variant. The main OtOpcUa Server is .NET 10 x64 (the OPC Foundation stack, SqlClient, and every other non-Galaxy driver target 64-bit). In-process hosting would force the whole Server to x86, which every other driver project would then inherit.
2. **Tier-C stability isolation.** Galaxy is classified Tier C in [docs/v2/driver-stability.md](../v2/driver-stability.md) — the COM runtime, STA thread, Aveva Historian SDK, and SQL queries all have crash/hang modes that can take down the hosting process. Isolating the driver in its own Windows service means a COM deadlock, AccessViolation in an unmanaged Historian DLL, or a runaway SQL query never takes the Server endpoint down. The Proxy-side supervisor restarts the Host with crash-loop circuit-breaker.
The same Tier-C isolation story applies to FOCAS (decision record in `docs/v2/plan.md` §7), which is the second out-of-process driver.
## IPC Transport
`GalaxyProxyDriver``GalaxyIpcClient` → named pipe → `Galaxy.Host` pipe server.
- Pipe name: `otopcua-galaxy-{DriverInstanceId}` (localhost-only, no TCP surface)
- Wire format: MessagePack-CSharp, length-prefixed frames
- ACL: pipe is created with a DACL that grants `ReadWrite | Synchronize` only to the configured Server service-principal SID + denies `LocalSystem`. The per-connection SID check in `PipeServer.VerifyCaller` is the real authorization boundary — any caller whose impersonated token SID doesn't match the allowed SID is dropped before the first frame is read.
- Handshake: Proxy presents a shared secret at `OpenSessionRequest`; Host rejects anything else with `MessageKind.OpenSessionResponse{Success=false}`
- Heartbeat: Proxy sends a periodic ping; missed heartbeats trigger the Proxy-side crash-loop supervisor to restart the Host
Every capability call on `GalaxyProxyDriver` (Read, Write, Subscribe, HistoryRead*, etc.) serializes a `*Request`, awaits the matching `*Response` via a `CallAsync<TReq, TResp>` helper, and rehydrates the result into the `Core.Abstractions` shape the Server expects.
## STA Thread Requirement (Host-side)
MXAccess COM objects — `LMXProxyServer` instantiation, `Register`, `AddItem`, `AdviseSupervisory`, `Write`, and cleanup calls — must all execute on the same Single-Threaded Apartment. Calling a COM object from the wrong thread causes marshalling failures or silent data corruption.
`StaComThread` in the Host provides that thread with the apartment state set before the thread starts:
```csharp
_thread = new Thread(ThreadEntry) { Name = "MxAccess-STA", IsBackground = true };
_thread.SetApartmentState(ApartmentState.STA);
```
+---------------------------------------+
| OtOpcUa.Server (.NET 10 AnyCPU) |
| GalaxyDriver (in-process) |
| ITagDiscovery / IReadable / |
| IWritable / ISubscribable / |
| IRediscoverable / |
| IHostConnectivityProbe |
+-------------------+-------------------+
|
gRPC (default http://localhost:5120)
|
v
+---------------------------------------+
| mxaccessgw (sibling repo) |
| +-------------------------------+ |
| | MxGateway.Worker (x86 net48) | |
| | STA + WM_APP pump | |
| | ArchestrA.MxAccess COM | |
| | Galaxy Repository SQL | |
| | Wonderware Historian SDK | |
| +-------------------------------+ |
+---------------------------------------+
```
Work items queue via `RunAsync(Action)` or `RunAsync<T>(Func<T>)` into a `ConcurrentQueue<Action>` and post `WM_APP` to wake the pump. Each work item is wrapped in a `TaskCompletionSource` so callers can `await` the result from any thread — including the IPC handler thread that receives the inbound pipe request.
History reads + alarm-condition tracking moved server-side in PR 7.2 (`IHistoryRouter`, `AlarmConditionService`). Galaxy no longer implements `IHistoryProvider` or `IAlarmSource` of its own.
## Win32 Message Pump (Host-side)
## Project Layout
COM callbacks (`OnDataChange`, `OnWriteComplete`) are delivered through the Windows message loop. `StaComThread` runs a standard Win32 message pump via P/Invoke:
The driver ships as a single project: `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/` (.NET 10, AnyCPU). Sub-folders:
1. `PeekMessage` primes the message queue (required before `PostThreadMessage` works)
2. `GetMessage` blocks until a message arrives
3. `WM_APP` drains the work queue
4. `WM_APP + 1` drains the queue and posts `WM_QUIT` to exit the loop
5. All other messages go through `TranslateMessage` / `DispatchMessage` for COM callback delivery
| Folder | Role |
|--------|------|
| `Browse/` | Static-side discovery: `GalaxyDiscoverer` walks the gateway's hierarchy + attribute-set RPCs, `DataTypeMap` and `SecurityMap` translate Galaxy types and security classifications into OPC UA equivalents, `AlarmRefBuilder` extracts alarm-bearing attribute references for the server-layer alarm engine. `IGalaxyHierarchySource` + `GatewayGalaxyHierarchySource` + `TracedGalaxyHierarchySource` decorate the gateway browse RPC; `IGalaxyDeployWatchSource` + `GatewayGalaxyDeployWatchSource` + `DeployWatcher` drive `IRediscoverable`. |
| `Runtime/` | Live data path: `EventPump` runs the gateway's `StreamEvents` RPC and fans out to subscribers via a bounded channel; `GalaxyMxSession` is the read-side handle; `GatewayGalaxySubscriber` + `GatewayGalaxyDataWriter` (each with a `Traced*` decorator) implement `ISubscribable` / `IWritable`; `SubscriptionRegistry` tracks subscription state for replay; `ReconnectSupervisor` owns the backoff loop and triggers `ReplaySubscriptions` on session loss; `StatusCodeMap` translates gateway StatusCodes to OPC UA; `MxValueDecoder` / `MxValueEncoder` handle scalar + array marshalling; `GalaxyTelemetry` + `GalaxySubscriptionHandle` round out the surface. |
| `Health/` | `HostStatusAggregator` rolls per-platform probe state into the driver's `IHostConnectivityProbe` view; `PerPlatformProbeWatcher` listens on the gateway's per-host status stream; `HostConnectivityForwarder` pushes transitions out to the server's connectivity bus. |
| `Config/` | `GalaxyDriverOptions` and the four nested option records (`GalaxyGatewayOptions`, `GalaxyMxAccessOptions`, `GalaxyRepositoryOptions`, `GalaxyReconnectOptions`). |
Without this pump MXAccess callbacks never fire and the driver delivers no live data.
Project root files:
## LMXProxyServer COM Object
- `GalaxyDriver.cs``IDriver` + capability-interface implementation; composes the Browse / Runtime / Health collaborators.
- `GalaxyDriverFactoryExtensions.cs` — DI registration helper used by the server's driver bootstrap.
`MxProxyAdapter` wraps the real `ArchestrA.MxAccess.LMXProxyServer` COM object behind the `IMxProxy` interface so Host unit tests can substitute a fake proxy without requiring the ArchestrA runtime. Lifecycle:
## Capability Surface
1. **`Register(clientName)`** — Creates a new `LMXProxyServer` instance, wires up `OnDataChange` and `OnWriteComplete` event handlers, calls `Register` to obtain a connection handle
2. **`Unregister(handle)`** — Unwires event handlers, calls `Unregister`, releases the COM object via `Marshal.ReleaseComObject`
`GalaxyDriver : IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IRediscoverable, IHostConnectivityProbe, IDisposable`.
## Register / AddItem / AdviseSupervisory Pattern
| Capability | Implementation entry point |
|------------|---------------------------|
| `ITagDiscovery` | `Browse/GalaxyDiscoverer.cs` |
| `IRediscoverable` | `Browse/DeployWatcher.cs` |
| `IReadable` | `Runtime/GalaxyMxSession.cs` |
| `IWritable` | `Runtime/GatewayGalaxyDataWriter.cs` |
| `ISubscribable` | `Runtime/GatewayGalaxySubscriber.cs` (driven by `EventPump`) |
| `IHostConnectivityProbe` | `Health/HostStatusAggregator.cs` |
Every MXAccess data operation follows a three-step pattern, all executed on the STA thread:
## Configuration
1. **`AddItem(handle, address)`** — Resolves a Galaxy tag reference (e.g., `TestMachine_001.MachineID`) to an integer item handle
2. **`AdviseSupervisory(handle, itemHandle)`** — Subscribes the item for supervisory data-change callbacks
3. The runtime begins delivering `OnDataChange` events
`DriverConfig` JSON binds to `Config/GalaxyDriverOptions.cs`. The four sections are:
For writes, after `AddItem` + `AdviseSupervisory`, `Write(handle, itemHandle, value, securityClassification)` sends the value; `OnWriteComplete` confirms or rejects. Cleanup reverses: `UnAdviseSupervisory` then `RemoveItem`.
- **`Gateway`** — endpoint, API key secret ref, TLS knobs, connect/call/stream timeouts. `StreamTimeoutSeconds = 0` keeps the long-lived `StreamEvents` RPC open for the driver's lifetime.
- **`MxAccess`** — `ClientName` (must be unique per OtOpcUa instance — redundancy pairs enforce uniqueness at install time), `PublishingIntervalMs` (forwarded as `buffered_update_interval_ms` on subscribe), `WriteUserId` for ArchestrA secured-write, `EventPumpChannelCapacity` (default 50_000 — one second of headroom at 50k tags / 1Hz; tune via the `galaxy.events.dropped` metric).
- **`Repository`** — `DiscoverPageSize`, `WatchDeployEvents`.
- **`Reconnect`** — `InitialBackoffMs`, `MaxBackoffMs`, `ReplayOnSessionLost` (calls the gateway's `ReplaySubscriptions` RPC after reconnect rather than re-issuing subscribe-bulk for every tag).
## OnDataChange and OnWriteComplete Callbacks
Full per-field descriptions live in `Config/GalaxyDriverOptions.cs`. The full JSON skeleton is reproduced in [docs/v2/driver-specs.md §1](../v2/driver-specs.md).
### OnDataChange
## Reconnect + Replay
Fired by the COM runtime on the STA thread when a subscribed tag changes. The handler in `MxAccessClient.EventHandlers.cs`:
`ReconnectSupervisor` owns an exponential-backoff loop bounded by `Reconnect.InitialBackoffMs` / `MaxBackoffMs`. On session loss it tears down the gRPC channel, redials, and — when `ReplayOnSessionLost = true` — calls the gateway's `ReplaySubscriptions` RPC with the cached subscription set from `SubscriptionRegistry` instead of re-subscribing tag-by-tag. The gateway's worker then re-issues `AdviseSupervisory` server-side under the apartment lock.
1. Maps the integer `phItemHandle` back to a tag address via `_handleToAddress`
2. Maps the MXAccess quality code to the internal `Quality` enum
3. Checks `MXSTATUS_PROXY` for error details and adjusts quality
4. Converts the timestamp to UTC
5. Constructs a `Vtq` (Value/Timestamp/Quality) and delivers it to:
- The stored per-tag subscription callback
- Any pending one-shot read completions
- The global `OnTagValueChanged` event (consumed by the Host's subscription dispatcher, which packages changes into `DataChangeEventArgs` and forwards them over the pipe to `GalaxyProxyDriver.OnDataChange`)
## Testing
### OnWriteComplete
- **Unit tests**: `tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests/` — fakes the gateway gRPC surface; covers Browse, Runtime, Health, and Config in isolation.
- **Parity rig + dev-rig walkthrough**: see [docs/v2/Galaxy.ParityRig.md](../v2/Galaxy.ParityRig.md). The rig stands up a real `mxaccessgw` against a live Galaxy and exercises the full read / write / subscribe / rediscover path.
- **Performance + soak**: see [docs/v2/Galaxy.Performance.md](../v2/Galaxy.Performance.md).
Fired when the runtime acknowledges or rejects a write. The handler resolves the pending `TaskCompletionSource<bool>` for the item handle. If `MXSTATUS_PROXY.success == 0` the write is considered failed and the error detail is logged.
## Operational Notes
## Reconnection Logic
`MxAccessClient` implements automatic reconnection through two mechanisms.
### Monitor loop
`StartMonitor` launches a background task that polls at `MonitorIntervalSeconds`. On each cycle:
- If the state is `Disconnected` or `Error` and `AutoReconnect` is enabled, it calls `ReconnectAsync`
- If connected and a probe tag is configured, it checks the probe staleness threshold
### Reconnect sequence
`ReconnectAsync` performs a full disconnect-then-connect cycle:
1. Increment the reconnect counter
2. `DisconnectAsync` — tear down all active subscriptions (`UnAdviseSupervisory` + `RemoveItem` for each), detach COM event handlers, call `Unregister`, clear all handle mappings
3. `ConnectAsync` — create a fresh `LMXProxyServer`, register, replay all stored subscriptions, re-subscribe the probe tag
Stored subscriptions (`_storedSubscriptions`) persist across reconnects. `ReplayStoredSubscriptionsAsync` iterates the stored entries and calls `AddItem` + `AdviseSupervisory` for each.
## Probe Tag Health Monitoring
A configurable probe tag (e.g., a frequently updating Galaxy attribute) serves as a connection health indicator. After connecting, the client subscribes to the probe tag and records `_lastProbeValueTime` on every `OnDataChange`. The monitor loop compares `DateTime.UtcNow - _lastProbeValueTime` against `ProbeStaleThresholdSeconds`; if the probe has not updated within the window, the connection is assumed stale and a reconnect is forced. This catches scenarios where the COM connection is technically alive but the runtime has stopped delivering data.
## Per-Host Runtime Status Probes (`<Host>.ScanState`)
Separate from the connection-level probe, the driver advises `<HostName>.ScanState` on every deployed `$WinPlatform` and `$AppEngine` in the Galaxy. These probes track per-host runtime state so the Admin UI dashboard can report "this specific Platform / AppEngine is off scan" and the driver can proactively invalidate every OPC UA variable hosted by the stopped object — preventing MXAccess from serving stale Good-quality cached values to clients who read those tags while the host is down.
Enabled by default via `MxAccess.RuntimeStatusProbesEnabled`; see [Configuration](../Configuration.md#mxaccess) for the two config fields.
### How it works
`GalaxyRuntimeProbeManager` lives in `Driver.Galaxy.Host` alongside the rest of the MXAccess code. It is owned by the Host's subscription dispatcher and runs a three-state machine per host (Unknown / Running / Stopped):
1. **Discovery** — After the Host completes `BuildAddressSpace`, the manager filters the hierarchy to rows where `CategoryId == 1` (`$WinPlatform`) or `CategoryId == 3` (`$AppEngine`) and issues `AdviseSupervisory` for `<TagName>.ScanState` on each one. Probes are driver-owned, not ref-counted against client subscriptions, and persist across address-space rebuilds via a `Sync` diff.
2. **Transition predicate** — A probe callback is interpreted as `isRunning = vtq.Quality.IsGood() && vtq.Value is bool b && b`. Everything else (explicit `ScanState = false`, bad quality, communication errors) means **Stopped**.
3. **On-change-only delivery**`ScanState` is delivered only when the value actually changes. A stably Running host may go hours without a callback. `Tick()` does NOT run a starvation check on Running entries — the only time-based transition is **Unknown → Stopped** when the initial callback hasn't arrived within `RuntimeStatusUnknownTimeoutSeconds` (default 15s). This protects against a probe that fails to resolve at all without incorrectly flipping healthy long-running hosts.
4. **Transport gating** — When `IMxAccessClient.State != Connected`, `GetSnapshot()` forces every entry to `Unknown`. The dashboard shows the Connection panel as the primary signal in that case rather than misleading operators with "every host stopped".
5. **Subscribe failure rollback** — If `SubscribeAsync` throws for a new probe (SDK failure, broker rejection, transport error), the manager rolls back both `_byProbe` and `_probeByGobjectId` so the probe never appears in `GetSnapshot()`. Stability review 2026-04-13 Finding 1.
### Subtree quality invalidation on transition
When a host transitions **Running → Stopped**, the probe manager invokes a callback that walks `_hostedVariables[gobjectId]` — the set of every OPC UA variable transitively hosted by that Galaxy object — and sets each variable's `StatusCode` to `BadOutOfService`. **Stopped → Running** calls `ClearHostVariablesBadQuality` to reset each to `Good` so the next on-change MXAccess update repopulates the value.
The hosted-variables map is built once per `BuildAddressSpace` by walking each object's `HostedByGobjectId` chain up to the nearest Platform or Engine ancestor. A variable hosted by an Engine inside a Platform lands in both the Engine's list and the Platform's list, so stopping the Platform transitively invalidates every descendant Engine's variables.
### Read-path short-circuit (`IsTagUnderStoppedHost`)
The Host's Read handler checks `IsTagUnderStoppedHost(tagRef)` (a reverse-index lookup `_hostIdsByTagRef[tagRef]``GalaxyRuntimeProbeManager.IsHostStopped(hostId)`) before the MXAccess round-trip. When the owning host is Stopped, the handler returns a synthesized `DataValue { Value = cachedVar.Value, StatusCode = BadOutOfService }` directly without touching MXAccess. This guarantees clients see a uniform `BadOutOfService` on every descendant tag of a stopped host, regardless of whether they're reading or subscribing.
### Deferred dispatch — the STA deadlock
**Critical**: probe transition callbacks must **not** run synchronously on the STA thread that delivered the `OnDataChange`. `MarkHostVariablesBadQuality` takes the subscription dispatcher lock, which may be held by a worker thread currently inside `Read` waiting on an `_mxAccessClient.ReadAsync()` round-trip that is itself waiting for the STA thread. Classic circular wait — the first real deploy of this feature hung inside 30 seconds from exactly this pattern.
The fix is a deferred-dispatch queue: probe callbacks enqueue the transition onto `ConcurrentQueue<(int GobjectId, bool Stopped)>` and set the existing dispatch signal. The dispatch thread drains the queue inside its existing 100ms `WaitOne` loop — outside any locks held by the STA path — and then calls `MarkHostVariablesBadQuality` / `ClearHostVariablesBadQuality` under its own natural lock acquisition. No circular wait, no STA involvement.
### Dashboard and health surface
- Admin UI **Galaxy Runtime** panel shows per-host state with Name / Kind / State / Since / Last Error columns. Panel color is green (all Running), yellow (any Unknown, none Stopped), red (any Stopped), gray (MXAccess transport disconnected)
- `HealthCheckService.CheckHealth` rolls overall driver health to `Degraded` when any host is Stopped
See [Status Dashboard](../StatusDashboard.md#galaxy-runtime) for the field table and [Configuration](../Configuration.md#mxaccess) for the config fields.
## Request Timeout Safety Backstop
Every sync-over-async site on the OPC UA stack thread that calls into Galaxy (`Read`, `Write`, address-space rebuild probe sync) is wrapped in a bounded `SyncOverAsync.WaitSync(...)` helper with timeout `MxAccess.RequestTimeoutSeconds` (default 30s). Inner `ReadTimeoutSeconds` / `WriteTimeoutSeconds` bounds on the async path are the first line of defense; the outer wrapper is a backstop so a scheduler stall, slow reconnect, or any other non-returning async path cannot park the stack thread indefinitely.
On timeout, the underlying task is **not** cancelled — it runs to completion on the thread pool and is abandoned. This is acceptable because Galaxy IPC clients are shared singletons and the abandoned continuation does not capture request-scoped state. The OPC UA stack receives `StatusCodes.BadTimeout` on the affected operation.
`ConfigurationValidator` enforces `RequestTimeoutSeconds >= 1` and warns when it is set below the inner Read/Write timeouts (operator misconfiguration). Stability review 2026-04-13 Finding 3.
All capability calls at the Server dispatch layer are additionally wrapped by `CapabilityInvoker` (Core/Resilience/) which runs them through a Polly pipeline keyed on `(DriverInstanceId, HostName, DriverCapability)`. `OTOPCUA0001` analyzer enforces the wrap at build time.
## Why Marshal.ReleaseComObject Is Needed
The .NET Framework runtime's garbage collector releases COM references non-deterministically. For MXAccess, delayed release can leave stale COM connections open, preventing clean re-registration. `MxProxyAdapter.Unregister` calls `Marshal.ReleaseComObject(_lmxProxy)` in a `finally` block to immediately drive the COM reference count to zero. This ensures the underlying COM server is freed before a reconnect attempt creates a new instance.
## Tag Discovery and Historical Data
Tag discovery (the Galaxy Repository SQL reader + `LocalPlatform` scope filter) is covered in [Galaxy-Repository.md](Galaxy-Repository.md). The Galaxy driver is `ITagDiscovery` for the Server's bootstrap path and `IRediscoverable` for the on-change-redeploy path.
Historical data access (raw, processed, at-time, events) runs against the Aveva Historian via the `aahClientManaged` SDK and is exposed through the Galaxy driver's `IHistoryProvider` implementation. See [HistoricalDataAccess.md](../HistoricalDataAccess.md).
## Key source files
Host-side (`.NET 4.8 x86`, `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/`):
- `Backend/MxAccess/StaComThread.cs` — STA thread and Win32 message pump
- `Backend/MxAccess/MxAccessClient.cs` — Core client (partial)
- `Backend/MxAccess/MxAccessClient.Connection.cs` — Connect / disconnect / reconnect
- `Backend/MxAccess/MxAccessClient.Subscription.cs` — Subscribe / unsubscribe / replay
- `Backend/MxAccess/MxAccessClient.ReadWrite.cs` — Read and write operations
- `Backend/MxAccess/MxAccessClient.EventHandlers.cs``OnDataChange` / `OnWriteComplete` handlers
- `Backend/MxAccess/MxAccessClient.Monitor.cs` — Background health monitor
- `Backend/MxAccess/MxProxyAdapter.cs` — COM object wrapper
- `Backend/MxAccess/GalaxyRuntimeProbeManager.cs` — Per-host `ScanState` probes, state machine, `IsHostStopped` lookup
- `Backend/Historian/HistorianDataSource.cs``aahClientManaged` SDK wrapper (see [HistoricalDataAccess.md](../HistoricalDataAccess.md))
- `Ipc/GalaxyIpcServer.cs` — Named-pipe server, message dispatch
- `Domain/IMxAccessClient.cs` — Client interface
Shared (`.NET Standard 2.0`, `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/`):
- `Contracts/MessageKind.cs` — IPC message kinds (`ReadRequest`, `HistoryReadRequest`, `OpenSessionResponse`, …)
- `Contracts/*.cs` — MessagePack DTOs for every request/response pair
Proxy-side (`.NET 10`, `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/`):
- `GalaxyProxyDriver.cs``IDriver`/`ITagDiscovery`/`IReadable`/`IWritable`/`ISubscribable`/`IAlarmSource`/`IHistoryProvider`/`IRediscoverable`/`IHostConnectivityProbe` implementation; every method forwards via `GalaxyIpcClient`
- `Ipc/GalaxyIpcClient.cs` — Named-pipe client, `CallAsync<TReq, TResp>`, reconnect on broken pipe
- `GalaxyProxySupervisor.cs` — Host-process monitor, crash-loop circuit-breaker, Host relaunch
- **MXAccess `ClientName` collisions**: two OtOpcUa instances sharing a `ClientName` cause the older Wonderware session to lose subscription state. Redundancy pairs (decision #149) enforce uniqueness via install scripts.
- **Channel saturation**: `galaxy.events.dropped > 0` indicates `EventPump` is back-pressured. Raise `EventPumpChannelCapacity` or investigate downstream slowness in the server-side fan-out.
- **Connectivity surface**: per-platform probe state is exposed through `IHostConnectivityProbe` and aggregated by the server's connectivity bus — there is no driver-private dashboard surface anymore. The Admin UI's Host Status panel is the consumer.