# MXAccess Bridge The MXAccess bridge connects the OPC UA server to the AVEVA System Platform runtime through the `ArchestrA.MxAccess` COM API. It handles all COM threading requirements, translates between OPC UA read/write requests and MXAccess operations, and manages connection health. ## STA Thread Requirement MXAccess is a COM-based API that requires a Single-Threaded Apartment (STA). All COM objects -- `LMXProxyServer` instantiation, `Register`, `AddItem`, `AdviseSupervisory`, `Write`, and cleanup calls -- must execute on the same STA thread. Calling COM objects from the wrong thread causes marshalling failures or silent data corruption. `StaComThread` provides a dedicated STA thread with the apartment state set before the thread starts: ```csharp _thread = new Thread(ThreadEntry) { Name = "MxAccess-STA", IsBackground = true }; _thread.SetApartmentState(ApartmentState.STA); ``` Work items are queued via `RunAsync(Action)` or `RunAsync(Func)`, which enqueue the work to a `ConcurrentQueue` and post a `WM_APP` message to wake the pump. Each work item is wrapped in a `TaskCompletionSource` so callers can `await` the result from any thread. ## Win32 Message Pump COM callbacks (like `OnDataChange`) are delivered through the Windows message loop. `StaComThread` runs a standard Win32 message pump using P/Invoke: 1. `PeekMessage` primes the message queue (required before `PostThreadMessage` works) 2. `GetMessage` blocks until a message arrives 3. `WM_APP` messages drain the work queue 4. `WM_APP + 1` drains the queue and posts `WM_QUIT` to exit the loop 5. All other messages are passed through `TranslateMessage`/`DispatchMessage` for COM callback delivery Without this message pump, MXAccess COM callbacks would never fire and the server would receive no live data. ## LMXProxyServer COM Object `MxProxyAdapter` wraps the real `ArchestrA.MxAccess.LMXProxyServer` COM object behind the `IMxProxy` interface. This abstraction allows unit tests to substitute a fake proxy without requiring the ArchestrA runtime. The COM object lifecycle: 1. **`Register(clientName)`** -- Creates a new `LMXProxyServer` instance, wires up `OnDataChange` and `OnWriteComplete` event handlers, and calls `Register` to obtain a connection handle 2. **`Unregister(handle)`** -- Unwires event handlers, calls `Unregister`, and releases the COM object via `Marshal.ReleaseComObject` ## Register/AddItem/AdviseSupervisory Pattern Every MXAccess data operation follows a three-step pattern, all executed on the STA thread: 1. **`AddItem(handle, address)`** -- Resolves a Galaxy tag reference (e.g., `TestMachine_001.MachineID`) to an integer item handle 2. **`AdviseSupervisory(handle, itemHandle)`** -- Subscribes the item for supervisory data change callbacks 3. The runtime begins delivering `OnDataChange` events for the item For writes, after `AddItem` + `AdviseSupervisory`, `Write(handle, itemHandle, value, securityClassification)` sends the value to the runtime. The `OnWriteComplete` callback confirms or rejects the write. Cleanup reverses the pattern: `UnAdviseSupervisory` then `RemoveItem`. ## OnDataChange and OnWriteComplete Callbacks ### OnDataChange Fired by the COM runtime on the STA thread when a subscribed tag value changes. The handler in `MxAccessClient.EventHandlers.cs`: 1. Maps the integer `phItemHandle` back to a tag address via `_handleToAddress` 2. Maps the MXAccess quality code to the internal `Quality` enum 3. Checks `MXSTATUS_PROXY` for error details and adjusts quality accordingly 4. Converts the timestamp to UTC 5. Constructs a `Vtq` (Value/Timestamp/Quality) and delivers it to: - The stored per-tag subscription callback - Any pending one-shot read completions - The global `OnTagValueChanged` event (consumed by `LmxNodeManager`) ### OnWriteComplete Fired when the runtime acknowledges or rejects a write. The handler resolves the pending `TaskCompletionSource` for the item handle. If `MXSTATUS_PROXY.success == 0`, the write is considered failed and the error detail is logged. ## Reconnection Logic `MxAccessClient` implements automatic reconnection through two mechanisms: ### Monitor loop `StartMonitor` launches a background task that polls at `MonitorIntervalSeconds`. On each cycle: - If the state is `Disconnected` or `Error` and `AutoReconnect` is enabled, it calls `ReconnectAsync` - If connected and a probe tag is configured, it checks the probe staleness threshold ### Reconnect sequence `ReconnectAsync` performs a full disconnect-then-connect cycle: 1. Increment the reconnect counter 2. `DisconnectAsync` -- Tears down all active subscriptions (`UnAdviseSupervisory` + `RemoveItem` for each), detaches COM event handlers, calls `Unregister`, and clears all handle mappings 3. `ConnectAsync` -- Creates a fresh `LMXProxyServer`, registers, replays all stored subscriptions, and re-subscribes the probe tag Stored subscriptions (`_storedSubscriptions`) persist across reconnects. When `ConnectAsync` succeeds, `ReplayStoredSubscriptionsAsync` iterates all stored entries and calls `AddItem` + `AdviseSupervisory` for each. ## Probe Tag Health Monitoring A configurable probe tag (e.g., a frequently updating Galaxy attribute) serves as a connection health indicator. After connecting, the client subscribes to the probe tag and records `_lastProbeValueTime` on every `OnDataChange` callback. The monitor loop compares `DateTime.UtcNow - _lastProbeValueTime` against `ProbeStaleThresholdSeconds`. If the probe value has not updated within the threshold, the connection is assumed stale and a reconnect is forced. This catches scenarios where the COM connection is technically alive but the runtime has stopped delivering data. ## Per-Host Runtime Status Probes (`.ScanState`) Separate from the connection-level probe above, the bridge advises `.ScanState` on every deployed `$WinPlatform` and `$AppEngine` in the Galaxy. These probes track per-host runtime state so the dashboard can report "this specific Platform / AppEngine is off scan" and the bridge can proactively invalidate every OPC UA variable hosted by the stopped object — preventing MxAccess from serving stale Good-quality cached values to clients who read those tags while the host is down. Enabled by default via `MxAccess.RuntimeStatusProbesEnabled`; see [Configuration](Configuration.md#mxaccess) for the two config fields. ### How it works `GalaxyRuntimeProbeManager` is owned by `LmxNodeManager` and operates on a simple three-state machine per host (Unknown / Running / Stopped): 1. **Discovery** — After `BuildAddressSpace` completes, the manager filters the hierarchy to rows where `CategoryId == 1` (`$WinPlatform`) or `CategoryId == 3` (`$AppEngine`) and issues `AdviseSupervisory` for `.ScanState` on each one. Probes are bridge-owned, not ref-counted against client subscriptions, and persist across address-space rebuilds via a `Sync` diff. 2. **Transition predicate** — A probe callback is interpreted as `isRunning = vtq.Quality.IsGood() && vtq.Value is bool b && b`. Everything else (explicit `ScanState = false`, bad quality, communication errors from the broker) means **Stopped**. 3. **On-change-only delivery** — `ScanState` is delivered **only when the value actually changes**. A stably Running host may go hours without a callback. The probe manager's `Tick()` explicitly does NOT run a starvation check on Running entries — the only time-based transition is **Unknown → Stopped** when the initial callback hasn't arrived within `RuntimeStatusUnknownTimeoutSeconds` (default 15s). This protects against a probe that fails to resolve at all without incorrectly flipping healthy long-running hosts. 4. **Transport gating** — When `IMxAccessClient.State != Connected`, `GetSnapshot()` forces every entry to `Unknown` regardless of underlying state. The dashboard shows the Connection panel as the primary signal in that case rather than misleading operators with "every host stopped." 5. **Subscribe failure rollback** — If `SubscribeAsync` throws for a new probe (SDK failure, broker rejection, transport error), the manager rolls back both `_byProbe` and `_probeByGobjectId` so the probe never appears in `GetSnapshot()`. Without this rollback, a failed subscribe would leave the entry in `Unknown` forever, and `Tick()` would later transition it to `Stopped` after the unknown-resolution timeout, fanning out a **false-negative** host-down signal that invalidates the subtree of a host that was never actually advised. Stability review 2026-04-13 Finding 1. ### Subtree quality invalidation on transition When a host transitions **Running → Stopped**, the probe manager invokes a callback that walks `_hostedVariables[gobjectId]` — the set of every OPC UA variable transitively hosted by that Galaxy object — and sets each variable's `StatusCode` to `BadOutOfService`. The reverse happens on **Stopped → Running**: `ClearHostVariablesBadQuality` resets each to `Good` and lets subsequent on-change MxAccess updates repopulate the values. The hosted-variables map is built once per `BuildAddressSpace` by walking each object's `HostedByGobjectId` chain up to the nearest Platform or Engine ancestor. A variable hosted by an Engine inside a Platform ends up in **both** the Engine's list and the Platform's list, so stopping the Platform transitively invalidates every descendant Engine's variables. ### Read-path short-circuit (`IsTagUnderStoppedHost`) `LmxNodeManager.Read` override is called by the OPC UA SDK for both direct Read requests and monitored-item sampling. It previously called `_mxAccessClient.ReadAsync(tagRef)` unconditionally and returned whatever VTQ the runtime reported. That created a gap: MxAccess happily serves the last cached value as Good on a tag whose hosting Engine has gone off scan. The Read override now checks `IsTagUnderStoppedHost(tagRef)` (a reverse-index lookup `_hostIdsByTagRef[tagRef]` → `GalaxyRuntimeProbeManager.IsHostStopped(hostId)`) before the MxAccess round-trip. When the owning host is Stopped, the handler returns a synthesized `DataValue { Value = cachedVar.Value, StatusCode = BadOutOfService }` directly without touching MxAccess. This guarantees clients see a uniform `BadOutOfService` on every descendant tag of a stopped host, regardless of whether they're reading or subscribing. ### Deferred dispatch: the STA deadlock **Critical**: probe transition callbacks must **not** run synchronously on the STA thread that delivered the `OnDataChange`. `MarkHostVariablesBadQuality` takes the `LmxNodeManager.Lock`, which may be held by a worker thread currently inside `Read` waiting on an `_mxAccessClient.ReadAsync()` round-trip that is itself waiting for the STA thread. Classic circular wait — the first real deploy of this feature hung inside 30 seconds from exactly this pattern. The fix is a deferred-dispatch queue: probe callbacks enqueue the transition onto `ConcurrentQueue<(int GobjectId, bool Stopped)>` and set the existing dispatch signal. The dispatch thread drains the queue inside its existing 100ms `WaitOne` loop — **outside** any locks held by the STA path — and then calls `MarkHostVariablesBadQuality` / `ClearHostVariablesBadQuality` under its own natural `Lock` acquisition. No circular wait, no STA dispatch involvement. See the `runtimestatus.md` plan file and the `service_info.md` entry for the in-flight debugging that led to this pattern. ### Dashboard + health surface - Dashboard **Galaxy Runtime** panel between Galaxy Info and Historian shows per-host state with Name / Kind / State / Since / Last Error columns. Panel color is green (all Running), yellow (any Unknown, none Stopped), red (any Stopped), gray (MxAccess transport disconnected). - Subscriptions panel gains a `Probes: N (bridge-owned runtime status)` line when at least one probe is active, so operators can distinguish bridge-owned probe count from client-driven subscriptions. - `HealthCheckService.CheckHealth` Rule 2e rolls overall health to `Degraded` when any host is Stopped, ordered after the MxAccess-transport check (Rule 1) so a transport outage stays `Unhealthy` without double-messaging. See [Status Dashboard](StatusDashboard.md#galaxy-runtime) for the field table and [Configuration](Configuration.md#mxaccess) for the two new config fields. ## Request Timeout Safety Backstop Every sync-over-async site on the OPC UA stack thread that calls into MxAccess (`Read`, `Write`, address-space rebuild probe sync) is wrapped in a bounded `SyncOverAsync.WaitSync(...)` helper with timeout `MxAccess.RequestTimeoutSeconds` (default 30s). This is a backstop: `MxAccessClient.Read/Write` already enforce inner `ReadTimeoutSeconds` / `WriteTimeoutSeconds` bounds on the async path. The outer wrapper exists so a scheduler stall, slow reconnect, or any other non-returning async path cannot park the stack thread indefinitely. On timeout, the underlying task is **not** cancelled — it runs to completion on the thread pool and is abandoned. This is acceptable because MxAccess clients are shared singletons and the abandoned continuation does not capture request-scoped state. The OPC UA stack receives `StatusCodes.BadTimeout` on the affected operation. `ConfigurationValidator` enforces `RequestTimeoutSeconds >= 1` and warns when it is set below the inner Read/Write timeouts (operator misconfiguration). Stability review 2026-04-13 Finding 3. ## Why Marshal.ReleaseComObject Is Needed The .NET runtime's garbage collector releases COM references non-deterministically. For MXAccess, delayed release can leave stale COM connections open, preventing clean re-registration. `MxProxyAdapter.Unregister` calls `Marshal.ReleaseComObject(_lmxProxy)` in a `finally` block to immediately release the COM reference count to zero. This ensures the underlying COM server is freed before a reconnect attempt creates a new instance. ## Key source files - `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/StaComThread.cs` -- STA thread and Win32 message pump - `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxAccessClient.cs` -- Core client class (partial) - `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxAccessClient.Connection.cs` -- Connect, disconnect, reconnect - `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxAccessClient.Subscription.cs` -- Subscribe, unsubscribe, replay - `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxAccessClient.ReadWrite.cs` -- Read and write operations - `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxAccessClient.EventHandlers.cs` -- OnDataChange and OnWriteComplete handlers - `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxAccessClient.Monitor.cs` -- Background health monitor - `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxProxyAdapter.cs` -- COM object wrapper - `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs` -- Per-host `ScanState` probes, state machine, `IsHostStopped` lookup - `src/ZB.MOM.WW.OtOpcUa.Host/Domain/GalaxyRuntimeStatus.cs` -- Per-host DTO - `src/ZB.MOM.WW.OtOpcUa.Host/Domain/GalaxyRuntimeState.cs` -- `Unknown` / `Running` / `Stopped` enum - `src/ZB.MOM.WW.OtOpcUa.Host/Domain/IMxAccessClient.cs` -- Client interface