158 lines
13 KiB
Markdown
158 lines
13 KiB
Markdown
# MXAccess Bridge
|
|
|
|
The MXAccess bridge connects the OPC UA server to the AVEVA System Platform runtime through the `ArchestrA.MxAccess` COM API. It handles all COM threading requirements, translates between OPC UA read/write requests and MXAccess operations, and manages connection health.
|
|
|
|
## STA Thread Requirement
|
|
|
|
MXAccess is a COM-based API that requires a Single-Threaded Apartment (STA). All COM objects -- `LMXProxyServer` instantiation, `Register`, `AddItem`, `AdviseSupervisory`, `Write`, and cleanup calls -- must execute on the same STA thread. Calling COM objects from the wrong thread causes marshalling failures or silent data corruption.
|
|
|
|
`StaComThread` provides a dedicated STA thread with the apartment state set before the thread starts:
|
|
|
|
```csharp
|
|
_thread = new Thread(ThreadEntry) { Name = "MxAccess-STA", IsBackground = true };
|
|
_thread.SetApartmentState(ApartmentState.STA);
|
|
```
|
|
|
|
Work items are queued via `RunAsync(Action)` or `RunAsync<T>(Func<T>)`, which enqueue the work to a `ConcurrentQueue<Action>` and post a `WM_APP` message to wake the pump. Each work item is wrapped in a `TaskCompletionSource` so callers can `await` the result from any thread.
|
|
|
|
## Win32 Message Pump
|
|
|
|
COM callbacks (like `OnDataChange`) are delivered through the Windows message loop. `StaComThread` runs a standard Win32 message pump using P/Invoke:
|
|
|
|
1. `PeekMessage` primes the message queue (required before `PostThreadMessage` works)
|
|
2. `GetMessage` blocks until a message arrives
|
|
3. `WM_APP` messages drain the work queue
|
|
4. `WM_APP + 1` drains the queue and posts `WM_QUIT` to exit the loop
|
|
5. All other messages are passed through `TranslateMessage`/`DispatchMessage` for COM callback delivery
|
|
|
|
Without this message pump, MXAccess COM callbacks would never fire and the server would receive no live data.
|
|
|
|
## LMXProxyServer COM Object
|
|
|
|
`MxProxyAdapter` wraps the real `ArchestrA.MxAccess.LMXProxyServer` COM object behind the `IMxProxy` interface. This abstraction allows unit tests to substitute a fake proxy without requiring the ArchestrA runtime.
|
|
|
|
The COM object lifecycle:
|
|
|
|
1. **`Register(clientName)`** -- Creates a new `LMXProxyServer` instance, wires up `OnDataChange` and `OnWriteComplete` event handlers, and calls `Register` to obtain a connection handle
|
|
2. **`Unregister(handle)`** -- Unwires event handlers, calls `Unregister`, and releases the COM object via `Marshal.ReleaseComObject`
|
|
|
|
## Register/AddItem/AdviseSupervisory Pattern
|
|
|
|
Every MXAccess data operation follows a three-step pattern, all executed on the STA thread:
|
|
|
|
1. **`AddItem(handle, address)`** -- Resolves a Galaxy tag reference (e.g., `TestMachine_001.MachineID`) to an integer item handle
|
|
2. **`AdviseSupervisory(handle, itemHandle)`** -- Subscribes the item for supervisory data change callbacks
|
|
3. The runtime begins delivering `OnDataChange` events for the item
|
|
|
|
For writes, after `AddItem` + `AdviseSupervisory`, `Write(handle, itemHandle, value, securityClassification)` sends the value to the runtime. The `OnWriteComplete` callback confirms or rejects the write.
|
|
|
|
Cleanup reverses the pattern: `UnAdviseSupervisory` then `RemoveItem`.
|
|
|
|
## OnDataChange and OnWriteComplete Callbacks
|
|
|
|
### OnDataChange
|
|
|
|
Fired by the COM runtime on the STA thread when a subscribed tag value changes. The handler in `MxAccessClient.EventHandlers.cs`:
|
|
|
|
1. Maps the integer `phItemHandle` back to a tag address via `_handleToAddress`
|
|
2. Maps the MXAccess quality code to the internal `Quality` enum
|
|
3. Checks `MXSTATUS_PROXY` for error details and adjusts quality accordingly
|
|
4. Converts the timestamp to UTC
|
|
5. Constructs a `Vtq` (Value/Timestamp/Quality) and delivers it to:
|
|
- The stored per-tag subscription callback
|
|
- Any pending one-shot read completions
|
|
- The global `OnTagValueChanged` event (consumed by `LmxNodeManager`)
|
|
|
|
### OnWriteComplete
|
|
|
|
Fired when the runtime acknowledges or rejects a write. The handler resolves the pending `TaskCompletionSource<bool>` for the item handle. If `MXSTATUS_PROXY.success == 0`, the write is considered failed and the error detail is logged.
|
|
|
|
## Reconnection Logic
|
|
|
|
`MxAccessClient` implements automatic reconnection through two mechanisms:
|
|
|
|
### Monitor loop
|
|
|
|
`StartMonitor` launches a background task that polls at `MonitorIntervalSeconds`. On each cycle:
|
|
|
|
- If the state is `Disconnected` or `Error` and `AutoReconnect` is enabled, it calls `ReconnectAsync`
|
|
- If connected and a probe tag is configured, it checks the probe staleness threshold
|
|
|
|
### Reconnect sequence
|
|
|
|
`ReconnectAsync` performs a full disconnect-then-connect cycle:
|
|
|
|
1. Increment the reconnect counter
|
|
2. `DisconnectAsync` -- Tears down all active subscriptions (`UnAdviseSupervisory` + `RemoveItem` for each), detaches COM event handlers, calls `Unregister`, and clears all handle mappings
|
|
3. `ConnectAsync` -- Creates a fresh `LMXProxyServer`, registers, replays all stored subscriptions, and re-subscribes the probe tag
|
|
|
|
Stored subscriptions (`_storedSubscriptions`) persist across reconnects. When `ConnectAsync` succeeds, `ReplayStoredSubscriptionsAsync` iterates all stored entries and calls `AddItem` + `AdviseSupervisory` for each.
|
|
|
|
## Probe Tag Health Monitoring
|
|
|
|
A configurable probe tag (e.g., a frequently updating Galaxy attribute) serves as a connection health indicator. After connecting, the client subscribes to the probe tag and records `_lastProbeValueTime` on every `OnDataChange` callback.
|
|
|
|
The monitor loop compares `DateTime.UtcNow - _lastProbeValueTime` against `ProbeStaleThresholdSeconds`. If the probe value has not updated within the threshold, the connection is assumed stale and a reconnect is forced. This catches scenarios where the COM connection is technically alive but the runtime has stopped delivering data.
|
|
|
|
## Per-Host Runtime Status Probes (`<Host>.ScanState`)
|
|
|
|
Separate from the connection-level probe above, the bridge advises `<HostName>.ScanState` on every deployed `$WinPlatform` and `$AppEngine` in the Galaxy. These probes track per-host runtime state so the dashboard can report "this specific Platform / AppEngine is off scan" and the bridge can proactively invalidate every OPC UA variable hosted by the stopped object — preventing MxAccess from serving stale Good-quality cached values to clients who read those tags while the host is down.
|
|
|
|
Enabled by default via `MxAccess.RuntimeStatusProbesEnabled`; see [Configuration](Configuration.md#mxaccess) for the two config fields.
|
|
|
|
### How it works
|
|
|
|
`GalaxyRuntimeProbeManager` is owned by `LmxNodeManager` and operates on a simple three-state machine per host (Unknown / Running / Stopped):
|
|
|
|
1. **Discovery** — After `BuildAddressSpace` completes, the manager filters the hierarchy to rows where `CategoryId == 1` (`$WinPlatform`) or `CategoryId == 3` (`$AppEngine`) and issues `AdviseSupervisory` for `<TagName>.ScanState` on each one. Probes are bridge-owned, not ref-counted against client subscriptions, and persist across address-space rebuilds via a `Sync` diff.
|
|
2. **Transition predicate** — A probe callback is interpreted as `isRunning = vtq.Quality.IsGood() && vtq.Value is bool b && b`. Everything else (explicit `ScanState = false`, bad quality, communication errors from the broker) means **Stopped**.
|
|
3. **On-change-only delivery** — `ScanState` is delivered **only when the value actually changes**. A stably Running host may go hours without a callback. The probe manager's `Tick()` explicitly does NOT run a starvation check on Running entries — the only time-based transition is **Unknown → Stopped** when the initial callback hasn't arrived within `RuntimeStatusUnknownTimeoutSeconds` (default 15s). This protects against a probe that fails to resolve at all without incorrectly flipping healthy long-running hosts.
|
|
4. **Transport gating** — When `IMxAccessClient.State != Connected`, `GetSnapshot()` forces every entry to `Unknown` regardless of underlying state. The dashboard shows the Connection panel as the primary signal in that case rather than misleading operators with "every host stopped."
|
|
|
|
### Subtree quality invalidation on transition
|
|
|
|
When a host transitions **Running → Stopped**, the probe manager invokes a callback that walks `_hostedVariables[gobjectId]` — the set of every OPC UA variable transitively hosted by that Galaxy object — and sets each variable's `StatusCode` to `BadOutOfService`. The reverse happens on **Stopped → Running**: `ClearHostVariablesBadQuality` resets each to `Good` and lets subsequent on-change MxAccess updates repopulate the values.
|
|
|
|
The hosted-variables map is built once per `BuildAddressSpace` by walking each object's `HostedByGobjectId` chain up to the nearest Platform or Engine ancestor. A variable hosted by an Engine inside a Platform ends up in **both** the Engine's list and the Platform's list, so stopping the Platform transitively invalidates every descendant Engine's variables.
|
|
|
|
### Read-path short-circuit (`IsTagUnderStoppedHost`)
|
|
|
|
`LmxNodeManager.Read` override is called by the OPC UA SDK for both direct Read requests and monitored-item sampling. It previously called `_mxAccessClient.ReadAsync(tagRef)` unconditionally and returned whatever VTQ the runtime reported. That created a gap: MxAccess happily serves the last cached value as Good on a tag whose hosting Engine has gone off scan.
|
|
|
|
The Read override now checks `IsTagUnderStoppedHost(tagRef)` (a reverse-index lookup `_hostIdsByTagRef[tagRef]` → `GalaxyRuntimeProbeManager.IsHostStopped(hostId)`) before the MxAccess round-trip. When the owning host is Stopped, the handler returns a synthesized `DataValue { Value = cachedVar.Value, StatusCode = BadOutOfService }` directly without touching MxAccess. This guarantees clients see a uniform `BadOutOfService` on every descendant tag of a stopped host, regardless of whether they're reading or subscribing.
|
|
|
|
### Deferred dispatch: the STA deadlock
|
|
|
|
**Critical**: probe transition callbacks must **not** run synchronously on the STA thread that delivered the `OnDataChange`. `MarkHostVariablesBadQuality` takes the `LmxNodeManager.Lock`, which may be held by a worker thread currently inside `Read` waiting on an `_mxAccessClient.ReadAsync()` round-trip that is itself waiting for the STA thread. Classic circular wait — the first real deploy of this feature hung inside 30 seconds from exactly this pattern.
|
|
|
|
The fix is a deferred-dispatch queue: probe callbacks enqueue the transition onto `ConcurrentQueue<(int GobjectId, bool Stopped)>` and set the existing dispatch signal. The dispatch thread drains the queue inside its existing 100ms `WaitOne` loop — **outside** any locks held by the STA path — and then calls `MarkHostVariablesBadQuality` / `ClearHostVariablesBadQuality` under its own natural `Lock` acquisition. No circular wait, no STA dispatch involvement.
|
|
|
|
See the `runtimestatus.md` plan file and the `service_info.md` entry for the in-flight debugging that led to this pattern.
|
|
|
|
### Dashboard + health surface
|
|
|
|
- Dashboard **Galaxy Runtime** panel between Galaxy Info and Historian shows per-host state with Name / Kind / State / Since / Last Error columns. Panel color is green (all Running), yellow (any Unknown, none Stopped), red (any Stopped), gray (MxAccess transport disconnected).
|
|
- Subscriptions panel gains a `Probes: N (bridge-owned runtime status)` line when at least one probe is active, so operators can distinguish bridge-owned probe count from client-driven subscriptions.
|
|
- `HealthCheckService.CheckHealth` Rule 2e rolls overall health to `Degraded` when any host is Stopped, ordered after the MxAccess-transport check (Rule 1) so a transport outage stays `Unhealthy` without double-messaging.
|
|
|
|
See [Status Dashboard](StatusDashboard.md#galaxy-runtime) for the field table and [Configuration](Configuration.md#mxaccess) for the two new config fields.
|
|
|
|
## Why Marshal.ReleaseComObject Is Needed
|
|
|
|
The .NET runtime's garbage collector releases COM references non-deterministically. For MXAccess, delayed release can leave stale COM connections open, preventing clean re-registration. `MxProxyAdapter.Unregister` calls `Marshal.ReleaseComObject(_lmxProxy)` in a `finally` block to immediately release the COM reference count to zero. This ensures the underlying COM server is freed before a reconnect attempt creates a new instance.
|
|
|
|
## Key source files
|
|
|
|
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/StaComThread.cs` -- STA thread and Win32 message pump
|
|
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.cs` -- Core client class (partial)
|
|
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.Connection.cs` -- Connect, disconnect, reconnect
|
|
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.Subscription.cs` -- Subscribe, unsubscribe, replay
|
|
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.ReadWrite.cs` -- Read and write operations
|
|
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.EventHandlers.cs` -- OnDataChange and OnWriteComplete handlers
|
|
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.Monitor.cs` -- Background health monitor
|
|
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxProxyAdapter.cs` -- COM object wrapper
|
|
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs` -- Per-host `ScanState` probes, state machine, `IsHostStopped` lookup
|
|
- `src/ZB.MOM.WW.LmxOpcUa.Host/Domain/GalaxyRuntimeStatus.cs` -- Per-host DTO
|
|
- `src/ZB.MOM.WW.LmxOpcUa.Host/Domain/GalaxyRuntimeState.cs` -- `Unknown` / `Running` / `Stopped` enum
|
|
- `src/ZB.MOM.WW.LmxOpcUa.Host/Domain/IMxAccessClient.cs` -- Client interface
|