Files

Joseph Doherty c76ab8fdee Close all four stability-review 2026-04-13 findings so a failed runtime probe subscription can no longer leave a phantom entry that Tick() flips to Stopped and fans out false BadOutOfService quality across a host's subtree, a silently-failed dashboard bind no longer lets the service advertise a successful start while an operator-visible endpoint is dead, the seven sync-over-async sites in LmxNodeManager (rebuild probe sync, Read, Write, four HistoryRead overrides) can no longer park the OPC UA stack thread indefinitely on a hung backend, and alarm auto-subscribe + transferred-subscription restore no longer race shutdown as untracked fire-and-forget tasks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-14 00:48:07 -04:00

15 KiB

Raw Blame History

MXAccess Bridge

The MXAccess bridge connects the OPC UA server to the AVEVA System Platform runtime through the ArchestrA.MxAccess COM API. It handles all COM threading requirements, translates between OPC UA read/write requests and MXAccess operations, and manages connection health.

STA Thread Requirement

MXAccess is a COM-based API that requires a Single-Threaded Apartment (STA). All COM objects -- LMXProxyServer instantiation, Register, AddItem, AdviseSupervisory, Write, and cleanup calls -- must execute on the same STA thread. Calling COM objects from the wrong thread causes marshalling failures or silent data corruption.

StaComThread provides a dedicated STA thread with the apartment state set before the thread starts:

_thread = new Thread(ThreadEntry) { Name = "MxAccess-STA", IsBackground = true };
_thread.SetApartmentState(ApartmentState.STA);

Work items are queued via RunAsync(Action) or RunAsync<T>(Func<T>), which enqueue the work to a ConcurrentQueue<Action> and post a WM_APP message to wake the pump. Each work item is wrapped in a TaskCompletionSource so callers can await the result from any thread.

Win32 Message Pump

COM callbacks (like OnDataChange) are delivered through the Windows message loop. StaComThread runs a standard Win32 message pump using P/Invoke:

PeekMessage primes the message queue (required before PostThreadMessage works)
GetMessage blocks until a message arrives
WM_APP messages drain the work queue
WM_APP + 1 drains the queue and posts WM_QUIT to exit the loop
All other messages are passed through TranslateMessage/DispatchMessage for COM callback delivery

Without this message pump, MXAccess COM callbacks would never fire and the server would receive no live data.

LMXProxyServer COM Object

MxProxyAdapter wraps the real ArchestrA.MxAccess.LMXProxyServer COM object behind the IMxProxy interface. This abstraction allows unit tests to substitute a fake proxy without requiring the ArchestrA runtime.

The COM object lifecycle:

Register(clientName) -- Creates a new LMXProxyServer instance, wires up OnDataChange and OnWriteComplete event handlers, and calls Register to obtain a connection handle
Unregister(handle) -- Unwires event handlers, calls Unregister, and releases the COM object via Marshal.ReleaseComObject

Register/AddItem/AdviseSupervisory Pattern

Every MXAccess data operation follows a three-step pattern, all executed on the STA thread:

AddItem(handle, address) -- Resolves a Galaxy tag reference (e.g., TestMachine_001.MachineID) to an integer item handle
AdviseSupervisory(handle, itemHandle) -- Subscribes the item for supervisory data change callbacks
The runtime begins delivering OnDataChange events for the item

For writes, after AddItem + AdviseSupervisory, Write(handle, itemHandle, value, securityClassification) sends the value to the runtime. The OnWriteComplete callback confirms or rejects the write.

Cleanup reverses the pattern: UnAdviseSupervisory then RemoveItem.

OnDataChange and OnWriteComplete Callbacks

OnDataChange

Fired by the COM runtime on the STA thread when a subscribed tag value changes. The handler in MxAccessClient.EventHandlers.cs:

Maps the integer phItemHandle back to a tag address via _handleToAddress
Maps the MXAccess quality code to the internal Quality enum
Checks MXSTATUS_PROXY for error details and adjusts quality accordingly
Converts the timestamp to UTC
Constructs a Vtq (Value/Timestamp/Quality) and delivers it to:
- The stored per-tag subscription callback
- Any pending one-shot read completions
- The global OnTagValueChanged event (consumed by LmxNodeManager)

OnWriteComplete

Fired when the runtime acknowledges or rejects a write. The handler resolves the pending TaskCompletionSource<bool> for the item handle. If MXSTATUS_PROXY.success == 0, the write is considered failed and the error detail is logged.

Reconnection Logic

MxAccessClient implements automatic reconnection through two mechanisms:

Monitor loop

StartMonitor launches a background task that polls at MonitorIntervalSeconds. On each cycle:

If the state is Disconnected or Error and AutoReconnect is enabled, it calls ReconnectAsync
If connected and a probe tag is configured, it checks the probe staleness threshold

Reconnect sequence

ReconnectAsync performs a full disconnect-then-connect cycle:

Increment the reconnect counter
DisconnectAsync -- Tears down all active subscriptions (UnAdviseSupervisory + RemoveItem for each), detaches COM event handlers, calls Unregister, and clears all handle mappings
ConnectAsync -- Creates a fresh LMXProxyServer, registers, replays all stored subscriptions, and re-subscribes the probe tag

Stored subscriptions (_storedSubscriptions) persist across reconnects. When ConnectAsync succeeds, ReplayStoredSubscriptionsAsync iterates all stored entries and calls AddItem + AdviseSupervisory for each.

Probe Tag Health Monitoring

A configurable probe tag (e.g., a frequently updating Galaxy attribute) serves as a connection health indicator. After connecting, the client subscribes to the probe tag and records _lastProbeValueTime on every OnDataChange callback.

The monitor loop compares DateTime.UtcNow - _lastProbeValueTime against ProbeStaleThresholdSeconds. If the probe value has not updated within the threshold, the connection is assumed stale and a reconnect is forced. This catches scenarios where the COM connection is technically alive but the runtime has stopped delivering data.

Per-Host Runtime Status Probes (`<Host>.ScanState`)

Separate from the connection-level probe above, the bridge advises <HostName>.ScanState on every deployed $WinPlatform and $AppEngine in the Galaxy. These probes track per-host runtime state so the dashboard can report "this specific Platform / AppEngine is off scan" and the bridge can proactively invalidate every OPC UA variable hosted by the stopped object — preventing MxAccess from serving stale Good-quality cached values to clients who read those tags while the host is down.

Enabled by default via MxAccess.RuntimeStatusProbesEnabled; see Configuration for the two config fields.

How it works

GalaxyRuntimeProbeManager is owned by LmxNodeManager and operates on a simple three-state machine per host (Unknown / Running / Stopped):

Discovery — After BuildAddressSpace completes, the manager filters the hierarchy to rows where CategoryId == 1 ($WinPlatform) or CategoryId == 3 ($AppEngine) and issues AdviseSupervisory for <TagName>.ScanState on each one. Probes are bridge-owned, not ref-counted against client subscriptions, and persist across address-space rebuilds via a Sync diff.
Transition predicate — A probe callback is interpreted as isRunning = vtq.Quality.IsGood() && vtq.Value is bool b && b. Everything else (explicit ScanState = false, bad quality, communication errors from the broker) means Stopped.
On-change-only delivery — ScanState is delivered only when the value actually changes. A stably Running host may go hours without a callback. The probe manager's Tick() explicitly does NOT run a starvation check on Running entries — the only time-based transition is Unknown → Stopped when the initial callback hasn't arrived within RuntimeStatusUnknownTimeoutSeconds (default 15s). This protects against a probe that fails to resolve at all without incorrectly flipping healthy long-running hosts.
Transport gating — When IMxAccessClient.State != Connected, GetSnapshot() forces every entry to Unknown regardless of underlying state. The dashboard shows the Connection panel as the primary signal in that case rather than misleading operators with "every host stopped."
Subscribe failure rollback — If SubscribeAsync throws for a new probe (SDK failure, broker rejection, transport error), the manager rolls back both _byProbe and _probeByGobjectId so the probe never appears in GetSnapshot(). Without this rollback, a failed subscribe would leave the entry in Unknown forever, and Tick() would later transition it to Stopped after the unknown-resolution timeout, fanning out a false-negative host-down signal that invalidates the subtree of a host that was never actually advised. Stability review 2026-04-13 Finding 1.

Subtree quality invalidation on transition

When a host transitions Running → Stopped, the probe manager invokes a callback that walks _hostedVariables[gobjectId] — the set of every OPC UA variable transitively hosted by that Galaxy object — and sets each variable's StatusCode to BadOutOfService. The reverse happens on Stopped → Running: ClearHostVariablesBadQuality resets each to Good and lets subsequent on-change MxAccess updates repopulate the values.

The hosted-variables map is built once per BuildAddressSpace by walking each object's HostedByGobjectId chain up to the nearest Platform or Engine ancestor. A variable hosted by an Engine inside a Platform ends up in both the Engine's list and the Platform's list, so stopping the Platform transitively invalidates every descendant Engine's variables.

Read-path short-circuit (`IsTagUnderStoppedHost`)

LmxNodeManager.Read override is called by the OPC UA SDK for both direct Read requests and monitored-item sampling. It previously called _mxAccessClient.ReadAsync(tagRef) unconditionally and returned whatever VTQ the runtime reported. That created a gap: MxAccess happily serves the last cached value as Good on a tag whose hosting Engine has gone off scan.

The Read override now checks IsTagUnderStoppedHost(tagRef) (a reverse-index lookup _hostIdsByTagRef[tagRef] → GalaxyRuntimeProbeManager.IsHostStopped(hostId)) before the MxAccess round-trip. When the owning host is Stopped, the handler returns a synthesized DataValue { Value = cachedVar.Value, StatusCode = BadOutOfService } directly without touching MxAccess. This guarantees clients see a uniform BadOutOfService on every descendant tag of a stopped host, regardless of whether they're reading or subscribing.

Deferred dispatch: the STA deadlock

Critical: probe transition callbacks must not run synchronously on the STA thread that delivered the OnDataChange. MarkHostVariablesBadQuality takes the LmxNodeManager.Lock, which may be held by a worker thread currently inside Read waiting on an _mxAccessClient.ReadAsync() round-trip that is itself waiting for the STA thread. Classic circular wait — the first real deploy of this feature hung inside 30 seconds from exactly this pattern.

The fix is a deferred-dispatch queue: probe callbacks enqueue the transition onto ConcurrentQueue<(int GobjectId, bool Stopped)> and set the existing dispatch signal. The dispatch thread drains the queue inside its existing 100ms WaitOne loop — outside any locks held by the STA path — and then calls MarkHostVariablesBadQuality / ClearHostVariablesBadQuality under its own natural Lock acquisition. No circular wait, no STA dispatch involvement.

See the runtimestatus.md plan file and the service_info.md entry for the in-flight debugging that led to this pattern.

Dashboard + health surface

Dashboard Galaxy Runtime panel between Galaxy Info and Historian shows per-host state with Name / Kind / State / Since / Last Error columns. Panel color is green (all Running), yellow (any Unknown, none Stopped), red (any Stopped), gray (MxAccess transport disconnected).
Subscriptions panel gains a Probes: N (bridge-owned runtime status) line when at least one probe is active, so operators can distinguish bridge-owned probe count from client-driven subscriptions.
HealthCheckService.CheckHealth Rule 2e rolls overall health to Degraded when any host is Stopped, ordered after the MxAccess-transport check (Rule 1) so a transport outage stays Unhealthy without double-messaging.

See Status Dashboard for the field table and Configuration for the two new config fields.

Request Timeout Safety Backstop

Every sync-over-async site on the OPC UA stack thread that calls into MxAccess (Read, Write, address-space rebuild probe sync) is wrapped in a bounded SyncOverAsync.WaitSync(...) helper with timeout MxAccess.RequestTimeoutSeconds (default 30s). This is a backstop: MxAccessClient.Read/Write already enforce inner ReadTimeoutSeconds / WriteTimeoutSeconds bounds on the async path. The outer wrapper exists so a scheduler stall, slow reconnect, or any other non-returning async path cannot park the stack thread indefinitely.

On timeout, the underlying task is not cancelled — it runs to completion on the thread pool and is abandoned. This is acceptable because MxAccess clients are shared singletons and the abandoned continuation does not capture request-scoped state. The OPC UA stack receives StatusCodes.BadTimeout on the affected operation.

ConfigurationValidator enforces RequestTimeoutSeconds >= 1 and warns when it is set below the inner Read/Write timeouts (operator misconfiguration). Stability review 2026-04-13 Finding 3.

Why Marshal.ReleaseComObject Is Needed

The .NET runtime's garbage collector releases COM references non-deterministically. For MXAccess, delayed release can leave stale COM connections open, preventing clean re-registration. MxProxyAdapter.Unregister calls Marshal.ReleaseComObject(_lmxProxy) in a finally block to immediately release the COM reference count to zero. This ensures the underlying COM server is freed before a reconnect attempt creates a new instance.

Key source files

src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/StaComThread.cs -- STA thread and Win32 message pump
src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.cs -- Core client class (partial)
src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.Connection.cs -- Connect, disconnect, reconnect
src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.Subscription.cs -- Subscribe, unsubscribe, replay
src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.ReadWrite.cs -- Read and write operations
src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.EventHandlers.cs -- OnDataChange and OnWriteComplete handlers
src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.Monitor.cs -- Background health monitor
src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxProxyAdapter.cs -- COM object wrapper
src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs -- Per-host ScanState probes, state machine, IsHostStopped lookup
src/ZB.MOM.WW.LmxOpcUa.Host/Domain/GalaxyRuntimeStatus.cs -- Per-host DTO
src/ZB.MOM.WW.LmxOpcUa.Host/Domain/GalaxyRuntimeState.cs -- Unknown / Running / Stopped enum
src/ZB.MOM.WW.LmxOpcUa.Host/Domain/IMxAccessClient.cs -- Client interface

15 KiB Raw Blame History