Files
lmxopcua/docs/drivers/Galaxy.md
Joseph Doherty 71339307fa Doc refresh (task #203) — driver docs split + drivers index + IHistoryProvider-aware HistoricalDataAccess
Restructure the driver-facing docs to match the OtOpcUa v2 multi-driver
reality (Galaxy, Modbus, S7, AB CIP, AB Legacy, TwinCAT, FOCAS, OPC UA Client
— 8 drivers total; Galaxy ships as three projects) and the capability-interface
architecture where every driver opts into IDriver + whichever of IReadable /
IWritable / ITagDiscovery / ISubscribable / IHostConnectivityProbe /
IPerCallHostResolver / IAlarmSource / IHistoryProvider / IRediscoverable it
supports. Doc scope follows the code: one-driver-specific docs scoped to that
driver, cross-driver concerns live once at the top level, per-driver specs
cross-link to docs/v2/driver-specs.md rather than duplicate.

What changed per file:

- docs/MxAccessBridge.md -> docs/drivers/Galaxy.md (git mv + rewrite): retitled
  "Galaxy Driver", reframed as one of seven drivers. Added Project Split table
  (Shared .NET Standard 2.0 / Host .NET 4.8 x86 / Proxy .NET 10) and Why
  Out-of-Process section citing both the MXAccess bitness constraint and Tier C
  stability isolation per docs/v2/plan.md section 4. Added IPC Transport
  section covering pipe naming, MessagePack framing, DACL that denies Admins,
  shared-secret handshake, heartbeat, and CallAsync<TReq,TResp> dispatch.
  Moved file paths from src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/* to
  src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/MxAccess/* and added the
  Shared + Proxy key-file tables. Added CapabilityInvoker + OTOPCUA0001
  analyzer callout. Cross-linked to drivers/README.md, Galaxy-Repository.md,
  HistoricalDataAccess.md.

- docs/GalaxyRepository.md -> docs/drivers/Galaxy-Repository.md (git mv +
  rewrite): retitled "Galaxy Repository — Tag Discovery for the Galaxy
  Driver", opened with a comparison table showing how every driver's
  ITagDiscovery source is different (AB CIP @tags walker, TwinCAT
  SymbolLoaderFactory, FOCAS CNC queries, OPC UA Client Session.Browse, etc).
  Repositioned GalaxyRepositoryService as the Galaxy driver's
  ITagDiscovery.DiscoverAsync implementation. Updated paths to
  Driver.Galaxy.Host/Backend/GalaxyRepository/*. Added IRediscoverable section
  covering the on-change-redeploy IPC path.

- docs/drivers/README.md (new): index with ground-truth driver table —
  project path, stability tier, wire library, capability-interface list, and
  one notable quirk per driver. Verified against the driver csproj files and
  class declarations on focas-pr3-remaining-capabilities (the most recent
  branch containing every driver). Galaxy gets its own dedicated docs; the
  other seven drivers cross-link to docs/v2/driver-specs.md. Lists the full
  Core.Abstractions capability surface, DriverTypeRegistry, CapabilityInvoker,
  and OTOPCUA0001 analyzer.

- docs/HistoricalDataAccess.md (rewrite): reframed around IHistoryProvider as
  a per-driver optional capability interface. Replaced v1 HistorianPluginLoader
  / AvevaHistorianPluginEntry plugin architecture with the v2 story —
  Historian.Aveva was merged into Driver.Galaxy.Host/Backend/Historian/ and
  IPC-forwarded through GalaxyProxyDriver. Documented all four IHistoryProvider
  methods (ReadRawAsync / ReadProcessedAsync / ReadAtTimeAsync /
  ReadEventsAsync), CapabilityInvoker wrapping with DriverCapability.HistoryRead,
  and the per-driver coverage matrix (Galaxy + OPC UA Client implement; the
  six protocol drivers don't and return BadHistoryOperationUnsupported). Kept
  the cluster-failover + health-counter + quality-mapping detail for the
  Galaxy Historian implementation. Flagged one gap: Proxy forwards all four
  history message kinds but the Host-side HistoryAggregateType -> AnalogSummary
  column mapping may surface GalaxyIpcException{Code="not-implemented"} on a
  given branch until the Phase 2 Galaxy out-of-process gate lands.

Driver list built against ground truth (src on focas-pr3-remaining-capabilities):
  Driver.Galaxy.{Shared,Host,Proxy}, Driver.Modbus, Driver.S7, Driver.AbCip,
  Driver.AbLegacy, Driver.TwinCAT, Driver.FOCAS, Driver.OpcUaClient.
Capability interface lists verified against each *Driver.cs class declaration.
Aveva Historian ported to Driver.Galaxy.Host/Backend/Historian/; no separate
Historian.Aveva assembly on v2 branches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 01:33:53 -04:00

19 KiB

Galaxy Driver

The Galaxy driver bridges OtOpcUa to AVEVA System Platform (Wonderware) Galaxies through the ArchestrA.MxAccess COM API plus the Galaxy Repository SQL database. It is one driver of seven in the OtOpcUa platform (see drivers/README.md for the full list); all other drivers run in-process in the main Server (.NET 10 x64). Galaxy is the exception — it runs as its own Windows service and talks to the Server over a local named pipe.

For the decision record on why Galaxy is out-of-process and how the refactor was staged, see docs/v2/plan.md §4 Galaxy/MXAccess as Out-of-Process Driver. For the full driver spec (addressing, data-type map, config shape), see docs/v2/driver-specs.md §1.

Project Split

Galaxy ships as three projects:

Project Target Role
src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/ .NET Standard 2.0 IPC contracts (MessagePack records + MessageKind enum) referenced by both sides
src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/ .NET Framework 4.8 x86 Separate Windows service hosting the MXAccess COM objects, STA thread + Win32 message pump, Galaxy Repository reader, Historian SDK, runtime-probe manager
src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/ .NET 10 (matches Server) GalaxyProxyDriver : IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IRediscoverable, IHostConnectivityProbe — loaded in-process by the Server; every call forwards over the pipe to the Host

The Shared assembly is the only contract between the two runtimes. It carries no COM or SDK references so Proxy (net10) can reference it without dragging x86 code into the Server process.

Why Out-of-Process

Two reasons drive the split, per docs/v2/plan.md:

  1. Bitness constraint. MXAccess is 32-bit COM only — ArchestrA.MxAccess.dll in Program Files (x86)\ArchestrA\Framework\bin has no 64-bit variant. The main OtOpcUa Server is .NET 10 x64 (the OPC Foundation stack, SqlClient, and every other non-Galaxy driver target 64-bit). In-process hosting would force the whole Server to x86, which every other driver project would then inherit.
  2. Tier-C stability isolation. Galaxy is classified Tier C in docs/v2/driver-stability.md — the COM runtime, STA thread, Aveva Historian SDK, and SQL queries all have crash/hang modes that can take down the hosting process. Isolating the driver in its own Windows service means a COM deadlock, AccessViolation in an unmanaged Historian DLL, or a runaway SQL query never takes the Server endpoint down. The Proxy-side supervisor restarts the Host with crash-loop circuit-breaker.

The same Tier-C isolation story applies to FOCAS (decision record in docs/v2/plan.md §7), which is the second out-of-process driver.

IPC Transport

GalaxyProxyDriverGalaxyIpcClient → named pipe → Galaxy.Host pipe server.

  • Pipe name: otopcua-galaxy-{DriverInstanceId} (localhost-only, no TCP surface)
  • Wire format: MessagePack-CSharp, length-prefixed frames
  • ACL: pipe is created with a DACL that grants only the Server's service identity; the Admins group is explicitly denied so a live-smoke test running from an elevated shell fails fast rather than silently bypassing the handshake
  • Handshake: Proxy presents a shared secret at OpenSessionRequest; Host rejects anything else with MessageKind.OpenSessionResponse{Success=false}
  • Heartbeat: Proxy sends a periodic ping; missed heartbeats trigger the Proxy-side crash-loop supervisor to restart the Host

Every capability call on GalaxyProxyDriver (Read, Write, Subscribe, HistoryRead*, etc.) serializes a *Request, awaits the matching *Response via a CallAsync<TReq, TResp> helper, and rehydrates the result into the Core.Abstractions shape the Server expects.

STA Thread Requirement (Host-side)

MXAccess COM objects — LMXProxyServer instantiation, Register, AddItem, AdviseSupervisory, Write, and cleanup calls — must all execute on the same Single-Threaded Apartment. Calling a COM object from the wrong thread causes marshalling failures or silent data corruption.

StaComThread in the Host provides that thread with the apartment state set before the thread starts:

_thread = new Thread(ThreadEntry) { Name = "MxAccess-STA", IsBackground = true };
_thread.SetApartmentState(ApartmentState.STA);

Work items queue via RunAsync(Action) or RunAsync<T>(Func<T>) into a ConcurrentQueue<Action> and post WM_APP to wake the pump. Each work item is wrapped in a TaskCompletionSource so callers can await the result from any thread — including the IPC handler thread that receives the inbound pipe request.

Win32 Message Pump (Host-side)

COM callbacks (OnDataChange, OnWriteComplete) are delivered through the Windows message loop. StaComThread runs a standard Win32 message pump via P/Invoke:

  1. PeekMessage primes the message queue (required before PostThreadMessage works)
  2. GetMessage blocks until a message arrives
  3. WM_APP drains the work queue
  4. WM_APP + 1 drains the queue and posts WM_QUIT to exit the loop
  5. All other messages go through TranslateMessage / DispatchMessage for COM callback delivery

Without this pump MXAccess callbacks never fire and the driver delivers no live data.

LMXProxyServer COM Object

MxProxyAdapter wraps the real ArchestrA.MxAccess.LMXProxyServer COM object behind the IMxProxy interface so Host unit tests can substitute a fake proxy without requiring the ArchestrA runtime. Lifecycle:

  1. Register(clientName) — Creates a new LMXProxyServer instance, wires up OnDataChange and OnWriteComplete event handlers, calls Register to obtain a connection handle
  2. Unregister(handle) — Unwires event handlers, calls Unregister, releases the COM object via Marshal.ReleaseComObject

Register / AddItem / AdviseSupervisory Pattern

Every MXAccess data operation follows a three-step pattern, all executed on the STA thread:

  1. AddItem(handle, address) — Resolves a Galaxy tag reference (e.g., TestMachine_001.MachineID) to an integer item handle
  2. AdviseSupervisory(handle, itemHandle) — Subscribes the item for supervisory data-change callbacks
  3. The runtime begins delivering OnDataChange events

For writes, after AddItem + AdviseSupervisory, Write(handle, itemHandle, value, securityClassification) sends the value; OnWriteComplete confirms or rejects. Cleanup reverses: UnAdviseSupervisory then RemoveItem.

OnDataChange and OnWriteComplete Callbacks

OnDataChange

Fired by the COM runtime on the STA thread when a subscribed tag changes. The handler in MxAccessClient.EventHandlers.cs:

  1. Maps the integer phItemHandle back to a tag address via _handleToAddress
  2. Maps the MXAccess quality code to the internal Quality enum
  3. Checks MXSTATUS_PROXY for error details and adjusts quality
  4. Converts the timestamp to UTC
  5. Constructs a Vtq (Value/Timestamp/Quality) and delivers it to:
    • The stored per-tag subscription callback
    • Any pending one-shot read completions
    • The global OnTagValueChanged event (consumed by the Host's subscription dispatcher, which packages changes into DataChangeEventArgs and forwards them over the pipe to GalaxyProxyDriver.OnDataChange)

OnWriteComplete

Fired when the runtime acknowledges or rejects a write. The handler resolves the pending TaskCompletionSource<bool> for the item handle. If MXSTATUS_PROXY.success == 0 the write is considered failed and the error detail is logged.

Reconnection Logic

MxAccessClient implements automatic reconnection through two mechanisms.

Monitor loop

StartMonitor launches a background task that polls at MonitorIntervalSeconds. On each cycle:

  • If the state is Disconnected or Error and AutoReconnect is enabled, it calls ReconnectAsync
  • If connected and a probe tag is configured, it checks the probe staleness threshold

Reconnect sequence

ReconnectAsync performs a full disconnect-then-connect cycle:

  1. Increment the reconnect counter
  2. DisconnectAsync — tear down all active subscriptions (UnAdviseSupervisory + RemoveItem for each), detach COM event handlers, call Unregister, clear all handle mappings
  3. ConnectAsync — create a fresh LMXProxyServer, register, replay all stored subscriptions, re-subscribe the probe tag

Stored subscriptions (_storedSubscriptions) persist across reconnects. ReplayStoredSubscriptionsAsync iterates the stored entries and calls AddItem + AdviseSupervisory for each.

Probe Tag Health Monitoring

A configurable probe tag (e.g., a frequently updating Galaxy attribute) serves as a connection health indicator. After connecting, the client subscribes to the probe tag and records _lastProbeValueTime on every OnDataChange. The monitor loop compares DateTime.UtcNow - _lastProbeValueTime against ProbeStaleThresholdSeconds; if the probe has not updated within the window, the connection is assumed stale and a reconnect is forced. This catches scenarios where the COM connection is technically alive but the runtime has stopped delivering data.

Per-Host Runtime Status Probes (<Host>.ScanState)

Separate from the connection-level probe, the driver advises <HostName>.ScanState on every deployed $WinPlatform and $AppEngine in the Galaxy. These probes track per-host runtime state so the Admin UI dashboard can report "this specific Platform / AppEngine is off scan" and the driver can proactively invalidate every OPC UA variable hosted by the stopped object — preventing MXAccess from serving stale Good-quality cached values to clients who read those tags while the host is down.

Enabled by default via MxAccess.RuntimeStatusProbesEnabled; see Configuration for the two config fields.

How it works

GalaxyRuntimeProbeManager lives in Driver.Galaxy.Host alongside the rest of the MXAccess code. It is owned by the Host's subscription dispatcher and runs a three-state machine per host (Unknown / Running / Stopped):

  1. Discovery — After the Host completes BuildAddressSpace, the manager filters the hierarchy to rows where CategoryId == 1 ($WinPlatform) or CategoryId == 3 ($AppEngine) and issues AdviseSupervisory for <TagName>.ScanState on each one. Probes are driver-owned, not ref-counted against client subscriptions, and persist across address-space rebuilds via a Sync diff.
  2. Transition predicate — A probe callback is interpreted as isRunning = vtq.Quality.IsGood() && vtq.Value is bool b && b. Everything else (explicit ScanState = false, bad quality, communication errors) means Stopped.
  3. On-change-only deliveryScanState is delivered only when the value actually changes. A stably Running host may go hours without a callback. Tick() does NOT run a starvation check on Running entries — the only time-based transition is Unknown → Stopped when the initial callback hasn't arrived within RuntimeStatusUnknownTimeoutSeconds (default 15s). This protects against a probe that fails to resolve at all without incorrectly flipping healthy long-running hosts.
  4. Transport gating — When IMxAccessClient.State != Connected, GetSnapshot() forces every entry to Unknown. The dashboard shows the Connection panel as the primary signal in that case rather than misleading operators with "every host stopped".
  5. Subscribe failure rollback — If SubscribeAsync throws for a new probe (SDK failure, broker rejection, transport error), the manager rolls back both _byProbe and _probeByGobjectId so the probe never appears in GetSnapshot(). Stability review 2026-04-13 Finding 1.

Subtree quality invalidation on transition

When a host transitions Running → Stopped, the probe manager invokes a callback that walks _hostedVariables[gobjectId] — the set of every OPC UA variable transitively hosted by that Galaxy object — and sets each variable's StatusCode to BadOutOfService. Stopped → Running calls ClearHostVariablesBadQuality to reset each to Good so the next on-change MXAccess update repopulates the value.

The hosted-variables map is built once per BuildAddressSpace by walking each object's HostedByGobjectId chain up to the nearest Platform or Engine ancestor. A variable hosted by an Engine inside a Platform lands in both the Engine's list and the Platform's list, so stopping the Platform transitively invalidates every descendant Engine's variables.

Read-path short-circuit (IsTagUnderStoppedHost)

The Host's Read handler checks IsTagUnderStoppedHost(tagRef) (a reverse-index lookup _hostIdsByTagRef[tagRef]GalaxyRuntimeProbeManager.IsHostStopped(hostId)) before the MXAccess round-trip. When the owning host is Stopped, the handler returns a synthesized DataValue { Value = cachedVar.Value, StatusCode = BadOutOfService } directly without touching MXAccess. This guarantees clients see a uniform BadOutOfService on every descendant tag of a stopped host, regardless of whether they're reading or subscribing.

Deferred dispatch — the STA deadlock

Critical: probe transition callbacks must not run synchronously on the STA thread that delivered the OnDataChange. MarkHostVariablesBadQuality takes the subscription dispatcher lock, which may be held by a worker thread currently inside Read waiting on an _mxAccessClient.ReadAsync() round-trip that is itself waiting for the STA thread. Classic circular wait — the first real deploy of this feature hung inside 30 seconds from exactly this pattern.

The fix is a deferred-dispatch queue: probe callbacks enqueue the transition onto ConcurrentQueue<(int GobjectId, bool Stopped)> and set the existing dispatch signal. The dispatch thread drains the queue inside its existing 100ms WaitOne loop — outside any locks held by the STA path — and then calls MarkHostVariablesBadQuality / ClearHostVariablesBadQuality under its own natural lock acquisition. No circular wait, no STA involvement.

Dashboard and health surface

  • Admin UI Galaxy Runtime panel shows per-host state with Name / Kind / State / Since / Last Error columns. Panel color is green (all Running), yellow (any Unknown, none Stopped), red (any Stopped), gray (MXAccess transport disconnected)
  • HealthCheckService.CheckHealth rolls overall driver health to Degraded when any host is Stopped

See Status Dashboard for the field table and Configuration for the config fields.

Request Timeout Safety Backstop

Every sync-over-async site on the OPC UA stack thread that calls into Galaxy (Read, Write, address-space rebuild probe sync) is wrapped in a bounded SyncOverAsync.WaitSync(...) helper with timeout MxAccess.RequestTimeoutSeconds (default 30s). Inner ReadTimeoutSeconds / WriteTimeoutSeconds bounds on the async path are the first line of defense; the outer wrapper is a backstop so a scheduler stall, slow reconnect, or any other non-returning async path cannot park the stack thread indefinitely.

On timeout, the underlying task is not cancelled — it runs to completion on the thread pool and is abandoned. This is acceptable because Galaxy IPC clients are shared singletons and the abandoned continuation does not capture request-scoped state. The OPC UA stack receives StatusCodes.BadTimeout on the affected operation.

ConfigurationValidator enforces RequestTimeoutSeconds >= 1 and warns when it is set below the inner Read/Write timeouts (operator misconfiguration). Stability review 2026-04-13 Finding 3.

All capability calls at the Server dispatch layer are additionally wrapped by CapabilityInvoker (Core/Resilience/) which runs them through a Polly pipeline keyed on (DriverInstanceId, HostName, DriverCapability). OTOPCUA0001 analyzer enforces the wrap at build time.

Why Marshal.ReleaseComObject Is Needed

The .NET Framework runtime's garbage collector releases COM references non-deterministically. For MXAccess, delayed release can leave stale COM connections open, preventing clean re-registration. MxProxyAdapter.Unregister calls Marshal.ReleaseComObject(_lmxProxy) in a finally block to immediately drive the COM reference count to zero. This ensures the underlying COM server is freed before a reconnect attempt creates a new instance.

Tag Discovery and Historical Data

Tag discovery (the Galaxy Repository SQL reader + LocalPlatform scope filter) is covered in Galaxy-Repository.md. The Galaxy driver is ITagDiscovery for the Server's bootstrap path and IRediscoverable for the on-change-redeploy path.

Historical data access (raw, processed, at-time, events) runs against the Aveva Historian via the aahClientManaged SDK and is exposed through the Galaxy driver's IHistoryProvider implementation. See HistoricalDataAccess.md.

Key source files

Host-side (.NET 4.8 x86, src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/):

  • Backend/MxAccess/StaComThread.cs — STA thread and Win32 message pump
  • Backend/MxAccess/MxAccessClient.cs — Core client (partial)
  • Backend/MxAccess/MxAccessClient.Connection.cs — Connect / disconnect / reconnect
  • Backend/MxAccess/MxAccessClient.Subscription.cs — Subscribe / unsubscribe / replay
  • Backend/MxAccess/MxAccessClient.ReadWrite.cs — Read and write operations
  • Backend/MxAccess/MxAccessClient.EventHandlers.csOnDataChange / OnWriteComplete handlers
  • Backend/MxAccess/MxAccessClient.Monitor.cs — Background health monitor
  • Backend/MxAccess/MxProxyAdapter.cs — COM object wrapper
  • Backend/MxAccess/GalaxyRuntimeProbeManager.cs — Per-host ScanState probes, state machine, IsHostStopped lookup
  • Backend/Historian/HistorianDataSource.csaahClientManaged SDK wrapper (see HistoricalDataAccess.md)
  • Ipc/GalaxyIpcServer.cs — Named-pipe server, message dispatch
  • Domain/IMxAccessClient.cs — Client interface

Shared (.NET Standard 2.0, src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/):

  • Contracts/MessageKind.cs — IPC message kinds (ReadRequest, HistoryReadRequest, OpenSessionResponse, …)
  • Contracts/*.cs — MessagePack DTOs for every request/response pair

Proxy-side (.NET 10, src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/):

  • GalaxyProxyDriver.csIDriver/ITagDiscovery/IReadable/IWritable/ISubscribable/IAlarmSource/IHistoryProvider/IRediscoverable/IHostConnectivityProbe implementation; every method forwards via GalaxyIpcClient
  • Ipc/GalaxyIpcClient.cs — Named-pipe client, CallAsync<TReq, TResp>, reconnect on broken pipe
  • GalaxyProxySupervisor.cs — Host-process monitor, crash-loop circuit-breaker, Host relaunch