Three root-cause fixes to get an elevated dev-box shell past session open through to real MXAccess reads: 1. PipeAcl — drop BUILTIN\Administrators deny ACE. UAC's filtered token carries the Admins SID as deny-only, so the deny fired even from non-elevated admin-account shells. The per-connection SID check in PipeServer.VerifyCaller remains the real authorization boundary. 2. PipeServer — swap the Hello-read / VerifyCaller order. ImpersonateNamedPipeClient returns ERROR_CANNOT_IMPERSONATE until at least one frame has been read from the pipe; reading Hello first satisfies that rule. Previously the ACL deny-first path masked this race — removing the deny ACE exposed it. 3. GalaxyIpcClient — add a background reader + single pending-response slot. A RuntimeStatusChange event between OpenSessionRequest and OpenSessionResponse used to satisfy the caller's single ReadFrameAsync and fail CallAsync with "Expected OpenSessionResponse, got RuntimeStatusChange". The reader now routes response kinds (and ErrorResponse) to the pending TCS and everything else to a handler the driver registers in InitializeAsync. The Proxy was already set up to raise managed events from RaiseDataChange / RaiseAlarmEvent / OnHostConnectivityUpdate — those helpers had no caller until now. 4. RedundancyPublisherHostedService — swallow BadServerHalted while polling host.Server.CurrentInstance. StandardServer throws that code during startup rather than returning null, so the first poll attempt crashed the BackgroundService (and the host) before OnServerStarted ran. This race was latent behind the Galaxy init failure above. Updates docs that described the Admins deny ACE + mandatory non-elevated shells, and drops the admin-skip guards from every Galaxy integration + E2E fixture that had them (IpcHandshakeIntegrationTests, EndToEndIpcTests, ParityFixture, LiveStackFixture, HostSubprocessParityTests). Adds GalaxyIpcClientRoutingTests covering the router's request/response match, ErrorResponse, event-between-call, idle event, and peer-close paths. Verified live on the dev box against the p7-smoke cluster (gen 6): driver registered=1 failedInit=0, Phase 7 bridge subscribed, OPC UA server up on 4840, MXAccess read round-trip returns real data with Status=0x00000000. Task #112 — partial: Galaxy live stack is functional end-to-end. The supplied test-galaxy.ps1 script still fails because the UNS walker encodes TagConfig JSON as the tag's NodeId instead of the seeded TagId (pre-existing; separate issue from this commit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
19 KiB
Galaxy Driver
The Galaxy driver bridges OtOpcUa to AVEVA System Platform (Wonderware) Galaxies through the ArchestrA.MxAccess COM API plus the Galaxy Repository SQL database. It is one driver of seven in the OtOpcUa platform (see drivers/README.md for the full list); all other drivers run in-process in the main Server (.NET 10 x64). Galaxy is the exception — it runs as its own Windows service and talks to the Server over a local named pipe.
For the decision record on why Galaxy is out-of-process and how the refactor was staged, see docs/v2/plan.md §4 Galaxy/MXAccess as Out-of-Process Driver. For the full driver spec (addressing, data-type map, config shape), see docs/v2/driver-specs.md §1.
Project Split
Galaxy ships as three projects:
| Project | Target | Role |
|---|---|---|
src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/ |
.NET Standard 2.0 | IPC contracts (MessagePack records + MessageKind enum) referenced by both sides |
src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/ |
.NET Framework 4.8 x86 | Separate Windows service hosting the MXAccess COM objects, STA thread + Win32 message pump, Galaxy Repository reader, Historian SDK, runtime-probe manager |
src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/ |
.NET 10 (matches Server) | GalaxyProxyDriver : IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IRediscoverable, IHostConnectivityProbe — loaded in-process by the Server; every call forwards over the pipe to the Host |
The Shared assembly is the only contract between the two runtimes. It carries no COM or SDK references so Proxy (net10) can reference it without dragging x86 code into the Server process.
Why Out-of-Process
Two reasons drive the split, per docs/v2/plan.md:
- Bitness constraint. MXAccess is 32-bit COM only —
ArchestrA.MxAccess.dllinProgram Files (x86)\ArchestrA\Framework\binhas no 64-bit variant. The main OtOpcUa Server is .NET 10 x64 (the OPC Foundation stack, SqlClient, and every other non-Galaxy driver target 64-bit). In-process hosting would force the whole Server to x86, which every other driver project would then inherit. - Tier-C stability isolation. Galaxy is classified Tier C in docs/v2/driver-stability.md — the COM runtime, STA thread, Aveva Historian SDK, and SQL queries all have crash/hang modes that can take down the hosting process. Isolating the driver in its own Windows service means a COM deadlock, AccessViolation in an unmanaged Historian DLL, or a runaway SQL query never takes the Server endpoint down. The Proxy-side supervisor restarts the Host with crash-loop circuit-breaker.
The same Tier-C isolation story applies to FOCAS (decision record in docs/v2/plan.md §7), which is the second out-of-process driver.
IPC Transport
GalaxyProxyDriver → GalaxyIpcClient → named pipe → Galaxy.Host pipe server.
- Pipe name:
otopcua-galaxy-{DriverInstanceId}(localhost-only, no TCP surface) - Wire format: MessagePack-CSharp, length-prefixed frames
- ACL: pipe is created with a DACL that grants
ReadWrite | Synchronizeonly to the configured Server service-principal SID + deniesLocalSystem. The per-connection SID check inPipeServer.VerifyCalleris the real authorization boundary — any caller whose impersonated token SID doesn't match the allowed SID is dropped before the first frame is read. - Handshake: Proxy presents a shared secret at
OpenSessionRequest; Host rejects anything else withMessageKind.OpenSessionResponse{Success=false} - Heartbeat: Proxy sends a periodic ping; missed heartbeats trigger the Proxy-side crash-loop supervisor to restart the Host
Every capability call on GalaxyProxyDriver (Read, Write, Subscribe, HistoryRead*, etc.) serializes a *Request, awaits the matching *Response via a CallAsync<TReq, TResp> helper, and rehydrates the result into the Core.Abstractions shape the Server expects.
STA Thread Requirement (Host-side)
MXAccess COM objects — LMXProxyServer instantiation, Register, AddItem, AdviseSupervisory, Write, and cleanup calls — must all execute on the same Single-Threaded Apartment. Calling a COM object from the wrong thread causes marshalling failures or silent data corruption.
StaComThread in the Host provides that thread with the apartment state set before the thread starts:
_thread = new Thread(ThreadEntry) { Name = "MxAccess-STA", IsBackground = true };
_thread.SetApartmentState(ApartmentState.STA);
Work items queue via RunAsync(Action) or RunAsync<T>(Func<T>) into a ConcurrentQueue<Action> and post WM_APP to wake the pump. Each work item is wrapped in a TaskCompletionSource so callers can await the result from any thread — including the IPC handler thread that receives the inbound pipe request.
Win32 Message Pump (Host-side)
COM callbacks (OnDataChange, OnWriteComplete) are delivered through the Windows message loop. StaComThread runs a standard Win32 message pump via P/Invoke:
PeekMessageprimes the message queue (required beforePostThreadMessageworks)GetMessageblocks until a message arrivesWM_APPdrains the work queueWM_APP + 1drains the queue and postsWM_QUITto exit the loop- All other messages go through
TranslateMessage/DispatchMessagefor COM callback delivery
Without this pump MXAccess callbacks never fire and the driver delivers no live data.
LMXProxyServer COM Object
MxProxyAdapter wraps the real ArchestrA.MxAccess.LMXProxyServer COM object behind the IMxProxy interface so Host unit tests can substitute a fake proxy without requiring the ArchestrA runtime. Lifecycle:
Register(clientName)— Creates a newLMXProxyServerinstance, wires upOnDataChangeandOnWriteCompleteevent handlers, callsRegisterto obtain a connection handleUnregister(handle)— Unwires event handlers, callsUnregister, releases the COM object viaMarshal.ReleaseComObject
Register / AddItem / AdviseSupervisory Pattern
Every MXAccess data operation follows a three-step pattern, all executed on the STA thread:
AddItem(handle, address)— Resolves a Galaxy tag reference (e.g.,TestMachine_001.MachineID) to an integer item handleAdviseSupervisory(handle, itemHandle)— Subscribes the item for supervisory data-change callbacks- The runtime begins delivering
OnDataChangeevents
For writes, after AddItem + AdviseSupervisory, Write(handle, itemHandle, value, securityClassification) sends the value; OnWriteComplete confirms or rejects. Cleanup reverses: UnAdviseSupervisory then RemoveItem.
OnDataChange and OnWriteComplete Callbacks
OnDataChange
Fired by the COM runtime on the STA thread when a subscribed tag changes. The handler in MxAccessClient.EventHandlers.cs:
- Maps the integer
phItemHandleback to a tag address via_handleToAddress - Maps the MXAccess quality code to the internal
Qualityenum - Checks
MXSTATUS_PROXYfor error details and adjusts quality - Converts the timestamp to UTC
- Constructs a
Vtq(Value/Timestamp/Quality) and delivers it to:- The stored per-tag subscription callback
- Any pending one-shot read completions
- The global
OnTagValueChangedevent (consumed by the Host's subscription dispatcher, which packages changes intoDataChangeEventArgsand forwards them over the pipe toGalaxyProxyDriver.OnDataChange)
OnWriteComplete
Fired when the runtime acknowledges or rejects a write. The handler resolves the pending TaskCompletionSource<bool> for the item handle. If MXSTATUS_PROXY.success == 0 the write is considered failed and the error detail is logged.
Reconnection Logic
MxAccessClient implements automatic reconnection through two mechanisms.
Monitor loop
StartMonitor launches a background task that polls at MonitorIntervalSeconds. On each cycle:
- If the state is
DisconnectedorErrorandAutoReconnectis enabled, it callsReconnectAsync - If connected and a probe tag is configured, it checks the probe staleness threshold
Reconnect sequence
ReconnectAsync performs a full disconnect-then-connect cycle:
- Increment the reconnect counter
DisconnectAsync— tear down all active subscriptions (UnAdviseSupervisory+RemoveItemfor each), detach COM event handlers, callUnregister, clear all handle mappingsConnectAsync— create a freshLMXProxyServer, register, replay all stored subscriptions, re-subscribe the probe tag
Stored subscriptions (_storedSubscriptions) persist across reconnects. ReplayStoredSubscriptionsAsync iterates the stored entries and calls AddItem + AdviseSupervisory for each.
Probe Tag Health Monitoring
A configurable probe tag (e.g., a frequently updating Galaxy attribute) serves as a connection health indicator. After connecting, the client subscribes to the probe tag and records _lastProbeValueTime on every OnDataChange. The monitor loop compares DateTime.UtcNow - _lastProbeValueTime against ProbeStaleThresholdSeconds; if the probe has not updated within the window, the connection is assumed stale and a reconnect is forced. This catches scenarios where the COM connection is technically alive but the runtime has stopped delivering data.
Per-Host Runtime Status Probes (<Host>.ScanState)
Separate from the connection-level probe, the driver advises <HostName>.ScanState on every deployed $WinPlatform and $AppEngine in the Galaxy. These probes track per-host runtime state so the Admin UI dashboard can report "this specific Platform / AppEngine is off scan" and the driver can proactively invalidate every OPC UA variable hosted by the stopped object — preventing MXAccess from serving stale Good-quality cached values to clients who read those tags while the host is down.
Enabled by default via MxAccess.RuntimeStatusProbesEnabled; see Configuration for the two config fields.
How it works
GalaxyRuntimeProbeManager lives in Driver.Galaxy.Host alongside the rest of the MXAccess code. It is owned by the Host's subscription dispatcher and runs a three-state machine per host (Unknown / Running / Stopped):
- Discovery — After the Host completes
BuildAddressSpace, the manager filters the hierarchy to rows whereCategoryId == 1($WinPlatform) orCategoryId == 3($AppEngine) and issuesAdviseSupervisoryfor<TagName>.ScanStateon each one. Probes are driver-owned, not ref-counted against client subscriptions, and persist across address-space rebuilds via aSyncdiff. - Transition predicate — A probe callback is interpreted as
isRunning = vtq.Quality.IsGood() && vtq.Value is bool b && b. Everything else (explicitScanState = false, bad quality, communication errors) means Stopped. - On-change-only delivery —
ScanStateis delivered only when the value actually changes. A stably Running host may go hours without a callback.Tick()does NOT run a starvation check on Running entries — the only time-based transition is Unknown → Stopped when the initial callback hasn't arrived withinRuntimeStatusUnknownTimeoutSeconds(default 15s). This protects against a probe that fails to resolve at all without incorrectly flipping healthy long-running hosts. - Transport gating — When
IMxAccessClient.State != Connected,GetSnapshot()forces every entry toUnknown. The dashboard shows the Connection panel as the primary signal in that case rather than misleading operators with "every host stopped". - Subscribe failure rollback — If
SubscribeAsyncthrows for a new probe (SDK failure, broker rejection, transport error), the manager rolls back both_byProbeand_probeByGobjectIdso the probe never appears inGetSnapshot(). Stability review 2026-04-13 Finding 1.
Subtree quality invalidation on transition
When a host transitions Running → Stopped, the probe manager invokes a callback that walks _hostedVariables[gobjectId] — the set of every OPC UA variable transitively hosted by that Galaxy object — and sets each variable's StatusCode to BadOutOfService. Stopped → Running calls ClearHostVariablesBadQuality to reset each to Good so the next on-change MXAccess update repopulates the value.
The hosted-variables map is built once per BuildAddressSpace by walking each object's HostedByGobjectId chain up to the nearest Platform or Engine ancestor. A variable hosted by an Engine inside a Platform lands in both the Engine's list and the Platform's list, so stopping the Platform transitively invalidates every descendant Engine's variables.
Read-path short-circuit (IsTagUnderStoppedHost)
The Host's Read handler checks IsTagUnderStoppedHost(tagRef) (a reverse-index lookup _hostIdsByTagRef[tagRef] → GalaxyRuntimeProbeManager.IsHostStopped(hostId)) before the MXAccess round-trip. When the owning host is Stopped, the handler returns a synthesized DataValue { Value = cachedVar.Value, StatusCode = BadOutOfService } directly without touching MXAccess. This guarantees clients see a uniform BadOutOfService on every descendant tag of a stopped host, regardless of whether they're reading or subscribing.
Deferred dispatch — the STA deadlock
Critical: probe transition callbacks must not run synchronously on the STA thread that delivered the OnDataChange. MarkHostVariablesBadQuality takes the subscription dispatcher lock, which may be held by a worker thread currently inside Read waiting on an _mxAccessClient.ReadAsync() round-trip that is itself waiting for the STA thread. Classic circular wait — the first real deploy of this feature hung inside 30 seconds from exactly this pattern.
The fix is a deferred-dispatch queue: probe callbacks enqueue the transition onto ConcurrentQueue<(int GobjectId, bool Stopped)> and set the existing dispatch signal. The dispatch thread drains the queue inside its existing 100ms WaitOne loop — outside any locks held by the STA path — and then calls MarkHostVariablesBadQuality / ClearHostVariablesBadQuality under its own natural lock acquisition. No circular wait, no STA involvement.
Dashboard and health surface
- Admin UI Galaxy Runtime panel shows per-host state with Name / Kind / State / Since / Last Error columns. Panel color is green (all Running), yellow (any Unknown, none Stopped), red (any Stopped), gray (MXAccess transport disconnected)
HealthCheckService.CheckHealthrolls overall driver health toDegradedwhen any host is Stopped
See Status Dashboard for the field table and Configuration for the config fields.
Request Timeout Safety Backstop
Every sync-over-async site on the OPC UA stack thread that calls into Galaxy (Read, Write, address-space rebuild probe sync) is wrapped in a bounded SyncOverAsync.WaitSync(...) helper with timeout MxAccess.RequestTimeoutSeconds (default 30s). Inner ReadTimeoutSeconds / WriteTimeoutSeconds bounds on the async path are the first line of defense; the outer wrapper is a backstop so a scheduler stall, slow reconnect, or any other non-returning async path cannot park the stack thread indefinitely.
On timeout, the underlying task is not cancelled — it runs to completion on the thread pool and is abandoned. This is acceptable because Galaxy IPC clients are shared singletons and the abandoned continuation does not capture request-scoped state. The OPC UA stack receives StatusCodes.BadTimeout on the affected operation.
ConfigurationValidator enforces RequestTimeoutSeconds >= 1 and warns when it is set below the inner Read/Write timeouts (operator misconfiguration). Stability review 2026-04-13 Finding 3.
All capability calls at the Server dispatch layer are additionally wrapped by CapabilityInvoker (Core/Resilience/) which runs them through a Polly pipeline keyed on (DriverInstanceId, HostName, DriverCapability). OTOPCUA0001 analyzer enforces the wrap at build time.
Why Marshal.ReleaseComObject Is Needed
The .NET Framework runtime's garbage collector releases COM references non-deterministically. For MXAccess, delayed release can leave stale COM connections open, preventing clean re-registration. MxProxyAdapter.Unregister calls Marshal.ReleaseComObject(_lmxProxy) in a finally block to immediately drive the COM reference count to zero. This ensures the underlying COM server is freed before a reconnect attempt creates a new instance.
Tag Discovery and Historical Data
Tag discovery (the Galaxy Repository SQL reader + LocalPlatform scope filter) is covered in Galaxy-Repository.md. The Galaxy driver is ITagDiscovery for the Server's bootstrap path and IRediscoverable for the on-change-redeploy path.
Historical data access (raw, processed, at-time, events) runs against the Aveva Historian via the aahClientManaged SDK and is exposed through the Galaxy driver's IHistoryProvider implementation. See HistoricalDataAccess.md.
Key source files
Host-side (.NET 4.8 x86, src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/):
Backend/MxAccess/StaComThread.cs— STA thread and Win32 message pumpBackend/MxAccess/MxAccessClient.cs— Core client (partial)Backend/MxAccess/MxAccessClient.Connection.cs— Connect / disconnect / reconnectBackend/MxAccess/MxAccessClient.Subscription.cs— Subscribe / unsubscribe / replayBackend/MxAccess/MxAccessClient.ReadWrite.cs— Read and write operationsBackend/MxAccess/MxAccessClient.EventHandlers.cs—OnDataChange/OnWriteCompletehandlersBackend/MxAccess/MxAccessClient.Monitor.cs— Background health monitorBackend/MxAccess/MxProxyAdapter.cs— COM object wrapperBackend/MxAccess/GalaxyRuntimeProbeManager.cs— Per-hostScanStateprobes, state machine,IsHostStoppedlookupBackend/Historian/HistorianDataSource.cs—aahClientManagedSDK wrapper (see HistoricalDataAccess.md)Ipc/GalaxyIpcServer.cs— Named-pipe server, message dispatchDomain/IMxAccessClient.cs— Client interface
Shared (.NET Standard 2.0, src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/):
Contracts/MessageKind.cs— IPC message kinds (ReadRequest,HistoryReadRequest,OpenSessionResponse, …)Contracts/*.cs— MessagePack DTOs for every request/response pair
Proxy-side (.NET 10, src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/):
GalaxyProxyDriver.cs—IDriver/ITagDiscovery/IReadable/IWritable/ISubscribable/IAlarmSource/IHistoryProvider/IRediscoverable/IHostConnectivityProbeimplementation; every method forwards viaGalaxyIpcClientIpc/GalaxyIpcClient.cs— Named-pipe client,CallAsync<TReq, TResp>, reconnect on broken pipeGalaxyProxySupervisor.cs— Host-process monitor, crash-loop circuit-breaker, Host relaunch