docs: post-PR-7.2 cleanup — audit + three-track scrub
Audit (three parallel agent passes) found 43 markdown files carrying stale references to the deleted Galaxy.Host/Proxy/Shared projects after the v2-mxgw merge. This commit lands the prioritized fixes. Track 1 — high-traffic in-place rewrites (3 files, ~454 lines deleted) - README.md (202 → 91 lines): drops .NET 4.8 / x86 / TopShelf install text; leads with the multi-driver .NET 10 server identity and points at scripts/install/Install-Services.ps1 and the parity rig. - docs/v2/driver-specs.md §1 Galaxy (~289 → ~66 lines): replaces the Tier-C out-of-process spec with a Tier-A in-process description matching the current GalaxyDriver code, with the four-section GalaxyDriverOptions JSON shape pulled verbatim from Config/GalaxyDriverOptions.cs. - docs/drivers/Galaxy.md (211 → 92 lines): full rewrite around the current Browse/Runtime/Health/Config sub-folders. Track 2 — historical banners (5 files) - lmx_mxgw.md, lmx_mxgw_impl.md, lmx_backend.md, docs/v2/Galaxy.ParityMatrix.md, docs/v2/implementation/phase-2-galaxy-out-of-process.md each get a "✅ Completed 2026-04-30 — historical record" banner block. lmx_mxgw.md also fixes two dead links (`docs/Galaxy.Driver.md` and `docs/v2/Galaxy.Driver.md`) → `docs/drivers/Galaxy.md`. Track 3 — v1 archive sweep (10 git mv + 1 new index + 2 in-place scrubs) - Moved 10 v1 docs under docs/v1/ preserving subpath structure: AlarmTracking, Configuration, DataTypeMapping, HistoricalDataAccess, Subscriptions (top-level); drivers/Galaxy-Repository, drivers/Galaxy-Test-Fixture; reqs/GalaxyRepositoryReqs, reqs/MxAccessClientReqs, reqs/ServiceHostReqs. - New docs/v1/README.md is the shared archive banner + per-file table. - docs/README.md repointed to the v1 paths and updated to reflect the v2 two-process deploy shape (Server + Admin + optional OtOpcUaWonderwareHistorian). - docs/v2/Galaxy.ParityRig.md got a historical banner + four inline scrubs marking the OtOpcUaGalaxyHost service / Driver.Galaxy.Host EXE / Driver.Galaxy.ParityTests project as deleted-in-PR-7.2. The repo's live-reading surface (README + CLAUDE.md + docs/v2/) now describes only the post-PR-7.2 architecture. v1 docs are preserved as a labelled archive under docs/v1/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,211 +1,92 @@
|
||||
# Galaxy Driver
|
||||
|
||||
The Galaxy driver bridges OtOpcUa to AVEVA System Platform (Wonderware) Galaxies through the `ArchestrA.MxAccess` COM API plus the Galaxy Repository SQL database. It is one driver of seven in the OtOpcUa platform (see [drivers/README.md](README.md) for the full list); all other drivers run in-process in the main Server (.NET 10 x64). Galaxy is the exception — it runs as its own Windows service and talks to the Server over a local named pipe.
|
||||
The Galaxy driver bridges OtOpcUa to AVEVA System Platform (Wonderware) Galaxies. It is a **Tier-A in-process driver** that runs in the OtOpcUa server's .NET 10 AnyCPU process and speaks gRPC to a separately installed `mxaccessgw` server (sibling repo at `c:\Users\dohertj2\Desktop\mxaccessgw\`). The gateway owns the MXAccess COM apartment, the STA + Win32 message pump, the Galaxy Repository SQL reader, and the Historian SDK — all the bits that need x86 / .NET Framework 4.8 / COM interop. The driver itself is platform-agnostic and contains no COM, no STA thread, and no x86 bitness constraint.
|
||||
|
||||
For the decision record on why Galaxy is out-of-process and how the refactor was staged, see [docs/v2/plan.md §4 Galaxy/MXAccess as Out-of-Process Driver](../v2/plan.md). For the full driver spec (addressing, data-type map, config shape), see [docs/v2/driver-specs.md §1](../v2/driver-specs.md).
|
||||
For the driver spec (capability surface, config shape, addressing), see [docs/v2/driver-specs.md §1](../v2/driver-specs.md). For the gateway setup recipe, see [docs/v2/Galaxy.ParityRig.md](../v2/Galaxy.ParityRig.md). For tracing, metrics, and soak profile, see [docs/v2/Galaxy.Performance.md](../v2/Galaxy.Performance.md).
|
||||
|
||||
## Project Split
|
||||
> **Note**: the related drivers `Galaxy-Repository.md` and `Galaxy-Test-Fixture.md` describe the previous v1 / out-of-process topology and are being moved to `docs/v1/` by a parallel cleanup track. Use `Galaxy.ParityRig.md` and the `mxaccessgw` repo for current testing.
|
||||
|
||||
Galaxy ships as three projects:
|
||||
## Architecture
|
||||
|
||||
| Project | Target | Role |
|
||||
|---------|--------|------|
|
||||
| `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/` | .NET Standard 2.0 | IPC contracts (MessagePack records + `MessageKind` enum) referenced by both sides |
|
||||
| `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/` | .NET Framework 4.8 **x86** | Separate Windows service hosting the MXAccess COM objects, STA thread + Win32 message pump, Galaxy Repository reader, Historian SDK, runtime-probe manager |
|
||||
| `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/` | .NET 10 (matches Server) | `GalaxyProxyDriver : IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IRediscoverable, IHostConnectivityProbe` — loaded in-process by the Server; every call forwards over the pipe to the Host |
|
||||
|
||||
The Shared assembly is the **only** contract between the two runtimes. It carries no COM or SDK references so Proxy (net10) can reference it without dragging x86 code into the Server process.
|
||||
|
||||
## Why Out-of-Process
|
||||
|
||||
Two reasons drive the split, per `docs/v2/plan.md`:
|
||||
|
||||
1. **Bitness constraint.** MXAccess is 32-bit COM only — `ArchestrA.MxAccess.dll` in `Program Files (x86)\ArchestrA\Framework\bin` has no 64-bit variant. The main OtOpcUa Server is .NET 10 x64 (the OPC Foundation stack, SqlClient, and every other non-Galaxy driver target 64-bit). In-process hosting would force the whole Server to x86, which every other driver project would then inherit.
|
||||
2. **Tier-C stability isolation.** Galaxy is classified Tier C in [docs/v2/driver-stability.md](../v2/driver-stability.md) — the COM runtime, STA thread, Aveva Historian SDK, and SQL queries all have crash/hang modes that can take down the hosting process. Isolating the driver in its own Windows service means a COM deadlock, AccessViolation in an unmanaged Historian DLL, or a runaway SQL query never takes the Server endpoint down. The Proxy-side supervisor restarts the Host with crash-loop circuit-breaker.
|
||||
|
||||
The same Tier-C isolation story applies to FOCAS (decision record in `docs/v2/plan.md` §7), which is the second out-of-process driver.
|
||||
|
||||
## IPC Transport
|
||||
|
||||
`GalaxyProxyDriver` → `GalaxyIpcClient` → named pipe → `Galaxy.Host` pipe server.
|
||||
|
||||
- Pipe name: `otopcua-galaxy-{DriverInstanceId}` (localhost-only, no TCP surface)
|
||||
- Wire format: MessagePack-CSharp, length-prefixed frames
|
||||
- ACL: pipe is created with a DACL that grants `ReadWrite | Synchronize` only to the configured Server service-principal SID + denies `LocalSystem`. The per-connection SID check in `PipeServer.VerifyCaller` is the real authorization boundary — any caller whose impersonated token SID doesn't match the allowed SID is dropped before the first frame is read.
|
||||
- Handshake: Proxy presents a shared secret at `OpenSessionRequest`; Host rejects anything else with `MessageKind.OpenSessionResponse{Success=false}`
|
||||
- Heartbeat: Proxy sends a periodic ping; missed heartbeats trigger the Proxy-side crash-loop supervisor to restart the Host
|
||||
|
||||
Every capability call on `GalaxyProxyDriver` (Read, Write, Subscribe, HistoryRead*, etc.) serializes a `*Request`, awaits the matching `*Response` via a `CallAsync<TReq, TResp>` helper, and rehydrates the result into the `Core.Abstractions` shape the Server expects.
|
||||
|
||||
## STA Thread Requirement (Host-side)
|
||||
|
||||
MXAccess COM objects — `LMXProxyServer` instantiation, `Register`, `AddItem`, `AdviseSupervisory`, `Write`, and cleanup calls — must all execute on the same Single-Threaded Apartment. Calling a COM object from the wrong thread causes marshalling failures or silent data corruption.
|
||||
|
||||
`StaComThread` in the Host provides that thread with the apartment state set before the thread starts:
|
||||
|
||||
```csharp
|
||||
_thread = new Thread(ThreadEntry) { Name = "MxAccess-STA", IsBackground = true };
|
||||
_thread.SetApartmentState(ApartmentState.STA);
|
||||
```
|
||||
+---------------------------------------+
|
||||
| OtOpcUa.Server (.NET 10 AnyCPU) |
|
||||
| GalaxyDriver (in-process) |
|
||||
| ITagDiscovery / IReadable / |
|
||||
| IWritable / ISubscribable / |
|
||||
| IRediscoverable / |
|
||||
| IHostConnectivityProbe |
|
||||
+-------------------+-------------------+
|
||||
|
|
||||
gRPC (default http://localhost:5120)
|
||||
|
|
||||
v
|
||||
+---------------------------------------+
|
||||
| mxaccessgw (sibling repo) |
|
||||
| +-------------------------------+ |
|
||||
| | MxGateway.Worker (x86 net48) | |
|
||||
| | STA + WM_APP pump | |
|
||||
| | ArchestrA.MxAccess COM | |
|
||||
| | Galaxy Repository SQL | |
|
||||
| | Wonderware Historian SDK | |
|
||||
| +-------------------------------+ |
|
||||
+---------------------------------------+
|
||||
```
|
||||
|
||||
Work items queue via `RunAsync(Action)` or `RunAsync<T>(Func<T>)` into a `ConcurrentQueue<Action>` and post `WM_APP` to wake the pump. Each work item is wrapped in a `TaskCompletionSource` so callers can `await` the result from any thread — including the IPC handler thread that receives the inbound pipe request.
|
||||
History reads + alarm-condition tracking moved server-side in PR 7.2 (`IHistoryRouter`, `AlarmConditionService`). Galaxy no longer implements `IHistoryProvider` or `IAlarmSource` of its own.
|
||||
|
||||
## Win32 Message Pump (Host-side)
|
||||
## Project Layout
|
||||
|
||||
COM callbacks (`OnDataChange`, `OnWriteComplete`) are delivered through the Windows message loop. `StaComThread` runs a standard Win32 message pump via P/Invoke:
|
||||
The driver ships as a single project: `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/` (.NET 10, AnyCPU). Sub-folders:
|
||||
|
||||
1. `PeekMessage` primes the message queue (required before `PostThreadMessage` works)
|
||||
2. `GetMessage` blocks until a message arrives
|
||||
3. `WM_APP` drains the work queue
|
||||
4. `WM_APP + 1` drains the queue and posts `WM_QUIT` to exit the loop
|
||||
5. All other messages go through `TranslateMessage` / `DispatchMessage` for COM callback delivery
|
||||
| Folder | Role |
|
||||
|--------|------|
|
||||
| `Browse/` | Static-side discovery: `GalaxyDiscoverer` walks the gateway's hierarchy + attribute-set RPCs, `DataTypeMap` and `SecurityMap` translate Galaxy types and security classifications into OPC UA equivalents, `AlarmRefBuilder` extracts alarm-bearing attribute references for the server-layer alarm engine. `IGalaxyHierarchySource` + `GatewayGalaxyHierarchySource` + `TracedGalaxyHierarchySource` decorate the gateway browse RPC; `IGalaxyDeployWatchSource` + `GatewayGalaxyDeployWatchSource` + `DeployWatcher` drive `IRediscoverable`. |
|
||||
| `Runtime/` | Live data path: `EventPump` runs the gateway's `StreamEvents` RPC and fans out to subscribers via a bounded channel; `GalaxyMxSession` is the read-side handle; `GatewayGalaxySubscriber` + `GatewayGalaxyDataWriter` (each with a `Traced*` decorator) implement `ISubscribable` / `IWritable`; `SubscriptionRegistry` tracks subscription state for replay; `ReconnectSupervisor` owns the backoff loop and triggers `ReplaySubscriptions` on session loss; `StatusCodeMap` translates gateway StatusCodes to OPC UA; `MxValueDecoder` / `MxValueEncoder` handle scalar + array marshalling; `GalaxyTelemetry` + `GalaxySubscriptionHandle` round out the surface. |
|
||||
| `Health/` | `HostStatusAggregator` rolls per-platform probe state into the driver's `IHostConnectivityProbe` view; `PerPlatformProbeWatcher` listens on the gateway's per-host status stream; `HostConnectivityForwarder` pushes transitions out to the server's connectivity bus. |
|
||||
| `Config/` | `GalaxyDriverOptions` and the four nested option records (`GalaxyGatewayOptions`, `GalaxyMxAccessOptions`, `GalaxyRepositoryOptions`, `GalaxyReconnectOptions`). |
|
||||
|
||||
Without this pump MXAccess callbacks never fire and the driver delivers no live data.
|
||||
Project root files:
|
||||
|
||||
## LMXProxyServer COM Object
|
||||
- `GalaxyDriver.cs` — `IDriver` + capability-interface implementation; composes the Browse / Runtime / Health collaborators.
|
||||
- `GalaxyDriverFactoryExtensions.cs` — DI registration helper used by the server's driver bootstrap.
|
||||
|
||||
`MxProxyAdapter` wraps the real `ArchestrA.MxAccess.LMXProxyServer` COM object behind the `IMxProxy` interface so Host unit tests can substitute a fake proxy without requiring the ArchestrA runtime. Lifecycle:
|
||||
## Capability Surface
|
||||
|
||||
1. **`Register(clientName)`** — Creates a new `LMXProxyServer` instance, wires up `OnDataChange` and `OnWriteComplete` event handlers, calls `Register` to obtain a connection handle
|
||||
2. **`Unregister(handle)`** — Unwires event handlers, calls `Unregister`, releases the COM object via `Marshal.ReleaseComObject`
|
||||
`GalaxyDriver : IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IRediscoverable, IHostConnectivityProbe, IDisposable`.
|
||||
|
||||
## Register / AddItem / AdviseSupervisory Pattern
|
||||
| Capability | Implementation entry point |
|
||||
|------------|---------------------------|
|
||||
| `ITagDiscovery` | `Browse/GalaxyDiscoverer.cs` |
|
||||
| `IRediscoverable` | `Browse/DeployWatcher.cs` |
|
||||
| `IReadable` | `Runtime/GalaxyMxSession.cs` |
|
||||
| `IWritable` | `Runtime/GatewayGalaxyDataWriter.cs` |
|
||||
| `ISubscribable` | `Runtime/GatewayGalaxySubscriber.cs` (driven by `EventPump`) |
|
||||
| `IHostConnectivityProbe` | `Health/HostStatusAggregator.cs` |
|
||||
|
||||
Every MXAccess data operation follows a three-step pattern, all executed on the STA thread:
|
||||
## Configuration
|
||||
|
||||
1. **`AddItem(handle, address)`** — Resolves a Galaxy tag reference (e.g., `TestMachine_001.MachineID`) to an integer item handle
|
||||
2. **`AdviseSupervisory(handle, itemHandle)`** — Subscribes the item for supervisory data-change callbacks
|
||||
3. The runtime begins delivering `OnDataChange` events
|
||||
`DriverConfig` JSON binds to `Config/GalaxyDriverOptions.cs`. The four sections are:
|
||||
|
||||
For writes, after `AddItem` + `AdviseSupervisory`, `Write(handle, itemHandle, value, securityClassification)` sends the value; `OnWriteComplete` confirms or rejects. Cleanup reverses: `UnAdviseSupervisory` then `RemoveItem`.
|
||||
- **`Gateway`** — endpoint, API key secret ref, TLS knobs, connect/call/stream timeouts. `StreamTimeoutSeconds = 0` keeps the long-lived `StreamEvents` RPC open for the driver's lifetime.
|
||||
- **`MxAccess`** — `ClientName` (must be unique per OtOpcUa instance — redundancy pairs enforce uniqueness at install time), `PublishingIntervalMs` (forwarded as `buffered_update_interval_ms` on subscribe), `WriteUserId` for ArchestrA secured-write, `EventPumpChannelCapacity` (default 50_000 — one second of headroom at 50k tags / 1Hz; tune via the `galaxy.events.dropped` metric).
|
||||
- **`Repository`** — `DiscoverPageSize`, `WatchDeployEvents`.
|
||||
- **`Reconnect`** — `InitialBackoffMs`, `MaxBackoffMs`, `ReplayOnSessionLost` (calls the gateway's `ReplaySubscriptions` RPC after reconnect rather than re-issuing subscribe-bulk for every tag).
|
||||
|
||||
## OnDataChange and OnWriteComplete Callbacks
|
||||
Full per-field descriptions live in `Config/GalaxyDriverOptions.cs`. The full JSON skeleton is reproduced in [docs/v2/driver-specs.md §1](../v2/driver-specs.md).
|
||||
|
||||
### OnDataChange
|
||||
## Reconnect + Replay
|
||||
|
||||
Fired by the COM runtime on the STA thread when a subscribed tag changes. The handler in `MxAccessClient.EventHandlers.cs`:
|
||||
`ReconnectSupervisor` owns an exponential-backoff loop bounded by `Reconnect.InitialBackoffMs` / `MaxBackoffMs`. On session loss it tears down the gRPC channel, redials, and — when `ReplayOnSessionLost = true` — calls the gateway's `ReplaySubscriptions` RPC with the cached subscription set from `SubscriptionRegistry` instead of re-subscribing tag-by-tag. The gateway's worker then re-issues `AdviseSupervisory` server-side under the apartment lock.
|
||||
|
||||
1. Maps the integer `phItemHandle` back to a tag address via `_handleToAddress`
|
||||
2. Maps the MXAccess quality code to the internal `Quality` enum
|
||||
3. Checks `MXSTATUS_PROXY` for error details and adjusts quality
|
||||
4. Converts the timestamp to UTC
|
||||
5. Constructs a `Vtq` (Value/Timestamp/Quality) and delivers it to:
|
||||
- The stored per-tag subscription callback
|
||||
- Any pending one-shot read completions
|
||||
- The global `OnTagValueChanged` event (consumed by the Host's subscription dispatcher, which packages changes into `DataChangeEventArgs` and forwards them over the pipe to `GalaxyProxyDriver.OnDataChange`)
|
||||
## Testing
|
||||
|
||||
### OnWriteComplete
|
||||
- **Unit tests**: `tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests/` — fakes the gateway gRPC surface; covers Browse, Runtime, Health, and Config in isolation.
|
||||
- **Parity rig + dev-rig walkthrough**: see [docs/v2/Galaxy.ParityRig.md](../v2/Galaxy.ParityRig.md). The rig stands up a real `mxaccessgw` against a live Galaxy and exercises the full read / write / subscribe / rediscover path.
|
||||
- **Performance + soak**: see [docs/v2/Galaxy.Performance.md](../v2/Galaxy.Performance.md).
|
||||
|
||||
Fired when the runtime acknowledges or rejects a write. The handler resolves the pending `TaskCompletionSource<bool>` for the item handle. If `MXSTATUS_PROXY.success == 0` the write is considered failed and the error detail is logged.
|
||||
## Operational Notes
|
||||
|
||||
## Reconnection Logic
|
||||
|
||||
`MxAccessClient` implements automatic reconnection through two mechanisms.
|
||||
|
||||
### Monitor loop
|
||||
|
||||
`StartMonitor` launches a background task that polls at `MonitorIntervalSeconds`. On each cycle:
|
||||
|
||||
- If the state is `Disconnected` or `Error` and `AutoReconnect` is enabled, it calls `ReconnectAsync`
|
||||
- If connected and a probe tag is configured, it checks the probe staleness threshold
|
||||
|
||||
### Reconnect sequence
|
||||
|
||||
`ReconnectAsync` performs a full disconnect-then-connect cycle:
|
||||
|
||||
1. Increment the reconnect counter
|
||||
2. `DisconnectAsync` — tear down all active subscriptions (`UnAdviseSupervisory` + `RemoveItem` for each), detach COM event handlers, call `Unregister`, clear all handle mappings
|
||||
3. `ConnectAsync` — create a fresh `LMXProxyServer`, register, replay all stored subscriptions, re-subscribe the probe tag
|
||||
|
||||
Stored subscriptions (`_storedSubscriptions`) persist across reconnects. `ReplayStoredSubscriptionsAsync` iterates the stored entries and calls `AddItem` + `AdviseSupervisory` for each.
|
||||
|
||||
## Probe Tag Health Monitoring
|
||||
|
||||
A configurable probe tag (e.g., a frequently updating Galaxy attribute) serves as a connection health indicator. After connecting, the client subscribes to the probe tag and records `_lastProbeValueTime` on every `OnDataChange`. The monitor loop compares `DateTime.UtcNow - _lastProbeValueTime` against `ProbeStaleThresholdSeconds`; if the probe has not updated within the window, the connection is assumed stale and a reconnect is forced. This catches scenarios where the COM connection is technically alive but the runtime has stopped delivering data.
|
||||
|
||||
## Per-Host Runtime Status Probes (`<Host>.ScanState`)
|
||||
|
||||
Separate from the connection-level probe, the driver advises `<HostName>.ScanState` on every deployed `$WinPlatform` and `$AppEngine` in the Galaxy. These probes track per-host runtime state so the Admin UI dashboard can report "this specific Platform / AppEngine is off scan" and the driver can proactively invalidate every OPC UA variable hosted by the stopped object — preventing MXAccess from serving stale Good-quality cached values to clients who read those tags while the host is down.
|
||||
|
||||
Enabled by default via `MxAccess.RuntimeStatusProbesEnabled`; see [Configuration](../Configuration.md#mxaccess) for the two config fields.
|
||||
|
||||
### How it works
|
||||
|
||||
`GalaxyRuntimeProbeManager` lives in `Driver.Galaxy.Host` alongside the rest of the MXAccess code. It is owned by the Host's subscription dispatcher and runs a three-state machine per host (Unknown / Running / Stopped):
|
||||
|
||||
1. **Discovery** — After the Host completes `BuildAddressSpace`, the manager filters the hierarchy to rows where `CategoryId == 1` (`$WinPlatform`) or `CategoryId == 3` (`$AppEngine`) and issues `AdviseSupervisory` for `<TagName>.ScanState` on each one. Probes are driver-owned, not ref-counted against client subscriptions, and persist across address-space rebuilds via a `Sync` diff.
|
||||
2. **Transition predicate** — A probe callback is interpreted as `isRunning = vtq.Quality.IsGood() && vtq.Value is bool b && b`. Everything else (explicit `ScanState = false`, bad quality, communication errors) means **Stopped**.
|
||||
3. **On-change-only delivery** — `ScanState` is delivered only when the value actually changes. A stably Running host may go hours without a callback. `Tick()` does NOT run a starvation check on Running entries — the only time-based transition is **Unknown → Stopped** when the initial callback hasn't arrived within `RuntimeStatusUnknownTimeoutSeconds` (default 15s). This protects against a probe that fails to resolve at all without incorrectly flipping healthy long-running hosts.
|
||||
4. **Transport gating** — When `IMxAccessClient.State != Connected`, `GetSnapshot()` forces every entry to `Unknown`. The dashboard shows the Connection panel as the primary signal in that case rather than misleading operators with "every host stopped".
|
||||
5. **Subscribe failure rollback** — If `SubscribeAsync` throws for a new probe (SDK failure, broker rejection, transport error), the manager rolls back both `_byProbe` and `_probeByGobjectId` so the probe never appears in `GetSnapshot()`. Stability review 2026-04-13 Finding 1.
|
||||
|
||||
### Subtree quality invalidation on transition
|
||||
|
||||
When a host transitions **Running → Stopped**, the probe manager invokes a callback that walks `_hostedVariables[gobjectId]` — the set of every OPC UA variable transitively hosted by that Galaxy object — and sets each variable's `StatusCode` to `BadOutOfService`. **Stopped → Running** calls `ClearHostVariablesBadQuality` to reset each to `Good` so the next on-change MXAccess update repopulates the value.
|
||||
|
||||
The hosted-variables map is built once per `BuildAddressSpace` by walking each object's `HostedByGobjectId` chain up to the nearest Platform or Engine ancestor. A variable hosted by an Engine inside a Platform lands in both the Engine's list and the Platform's list, so stopping the Platform transitively invalidates every descendant Engine's variables.
|
||||
|
||||
### Read-path short-circuit (`IsTagUnderStoppedHost`)
|
||||
|
||||
The Host's Read handler checks `IsTagUnderStoppedHost(tagRef)` (a reverse-index lookup `_hostIdsByTagRef[tagRef]` → `GalaxyRuntimeProbeManager.IsHostStopped(hostId)`) before the MXAccess round-trip. When the owning host is Stopped, the handler returns a synthesized `DataValue { Value = cachedVar.Value, StatusCode = BadOutOfService }` directly without touching MXAccess. This guarantees clients see a uniform `BadOutOfService` on every descendant tag of a stopped host, regardless of whether they're reading or subscribing.
|
||||
|
||||
### Deferred dispatch — the STA deadlock
|
||||
|
||||
**Critical**: probe transition callbacks must **not** run synchronously on the STA thread that delivered the `OnDataChange`. `MarkHostVariablesBadQuality` takes the subscription dispatcher lock, which may be held by a worker thread currently inside `Read` waiting on an `_mxAccessClient.ReadAsync()` round-trip that is itself waiting for the STA thread. Classic circular wait — the first real deploy of this feature hung inside 30 seconds from exactly this pattern.
|
||||
|
||||
The fix is a deferred-dispatch queue: probe callbacks enqueue the transition onto `ConcurrentQueue<(int GobjectId, bool Stopped)>` and set the existing dispatch signal. The dispatch thread drains the queue inside its existing 100ms `WaitOne` loop — outside any locks held by the STA path — and then calls `MarkHostVariablesBadQuality` / `ClearHostVariablesBadQuality` under its own natural lock acquisition. No circular wait, no STA involvement.
|
||||
|
||||
### Dashboard and health surface
|
||||
|
||||
- Admin UI **Galaxy Runtime** panel shows per-host state with Name / Kind / State / Since / Last Error columns. Panel color is green (all Running), yellow (any Unknown, none Stopped), red (any Stopped), gray (MXAccess transport disconnected)
|
||||
- `HealthCheckService.CheckHealth` rolls overall driver health to `Degraded` when any host is Stopped
|
||||
|
||||
See [Status Dashboard](../StatusDashboard.md#galaxy-runtime) for the field table and [Configuration](../Configuration.md#mxaccess) for the config fields.
|
||||
|
||||
## Request Timeout Safety Backstop
|
||||
|
||||
Every sync-over-async site on the OPC UA stack thread that calls into Galaxy (`Read`, `Write`, address-space rebuild probe sync) is wrapped in a bounded `SyncOverAsync.WaitSync(...)` helper with timeout `MxAccess.RequestTimeoutSeconds` (default 30s). Inner `ReadTimeoutSeconds` / `WriteTimeoutSeconds` bounds on the async path are the first line of defense; the outer wrapper is a backstop so a scheduler stall, slow reconnect, or any other non-returning async path cannot park the stack thread indefinitely.
|
||||
|
||||
On timeout, the underlying task is **not** cancelled — it runs to completion on the thread pool and is abandoned. This is acceptable because Galaxy IPC clients are shared singletons and the abandoned continuation does not capture request-scoped state. The OPC UA stack receives `StatusCodes.BadTimeout` on the affected operation.
|
||||
|
||||
`ConfigurationValidator` enforces `RequestTimeoutSeconds >= 1` and warns when it is set below the inner Read/Write timeouts (operator misconfiguration). Stability review 2026-04-13 Finding 3.
|
||||
|
||||
All capability calls at the Server dispatch layer are additionally wrapped by `CapabilityInvoker` (Core/Resilience/) which runs them through a Polly pipeline keyed on `(DriverInstanceId, HostName, DriverCapability)`. `OTOPCUA0001` analyzer enforces the wrap at build time.
|
||||
|
||||
## Why Marshal.ReleaseComObject Is Needed
|
||||
|
||||
The .NET Framework runtime's garbage collector releases COM references non-deterministically. For MXAccess, delayed release can leave stale COM connections open, preventing clean re-registration. `MxProxyAdapter.Unregister` calls `Marshal.ReleaseComObject(_lmxProxy)` in a `finally` block to immediately drive the COM reference count to zero. This ensures the underlying COM server is freed before a reconnect attempt creates a new instance.
|
||||
|
||||
## Tag Discovery and Historical Data
|
||||
|
||||
Tag discovery (the Galaxy Repository SQL reader + `LocalPlatform` scope filter) is covered in [Galaxy-Repository.md](Galaxy-Repository.md). The Galaxy driver is `ITagDiscovery` for the Server's bootstrap path and `IRediscoverable` for the on-change-redeploy path.
|
||||
|
||||
Historical data access (raw, processed, at-time, events) runs against the Aveva Historian via the `aahClientManaged` SDK and is exposed through the Galaxy driver's `IHistoryProvider` implementation. See [HistoricalDataAccess.md](../HistoricalDataAccess.md).
|
||||
|
||||
## Key source files
|
||||
|
||||
Host-side (`.NET 4.8 x86`, `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/`):
|
||||
|
||||
- `Backend/MxAccess/StaComThread.cs` — STA thread and Win32 message pump
|
||||
- `Backend/MxAccess/MxAccessClient.cs` — Core client (partial)
|
||||
- `Backend/MxAccess/MxAccessClient.Connection.cs` — Connect / disconnect / reconnect
|
||||
- `Backend/MxAccess/MxAccessClient.Subscription.cs` — Subscribe / unsubscribe / replay
|
||||
- `Backend/MxAccess/MxAccessClient.ReadWrite.cs` — Read and write operations
|
||||
- `Backend/MxAccess/MxAccessClient.EventHandlers.cs` — `OnDataChange` / `OnWriteComplete` handlers
|
||||
- `Backend/MxAccess/MxAccessClient.Monitor.cs` — Background health monitor
|
||||
- `Backend/MxAccess/MxProxyAdapter.cs` — COM object wrapper
|
||||
- `Backend/MxAccess/GalaxyRuntimeProbeManager.cs` — Per-host `ScanState` probes, state machine, `IsHostStopped` lookup
|
||||
- `Backend/Historian/HistorianDataSource.cs` — `aahClientManaged` SDK wrapper (see [HistoricalDataAccess.md](../HistoricalDataAccess.md))
|
||||
- `Ipc/GalaxyIpcServer.cs` — Named-pipe server, message dispatch
|
||||
- `Domain/IMxAccessClient.cs` — Client interface
|
||||
|
||||
Shared (`.NET Standard 2.0`, `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/`):
|
||||
|
||||
- `Contracts/MessageKind.cs` — IPC message kinds (`ReadRequest`, `HistoryReadRequest`, `OpenSessionResponse`, …)
|
||||
- `Contracts/*.cs` — MessagePack DTOs for every request/response pair
|
||||
|
||||
Proxy-side (`.NET 10`, `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/`):
|
||||
|
||||
- `GalaxyProxyDriver.cs` — `IDriver`/`ITagDiscovery`/`IReadable`/`IWritable`/`ISubscribable`/`IAlarmSource`/`IHistoryProvider`/`IRediscoverable`/`IHostConnectivityProbe` implementation; every method forwards via `GalaxyIpcClient`
|
||||
- `Ipc/GalaxyIpcClient.cs` — Named-pipe client, `CallAsync<TReq, TResp>`, reconnect on broken pipe
|
||||
- `GalaxyProxySupervisor.cs` — Host-process monitor, crash-loop circuit-breaker, Host relaunch
|
||||
- **MXAccess `ClientName` collisions**: two OtOpcUa instances sharing a `ClientName` cause the older Wonderware session to lose subscription state. Redundancy pairs (decision #149) enforce uniqueness via install scripts.
|
||||
- **Channel saturation**: `galaxy.events.dropped > 0` indicates `EventPump` is back-pressured. Raise `EventPumpChannelCapacity` or investigate downstream slowness in the server-side fan-out.
|
||||
- **Connectivity surface**: per-platform probe state is exposed through `IHostConnectivityProbe` and aggregated by the server's connectivity bus — there is no driver-private dashboard surface anymore. The Admin UI's Host Status panel is the consumer.
|
||||
|
||||
Reference in New Issue
Block a user