> **✅ Completed 2026-04-30 — historical record of the v2-mxgw migration design.** > > This document is the design doc that drove the migration from the > legacy out-of-process Galaxy.Host topology to the in-process > GalaxyDriver + mxaccessgw architecture. Option 1 (the in-process > driver path) was selected and implemented across 39 PRs spanning > phases 0–7, merged to master at commit `ae7106d`. For current > architecture see `CLAUDE.md`, `docs/drivers/Galaxy.md`, and > `docs/v2/Galaxy.Performance.md`. # Galaxy → MxAccessGateway Migration Plan Implements **Option 1** from `lmx_backend.md`: replace the bespoke `Galaxy.Host` + `Galaxy.Proxy` IPC pair with an **in-process Tier-A** `Driver.Galaxy` running in the .NET 10 OtOpcUa server, talking to a separately-deployed `MxGateway.Server` (mxaccessgw repo) over gRPC for live MXAccess work and Galaxy Repository browse. ## Outcome After this work: - `OtOpcUa.Server` is fully .NET 10 x64 — no x86 build artifacts in this repo. - `Driver.Galaxy.Host` (Windows service, NSSM-wrapped, .NET 4.8 x86) is retired. `Driver.Galaxy.Proxy` and `Driver.Galaxy.Shared` are deleted. AVEVA platform is no longer required on the OtOpcUa box. - A new in-process `Driver.Galaxy` lives next to `Driver.Modbus`, `Driver.OpcUaClient`, etc. It implements the same `IDriver` capability set the proxy implements today, but its body calls `MxGateway.Client` (`MxGatewayClient`, `MxGatewaySession`, `GalaxyRepositoryClient`). - Wonderware Historian SDK access moves out of the Galaxy driver into a driver-agnostic historian data source (`Driver.Historian.Wonderware`, separate sidecar, .NET 4.8 x86). The OPC UA HA service plugs into it the same way it would plug into any future historian. - Alarm condition tracking moves out of the driver into the OPC UA server's generic A&E subsystem. The driver only flags `IsAlarm=true` on attribute metadata and forwards live `.InAlarm`/`.Acked`/etc value changes; the server runs the AlarmCondition state machine. - Per-platform `ScanState` probes degrade to plain attribute subscriptions — no special probe manager. --- ## Pre-flight: improvements to land in mxaccessgw first These are **integration-quality changes** in the mxaccessgw repo that make the OtOpcUa side dramatically simpler / faster / more robust. They aren't strictly required to start, but ship enough of them before phase 3 that we're not designing around gaps. ### gw-1. Galaxy attribute metadata parity **What's there:** `galaxy_repository.v1.DiscoverHierarchy` returns `GalaxyObject` with name, parent, category, and dynamic attributes. **What's missing for OtOpcUa:** every field today's `MxAccessGalaxyBackend` copies into `GalaxyAttributeInfo` — confirm gw's `Attribute` proto carries: - `mx_data_type` (int) - `is_array` (bool) - `array_dimension` (uint, optional) - `security_classification` (int) - `is_historized` (bool, from `HistorizedExtension` primitive) - `is_alarm` (bool, from `AlarmExtension` primitive) If any are missing, add them to the proto and the server-side query mapper. Without `IsAlarm` and `IsHistorized` the OPC UA server can't decide which nodes get HasHistoricalConfiguration / which become AlarmConditions. ### gw-2. Stable, documented event-stream resume semantics **What's needed:** the OtOpcUa driver must survive a transient gw transport drop without losing subscription state or duplicating change events. gw's `StreamEventsAsync(afterWorkerSequence)` already exposes resumption. Document the per-session retention window (how long does the worker buffer events the gateway hasn't acked?) and the "events were dropped, you must re-subscribe" signal. If retention is bounded by count rather than time, expose the bound in `OpenSessionReply` so the client can size its own buffer. ### gw-3. Reconnectable sessions Listed under "post-v1 revisit" in `gateway.md`. Without it, every gw or OtOpcUa restart re-`Register`s, re-`AddItem`s, re-`Advise`s the entire address space — for a 50k-tag Galaxy that's a non-trivial cold-start. With reconnectable sessions, the driver presents its `SessionId` after a restart and the worker keeps its handles. If full reconnection is too large, ship a **bulk replay** instead: a single RPC that takes the full subscription set and the worker performs the register/add/advise inside one round trip. We can drive it from a client-side cache rather than gw state. See gw-5 below. ### gw-4. Driver-shaped subscribe primitive `MxGatewaySession` already has `SubscribeBulkAsync` (one RPC: `Register` implicit + `AddItem` + `Advise` for a list of tag addresses, returning per-tag `SubscribeResult`). That's exactly what `ISubscribable.SubscribeAsync` wants. Confirm it returns enough per-tag detail to surface a partial-failure list to OPC UA monitored items (good handle, status code, error text). If not already, expose **`SubscribeBulk` with optional update-rate hint** forwarded to `SetBufferedUpdateInterval` so the OPC UA publishing interval becomes a single field on the subscribe call rather than a follow-up RPC. ### gw-5. Subscription replay snapshot Provide an RPC `ReplaySubscriptionsAsync(SessionId, IEnumerable)` that re-establishes a list of subscriptions after a session reset and returns per-tag results. The client stores its tag list locally (the driver already has it from `Discover`), and the gw worker turns it into one register/add/advise sequence. This is the minimum surface we need; full "reattach to a previous session by id" (gw-3) is a richer version of the same thing. ### gw-6. Transport-health stream The gw already exposes worker / session health on its dashboard. Add a small streaming RPC `StreamSessionHealth(SessionId) → stream SessionHealth` so the OtOpcUa driver can surface "MXAccess transport up/down" to its `IHostConnectivityProbe` without faking it via probe-tag subscriptions. Today `MxAccessClient.ConnectionStateChanged` does this in-process; we want the same signal at the gw boundary. ### gw-7. Optional .NET 10 client polish - Async-disposable session pattern is already there. - Add a **typed `MxValue` ⇄ `object` adapter** for the seven Galaxy types OtOpcUa cares about (Boolean, Int32, Float, Double, String, DateTime, arrays of the same). Today every consumer writes its own `MxValue.From` helpers; this shaves boilerplate from the driver. - Add a **`SubscribeWithCallback`** convenience wrapper that combines `OpenSession` + `SubscribeBulk` + `StreamEvents` and routes events through a delegate per tag. Keeps the OPC UA driver from re-implementing the fan-out / sequencer pattern. ### gw-8. Auth minimums Document API-key scoping as it applies to OtOpcUa: the server identity needs `session`, `invoke`, `event`, and `metadata:read` scopes. Provide a CLI to mint a key bound to those scopes for an OtOpcUa instance. ### gw-9. Performance: bulk paths and value coalescing - Confirm `SubscribeBulkAsync` is implemented as a single MXAccess `AddItem`+`Advise` loop on the worker, not N pipe round trips. If not, fix before we drive 50k-tag Galaxies through it. - Expose `SetBufferedUpdateInterval` per session so OtOpcUa can request buffered updates at the OPC UA publishing interval and get one batched `OnBufferedDataChange` per tick rather than N `OnDataChange` events. These can all ship in mxaccessgw independently and improve every consumer. --- ## OtOpcUa-side improvements to land in parallel Some are forced by removing `Galaxy.Host`; others are quality-of-life. ### ot-1. Promote `IHistorianDataSource` to a server-level extension point Today `IHistorianDataSource` is a Galaxy-internal abstraction in `Driver.Galaxy.Host`. Lift it to `OtOpcUa.Core.Abstractions` (or a similar home next to `IDriver`) and let the OPC UA HA service consume **any number of registered data sources** keyed by node namespace. Drivers don't own historian access; the server mounts data sources alongside drivers. This is the prerequisite that lets us move Wonderware Historian out of the Galaxy driver without losing the feature. ### ot-2. Generic alarm condition state machine in the server Move the `.InAlarm`/`.Priority`/`.DescAttrName`/`.Acked` quartet handling out of `GalaxyAlarmTracker` into a server-level alarm subsystem keyed off the `IsAlarm=true` flag drivers set during discovery. The server subscribes to the four sub-attributes itself and runs the AlarmCondition state machine. Driver only: - declares `IsAlarm=true` in `DriverAttributeInfo`, - forwards plain attribute value changes (already done by `ISubscribable`). This is also a precondition for future drivers (Modbus DL205 alarm bits, S7 alarm DBs) to emit alarms without each writing their own tracker. ### ot-3. Driver capabilities trim After ot-1 and ot-2, `Driver.Galaxy` no longer needs to implement: - `IHistoryProvider` (server's HA service handles it via Wonderware historian data source) - `IAlarmHistorianWriter` (server's A&E historian, or kept generic — Galaxy shouldn't own the SQLite path) - `IAlarmSource` ack route (server-level alarm subsystem writes back via the driver's `IWritable.WriteAsync`, which the gw already supports) Keep: - `IDriver`, `ITagDiscovery`, `IReadable`, `IWritable`, `ISubscribable`, `IRediscoverable`, `IHostConnectivityProbe`. ### ot-4. Treat `time_of_last_deploy` as `IRediscoverable`'s pump Replace the Host-side change-detection poll with a managed `GalaxyRepositoryClient.WatchDeployEventsAsync` consumer in the driver. Each event raises `OnRediscoveryNeeded` with the new deploy time as the `scopeHint`. No polling code in this repo. ### ot-5. Connection pool at the server, not the driver If the redundancy pair runs two OtOpcUa instances against one gw, both should share a single `GrpcChannel` per process (already gRPC default) but **different sessions** (one MXAccess client identity per OtOpcUa instance, not one shared session that fights over Wonderware client state). Encode the per-instance MXAccess client name in driver config — already partly there (`OTOPCUA_GALAXY_CLIENT_NAME`); make it explicit in the new driver's `appsettings.json` shape. --- ## Phased implementation Each phase is a working, mergeable slice. Keep `Galaxy.Host` running alongside the new driver until phase 7 — gated by a config switch `Galaxy:Backend = legacy-host | mxgateway`. ### Phase 0 — pre-flight (mxaccessgw repo) Ship gw-1, gw-2, gw-4, gw-9 (the parity, performance, and contract bits the plan immediately depends on). gw-3, gw-5, gw-6, gw-7 can come during or after phase 5. **Exit:** local OtOpcUa dev box can `MxGatewayClient.Create` a client, open a session, `SubscribeBulkAsync` 100 tags, and observe `OnDataChange` events at the configured update rate. ### Phase 1 — server-level historian extension point (ot-1) 1. Extract `IHistorianDataSource` (and its DTOs `HistorianSample`, `HistorianAggregateSample`, `HistoricalEvent`) from `Driver.Galaxy.Host/Backend/Historian/` into `src/ZB.MOM.WW.OtOpcUa.Core/Abstractions/Historian/`. 2. Extend the OPC UA HA service to look up a registered `IHistorianDataSource` per namespace and call into it for `HistoryRead`, `HistoryReadProcessed`, `HistoryReadAtTime`, `HistoryReadEvents`. Drivers stop implementing `IHistoryProvider` directly; the server proxies. 3. Add a no-op default registration so drivers without history keep working. **Exit:** all current Galaxy history reads route through an `IHistorianDataSource` registered by `Driver.Galaxy.Host` (still legacy) without behavior change. Other drivers untouched. ### Phase 2 — server-level alarm subsystem (ot-2) 1. Add an `IAlarmConditionDeclaration` API on the address-space builder so discovery can flag a node as alarm-bearing and supply the four sub-attribute references. 2. Add a hosted `AlarmConditionService` in the server that, on driver `Discover`, subscribes to the four sub-attributes via the driver's own `ISubscribable`, runs the state machine, and emits `IAlarmSource.OnAlarmEvent` itself. Acks route back through the driver's `IWritable.WriteAsync` to the `.AckMsg` attribute. 3. Add Galaxy-specific defaults (sub-attribute naming) as a small adapter so the same service can serve future drivers with different conventions. **Exit:** Galaxy alarms still work end-to-end; the tracker code that runs inside `Galaxy.Host` is dead but kept for the legacy-host backend path. ### Phase 3 — Wonderware Historian sidecar (`Driver.Historian.Wonderware`) 1. New solution project: `Driver.Historian.Wonderware`, .NET 4.8 x86, console app + NSSM (mirrors today's Galaxy.Host packaging exactly, minus Galaxy responsibilities). 2. Hosts the existing `HistorianDataSource`, `HistorianClusterEndpointPicker`, `HistorianHealthSnapshot` code lifted from `Galaxy.Host/Backend/Historian/` and exposes them over a small named-pipe protocol (or local gRPC if .NET 4.8 cost is acceptable; named pipe is simpler). 3. Add `Driver.Historian.Wonderware.Client` — .NET 10 — implementing `IHistorianDataSource` against the sidecar. 4. Server registers it as a data source for the `Galaxy` namespace. **Exit:** OPC UA history reads work via the sidecar with the legacy-host backend still in place. We've decoupled history from MXAccess. ### Phase 4 — new `Driver.Galaxy` against gw This is the meat. New project: `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/`, .NET 10, in-process. Capabilities (post ot-3): `IDriver`, `ITagDiscovery`, `IReadable`, `IWritable`, `ISubscribable`, `IRediscoverable`, `IHostConnectivityProbe`. Shape: ``` Driver.Galaxy/ GalaxyDriver.cs # IDriver root Browse/ GalaxyDiscoverer.cs # consumes GalaxyRepositoryClient.DiscoverHierarchyAsync DataTypeMap.cs # mx_data_type → DriverDataType SecurityMap.cs # security_classification → SecurityClassification Runtime/ GalaxyMxSession.cs # owns one MxGatewaySession; Register + map per-driver client name SubscriptionRegistry.cs # tag → server/item handles; persists to memory only EventPump.cs # consumes session.StreamEventsAsync, fans out to OnDataChange ReconnectSupervisor.cs # gw transport drop / session-lost recovery DeployWatcher.cs # GalaxyRepositoryClient.WatchDeployEventsAsync → OnRediscoveryNeeded Health/ HostConnectivityForwarder.cs # gw-6 SessionHealth → IHostConnectivityProbe Config/ GalaxyDriverOptions.cs # endpoint, ApiKey, ClientName, TLS, retry, intervals GalaxyDriverFactoryExtensions.cs # AddGalaxyDriver(IServiceCollection) ``` Key behaviors: - **Discovery** calls `GalaxyRepositoryClient.DiscoverHierarchyAsync()` once at init and on every `WatchDeployEvents` event, then drives the address space builder. Same node naming as today (parent contained-name hierarchy + leaf attributes named `tag_name.AttributeName`). - **Read** uses one-off `AddItem` + `Advise` + read-after-first-callback is overkill; instead, use **`Register` + per-call `AddItem`/`Read`** if gw exposes a synchronous read, otherwise short-lived advise. *Action item:* confirm gw's read story; if absent, request a synchronous `ReadAsync` RPC on top of MXAccess `Read` (which exists in the COM API). - **Write** maps `WriteRequest.Value` to `MxValue` via gw-7 helpers and calls `WriteAsync(serverHandle, itemHandle, value, userId=0)`. Routes `WriteSecured` (where `SecurityClassification == SecuredWrite/Verified`) to `WriteSecuredAsync` once exposed on `MxGatewaySession`. - **Subscribe** calls `SubscribeBulkAsync` once per `ISubscribable.Subscribe` call. Stores `(tag → itemHandle, sid)` in `SubscriptionRegistry`. The single `EventPump` consumes one `StreamEventsAsync` per session and fans out per `sid`. - **Unsubscribe** calls `UnsubscribeBulkAsync` and drops registry entries. - **Reconnect** — when the gRPC channel drops or `StreamEvents` returns, `ReconnectSupervisor` reopens the session and replays subscriptions via gw-5 `ReplaySubscriptionsAsync`. The driver flags `DriverState.Degraded` during recovery; the server keeps publishing last-good values with `Uncertain` quality. - **Host connectivity** — single synthesized host entry named after `OTOPCUA_GALAXY_CLIENT_NAME` driven by gw-6 `SessionHealth` updates (or, until gw-6 lands, by transport drops). Wire into the server next to other Tier-A drivers in the `AddDrivers(...)` call site. **Exit:** flipping `Galaxy:Backend` to `mxgateway` runs the OPC UA server end-to-end with no `Galaxy.Host` involvement. Live read, live write, live subscribe pass against the dev Galaxy. Historian + alarms still work via phases 1–3. ### Phase 5 — parity test matrix Reuse the existing live-Galaxy integration tests; run each scenario twice: once with `Galaxy:Backend=legacy-host`, once with `mxgateway`. Compare: - discovered hierarchy node count + names + datatypes, - subscribed publish rates (allow ±10% tolerance vs. legacy), - write success / status codes for each `SecurityClassification`, - alarm condition transitions (Active / Acked / Inactive) — already routed through phase 2's server-level subsystem, - history reads — phase 3 sidecar, identical results both backends, - reconnect behavior under gw kill, worker kill, network drop, ZB drop. Document the matrix; resolve every discrepancy or explicitly accept it. **Exit:** parity matrix has zero unexplained deltas. Performance budget agreed: e.g. ≤ 2× per-call latency vs. named-pipe baseline at the 95th percentile, equal or better throughput in `SubscribeBulk` setup time. ### Phase 6 — perf + hardening - Land gw-9 buffered-update intervals. - Add OpenTelemetry traces from the driver around every gw call, correlated via `client_correlation_id`. - Write soak test: 50k tags subscribed, 24h, count missed events, gw restarts, OtOpcUa restarts. - Tune `MxGatewayClientOptions.MaxGrpcMessageBytes`, retry pipeline, call timeouts based on soak results. **Exit:** production-acceptable perf numbers documented in `docs/drivers/Galaxy.md`. ### Phase 7 — retirement 1. Default `Galaxy:Backend = mxgateway` everywhere (sample configs, install scripts, e2e configs). 2. Delete `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host`, `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy`, `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared`, and matching tests. 3. Remove `OtOpcUaGalaxyHost` NSSM registration from `scripts/install/Install-Services.ps1`. Add a registration block for the Wonderware historian sidecar from phase 3. 4. Remove every x86 .NET 4.8 reference, build target, and CI step from this repo; remove `mxaccess_documentation.md`-driven dependencies that no longer apply. 5. Update CLAUDE.md, `docs/v2/dev-environment.md`, `docs/ServiceHosting.md`, `docs/Redundancy.md` to reflect the new topology. 6. Memory housekeeping: retire `project_galaxy_host_service.md` and `project_galaxy_host_installed.md`; add a short note about the gw dependency. **Exit:** `git grep -i 'Galaxy\.Host'` returns nothing in source. --- ## Configuration shape (new driver) ```jsonc "Drivers": { "Galaxy": { "Type": "Galaxy", "InstanceId": "galaxy-prod-1", "Gateway": { "Endpoint": "https://mxgw.aveva.local:5001", "ApiKeySecretRef": "galaxy:apiKey", // resolved via existing secret store "UseTls": true, "CaCertificatePath": "C:\\publish\\mxgw\\ca.crt", "ConnectTimeoutSeconds": 10, "DefaultCallTimeoutSeconds": 5, "StreamTimeoutSeconds": 0 // unbounded }, "MxAccess": { "ClientName": "OtOpcUa-A", // unique per OtOpcUa instance "PublishingIntervalMs": 1000, // hint for SetBufferedUpdateInterval "WriteUserId": 0 }, "Repository": { "DiscoverPageSize": 5000, "WatchDeployEvents": true }, "Reconnect": { "InitialBackoffMs": 500, "MaxBackoffMs": 30000, "ReplayOnSessionLost": true } } } ``` The OtOpcUa secret store already handles DPAPI-protected values for LDAP binds; reuse it for the gw API key. Never put the key in plaintext in the sample config. --- ## Risks and mitigations | Risk | Mitigation | |---|---| | gw protocol regression breaks production | Pin gw NuGet to a contract version range; CI runs parity matrix on every gw bump; staged rollout via `Galaxy:Backend` flag. | | Per-call latency regresses for chatty workloads | Land gw-9 (buffered updates) before phase 5; soak the 95p in phase 6. | | Reconnect storm after gw restart re-registers 50k tags | Land gw-3 or gw-5 before phase 6; client-side bulk replay throttled by `SubscribeBulkAsync` chunk size. | | Alarm parity gap from moving tracker server-side | Phase 2 ships before phase 4; parity matrix gates phase 7. | | Historian sidecar adds a second .NET 4.8 x86 service | Acceptable: it's a *driver-agnostic* component, and it ships only where Wonderware historian access is actually needed. | | Two OtOpcUa instances both registering as same MXAccess client | `ClientName` is per-instance config (ot-5); install scripts lint that the redundancy pair has distinct names. | | Cross-machine MXAccess writes traverse plaintext gRPC | Phase 0 enforces `UseTls=true` for any non-loopback `Endpoint`; CI lints the sample configs. | | gw API key leaked in logs | gw and `MxGatewayClient` already redact `authorization` metadata; phase 6 audit. | | Memory leak in `EventPump` under high event rate | Bounded channel between `StreamEventsAsync` and per-sub fan-out, drop-newest with a metric counter; soak test catches. | --- ## Cross-cutting deliverables - **Docs:** `docs/drivers/Galaxy.md` (new), updates to `docs/v2/dev-environment.md`, `docs/ServiceHosting.md`, `docs/Redundancy.md`, `CLAUDE.md`. - **Install scripts:** `scripts/install/Install-Services.ps1` removes `OtOpcUaGalaxyHost`, adds `OtOpcUaWonderwareHistorian`, no Galaxy service registration on the OtOpcUa node. - **e2e:** `scripts/e2e/e2e-config.sample.json` — drop `OTOPCUA_GALAXY_*` pipe vars, add `Drivers:Galaxy:Gateway:Endpoint` etc. - **Memory:** retire stale Galaxy.Host entries; add gw dependency entry, redundancy + client-name guidance. --- ## Order-of-work summary ``` Phase 0 (gw repo): gw-1, gw-2, gw-4, gw-9 Phase 1 (this): ot-1 — historian extension point Phase 2 (this): ot-2 — alarm subsystem Phase 3 (this): Driver.Historian.Wonderware sidecar Phase 4 (this): Driver.Galaxy (new) behind backend flag — depends on Phase 0, 1, 2 Phase 5 (this+gw): parity matrix — drives gw-3 / gw-5 / gw-6 / gw-7 if gaps surface Phase 6 (this): perf + hardening Phase 7 (this): retire Galaxy.Host / Proxy / Shared ``` Phases 1–3 are independent of each other and can run in parallel. Phase 4 needs all three plus Phase 0. Phase 5 requires Phase 4. Phases 6 and 7 are sequential after Phase 5.