# Galaxy / LMX Backend — Restructuring Options ## Context Today the Galaxy driver is structured very differently from every other driver in this repo: - **Galaxy.Proxy** (.NET 10, in-process): tiny shim that frames IPC to the host. - **Galaxy.Host** (.NET Framework 4.8 **x86**, NSSM-wrapped Windows service): owns MXAccess COM, the STA pump, the ZB Galaxy Repository SQL queries, the Wonderware Historian SDK plugin, the per-platform `ScanState` probe manager, the alarm tracker (`.InAlarm`/`.Priority`/`.DescAttrName`/`.Acked` state machine + ack writer), recycle policy, and post-mortem MMF. Other drivers (Modbus, S7, AB CIP, OpcUaClient, TwinCAT, FOCAS Tier-C) are **in-process Tier-A drivers** in the .NET 10 server. They do data + browse only; historian and alarming are driver-agnostic concerns at the server layer. A sibling project, **mxaccessgw** (`C:\Users\dohertj2\Desktop\mxaccessgw`), already provides: - A .NET 10 x64 gRPC gateway in front of per-session .NET 4.8 x86 worker processes that own MXAccess COM, the STA, and event sinks (`MxGateway.Server` + `MxGateway.Worker`). - A full MXAccess command + event surface (`Register`, `AddItem`, `Advise`, `Write`, `WriteSecured`, `OnDataChange`, `OnWriteComplete`, etc.). - A cached, deploy-gated, paged **Galaxy Repository browse** RPC (`galaxy_repository.v1`) reading the same ZB tables we read today, with the query bodies kept byte-identical to OtOpcUa. - A .NET client library (`clients/dotnet/MxGateway.Client`). - API-key auth, Blazor dashboard, structured logs, metrics, watchdog/recycle. The proposal is to **strip Galaxy down to data + browse** — push historian and alarming out to server-level subsystems where they live for every other driver — and pick how the slimmed-down driver talks to MXAccess. --- ## What "push historian and alarming out" means Both options below assume the same scope reduction; they only differ in how the driver reaches MXAccess. | Concern | Today (Galaxy.Host) | After | |---|---|---| | Galaxy hierarchy browse | `GalaxyRepository` (SQL) inside Host | Driver (Option 1: via gw browse RPC; Option 2: own SQL or worker) | | Live read / write / subscribe | `MxAccessClient` + STA pump in Host | gw (Option 1) or embedded worker (Option 2) | | Wonderware Historian SDK | `HistorianDataSource` in Host (x86) | Separate Historian data source plugged into the server's HA service. Likely stays its own .NET 4.8 x86 sidecar because the SDK is x86-only; **independent of the Galaxy driver lifecycle**. | | Alarm state machine (`.InAlarm`/`.Acked` quartet, transitions, ack writer) | `GalaxyAlarmTracker` in Host | Server-level A&E subsystem subscribes to alarm-bearing attributes the driver advertises and runs the AlarmCondition state machine generically. Driver only flags `IsAlarm=true` in node metadata. | | `ScanState` per-platform probes | `GalaxyRuntimeProbeManager` in Host | Driver-side: ScanState is just another tag subscription; the driver re-advises one per discovered `$WinPlatform`/`$AppEngine` and reports `HostConnectivityStatus` from the value stream. No special host-side machinery. | After the strip-down, the Galaxy driver looks like Modbus or OpcUaClient: it discovers nodes, reads/writes/subscribes, and reports per-host transport health. Everything else is the server's problem. --- ## Option 1 — Tier-A driver against the MxAccess Gateway `Driver.Galaxy` becomes a regular **in-process .NET 10 driver** in the OtOpcUa server (no `.Host`, no `.Proxy` split, no x86). It talks to a separately deployed `MxGateway.Server` over gRPC using `MxGateway.Client`. Browse comes from `galaxy_repository.v1.DiscoverHierarchy`. Live data comes from `MxAccessGateway.OpenSession`/`AddItem`/`Advise`/`StreamEvents`. ``` OtOpcUa.Server (.NET 10 x64) └── Driver.Galaxy (in-proc, .NET 10) └── gRPC ──► MxGateway.Server (.NET 10 x64) └── pipe ──► MxGateway.Worker (.NET 4.8 x86) └── MXAccess COM (STA) ``` ### Pros - **Architectural parity with other drivers.** No bespoke `Host` service, no x86 build target, no NSSM wrapper, no STA pump in this repo, no `PostMortemMmf`/`RecyclePolicy` we maintain ourselves. - **OtOpcUa server stops needing AVEVA installed on its own host.** The gateway runs where MXAccess lives; the OPC UA server can live on a different box, in a container, or on a hardened jump host. - **One canonical MXAccess surface across the org.** Any future tool — a diagnostic CLI, a Historian replacement, an integration harness — talks to the same gw with the same parity guarantees we get. - **Multi-instance friendly.** Two OtOpcUa servers (warm/hot redundancy) share one gw and one MXAccess footprint instead of each running their own `Galaxy.Host` with duplicate Wonderware client identities. - **Browse + cache for free.** `galaxy_repository.v1` already implements the hierarchy cache, deploy-time gating, paging, and `WatchDeployEvents` — we delete `GalaxyRepository.cs`, `GalaxyHierarchyRow.cs`, the change-detection poll loop, and the matching SQL plumbing. - **Operability for free.** API-key auth, Blazor dashboard at `/dashboard`, metrics via `Meter`, structured logs with redaction. We currently have none of that in `Galaxy.Host`. - **Future backend swap.** When AVEVA exposes managed NMX or another modern path, gw routes to it without OtOpcUa changes (gw's stated roadmap). - **Tighter blast radius.** A hung COM event, a leaking COM object, a crashing worker — all owned by gw's session/worker isolation, not the OPC UA server process. - **Simpler version story for OtOpcUa.** Driver is plain .NET 10; the bitness/runtime split lives entirely in mxaccessgw's repo. ### Cons - **Extra deployment dependency.** mxaccessgw is now a service that has to be installed, monitored, and kept on a compatible protocol version. For a single-box install this is one more moving piece. - **Two hops on every call** (driver→gw, gw→worker) instead of one (proxy→host). Today's hop is MessagePack over a named pipe; the new outer hop is gRPC over TCP. Per-call overhead is a few hundred microseconds, not a regression for OPC UA workloads but measurable for very chatty bursts. - **Auth/secret surface added.** OtOpcUa now holds an API key for gw and rotates it; gw's SQLite-backed key store has to be managed. - **Failure model spans two processes we don't own** — gw + worker. Reconnect logic in our driver has to ride both: gw transport drop, gw session lease expiry, gw-detected worker crash, plus the worker's own MXAccess reconnect. All of it is exposed in the gRPC contract, but it's still surface area. - **Cross-repo protocol coupling.** Bumping `mxaccessgw` major version (gRPC contract changes, session shape changes) ripples into OtOpcUa releases. Mitigated by versioned contracts; not free. - **Galaxy redundancy still has to think about gw.** A redundancy fail-over of OtOpcUa is independent of the gw's session lifecycle. Need to decide whether the standby holds an open session or only opens it on takeover. - **Sensitive writes (`WriteSecured`, `AuthenticateUser`) cross the network** if gw is remote. TLS + mTLS solves it but adds setup. --- ## Option 2 — Embed mxaccessgw worker, no gateway `Driver.Galaxy` is still in-process .NET 10, but instead of speaking gRPC to a gateway service, it directly **launches and supervises one (or more) `MxGateway.Worker` processes** and talks to them over the same named-pipe worker protocol gw uses internally (`docs/WorkerFrameProtocol.md`, `docs/WorkerProcessLauncher.md`). Browse stays local — driver runs the SQL queries against ZB itself. ``` OtOpcUa.Server (.NET 10 x64) └── Driver.Galaxy (in-proc, .NET 10) ├── ZB SQL (local, in-proc) └── pipe ──► MxGateway.Worker (.NET 4.8 x86, child process) └── MXAccess COM (STA) ``` ### Pros - **One hop, not two.** Driver → worker pipe is the same shape as today's Proxy → Host pipe. Latency is on par with the current implementation. - **No new service to deploy.** Worker is launched as a child process the same way `Galaxy.Host` is launched today (just with mxaccessgw's worker binary). Single-machine install story stays simple. - **Keeps the trust boundary local.** No API keys, no TLS, no exposed gRPC port on the OtOpcUa box. - **Reuses mxaccessgw's parity-tested worker code** — STA pump, COM lifetime, event conversion, fault model — without inheriting gw's ASP.NET Core / Blazor / SQLite footprint. - **Tighter ownership.** OtOpcUa owns the worker lifecycle; recycle, kill, restart, post-mortem all decided by the driver, not by an external service we don't control. - **Easier to reason about during integration tests.** No second service to spin up in CI; just a child process per test fixture. ### Cons - **OtOpcUa server box must still have AVEVA + MXAccess installed**, since the worker runs locally. The major deployment win of Option 1 (separating where MXAccess runs from where OtOpcUa runs) is lost. - **OtOpcUa still ships an x86 .NET 4.8 binary alongside it.** Even if we vendor mxaccessgw's worker rather than write our own, installer complexity and bitness considerations remain. - **We re-implement everything gw already gives.** Process supervision, watchdog, recycle policy, heartbeat, post-mortem — these are exactly what `Galaxy.Host` does today, and they'd live in our repo again, just calling a different worker binary. - **No browse cache, no deploy gating, no `WatchDeployEvents`** — we keep running our own ZB queries and our own `time_of_last_deploy` poll, or we port gw's cache code into the driver. Either way it's duplicated logic. - **No auth, no dashboard, no metrics.** Operability stays where it is today (i.e., minimal). Adding it ourselves is a separate project. - **Multiple OtOpcUa instances multiply MXAccess sessions.** Redundancy pair → two MXAccess clients on the Galaxy from the same software, vs. Option 1 where one gw arbitrates. - **Worker protocol coupling without the contract surface.** We depend on mxaccessgw's worker IPC frame format — a surface that mxaccessgw treats as *internal* to its own gw↔worker boundary. If they refactor it, we have to follow. The public gRPC contract (Option 1) is more stable by design. - **Loses the "common MXAccess access point" benefit.** Other consumers (CLI, integration harnesses, future tools) can't share state with our embedded worker. --- ## Status quo (for comparison) Keep `Galaxy.Host` as today, and in-place rip out historian + alarming + probe manager. End state: the Host shrinks to `MxAccessClient` + `GalaxyRepository`, which is roughly what Option 2 ends up looking like — but with our hand-rolled COM bridge instead of mxaccessgw's worker. Not a serious option once mxaccessgw exists; we'd be maintaining a parallel implementation of the same thing. --- ## Recommendation (effort-agnostic) **Go with Option 1 — Tier-A driver against the MxAccess Gateway.** The decisive arguments: 1. **It's the only option that aligns Galaxy with how every other driver in this repo is structured.** The user's stated goal — "keep lmx to data + browsing, similar to other drivers" — only fully resolves if there is no `.Host` and no x86 build artifact in this repo at all. Option 2 still has an x86 child process and supervisor code; it's `Galaxy.Host` with a different worker binary inside. 2. **It separates *where MXAccess runs* from *where OtOpcUa runs*.** That is a strategically larger win than a few hundred microseconds of per-call latency. The OPC UA server stops being chained to AVEVA install footprint, bitness, and Wonderware client identity — which removes a class of deployment, redundancy, and CI problems we hit today (e.g., the `DESKTOP-6JL3KKO` Hyper-V/Docker conflict, the `dohertj2`-only pipe ACL, the live-Galaxy smoke test prerequisites). 3. **It collapses scope.** A non-trivial fraction of `Galaxy.Host` (browse cache, deploy-event watch, worker supervision, COM bridge, post-mortem, recycle, ACL hardening) is reproduced *better* in mxaccessgw. Option 1 deletes our copy. Option 2 keeps it. 4. **It positions historian and alarming for the right home.** Once the Galaxy driver is "just another driver", historian becomes a server-level data source (one that can also feed Modbus/S7 history if we ever want it), and alarming becomes a server-level A&E subsystem. Option 2 nominally allows the same move, but the temptation to keep them in `Galaxy.Host` "while we're already there" is real. 5. **It future-proofs against AVEVA's roadmap.** Managed NMX, ASB, or any replacement that shows up over the next few years gets adopted in mxaccessgw without a release in this repo. The case for Option 2 is real but narrow: it's the right call **only** if we commit to single-box deployments forever, refuse to take a gRPC dependency, and value local-trust simplicity over the consolidation/operability benefits gw provides. None of those constraints hold here. ### What flips the recommendation - If the gw protocol is unstable or perf-tested under our subscription patterns turns out worse than expected → revisit Option 2. - If org-policy forbids running an MXAccess gateway as its own service → Option 2. - If Galaxy goes from one of several drivers to *the* primary driver and raw call-rate matters more than architectural fit → revisit. Otherwise: Option 1. --- ## Out-of-scope follow-ups (don't decide here, but flag them) - **Where does the Wonderware Historian SDK live?** Likely its own .NET 4.8 x86 sidecar exposing a small `IHistorianDataSource` over a pipe or gRPC, plugged into the OPC UA server's HA service alongside any future historian sources. Independent of which option above is chosen. - **Alarm subsystem ownership.** Decide whether the server hosts a generic AlarmCondition state machine driven by driver-advertised alarm metadata, or whether each driver continues to emit pre-shaped alarm transitions. Galaxy's 4-attr quartet is a strong forcing function for the generic approach. - **Redundancy + gw sessions.** Standby OtOpcUa holds an open gw session (warm) vs. opens on takeover (cold). Affects gw worker count and Galaxy client-identity collisions. - **Auth between OtOpcUa and gw.** API key in DPAPI-protected secret file vs. Windows-auth gRPC. Both supported by gw; pick before rollout.