# Galaxy / LMX Backend — Restructuring Options

## Context

Today the Galaxy driver is structured very differently from every other driver
in this repo:

- **Galaxy.Proxy** (.NET 10, in-process): tiny shim that frames IPC to the host.
- **Galaxy.Host** (.NET Framework 4.8 **x86**, NSSM-wrapped Windows service):
  owns MXAccess COM, the STA pump, the ZB Galaxy Repository SQL queries, the
  Wonderware Historian SDK plugin, the per-platform `ScanState` probe manager,
  the alarm tracker (`.InAlarm`/`.Priority`/`.DescAttrName`/`.Acked` state
  machine + ack writer), recycle policy, and post-mortem MMF.

Other drivers (Modbus, S7, AB CIP, OpcUaClient, TwinCAT, FOCAS Tier-C) are
**in-process Tier-A drivers** in the .NET 10 server. They do data + browse
only; historian and alarming are driver-agnostic concerns at the server layer.

A sibling project, **mxaccessgw**
(`C:\Users\dohertj2\Desktop\mxaccessgw`), already provides:

- A .NET 10 x64 gRPC gateway in front of per-session .NET 4.8 x86 worker
  processes that own MXAccess COM, the STA, and event sinks
  (`MxGateway.Server` + `MxGateway.Worker`).
- A full MXAccess command + event surface (`Register`, `AddItem`, `Advise`,
  `Write`, `WriteSecured`, `OnDataChange`, `OnWriteComplete`, etc.).
- A cached, deploy-gated, paged **Galaxy Repository browse** RPC
  (`galaxy_repository.v1`) reading the same ZB tables we read today, with the
  query bodies kept byte-identical to OtOpcUa.
- A .NET client library (`clients/dotnet/MxGateway.Client`).
- API-key auth, Blazor dashboard, structured logs, metrics, watchdog/recycle.

The proposal is to **strip Galaxy down to data + browse** — push historian and
alarming out to server-level subsystems where they live for every other driver
— and pick how the slimmed-down driver talks to MXAccess.

---

## What "push historian and alarming out" means

Both options below assume the same scope reduction; they only differ in how
the driver reaches MXAccess.

| Concern | Today (Galaxy.Host) | After |
|---|---|---|
| Galaxy hierarchy browse | `GalaxyRepository` (SQL) inside Host | Driver (Option 1: via gw browse RPC; Option 2: own SQL or worker) |
| Live read / write / subscribe | `MxAccessClient` + STA pump in Host | gw (Option 1) or embedded worker (Option 2) |
| Wonderware Historian SDK | `HistorianDataSource` in Host (x86) | Separate Historian data source plugged into the server's HA service. Likely stays its own .NET 4.8 x86 sidecar because the SDK is x86-only; **independent of the Galaxy driver lifecycle**. |
| Alarm state machine (`.InAlarm`/`.Acked` quartet, transitions, ack writer) | `GalaxyAlarmTracker` in Host | Server-level A&E subsystem subscribes to alarm-bearing attributes the driver advertises and runs the AlarmCondition state machine generically. Driver only flags `IsAlarm=true` in node metadata. |
| `ScanState` per-platform probes | `GalaxyRuntimeProbeManager` in Host | Driver-side: ScanState is just another tag subscription; the driver re-advises one per discovered `$WinPlatform`/`$AppEngine` and reports `HostConnectivityStatus` from the value stream. No special host-side machinery. |

After the strip-down, the Galaxy driver looks like Modbus or OpcUaClient: it
discovers nodes, reads/writes/subscribes, and reports per-host transport
health. Everything else is the server's problem.

---

## Option 1 — Tier-A driver against the MxAccess Gateway

`Driver.Galaxy` becomes a regular **in-process .NET 10 driver** in the OtOpcUa
server (no `.Host`, no `.Proxy` split, no x86). It talks to a separately
deployed `MxGateway.Server` over gRPC using `MxGateway.Client`. Browse comes
from `galaxy_repository.v1.DiscoverHierarchy`. Live data comes from
`MxAccessGateway.OpenSession`/`AddItem`/`Advise`/`StreamEvents`.

```
OtOpcUa.Server (.NET 10 x64)
  └── Driver.Galaxy (in-proc, .NET 10)
        └── gRPC ──► MxGateway.Server (.NET 10 x64)
                       └── pipe ──► MxGateway.Worker (.NET 4.8 x86)
                                       └── MXAccess COM (STA)
```

### Pros

- **Architectural parity with other drivers.** No bespoke `Host` service, no
  x86 build target, no NSSM wrapper, no STA pump in this repo, no
  `PostMortemMmf`/`RecyclePolicy` we maintain ourselves.
- **OtOpcUa server stops needing AVEVA installed on its own host.** The
  gateway runs where MXAccess lives; the OPC UA server can live on a different
  box, in a container, or on a hardened jump host.
- **One canonical MXAccess surface across the org.** Any future tool — a
  diagnostic CLI, a Historian replacement, an integration harness — talks to
  the same gw with the same parity guarantees we get.
- **Multi-instance friendly.** Two OtOpcUa servers (warm/hot redundancy) share
  one gw and one MXAccess footprint instead of each running their own
  `Galaxy.Host` with duplicate Wonderware client identities.
- **Browse + cache for free.** `galaxy_repository.v1` already implements the
  hierarchy cache, deploy-time gating, paging, and `WatchDeployEvents` — we
  delete `GalaxyRepository.cs`, `GalaxyHierarchyRow.cs`, the change-detection
  poll loop, and the matching SQL plumbing.
- **Operability for free.** API-key auth, Blazor dashboard at `/dashboard`,
  metrics via `Meter`, structured logs with redaction. We currently have
  none of that in `Galaxy.Host`.
- **Future backend swap.** When AVEVA exposes managed NMX or another modern
  path, gw routes to it without OtOpcUa changes (gw's stated roadmap).
- **Tighter blast radius.** A hung COM event, a leaking COM object, a
  crashing worker — all owned by gw's session/worker isolation, not the
  OPC UA server process.
- **Simpler version story for OtOpcUa.** Driver is plain .NET 10; the
  bitness/runtime split lives entirely in mxaccessgw's repo.

### Cons

- **Extra deployment dependency.** mxaccessgw is now a service that has to be
  installed, monitored, and kept on a compatible protocol version. For a
  single-box install this is one more moving piece.
- **Two hops on every call** (driver→gw, gw→worker) instead of one
  (proxy→host). Today's hop is MessagePack over a named pipe; the new outer
  hop is gRPC over TCP. Per-call overhead is a few hundred microseconds, not
  a regression for OPC UA workloads but measurable for very chatty bursts.
- **Auth/secret surface added.** OtOpcUa now holds an API key for gw and
  rotates it; gw's SQLite-backed key store has to be managed.
- **Failure model spans two processes we don't own** — gw + worker. Reconnect
  logic in our driver has to ride both: gw transport drop, gw session lease
  expiry, gw-detected worker crash, plus the worker's own MXAccess reconnect.
  All of it is exposed in the gRPC contract, but it's still surface area.
- **Cross-repo protocol coupling.** Bumping `mxaccessgw` major version (gRPC
  contract changes, session shape changes) ripples into OtOpcUa releases.
  Mitigated by versioned contracts; not free.
- **Galaxy redundancy still has to think about gw.** A redundancy fail-over of
  OtOpcUa is independent of the gw's session lifecycle. Need to decide whether
  the standby holds an open session or only opens it on takeover.
- **Sensitive writes (`WriteSecured`, `AuthenticateUser`) cross the network**
  if gw is remote. TLS + mTLS solves it but adds setup.

---

## Option 2 — Embed mxaccessgw worker, no gateway

`Driver.Galaxy` is still in-process .NET 10, but instead of speaking gRPC to a
gateway service, it directly **launches and supervises one (or more)
`MxGateway.Worker` processes** and talks to them over the same named-pipe
worker protocol gw uses internally
(`docs/WorkerFrameProtocol.md`, `docs/WorkerProcessLauncher.md`). Browse stays
local — driver runs the SQL queries against ZB itself.

```
OtOpcUa.Server (.NET 10 x64)
  └── Driver.Galaxy (in-proc, .NET 10)
        ├── ZB SQL (local, in-proc)
        └── pipe ──► MxGateway.Worker (.NET 4.8 x86, child process)
                       └── MXAccess COM (STA)
```

### Pros

- **One hop, not two.** Driver → worker pipe is the same shape as today's
  Proxy → Host pipe. Latency is on par with the current implementation.
- **No new service to deploy.** Worker is launched as a child process the
  same way `Galaxy.Host` is launched today (just with mxaccessgw's worker
  binary). Single-machine install story stays simple.
- **Keeps the trust boundary local.** No API keys, no TLS, no exposed gRPC
  port on the OtOpcUa box.
- **Reuses mxaccessgw's parity-tested worker code** — STA pump, COM lifetime,
  event conversion, fault model — without inheriting gw's ASP.NET Core /
  Blazor / SQLite footprint.
- **Tighter ownership.** OtOpcUa owns the worker lifecycle; recycle, kill,
  restart, post-mortem all decided by the driver, not by an external service
  we don't control.
- **Easier to reason about during integration tests.** No second service to
  spin up in CI; just a child process per test fixture.

### Cons

- **OtOpcUa server box must still have AVEVA + MXAccess installed**, since
  the worker runs locally. The major deployment win of Option 1
  (separating where MXAccess runs from where OtOpcUa runs) is lost.
- **OtOpcUa still ships an x86 .NET 4.8 binary alongside it.** Even if we
  vendor mxaccessgw's worker rather than write our own, installer complexity
  and bitness considerations remain.
- **We re-implement everything gw already gives.** Process supervision,
  watchdog, recycle policy, heartbeat, post-mortem — these are exactly what
  `Galaxy.Host` does today, and they'd live in our repo again, just calling a
  different worker binary.
- **No browse cache, no deploy gating, no `WatchDeployEvents`** — we keep
  running our own ZB queries and our own `time_of_last_deploy` poll, or we
  port gw's cache code into the driver. Either way it's duplicated logic.
- **No auth, no dashboard, no metrics.** Operability stays where it is today
  (i.e., minimal). Adding it ourselves is a separate project.
- **Multiple OtOpcUa instances multiply MXAccess sessions.** Redundancy pair
  → two MXAccess clients on the Galaxy from the same software, vs. Option 1
  where one gw arbitrates.
- **Worker protocol coupling without the contract surface.** We depend on
  mxaccessgw's worker IPC frame format — a surface that mxaccessgw treats as
  *internal* to its own gw↔worker boundary. If they refactor it, we have to
  follow. The public gRPC contract (Option 1) is more stable by design.
- **Loses the "common MXAccess access point" benefit.** Other consumers
  (CLI, integration harnesses, future tools) can't share state with our
  embedded worker.

---

## Status quo (for comparison)

Keep `Galaxy.Host` as today, and in-place rip out historian + alarming +
probe manager. End state: the Host shrinks to `MxAccessClient` + `GalaxyRepository`,
which is roughly what Option 2 ends up looking like — but with our hand-rolled
COM bridge instead of mxaccessgw's worker. Not a serious option once
mxaccessgw exists; we'd be maintaining a parallel implementation of the same
thing.

---

## Recommendation (effort-agnostic)

**Go with Option 1 — Tier-A driver against the MxAccess Gateway.**

The decisive arguments:

1. **It's the only option that aligns Galaxy with how every other driver in
   this repo is structured.** The user's stated goal — "keep lmx to data +
   browsing, similar to other drivers" — only fully resolves if there is no
   `.Host` and no x86 build artifact in this repo at all. Option 2 still has
   an x86 child process and supervisor code; it's `Galaxy.Host` with a
   different worker binary inside.

2. **It separates *where MXAccess runs* from *where OtOpcUa runs*.** That is
   a strategically larger win than a few hundred microseconds of per-call
   latency. The OPC UA server stops being chained to AVEVA install footprint,
   bitness, and Wonderware client identity — which removes a class of
   deployment, redundancy, and CI problems we hit today (e.g., the
   `DESKTOP-6JL3KKO` Hyper-V/Docker conflict, the `dohertj2`-only pipe ACL,
   the live-Galaxy smoke test prerequisites).

3. **It collapses scope.** A non-trivial fraction of `Galaxy.Host` (browse
   cache, deploy-event watch, worker supervision, COM bridge, post-mortem,
   recycle, ACL hardening) is reproduced *better* in mxaccessgw. Option 1
   deletes our copy. Option 2 keeps it.

4. **It positions historian and alarming for the right home.** Once the
   Galaxy driver is "just another driver", historian becomes a server-level
   data source (one that can also feed Modbus/S7 history if we ever want it),
   and alarming becomes a server-level A&E subsystem. Option 2 nominally
   allows the same move, but the temptation to keep them in `Galaxy.Host`
   "while we're already there" is real.

5. **It future-proofs against AVEVA's roadmap.** Managed NMX, ASB, or any
   replacement that shows up over the next few years gets adopted in
   mxaccessgw without a release in this repo.

The case for Option 2 is real but narrow: it's the right call **only** if we
commit to single-box deployments forever, refuse to take a gRPC dependency,
and value local-trust simplicity over the consolidation/operability benefits
gw provides. None of those constraints hold here.

### What flips the recommendation

- If the gw protocol is unstable or perf-tested under our subscription
  patterns turns out worse than expected → revisit Option 2.
- If org-policy forbids running an MXAccess gateway as its own service →
  Option 2.
- If Galaxy goes from one of several drivers to *the* primary driver and
  raw call-rate matters more than architectural fit → revisit.

Otherwise: Option 1.

---

## Out-of-scope follow-ups (don't decide here, but flag them)

- **Where does the Wonderware Historian SDK live?** Likely its own
  .NET 4.8 x86 sidecar exposing a small `IHistorianDataSource` over a pipe or
  gRPC, plugged into the OPC UA server's HA service alongside any future
  historian sources. Independent of which option above is chosen.
- **Alarm subsystem ownership.** Decide whether the server hosts a generic
  AlarmCondition state machine driven by driver-advertised alarm metadata, or
  whether each driver continues to emit pre-shaped alarm transitions. Galaxy's
  4-attr quartet is a strong forcing function for the generic approach.
- **Redundancy + gw sessions.** Standby OtOpcUa holds an open gw session
  (warm) vs. opens on takeover (cold). Affects gw worker count and Galaxy
  client-identity collisions.
- **Auth between OtOpcUa and gw.** API key in DPAPI-protected secret file vs.
  Windows-auth gRPC. Both supported by gw; pick before rollout.