Files
lmxopcua/lmx_backend.md
Joseph Doherty ef22a61c39 v2 mxgw migration — Phase 1+2+3.1 wiring (7 PRs)
Foundational PRs from lmx_mxgw_impl.md, all green. Bodies only — DI/wiring
deferred to PR 1+2.W (combined wire-up) and PR 3.W.

PR 1.1 — IHistorianDataSource lifted to Core.Abstractions/Historian/
  Reuses existing DataValueSnapshot + HistoricalEvent shapes; sidecar (PR
  3.4) translates byte-quality → uint StatusCode internally.

PR 1.2 — IHistoryRouter + HistoryRouter on the server
  Longest-prefix-match resolution, case-insensitive, ObjectDisposed-guarded,
  swallow-on-shutdown disposal of misbehaving sources.

PR 1.3 — DriverNodeManager.HistoryRead* dispatch through IHistoryRouter
  Per-tag resolution with LegacyDriverHistoryAdapter wrapping
  `_driver as IHistoryProvider` so existing tests + drivers keep working
  until PR 7.2 retires the fallback.

PR 2.1 — AlarmConditionInfo extended with five sub-attribute refs
  InAlarmRef / PriorityRef / DescAttrNameRef / AckedRef / AckMsgWriteRef.
  Optional defaulted parameters preserve all existing 3-arg call sites.

PR 2.2 — AlarmConditionService state machine in Server/Alarms/
  Driver-agnostic port of GalaxyAlarmTracker. Sub-attribute refs come from
  AlarmConditionInfo, values arrive as DataValueSnapshot, ack writes route
  through IAlarmAcknowledger. State machine preserves Active/Acknowledged/
  Inactive transitions, Acked-on-active reset, post-disposal silence.

PR 2.3 — DriverNodeManager wires AlarmConditionService
  MarkAsAlarmCondition registers each alarm-bearing variable with the
  service; DriverWritableAcknowledger routes ack-message writes through
  the driver's IWritable + CapabilityInvoker. Service-raised transitions
  route via OnAlarmServiceTransition → matching ConditionSink. Legacy
  IAlarmSource path unchanged for null service.

PR 3.1 — Driver.Historian.Wonderware shell project (net48 x86)
  Console host shell + smoke test; SDK references + code lift come in
  PR 3.2.

Tests: 9 (PR 1.1) + 5 (PR 2.1) + 10 (PR 1.2) + 19 (PR 2.2) + 1 (PR 3.1)
all pass. Existing AlarmSubscribeIntegrationTests + HistoryReadIntegrationTests
unchanged.

Plan + audit docs (lmx_backend.md, lmx_mxgw.md, lmx_mxgw_impl.md)
included so parallel subagent worktrees can read them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 14:03:36 -04:00

14 KiB

Galaxy / LMX Backend — Restructuring Options

Context

Today the Galaxy driver is structured very differently from every other driver in this repo:

  • Galaxy.Proxy (.NET 10, in-process): tiny shim that frames IPC to the host.
  • Galaxy.Host (.NET Framework 4.8 x86, NSSM-wrapped Windows service): owns MXAccess COM, the STA pump, the ZB Galaxy Repository SQL queries, the Wonderware Historian SDK plugin, the per-platform ScanState probe manager, the alarm tracker (.InAlarm/.Priority/.DescAttrName/.Acked state machine + ack writer), recycle policy, and post-mortem MMF.

Other drivers (Modbus, S7, AB CIP, OpcUaClient, TwinCAT, FOCAS Tier-C) are in-process Tier-A drivers in the .NET 10 server. They do data + browse only; historian and alarming are driver-agnostic concerns at the server layer.

A sibling project, mxaccessgw (C:\Users\dohertj2\Desktop\mxaccessgw), already provides:

  • A .NET 10 x64 gRPC gateway in front of per-session .NET 4.8 x86 worker processes that own MXAccess COM, the STA, and event sinks (MxGateway.Server + MxGateway.Worker).
  • A full MXAccess command + event surface (Register, AddItem, Advise, Write, WriteSecured, OnDataChange, OnWriteComplete, etc.).
  • A cached, deploy-gated, paged Galaxy Repository browse RPC (galaxy_repository.v1) reading the same ZB tables we read today, with the query bodies kept byte-identical to OtOpcUa.
  • A .NET client library (clients/dotnet/MxGateway.Client).
  • API-key auth, Blazor dashboard, structured logs, metrics, watchdog/recycle.

The proposal is to strip Galaxy down to data + browse — push historian and alarming out to server-level subsystems where they live for every other driver — and pick how the slimmed-down driver talks to MXAccess.


What "push historian and alarming out" means

Both options below assume the same scope reduction; they only differ in how the driver reaches MXAccess.

Concern Today (Galaxy.Host) After
Galaxy hierarchy browse GalaxyRepository (SQL) inside Host Driver (Option 1: via gw browse RPC; Option 2: own SQL or worker)
Live read / write / subscribe MxAccessClient + STA pump in Host gw (Option 1) or embedded worker (Option 2)
Wonderware Historian SDK HistorianDataSource in Host (x86) Separate Historian data source plugged into the server's HA service. Likely stays its own .NET 4.8 x86 sidecar because the SDK is x86-only; independent of the Galaxy driver lifecycle.
Alarm state machine (.InAlarm/.Acked quartet, transitions, ack writer) GalaxyAlarmTracker in Host Server-level A&E subsystem subscribes to alarm-bearing attributes the driver advertises and runs the AlarmCondition state machine generically. Driver only flags IsAlarm=true in node metadata.
ScanState per-platform probes GalaxyRuntimeProbeManager in Host Driver-side: ScanState is just another tag subscription; the driver re-advises one per discovered $WinPlatform/$AppEngine and reports HostConnectivityStatus from the value stream. No special host-side machinery.

After the strip-down, the Galaxy driver looks like Modbus or OpcUaClient: it discovers nodes, reads/writes/subscribes, and reports per-host transport health. Everything else is the server's problem.


Option 1 — Tier-A driver against the MxAccess Gateway

Driver.Galaxy becomes a regular in-process .NET 10 driver in the OtOpcUa server (no .Host, no .Proxy split, no x86). It talks to a separately deployed MxGateway.Server over gRPC using MxGateway.Client. Browse comes from galaxy_repository.v1.DiscoverHierarchy. Live data comes from MxAccessGateway.OpenSession/AddItem/Advise/StreamEvents.

OtOpcUa.Server (.NET 10 x64)
  └── Driver.Galaxy (in-proc, .NET 10)
        └── gRPC ──► MxGateway.Server (.NET 10 x64)
                       └── pipe ──► MxGateway.Worker (.NET 4.8 x86)
                                       └── MXAccess COM (STA)

Pros

  • Architectural parity with other drivers. No bespoke Host service, no x86 build target, no NSSM wrapper, no STA pump in this repo, no PostMortemMmf/RecyclePolicy we maintain ourselves.
  • OtOpcUa server stops needing AVEVA installed on its own host. The gateway runs where MXAccess lives; the OPC UA server can live on a different box, in a container, or on a hardened jump host.
  • One canonical MXAccess surface across the org. Any future tool — a diagnostic CLI, a Historian replacement, an integration harness — talks to the same gw with the same parity guarantees we get.
  • Multi-instance friendly. Two OtOpcUa servers (warm/hot redundancy) share one gw and one MXAccess footprint instead of each running their own Galaxy.Host with duplicate Wonderware client identities.
  • Browse + cache for free. galaxy_repository.v1 already implements the hierarchy cache, deploy-time gating, paging, and WatchDeployEvents — we delete GalaxyRepository.cs, GalaxyHierarchyRow.cs, the change-detection poll loop, and the matching SQL plumbing.
  • Operability for free. API-key auth, Blazor dashboard at /dashboard, metrics via Meter, structured logs with redaction. We currently have none of that in Galaxy.Host.
  • Future backend swap. When AVEVA exposes managed NMX or another modern path, gw routes to it without OtOpcUa changes (gw's stated roadmap).
  • Tighter blast radius. A hung COM event, a leaking COM object, a crashing worker — all owned by gw's session/worker isolation, not the OPC UA server process.
  • Simpler version story for OtOpcUa. Driver is plain .NET 10; the bitness/runtime split lives entirely in mxaccessgw's repo.

Cons

  • Extra deployment dependency. mxaccessgw is now a service that has to be installed, monitored, and kept on a compatible protocol version. For a single-box install this is one more moving piece.
  • Two hops on every call (driver→gw, gw→worker) instead of one (proxy→host). Today's hop is MessagePack over a named pipe; the new outer hop is gRPC over TCP. Per-call overhead is a few hundred microseconds, not a regression for OPC UA workloads but measurable for very chatty bursts.
  • Auth/secret surface added. OtOpcUa now holds an API key for gw and rotates it; gw's SQLite-backed key store has to be managed.
  • Failure model spans two processes we don't own — gw + worker. Reconnect logic in our driver has to ride both: gw transport drop, gw session lease expiry, gw-detected worker crash, plus the worker's own MXAccess reconnect. All of it is exposed in the gRPC contract, but it's still surface area.
  • Cross-repo protocol coupling. Bumping mxaccessgw major version (gRPC contract changes, session shape changes) ripples into OtOpcUa releases. Mitigated by versioned contracts; not free.
  • Galaxy redundancy still has to think about gw. A redundancy fail-over of OtOpcUa is independent of the gw's session lifecycle. Need to decide whether the standby holds an open session or only opens it on takeover.
  • Sensitive writes (WriteSecured, AuthenticateUser) cross the network if gw is remote. TLS + mTLS solves it but adds setup.

Option 2 — Embed mxaccessgw worker, no gateway

Driver.Galaxy is still in-process .NET 10, but instead of speaking gRPC to a gateway service, it directly launches and supervises one (or more) MxGateway.Worker processes and talks to them over the same named-pipe worker protocol gw uses internally (docs/WorkerFrameProtocol.md, docs/WorkerProcessLauncher.md). Browse stays local — driver runs the SQL queries against ZB itself.

OtOpcUa.Server (.NET 10 x64)
  └── Driver.Galaxy (in-proc, .NET 10)
        ├── ZB SQL (local, in-proc)
        └── pipe ──► MxGateway.Worker (.NET 4.8 x86, child process)
                       └── MXAccess COM (STA)

Pros

  • One hop, not two. Driver → worker pipe is the same shape as today's Proxy → Host pipe. Latency is on par with the current implementation.
  • No new service to deploy. Worker is launched as a child process the same way Galaxy.Host is launched today (just with mxaccessgw's worker binary). Single-machine install story stays simple.
  • Keeps the trust boundary local. No API keys, no TLS, no exposed gRPC port on the OtOpcUa box.
  • Reuses mxaccessgw's parity-tested worker code — STA pump, COM lifetime, event conversion, fault model — without inheriting gw's ASP.NET Core / Blazor / SQLite footprint.
  • Tighter ownership. OtOpcUa owns the worker lifecycle; recycle, kill, restart, post-mortem all decided by the driver, not by an external service we don't control.
  • Easier to reason about during integration tests. No second service to spin up in CI; just a child process per test fixture.

Cons

  • OtOpcUa server box must still have AVEVA + MXAccess installed, since the worker runs locally. The major deployment win of Option 1 (separating where MXAccess runs from where OtOpcUa runs) is lost.
  • OtOpcUa still ships an x86 .NET 4.8 binary alongside it. Even if we vendor mxaccessgw's worker rather than write our own, installer complexity and bitness considerations remain.
  • We re-implement everything gw already gives. Process supervision, watchdog, recycle policy, heartbeat, post-mortem — these are exactly what Galaxy.Host does today, and they'd live in our repo again, just calling a different worker binary.
  • No browse cache, no deploy gating, no WatchDeployEvents — we keep running our own ZB queries and our own time_of_last_deploy poll, or we port gw's cache code into the driver. Either way it's duplicated logic.
  • No auth, no dashboard, no metrics. Operability stays where it is today (i.e., minimal). Adding it ourselves is a separate project.
  • Multiple OtOpcUa instances multiply MXAccess sessions. Redundancy pair → two MXAccess clients on the Galaxy from the same software, vs. Option 1 where one gw arbitrates.
  • Worker protocol coupling without the contract surface. We depend on mxaccessgw's worker IPC frame format — a surface that mxaccessgw treats as internal to its own gw↔worker boundary. If they refactor it, we have to follow. The public gRPC contract (Option 1) is more stable by design.
  • Loses the "common MXAccess access point" benefit. Other consumers (CLI, integration harnesses, future tools) can't share state with our embedded worker.

Status quo (for comparison)

Keep Galaxy.Host as today, and in-place rip out historian + alarming + probe manager. End state: the Host shrinks to MxAccessClient + GalaxyRepository, which is roughly what Option 2 ends up looking like — but with our hand-rolled COM bridge instead of mxaccessgw's worker. Not a serious option once mxaccessgw exists; we'd be maintaining a parallel implementation of the same thing.


Recommendation (effort-agnostic)

Go with Option 1 — Tier-A driver against the MxAccess Gateway.

The decisive arguments:

  1. It's the only option that aligns Galaxy with how every other driver in this repo is structured. The user's stated goal — "keep lmx to data + browsing, similar to other drivers" — only fully resolves if there is no .Host and no x86 build artifact in this repo at all. Option 2 still has an x86 child process and supervisor code; it's Galaxy.Host with a different worker binary inside.

  2. It separates where MXAccess runs from where OtOpcUa runs. That is a strategically larger win than a few hundred microseconds of per-call latency. The OPC UA server stops being chained to AVEVA install footprint, bitness, and Wonderware client identity — which removes a class of deployment, redundancy, and CI problems we hit today (e.g., the DESKTOP-6JL3KKO Hyper-V/Docker conflict, the dohertj2-only pipe ACL, the live-Galaxy smoke test prerequisites).

  3. It collapses scope. A non-trivial fraction of Galaxy.Host (browse cache, deploy-event watch, worker supervision, COM bridge, post-mortem, recycle, ACL hardening) is reproduced better in mxaccessgw. Option 1 deletes our copy. Option 2 keeps it.

  4. It positions historian and alarming for the right home. Once the Galaxy driver is "just another driver", historian becomes a server-level data source (one that can also feed Modbus/S7 history if we ever want it), and alarming becomes a server-level A&E subsystem. Option 2 nominally allows the same move, but the temptation to keep them in Galaxy.Host "while we're already there" is real.

  5. It future-proofs against AVEVA's roadmap. Managed NMX, ASB, or any replacement that shows up over the next few years gets adopted in mxaccessgw without a release in this repo.

The case for Option 2 is real but narrow: it's the right call only if we commit to single-box deployments forever, refuse to take a gRPC dependency, and value local-trust simplicity over the consolidation/operability benefits gw provides. None of those constraints hold here.

What flips the recommendation

  • If the gw protocol is unstable or perf-tested under our subscription patterns turns out worse than expected → revisit Option 2.
  • If org-policy forbids running an MXAccess gateway as its own service → Option 2.
  • If Galaxy goes from one of several drivers to the primary driver and raw call-rate matters more than architectural fit → revisit.

Otherwise: Option 1.


Out-of-scope follow-ups (don't decide here, but flag them)

  • Where does the Wonderware Historian SDK live? Likely its own .NET 4.8 x86 sidecar exposing a small IHistorianDataSource over a pipe or gRPC, plugged into the OPC UA server's HA service alongside any future historian sources. Independent of which option above is chosen.
  • Alarm subsystem ownership. Decide whether the server hosts a generic AlarmCondition state machine driven by driver-advertised alarm metadata, or whether each driver continues to emit pre-shaped alarm transitions. Galaxy's 4-attr quartet is a strong forcing function for the generic approach.
  • Redundancy + gw sessions. Standby OtOpcUa holds an open gw session (warm) vs. opens on takeover (cold). Affects gw worker count and Galaxy client-identity collisions.
  • Auth between OtOpcUa and gw. API key in DPAPI-protected secret file vs. Windows-auth gRPC. Both supported by gw; pick before rollout.