Files
lmxopcua/lmx_backend.md
Joseph Doherty ef22a61c39 v2 mxgw migration — Phase 1+2+3.1 wiring (7 PRs)
Foundational PRs from lmx_mxgw_impl.md, all green. Bodies only — DI/wiring
deferred to PR 1+2.W (combined wire-up) and PR 3.W.

PR 1.1 — IHistorianDataSource lifted to Core.Abstractions/Historian/
  Reuses existing DataValueSnapshot + HistoricalEvent shapes; sidecar (PR
  3.4) translates byte-quality → uint StatusCode internally.

PR 1.2 — IHistoryRouter + HistoryRouter on the server
  Longest-prefix-match resolution, case-insensitive, ObjectDisposed-guarded,
  swallow-on-shutdown disposal of misbehaving sources.

PR 1.3 — DriverNodeManager.HistoryRead* dispatch through IHistoryRouter
  Per-tag resolution with LegacyDriverHistoryAdapter wrapping
  `_driver as IHistoryProvider` so existing tests + drivers keep working
  until PR 7.2 retires the fallback.

PR 2.1 — AlarmConditionInfo extended with five sub-attribute refs
  InAlarmRef / PriorityRef / DescAttrNameRef / AckedRef / AckMsgWriteRef.
  Optional defaulted parameters preserve all existing 3-arg call sites.

PR 2.2 — AlarmConditionService state machine in Server/Alarms/
  Driver-agnostic port of GalaxyAlarmTracker. Sub-attribute refs come from
  AlarmConditionInfo, values arrive as DataValueSnapshot, ack writes route
  through IAlarmAcknowledger. State machine preserves Active/Acknowledged/
  Inactive transitions, Acked-on-active reset, post-disposal silence.

PR 2.3 — DriverNodeManager wires AlarmConditionService
  MarkAsAlarmCondition registers each alarm-bearing variable with the
  service; DriverWritableAcknowledger routes ack-message writes through
  the driver's IWritable + CapabilityInvoker. Service-raised transitions
  route via OnAlarmServiceTransition → matching ConditionSink. Legacy
  IAlarmSource path unchanged for null service.

PR 3.1 — Driver.Historian.Wonderware shell project (net48 x86)
  Console host shell + smoke test; SDK references + code lift come in
  PR 3.2.

Tests: 9 (PR 1.1) + 5 (PR 2.1) + 10 (PR 1.2) + 19 (PR 2.2) + 1 (PR 3.1)
all pass. Existing AlarmSubscribeIntegrationTests + HistoryReadIntegrationTests
unchanged.

Plan + audit docs (lmx_backend.md, lmx_mxgw.md, lmx_mxgw_impl.md)
included so parallel subagent worktrees can read them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 14:03:36 -04:00

275 lines
14 KiB
Markdown

# Galaxy / LMX Backend — Restructuring Options
## Context
Today the Galaxy driver is structured very differently from every other driver
in this repo:
- **Galaxy.Proxy** (.NET 10, in-process): tiny shim that frames IPC to the host.
- **Galaxy.Host** (.NET Framework 4.8 **x86**, NSSM-wrapped Windows service):
owns MXAccess COM, the STA pump, the ZB Galaxy Repository SQL queries, the
Wonderware Historian SDK plugin, the per-platform `ScanState` probe manager,
the alarm tracker (`.InAlarm`/`.Priority`/`.DescAttrName`/`.Acked` state
machine + ack writer), recycle policy, and post-mortem MMF.
Other drivers (Modbus, S7, AB CIP, OpcUaClient, TwinCAT, FOCAS Tier-C) are
**in-process Tier-A drivers** in the .NET 10 server. They do data + browse
only; historian and alarming are driver-agnostic concerns at the server layer.
A sibling project, **mxaccessgw**
(`C:\Users\dohertj2\Desktop\mxaccessgw`), already provides:
- A .NET 10 x64 gRPC gateway in front of per-session .NET 4.8 x86 worker
processes that own MXAccess COM, the STA, and event sinks
(`MxGateway.Server` + `MxGateway.Worker`).
- A full MXAccess command + event surface (`Register`, `AddItem`, `Advise`,
`Write`, `WriteSecured`, `OnDataChange`, `OnWriteComplete`, etc.).
- A cached, deploy-gated, paged **Galaxy Repository browse** RPC
(`galaxy_repository.v1`) reading the same ZB tables we read today, with the
query bodies kept byte-identical to OtOpcUa.
- A .NET client library (`clients/dotnet/MxGateway.Client`).
- API-key auth, Blazor dashboard, structured logs, metrics, watchdog/recycle.
The proposal is to **strip Galaxy down to data + browse** — push historian and
alarming out to server-level subsystems where they live for every other driver
— and pick how the slimmed-down driver talks to MXAccess.
---
## What "push historian and alarming out" means
Both options below assume the same scope reduction; they only differ in how
the driver reaches MXAccess.
| Concern | Today (Galaxy.Host) | After |
|---|---|---|
| Galaxy hierarchy browse | `GalaxyRepository` (SQL) inside Host | Driver (Option 1: via gw browse RPC; Option 2: own SQL or worker) |
| Live read / write / subscribe | `MxAccessClient` + STA pump in Host | gw (Option 1) or embedded worker (Option 2) |
| Wonderware Historian SDK | `HistorianDataSource` in Host (x86) | Separate Historian data source plugged into the server's HA service. Likely stays its own .NET 4.8 x86 sidecar because the SDK is x86-only; **independent of the Galaxy driver lifecycle**. |
| Alarm state machine (`.InAlarm`/`.Acked` quartet, transitions, ack writer) | `GalaxyAlarmTracker` in Host | Server-level A&E subsystem subscribes to alarm-bearing attributes the driver advertises and runs the AlarmCondition state machine generically. Driver only flags `IsAlarm=true` in node metadata. |
| `ScanState` per-platform probes | `GalaxyRuntimeProbeManager` in Host | Driver-side: ScanState is just another tag subscription; the driver re-advises one per discovered `$WinPlatform`/`$AppEngine` and reports `HostConnectivityStatus` from the value stream. No special host-side machinery. |
After the strip-down, the Galaxy driver looks like Modbus or OpcUaClient: it
discovers nodes, reads/writes/subscribes, and reports per-host transport
health. Everything else is the server's problem.
---
## Option 1 — Tier-A driver against the MxAccess Gateway
`Driver.Galaxy` becomes a regular **in-process .NET 10 driver** in the OtOpcUa
server (no `.Host`, no `.Proxy` split, no x86). It talks to a separately
deployed `MxGateway.Server` over gRPC using `MxGateway.Client`. Browse comes
from `galaxy_repository.v1.DiscoverHierarchy`. Live data comes from
`MxAccessGateway.OpenSession`/`AddItem`/`Advise`/`StreamEvents`.
```
OtOpcUa.Server (.NET 10 x64)
└── Driver.Galaxy (in-proc, .NET 10)
└── gRPC ──► MxGateway.Server (.NET 10 x64)
└── pipe ──► MxGateway.Worker (.NET 4.8 x86)
└── MXAccess COM (STA)
```
### Pros
- **Architectural parity with other drivers.** No bespoke `Host` service, no
x86 build target, no NSSM wrapper, no STA pump in this repo, no
`PostMortemMmf`/`RecyclePolicy` we maintain ourselves.
- **OtOpcUa server stops needing AVEVA installed on its own host.** The
gateway runs where MXAccess lives; the OPC UA server can live on a different
box, in a container, or on a hardened jump host.
- **One canonical MXAccess surface across the org.** Any future tool — a
diagnostic CLI, a Historian replacement, an integration harness — talks to
the same gw with the same parity guarantees we get.
- **Multi-instance friendly.** Two OtOpcUa servers (warm/hot redundancy) share
one gw and one MXAccess footprint instead of each running their own
`Galaxy.Host` with duplicate Wonderware client identities.
- **Browse + cache for free.** `galaxy_repository.v1` already implements the
hierarchy cache, deploy-time gating, paging, and `WatchDeployEvents` — we
delete `GalaxyRepository.cs`, `GalaxyHierarchyRow.cs`, the change-detection
poll loop, and the matching SQL plumbing.
- **Operability for free.** API-key auth, Blazor dashboard at `/dashboard`,
metrics via `Meter`, structured logs with redaction. We currently have
none of that in `Galaxy.Host`.
- **Future backend swap.** When AVEVA exposes managed NMX or another modern
path, gw routes to it without OtOpcUa changes (gw's stated roadmap).
- **Tighter blast radius.** A hung COM event, a leaking COM object, a
crashing worker — all owned by gw's session/worker isolation, not the
OPC UA server process.
- **Simpler version story for OtOpcUa.** Driver is plain .NET 10; the
bitness/runtime split lives entirely in mxaccessgw's repo.
### Cons
- **Extra deployment dependency.** mxaccessgw is now a service that has to be
installed, monitored, and kept on a compatible protocol version. For a
single-box install this is one more moving piece.
- **Two hops on every call** (driver→gw, gw→worker) instead of one
(proxy→host). Today's hop is MessagePack over a named pipe; the new outer
hop is gRPC over TCP. Per-call overhead is a few hundred microseconds, not
a regression for OPC UA workloads but measurable for very chatty bursts.
- **Auth/secret surface added.** OtOpcUa now holds an API key for gw and
rotates it; gw's SQLite-backed key store has to be managed.
- **Failure model spans two processes we don't own** — gw + worker. Reconnect
logic in our driver has to ride both: gw transport drop, gw session lease
expiry, gw-detected worker crash, plus the worker's own MXAccess reconnect.
All of it is exposed in the gRPC contract, but it's still surface area.
- **Cross-repo protocol coupling.** Bumping `mxaccessgw` major version (gRPC
contract changes, session shape changes) ripples into OtOpcUa releases.
Mitigated by versioned contracts; not free.
- **Galaxy redundancy still has to think about gw.** A redundancy fail-over of
OtOpcUa is independent of the gw's session lifecycle. Need to decide whether
the standby holds an open session or only opens it on takeover.
- **Sensitive writes (`WriteSecured`, `AuthenticateUser`) cross the network**
if gw is remote. TLS + mTLS solves it but adds setup.
---
## Option 2 — Embed mxaccessgw worker, no gateway
`Driver.Galaxy` is still in-process .NET 10, but instead of speaking gRPC to a
gateway service, it directly **launches and supervises one (or more)
`MxGateway.Worker` processes** and talks to them over the same named-pipe
worker protocol gw uses internally
(`docs/WorkerFrameProtocol.md`, `docs/WorkerProcessLauncher.md`). Browse stays
local — driver runs the SQL queries against ZB itself.
```
OtOpcUa.Server (.NET 10 x64)
└── Driver.Galaxy (in-proc, .NET 10)
├── ZB SQL (local, in-proc)
└── pipe ──► MxGateway.Worker (.NET 4.8 x86, child process)
└── MXAccess COM (STA)
```
### Pros
- **One hop, not two.** Driver → worker pipe is the same shape as today's
Proxy → Host pipe. Latency is on par with the current implementation.
- **No new service to deploy.** Worker is launched as a child process the
same way `Galaxy.Host` is launched today (just with mxaccessgw's worker
binary). Single-machine install story stays simple.
- **Keeps the trust boundary local.** No API keys, no TLS, no exposed gRPC
port on the OtOpcUa box.
- **Reuses mxaccessgw's parity-tested worker code** — STA pump, COM lifetime,
event conversion, fault model — without inheriting gw's ASP.NET Core /
Blazor / SQLite footprint.
- **Tighter ownership.** OtOpcUa owns the worker lifecycle; recycle, kill,
restart, post-mortem all decided by the driver, not by an external service
we don't control.
- **Easier to reason about during integration tests.** No second service to
spin up in CI; just a child process per test fixture.
### Cons
- **OtOpcUa server box must still have AVEVA + MXAccess installed**, since
the worker runs locally. The major deployment win of Option 1
(separating where MXAccess runs from where OtOpcUa runs) is lost.
- **OtOpcUa still ships an x86 .NET 4.8 binary alongside it.** Even if we
vendor mxaccessgw's worker rather than write our own, installer complexity
and bitness considerations remain.
- **We re-implement everything gw already gives.** Process supervision,
watchdog, recycle policy, heartbeat, post-mortem — these are exactly what
`Galaxy.Host` does today, and they'd live in our repo again, just calling a
different worker binary.
- **No browse cache, no deploy gating, no `WatchDeployEvents`** — we keep
running our own ZB queries and our own `time_of_last_deploy` poll, or we
port gw's cache code into the driver. Either way it's duplicated logic.
- **No auth, no dashboard, no metrics.** Operability stays where it is today
(i.e., minimal). Adding it ourselves is a separate project.
- **Multiple OtOpcUa instances multiply MXAccess sessions.** Redundancy pair
→ two MXAccess clients on the Galaxy from the same software, vs. Option 1
where one gw arbitrates.
- **Worker protocol coupling without the contract surface.** We depend on
mxaccessgw's worker IPC frame format — a surface that mxaccessgw treats as
*internal* to its own gw↔worker boundary. If they refactor it, we have to
follow. The public gRPC contract (Option 1) is more stable by design.
- **Loses the "common MXAccess access point" benefit.** Other consumers
(CLI, integration harnesses, future tools) can't share state with our
embedded worker.
---
## Status quo (for comparison)
Keep `Galaxy.Host` as today, and in-place rip out historian + alarming +
probe manager. End state: the Host shrinks to `MxAccessClient` + `GalaxyRepository`,
which is roughly what Option 2 ends up looking like — but with our hand-rolled
COM bridge instead of mxaccessgw's worker. Not a serious option once
mxaccessgw exists; we'd be maintaining a parallel implementation of the same
thing.
---
## Recommendation (effort-agnostic)
**Go with Option 1 — Tier-A driver against the MxAccess Gateway.**
The decisive arguments:
1. **It's the only option that aligns Galaxy with how every other driver in
this repo is structured.** The user's stated goal — "keep lmx to data +
browsing, similar to other drivers" — only fully resolves if there is no
`.Host` and no x86 build artifact in this repo at all. Option 2 still has
an x86 child process and supervisor code; it's `Galaxy.Host` with a
different worker binary inside.
2. **It separates *where MXAccess runs* from *where OtOpcUa runs*.** That is
a strategically larger win than a few hundred microseconds of per-call
latency. The OPC UA server stops being chained to AVEVA install footprint,
bitness, and Wonderware client identity — which removes a class of
deployment, redundancy, and CI problems we hit today (e.g., the
`DESKTOP-6JL3KKO` Hyper-V/Docker conflict, the `dohertj2`-only pipe ACL,
the live-Galaxy smoke test prerequisites).
3. **It collapses scope.** A non-trivial fraction of `Galaxy.Host` (browse
cache, deploy-event watch, worker supervision, COM bridge, post-mortem,
recycle, ACL hardening) is reproduced *better* in mxaccessgw. Option 1
deletes our copy. Option 2 keeps it.
4. **It positions historian and alarming for the right home.** Once the
Galaxy driver is "just another driver", historian becomes a server-level
data source (one that can also feed Modbus/S7 history if we ever want it),
and alarming becomes a server-level A&E subsystem. Option 2 nominally
allows the same move, but the temptation to keep them in `Galaxy.Host`
"while we're already there" is real.
5. **It future-proofs against AVEVA's roadmap.** Managed NMX, ASB, or any
replacement that shows up over the next few years gets adopted in
mxaccessgw without a release in this repo.
The case for Option 2 is real but narrow: it's the right call **only** if we
commit to single-box deployments forever, refuse to take a gRPC dependency,
and value local-trust simplicity over the consolidation/operability benefits
gw provides. None of those constraints hold here.
### What flips the recommendation
- If the gw protocol is unstable or perf-tested under our subscription
patterns turns out worse than expected → revisit Option 2.
- If org-policy forbids running an MXAccess gateway as its own service →
Option 2.
- If Galaxy goes from one of several drivers to *the* primary driver and
raw call-rate matters more than architectural fit → revisit.
Otherwise: Option 1.
---
## Out-of-scope follow-ups (don't decide here, but flag them)
- **Where does the Wonderware Historian SDK live?** Likely its own
.NET 4.8 x86 sidecar exposing a small `IHistorianDataSource` over a pipe or
gRPC, plugged into the OPC UA server's HA service alongside any future
historian sources. Independent of which option above is chosen.
- **Alarm subsystem ownership.** Decide whether the server hosts a generic
AlarmCondition state machine driven by driver-advertised alarm metadata, or
whether each driver continues to emit pre-shaped alarm transitions. Galaxy's
4-attr quartet is a strong forcing function for the generic approach.
- **Redundancy + gw sessions.** Standby OtOpcUa holds an open gw session
(warm) vs. opens on takeover (cold). Affects gw worker count and Galaxy
client-identity collisions.
- **Auth between OtOpcUa and gw.** API key in DPAPI-protected secret file vs.
Windows-auth gRPC. Both supported by gw; pick before rollout.