chore: drop root scratch + retired v2-mxgw plan docs

- Delete _p54.json / _p55.json (PR-body snapshots for the shipped S7
  + Mitsubishi research docs).
- Delete session.dat (38-byte CLI runtime cache, not produced by any
  current source code) and add it to .gitignore so it doesn't come
  back.
- Delete lmx_backend.md / lmx_mxgw.md / lmx_mxgw_impl.md. All three
  carried " Completed 2026-04-30" historical-record banners — the
  v2-mxgw migration shipped + merged to master, so the design plans
  served their purpose. Drop the cross-refs from CLAUDE.md and
  docs/v1/README.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-04-30 09:47:52 -04:00
parent 33054c3275
commit ea045477ad
9 changed files with 4 additions and 1836 deletions

3
.gitignore vendored
View File

@@ -37,3 +37,6 @@ src/ZB.MOM.WW.OtOpcUa.Server/config_cache.db
# E2E sidecar config — NodeIds are specific to each dev's local seed (see scripts/e2e/README.md)
scripts/e2e/e2e-config.json
config_cache*.db
# Client CLI/UI runtime scratch (last-connected endpoint cache)
session.dat

View File

@@ -16,8 +16,7 @@ in this repo is .NET 10. PR 7.2 retired the legacy in-process
`Galaxy.Host` / `Galaxy.Proxy` / `Galaxy.Shared` projects + the
`OtOpcUaGalaxyHost` Windows service.
See `lmx_mxgw.md` for the migration design and
`docs/v2/Galaxy.Performance.md` for the runtime perf surface
See `docs/v2/Galaxy.Performance.md` for the runtime perf surface
(tracing, metrics, soak harness).
## Architecture Overview

View File

@@ -1 +0,0 @@
{"title":"Phase 3 PR 54 -- Siemens S7 Modbus TCP quirks research doc","body":"## Summary\n\nAdds `docs/v2/s7.md` (485 lines) covering Siemens SIMATIC S7 family Modbus TCP behavior. Mirrors the `docs/v2/dl205.md` template for future per-quirk implementation PRs.\n\n## Key findings for the implementation track\n\n- **No fixed memory map** — every S7 Modbus server is user-wired via `MB_SERVER`/`MODBUSCP`/`MODBUSPN` library blocks. Driver must accept per-site config, not assume a vendor layout.\n- **MB_SERVER requires non-optimized DBs** (STATUS `0x8383` if optimized). Most common field bug.\n- **Word order default = ABCD** (opposite of DL260). Driver's S7 profile default must be `ByteOrder.BigEndian`, not `WordSwap`.\n- **One port per MB_SERVER instance** — multi-client requires parallel FBs on 503/504/… Most clients assume port 502 multiplexes (wrong on S7).\n- **CP 343-1 Lean is server-only**, requires the `2XV9450-1MB00` license.\n- **FC20/21/22/23/43 all return Illegal Function** on every S7 variant — driver must not attempt FC23 bulk-read optimization for S7.\n- **STOP-mode behavior non-deterministic** across firmware bands — treat both read/write STOP-mode responses as unavailable.\n\nTwo items flagged as unconfirmed rumour (V2.0+ float byte-order claim, STOP-mode caching location).\n\nNo code, no tests — implementation lands in PRs 56+.\n\n## Test plan\n- [x] Doc renders as markdown\n- [x] 31 citations present\n- [x] Section structure matches dl205.md template","head":"phase-3-pr54-s7-research-doc","base":"v2"}

View File

@@ -1 +0,0 @@
{"title":"Phase 3 PR 55 -- Mitsubishi MELSEC Modbus TCP quirks research doc","body":"## Summary\n\nAdds `docs/v2/mitsubishi.md` (451 lines) covering MELSEC Q/L/iQ-R/iQ-F/FX3U Modbus TCP behavior. Mirrors `docs/v2/dl205.md` template for per-quirk implementation PRs.\n\n## Key findings for the implementation track\n\n- **Module naming trap** — `QJ71MB91` is SERIAL RTU, not TCP. TCP module is `QJ71MT91`. Surface clearly in driver docs.\n- **No canonical mapping** — per-site 'Modbus Device Assignment Parameter' block (up to 16 entries). Treat mapping as runtime config.\n- **X/Y hex vs octal depends on family** — Q/L/iQ-R use HEX (X20 = decimal 32); FX/iQ-F use OCTAL (X20 = decimal 16). Helper must take a family selector.\n- **Word order CDAB default** across all MELSEC families (opposite of Siemens S7). Driver Mitsubishi profile default: `ByteOrder.WordSwap`.\n- **D-registers binary by default** (opposite of DL205's BCD default). Caller opts in to `Bcd16`/`Bcd32` when ladder uses BCD.\n- **FX5U needs firmware ≥ 1.060** for Modbus TCP server — older is client-only.\n- **FX3U-ENET vs FX3U-ENET-P502 vs FX3U-ENET-ADP** — only the middle one binds port 502; the last has no Modbus at all. Common operator mis-purchase.\n- **QJ71MT91 does NOT support FC22 / FC23** — iQ-R / iQ-F do. Bulk-read optimization must gate on capability.\n- **STOP-mode writes configurable** on Q/L/iQ-R/iQ-F (default accept), always rejected on FX3U-ENET.\n\nThree unconfirmed rumours flagged separately.\n\nNo code, no tests — implementation lands in PRs 58+.\n\n## Test plan\n- [x] Doc renders as markdown\n- [x] 17 citations present\n- [x] Per-model test naming matrix included (`Mitsubishi_QJ71MT91_*`, `Mitsubishi_FX5U_*`, `Mitsubishi_FX3U_ENET_*`, shared `Mitsubishi_Common_*`)","head":"phase-3-pr55-mitsubishi-research-doc","base":"v2"}

View File

@@ -15,7 +15,6 @@ For current architecture see:
- `docs/drivers/Galaxy.md` — current Galaxy driver doc
- `docs/v2/Galaxy.ParityRig.md` — current testing setup
- `docs/v2/Galaxy.Performance.md` — observability + perf
- `lmx_mxgw.md` (in repo root) — design rationale for the migration
| File | What it covered |
|---|---|

View File

@@ -1,282 +0,0 @@
> **✅ Completed 2026-04-30 — historical record of the v2-mxgw backend-options decision.**
>
> This document evaluated alternative backend topologies before the
> v2-mxgw migration. **Option 1 (in-process driver + gRPC gateway) was
> selected and implemented**; see `lmx_mxgw.md` for the design and
> `lmx_mxgw_impl.md` for the implementation plan. Both shipped at
> commit `ae7106d` (2026-04-30). Preserved here as the audit trail.
# Galaxy / LMX Backend — Restructuring Options
## Context
Today the Galaxy driver is structured very differently from every other driver
in this repo:
- **Galaxy.Proxy** (.NET 10, in-process): tiny shim that frames IPC to the host.
- **Galaxy.Host** (.NET Framework 4.8 **x86**, NSSM-wrapped Windows service):
owns MXAccess COM, the STA pump, the ZB Galaxy Repository SQL queries, the
Wonderware Historian SDK plugin, the per-platform `ScanState` probe manager,
the alarm tracker (`.InAlarm`/`.Priority`/`.DescAttrName`/`.Acked` state
machine + ack writer), recycle policy, and post-mortem MMF.
Other drivers (Modbus, S7, AB CIP, OpcUaClient, TwinCAT, FOCAS Tier-C) are
**in-process Tier-A drivers** in the .NET 10 server. They do data + browse
only; historian and alarming are driver-agnostic concerns at the server layer.
A sibling project, **mxaccessgw**
(`C:\Users\dohertj2\Desktop\mxaccessgw`), already provides:
- A .NET 10 x64 gRPC gateway in front of per-session .NET 4.8 x86 worker
processes that own MXAccess COM, the STA, and event sinks
(`MxGateway.Server` + `MxGateway.Worker`).
- A full MXAccess command + event surface (`Register`, `AddItem`, `Advise`,
`Write`, `WriteSecured`, `OnDataChange`, `OnWriteComplete`, etc.).
- A cached, deploy-gated, paged **Galaxy Repository browse** RPC
(`galaxy_repository.v1`) reading the same ZB tables we read today, with the
query bodies kept byte-identical to OtOpcUa.
- A .NET client library (`clients/dotnet/MxGateway.Client`).
- API-key auth, Blazor dashboard, structured logs, metrics, watchdog/recycle.
The proposal is to **strip Galaxy down to data + browse** — push historian and
alarming out to server-level subsystems where they live for every other driver
— and pick how the slimmed-down driver talks to MXAccess.
---
## What "push historian and alarming out" means
Both options below assume the same scope reduction; they only differ in how
the driver reaches MXAccess.
| Concern | Today (Galaxy.Host) | After |
|---|---|---|
| Galaxy hierarchy browse | `GalaxyRepository` (SQL) inside Host | Driver (Option 1: via gw browse RPC; Option 2: own SQL or worker) |
| Live read / write / subscribe | `MxAccessClient` + STA pump in Host | gw (Option 1) or embedded worker (Option 2) |
| Wonderware Historian SDK | `HistorianDataSource` in Host (x86) | Separate Historian data source plugged into the server's HA service. Likely stays its own .NET 4.8 x86 sidecar because the SDK is x86-only; **independent of the Galaxy driver lifecycle**. |
| Alarm state machine (`.InAlarm`/`.Acked` quartet, transitions, ack writer) | `GalaxyAlarmTracker` in Host | Server-level A&E subsystem subscribes to alarm-bearing attributes the driver advertises and runs the AlarmCondition state machine generically. Driver only flags `IsAlarm=true` in node metadata. |
| `ScanState` per-platform probes | `GalaxyRuntimeProbeManager` in Host | Driver-side: ScanState is just another tag subscription; the driver re-advises one per discovered `$WinPlatform`/`$AppEngine` and reports `HostConnectivityStatus` from the value stream. No special host-side machinery. |
After the strip-down, the Galaxy driver looks like Modbus or OpcUaClient: it
discovers nodes, reads/writes/subscribes, and reports per-host transport
health. Everything else is the server's problem.
---
## Option 1 — Tier-A driver against the MxAccess Gateway
`Driver.Galaxy` becomes a regular **in-process .NET 10 driver** in the OtOpcUa
server (no `.Host`, no `.Proxy` split, no x86). It talks to a separately
deployed `MxGateway.Server` over gRPC using `MxGateway.Client`. Browse comes
from `galaxy_repository.v1.DiscoverHierarchy`. Live data comes from
`MxAccessGateway.OpenSession`/`AddItem`/`Advise`/`StreamEvents`.
```
OtOpcUa.Server (.NET 10 x64)
└── Driver.Galaxy (in-proc, .NET 10)
└── gRPC ──► MxGateway.Server (.NET 10 x64)
└── pipe ──► MxGateway.Worker (.NET 4.8 x86)
└── MXAccess COM (STA)
```
### Pros
- **Architectural parity with other drivers.** No bespoke `Host` service, no
x86 build target, no NSSM wrapper, no STA pump in this repo, no
`PostMortemMmf`/`RecyclePolicy` we maintain ourselves.
- **OtOpcUa server stops needing AVEVA installed on its own host.** The
gateway runs where MXAccess lives; the OPC UA server can live on a different
box, in a container, or on a hardened jump host.
- **One canonical MXAccess surface across the org.** Any future tool — a
diagnostic CLI, a Historian replacement, an integration harness — talks to
the same gw with the same parity guarantees we get.
- **Multi-instance friendly.** Two OtOpcUa servers (warm/hot redundancy) share
one gw and one MXAccess footprint instead of each running their own
`Galaxy.Host` with duplicate Wonderware client identities.
- **Browse + cache for free.** `galaxy_repository.v1` already implements the
hierarchy cache, deploy-time gating, paging, and `WatchDeployEvents` — we
delete `GalaxyRepository.cs`, `GalaxyHierarchyRow.cs`, the change-detection
poll loop, and the matching SQL plumbing.
- **Operability for free.** API-key auth, Blazor dashboard at `/dashboard`,
metrics via `Meter`, structured logs with redaction. We currently have
none of that in `Galaxy.Host`.
- **Future backend swap.** When AVEVA exposes managed NMX or another modern
path, gw routes to it without OtOpcUa changes (gw's stated roadmap).
- **Tighter blast radius.** A hung COM event, a leaking COM object, a
crashing worker — all owned by gw's session/worker isolation, not the
OPC UA server process.
- **Simpler version story for OtOpcUa.** Driver is plain .NET 10; the
bitness/runtime split lives entirely in mxaccessgw's repo.
### Cons
- **Extra deployment dependency.** mxaccessgw is now a service that has to be
installed, monitored, and kept on a compatible protocol version. For a
single-box install this is one more moving piece.
- **Two hops on every call** (driver→gw, gw→worker) instead of one
(proxy→host). Today's hop is MessagePack over a named pipe; the new outer
hop is gRPC over TCP. Per-call overhead is a few hundred microseconds, not
a regression for OPC UA workloads but measurable for very chatty bursts.
- **Auth/secret surface added.** OtOpcUa now holds an API key for gw and
rotates it; gw's SQLite-backed key store has to be managed.
- **Failure model spans two processes we don't own** — gw + worker. Reconnect
logic in our driver has to ride both: gw transport drop, gw session lease
expiry, gw-detected worker crash, plus the worker's own MXAccess reconnect.
All of it is exposed in the gRPC contract, but it's still surface area.
- **Cross-repo protocol coupling.** Bumping `mxaccessgw` major version (gRPC
contract changes, session shape changes) ripples into OtOpcUa releases.
Mitigated by versioned contracts; not free.
- **Galaxy redundancy still has to think about gw.** A redundancy fail-over of
OtOpcUa is independent of the gw's session lifecycle. Need to decide whether
the standby holds an open session or only opens it on takeover.
- **Sensitive writes (`WriteSecured`, `AuthenticateUser`) cross the network**
if gw is remote. TLS + mTLS solves it but adds setup.
---
## Option 2 — Embed mxaccessgw worker, no gateway
`Driver.Galaxy` is still in-process .NET 10, but instead of speaking gRPC to a
gateway service, it directly **launches and supervises one (or more)
`MxGateway.Worker` processes** and talks to them over the same named-pipe
worker protocol gw uses internally
(`docs/WorkerFrameProtocol.md`, `docs/WorkerProcessLauncher.md`). Browse stays
local — driver runs the SQL queries against ZB itself.
```
OtOpcUa.Server (.NET 10 x64)
└── Driver.Galaxy (in-proc, .NET 10)
├── ZB SQL (local, in-proc)
└── pipe ──► MxGateway.Worker (.NET 4.8 x86, child process)
└── MXAccess COM (STA)
```
### Pros
- **One hop, not two.** Driver → worker pipe is the same shape as today's
Proxy → Host pipe. Latency is on par with the current implementation.
- **No new service to deploy.** Worker is launched as a child process the
same way `Galaxy.Host` is launched today (just with mxaccessgw's worker
binary). Single-machine install story stays simple.
- **Keeps the trust boundary local.** No API keys, no TLS, no exposed gRPC
port on the OtOpcUa box.
- **Reuses mxaccessgw's parity-tested worker code** — STA pump, COM lifetime,
event conversion, fault model — without inheriting gw's ASP.NET Core /
Blazor / SQLite footprint.
- **Tighter ownership.** OtOpcUa owns the worker lifecycle; recycle, kill,
restart, post-mortem all decided by the driver, not by an external service
we don't control.
- **Easier to reason about during integration tests.** No second service to
spin up in CI; just a child process per test fixture.
### Cons
- **OtOpcUa server box must still have AVEVA + MXAccess installed**, since
the worker runs locally. The major deployment win of Option 1
(separating where MXAccess runs from where OtOpcUa runs) is lost.
- **OtOpcUa still ships an x86 .NET 4.8 binary alongside it.** Even if we
vendor mxaccessgw's worker rather than write our own, installer complexity
and bitness considerations remain.
- **We re-implement everything gw already gives.** Process supervision,
watchdog, recycle policy, heartbeat, post-mortem — these are exactly what
`Galaxy.Host` does today, and they'd live in our repo again, just calling a
different worker binary.
- **No browse cache, no deploy gating, no `WatchDeployEvents`** — we keep
running our own ZB queries and our own `time_of_last_deploy` poll, or we
port gw's cache code into the driver. Either way it's duplicated logic.
- **No auth, no dashboard, no metrics.** Operability stays where it is today
(i.e., minimal). Adding it ourselves is a separate project.
- **Multiple OtOpcUa instances multiply MXAccess sessions.** Redundancy pair
→ two MXAccess clients on the Galaxy from the same software, vs. Option 1
where one gw arbitrates.
- **Worker protocol coupling without the contract surface.** We depend on
mxaccessgw's worker IPC frame format — a surface that mxaccessgw treats as
*internal* to its own gw↔worker boundary. If they refactor it, we have to
follow. The public gRPC contract (Option 1) is more stable by design.
- **Loses the "common MXAccess access point" benefit.** Other consumers
(CLI, integration harnesses, future tools) can't share state with our
embedded worker.
---
## Status quo (for comparison)
Keep `Galaxy.Host` as today, and in-place rip out historian + alarming +
probe manager. End state: the Host shrinks to `MxAccessClient` + `GalaxyRepository`,
which is roughly what Option 2 ends up looking like — but with our hand-rolled
COM bridge instead of mxaccessgw's worker. Not a serious option once
mxaccessgw exists; we'd be maintaining a parallel implementation of the same
thing.
---
## Recommendation (effort-agnostic)
**Go with Option 1 — Tier-A driver against the MxAccess Gateway.**
The decisive arguments:
1. **It's the only option that aligns Galaxy with how every other driver in
this repo is structured.** The user's stated goal — "keep lmx to data +
browsing, similar to other drivers" — only fully resolves if there is no
`.Host` and no x86 build artifact in this repo at all. Option 2 still has
an x86 child process and supervisor code; it's `Galaxy.Host` with a
different worker binary inside.
2. **It separates *where MXAccess runs* from *where OtOpcUa runs*.** That is
a strategically larger win than a few hundred microseconds of per-call
latency. The OPC UA server stops being chained to AVEVA install footprint,
bitness, and Wonderware client identity — which removes a class of
deployment, redundancy, and CI problems we hit today (e.g., the
`DESKTOP-6JL3KKO` Hyper-V/Docker conflict, the `dohertj2`-only pipe ACL,
the live-Galaxy smoke test prerequisites).
3. **It collapses scope.** A non-trivial fraction of `Galaxy.Host` (browse
cache, deploy-event watch, worker supervision, COM bridge, post-mortem,
recycle, ACL hardening) is reproduced *better* in mxaccessgw. Option 1
deletes our copy. Option 2 keeps it.
4. **It positions historian and alarming for the right home.** Once the
Galaxy driver is "just another driver", historian becomes a server-level
data source (one that can also feed Modbus/S7 history if we ever want it),
and alarming becomes a server-level A&E subsystem. Option 2 nominally
allows the same move, but the temptation to keep them in `Galaxy.Host`
"while we're already there" is real.
5. **It future-proofs against AVEVA's roadmap.** Managed NMX, ASB, or any
replacement that shows up over the next few years gets adopted in
mxaccessgw without a release in this repo.
The case for Option 2 is real but narrow: it's the right call **only** if we
commit to single-box deployments forever, refuse to take a gRPC dependency,
and value local-trust simplicity over the consolidation/operability benefits
gw provides. None of those constraints hold here.
### What flips the recommendation
- If the gw protocol is unstable or perf-tested under our subscription
patterns turns out worse than expected → revisit Option 2.
- If org-policy forbids running an MXAccess gateway as its own service →
Option 2.
- If Galaxy goes from one of several drivers to *the* primary driver and
raw call-rate matters more than architectural fit → revisit.
Otherwise: Option 1.
---
## Out-of-scope follow-ups (don't decide here, but flag them)
- **Where does the Wonderware Historian SDK live?** Likely its own
.NET 4.8 x86 sidecar exposing a small `IHistorianDataSource` over a pipe or
gRPC, plugged into the OPC UA server's HA service alongside any future
historian sources. Independent of which option above is chosen.
- **Alarm subsystem ownership.** Decide whether the server hosts a generic
AlarmCondition state machine driven by driver-advertised alarm metadata, or
whether each driver continues to emit pre-shaped alarm transitions. Galaxy's
4-attr quartet is a strong forcing function for the generic approach.
- **Redundancy + gw sessions.** Standby OtOpcUa holds an open gw session
(warm) vs. opens on takeover (cold). Affects gw worker count and Galaxy
client-identity collisions.
- **Auth between OtOpcUa and gw.** API key in DPAPI-protected secret file vs.
Windows-auth gRPC. Both supported by gw; pick before rollout.

View File

@@ -1,486 +0,0 @@
> **✅ Completed 2026-04-30 — historical record of the v2-mxgw migration design.**
>
> This document is the design doc that drove the migration from the
> legacy out-of-process Galaxy.Host topology to the in-process
> GalaxyDriver + mxaccessgw architecture. Option 1 (the in-process
> driver path) was selected and implemented across 39 PRs spanning
> phases 07, merged to master at commit `ae7106d`. For current
> architecture see `CLAUDE.md`, `docs/drivers/Galaxy.md`, and
> `docs/v2/Galaxy.Performance.md`.
# Galaxy → MxAccessGateway Migration Plan
Implements **Option 1** from `lmx_backend.md`: replace the bespoke `Galaxy.Host`
+ `Galaxy.Proxy` IPC pair with an **in-process Tier-A** `Driver.Galaxy` running
in the .NET 10 OtOpcUa server, talking to a separately-deployed
`MxGateway.Server` (mxaccessgw repo) over gRPC for live MXAccess work and
Galaxy Repository browse.
## Outcome
After this work:
- `OtOpcUa.Server` is fully .NET 10 x64 — no x86 build artifacts in this repo.
- `Driver.Galaxy.Host` (Windows service, NSSM-wrapped, .NET 4.8 x86) is
retired. `Driver.Galaxy.Proxy` and `Driver.Galaxy.Shared` are deleted.
AVEVA platform is no longer required on the OtOpcUa box.
- A new in-process `Driver.Galaxy` lives next to `Driver.Modbus`,
`Driver.OpcUaClient`, etc. It implements the same `IDriver` capability set
the proxy implements today, but its body calls `MxGateway.Client`
(`MxGatewayClient`, `MxGatewaySession`, `GalaxyRepositoryClient`).
- Wonderware Historian SDK access moves out of the Galaxy driver into a
driver-agnostic historian data source (`Driver.Historian.Wonderware`,
separate sidecar, .NET 4.8 x86). The OPC UA HA service plugs into it the
same way it would plug into any future historian.
- Alarm condition tracking moves out of the driver into the OPC UA server's
generic A&E subsystem. The driver only flags `IsAlarm=true` on attribute
metadata and forwards live `.InAlarm`/`.Acked`/etc value changes; the
server runs the AlarmCondition state machine.
- Per-platform `ScanState` probes degrade to plain attribute subscriptions —
no special probe manager.
---
## Pre-flight: improvements to land in mxaccessgw first
These are **integration-quality changes** in the mxaccessgw repo that make
the OtOpcUa side dramatically simpler / faster / more robust. They aren't
strictly required to start, but ship enough of them before phase 3 that we're
not designing around gaps.
### gw-1. Galaxy attribute metadata parity
**What's there:** `galaxy_repository.v1.DiscoverHierarchy` returns
`GalaxyObject` with name, parent, category, and dynamic attributes.
**What's missing for OtOpcUa:** every field today's `MxAccessGalaxyBackend`
copies into `GalaxyAttributeInfo` — confirm gw's `Attribute` proto carries:
- `mx_data_type` (int)
- `is_array` (bool)
- `array_dimension` (uint, optional)
- `security_classification` (int)
- `is_historized` (bool, from `HistorizedExtension` primitive)
- `is_alarm` (bool, from `AlarmExtension` primitive)
If any are missing, add them to the proto and the server-side query mapper.
Without `IsAlarm` and `IsHistorized` the OPC UA server can't decide which
nodes get HasHistoricalConfiguration / which become AlarmConditions.
### gw-2. Stable, documented event-stream resume semantics
**What's needed:** the OtOpcUa driver must survive a transient gw transport
drop without losing subscription state or duplicating change events. gw's
`StreamEventsAsync(afterWorkerSequence)` already exposes resumption.
Document the per-session retention window (how long does the worker buffer
events the gateway hasn't acked?) and the "events were dropped, you must
re-subscribe" signal. If retention is bounded by count rather than time,
expose the bound in `OpenSessionReply` so the client can size its own buffer.
### gw-3. Reconnectable sessions
Listed under "post-v1 revisit" in `gateway.md`. Without it, every gw or
OtOpcUa restart re-`Register`s, re-`AddItem`s, re-`Advise`s the entire
address space — for a 50k-tag Galaxy that's a non-trivial cold-start. With
reconnectable sessions, the driver presents its `SessionId` after a restart
and the worker keeps its handles.
If full reconnection is too large, ship a **bulk replay** instead: a single
RPC that takes the full subscription set and the worker performs the
register/add/advise inside one round trip. We can drive it from a
client-side cache rather than gw state. See gw-5 below.
### gw-4. Driver-shaped subscribe primitive
`MxGatewaySession` already has `SubscribeBulkAsync` (one RPC: `Register`
implicit + `AddItem` + `Advise` for a list of tag addresses, returning
per-tag `SubscribeResult`). That's exactly what `ISubscribable.SubscribeAsync`
wants. Confirm it returns enough per-tag detail to surface a partial-failure
list to OPC UA monitored items (good handle, status code, error text).
If not already, expose **`SubscribeBulk` with optional update-rate hint**
forwarded to `SetBufferedUpdateInterval` so the OPC UA publishing interval
becomes a single field on the subscribe call rather than a follow-up RPC.
### gw-5. Subscription replay snapshot
Provide an RPC `ReplaySubscriptionsAsync(SessionId, IEnumerable<TagAddress>)`
that re-establishes a list of subscriptions after a session reset and returns
per-tag results. The client stores its tag list locally (the driver already
has it from `Discover`), and the gw worker turns it into one
register/add/advise sequence. This is the minimum surface we need; full
"reattach to a previous session by id" (gw-3) is a richer version of the
same thing.
### gw-6. Transport-health stream
The gw already exposes worker / session health on its dashboard. Add a small
streaming RPC `StreamSessionHealth(SessionId) → stream SessionHealth` so the
OtOpcUa driver can surface "MXAccess transport up/down" to its
`IHostConnectivityProbe` without faking it via probe-tag subscriptions.
Today `MxAccessClient.ConnectionStateChanged` does this in-process; we want
the same signal at the gw boundary.
### gw-7. Optional .NET 10 client polish
- Async-disposable session pattern is already there.
- Add a **typed `MxValue` ⇄ `object` adapter** for the seven Galaxy types
OtOpcUa cares about (Boolean, Int32, Float, Double, String, DateTime,
arrays of the same). Today every consumer writes its own `MxValue.From<T>`
helpers; this shaves boilerplate from the driver.
- Add a **`SubscribeWithCallback`** convenience wrapper that combines
`OpenSession` + `SubscribeBulk` + `StreamEvents` and routes events through
a delegate per tag. Keeps the OPC UA driver from re-implementing the
fan-out / sequencer pattern.
### gw-8. Auth minimums
Document API-key scoping as it applies to OtOpcUa: the server identity needs
`session`, `invoke`, `event`, and `metadata:read` scopes. Provide a CLI to
mint a key bound to those scopes for an OtOpcUa instance.
### gw-9. Performance: bulk paths and value coalescing
- Confirm `SubscribeBulkAsync` is implemented as a single MXAccess
`AddItem`+`Advise` loop on the worker, not N pipe round trips. If not, fix
before we drive 50k-tag Galaxies through it.
- Expose `SetBufferedUpdateInterval` per session so OtOpcUa can request
buffered updates at the OPC UA publishing interval and get one batched
`OnBufferedDataChange` per tick rather than N `OnDataChange` events.
These can all ship in mxaccessgw independently and improve every consumer.
---
## OtOpcUa-side improvements to land in parallel
Some are forced by removing `Galaxy.Host`; others are quality-of-life.
### ot-1. Promote `IHistorianDataSource` to a server-level extension point
Today `IHistorianDataSource` is a Galaxy-internal abstraction in
`Driver.Galaxy.Host`. Lift it to `OtOpcUa.Core.Abstractions` (or a similar
home next to `IDriver`) and let the OPC UA HA service consume **any number
of registered data sources** keyed by node namespace. Drivers don't own
historian access; the server mounts data sources alongside drivers. This is
the prerequisite that lets us move Wonderware Historian out of the Galaxy
driver without losing the feature.
### ot-2. Generic alarm condition state machine in the server
Move the `.InAlarm`/`.Priority`/`.DescAttrName`/`.Acked` quartet handling
out of `GalaxyAlarmTracker` into a server-level alarm subsystem keyed off the
`IsAlarm=true` flag drivers set during discovery. The server subscribes to
the four sub-attributes itself and runs the AlarmCondition state machine.
Driver only:
- declares `IsAlarm=true` in `DriverAttributeInfo`,
- forwards plain attribute value changes (already done by `ISubscribable`).
This is also a precondition for future drivers (Modbus DL205 alarm bits,
S7 alarm DBs) to emit alarms without each writing their own tracker.
### ot-3. Driver capabilities trim
After ot-1 and ot-2, `Driver.Galaxy` no longer needs to implement:
- `IHistoryProvider` (server's HA service handles it via Wonderware
historian data source)
- `IAlarmHistorianWriter` (server's A&E historian, or kept generic — Galaxy
shouldn't own the SQLite path)
- `IAlarmSource` ack route (server-level alarm subsystem writes back via the
driver's `IWritable.WriteAsync`, which the gw already supports)
Keep:
- `IDriver`, `ITagDiscovery`, `IReadable`, `IWritable`, `ISubscribable`,
`IRediscoverable`, `IHostConnectivityProbe`.
### ot-4. Treat `time_of_last_deploy` as `IRediscoverable`'s pump
Replace the Host-side change-detection poll with a managed
`GalaxyRepositoryClient.WatchDeployEventsAsync` consumer in the driver.
Each event raises `OnRediscoveryNeeded` with the new deploy time as the
`scopeHint`. No polling code in this repo.
### ot-5. Connection pool at the server, not the driver
If the redundancy pair runs two OtOpcUa instances against one gw, both
should share a single `GrpcChannel` per process (already gRPC default) but
**different sessions** (one MXAccess client identity per OtOpcUa instance,
not one shared session that fights over Wonderware client state). Encode
the per-instance MXAccess client name in driver config — already partly
there (`OTOPCUA_GALAXY_CLIENT_NAME`); make it explicit in the new driver's
`appsettings.json` shape.
---
## Phased implementation
Each phase is a working, mergeable slice. Keep `Galaxy.Host` running
alongside the new driver until phase 7 — gated by a config switch
`Galaxy:Backend = legacy-host | mxgateway`.
### Phase 0 — pre-flight (mxaccessgw repo)
Ship gw-1, gw-2, gw-4, gw-9 (the parity, performance, and contract bits the
plan immediately depends on). gw-3, gw-5, gw-6, gw-7 can come during or
after phase 5.
**Exit:** local OtOpcUa dev box can `MxGatewayClient.Create` a client, open a
session, `SubscribeBulkAsync` 100 tags, and observe `OnDataChange` events at
the configured update rate.
### Phase 1 — server-level historian extension point (ot-1)
1. Extract `IHistorianDataSource` (and its DTOs `HistorianSample`,
`HistorianAggregateSample`, `HistoricalEvent`) from
`Driver.Galaxy.Host/Backend/Historian/` into
`src/ZB.MOM.WW.OtOpcUa.Core/Abstractions/Historian/`.
2. Extend the OPC UA HA service to look up a registered
`IHistorianDataSource` per namespace and call into it for `HistoryRead`,
`HistoryReadProcessed`, `HistoryReadAtTime`, `HistoryReadEvents`. Drivers
stop implementing `IHistoryProvider` directly; the server proxies.
3. Add a no-op default registration so drivers without history keep working.
**Exit:** all current Galaxy history reads route through an
`IHistorianDataSource` registered by `Driver.Galaxy.Host` (still legacy)
without behavior change. Other drivers untouched.
### Phase 2 — server-level alarm subsystem (ot-2)
1. Add an `IAlarmConditionDeclaration` API on the address-space builder so
discovery can flag a node as alarm-bearing and supply the four
sub-attribute references.
2. Add a hosted `AlarmConditionService` in the server that, on driver
`Discover`, subscribes to the four sub-attributes via the driver's own
`ISubscribable`, runs the state machine, and emits
`IAlarmSource.OnAlarmEvent` itself. Acks route back through the driver's
`IWritable.WriteAsync` to the `.AckMsg` attribute.
3. Add Galaxy-specific defaults (sub-attribute naming) as a small adapter
so the same service can serve future drivers with different conventions.
**Exit:** Galaxy alarms still work end-to-end; the tracker code that runs
inside `Galaxy.Host` is dead but kept for the legacy-host backend path.
### Phase 3 — Wonderware Historian sidecar (`Driver.Historian.Wonderware`)
1. New solution project: `Driver.Historian.Wonderware`, .NET 4.8 x86,
console app + NSSM (mirrors today's Galaxy.Host packaging exactly,
minus Galaxy responsibilities).
2. Hosts the existing `HistorianDataSource`, `HistorianClusterEndpointPicker`,
`HistorianHealthSnapshot` code lifted from `Galaxy.Host/Backend/Historian/`
and exposes them over a small named-pipe protocol (or local gRPC if
.NET 4.8 cost is acceptable; named pipe is simpler).
3. Add `Driver.Historian.Wonderware.Client` — .NET 10 — implementing
`IHistorianDataSource` against the sidecar.
4. Server registers it as a data source for the `Galaxy` namespace.
**Exit:** OPC UA history reads work via the sidecar with the legacy-host
backend still in place. We've decoupled history from MXAccess.
### Phase 4 — new `Driver.Galaxy` against gw
This is the meat. New project: `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/`, .NET 10,
in-process. Capabilities (post ot-3): `IDriver`, `ITagDiscovery`, `IReadable`,
`IWritable`, `ISubscribable`, `IRediscoverable`, `IHostConnectivityProbe`.
Shape:
```
Driver.Galaxy/
GalaxyDriver.cs # IDriver root
Browse/
GalaxyDiscoverer.cs # consumes GalaxyRepositoryClient.DiscoverHierarchyAsync
DataTypeMap.cs # mx_data_type → DriverDataType
SecurityMap.cs # security_classification → SecurityClassification
Runtime/
GalaxyMxSession.cs # owns one MxGatewaySession; Register + map per-driver client name
SubscriptionRegistry.cs # tag → server/item handles; persists to memory only
EventPump.cs # consumes session.StreamEventsAsync, fans out to OnDataChange
ReconnectSupervisor.cs # gw transport drop / session-lost recovery
DeployWatcher.cs # GalaxyRepositoryClient.WatchDeployEventsAsync → OnRediscoveryNeeded
Health/
HostConnectivityForwarder.cs # gw-6 SessionHealth → IHostConnectivityProbe
Config/
GalaxyDriverOptions.cs # endpoint, ApiKey, ClientName, TLS, retry, intervals
GalaxyDriverFactoryExtensions.cs # AddGalaxyDriver(IServiceCollection)
```
Key behaviors:
- **Discovery** calls `GalaxyRepositoryClient.DiscoverHierarchyAsync()`
once at init and on every `WatchDeployEvents` event, then drives the
address space builder. Same node naming as today (parent contained-name
hierarchy + leaf attributes named `tag_name.AttributeName`).
- **Read** uses one-off `AddItem` + `Advise` + read-after-first-callback
is overkill; instead, use **`Register` + per-call `AddItem`/`Read`** if gw
exposes a synchronous read, otherwise short-lived advise. *Action item:*
confirm gw's read story; if absent, request a synchronous `ReadAsync` RPC
on top of MXAccess `Read` (which exists in the COM API).
- **Write** maps `WriteRequest.Value` to `MxValue` via gw-7 helpers and
calls `WriteAsync(serverHandle, itemHandle, value, userId=0)`. Routes
`WriteSecured` (where `SecurityClassification == SecuredWrite/Verified`)
to `WriteSecuredAsync` once exposed on `MxGatewaySession`.
- **Subscribe** calls `SubscribeBulkAsync` once per `ISubscribable.Subscribe`
call. Stores `(tag → itemHandle, sid)` in `SubscriptionRegistry`. The
single `EventPump` consumes one `StreamEventsAsync` per session and fans
out per `sid`.
- **Unsubscribe** calls `UnsubscribeBulkAsync` and drops registry entries.
- **Reconnect** — when the gRPC channel drops or `StreamEvents` returns,
`ReconnectSupervisor` reopens the session and replays subscriptions via
gw-5 `ReplaySubscriptionsAsync`. The driver flags `DriverState.Degraded`
during recovery; the server keeps publishing last-good values with
`Uncertain` quality.
- **Host connectivity** — single synthesized host entry named after
`OTOPCUA_GALAXY_CLIENT_NAME` driven by gw-6 `SessionHealth` updates
(or, until gw-6 lands, by transport drops).
Wire into the server next to other Tier-A drivers in the
`AddDrivers(...)` call site.
**Exit:** flipping `Galaxy:Backend` to `mxgateway` runs the OPC UA server
end-to-end with no `Galaxy.Host` involvement. Live read, live write, live
subscribe pass against the dev Galaxy. Historian + alarms still work via
phases 13.
### Phase 5 — parity test matrix
Reuse the existing live-Galaxy integration tests; run each scenario twice:
once with `Galaxy:Backend=legacy-host`, once with `mxgateway`. Compare:
- discovered hierarchy node count + names + datatypes,
- subscribed publish rates (allow ±10% tolerance vs. legacy),
- write success / status codes for each `SecurityClassification`,
- alarm condition transitions (Active / Acked / Inactive) — already
routed through phase 2's server-level subsystem,
- history reads — phase 3 sidecar, identical results both backends,
- reconnect behavior under gw kill, worker kill, network drop, ZB drop.
Document the matrix; resolve every discrepancy or explicitly accept it.
**Exit:** parity matrix has zero unexplained deltas. Performance budget
agreed: e.g. ≤ 2× per-call latency vs. named-pipe baseline at the 95th
percentile, equal or better throughput in `SubscribeBulk` setup time.
### Phase 6 — perf + hardening
- Land gw-9 buffered-update intervals.
- Add OpenTelemetry traces from the driver around every gw call,
correlated via `client_correlation_id`.
- Write soak test: 50k tags subscribed, 24h, count missed events, gw
restarts, OtOpcUa restarts.
- Tune `MxGatewayClientOptions.MaxGrpcMessageBytes`, retry pipeline,
call timeouts based on soak results.
**Exit:** production-acceptable perf numbers documented in
`docs/drivers/Galaxy.md`.
### Phase 7 — retirement
1. Default `Galaxy:Backend = mxgateway` everywhere (sample configs,
install scripts, e2e configs).
2. Delete `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host`,
`src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy`,
`src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared`, and matching tests.
3. Remove `OtOpcUaGalaxyHost` NSSM registration from
`scripts/install/Install-Services.ps1`. Add a registration block for the
Wonderware historian sidecar from phase 3.
4. Remove every x86 .NET 4.8 reference, build target, and CI step from this
repo; remove `mxaccess_documentation.md`-driven dependencies that no
longer apply.
5. Update CLAUDE.md, `docs/v2/dev-environment.md`, `docs/ServiceHosting.md`,
`docs/Redundancy.md` to reflect the new topology.
6. Memory housekeeping: retire `project_galaxy_host_service.md` and
`project_galaxy_host_installed.md`; add a short note about the gw
dependency.
**Exit:** `git grep -i 'Galaxy\.Host'` returns nothing in source.
---
## Configuration shape (new driver)
```jsonc
"Drivers": {
"Galaxy": {
"Type": "Galaxy",
"InstanceId": "galaxy-prod-1",
"Gateway": {
"Endpoint": "https://mxgw.aveva.local:5001",
"ApiKeySecretRef": "galaxy:apiKey", // resolved via existing secret store
"UseTls": true,
"CaCertificatePath": "C:\\publish\\mxgw\\ca.crt",
"ConnectTimeoutSeconds": 10,
"DefaultCallTimeoutSeconds": 5,
"StreamTimeoutSeconds": 0 // unbounded
},
"MxAccess": {
"ClientName": "OtOpcUa-A", // unique per OtOpcUa instance
"PublishingIntervalMs": 1000, // hint for SetBufferedUpdateInterval
"WriteUserId": 0
},
"Repository": {
"DiscoverPageSize": 5000,
"WatchDeployEvents": true
},
"Reconnect": {
"InitialBackoffMs": 500,
"MaxBackoffMs": 30000,
"ReplayOnSessionLost": true
}
}
}
```
The OtOpcUa secret store already handles DPAPI-protected values for LDAP
binds; reuse it for the gw API key. Never put the key in plaintext in the
sample config.
---
## Risks and mitigations
| Risk | Mitigation |
|---|---|
| gw protocol regression breaks production | Pin gw NuGet to a contract version range; CI runs parity matrix on every gw bump; staged rollout via `Galaxy:Backend` flag. |
| Per-call latency regresses for chatty workloads | Land gw-9 (buffered updates) before phase 5; soak the 95p in phase 6. |
| Reconnect storm after gw restart re-registers 50k tags | Land gw-3 or gw-5 before phase 6; client-side bulk replay throttled by `SubscribeBulkAsync` chunk size. |
| Alarm parity gap from moving tracker server-side | Phase 2 ships before phase 4; parity matrix gates phase 7. |
| Historian sidecar adds a second .NET 4.8 x86 service | Acceptable: it's a *driver-agnostic* component, and it ships only where Wonderware historian access is actually needed. |
| Two OtOpcUa instances both registering as same MXAccess client | `ClientName` is per-instance config (ot-5); install scripts lint that the redundancy pair has distinct names. |
| Cross-machine MXAccess writes traverse plaintext gRPC | Phase 0 enforces `UseTls=true` for any non-loopback `Endpoint`; CI lints the sample configs. |
| gw API key leaked in logs | gw and `MxGatewayClient` already redact `authorization` metadata; phase 6 audit. |
| Memory leak in `EventPump` under high event rate | Bounded channel between `StreamEventsAsync` and per-sub fan-out, drop-newest with a metric counter; soak test catches. |
---
## Cross-cutting deliverables
- **Docs:** `docs/drivers/Galaxy.md` (new), updates to
`docs/v2/dev-environment.md`, `docs/ServiceHosting.md`,
`docs/Redundancy.md`, `CLAUDE.md`.
- **Install scripts:** `scripts/install/Install-Services.ps1` removes
`OtOpcUaGalaxyHost`, adds `OtOpcUaWonderwareHistorian`, no Galaxy
service registration on the OtOpcUa node.
- **e2e:** `scripts/e2e/e2e-config.sample.json` — drop `OTOPCUA_GALAXY_*`
pipe vars, add `Drivers:Galaxy:Gateway:Endpoint` etc.
- **Memory:** retire stale Galaxy.Host entries; add gw dependency entry,
redundancy + client-name guidance.
---
## Order-of-work summary
```
Phase 0 (gw repo): gw-1, gw-2, gw-4, gw-9
Phase 1 (this): ot-1 — historian extension point
Phase 2 (this): ot-2 — alarm subsystem
Phase 3 (this): Driver.Historian.Wonderware sidecar
Phase 4 (this): Driver.Galaxy (new) behind backend flag
— depends on Phase 0, 1, 2
Phase 5 (this+gw): parity matrix
— drives gw-3 / gw-5 / gw-6 / gw-7 if gaps surface
Phase 6 (this): perf + hardening
Phase 7 (this): retire Galaxy.Host / Proxy / Shared
```
Phases 13 are independent of each other and can run in parallel. Phase 4
needs all three plus Phase 0. Phase 5 requires Phase 4. Phases 6 and 7 are
sequential after Phase 5.

File diff suppressed because it is too large Load Diff

View File

@@ -1 +0,0 @@
opc.tcp://opcuademo.sterfive.com:26543