Audit (three parallel agent passes) found 43 markdown files carrying stale references to the deleted Galaxy.Host/Proxy/Shared projects after the v2-mxgw merge. This commit lands the prioritized fixes. Track 1 — high-traffic in-place rewrites (3 files, ~454 lines deleted) - README.md (202 → 91 lines): drops .NET 4.8 / x86 / TopShelf install text; leads with the multi-driver .NET 10 server identity and points at scripts/install/Install-Services.ps1 and the parity rig. - docs/v2/driver-specs.md §1 Galaxy (~289 → ~66 lines): replaces the Tier-C out-of-process spec with a Tier-A in-process description matching the current GalaxyDriver code, with the four-section GalaxyDriverOptions JSON shape pulled verbatim from Config/GalaxyDriverOptions.cs. - docs/drivers/Galaxy.md (211 → 92 lines): full rewrite around the current Browse/Runtime/Health/Config sub-folders. Track 2 — historical banners (5 files) - lmx_mxgw.md, lmx_mxgw_impl.md, lmx_backend.md, docs/v2/Galaxy.ParityMatrix.md, docs/v2/implementation/phase-2-galaxy-out-of-process.md each get a "✅ Completed 2026-04-30 — historical record" banner block. lmx_mxgw.md also fixes two dead links (`docs/Galaxy.Driver.md` and `docs/v2/Galaxy.Driver.md`) → `docs/drivers/Galaxy.md`. Track 3 — v1 archive sweep (10 git mv + 1 new index + 2 in-place scrubs) - Moved 10 v1 docs under docs/v1/ preserving subpath structure: AlarmTracking, Configuration, DataTypeMapping, HistoricalDataAccess, Subscriptions (top-level); drivers/Galaxy-Repository, drivers/Galaxy-Test-Fixture; reqs/GalaxyRepositoryReqs, reqs/MxAccessClientReqs, reqs/ServiceHostReqs. - New docs/v1/README.md is the shared archive banner + per-file table. - docs/README.md repointed to the v1 paths and updated to reflect the v2 two-process deploy shape (Server + Admin + optional OtOpcUaWonderwareHistorian). - docs/v2/Galaxy.ParityRig.md got a historical banner + four inline scrubs marking the OtOpcUaGalaxyHost service / Driver.Galaxy.Host EXE / Driver.Galaxy.ParityTests project as deleted-in-PR-7.2. The repo's live-reading surface (README + CLAUDE.md + docs/v2/) now describes only the post-PR-7.2 architecture. v1 docs are preserved as a labelled archive under docs/v1/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
22 KiB
✅ Completed 2026-04-30 — historical record of the v2-mxgw migration design.
This document is the design doc that drove the migration from the legacy out-of-process Galaxy.Host topology to the in-process GalaxyDriver + mxaccessgw architecture. Option 1 (the in-process driver path) was selected and implemented across 39 PRs spanning phases 0–7, merged to master at commit
ae7106d. For current architecture seeCLAUDE.md,docs/drivers/Galaxy.md, anddocs/v2/Galaxy.Performance.md.
Galaxy → MxAccessGateway Migration Plan
Implements Option 1 from lmx_backend.md: replace the bespoke Galaxy.Host
Galaxy.ProxyIPC pair with an in-process Tier-ADriver.Galaxyrunning in the .NET 10 OtOpcUa server, talking to a separately-deployedMxGateway.Server(mxaccessgw repo) over gRPC for live MXAccess work and Galaxy Repository browse.
Outcome
After this work:
OtOpcUa.Serveris fully .NET 10 x64 — no x86 build artifacts in this repo.Driver.Galaxy.Host(Windows service, NSSM-wrapped, .NET 4.8 x86) is retired.Driver.Galaxy.ProxyandDriver.Galaxy.Sharedare deleted. AVEVA platform is no longer required on the OtOpcUa box.- A new in-process
Driver.Galaxylives next toDriver.Modbus,Driver.OpcUaClient, etc. It implements the sameIDrivercapability set the proxy implements today, but its body callsMxGateway.Client(MxGatewayClient,MxGatewaySession,GalaxyRepositoryClient). - Wonderware Historian SDK access moves out of the Galaxy driver into a
driver-agnostic historian data source (
Driver.Historian.Wonderware, separate sidecar, .NET 4.8 x86). The OPC UA HA service plugs into it the same way it would plug into any future historian. - Alarm condition tracking moves out of the driver into the OPC UA server's
generic A&E subsystem. The driver only flags
IsAlarm=trueon attribute metadata and forwards live.InAlarm/.Acked/etc value changes; the server runs the AlarmCondition state machine. - Per-platform
ScanStateprobes degrade to plain attribute subscriptions — no special probe manager.
Pre-flight: improvements to land in mxaccessgw first
These are integration-quality changes in the mxaccessgw repo that make the OtOpcUa side dramatically simpler / faster / more robust. They aren't strictly required to start, but ship enough of them before phase 3 that we're not designing around gaps.
gw-1. Galaxy attribute metadata parity
What's there: galaxy_repository.v1.DiscoverHierarchy returns
GalaxyObject with name, parent, category, and dynamic attributes.
What's missing for OtOpcUa: every field today's MxAccessGalaxyBackend
copies into GalaxyAttributeInfo — confirm gw's Attribute proto carries:
mx_data_type(int)is_array(bool)array_dimension(uint, optional)security_classification(int)is_historized(bool, fromHistorizedExtensionprimitive)is_alarm(bool, fromAlarmExtensionprimitive)
If any are missing, add them to the proto and the server-side query mapper.
Without IsAlarm and IsHistorized the OPC UA server can't decide which
nodes get HasHistoricalConfiguration / which become AlarmConditions.
gw-2. Stable, documented event-stream resume semantics
What's needed: the OtOpcUa driver must survive a transient gw transport
drop without losing subscription state or duplicating change events. gw's
StreamEventsAsync(afterWorkerSequence) already exposes resumption.
Document the per-session retention window (how long does the worker buffer
events the gateway hasn't acked?) and the "events were dropped, you must
re-subscribe" signal. If retention is bounded by count rather than time,
expose the bound in OpenSessionReply so the client can size its own buffer.
gw-3. Reconnectable sessions
Listed under "post-v1 revisit" in gateway.md. Without it, every gw or
OtOpcUa restart re-Registers, re-AddItems, re-Advises the entire
address space — for a 50k-tag Galaxy that's a non-trivial cold-start. With
reconnectable sessions, the driver presents its SessionId after a restart
and the worker keeps its handles.
If full reconnection is too large, ship a bulk replay instead: a single RPC that takes the full subscription set and the worker performs the register/add/advise inside one round trip. We can drive it from a client-side cache rather than gw state. See gw-5 below.
gw-4. Driver-shaped subscribe primitive
MxGatewaySession already has SubscribeBulkAsync (one RPC: Register
implicit + AddItem + Advise for a list of tag addresses, returning
per-tag SubscribeResult). That's exactly what ISubscribable.SubscribeAsync
wants. Confirm it returns enough per-tag detail to surface a partial-failure
list to OPC UA monitored items (good handle, status code, error text).
If not already, expose SubscribeBulk with optional update-rate hint
forwarded to SetBufferedUpdateInterval so the OPC UA publishing interval
becomes a single field on the subscribe call rather than a follow-up RPC.
gw-5. Subscription replay snapshot
Provide an RPC ReplaySubscriptionsAsync(SessionId, IEnumerable<TagAddress>)
that re-establishes a list of subscriptions after a session reset and returns
per-tag results. The client stores its tag list locally (the driver already
has it from Discover), and the gw worker turns it into one
register/add/advise sequence. This is the minimum surface we need; full
"reattach to a previous session by id" (gw-3) is a richer version of the
same thing.
gw-6. Transport-health stream
The gw already exposes worker / session health on its dashboard. Add a small
streaming RPC StreamSessionHealth(SessionId) → stream SessionHealth so the
OtOpcUa driver can surface "MXAccess transport up/down" to its
IHostConnectivityProbe without faking it via probe-tag subscriptions.
Today MxAccessClient.ConnectionStateChanged does this in-process; we want
the same signal at the gw boundary.
gw-7. Optional .NET 10 client polish
- Async-disposable session pattern is already there.
- Add a typed
MxValue⇄objectadapter for the seven Galaxy types OtOpcUa cares about (Boolean, Int32, Float, Double, String, DateTime, arrays of the same). Today every consumer writes its ownMxValue.From<T>helpers; this shaves boilerplate from the driver. - Add a
SubscribeWithCallbackconvenience wrapper that combinesOpenSession+SubscribeBulk+StreamEventsand routes events through a delegate per tag. Keeps the OPC UA driver from re-implementing the fan-out / sequencer pattern.
gw-8. Auth minimums
Document API-key scoping as it applies to OtOpcUa: the server identity needs
session, invoke, event, and metadata:read scopes. Provide a CLI to
mint a key bound to those scopes for an OtOpcUa instance.
gw-9. Performance: bulk paths and value coalescing
- Confirm
SubscribeBulkAsyncis implemented as a single MXAccessAddItem+Adviseloop on the worker, not N pipe round trips. If not, fix before we drive 50k-tag Galaxies through it. - Expose
SetBufferedUpdateIntervalper session so OtOpcUa can request buffered updates at the OPC UA publishing interval and get one batchedOnBufferedDataChangeper tick rather than NOnDataChangeevents.
These can all ship in mxaccessgw independently and improve every consumer.
OtOpcUa-side improvements to land in parallel
Some are forced by removing Galaxy.Host; others are quality-of-life.
ot-1. Promote IHistorianDataSource to a server-level extension point
Today IHistorianDataSource is a Galaxy-internal abstraction in
Driver.Galaxy.Host. Lift it to OtOpcUa.Core.Abstractions (or a similar
home next to IDriver) and let the OPC UA HA service consume any number
of registered data sources keyed by node namespace. Drivers don't own
historian access; the server mounts data sources alongside drivers. This is
the prerequisite that lets us move Wonderware Historian out of the Galaxy
driver without losing the feature.
ot-2. Generic alarm condition state machine in the server
Move the .InAlarm/.Priority/.DescAttrName/.Acked quartet handling
out of GalaxyAlarmTracker into a server-level alarm subsystem keyed off the
IsAlarm=true flag drivers set during discovery. The server subscribes to
the four sub-attributes itself and runs the AlarmCondition state machine.
Driver only:
- declares
IsAlarm=trueinDriverAttributeInfo, - forwards plain attribute value changes (already done by
ISubscribable).
This is also a precondition for future drivers (Modbus DL205 alarm bits, S7 alarm DBs) to emit alarms without each writing their own tracker.
ot-3. Driver capabilities trim
After ot-1 and ot-2, Driver.Galaxy no longer needs to implement:
IHistoryProvider(server's HA service handles it via Wonderware historian data source)IAlarmHistorianWriter(server's A&E historian, or kept generic — Galaxy shouldn't own the SQLite path)IAlarmSourceack route (server-level alarm subsystem writes back via the driver'sIWritable.WriteAsync, which the gw already supports)
Keep:
IDriver,ITagDiscovery,IReadable,IWritable,ISubscribable,IRediscoverable,IHostConnectivityProbe.
ot-4. Treat time_of_last_deploy as IRediscoverable's pump
Replace the Host-side change-detection poll with a managed
GalaxyRepositoryClient.WatchDeployEventsAsync consumer in the driver.
Each event raises OnRediscoveryNeeded with the new deploy time as the
scopeHint. No polling code in this repo.
ot-5. Connection pool at the server, not the driver
If the redundancy pair runs two OtOpcUa instances against one gw, both
should share a single GrpcChannel per process (already gRPC default) but
different sessions (one MXAccess client identity per OtOpcUa instance,
not one shared session that fights over Wonderware client state). Encode
the per-instance MXAccess client name in driver config — already partly
there (OTOPCUA_GALAXY_CLIENT_NAME); make it explicit in the new driver's
appsettings.json shape.
Phased implementation
Each phase is a working, mergeable slice. Keep Galaxy.Host running
alongside the new driver until phase 7 — gated by a config switch
Galaxy:Backend = legacy-host | mxgateway.
Phase 0 — pre-flight (mxaccessgw repo)
Ship gw-1, gw-2, gw-4, gw-9 (the parity, performance, and contract bits the plan immediately depends on). gw-3, gw-5, gw-6, gw-7 can come during or after phase 5.
Exit: local OtOpcUa dev box can MxGatewayClient.Create a client, open a
session, SubscribeBulkAsync 100 tags, and observe OnDataChange events at
the configured update rate.
Phase 1 — server-level historian extension point (ot-1)
- Extract
IHistorianDataSource(and its DTOsHistorianSample,HistorianAggregateSample,HistoricalEvent) fromDriver.Galaxy.Host/Backend/Historian/intosrc/ZB.MOM.WW.OtOpcUa.Core/Abstractions/Historian/. - Extend the OPC UA HA service to look up a registered
IHistorianDataSourceper namespace and call into it forHistoryRead,HistoryReadProcessed,HistoryReadAtTime,HistoryReadEvents. Drivers stop implementingIHistoryProviderdirectly; the server proxies. - Add a no-op default registration so drivers without history keep working.
Exit: all current Galaxy history reads route through an
IHistorianDataSource registered by Driver.Galaxy.Host (still legacy)
without behavior change. Other drivers untouched.
Phase 2 — server-level alarm subsystem (ot-2)
- Add an
IAlarmConditionDeclarationAPI on the address-space builder so discovery can flag a node as alarm-bearing and supply the four sub-attribute references. - Add a hosted
AlarmConditionServicein the server that, on driverDiscover, subscribes to the four sub-attributes via the driver's ownISubscribable, runs the state machine, and emitsIAlarmSource.OnAlarmEventitself. Acks route back through the driver'sIWritable.WriteAsyncto the.AckMsgattribute. - Add Galaxy-specific defaults (sub-attribute naming) as a small adapter so the same service can serve future drivers with different conventions.
Exit: Galaxy alarms still work end-to-end; the tracker code that runs
inside Galaxy.Host is dead but kept for the legacy-host backend path.
Phase 3 — Wonderware Historian sidecar (Driver.Historian.Wonderware)
- New solution project:
Driver.Historian.Wonderware, .NET 4.8 x86, console app + NSSM (mirrors today's Galaxy.Host packaging exactly, minus Galaxy responsibilities). - Hosts the existing
HistorianDataSource,HistorianClusterEndpointPicker,HistorianHealthSnapshotcode lifted fromGalaxy.Host/Backend/Historian/and exposes them over a small named-pipe protocol (or local gRPC if .NET 4.8 cost is acceptable; named pipe is simpler). - Add
Driver.Historian.Wonderware.Client— .NET 10 — implementingIHistorianDataSourceagainst the sidecar. - Server registers it as a data source for the
Galaxynamespace.
Exit: OPC UA history reads work via the sidecar with the legacy-host backend still in place. We've decoupled history from MXAccess.
Phase 4 — new Driver.Galaxy against gw
This is the meat. New project: src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/, .NET 10,
in-process. Capabilities (post ot-3): IDriver, ITagDiscovery, IReadable,
IWritable, ISubscribable, IRediscoverable, IHostConnectivityProbe.
Shape:
Driver.Galaxy/
GalaxyDriver.cs # IDriver root
Browse/
GalaxyDiscoverer.cs # consumes GalaxyRepositoryClient.DiscoverHierarchyAsync
DataTypeMap.cs # mx_data_type → DriverDataType
SecurityMap.cs # security_classification → SecurityClassification
Runtime/
GalaxyMxSession.cs # owns one MxGatewaySession; Register + map per-driver client name
SubscriptionRegistry.cs # tag → server/item handles; persists to memory only
EventPump.cs # consumes session.StreamEventsAsync, fans out to OnDataChange
ReconnectSupervisor.cs # gw transport drop / session-lost recovery
DeployWatcher.cs # GalaxyRepositoryClient.WatchDeployEventsAsync → OnRediscoveryNeeded
Health/
HostConnectivityForwarder.cs # gw-6 SessionHealth → IHostConnectivityProbe
Config/
GalaxyDriverOptions.cs # endpoint, ApiKey, ClientName, TLS, retry, intervals
GalaxyDriverFactoryExtensions.cs # AddGalaxyDriver(IServiceCollection)
Key behaviors:
- Discovery calls
GalaxyRepositoryClient.DiscoverHierarchyAsync()once at init and on everyWatchDeployEventsevent, then drives the address space builder. Same node naming as today (parent contained-name hierarchy + leaf attributes namedtag_name.AttributeName). - Read uses one-off
AddItem+Advise+ read-after-first-callback is overkill; instead, useRegister+ per-callAddItem/Readif gw exposes a synchronous read, otherwise short-lived advise. Action item: confirm gw's read story; if absent, request a synchronousReadAsyncRPC on top of MXAccessRead(which exists in the COM API). - Write maps
WriteRequest.ValuetoMxValuevia gw-7 helpers and callsWriteAsync(serverHandle, itemHandle, value, userId=0). RoutesWriteSecured(whereSecurityClassification == SecuredWrite/Verified) toWriteSecuredAsynconce exposed onMxGatewaySession. - Subscribe calls
SubscribeBulkAsynconce perISubscribable.Subscribecall. Stores(tag → itemHandle, sid)inSubscriptionRegistry. The singleEventPumpconsumes oneStreamEventsAsyncper session and fans out persid. - Unsubscribe calls
UnsubscribeBulkAsyncand drops registry entries. - Reconnect — when the gRPC channel drops or
StreamEventsreturns,ReconnectSupervisorreopens the session and replays subscriptions via gw-5ReplaySubscriptionsAsync. The driver flagsDriverState.Degradedduring recovery; the server keeps publishing last-good values withUncertainquality. - Host connectivity — single synthesized host entry named after
OTOPCUA_GALAXY_CLIENT_NAMEdriven by gw-6SessionHealthupdates (or, until gw-6 lands, by transport drops).
Wire into the server next to other Tier-A drivers in the
AddDrivers(...) call site.
Exit: flipping Galaxy:Backend to mxgateway runs the OPC UA server
end-to-end with no Galaxy.Host involvement. Live read, live write, live
subscribe pass against the dev Galaxy. Historian + alarms still work via
phases 1–3.
Phase 5 — parity test matrix
Reuse the existing live-Galaxy integration tests; run each scenario twice:
once with Galaxy:Backend=legacy-host, once with mxgateway. Compare:
- discovered hierarchy node count + names + datatypes,
- subscribed publish rates (allow ±10% tolerance vs. legacy),
- write success / status codes for each
SecurityClassification, - alarm condition transitions (Active / Acked / Inactive) — already routed through phase 2's server-level subsystem,
- history reads — phase 3 sidecar, identical results both backends,
- reconnect behavior under gw kill, worker kill, network drop, ZB drop.
Document the matrix; resolve every discrepancy or explicitly accept it.
Exit: parity matrix has zero unexplained deltas. Performance budget
agreed: e.g. ≤ 2× per-call latency vs. named-pipe baseline at the 95th
percentile, equal or better throughput in SubscribeBulk setup time.
Phase 6 — perf + hardening
- Land gw-9 buffered-update intervals.
- Add OpenTelemetry traces from the driver around every gw call,
correlated via
client_correlation_id. - Write soak test: 50k tags subscribed, 24h, count missed events, gw restarts, OtOpcUa restarts.
- Tune
MxGatewayClientOptions.MaxGrpcMessageBytes, retry pipeline, call timeouts based on soak results.
Exit: production-acceptable perf numbers documented in
docs/drivers/Galaxy.md.
Phase 7 — retirement
- Default
Galaxy:Backend = mxgatewayeverywhere (sample configs, install scripts, e2e configs). - Delete
src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host,src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy,src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared, and matching tests. - Remove
OtOpcUaGalaxyHostNSSM registration fromscripts/install/Install-Services.ps1. Add a registration block for the Wonderware historian sidecar from phase 3. - Remove every x86 .NET 4.8 reference, build target, and CI step from this
repo; remove
mxaccess_documentation.md-driven dependencies that no longer apply. - Update CLAUDE.md,
docs/v2/dev-environment.md,docs/ServiceHosting.md,docs/Redundancy.mdto reflect the new topology. - Memory housekeeping: retire
project_galaxy_host_service.mdandproject_galaxy_host_installed.md; add a short note about the gw dependency.
Exit: git grep -i 'Galaxy\.Host' returns nothing in source.
Configuration shape (new driver)
"Drivers": {
"Galaxy": {
"Type": "Galaxy",
"InstanceId": "galaxy-prod-1",
"Gateway": {
"Endpoint": "https://mxgw.aveva.local:5001",
"ApiKeySecretRef": "galaxy:apiKey", // resolved via existing secret store
"UseTls": true,
"CaCertificatePath": "C:\\publish\\mxgw\\ca.crt",
"ConnectTimeoutSeconds": 10,
"DefaultCallTimeoutSeconds": 5,
"StreamTimeoutSeconds": 0 // unbounded
},
"MxAccess": {
"ClientName": "OtOpcUa-A", // unique per OtOpcUa instance
"PublishingIntervalMs": 1000, // hint for SetBufferedUpdateInterval
"WriteUserId": 0
},
"Repository": {
"DiscoverPageSize": 5000,
"WatchDeployEvents": true
},
"Reconnect": {
"InitialBackoffMs": 500,
"MaxBackoffMs": 30000,
"ReplayOnSessionLost": true
}
}
}
The OtOpcUa secret store already handles DPAPI-protected values for LDAP binds; reuse it for the gw API key. Never put the key in plaintext in the sample config.
Risks and mitigations
| Risk | Mitigation |
|---|---|
| gw protocol regression breaks production | Pin gw NuGet to a contract version range; CI runs parity matrix on every gw bump; staged rollout via Galaxy:Backend flag. |
| Per-call latency regresses for chatty workloads | Land gw-9 (buffered updates) before phase 5; soak the 95p in phase 6. |
| Reconnect storm after gw restart re-registers 50k tags | Land gw-3 or gw-5 before phase 6; client-side bulk replay throttled by SubscribeBulkAsync chunk size. |
| Alarm parity gap from moving tracker server-side | Phase 2 ships before phase 4; parity matrix gates phase 7. |
| Historian sidecar adds a second .NET 4.8 x86 service | Acceptable: it's a driver-agnostic component, and it ships only where Wonderware historian access is actually needed. |
| Two OtOpcUa instances both registering as same MXAccess client | ClientName is per-instance config (ot-5); install scripts lint that the redundancy pair has distinct names. |
| Cross-machine MXAccess writes traverse plaintext gRPC | Phase 0 enforces UseTls=true for any non-loopback Endpoint; CI lints the sample configs. |
| gw API key leaked in logs | gw and MxGatewayClient already redact authorization metadata; phase 6 audit. |
Memory leak in EventPump under high event rate |
Bounded channel between StreamEventsAsync and per-sub fan-out, drop-newest with a metric counter; soak test catches. |
Cross-cutting deliverables
- Docs:
docs/drivers/Galaxy.md(new), updates todocs/v2/dev-environment.md,docs/ServiceHosting.md,docs/Redundancy.md,CLAUDE.md. - Install scripts:
scripts/install/Install-Services.ps1removesOtOpcUaGalaxyHost, addsOtOpcUaWonderwareHistorian, no Galaxy service registration on the OtOpcUa node. - e2e:
scripts/e2e/e2e-config.sample.json— dropOTOPCUA_GALAXY_*pipe vars, addDrivers:Galaxy:Gateway:Endpointetc. - Memory: retire stale Galaxy.Host entries; add gw dependency entry, redundancy + client-name guidance.
Order-of-work summary
Phase 0 (gw repo): gw-1, gw-2, gw-4, gw-9
Phase 1 (this): ot-1 — historian extension point
Phase 2 (this): ot-2 — alarm subsystem
Phase 3 (this): Driver.Historian.Wonderware sidecar
Phase 4 (this): Driver.Galaxy (new) behind backend flag
— depends on Phase 0, 1, 2
Phase 5 (this+gw): parity matrix
— drives gw-3 / gw-5 / gw-6 / gw-7 if gaps surface
Phase 6 (this): perf + hardening
Phase 7 (this): retire Galaxy.Host / Proxy / Shared
Phases 1–3 are independent of each other and can run in parallel. Phase 4 needs all three plus Phase 0. Phase 5 requires Phase 4. Phases 6 and 7 are sequential after Phase 5.