Files
lmxopcua/lmx_mxgw.md
Joseph Doherty ef22a61c39 v2 mxgw migration — Phase 1+2+3.1 wiring (7 PRs)
Foundational PRs from lmx_mxgw_impl.md, all green. Bodies only — DI/wiring
deferred to PR 1+2.W (combined wire-up) and PR 3.W.

PR 1.1 — IHistorianDataSource lifted to Core.Abstractions/Historian/
  Reuses existing DataValueSnapshot + HistoricalEvent shapes; sidecar (PR
  3.4) translates byte-quality → uint StatusCode internally.

PR 1.2 — IHistoryRouter + HistoryRouter on the server
  Longest-prefix-match resolution, case-insensitive, ObjectDisposed-guarded,
  swallow-on-shutdown disposal of misbehaving sources.

PR 1.3 — DriverNodeManager.HistoryRead* dispatch through IHistoryRouter
  Per-tag resolution with LegacyDriverHistoryAdapter wrapping
  `_driver as IHistoryProvider` so existing tests + drivers keep working
  until PR 7.2 retires the fallback.

PR 2.1 — AlarmConditionInfo extended with five sub-attribute refs
  InAlarmRef / PriorityRef / DescAttrNameRef / AckedRef / AckMsgWriteRef.
  Optional defaulted parameters preserve all existing 3-arg call sites.

PR 2.2 — AlarmConditionService state machine in Server/Alarms/
  Driver-agnostic port of GalaxyAlarmTracker. Sub-attribute refs come from
  AlarmConditionInfo, values arrive as DataValueSnapshot, ack writes route
  through IAlarmAcknowledger. State machine preserves Active/Acknowledged/
  Inactive transitions, Acked-on-active reset, post-disposal silence.

PR 2.3 — DriverNodeManager wires AlarmConditionService
  MarkAsAlarmCondition registers each alarm-bearing variable with the
  service; DriverWritableAcknowledger routes ack-message writes through
  the driver's IWritable + CapabilityInvoker. Service-raised transitions
  route via OnAlarmServiceTransition → matching ConditionSink. Legacy
  IAlarmSource path unchanged for null service.

PR 3.1 — Driver.Historian.Wonderware shell project (net48 x86)
  Console host shell + smoke test; SDK references + code lift come in
  PR 3.2.

Tests: 9 (PR 1.1) + 5 (PR 2.1) + 10 (PR 1.2) + 19 (PR 2.2) + 1 (PR 3.1)
all pass. Existing AlarmSubscribeIntegrationTests + HistoryReadIntegrationTests
unchanged.

Plan + audit docs (lmx_backend.md, lmx_mxgw.md, lmx_mxgw_impl.md)
included so parallel subagent worktrees can read them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 14:03:36 -04:00

22 KiB
Raw Blame History

Galaxy → MxAccessGateway Migration Plan

Implements Option 1 from lmx_backend.md: replace the bespoke Galaxy.Host

  • Galaxy.Proxy IPC pair with an in-process Tier-A Driver.Galaxy running in the .NET 10 OtOpcUa server, talking to a separately-deployed MxGateway.Server (mxaccessgw repo) over gRPC for live MXAccess work and Galaxy Repository browse.

Outcome

After this work:

  • OtOpcUa.Server is fully .NET 10 x64 — no x86 build artifacts in this repo.
  • Driver.Galaxy.Host (Windows service, NSSM-wrapped, .NET 4.8 x86) is retired. Driver.Galaxy.Proxy and Driver.Galaxy.Shared are deleted. AVEVA platform is no longer required on the OtOpcUa box.
  • A new in-process Driver.Galaxy lives next to Driver.Modbus, Driver.OpcUaClient, etc. It implements the same IDriver capability set the proxy implements today, but its body calls MxGateway.Client (MxGatewayClient, MxGatewaySession, GalaxyRepositoryClient).
  • Wonderware Historian SDK access moves out of the Galaxy driver into a driver-agnostic historian data source (Driver.Historian.Wonderware, separate sidecar, .NET 4.8 x86). The OPC UA HA service plugs into it the same way it would plug into any future historian.
  • Alarm condition tracking moves out of the driver into the OPC UA server's generic A&E subsystem. The driver only flags IsAlarm=true on attribute metadata and forwards live .InAlarm/.Acked/etc value changes; the server runs the AlarmCondition state machine.
  • Per-platform ScanState probes degrade to plain attribute subscriptions — no special probe manager.

Pre-flight: improvements to land in mxaccessgw first

These are integration-quality changes in the mxaccessgw repo that make the OtOpcUa side dramatically simpler / faster / more robust. They aren't strictly required to start, but ship enough of them before phase 3 that we're not designing around gaps.

gw-1. Galaxy attribute metadata parity

What's there: galaxy_repository.v1.DiscoverHierarchy returns GalaxyObject with name, parent, category, and dynamic attributes.

What's missing for OtOpcUa: every field today's MxAccessGalaxyBackend copies into GalaxyAttributeInfo — confirm gw's Attribute proto carries:

  • mx_data_type (int)
  • is_array (bool)
  • array_dimension (uint, optional)
  • security_classification (int)
  • is_historized (bool, from HistorizedExtension primitive)
  • is_alarm (bool, from AlarmExtension primitive)

If any are missing, add them to the proto and the server-side query mapper. Without IsAlarm and IsHistorized the OPC UA server can't decide which nodes get HasHistoricalConfiguration / which become AlarmConditions.

gw-2. Stable, documented event-stream resume semantics

What's needed: the OtOpcUa driver must survive a transient gw transport drop without losing subscription state or duplicating change events. gw's StreamEventsAsync(afterWorkerSequence) already exposes resumption. Document the per-session retention window (how long does the worker buffer events the gateway hasn't acked?) and the "events were dropped, you must re-subscribe" signal. If retention is bounded by count rather than time, expose the bound in OpenSessionReply so the client can size its own buffer.

gw-3. Reconnectable sessions

Listed under "post-v1 revisit" in gateway.md. Without it, every gw or OtOpcUa restart re-Registers, re-AddItems, re-Advises the entire address space — for a 50k-tag Galaxy that's a non-trivial cold-start. With reconnectable sessions, the driver presents its SessionId after a restart and the worker keeps its handles.

If full reconnection is too large, ship a bulk replay instead: a single RPC that takes the full subscription set and the worker performs the register/add/advise inside one round trip. We can drive it from a client-side cache rather than gw state. See gw-5 below.

gw-4. Driver-shaped subscribe primitive

MxGatewaySession already has SubscribeBulkAsync (one RPC: Register implicit + AddItem + Advise for a list of tag addresses, returning per-tag SubscribeResult). That's exactly what ISubscribable.SubscribeAsync wants. Confirm it returns enough per-tag detail to surface a partial-failure list to OPC UA monitored items (good handle, status code, error text).

If not already, expose SubscribeBulk with optional update-rate hint forwarded to SetBufferedUpdateInterval so the OPC UA publishing interval becomes a single field on the subscribe call rather than a follow-up RPC.

gw-5. Subscription replay snapshot

Provide an RPC ReplaySubscriptionsAsync(SessionId, IEnumerable<TagAddress>) that re-establishes a list of subscriptions after a session reset and returns per-tag results. The client stores its tag list locally (the driver already has it from Discover), and the gw worker turns it into one register/add/advise sequence. This is the minimum surface we need; full "reattach to a previous session by id" (gw-3) is a richer version of the same thing.

gw-6. Transport-health stream

The gw already exposes worker / session health on its dashboard. Add a small streaming RPC StreamSessionHealth(SessionId) → stream SessionHealth so the OtOpcUa driver can surface "MXAccess transport up/down" to its IHostConnectivityProbe without faking it via probe-tag subscriptions. Today MxAccessClient.ConnectionStateChanged does this in-process; we want the same signal at the gw boundary.

gw-7. Optional .NET 10 client polish

  • Async-disposable session pattern is already there.
  • Add a typed MxValueobject adapter for the seven Galaxy types OtOpcUa cares about (Boolean, Int32, Float, Double, String, DateTime, arrays of the same). Today every consumer writes its own MxValue.From<T> helpers; this shaves boilerplate from the driver.
  • Add a SubscribeWithCallback convenience wrapper that combines OpenSession + SubscribeBulk + StreamEvents and routes events through a delegate per tag. Keeps the OPC UA driver from re-implementing the fan-out / sequencer pattern.

gw-8. Auth minimums

Document API-key scoping as it applies to OtOpcUa: the server identity needs session, invoke, event, and metadata:read scopes. Provide a CLI to mint a key bound to those scopes for an OtOpcUa instance.

gw-9. Performance: bulk paths and value coalescing

  • Confirm SubscribeBulkAsync is implemented as a single MXAccess AddItem+Advise loop on the worker, not N pipe round trips. If not, fix before we drive 50k-tag Galaxies through it.
  • Expose SetBufferedUpdateInterval per session so OtOpcUa can request buffered updates at the OPC UA publishing interval and get one batched OnBufferedDataChange per tick rather than N OnDataChange events.

These can all ship in mxaccessgw independently and improve every consumer.


OtOpcUa-side improvements to land in parallel

Some are forced by removing Galaxy.Host; others are quality-of-life.

ot-1. Promote IHistorianDataSource to a server-level extension point

Today IHistorianDataSource is a Galaxy-internal abstraction in Driver.Galaxy.Host. Lift it to OtOpcUa.Core.Abstractions (or a similar home next to IDriver) and let the OPC UA HA service consume any number of registered data sources keyed by node namespace. Drivers don't own historian access; the server mounts data sources alongside drivers. This is the prerequisite that lets us move Wonderware Historian out of the Galaxy driver without losing the feature.

ot-2. Generic alarm condition state machine in the server

Move the .InAlarm/.Priority/.DescAttrName/.Acked quartet handling out of GalaxyAlarmTracker into a server-level alarm subsystem keyed off the IsAlarm=true flag drivers set during discovery. The server subscribes to the four sub-attributes itself and runs the AlarmCondition state machine. Driver only:

  • declares IsAlarm=true in DriverAttributeInfo,
  • forwards plain attribute value changes (already done by ISubscribable).

This is also a precondition for future drivers (Modbus DL205 alarm bits, S7 alarm DBs) to emit alarms without each writing their own tracker.

ot-3. Driver capabilities trim

After ot-1 and ot-2, Driver.Galaxy no longer needs to implement:

  • IHistoryProvider (server's HA service handles it via Wonderware historian data source)
  • IAlarmHistorianWriter (server's A&E historian, or kept generic — Galaxy shouldn't own the SQLite path)
  • IAlarmSource ack route (server-level alarm subsystem writes back via the driver's IWritable.WriteAsync, which the gw already supports)

Keep:

  • IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IRediscoverable, IHostConnectivityProbe.

ot-4. Treat time_of_last_deploy as IRediscoverable's pump

Replace the Host-side change-detection poll with a managed GalaxyRepositoryClient.WatchDeployEventsAsync consumer in the driver. Each event raises OnRediscoveryNeeded with the new deploy time as the scopeHint. No polling code in this repo.

ot-5. Connection pool at the server, not the driver

If the redundancy pair runs two OtOpcUa instances against one gw, both should share a single GrpcChannel per process (already gRPC default) but different sessions (one MXAccess client identity per OtOpcUa instance, not one shared session that fights over Wonderware client state). Encode the per-instance MXAccess client name in driver config — already partly there (OTOPCUA_GALAXY_CLIENT_NAME); make it explicit in the new driver's appsettings.json shape.


Phased implementation

Each phase is a working, mergeable slice. Keep Galaxy.Host running alongside the new driver until phase 7 — gated by a config switch Galaxy:Backend = legacy-host | mxgateway.

Phase 0 — pre-flight (mxaccessgw repo)

Ship gw-1, gw-2, gw-4, gw-9 (the parity, performance, and contract bits the plan immediately depends on). gw-3, gw-5, gw-6, gw-7 can come during or after phase 5.

Exit: local OtOpcUa dev box can MxGatewayClient.Create a client, open a session, SubscribeBulkAsync 100 tags, and observe OnDataChange events at the configured update rate.

Phase 1 — server-level historian extension point (ot-1)

  1. Extract IHistorianDataSource (and its DTOs HistorianSample, HistorianAggregateSample, HistoricalEvent) from Driver.Galaxy.Host/Backend/Historian/ into src/ZB.MOM.WW.OtOpcUa.Core/Abstractions/Historian/.
  2. Extend the OPC UA HA service to look up a registered IHistorianDataSource per namespace and call into it for HistoryRead, HistoryReadProcessed, HistoryReadAtTime, HistoryReadEvents. Drivers stop implementing IHistoryProvider directly; the server proxies.
  3. Add a no-op default registration so drivers without history keep working.

Exit: all current Galaxy history reads route through an IHistorianDataSource registered by Driver.Galaxy.Host (still legacy) without behavior change. Other drivers untouched.

Phase 2 — server-level alarm subsystem (ot-2)

  1. Add an IAlarmConditionDeclaration API on the address-space builder so discovery can flag a node as alarm-bearing and supply the four sub-attribute references.
  2. Add a hosted AlarmConditionService in the server that, on driver Discover, subscribes to the four sub-attributes via the driver's own ISubscribable, runs the state machine, and emits IAlarmSource.OnAlarmEvent itself. Acks route back through the driver's IWritable.WriteAsync to the .AckMsg attribute.
  3. Add Galaxy-specific defaults (sub-attribute naming) as a small adapter so the same service can serve future drivers with different conventions.

Exit: Galaxy alarms still work end-to-end; the tracker code that runs inside Galaxy.Host is dead but kept for the legacy-host backend path.

Phase 3 — Wonderware Historian sidecar (Driver.Historian.Wonderware)

  1. New solution project: Driver.Historian.Wonderware, .NET 4.8 x86, console app + NSSM (mirrors today's Galaxy.Host packaging exactly, minus Galaxy responsibilities).
  2. Hosts the existing HistorianDataSource, HistorianClusterEndpointPicker, HistorianHealthSnapshot code lifted from Galaxy.Host/Backend/Historian/ and exposes them over a small named-pipe protocol (or local gRPC if .NET 4.8 cost is acceptable; named pipe is simpler).
  3. Add Driver.Historian.Wonderware.Client — .NET 10 — implementing IHistorianDataSource against the sidecar.
  4. Server registers it as a data source for the Galaxy namespace.

Exit: OPC UA history reads work via the sidecar with the legacy-host backend still in place. We've decoupled history from MXAccess.

Phase 4 — new Driver.Galaxy against gw

This is the meat. New project: src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/, .NET 10, in-process. Capabilities (post ot-3): IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IRediscoverable, IHostConnectivityProbe.

Shape:

Driver.Galaxy/
  GalaxyDriver.cs            # IDriver root
  Browse/
    GalaxyDiscoverer.cs      # consumes GalaxyRepositoryClient.DiscoverHierarchyAsync
    DataTypeMap.cs           # mx_data_type → DriverDataType
    SecurityMap.cs           # security_classification → SecurityClassification
  Runtime/
    GalaxyMxSession.cs       # owns one MxGatewaySession; Register + map per-driver client name
    SubscriptionRegistry.cs  # tag → server/item handles; persists to memory only
    EventPump.cs             # consumes session.StreamEventsAsync, fans out to OnDataChange
    ReconnectSupervisor.cs   # gw transport drop / session-lost recovery
    DeployWatcher.cs         # GalaxyRepositoryClient.WatchDeployEventsAsync → OnRediscoveryNeeded
  Health/
    HostConnectivityForwarder.cs  # gw-6 SessionHealth → IHostConnectivityProbe
  Config/
    GalaxyDriverOptions.cs   # endpoint, ApiKey, ClientName, TLS, retry, intervals
  GalaxyDriverFactoryExtensions.cs  # AddGalaxyDriver(IServiceCollection)

Key behaviors:

  • Discovery calls GalaxyRepositoryClient.DiscoverHierarchyAsync() once at init and on every WatchDeployEvents event, then drives the address space builder. Same node naming as today (parent contained-name hierarchy + leaf attributes named tag_name.AttributeName).
  • Read uses one-off AddItem + Advise + read-after-first-callback is overkill; instead, use Register + per-call AddItem/Read if gw exposes a synchronous read, otherwise short-lived advise. Action item: confirm gw's read story; if absent, request a synchronous ReadAsync RPC on top of MXAccess Read (which exists in the COM API).
  • Write maps WriteRequest.Value to MxValue via gw-7 helpers and calls WriteAsync(serverHandle, itemHandle, value, userId=0). Routes WriteSecured (where SecurityClassification == SecuredWrite/Verified) to WriteSecuredAsync once exposed on MxGatewaySession.
  • Subscribe calls SubscribeBulkAsync once per ISubscribable.Subscribe call. Stores (tag → itemHandle, sid) in SubscriptionRegistry. The single EventPump consumes one StreamEventsAsync per session and fans out per sid.
  • Unsubscribe calls UnsubscribeBulkAsync and drops registry entries.
  • Reconnect — when the gRPC channel drops or StreamEvents returns, ReconnectSupervisor reopens the session and replays subscriptions via gw-5 ReplaySubscriptionsAsync. The driver flags DriverState.Degraded during recovery; the server keeps publishing last-good values with Uncertain quality.
  • Host connectivity — single synthesized host entry named after OTOPCUA_GALAXY_CLIENT_NAME driven by gw-6 SessionHealth updates (or, until gw-6 lands, by transport drops).

Wire into the server next to other Tier-A drivers in the AddDrivers(...) call site.

Exit: flipping Galaxy:Backend to mxgateway runs the OPC UA server end-to-end with no Galaxy.Host involvement. Live read, live write, live subscribe pass against the dev Galaxy. Historian + alarms still work via phases 13.

Phase 5 — parity test matrix

Reuse the existing live-Galaxy integration tests; run each scenario twice: once with Galaxy:Backend=legacy-host, once with mxgateway. Compare:

  • discovered hierarchy node count + names + datatypes,
  • subscribed publish rates (allow ±10% tolerance vs. legacy),
  • write success / status codes for each SecurityClassification,
  • alarm condition transitions (Active / Acked / Inactive) — already routed through phase 2's server-level subsystem,
  • history reads — phase 3 sidecar, identical results both backends,
  • reconnect behavior under gw kill, worker kill, network drop, ZB drop.

Document the matrix; resolve every discrepancy or explicitly accept it.

Exit: parity matrix has zero unexplained deltas. Performance budget agreed: e.g. ≤ 2× per-call latency vs. named-pipe baseline at the 95th percentile, equal or better throughput in SubscribeBulk setup time.

Phase 6 — perf + hardening

  • Land gw-9 buffered-update intervals.
  • Add OpenTelemetry traces from the driver around every gw call, correlated via client_correlation_id.
  • Write soak test: 50k tags subscribed, 24h, count missed events, gw restarts, OtOpcUa restarts.
  • Tune MxGatewayClientOptions.MaxGrpcMessageBytes, retry pipeline, call timeouts based on soak results.

Exit: production-acceptable perf numbers documented in docs/Galaxy.Driver.md.

Phase 7 — retirement

  1. Default Galaxy:Backend = mxgateway everywhere (sample configs, install scripts, e2e configs).
  2. Delete src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host, src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy, src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared, and matching tests.
  3. Remove OtOpcUaGalaxyHost NSSM registration from scripts/install/Install-Services.ps1. Add a registration block for the Wonderware historian sidecar from phase 3.
  4. Remove every x86 .NET 4.8 reference, build target, and CI step from this repo; remove mxaccess_documentation.md-driven dependencies that no longer apply.
  5. Update CLAUDE.md, docs/v2/dev-environment.md, docs/ServiceHosting.md, docs/Redundancy.md to reflect the new topology.
  6. Memory housekeeping: retire project_galaxy_host_service.md and project_galaxy_host_installed.md; add a short note about the gw dependency.

Exit: git grep -i 'Galaxy\.Host' returns nothing in source.


Configuration shape (new driver)

"Drivers": {
  "Galaxy": {
    "Type": "Galaxy",
    "InstanceId": "galaxy-prod-1",
    "Gateway": {
      "Endpoint": "https://mxgw.aveva.local:5001",
      "ApiKeySecretRef": "galaxy:apiKey",        // resolved via existing secret store
      "UseTls": true,
      "CaCertificatePath": "C:\\publish\\mxgw\\ca.crt",
      "ConnectTimeoutSeconds": 10,
      "DefaultCallTimeoutSeconds": 5,
      "StreamTimeoutSeconds": 0                   // unbounded
    },
    "MxAccess": {
      "ClientName": "OtOpcUa-A",                  // unique per OtOpcUa instance
      "PublishingIntervalMs": 1000,               // hint for SetBufferedUpdateInterval
      "WriteUserId": 0
    },
    "Repository": {
      "DiscoverPageSize": 5000,
      "WatchDeployEvents": true
    },
    "Reconnect": {
      "InitialBackoffMs": 500,
      "MaxBackoffMs": 30000,
      "ReplayOnSessionLost": true
    }
  }
}

The OtOpcUa secret store already handles DPAPI-protected values for LDAP binds; reuse it for the gw API key. Never put the key in plaintext in the sample config.


Risks and mitigations

Risk Mitigation
gw protocol regression breaks production Pin gw NuGet to a contract version range; CI runs parity matrix on every gw bump; staged rollout via Galaxy:Backend flag.
Per-call latency regresses for chatty workloads Land gw-9 (buffered updates) before phase 5; soak the 95p in phase 6.
Reconnect storm after gw restart re-registers 50k tags Land gw-3 or gw-5 before phase 6; client-side bulk replay throttled by SubscribeBulkAsync chunk size.
Alarm parity gap from moving tracker server-side Phase 2 ships before phase 4; parity matrix gates phase 7.
Historian sidecar adds a second .NET 4.8 x86 service Acceptable: it's a driver-agnostic component, and it ships only where Wonderware historian access is actually needed.
Two OtOpcUa instances both registering as same MXAccess client ClientName is per-instance config (ot-5); install scripts lint that the redundancy pair has distinct names.
Cross-machine MXAccess writes traverse plaintext gRPC Phase 0 enforces UseTls=true for any non-loopback Endpoint; CI lints the sample configs.
gw API key leaked in logs gw and MxGatewayClient already redact authorization metadata; phase 6 audit.
Memory leak in EventPump under high event rate Bounded channel between StreamEventsAsync and per-sub fan-out, drop-newest with a metric counter; soak test catches.

Cross-cutting deliverables

  • Docs: docs/v2/Galaxy.Driver.md (new), updates to docs/v2/dev-environment.md, docs/ServiceHosting.md, docs/Redundancy.md, CLAUDE.md.
  • Install scripts: scripts/install/Install-Services.ps1 removes OtOpcUaGalaxyHost, adds OtOpcUaWonderwareHistorian, no Galaxy service registration on the OtOpcUa node.
  • e2e: scripts/e2e/e2e-config.sample.json — drop OTOPCUA_GALAXY_* pipe vars, add Drivers:Galaxy:Gateway:Endpoint etc.
  • Memory: retire stale Galaxy.Host entries; add gw dependency entry, redundancy + client-name guidance.

Order-of-work summary

Phase 0 (gw repo):  gw-1, gw-2, gw-4, gw-9
Phase 1 (this):     ot-1   — historian extension point
Phase 2 (this):     ot-2   — alarm subsystem
Phase 3 (this):     Driver.Historian.Wonderware sidecar
Phase 4 (this):     Driver.Galaxy (new) behind backend flag
                     — depends on Phase 0, 1, 2
Phase 5 (this+gw):  parity matrix
                     — drives gw-3 / gw-5 / gw-6 / gw-7 if gaps surface
Phase 6 (this):     perf + hardening
Phase 7 (this):     retire Galaxy.Host / Proxy / Shared

Phases 13 are independent of each other and can run in parallel. Phase 4 needs all three plus Phase 0. Phase 5 requires Phase 4. Phases 6 and 7 are sequential after Phase 5.