Files
lmxopcua/docs/v2/implementation/exit-gate-phase-2.md
Joseph Doherty a7126ba953 Phase 2 — port MXAccess COM client to Galaxy.Host + MxAccessGalaxyBackend (3rd IGalaxyBackend) + live MXAccess smoke + Phase 2 exit-gate doc + adversarial review. The full Galaxy data-plane now flows through the v2 IPC topology end-to-end against live ArchestrA.MxAccess.dll, on this dev box, with 30/30 Host tests + 9/9 Proxy tests + 963/963 solution tests passing alongside the unchanged 494 v1 IntegrationTests baseline. Backend/MxAccess/Vtq is a focused port of v1's Vtq value-timestamp-quality DTO. Backend/MxAccess/IMxProxy abstracts LMXProxyServer (port of v1's IMxProxy with the same Register/Unregister/AddItem/RemoveItem/AdviseSupervisory/UnAdviseSupervisory/Write surface + OnDataChange + OnWriteComplete events); MxProxyAdapter is the concrete COM-backed implementation that does Marshal.ReleaseComObject-loop on Unregister, must be constructed on an STA thread. Backend/MxAccess/MxAccessClient is the focused port of v1's MxAccessClient partials — Connect/Disconnect/Read/Write/Subscribe/Unsubscribe through the new Sta/StaPump (the real Win32 GetMessage pump from the previous commit), ConcurrentDictionary handle tracking, OnDataChange event marshalling to per-tag callbacks, ReadAsync implemented as the canonical subscribe → first-OnDataChange → unsubscribe one-shot pattern. Galaxy.Host csproj flipped to x86 PlatformTarget + Prefer32Bit=true with the ArchestrA.MxAccess HintPath ..\..\lib\ArchestrA.MxAccess.dll reference (lib/ already contains the production DLL). Backend/MxAccessGalaxyBackend is the third IGalaxyBackend implementation (alongside StubGalaxyBackend and DbBackedGalaxyBackend): combines GalaxyRepository (Discover) with MxAccessClient (Read/Write/Subscribe), MessagePack-deserializes inbound write values, MessagePack-serializes outbound read values into ValueBytes, decodes ArrayDimension/SecurityClassification/category_id with the same v1 mapping. Program.cs selects between stub|db|mxaccess via OTOPCUA_GALAXY_BACKEND env var (default = mxaccess); OTOPCUA_GALAXY_ZB_CONN overrides the ZB connection string; OTOPCUA_GALAXY_CLIENT_NAME sets the Wonderware client identity; the StaPump and MxAccessClient lifecycles are tied to the server.RunAsync try/finally so a clean Ctrl+C tears down the COM proxy via Marshal.ReleaseComObject before the pump's WM_QUIT. Live MXAccess smoke tests (MxAccessLiveSmokeTests, net48 x86) — skipped when ZB unreachable or aaBootstrap not running, otherwise verify (1) MxAccessClient.ConnectAsync returns a positive LMXProxyServer handle on the StaPump, (2) MxAccessGalaxyBackend.OpenSession + Discover returns at least one gobject with attributes, (3) MxAccessGalaxyBackend.ReadValues against the first discovered attribute returns a response with the correct TagReference shape (value + quality vary by what's running, so we don't assert specific values). All 3 pass on this dev box. EndToEndIpcTests + IpcHandshakeIntegrationTests moved from Galaxy.Proxy.Tests (net10) to Galaxy.Host.Tests (net48 x86) — the previous test placement silently dropped them at xUnit discovery because Host became net48 x86 and net10 process can't load it. Rewritten to use Shared's FrameReader/FrameWriter directly instead of going through Proxy's GalaxyIpcClient (functionally equivalent — same wire protocol, framing primitives + dispatcher are the production code path verbatim). 7 IPC tests now run cleanly: Hello+heartbeat round-trip, wrong-secret rejection, OpenSession session-id assignment, Discover error-response surfacing, WriteValues per-tag bad status, Subscribe id assignment, Recycle grace window. Phase 2 exit-gate doc (docs/v2/implementation/exit-gate-phase-2.md) supersedes the partial-exit doc with the as-built state — Streams A/B/C complete; D/E gated only on the legacy-Host removal + parity-test rewrite cycle that fundamentally requires multi-day debug iteration; full adversarial-review section ranking 8 findings (2 high, 3 medium, 3 low) all explicitly deferred to Stream D/E or v2.1 with rationale; Stream-D removal checklist gives the next-session entry point with two policy options for the 494 v1 tests (rewrite-to-use-Proxy vs archive-and-write-smaller-v2-parity-suite). Cannot one-shot Stream D.1 in any single session because deleting OtOpcUa.Host requires the v1 IntegrationTests cycle to be retargeted first; that's the structural blocker, not "needs more code" — and the plan itself budgets 3-4 weeks for it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 00:23:24 -04:00

11 KiB
Raw Permalink Blame History

Phase 2 Exit Gate Record (2026-04-18)

Supersedes phase-2-partial-exit-evidence.md. Captures the as-built state of Phase 2 after the MXAccess COM client port + DB-backed and MXAccess-backed Galaxy backends + adversarial review.

Status: Streams A, B, C complete. Stream D + E gated only on legacy-Host removal + parity-test rewrite.

The Phase 2 plan exit criterion ("v1 IntegrationTests pass against v2 Galaxy.Proxy + Galaxy.Host topology byte-for-byte") still cannot be auto-validated in a single session. The blocker is no longer "the Galaxy code lift" — that's done in this session — but the structural fact that the 494 v1 IntegrationTests instantiate v1 OtOpcUa.Host classes directly. They have to be rewritten to use the IPC-fronted Proxy topology before legacy OtOpcUa.Host can be deleted, and the plan budgets that work as a multi-day debug-cycle (Task E.1).

What changed today: the MXAccess COM client now exists in Galaxy.Host with a real ArchestrA.MxAccess.dll reference, runs end-to-end against live LMXProxyServer, and 3 live COM smoke tests pass on this dev box. MxAccessGalaxyBackend (the third IGalaxyBackend implementation, alongside StubGalaxyBackend and DbBackedGalaxyBackend) combines the ported GalaxyRepository with the ported MxAccessClient so Discover / Read / Write / Subscribe all flow through one production-shape backend. Program.cs selects between the three backends via the OTOPCUA_GALAXY_BACKEND env var (default = mxaccess).

Delivered in Phase 2 (full scope, not just scaffolds)

Stream A — Driver.Galaxy.Shared ( complete)

  • 9 contract files: Hello/HelloAck (version negotiation), OpenSession/CloseSession/Heartbeat, Discover + GalaxyObjectInfo + GalaxyAttributeInfo, Read/Write + GalaxyDataValue, Subscribe/Unsubscribe/OnDataChange, AlarmSubscribe/Event/Ack, HistoryRead, HostConnectivityStatus, Recycle.
  • Length-prefixed framing (4-byte BE length + 1-byte kind + MessagePack body) with a 16 MiB cap.
  • Thread-safe FrameWriter (semaphore-gated) and single-consumer FrameReader.
  • 6 round-trip tests + reflection-scan that asserts contracts only reference BCL + MessagePack.

Stream B — Driver.Galaxy.Host ( complete, exceeded original scope)

  • Real Win32 message pump in StaPumpGetMessage/PostThreadMessage/PeekMessage/ PostQuitMessage P/Invoke, dedicated STA thread, WM_APP=0x8000 work dispatch, WM_APP+1 graceful-drain → PostQuitMessage, 5s join-on-dispose, responsiveness probe.
  • Strict PipeAcl (allow configured server SID only, deny LocalSystem + Administrators), PipeServer with caller-SID verification + per-process shared-secret Hello handshake.
  • Galaxy-specific MemoryWatchdog (warn max(1.5×baseline, +200 MB), soft-recycle max(2×baseline, +200 MB), hard ceiling 1.5 GB, slope ≥5 MB/min over 30-min window).
  • RecyclePolicy (1/hr cap + 03:00 daily scheduled), PostMortemMmf (1000-entry ring buffer, hard-crash survivable, cross-process readable), MxAccessHandle : SafeHandle.
  • IGalaxyBackend interface + 3 implementations:
    • StubGalaxyBackend — keeps IPC end-to-end testable without Galaxy.
    • DbBackedGalaxyBackend — real Discover via the ported GalaxyRepository against ZB.
    • MxAccessGalaxyBackend — Discover via DB + Read/Write/Subscribe via the ported MxAccessClient over the StaPump.
  • GalaxyRepository ported from v1 (HierarchySql + AttributesSql byte-for-byte identical).
  • MxAccessClient ported from v1 (Connect/Read/Write/Subscribe/Unsubscribe + ConcurrentDict handle tracking + OnDataChange / OnWriteComplete event marshalling). The reconnect loop + Historian plugin loader + extended-attribute query are explicit follow-ups.
  • MxProxyAdapter + IMxProxy for COM-isolation testability.
  • Program.cs env-driven backend selection (OTOPCUA_GALAXY_BACKEND=stub|db|mxaccess, OTOPCUA_GALAXY_ZB_CONN, OTOPCUA_GALAXY_CLIENT_NAME, plus the Phase 2 baseline OTOPCUA_GALAXY_PIPE / OTOPCUA_ALLOWED_SID / OTOPCUA_GALAXY_SECRET).
  • ArchestrA.MxAccess.dll referenced via HintPath at lib/ArchestrA.MxAccess.dll. Project flipped to x86 platform target (the COM interop requires it).

Stream C — Driver.Galaxy.Proxy ( complete)

  • GalaxyProxyDriver implements all 9 capability interfaces — IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IRediscoverable, IHostConnectivityProbe — each forwarding through the matching IPC contract.
  • GalaxyIpcClient with CallAsync (request/response gated through a semaphore so concurrent callers don't interleave frames) + SendOneWayAsync for fire-and-forget calls (Unsubscribe / AlarmAck / CloseSession).
  • Backoff (5s → 15s → 60s, capped, reset-on-stable-run), CircuitBreaker (3 crashes per 5 min opens; 1h → 4h → manual escalation; sticky alert), HeartbeatMonitor (2s cadence, 3 misses = host dead).

Tests

  • 963 pass / 1 pre-existing baseline across the full solution.
  • New in this session:
    • StaPumpTests — pump still passes 3/3 against the real Win32 implementation
    • EndToEndIpcTests (5) — every IPC operation through Pipe + dispatcher + StubBackend
    • IpcHandshakeIntegrationTests (2) — Hello + heartbeat + secret rejection
    • GalaxyRepositoryLiveSmokeTests (5) — live SQL against ZB, skip when ZB unreachable
    • MxAccessLiveSmokeTests (3) — live COM against running aaBootstrap + LMXProxyServer
    • All net48 x86 to match Galaxy.Host

Adversarial review findings

Independent pass over the Phase 2 deltas. Findings ranked by severity; all open items are explicitly deferred to Stream D/E or v2.1 with rationale.

Critical — none.

High

  1. MxAccess ReadAsync has a subscription-leak window on cancellation. The one-shot read uses subscribe → first-OnDataChange → unsubscribe. If the caller cancels between the SubscribeOnPumpAsync await and the tcs.Task await, the subscription stays installed. Mitigation: the StaPump's idempotent unsubscribe path drops orphan subs at disconnect, but a long-running session leaks them. Fix scoped to Phase 2 follow-up alongside the proper subscription registry that v1 had.

  2. No reconnect loop on the MXAccess COM connection. v1's MxAccessClient.Monitor polled a probe tag and triggered reconnect-with-replay on disconnection. The ported client's ConnectAsync is one-shot and there's no health monitor. Mitigation: the Tier C supervisor on the Proxy side (CircuitBreaker + HeartbeatMonitor) restarts the whole Host process on liveness failure, so connection loss surfaces as a process recycle rather than silent data loss. Reconnect-without-recycle is a v2.1 refinement per driver-stability.md.

Medium

  1. MxAccessGalaxyBackend.SubscribeAsync doesn't push OnDataChange frames back to the Proxy. The wire frame MessageKind.OnDataChangeNotification is defined and GalaxyProxyDriver has the RaiseDataChange internal entry point, but the Host-side push pipeline isn't wired — the subscribe registers on the COM side but the value just gets discarded. Mitigation: the SubscribeAsync handle is still useful for the ack flow, and one-shot reads work. Push plumbing is the next-session item.

  2. WriteValuesAsync doesn't await the OnWriteComplete callback. v1's implementation awaited a TCS keyed on the item handle; the port fires the write and returns success without confirming the runtime accepted it. Mitigation: the StatusCode in the response will be 0 (Good) for a fire-and-forget — false positive if the runtime rejects post-callback. Fix needs the same TCS-by-handle pattern as v1; queued.

  3. MxAccessGalaxyBackend.Discover re-queries SQL on every call. v1 cached the tree and only refreshed on the deploy-watermark change. Mitigation: AttributesSql is the slow one (~30s for a large Galaxy); first-call latency is the symptom, not data loss. Caching + IRediscoverable push is a v2.1 follow-up.

Low

  1. Live MXAccess test Backend_ReadValues_against_discovered_attribute_returns_a_response_shape silently passes if no readable attribute is found. Documented; the test asserts the shape not the value because some Galaxy installs are configuration-only.

  2. FrameWriter allocates the length-prefix as a 4-byte heap array per call. Could be stackalloc. Microbenchmark not done — currently irrelevant.

  3. MxProxyAdapter.Unregister swallows exceptions during Unregister(handle). v1 did the same; documented as best-effort during teardown. Consider logging the swallow.

Out of scope (correctly deferred)

  • Stream D.1 — delete legacy OtOpcUa.Host. Cannot be done in any single session because the 494 v1 IntegrationTests reference Host classes directly. Requires the test rewrite cycle in Stream E.
  • Stream E.1 — run v1 IntegrationTests against v2 topology. Requires (a) test rewrite to use Proxy/Host instead of in-process Host classes, then (b) the parity-debug iteration that the plan budgets 3-4 weeks for.
  • Stream E.2 — Client.CLI walkthrough diff. Requires the v1 baseline capture.
  • Stream E.3 — four 2026-04-13 stability findings regression tests. Requires the parity test harness from Stream E.1.
  • Wonderware Historian SDK plugin loader (Task B.1.h). HistoryRead returns a recognisable error until the plugin loader is wired.
  • Alarm subsystem wire-up (MxAccessGalaxyBackend.SubscribeAlarmsAsync is a no-op today). v1's alarm tracking is its own subtree; queued as Phase 2 follow-up.

Stream-D removal checklist (next session)

  1. Decide policy on the 494 v1 tests:
    • Option A: rewrite to use Driver.Galaxy.Proxy + Driver.Galaxy.Host topology (multi-day; full parity validation as a side effect)
    • Option B: archive them as OtOpcUa.Tests.v1Archive and write a smaller v2 parity suite against the new topology (faster; less coverage initially)
  2. Execute the chosen option.
  3. Delete src/ZB.MOM.WW.OtOpcUa.Host/, remove from .slnx.
  4. Update Windows service installer to register two services (OtOpcUa + OtOpcUaGalaxyHost) with the correct service-account SIDs.
  5. Migration script for appsettings.json Galaxy sections → DriverInstance.DriverConfig JSON.
  6. PR + adversarial review + exit-gate-phase-2-final.md.

What ships from this session

Eight commits on phase-1-configuration since the previous push:

  • 01fd90c Phase 1 finish + Phase 2 scaffold
  • 7a5b535 Admin UI core
  • 18f93d7 LDAP + SignalR
  • a1e9ed4 AVEVA-stack inventory doc
  • 32eeeb9 Phase 2 A+B+C feature-complete
  • 549cd36 GalaxyRepository ported + DbBackedBackend + live ZB smoke
  • (this commit) MXAccess COM port + MxAccessGalaxyBackend + live MXAccess smoke + adversarial review

494/494 v1 tests still pass. No regressions.