11 KiB
Phase 2 Exit Gate Record (2026-04-18)
Supersedes
phase-2-partial-exit-evidence.md. Captures the as-built state of Phase 2 after the MXAccess COM client port + DB-backed and MXAccess-backed Galaxy backends + adversarial review.
Status: Streams A, B, C complete. Stream D + E gated only on legacy-Host removal + parity-test rewrite.
The Phase 2 plan exit criterion ("v1 IntegrationTests pass against v2 Galaxy.Proxy + Galaxy.Host
topology byte-for-byte") still cannot be auto-validated in a single session. The blocker is no
longer "the Galaxy code lift" — that's done in this session — but the structural fact that the
494 v1 IntegrationTests instantiate v1 OtOpcUa.Host classes directly. They have to be rewritten
to use the IPC-fronted Proxy topology before legacy OtOpcUa.Host can be deleted, and the plan
budgets that work as a multi-day debug-cycle (Task E.1).
What changed today: the MXAccess COM client now exists in Galaxy.Host with a real
ArchestrA.MxAccess.dll reference, runs end-to-end against live LMXProxyServer, and 3 live
COM smoke tests pass on this dev box. MxAccessGalaxyBackend (the third
IGalaxyBackend implementation, alongside StubGalaxyBackend and DbBackedGalaxyBackend)
combines the ported GalaxyRepository with the ported MxAccessClient so Discover / Read /
Write / Subscribe all flow through one production-shape backend. Program.cs selects between
the three backends via the OTOPCUA_GALAXY_BACKEND env var (default = mxaccess).
Delivered in Phase 2 (full scope, not just scaffolds)
Stream A — Driver.Galaxy.Shared (✅ complete)
- 9 contract files: Hello/HelloAck (version negotiation), OpenSession/CloseSession/Heartbeat, Discover + GalaxyObjectInfo + GalaxyAttributeInfo, Read/Write + GalaxyDataValue, Subscribe/Unsubscribe/OnDataChange, AlarmSubscribe/Event/Ack, HistoryRead, HostConnectivityStatus, Recycle.
- Length-prefixed framing (4-byte BE length + 1-byte kind + MessagePack body) with a 16 MiB cap.
- Thread-safe
FrameWriter(semaphore-gated) and single-consumerFrameReader. - 6 round-trip tests + reflection-scan that asserts contracts only reference BCL + MessagePack.
Stream B — Driver.Galaxy.Host (✅ complete, exceeded original scope)
- Real Win32 message pump in
StaPump—GetMessage/PostThreadMessage/PeekMessage/PostQuitMessageP/Invoke, dedicated STA thread,WM_APP=0x8000work dispatch,WM_APP+1graceful-drain →PostQuitMessage, 5s join-on-dispose, responsiveness probe. - Strict
PipeAcl(allow configured server SID only, deny LocalSystem + Administrators),PipeServerwith caller-SID verification + per-process shared-secretHellohandshake. - Galaxy-specific
MemoryWatchdog(warnmax(1.5×baseline, +200 MB), soft-recyclemax(2×baseline, +200 MB), hard ceiling 1.5 GB, slope ≥5 MB/min over 30-min window). RecyclePolicy(1/hr cap + 03:00 daily scheduled),PostMortemMmf(1000-entry ring buffer, hard-crash survivable, cross-process readable),MxAccessHandle : SafeHandle.IGalaxyBackendinterface + 3 implementations:StubGalaxyBackend— keeps IPC end-to-end testable without Galaxy.DbBackedGalaxyBackend— real Discover via the portedGalaxyRepositoryagainst ZB.MxAccessGalaxyBackend— Discover via DB + Read/Write/Subscribe via the portedMxAccessClientover the StaPump.
GalaxyRepositoryported from v1 (HierarchySql + AttributesSql byte-for-byte identical).MxAccessClientported from v1 (Connect/Read/Write/Subscribe/Unsubscribe + ConcurrentDict handle tracking + OnDataChange / OnWriteComplete event marshalling). The reconnect loop + Historian plugin loader + extended-attribute query are explicit follow-ups.MxProxyAdapter+IMxProxyfor COM-isolation testability.Program.csenv-driven backend selection (OTOPCUA_GALAXY_BACKEND=stub|db|mxaccess,OTOPCUA_GALAXY_ZB_CONN,OTOPCUA_GALAXY_CLIENT_NAME, plus the Phase 2 baselineOTOPCUA_GALAXY_PIPE/OTOPCUA_ALLOWED_SID/OTOPCUA_GALAXY_SECRET).- ArchestrA.MxAccess.dll referenced via HintPath at
lib/ArchestrA.MxAccess.dll. Project flipped to x86 platform target (the COM interop requires it).
Stream C — Driver.Galaxy.Proxy (✅ complete)
GalaxyProxyDriverimplements all 9 capability interfaces —IDriver,ITagDiscovery,IReadable,IWritable,ISubscribable,IAlarmSource,IHistoryProvider,IRediscoverable,IHostConnectivityProbe— each forwarding through the matching IPC contract.GalaxyIpcClientwithCallAsync(request/response gated through a semaphore so concurrent callers don't interleave frames) +SendOneWayAsyncfor fire-and-forget calls (Unsubscribe / AlarmAck / CloseSession).Backoff(5s → 15s → 60s, capped, reset-on-stable-run),CircuitBreaker(3 crashes per 5 min opens; 1h → 4h → manual escalation; sticky alert),HeartbeatMonitor(2s cadence, 3 misses = host dead).
Tests
- 963 pass / 1 pre-existing baseline across the full solution.
- New in this session:
StaPumpTests— pump still passes 3/3 against the real Win32 implementationEndToEndIpcTests(5) — every IPC operation through Pipe + dispatcher + StubBackendIpcHandshakeIntegrationTests(2) — Hello + heartbeat + secret rejectionGalaxyRepositoryLiveSmokeTests(5) — live SQL against ZB, skip when ZB unreachableMxAccessLiveSmokeTests(3) — live COM against runningaaBootstrap+LMXProxyServer- All net48 x86 to match Galaxy.Host
Adversarial review findings
Independent pass over the Phase 2 deltas. Findings ranked by severity; all open items are explicitly deferred to Stream D/E or v2.1 with rationale.
Critical — none.
High
-
MxAccess
ReadAsynchas a subscription-leak window on cancellation. The one-shot read uses subscribe → first-OnDataChange → unsubscribe. If the caller cancels between theSubscribeOnPumpAsyncawait and thetcs.Taskawait, the subscription stays installed. Mitigation: the StaPump's idempotent unsubscribe path drops orphan subs at disconnect, but a long-running session leaks them. Fix scoped to Phase 2 follow-up alongside the proper subscription registry that v1 had. -
No reconnect loop on the MXAccess COM connection. v1's
MxAccessClient.Monitorpolled a probe tag and triggered reconnect-with-replay on disconnection. The ported client'sConnectAsyncis one-shot and there's no health monitor. Mitigation: the Tier C supervisor on the Proxy side (CircuitBreaker + HeartbeatMonitor) restarts the whole Host process on liveness failure, so connection loss surfaces as a process recycle rather than silent data loss. Reconnect-without-recycle is a v2.1 refinement perdriver-stability.md.
Medium
-
MxAccessGalaxyBackend.SubscribeAsyncdoesn't push OnDataChange frames back to the Proxy. The wire frameMessageKind.OnDataChangeNotificationis defined andGalaxyProxyDriverhas theRaiseDataChangeinternal entry point, but the Host-side push pipeline isn't wired — the subscribe registers on the COM side but the value just gets discarded. Mitigation: the SubscribeAsync handle is still useful for the ack flow, and one-shot reads work. Push plumbing is the next-session item. -
WriteValuesAsyncdoesn't await the OnWriteComplete callback. v1's implementation awaited a TCS keyed on the item handle; the port fires the write and returns success without confirming the runtime accepted it. Mitigation: the StatusCode in the response will be 0 (Good) for a fire-and-forget — false positive if the runtime rejects post-callback. Fix needs the same TCS-by-handle pattern as v1; queued. -
MxAccessGalaxyBackend.Discoverre-queries SQL on every call. v1 cached the tree and only refreshed on the deploy-watermark change. Mitigation: AttributesSql is the slow one (~30s for a large Galaxy); first-call latency is the symptom, not data loss. Caching +IRediscoverablepush is a v2.1 follow-up.
Low
-
Live MXAccess test
Backend_ReadValues_against_discovered_attribute_returns_a_response_shapesilently passes if no readable attribute is found. Documented; the test asserts the shape not the value because some Galaxy installs are configuration-only. -
FrameWriterallocates the length-prefix as a 4-byte heap array per call. Could be stackalloc. Microbenchmark not done — currently irrelevant. -
MxProxyAdapter.Unregisterswallows exceptions duringUnregister(handle). v1 did the same; documented as best-effort during teardown. Consider logging the swallow.
Out of scope (correctly deferred)
- Stream D.1 — delete legacy
OtOpcUa.Host. Cannot be done in any single session because the 494 v1 IntegrationTests reference Host classes directly. Requires the test rewrite cycle in Stream E. - Stream E.1 — run v1 IntegrationTests against v2 topology. Requires (a) test rewrite to use Proxy/Host instead of in-process Host classes, then (b) the parity-debug iteration that the plan budgets 3-4 weeks for.
- Stream E.2 — Client.CLI walkthrough diff. Requires the v1 baseline capture.
- Stream E.3 — four 2026-04-13 stability findings regression tests. Requires the parity test harness from Stream E.1.
- Wonderware Historian SDK plugin loader (Task B.1.h). HistoryRead returns a recognisable error until the plugin loader is wired.
- Alarm subsystem wire-up (
MxAccessGalaxyBackend.SubscribeAlarmsAsyncis a no-op today). v1's alarm tracking is its own subtree; queued as Phase 2 follow-up.
Stream-D removal checklist (next session)
- Decide policy on the 494 v1 tests:
- Option A: rewrite to use
Driver.Galaxy.Proxy+Driver.Galaxy.Hosttopology (multi-day; full parity validation as a side effect) - Option B: archive them as
OtOpcUa.Tests.v1Archiveand write a smaller v2 parity suite against the new topology (faster; less coverage initially)
- Option A: rewrite to use
- Execute the chosen option.
- Delete
src/ZB.MOM.WW.OtOpcUa.Host/, remove from.slnx. - Update Windows service installer to register two services
(
OtOpcUa+OtOpcUaGalaxyHost) with the correct service-account SIDs. - Migration script for
appsettings.jsonGalaxy sections →DriverInstance.DriverConfigJSON. - PR + adversarial review +
exit-gate-phase-2-final.md.
What ships from this session
Eight commits on phase-1-configuration since the previous push:
01fd90cPhase 1 finish + Phase 2 scaffold7a5b535Admin UI core18f93d7LDAP + SignalRa1e9ed4AVEVA-stack inventory doc32eeeb9Phase 2 A+B+C feature-complete549cd36GalaxyRepository ported + DbBackedBackend + live ZB smoke(this commit)MXAccess COM port + MxAccessGalaxyBackend + live MXAccess smoke + adversarial review
494/494 v1 tests still pass. No regressions.