12 KiB
Phase 2 — Partial Exit Evidence (2026-04-17)
This records what Phase 2 of v2 completed in the current session and what was explicitly deferred. See
phase-2-galaxy-out-of-process.mdfor the full task plan; this is the as-built delta.
Status: Streams A + B + C scaffolded and test-green. Streams D + E deferred — runtime now in place.
The goal per the plan is "parity, not regression" — the phase exit gate requires v1
IntegrationTests to pass against the v2 Galaxy.Proxy + Galaxy.Host topology byte-for-byte.
Achieving that requires live MXAccess runtime plus the Galaxy code lift out of the legacy
OtOpcUa.Host. Without that cycle, deleting the legacy Host would break the 494 passing v1
tests that are the parity baseline.
Update 2026-04-17 — runtime confirmed local. The dev box has the full AVEVA stack required for the LmxOpcUa breakout: 27 ArchestrA / Wonderware / AVEVA services running including
aaBootstrap,aaGR(Galaxy Repository),aaLogger,aaUserValidator,aaPim,ArchestrADataStore,AsbServiceManager; the full Historian set (aahClientAccessPoint,aahGateway,aahInSight,aahSearchIndexer,InSQLStorage,InSQLConfiguration,InSQLEventSystem,InSQLIndexing,InSQLIOServer,HistorianSearch-x64); SuiteLink (slssvc); MXAccess COM atC:\Program Files (x86)\ArchestrA\Framework\bin\ArchestrA.MXAccess.dll; and the OI-Gateway install atC:\Program Files (x86)\Wonderware\OI-Server\OI-Gateway\(so the AppServer-via-OI-Gateway smoke test from decision #142 is also runnable here, not blocked on a dedicated AVEVA test box).The "needs a dev Galaxy" prerequisite is therefore satisfied. Stream D + E can start whenever the team is ready to take the parity-cycle hit on the 494 v1 tests; no environmental blocker remains.
What is done: all scaffolding, IPC contracts, supervisor logic, and stability protections needed to hang the real MXAccess code onto. Every piece has unit-level or IPC-level test coverage.
Delivered
Stream A — Driver.Galaxy.Shared (1 week estimate, complete)
src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/(.NET Standard 2.0, MessagePack-only dependency)- Contracts:
Hello/HelloAck(version negotiation per Task A.3),OpenSessionRequest/OpenSessionResponse/CloseSessionRequest,Heartbeat/HeartbeatAck,ErrorResponse,DiscoverHierarchyRequest/Response+GalaxyObjectInfo+GalaxyAttributeInfo,ReadValuesRequest/Response,WriteValuesRequest/Response,SubscribeRequest/Response/UnsubscribeRequest/OnDataChangeNotification,AlarmSubscribeRequest/GalaxyAlarmEvent/AlarmAckRequest,HistoryReadRequest/Response+HistoryTagValues,HostConnectivityStatus+RuntimeStatusChangeNotification,RecycleHostRequest/RecycleStatusResponse - Framing: length-prefixed (decision #28) + 1-byte kind tag + MessagePack body. 16 MiB
body cap.
FrameWriter/FrameReaderwith thread-safe write gate. - Tests (6): reflection-scan round-trip for every
[MessagePackObject], referenced- assemblies guard (only MessagePack allowed outside BCL), Hello version defaults,FrameWriter↔FrameReaderinterop, oversize-frame rejection.
Stream B — Driver.Galaxy.Host (3–4 week estimate, scaffold complete; MXAccess lift deferred)
src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/(.NET Framework 4.8 AnyCPU — flips to x86 when the Galaxy code lift happens per Task B.1 scope)Ipc/PipeAcl: builds the strictPipeSecurity— allow configured server-principal SID, explicit deny on LocalSystem + Administrators, owner = allowed SID (decision #76).Ipc/PipeServer: named-pipe server that (1) enforces the ACL, (2) verifies caller SID viapipe.RunAsClient+WindowsIdentity.GetCurrent, (3) requires the per-process shared secret in the Hello frame before any other RPC, (4) rejects major-version mismatches.Stability/MemoryWatchdog: Galaxy thresholds — warn atmax(1.5×baseline, +200 MB), soft-recycle atmax(2×baseline, +200 MB), hard ceiling 1.5 GB, slope ≥5 MB/min over 30 min. Pluggable RSS source for unit testability.Stability/RecyclePolicy: 1-recycle/hr cap; 03:00 local daily scheduled recycle.Stability/PostMortemMmf: ring buffer of 1000 × 256-byte entries in%ProgramData%\ OtOpcUa\driver-postmortem\galaxy.mmf. Single-writer / multi-reader. Survives hard crash; supervisor reads the MMF via a second process.Sta/MxAccessHandle:SafeHandlesubclass —ReleaseHandlecallsMarshal.ReleaseComObjectin a loop until refcount = 0 then invokes the optionalunregistercallback. Finalizer-safe. Wraps any RCW viaobjectso we can unit-test against a mock; the real wiring toArchestrA.MxAccess.LMXProxyServerlands with the deferred code move.Sta/StaPump: dedicated STA thread withBlockingCollectionwork queue +InvokeAsyncdispatch. Responsiveness probe (IsResponsiveAsync) returns false on wedge. The real Win32GetMessage/DispatchMessagepump from v1LmxProxy.Hostslots in here with the same dispatch semantics.IsExternalInitshim: required forinitsetters on .NET 4.8.Program.cs: readsOTOPCUA_GALAXY_PIPE,OTOPCUA_ALLOWED_SID,OTOPCUA_GALAXY_SECRETfrom env (supervisor sets at spawn), runs the pipe server, logs via Serilog to%ProgramData%\OtOpcUa\galaxy-host-YYYY-MM-DD.log.Ipc/StubFrameHandler: placeholder that heartbeat-acks and returnsnot-implementederrors. Swapped for the real Galaxy-backed handler when the MXAccess code move completes.- Tests (15):
MemoryWatchdogthresholds + slope detection;RecyclePolicycap + daily schedule;PostMortemMmfround-trip + ring-wrap + truncation-safety;StaPumpapartment-state + responsiveness-probe wedge detection.
Stream C — Driver.Galaxy.Proxy (1.5 week estimate, complete as IPC-forwarder)
src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/(.NET 10)Ipc/GalaxyIpcClient: Hello handshake + shared-secret authentication + single-call request/response over the data-plane pipe. Serializes concurrent callers viaSemaphoreSlim. LiftsErrorResponsetoGalaxyIpcExceptionwith the error code.GalaxyProxyDriver: implementsIDriver+ITagDiscovery. Forwards lifecycle and discovery over IPC; maps Galaxy MX data types →DriverDataTypeand security classifications →SecurityClassification. Stream C-plan capability interfaces forIReadable,IWritable,ISubscribable,IAlarmSource,IHistoryProvider,IHostConnectivityProbe,IRediscoverableare structured identically — wire them in when the Host's MXAccess backend exists so the round-trips can actually serve data.Supervisor/Backoff: 5s → 15s → 60s capped;RecordStableRunresets after 2-min successful run.Supervisor/CircuitBreaker: 3 crashes per 5 min opens; cooldown escalates 1h → 4h → manual (TimeSpan.MaxValue). Sticky alert doesn't auto-clear when cooldown elapses;ManualResetonly.Supervisor/HeartbeatMonitor: 2s cadence, 3 consecutive misses = host dead.- Tests (11):
Backoffsequence + reset;CircuitBreakerfull 1h/4h/manual escalation path;HeartbeatMonitormiss-count + ack-reset; full IPC handshake round-trip (Host + Proxy over a real named pipe, heartbeat ack verified; shared-secret mismatch rejected withUnauthorizedAccessException).
Deferred (explicitly noted as TODO)
Stream D — Retire legacy OtOpcUa.Host
Not executable until Stream E parity passes. Deleting the legacy project now would break the 494 v1 IntegrationTests that are the parity baseline. Recovery requires:
- Host MXAccess code lift (Task B.1 "move Galaxy code") from
OtOpcUa.Host/intoOtOpcUa.Driver.Galaxy.Host/— STA pump wiring,MxAccessHandlebacking the realLMXProxyServer,GalaxyRepositoryand its SQL queries,GalaxyRuntimeProbeManager, Historian loader, the Ipc stub handler replaced with a realIFrameHandlerthat invokes the handle. - Address-space build via
IAddressSpaceBuilderproduces byte-equivalent OPC UA browse output to v1 (Task C.4). - Windows service installer registers two services (
OtOpcUa+OtOpcUaGalaxyHost) with the correct service-account SIDs and per-process secret provisioning. Galaxy.Host starts before OtOpcUa. appsettings.jsonGalaxy config (MxAccess / Galaxy / Historian sections) migrated intoDriverInstance.DriverConfigJSON in the Configuration DB via an idempotent migration script. Post-migration, the localappsettings.jsonkeeps onlyCluster.NodeId,ClusterId, and the DB conn string per decision #18.
Stream E — Parity validation
Requires live MXAccess + Galaxy runtime and the above lift complete. Work items:
- Run v1 IntegrationTests against the v2 Galaxy.Proxy + Galaxy.Host topology. Pass count = v1 baseline; failures = 0. Per-test duration regression report flags any test >2× baseline.
- Scripted Client.CLI walkthrough recorded at Phase 2 entry gate against v1, replayed against v2; diff must show only timestamp/latency differences.
- Regression tests for the four 2026-04-13 stability findings (phantom probe, cross-host quality clear, sync-over-async guard, fire-and-forget alarm drain).
/codex:adversarial-review --base v2on the merged Phase 2 diff — findings closed or deferred with rationale.
Also deferred from Stream B
- Task B.10 FaultShim (test-only
ArchestrA.MxAccesssubstitute for fault injection). Needs the productionArchestrA.MxAccessreference in place first; flagged as part of the plan's "mid-gate review" fallback (Risk row 7). - Task B.8 WM_QUIT hard-exit escalation — wired in when the real Win32 pump replaces the
BlockingCollectiondispatcher. TheStaPump.IsResponsiveAsyncprobe already exists; the supervisor escalation-to-Environment.Exit(2)belongs to the Program main loop after the pump integration.
Cross-session impact on the build
- Full solution: 926 tests pass, 1 fails (pre-existing Phase 0 baseline
Client.CLI.Tests.SubscribeCommandTests.Execute_PrintsSubscriptionMessage— not a Phase 2 regression; was red before Phase 1 and stays red through Phase 2). - New projects added to
.slnx:Driver.Galaxy.Shared,Driver.Galaxy.Host,Driver.Galaxy.Proxy, plus the three matching test projects. - No existing tests broke. The 494 v1
OtOpcUa.Tests(net48) and 6IntegrationTests(net48) still pass because the legacyOtOpcUa.Hostis untouched.
Next-session checklist for Stream D + E
- Verify the local AVEVA stack is still green (
Get-Service aaGR, aaBootstrap, slssvc→ Running) and the GalaxyZBrepository is reachable fromsqlcmd -S localhost -d ZB -E. The runtime is already on this machine — no install step needed. - Capture Client.CLI walkthrough baseline against v1 (the parity reference).
- Move Galaxy-specific files from
OtOpcUa.HostintoDriver.Galaxy.Host, renaming namespaces. ReplaceStubFrameHandlerwith the real one. - Wire up the real Win32 pump inside
StaPump(lift from scadalink-design'sLmxProxy.Hostreference per CLAUDE.md). - Run v1 IntegrationTests against the v2 topology — iterate on parity defects until green.
- Run Client.CLI walkthrough and diff.
- Regression tests for the four 2026-04-13 stability findings.
- Delete legacy
OtOpcUa.Host; update.slnx; update installer scripts. - Optional but valuable now that the runtime is local: AppServer-via-OI-Gateway smoke test
(decision #142 / Phase 1 Task E.10) — the OI-Gateway install at
C:\Program Files (x86)\Wonderware\OI-Server\OI-Gateway\is in place; the test was deferred for "needs live AVEVA runtime" reasons that no longer apply on this dev box. - Adversarial review;
exit-gate-phase-2.mdrecorded; PR merged.