10 KiB
Phase 2 — Partial Exit Evidence (2026-04-17)
This records what Phase 2 of v2 completed in the current session and what was explicitly deferred. See
phase-2-galaxy-out-of-process.mdfor the full task plan; this is the as-built delta.
Status: Streams A + B + C scaffolded and test-green. Streams D + E deferred.
The goal per the plan is "parity, not regression" — the phase exit gate requires v1
IntegrationTests to pass against the v2 Galaxy.Proxy + Galaxy.Host topology byte-for-byte.
Achieving that requires live MXAccess runtime plus the Galaxy code lift out of the legacy
OtOpcUa.Host. Both are operations that need a dev Galaxy up and a parity test cycle to verify.
Without that cycle, deleting the legacy Host would break the 494 passing v1 tests that are the
parity baseline.
What is done: all scaffolding, IPC contracts, supervisor logic, and stability protections needed to hang the real MXAccess code onto. Every piece has unit-level or IPC-level test coverage.
Delivered
Stream A — Driver.Galaxy.Shared (1 week estimate, complete)
src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/(.NET Standard 2.0, MessagePack-only dependency)- Contracts:
Hello/HelloAck(version negotiation per Task A.3),OpenSessionRequest/OpenSessionResponse/CloseSessionRequest,Heartbeat/HeartbeatAck,ErrorResponse,DiscoverHierarchyRequest/Response+GalaxyObjectInfo+GalaxyAttributeInfo,ReadValuesRequest/Response,WriteValuesRequest/Response,SubscribeRequest/Response/UnsubscribeRequest/OnDataChangeNotification,AlarmSubscribeRequest/GalaxyAlarmEvent/AlarmAckRequest,HistoryReadRequest/Response+HistoryTagValues,HostConnectivityStatus+RuntimeStatusChangeNotification,RecycleHostRequest/RecycleStatusResponse - Framing: length-prefixed (decision #28) + 1-byte kind tag + MessagePack body. 16 MiB
body cap.
FrameWriter/FrameReaderwith thread-safe write gate. - Tests (6): reflection-scan round-trip for every
[MessagePackObject], referenced- assemblies guard (only MessagePack allowed outside BCL), Hello version defaults,FrameWriter↔FrameReaderinterop, oversize-frame rejection.
Stream B — Driver.Galaxy.Host (3–4 week estimate, scaffold complete; MXAccess lift deferred)
src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/(.NET Framework 4.8 AnyCPU — flips to x86 when the Galaxy code lift happens per Task B.1 scope)Ipc/PipeAcl: builds the strictPipeSecurity— allow configured server-principal SID, explicit deny on LocalSystem + Administrators, owner = allowed SID (decision #76).Ipc/PipeServer: named-pipe server that (1) enforces the ACL, (2) verifies caller SID viapipe.RunAsClient+WindowsIdentity.GetCurrent, (3) requires the per-process shared secret in the Hello frame before any other RPC, (4) rejects major-version mismatches.Stability/MemoryWatchdog: Galaxy thresholds — warn atmax(1.5×baseline, +200 MB), soft-recycle atmax(2×baseline, +200 MB), hard ceiling 1.5 GB, slope ≥5 MB/min over 30 min. Pluggable RSS source for unit testability.Stability/RecyclePolicy: 1-recycle/hr cap; 03:00 local daily scheduled recycle.Stability/PostMortemMmf: ring buffer of 1000 × 256-byte entries in%ProgramData%\ OtOpcUa\driver-postmortem\galaxy.mmf. Single-writer / multi-reader. Survives hard crash; supervisor reads the MMF via a second process.Sta/MxAccessHandle:SafeHandlesubclass —ReleaseHandlecallsMarshal.ReleaseComObjectin a loop until refcount = 0 then invokes the optionalunregistercallback. Finalizer-safe. Wraps any RCW viaobjectso we can unit-test against a mock; the real wiring toArchestrA.MxAccess.LMXProxyServerlands with the deferred code move.Sta/StaPump: dedicated STA thread withBlockingCollectionwork queue +InvokeAsyncdispatch. Responsiveness probe (IsResponsiveAsync) returns false on wedge. The real Win32GetMessage/DispatchMessagepump from v1LmxProxy.Hostslots in here with the same dispatch semantics.IsExternalInitshim: required forinitsetters on .NET 4.8.Program.cs: readsOTOPCUA_GALAXY_PIPE,OTOPCUA_ALLOWED_SID,OTOPCUA_GALAXY_SECRETfrom env (supervisor sets at spawn), runs the pipe server, logs via Serilog to%ProgramData%\OtOpcUa\galaxy-host-YYYY-MM-DD.log.Ipc/StubFrameHandler: placeholder that heartbeat-acks and returnsnot-implementederrors. Swapped for the real Galaxy-backed handler when the MXAccess code move completes.- Tests (15):
MemoryWatchdogthresholds + slope detection;RecyclePolicycap + daily schedule;PostMortemMmfround-trip + ring-wrap + truncation-safety;StaPumpapartment-state + responsiveness-probe wedge detection.
Stream C — Driver.Galaxy.Proxy (1.5 week estimate, complete as IPC-forwarder)
src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/(.NET 10)Ipc/GalaxyIpcClient: Hello handshake + shared-secret authentication + single-call request/response over the data-plane pipe. Serializes concurrent callers viaSemaphoreSlim. LiftsErrorResponsetoGalaxyIpcExceptionwith the error code.GalaxyProxyDriver: implementsIDriver+ITagDiscovery. Forwards lifecycle and discovery over IPC; maps Galaxy MX data types →DriverDataTypeand security classifications →SecurityClassification. Stream C-plan capability interfaces forIReadable,IWritable,ISubscribable,IAlarmSource,IHistoryProvider,IHostConnectivityProbe,IRediscoverableare structured identically — wire them in when the Host's MXAccess backend exists so the round-trips can actually serve data.Supervisor/Backoff: 5s → 15s → 60s capped;RecordStableRunresets after 2-min successful run.Supervisor/CircuitBreaker: 3 crashes per 5 min opens; cooldown escalates 1h → 4h → manual (TimeSpan.MaxValue). Sticky alert doesn't auto-clear when cooldown elapses;ManualResetonly.Supervisor/HeartbeatMonitor: 2s cadence, 3 consecutive misses = host dead.- Tests (11):
Backoffsequence + reset;CircuitBreakerfull 1h/4h/manual escalation path;HeartbeatMonitormiss-count + ack-reset; full IPC handshake round-trip (Host + Proxy over a real named pipe, heartbeat ack verified; shared-secret mismatch rejected withUnauthorizedAccessException).
Deferred (explicitly noted as TODO)
Stream D — Retire legacy OtOpcUa.Host
Not executable until Stream E parity passes. Deleting the legacy project now would break the 494 v1 IntegrationTests that are the parity baseline. Recovery requires:
- Host MXAccess code lift (Task B.1 "move Galaxy code") from
OtOpcUa.Host/intoOtOpcUa.Driver.Galaxy.Host/— STA pump wiring,MxAccessHandlebacking the realLMXProxyServer,GalaxyRepositoryand its SQL queries,GalaxyRuntimeProbeManager, Historian loader, the Ipc stub handler replaced with a realIFrameHandlerthat invokes the handle. - Address-space build via
IAddressSpaceBuilderproduces byte-equivalent OPC UA browse output to v1 (Task C.4). - Windows service installer registers two services (
OtOpcUa+OtOpcUaGalaxyHost) with the correct service-account SIDs and per-process secret provisioning. Galaxy.Host starts before OtOpcUa. appsettings.jsonGalaxy config (MxAccess / Galaxy / Historian sections) migrated intoDriverInstance.DriverConfigJSON in the Configuration DB via an idempotent migration script. Post-migration, the localappsettings.jsonkeeps onlyCluster.NodeId,ClusterId, and the DB conn string per decision #18.
Stream E — Parity validation
Requires live MXAccess + Galaxy runtime and the above lift complete. Work items:
- Run v1 IntegrationTests against the v2 Galaxy.Proxy + Galaxy.Host topology. Pass count = v1 baseline; failures = 0. Per-test duration regression report flags any test >2× baseline.
- Scripted Client.CLI walkthrough recorded at Phase 2 entry gate against v1, replayed against v2; diff must show only timestamp/latency differences.
- Regression tests for the four 2026-04-13 stability findings (phantom probe, cross-host quality clear, sync-over-async guard, fire-and-forget alarm drain).
/codex:adversarial-review --base v2on the merged Phase 2 diff — findings closed or deferred with rationale.
Also deferred from Stream B
- Task B.10 FaultShim (test-only
ArchestrA.MxAccesssubstitute for fault injection). Needs the productionArchestrA.MxAccessreference in place first; flagged as part of the plan's "mid-gate review" fallback (Risk row 7). - Task B.8 WM_QUIT hard-exit escalation — wired in when the real Win32 pump replaces the
BlockingCollectiondispatcher. TheStaPump.IsResponsiveAsyncprobe already exists; the supervisor escalation-to-Environment.Exit(2)belongs to the Program main loop after the pump integration.
Cross-session impact on the build
- Full solution: 926 tests pass, 1 fails (pre-existing Phase 0 baseline
Client.CLI.Tests.SubscribeCommandTests.Execute_PrintsSubscriptionMessage— not a Phase 2 regression; was red before Phase 1 and stays red through Phase 2). - New projects added to
.slnx:Driver.Galaxy.Shared,Driver.Galaxy.Host,Driver.Galaxy.Proxy, plus the three matching test projects. - No existing tests broke. The 494 v1
OtOpcUa.Tests(net48) and 6IntegrationTests(net48) still pass because the legacyOtOpcUa.Hostis untouched.
Next-session checklist for Stream D + E
- Stand up dev Galaxy; capture Client.CLI walkthrough baseline against v1.
- Move Galaxy-specific files from
OtOpcUa.HostintoDriver.Galaxy.Host, renaming namespaces. ReplaceStubFrameHandlerwith the real one. - Wire up the real Win32 pump inside
StaPump(lift from scadalink-design'sLmxProxy.Hostreference per CLAUDE.md). - Run v1 IntegrationTests against the v2 topology — iterate on parity defects until green.
- Run Client.CLI walkthrough and diff.
- Regression tests for the four stability findings.
- Delete legacy
OtOpcUa.Host; update.slnx; update installer scripts. - Adversarial review;
exit-gate-phase-2.mdrecorded; PR merged.