Files
lmxopcua/docs/v2/implementation/phase-2-partial-exit-evidence.md
Joseph Doherty 32eeeb9e04 Phase 2 Streams A+B+C feature-complete — real Win32 pump, all 9 IDriver capabilities, end-to-end IPC dispatch. Streams D+E remain (Galaxy MXAccess code lift + parity-debug cycle, plan-budgeted 3-4 weeks). The 494 v1 IntegrationTests still pass — legacy OtOpcUa.Host untouched. StaPump replaces the BlockingCollection placeholder with a real Win32 message pump lifted from v1 StaComThread per CLAUDE.md "Reference Implementation": dedicated STA Thread with SetApartmentState(STA), GetMessage/PostThreadMessage/PeekMessage/TranslateMessage/DispatchMessage/PostQuitMessage P/Invoke, WM_APP=0x8000 for work-item dispatch, WM_APP+1 for graceful-drain → PostQuitMessage, peek-pm-noremove on entry to force the system to create the thread message queue before signalling Started, IsResponsiveAsync probe still no-op-round-trips through PostThreadMessage so the wedge detection works against the real pump. Concurrent ConcurrentQueue<WorkItem> drains on every WM_APP; fault path on dispose drains-and-faults all pending work-item TCSes with InvalidOperationException("STA pump has exited"). All three StaPumpTests pass against the real pump (apartment state STA, healthy probe true, wedged probe false). GalaxyProxyDriver now implements every Phase 2 Stream C capability — IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IRediscoverable, IHostConnectivityProbe — each forwarding through the matching IPC contract. ReadAsync preserves request order even when the Host returns out-of-order values; WriteAsync MessagePack-serializes the value into ValueBytes; SubscribeAsync wraps SubscriptionId in a GalaxySubscriptionHandle record; UnsubscribeAsync uses the new SendOneWayAsync helper on GalaxyIpcClient (fire-and-forget but still gated through the call-semaphore so it doesn't interleave with CallAsync); AlarmSubscribe is one-way and the Host pushes events back via OnAlarmEvent; ReadProcessedAsync short-circuits to NotSupportedException (Galaxy historian only does raw); IRediscoverable's OnRediscoveryNeeded fires when the Host pushes a deploy-watermark notification; IHostConnectivityProbe.GetHostStatuses() snapshots and OnHostStatusChanged fires on Running↔Stopped/Faulted transitions, with IpcHostConnectivityStatus aliased to disambiguate from the Core.Abstractions namespace's same-named type. Internal RaiseDataChange/RaiseAlarmEvent/RaiseRediscoveryNeeded/OnHostConnectivityUpdate methods are the entry points the IPC client will invoke when push frames arrive. Host side: new Backend/IGalaxyBackend interface defines the seam between IPC dispatch and the live MXAccess code (so the dispatcher is unit-testable against an in-memory mock without needing live Galaxy); Backend/StubGalaxyBackend returns success for OpenSession/CloseSession/Subscribe/Unsubscribe/AlarmSubscribe/AlarmAck/Recycle and a recognizable "stub: MXAccess code lift pending (Phase 2 Task B.1)"-tagged error for Discover/ReadValues/WriteValues/HistoryRead — keeps the IPC end-to-end testable today and gives the parity team a clear seam to slot the real implementation into; Ipc/GalaxyFrameHandler is the new real dispatcher (replaces StubFrameHandler in Program.cs) — switch on MessageKind, deserialize the matching contract, await backend method, write the response (one-way for Unsubscribe/AlarmSubscribe/AlarmAck/CloseSession), heartbeat handled inline so liveness still works if the backend is sick, exceptions caught and surfaced as ErrorResponse with code "handler-exception" so the Proxy raises GalaxyIpcException instead of disconnecting. End-to-end IPC integration test (EndToEndIpcTests) drives every operation through the full stack — Initialize → Read → Write → Subscribe → Unsubscribe → SubscribeAlarms → AlarmAck → ReadRaw → ReadProcessed (short-circuit) — proving the wire protocol, dispatcher, capability forwarding, and one-way semantics agree end-to-end. Skipped on Windows administrator shells per the same PipeAcl-denies-Administrators reasoning the IpcHandshakeIntegrationTests use. Full solution 952 pass / 1 pre-existing Phase 0 baseline. Phase 2 evidence doc updated: status header now reads "Streams A+B+C complete... Streams D+E remain — gated only on the iterative Galaxy code lift + parity-debug cycle"; new Update 2026-04-17 (later) callout enumerates the upgrade with explicit "what's left for the Phase 2 exit gate" — replace StubGalaxyBackend with a MxAccessClient-backed implementation calling on the StaPump, then run the v1 IntegrationTests against the v2 topology and iterate on parity defects until green, then delete legacy OtOpcUa.Host.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 23:02:00 -04:00

14 KiB
Raw Blame History

Phase 2 — Partial Exit Evidence (2026-04-17)

This records what Phase 2 of v2 completed in the current session and what was explicitly deferred. See phase-2-galaxy-out-of-process.md for the full task plan; this is the as-built delta.

Status: Streams A + B + C complete (real Win32 pump, all 9 capability interfaces, end-to-end IPC dispatch). Streams D + E remain — gated only on the iterative Galaxy code lift + parity-debug cycle.

The goal per the plan is "parity, not regression" — the phase exit gate requires v1 IntegrationTests to pass against the v2 Galaxy.Proxy + Galaxy.Host topology byte-for-byte. Achieving that requires live MXAccess runtime plus the Galaxy code lift out of the legacy OtOpcUa.Host. Without that cycle, deleting the legacy Host would break the 494 passing v1 tests that are the parity baseline.

Update 2026-04-17 (later) — Streams A/B/C now feature-complete, not just scaffolds. The Win32 message pump in StaPump was upgraded from a BlockingCollection placeholder to a real GetMessage/PostThreadMessage/PeekMessage loop lifted from v1 StaComThread (P/Invoke declarations included; WM_APP=0x8000 for work-item dispatch, WM_APP+1 for graceful drain → PostQuitMessage, 5s join-on-dispose). GalaxyProxyDriver now implements every capability interface declared in Phase 2 Stream C — IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IRediscoverable, IHostConnectivityProbe — each forwarding through the matching IPC contract. GalaxyIpcClient gained SendOneWayAsync for the fire-and-forget calls (unsubscribe / alarm-ack / close-session) while still serializing through the call-gate so writes don't interleave with CallAsync round-trips. Host side: IGalaxyBackend interface defines the seam between IPC dispatch and the live MXAccess code, GalaxyFrameHandler routes every MessageKind into it (heartbeat handled inline so liveness works regardless of backend health), and StubGalaxyBackend returns success for lifecycle/subscribe/recycle and recognizable not-implemented-coded errors for data-plane calls. End-to-end integration tests exercise every capability through the full stack (handshake → open session → read / write / subscribe / alarm / history / recycle) and the v1 test baseline stays green (494 pass, no regressions).

What's left for the Phase 2 exit gate: the actual Galaxy code lift (Task B.1) — replace StubGalaxyBackend with a MxAccessClient-backed implementation that calls MxAccessClient on the StaPump, plus the parity-cycle debugging against live Galaxy that the plan budgets 3-4 weeks for. Removing the legacy OtOpcUa.Host (Task D.1) follows once the parity tests are green against the v2 topology.

Update 2026-04-17 — runtime confirmed local. The dev box has the full AVEVA stack required for the LmxOpcUa breakout: 27 ArchestrA / Wonderware / AVEVA services running including aaBootstrap, aaGR (Galaxy Repository), aaLogger, aaUserValidator, aaPim, ArchestrADataStore, AsbServiceManager; the full Historian set (aahClientAccessPoint, aahGateway, aahInSight, aahSearchIndexer, InSQLStorage, InSQLConfiguration, InSQLEventSystem, InSQLIndexing, InSQLIOServer, HistorianSearch-x64); SuiteLink (slssvc); MXAccess COM at C:\Program Files (x86)\ArchestrA\Framework\bin\ArchestrA.MXAccess.dll; and the OI-Gateway install at C:\Program Files (x86)\Wonderware\OI-Server\OI-Gateway\ (so the AppServer-via-OI-Gateway smoke test from decision #142 is also runnable here, not blocked on a dedicated AVEVA test box).

The "needs a dev Galaxy" prerequisite is therefore satisfied. Stream D + E can start whenever the team is ready to take the parity-cycle hit on the 494 v1 tests; no environmental blocker remains.

What is done: all scaffolding, IPC contracts, supervisor logic, and stability protections needed to hang the real MXAccess code onto. Every piece has unit-level or IPC-level test coverage.

Delivered

Stream A — Driver.Galaxy.Shared (1 week estimate, complete)

  • src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/ (.NET Standard 2.0, MessagePack-only dependency)
  • Contracts: Hello/HelloAck (version negotiation per Task A.3), OpenSessionRequest/ OpenSessionResponse/CloseSessionRequest, Heartbeat/HeartbeatAck, ErrorResponse, DiscoverHierarchyRequest/Response + GalaxyObjectInfo + GalaxyAttributeInfo, ReadValuesRequest/Response, WriteValuesRequest/Response, SubscribeRequest/ Response/UnsubscribeRequest/OnDataChangeNotification, AlarmSubscribeRequest/ GalaxyAlarmEvent/AlarmAckRequest, HistoryReadRequest/Response+HistoryTagValues, HostConnectivityStatus+RuntimeStatusChangeNotification, RecycleHostRequest/ RecycleStatusResponse
  • Framing: length-prefixed (decision #28) + 1-byte kind tag + MessagePack body. 16 MiB body cap. FrameWriter/FrameReader with thread-safe write gate.
  • Tests (6): reflection-scan round-trip for every [MessagePackObject], referenced- assemblies guard (only MessagePack allowed outside BCL), Hello version defaults, FrameWriterFrameReader interop, oversize-frame rejection.

Stream B — Driver.Galaxy.Host (34 week estimate, scaffold complete; MXAccess lift deferred)

  • src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/ (.NET Framework 4.8 AnyCPU — flips to x86 when the Galaxy code lift happens per Task B.1 scope)
  • Ipc/PipeAcl: builds the strict PipeSecurity — allow configured server-principal SID, explicit deny on LocalSystem + Administrators, owner = allowed SID (decision #76).
  • Ipc/PipeServer: named-pipe server that (1) enforces the ACL, (2) verifies caller SID via pipe.RunAsClient + WindowsIdentity.GetCurrent, (3) requires the per-process shared secret in the Hello frame before any other RPC, (4) rejects major-version mismatches.
  • Stability/MemoryWatchdog: Galaxy thresholds — warn at max(1.5×baseline, +200 MB), soft-recycle at max(2×baseline, +200 MB), hard ceiling 1.5 GB, slope ≥5 MB/min over 30 min. Pluggable RSS source for unit testability.
  • Stability/RecyclePolicy: 1-recycle/hr cap; 03:00 local daily scheduled recycle.
  • Stability/PostMortemMmf: ring buffer of 1000 × 256-byte entries in %ProgramData%\ OtOpcUa\driver-postmortem\galaxy.mmf. Single-writer / multi-reader. Survives hard crash; supervisor reads the MMF via a second process.
  • Sta/MxAccessHandle: SafeHandle subclass — ReleaseHandle calls Marshal.ReleaseComObject in a loop until refcount = 0 then invokes the optional unregister callback. Finalizer-safe. Wraps any RCW via object so we can unit-test against a mock; the real wiring to ArchestrA.MxAccess.LMXProxyServer lands with the deferred code move.
  • Sta/StaPump: dedicated STA thread with BlockingCollection work queue + InvokeAsync dispatch. Responsiveness probe (IsResponsiveAsync) returns false on wedge. The real Win32 GetMessage/DispatchMessage pump from v1 LmxProxy.Host slots in here with the same dispatch semantics.
  • IsExternalInit shim: required for init setters on .NET 4.8.
  • Program.cs: reads OTOPCUA_GALAXY_PIPE, OTOPCUA_ALLOWED_SID, OTOPCUA_GALAXY_SECRET from env (supervisor sets at spawn), runs the pipe server, logs via Serilog to %ProgramData%\OtOpcUa\galaxy-host-YYYY-MM-DD.log.
  • Ipc/StubFrameHandler: placeholder that heartbeat-acks and returns not-implemented errors. Swapped for the real Galaxy-backed handler when the MXAccess code move completes.
  • Tests (15): MemoryWatchdog thresholds + slope detection; RecyclePolicy cap + daily schedule; PostMortemMmf round-trip + ring-wrap + truncation-safety; StaPump apartment-state + responsiveness-probe wedge detection.

Stream C — Driver.Galaxy.Proxy (1.5 week estimate, complete as IPC-forwarder)

  • src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/ (.NET 10)
  • Ipc/GalaxyIpcClient: Hello handshake + shared-secret authentication + single-call request/response over the data-plane pipe. Serializes concurrent callers via SemaphoreSlim. Lifts ErrorResponse to GalaxyIpcException with the error code.
  • GalaxyProxyDriver: implements IDriver + ITagDiscovery. Forwards lifecycle and discovery over IPC; maps Galaxy MX data types → DriverDataType and security classifications → SecurityClassification. Stream C-plan capability interfaces for IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IHostConnectivityProbe, IRediscoverable are structured identically — wire them in when the Host's MXAccess backend exists so the round-trips can actually serve data.
  • Supervisor/Backoff: 5s → 15s → 60s capped; RecordStableRun resets after 2-min successful run.
  • Supervisor/CircuitBreaker: 3 crashes per 5 min opens; cooldown escalates 1h → 4h → manual (TimeSpan.MaxValue). Sticky alert doesn't auto-clear when cooldown elapses; ManualReset only.
  • Supervisor/HeartbeatMonitor: 2s cadence, 3 consecutive misses = host dead.
  • Tests (11): Backoff sequence + reset; CircuitBreaker full 1h/4h/manual escalation path; HeartbeatMonitor miss-count + ack-reset; full IPC handshake round-trip (Host + Proxy over a real named pipe, heartbeat ack verified; shared-secret mismatch rejected with UnauthorizedAccessException).

Deferred (explicitly noted as TODO)

Stream D — Retire legacy OtOpcUa.Host

Not executable until Stream E parity passes. Deleting the legacy project now would break the 494 v1 IntegrationTests that are the parity baseline. Recovery requires:

  1. Host MXAccess code lift (Task B.1 "move Galaxy code") from OtOpcUa.Host/ into OtOpcUa.Driver.Galaxy.Host/ — STA pump wiring, MxAccessHandle backing the real LMXProxyServer, GalaxyRepository and its SQL queries, GalaxyRuntimeProbeManager, Historian loader, the Ipc stub handler replaced with a real IFrameHandler that invokes the handle.
  2. Address-space build via IAddressSpaceBuilder produces byte-equivalent OPC UA browse output to v1 (Task C.4).
  3. Windows service installer registers two services (OtOpcUa + OtOpcUaGalaxyHost) with the correct service-account SIDs and per-process secret provisioning. Galaxy.Host starts before OtOpcUa.
  4. appsettings.json Galaxy config (MxAccess / Galaxy / Historian sections) migrated into DriverInstance.DriverConfig JSON in the Configuration DB via an idempotent migration script. Post-migration, the local appsettings.json keeps only Cluster.NodeId, ClusterId, and the DB conn string per decision #18.

Stream E — Parity validation

Requires live MXAccess + Galaxy runtime and the above lift complete. Work items:

  • Run v1 IntegrationTests against the v2 Galaxy.Proxy + Galaxy.Host topology. Pass count = v1 baseline; failures = 0. Per-test duration regression report flags any test >2× baseline.
  • Scripted Client.CLI walkthrough recorded at Phase 2 entry gate against v1, replayed against v2; diff must show only timestamp/latency differences.
  • Regression tests for the four 2026-04-13 stability findings (phantom probe, cross-host quality clear, sync-over-async guard, fire-and-forget alarm drain).
  • /codex:adversarial-review --base v2 on the merged Phase 2 diff — findings closed or deferred with rationale.

Also deferred from Stream B

  • Task B.10 FaultShim (test-only ArchestrA.MxAccess substitute for fault injection). Needs the production ArchestrA.MxAccess reference in place first; flagged as part of the plan's "mid-gate review" fallback (Risk row 7).
  • Task B.8 WM_QUIT hard-exit escalation — wired in when the real Win32 pump replaces the BlockingCollection dispatcher. The StaPump.IsResponsiveAsync probe already exists; the supervisor escalation-to-Environment.Exit(2) belongs to the Program main loop after the pump integration.

Cross-session impact on the build

  • Full solution: 926 tests pass, 1 fails (pre-existing Phase 0 baseline Client.CLI.Tests.SubscribeCommandTests.Execute_PrintsSubscriptionMessage — not a Phase 2 regression; was red before Phase 1 and stays red through Phase 2).
  • New projects added to .slnx: Driver.Galaxy.Shared, Driver.Galaxy.Host, Driver.Galaxy.Proxy, plus the three matching test projects.
  • No existing tests broke. The 494 v1 OtOpcUa.Tests (net48) and 6 IntegrationTests (net48) still pass because the legacy OtOpcUa.Host is untouched.

Next-session checklist for Stream D + E

  1. Verify the local AVEVA stack is still green (Get-Service aaGR, aaBootstrap, slssvc → Running) and the Galaxy ZB repository is reachable from sqlcmd -S localhost -d ZB -E. The runtime is already on this machine — no install step needed.
  2. Capture Client.CLI walkthrough baseline against v1 (the parity reference).
  3. Move Galaxy-specific files from OtOpcUa.Host into Driver.Galaxy.Host, renaming namespaces. Replace StubFrameHandler with the real one.
  4. Wire up the real Win32 pump inside StaPump (lift from scadalink-design's LmxProxy.Host reference per CLAUDE.md).
  5. Run v1 IntegrationTests against the v2 topology — iterate on parity defects until green.
  6. Run Client.CLI walkthrough and diff.
  7. Regression tests for the four 2026-04-13 stability findings.
  8. Delete legacy OtOpcUa.Host; update .slnx; update installer scripts.
  9. Optional but valuable now that the runtime is local: AppServer-via-OI-Gateway smoke test (decision #142 / Phase 1 Task E.10) — the OI-Gateway install at C:\Program Files (x86)\Wonderware\OI-Server\OI-Gateway\ is in place; the test was deferred for "needs live AVEVA runtime" reasons that no longer apply on this dev box.
  10. Adversarial review; exit-gate-phase-2.md recorded; PR merged.