Files
lmxopcua/docs/v2/implementation/phase-2-partial-exit-evidence.md
Joseph Doherty a1e9ed40fb Doc — record that this dev box (DESKTOP-6JL3KKO) hosts the full AVEVA stack required for the LmxOpcUa Phase 2 breakout, removing the "needs live MXAccess runtime" environmental blocker that the partial-exit evidence cited as gating Streams D + E. Inventory verified via Get-Service: 27 ArchestrA / Wonderware / AVEVA services running including aaBootstrap, aaGR (Galaxy Repository), aaLogger, aaUserValidator, aaPim, ArchestrADataStore, AsbServiceManager, AutoBuild_Service; the full Historian set (aahClientAccessPoint, aahGateway, aahInSight, aahSearchIndexer, aahSupervisor, InSQLStorage, InSQLConfiguration, InSQLEventSystem, InSQLIndexing, InSQLIOServer, InSQLManualStorage, InSQLSystemDriver, HistorianSearch-x64); slssvc (Wonderware SuiteLink); MXAccess COM DLL at C:\Program Files (x86)\ArchestrA\Framework\bin\ArchestrA.MXAccess.dll plus the matching .tlb files; OI-Gateway install at C:\Program Files (x86)\Wonderware\OI-Server\OI-Gateway\ — which means the Phase 1 Task E.10 AppServer-via-OI-Gateway smoke test (decision #142) is *also* runnable on the same box, not blocked on a separate AVEVA test machine as the original deferral assumed. dev-environment.md inventory row for "Dev Galaxy" now lists every service and file path; status flips to "Fully available — Phase 2 lift unblocked"; the GLAuth row also fills out v2.4.0 actual install details (direct-bind cn={user},dc=lmxopcua,dc=local; users readonly/writeop/writetune/writeconfig/alarmack/admin/serviceaccount; running under NSSM service GLAuth; current GroupToRole mapping ReadOnly→ConfigViewer / WriteOperate→ConfigEditor / AlarmAck→FleetAdmin) and notes the v2-rebrand to dc=otopcua,dc=local is a future cosmetic change. phase-2-partial-exit-evidence.md status header gains "runtime now in place"; an Update 2026-04-17 callout enumerates the same service inventory and concludes "no environmental blocker remains"; the next-session checklist's first step changes from "stand up dev Galaxy" to "verify the local AVEVA stack is still green (Get-Service aaGR, aaBootstrap, slssvc → Running) and the Galaxy ZB repository is reachable" with a new step 9 calling out that the AppServer-via-OI-Gateway smoke test should now be folded in opportunistically. plan.md §"4. Galaxy/MXAccess as Out-of-Process Driver" gains a "Dev environment for the LmxOpcUa breakout" paragraph documenting which physical machine has the runtime so the planning doc no longer reads as if AVEVA capability were a future logistical concern. No source / test changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 22:42:15 -04:00

12 KiB
Raw Blame History

Phase 2 — Partial Exit Evidence (2026-04-17)

This records what Phase 2 of v2 completed in the current session and what was explicitly deferred. See phase-2-galaxy-out-of-process.md for the full task plan; this is the as-built delta.

Status: Streams A + B + C scaffolded and test-green. Streams D + E deferred — runtime now in place.

The goal per the plan is "parity, not regression" — the phase exit gate requires v1 IntegrationTests to pass against the v2 Galaxy.Proxy + Galaxy.Host topology byte-for-byte. Achieving that requires live MXAccess runtime plus the Galaxy code lift out of the legacy OtOpcUa.Host. Without that cycle, deleting the legacy Host would break the 494 passing v1 tests that are the parity baseline.

Update 2026-04-17 — runtime confirmed local. The dev box has the full AVEVA stack required for the LmxOpcUa breakout: 27 ArchestrA / Wonderware / AVEVA services running including aaBootstrap, aaGR (Galaxy Repository), aaLogger, aaUserValidator, aaPim, ArchestrADataStore, AsbServiceManager; the full Historian set (aahClientAccessPoint, aahGateway, aahInSight, aahSearchIndexer, InSQLStorage, InSQLConfiguration, InSQLEventSystem, InSQLIndexing, InSQLIOServer, HistorianSearch-x64); SuiteLink (slssvc); MXAccess COM at C:\Program Files (x86)\ArchestrA\Framework\bin\ArchestrA.MXAccess.dll; and the OI-Gateway install at C:\Program Files (x86)\Wonderware\OI-Server\OI-Gateway\ (so the AppServer-via-OI-Gateway smoke test from decision #142 is also runnable here, not blocked on a dedicated AVEVA test box).

The "needs a dev Galaxy" prerequisite is therefore satisfied. Stream D + E can start whenever the team is ready to take the parity-cycle hit on the 494 v1 tests; no environmental blocker remains.

What is done: all scaffolding, IPC contracts, supervisor logic, and stability protections needed to hang the real MXAccess code onto. Every piece has unit-level or IPC-level test coverage.

Delivered

Stream A — Driver.Galaxy.Shared (1 week estimate, complete)

  • src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/ (.NET Standard 2.0, MessagePack-only dependency)
  • Contracts: Hello/HelloAck (version negotiation per Task A.3), OpenSessionRequest/ OpenSessionResponse/CloseSessionRequest, Heartbeat/HeartbeatAck, ErrorResponse, DiscoverHierarchyRequest/Response + GalaxyObjectInfo + GalaxyAttributeInfo, ReadValuesRequest/Response, WriteValuesRequest/Response, SubscribeRequest/ Response/UnsubscribeRequest/OnDataChangeNotification, AlarmSubscribeRequest/ GalaxyAlarmEvent/AlarmAckRequest, HistoryReadRequest/Response+HistoryTagValues, HostConnectivityStatus+RuntimeStatusChangeNotification, RecycleHostRequest/ RecycleStatusResponse
  • Framing: length-prefixed (decision #28) + 1-byte kind tag + MessagePack body. 16 MiB body cap. FrameWriter/FrameReader with thread-safe write gate.
  • Tests (6): reflection-scan round-trip for every [MessagePackObject], referenced- assemblies guard (only MessagePack allowed outside BCL), Hello version defaults, FrameWriterFrameReader interop, oversize-frame rejection.

Stream B — Driver.Galaxy.Host (34 week estimate, scaffold complete; MXAccess lift deferred)

  • src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/ (.NET Framework 4.8 AnyCPU — flips to x86 when the Galaxy code lift happens per Task B.1 scope)
  • Ipc/PipeAcl: builds the strict PipeSecurity — allow configured server-principal SID, explicit deny on LocalSystem + Administrators, owner = allowed SID (decision #76).
  • Ipc/PipeServer: named-pipe server that (1) enforces the ACL, (2) verifies caller SID via pipe.RunAsClient + WindowsIdentity.GetCurrent, (3) requires the per-process shared secret in the Hello frame before any other RPC, (4) rejects major-version mismatches.
  • Stability/MemoryWatchdog: Galaxy thresholds — warn at max(1.5×baseline, +200 MB), soft-recycle at max(2×baseline, +200 MB), hard ceiling 1.5 GB, slope ≥5 MB/min over 30 min. Pluggable RSS source for unit testability.
  • Stability/RecyclePolicy: 1-recycle/hr cap; 03:00 local daily scheduled recycle.
  • Stability/PostMortemMmf: ring buffer of 1000 × 256-byte entries in %ProgramData%\ OtOpcUa\driver-postmortem\galaxy.mmf. Single-writer / multi-reader. Survives hard crash; supervisor reads the MMF via a second process.
  • Sta/MxAccessHandle: SafeHandle subclass — ReleaseHandle calls Marshal.ReleaseComObject in a loop until refcount = 0 then invokes the optional unregister callback. Finalizer-safe. Wraps any RCW via object so we can unit-test against a mock; the real wiring to ArchestrA.MxAccess.LMXProxyServer lands with the deferred code move.
  • Sta/StaPump: dedicated STA thread with BlockingCollection work queue + InvokeAsync dispatch. Responsiveness probe (IsResponsiveAsync) returns false on wedge. The real Win32 GetMessage/DispatchMessage pump from v1 LmxProxy.Host slots in here with the same dispatch semantics.
  • IsExternalInit shim: required for init setters on .NET 4.8.
  • Program.cs: reads OTOPCUA_GALAXY_PIPE, OTOPCUA_ALLOWED_SID, OTOPCUA_GALAXY_SECRET from env (supervisor sets at spawn), runs the pipe server, logs via Serilog to %ProgramData%\OtOpcUa\galaxy-host-YYYY-MM-DD.log.
  • Ipc/StubFrameHandler: placeholder that heartbeat-acks and returns not-implemented errors. Swapped for the real Galaxy-backed handler when the MXAccess code move completes.
  • Tests (15): MemoryWatchdog thresholds + slope detection; RecyclePolicy cap + daily schedule; PostMortemMmf round-trip + ring-wrap + truncation-safety; StaPump apartment-state + responsiveness-probe wedge detection.

Stream C — Driver.Galaxy.Proxy (1.5 week estimate, complete as IPC-forwarder)

  • src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/ (.NET 10)
  • Ipc/GalaxyIpcClient: Hello handshake + shared-secret authentication + single-call request/response over the data-plane pipe. Serializes concurrent callers via SemaphoreSlim. Lifts ErrorResponse to GalaxyIpcException with the error code.
  • GalaxyProxyDriver: implements IDriver + ITagDiscovery. Forwards lifecycle and discovery over IPC; maps Galaxy MX data types → DriverDataType and security classifications → SecurityClassification. Stream C-plan capability interfaces for IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IHostConnectivityProbe, IRediscoverable are structured identically — wire them in when the Host's MXAccess backend exists so the round-trips can actually serve data.
  • Supervisor/Backoff: 5s → 15s → 60s capped; RecordStableRun resets after 2-min successful run.
  • Supervisor/CircuitBreaker: 3 crashes per 5 min opens; cooldown escalates 1h → 4h → manual (TimeSpan.MaxValue). Sticky alert doesn't auto-clear when cooldown elapses; ManualReset only.
  • Supervisor/HeartbeatMonitor: 2s cadence, 3 consecutive misses = host dead.
  • Tests (11): Backoff sequence + reset; CircuitBreaker full 1h/4h/manual escalation path; HeartbeatMonitor miss-count + ack-reset; full IPC handshake round-trip (Host + Proxy over a real named pipe, heartbeat ack verified; shared-secret mismatch rejected with UnauthorizedAccessException).

Deferred (explicitly noted as TODO)

Stream D — Retire legacy OtOpcUa.Host

Not executable until Stream E parity passes. Deleting the legacy project now would break the 494 v1 IntegrationTests that are the parity baseline. Recovery requires:

  1. Host MXAccess code lift (Task B.1 "move Galaxy code") from OtOpcUa.Host/ into OtOpcUa.Driver.Galaxy.Host/ — STA pump wiring, MxAccessHandle backing the real LMXProxyServer, GalaxyRepository and its SQL queries, GalaxyRuntimeProbeManager, Historian loader, the Ipc stub handler replaced with a real IFrameHandler that invokes the handle.
  2. Address-space build via IAddressSpaceBuilder produces byte-equivalent OPC UA browse output to v1 (Task C.4).
  3. Windows service installer registers two services (OtOpcUa + OtOpcUaGalaxyHost) with the correct service-account SIDs and per-process secret provisioning. Galaxy.Host starts before OtOpcUa.
  4. appsettings.json Galaxy config (MxAccess / Galaxy / Historian sections) migrated into DriverInstance.DriverConfig JSON in the Configuration DB via an idempotent migration script. Post-migration, the local appsettings.json keeps only Cluster.NodeId, ClusterId, and the DB conn string per decision #18.

Stream E — Parity validation

Requires live MXAccess + Galaxy runtime and the above lift complete. Work items:

  • Run v1 IntegrationTests against the v2 Galaxy.Proxy + Galaxy.Host topology. Pass count = v1 baseline; failures = 0. Per-test duration regression report flags any test >2× baseline.
  • Scripted Client.CLI walkthrough recorded at Phase 2 entry gate against v1, replayed against v2; diff must show only timestamp/latency differences.
  • Regression tests for the four 2026-04-13 stability findings (phantom probe, cross-host quality clear, sync-over-async guard, fire-and-forget alarm drain).
  • /codex:adversarial-review --base v2 on the merged Phase 2 diff — findings closed or deferred with rationale.

Also deferred from Stream B

  • Task B.10 FaultShim (test-only ArchestrA.MxAccess substitute for fault injection). Needs the production ArchestrA.MxAccess reference in place first; flagged as part of the plan's "mid-gate review" fallback (Risk row 7).
  • Task B.8 WM_QUIT hard-exit escalation — wired in when the real Win32 pump replaces the BlockingCollection dispatcher. The StaPump.IsResponsiveAsync probe already exists; the supervisor escalation-to-Environment.Exit(2) belongs to the Program main loop after the pump integration.

Cross-session impact on the build

  • Full solution: 926 tests pass, 1 fails (pre-existing Phase 0 baseline Client.CLI.Tests.SubscribeCommandTests.Execute_PrintsSubscriptionMessage — not a Phase 2 regression; was red before Phase 1 and stays red through Phase 2).
  • New projects added to .slnx: Driver.Galaxy.Shared, Driver.Galaxy.Host, Driver.Galaxy.Proxy, plus the three matching test projects.
  • No existing tests broke. The 494 v1 OtOpcUa.Tests (net48) and 6 IntegrationTests (net48) still pass because the legacy OtOpcUa.Host is untouched.

Next-session checklist for Stream D + E

  1. Verify the local AVEVA stack is still green (Get-Service aaGR, aaBootstrap, slssvc → Running) and the Galaxy ZB repository is reachable from sqlcmd -S localhost -d ZB -E. The runtime is already on this machine — no install step needed.
  2. Capture Client.CLI walkthrough baseline against v1 (the parity reference).
  3. Move Galaxy-specific files from OtOpcUa.Host into Driver.Galaxy.Host, renaming namespaces. Replace StubFrameHandler with the real one.
  4. Wire up the real Win32 pump inside StaPump (lift from scadalink-design's LmxProxy.Host reference per CLAUDE.md).
  5. Run v1 IntegrationTests against the v2 topology — iterate on parity defects until green.
  6. Run Client.CLI walkthrough and diff.
  7. Regression tests for the four 2026-04-13 stability findings.
  8. Delete legacy OtOpcUa.Host; update .slnx; update installer scripts.
  9. Optional but valuable now that the runtime is local: AppServer-via-OI-Gateway smoke test (decision #142 / Phase 1 Task E.10) — the OI-Gateway install at C:\Program Files (x86)\Wonderware\OI-Server\OI-Gateway\ is in place; the test was deferred for "needs live AVEVA runtime" reasons that no longer apply on this dev box.
  10. Adversarial review; exit-gate-phase-2.md recorded; PR merged.