# Phase 2 — Galaxy Out-of-Process Refactor (Tier C) > **Status**: DRAFT — implementation plan for Phase 2 of the v2 build (`plan.md` §6, `driver-stability.md` §"Galaxy — Deep Dive"). > > **Branch**: `v2/phase-2-galaxy` > **Estimated duration**: 6–8 weeks (largest refactor phase; Tier C protections + IPC are the bulk) > **Predecessor**: Phase 1 (`phase-1-configuration-and-admin-scaffold.md`) > **Successor**: Phase 3 (Modbus TCP driver) ## Phase Objective Move Galaxy / MXAccess from the legacy in-process `OtOpcUa.Host` project into the **Tier C out-of-process** topology specified in `driver-stability.md`: 1. **`Driver.Galaxy.Shared`** — .NET Standard 2.0 IPC message contracts (MessagePack DTOs) 2. **`Driver.Galaxy.Host`** — .NET 4.8 x86 separate Windows Service that owns `MxAccessBridge`, `GalaxyRepository`, alarm tracking, `GalaxyRuntimeProbeManager`, the Wonderware Historian SDK, the STA thread + Win32 message pump, and all Tier C cross-cutting protections (memory watchdog, scheduled recycle, post-mortem MMF, IPC ACL + caller SID verification, per-process shared secret) 3. **`Driver.Galaxy.Proxy`** — .NET 10 in-process driver implementing every capability interface (`IDriver`, `ITagDiscovery`, `IRediscoverable`, `IReadable`, `IWritable`, `ISubscribable`, `IAlarmSource`, `IHistoryProvider`, `IHostConnectivityProbe`), forwarding each call over named-pipe IPC and owning the supervisor (heartbeat, host liveness, respawn with backoff, crash-loop circuit breaker, fan-out of Bad quality on host death) 4. **Retire the legacy `OtOpcUa.Host` project** — its responsibilities now live in `OtOpcUa.Server` (built in Phase 1) for OPC UA hosting and `OtOpcUa.Driver.Galaxy.Host` for Galaxy-specific runtime **Parity, not regression.** The phase exit gate is: the v1 `IntegrationTests` suite passes byte-for-byte against the v2 Galaxy.Proxy + Galaxy.Host topology, and a scripted Client.CLI walkthrough produces equivalent output to v1 (decision #56). Anything different — quality codes, browse paths, alarm shapes, history responses — is a parity defect. This phase also closes the four 2026-04-13 stability findings (commits `c76ab8f` and `7310925`) by adding regression tests to the parity suite per `driver-specs.md` Galaxy "Operational Stability Notes". ## Scope — What Changes | Concern | Change | |---------|--------| | Project layout | 3 new projects: `Driver.Galaxy.Shared` (.NET Standard 2.0), `Driver.Galaxy.Host` (.NET 4.8 x86), `Driver.Galaxy.Proxy` (.NET 10) | | `OtOpcUa.Host` (legacy in-process) | **Retired**. Galaxy-specific code moves to `Driver.Galaxy.Host`; the small remainder (TopShelf wrapper, `Program.cs`) was already replaced by `OtOpcUa.Server` in Phase 1 | | MXAccess COM access | Now lives only in `Driver.Galaxy.Host` (.NET 4.8 x86, STA thread + Win32 message pump). Main server (`OtOpcUa.Server`, .NET 10 x64) never references `ArchestrA.MxAccess` | | Wonderware Historian SDK | Same — only in `Driver.Galaxy.Host` | | Galaxy DB queries | `GalaxyRepository` moves to `Driver.Galaxy.Host`; the SQL connection string lives in the Galaxy `DriverConfig` JSON | | OPC UA address space build for Galaxy | Driven by `Driver.Galaxy.Proxy` calls into `IAddressSpaceBuilder` (Phase 1 API) — Proxy fetches the hierarchy via IPC, streams nodes to the builder | | Subscriptions, reads, writes, alarms, history | All forwarded over named-pipe IPC via MessagePack contracts in `Driver.Galaxy.Shared` | | Tier C cross-cutting protections | All wired up per `driver-stability.md` §"Cross-Cutting Protections" → "Isolated host only (Tier C)" + the Galaxy deep dive | | Windows service installer | Two services per Galaxy-using cluster node: `OtOpcUa` (the main server) + `OtOpcUaGalaxyHost` (the Galaxy host). Installer scripts updated. | | `appsettings.json` (legacy Galaxy config sections) | Migrated into the central config DB under `DriverInstance.DriverConfig` JSON for the Galaxy driver instance. Local `appsettings.json` keeps only `Cluster.NodeId` + `ClusterId` + DB conn (per decision #18) | ## Scope — What Does NOT Change | Item | Reason | |------|--------| | OPC UA wire behavior visible to clients | Parity is the gate. Clients see the same browse paths, quality codes, alarm shapes, and history responses as v1 | | Galaxy hierarchy mapping (gobject parents → OPC UA folders) | Galaxy uses the SystemPlatform-kind namespace; UNS rules don't apply (decision #108). `Tag.FolderPath` mirrors v1 LmxOpcUa exactly | | Galaxy `EquipmentClassRef` integration | Galaxy is SystemPlatform-namespace; no `Equipment` rows are created for Galaxy tags. Equipment-namespace work is for the native-protocol drivers in Phase 3+ | | Any non-Galaxy driver | Phase 3+ | | `OtOpcUa.Server` lifecycle / configuration substrate / Admin UI | Built in Phase 1; Phase 2 only adds the Galaxy.Proxy as a `DriverInstance` | | Wonderware Historian dependency | Stays optional, loaded only when `Historian.Enabled = true` in the Galaxy `DriverConfig` | ## Entry Gate Checklist - [ ] Phase 1 exit gate cleared (Configuration + Admin + Server + Core.Abstractions all green; Galaxy still in-process via legacy Host) - [ ] `v2` branch is clean - [ ] Phase 1 PR merged - [ ] Dev Galaxy reachable for parity testing — same Galaxy that v1 tests against - [ ] v1 IntegrationTests baseline pass count + duration recorded (this is the parity bar) - [ ] Client.CLI walkthrough script captured against v1 and saved as reference output - [ ] All Phase 2-relevant docs reviewed: `plan.md` §3–4, §5a (LmxNodeManager reusability), `driver-stability.md` §"Out-of-Process Driver Pattern (Generalized)" + §"Galaxy — Deep Dive (Tier C)", `driver-specs.md` §1 (Galaxy) - [ ] Decisions cited or implemented by Phase 2 read at least skim-level: #11, #24, #25, #28, #29, #32, #34, #44, #46–47, #55–56, #62, #63–69, #76, #102 (the Tier C IPC ACL + recycle decisions are all relevant) - [ ] Confirmation that the four 2026-04-13 stability findings (`c76ab8f`, `7310925`) have existing v1 tests that will be the regression net for the v2 split **Evidence file**: `docs/v2/implementation/entry-gate-phase-2.md`. ## Task Breakdown Five work streams (A–E). Stream A is the foundation; B and C run partly in parallel after A; D depends on B + C; E is the parity gate at the end. ### Stream A — Driver.Galaxy.Shared (1 week) #### Task A.1 — Create the project `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/` (.NET Standard 2.0 — must be consumable by both .NET 10 Proxy and .NET 4.8 Host per decision #25). Single dependency: `MessagePack` NuGet (decision #32). #### Task A.2 — IPC message contracts Define the MessagePack DTOs covering every Galaxy operation the Proxy will forward: - **Lifecycle**: `OpenSessionRequest`, `OpenSessionResponse`, `CloseSessionRequest`, `Heartbeat` (separate channel per decision §"Heartbeat between proxy and host") - **Discovery**: `DiscoverGalaxyHierarchyRequest`, `GalaxyObjectInfo`, `GalaxyAttributeInfo` (these are not the v1 Domain types — they're the IPC-shape with MessagePack attributes; the Proxy maps to/from `DriverAttributeInfo` from `Core.Abstractions`) - **Read / Write**: `ReadValuesRequest`, `ReadValuesResponse`, `WriteValuesRequest`, `WriteValuesResponse` (carries `DataValue` shape per decision #13: value + StatusCode + timestamps) - **Subscriptions**: `SubscribeRequest`, `UnsubscribeRequest`, `OnDataChangeNotification` (server-pushed) - **Alarms**: `AlarmSubscribeRequest`, `AlarmEvent`, `AlarmAcknowledgeRequest` - **History**: `HistoryReadRequest`, `HistoryReadResponse` - **Probe**: `HostConnectivityStatus`, `RuntimeStatusChangeNotification` - **Recycle / control**: `RecycleHostRequest`, `RecycleStatusResponse` Length-prefixed framing per decision #28; MessagePack body inside each frame. **Acceptance**: - All contracts compile against .NET Standard 2.0 - Unit test project asserts each contract round-trips through MessagePack serialize → deserialize byte-for-byte - Reflection test asserts no contract references `System.Text.Json` or anything not in BCL/MessagePack #### Task A.3 — Versioning + capability negotiation Add a top-of-stream `Hello` message exchanged on connection: protocol version, supported features. Future-proofs for adding new operations without breaking older Hosts. **Acceptance**: - Proxy refuses to talk to a Host advertising a major version it doesn't understand; logs the mismatch - Host refuses to accept a Proxy from an unknown major version ### Stream B — Driver.Galaxy.Host (3–4 weeks) #### Task B.1 — Create the project + move Galaxy code `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/` (.NET 4.8, **x86 platform** target — required for MXAccess COM per decision #23). Move from legacy `OtOpcUa.Host`: - `MxAccessBridge.cs` and supporting types - `GalaxyRepository.cs` and SQL queries - Alarm tracking infrastructure - `GalaxyRuntimeProbeManager.cs` - `MxDataTypeMapper.cs`, `SecurityClassificationMapper.cs` - Historian plugin loader and `IHistorianDataSource` (only loaded when `Historian.Enabled = true`) - Configuration types (`MxAccessConfiguration`, `GalaxyRepositoryConfiguration`, `HistorianConfiguration`, `GalaxyScope`) — these now read from the JSON `DriverConfig` rather than `appsettings.json` `Driver.Galaxy.Host` does **not** reference `Core.Abstractions` (decision §5 dependency graph) — it's a closed unit, IPC-fronted. **Acceptance**: - Project builds against .NET 4.8 x86 - All moved files have their namespace updated to `ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.*` - v1 unit tests for these classes (still in `OtOpcUa.Host.Tests`) move to a new `OtOpcUa.Driver.Galaxy.Host.Tests` project and pass #### Task B.2 — STA thread + Win32 message pump Per `driver-stability.md` Galaxy deep dive: - Single STA thread per Host process owns all `LMXProxyServer` instances - Work item dispatch via `PostThreadMessage(WM_APP)` - `WM_QUIT` shutdown only after all outstanding work items complete - Pump health probe: no-op work item every 10s, timeout = wedged-pump signal that triggers recycle This is essentially v1's `StaComThread` lifted from `LmxProxy.Host` reference (per CLAUDE.md "Reference Implementation" section). **Acceptance**: - Pump starts, dispatches work items, exits cleanly on `WM_QUIT` - Pump-wedged simulation (work item that infinite-loops) triggers the 10s timeout and posts a recycle event - COM call from non-STA thread fails fast with a recognizable error (regression net for cross-apartment bugs) #### Task B.3 — `MxAccessHandle : SafeHandle` for COM lifetime Wrap each `LMXProxyServer` connection in a `SafeHandle` subclass (decision #65 + Galaxy deep dive): - `ReleaseHandle()` calls `Marshal.ReleaseComObject` until refcount = 0, then `UnregisterProxy` - Subscription handles wrapped per item; `RemoveAdvise` → `RemoveItem` ordering enforced - `CriticalFinalizerObject` for finalizer ordering during AppDomain unload - Pre-shutdown drain: cancel all subscriptions cleanly via the STA pump, in order, before pump exit **Acceptance**: - Unit test asserts a leaked handle (no `Dispose`) is released by the finalizer - Shutdown test asserts no orphan COM refs after Host exits cleanly - Stress test: 1000 subscribe/unsubscribe cycles → handle table empty at the end #### Task B.4 — Subscription registry + reconnect Per `driver-stability.md` Galaxy deep dive §"Subscription State and Reconnect": - In-memory registry of `(Item, AdviseId, OwningHost)` for every subscription - Reconnect order: register proxy → re-add items → re-advise - Cross-host quality clear gated on host-status check (closes 2026-04-13 finding) **Acceptance**: - Disconnect simulation: kill TCP to MXAccess; subscriptions go Bad; reconnect; subscriptions restore in correct order - Multi-host test: stop AppEngine A while AppEngine B is running; verify A's subscriptions go Bad but B's stay Good (closes the cross-host quality clear regression) #### Task B.5 — Connection health probe (`GalaxyRuntimeProbeManager` rebuild) Lift the existing `GalaxyRuntimeProbeManager` into the new project. Behaviors per `driver-stability.md`: - Subscribe to per-host runtime-status synthetic attribute - Bad-quality fan-out scoped to the host's subtree (not Galaxy-wide) - Failed probe subscription does **not** leave a phantom entry that Tick() flips to Stopped (closes 2026-04-13 finding) **Acceptance**: - Probe failure simulation → no phantom entry; Tick() does not flip arbitrary subscriptions to Stopped (regression test for the finding) - Probe transitions Stopped → Running → Stopped → Running over 5 minutes; quality fan-out happens correctly each transition #### Task B.6 — Named-pipe IPC server with mandatory ACL Per decision #76 + `driver-stability.md` §"IPC Security": - Pipe ACL on creation: `ReadWrite | Synchronize` granted only to the OtOpcUa server's service principal SID; LocalSystem and Administrators **explicitly denied** - Caller identity verification on each new connection: `GetImpersonationUserName()` cross-checked against configured server service SID; mismatches dropped before any RPC frame is read - Per-process shared secret: passed by the supervisor at spawn time, required on first frame of every connection - Heartbeat pipe: separate from data-plane pipe, same ACL **Acceptance**: - Unit test: pipe ACL enumeration shows only the configured SID + Synchronize/ReadWrite - Integration test: connection from a non-server-SID local process is dropped with audit log entry - Integration test: connection without correct shared secret on first frame is dropped - Defense-in-depth test: even if ACL is misconfigured (manually overridden), shared-secret check catches the wrong client #### Task B.7 — Memory watchdog with Galaxy-specific thresholds Per `driver-stability.md` Galaxy deep dive §"Memory Watchdog Thresholds": - Sample RSS every 30s - Warning: `1.5× baseline OR baseline + 200 MB` (whichever larger) - Soft recycle: `2× baseline OR baseline + 200 MB` (whichever larger) - Hard ceiling: 1.5 GB → force-kill - Slope: > 5 MB/min sustained 30 min → soft recycle **Acceptance**: - Unit test against a mock RSS source: each threshold triggers the correct action - Integration test with the FaultShim (Stream B.10): leak simulation crosses the soft-recycle threshold and triggers soft recycle path #### Task B.8 — Recycle policy with WM_QUIT escalation Per `driver-stability.md` Galaxy deep dive §"Recycle Policy (COM-specific)": - 15s grace for in-flight COM calls (longer than FOCAS because legitimate MXAccess bulk reads take seconds) - Per-handle: `RemoveAdvise` → `RemoveItem` → `ReleaseComObject` → `UnregisterProxy`, on the STA thread - `WM_QUIT` posted only after all of the above complete - If STA pump doesn't exit within 5s of `WM_QUIT` → `Environment.Exit(2)` (hard exit) - Soft recycle scheduled daily at 03:00 local; recycle frequency cap 1/hour **Acceptance**: - Soft recycle test: in-flight call returns within grace → clean exit (`Exit(0)`) - Soft recycle test: in-flight call exceeds grace → hard exit (`Exit(2)`); supervisor records as unclean recycle - Wedged-pump test: pump doesn't drain after `WM_QUIT` → `Exit(2)` within 5s - Frequency cap test: trigger 2 soft recycles within an hour → second is blocked, alert raised #### Task B.9 — Post-mortem MMF writer Per `driver-stability.md` Galaxy deep dive §"Post-Mortem Log Contents": - Ring buffer of last 1000 IPC operations - Plus Galaxy-specific snapshots: STA pump state (thread ID, last dispatched timestamp, queue depth), active subscription count by host, `MxAccessHandle` refcount snapshot, last 100 probe results, last redeploy event, Galaxy DB connection state, Historian connection state if HDA enabled - Memory-mapped file at `%ProgramData%\OtOpcUa\driver-postmortem\galaxy.mmf` - On graceful shutdown: flush ring + snapshots to a rotating log - On hard crash: supervisor reads the MMF after the corpse is gone **Acceptance**: - Round-trip test: write 1000 operations → read back → assert order + content - Hard-crash test: kill the process mid-operation → supervisor reads the MMF → ring tail shows the operation that was in flight #### Task B.10 — Driver.Galaxy.FaultShim (test-only) Per `driver-stability.md` §"Test Coverage for Galaxy Stability" — analogous to FOCAS FaultShim: - Test-only managed assembly substituted for `ArchestrA.MxAccess.dll` via assembly binding - Injects: COM exception at chosen call site, subscription that never fires `OnDataChange`, `Marshal.ReleaseComObject` returning unexpected refcount, STA pump deadlock simulation - Production builds load the real `ArchestrA.MxAccess` from GAC **Acceptance**: - FaultShim binds successfully under test configuration - Each fault scenario triggers the expected protection (memory watchdog → recycle, supervisor → respawn, etc.) ### Stream C — Driver.Galaxy.Proxy (1.5 weeks, can parallel with B after A done) #### Task C.1 — Create the project + capability interface implementation `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/` (.NET 10). Dependencies: `Core.Abstractions` (Phase 1) + `Driver.Galaxy.Shared` (Stream A) + `MessagePack`. Implement every interface listed in Phase Objective above. Each method: - Marshals arguments into the matching IPC contract - Sends over the data-plane pipe - Awaits the response (with timeout per Polly per decision #34) - Maps the response into the `Core.Abstractions` shape (`DataValue`, `DriverAttributeInfo`, etc.) - Surfaces failures as the appropriate StatusCode **Acceptance**: - Each interface method has a unit test against a mock IPC channel: happy path + IPC timeout path + IPC error path - `IRediscoverable` opt-in works: when Galaxy.Host signals a redeploy, Proxy invokes the Core's rediscovery flow (not full restart) #### Task C.2 — Heartbeat sender + host liveness Per `driver-stability.md` §"Heartbeat between proxy and host": - 2s cadence (decision #72) on the dedicated heartbeat pipe - 3 consecutive missed responses = host declared dead (6s detection) - On host-dead: fan out Bad quality on all Galaxy-namespace nodes; ask supervisor to respawn **Acceptance**: - Heartbeat round-trip test against a mock host - Missed-heartbeat test: stop the mock host's heartbeat responder → 3 misses → supervisor respawn requested - GC pause test: simulate a 700ms GC pause on the proxy side → no false positive (single missed beat absorbed by 3-miss tolerance) #### Task C.3 — Supervisor with respawn-with-backoff + crash-loop circuit breaker Per `driver-stability.md` §"Crash-loop circuit breaker" + Galaxy §"Recovery Sequence After Crash": - Backoff: 5s → 15s → 60s (capped) - Crash-loop: 3 crashes / 5 min → escalating cooldown (1h → 4h → 24h manual) - Sticky alert that doesn't auto-clear when cooldown elapses - On respawn after recycle: reuse cached `time_of_last_deploy` watermark to skip full DB rediscovery if unchanged **Acceptance**: - Respawn test: kill host process → supervisor respawns within 5s → host re-establishes - Crash-loop test: force 3 crashes within 5 minutes → 4th respawn blocked, alert raised, manual reset clears alert - Cooldown escalation test: trip → 1h auto-reset → re-trip within 10 min → 4h cooldown → re-trip → 24h manual #### Task C.4 — Address space build via `IAddressSpaceBuilder` When the Proxy is asked to discover its tags, it issues `DiscoverGalaxyHierarchyRequest` to the Host, receives the gobject tree + attributes, and streams them to `IAddressSpaceBuilder` (Phase 1 API per decision #52). Galaxy uses the SystemPlatform-kind namespace; tags use `FolderPath` (v1-style) — no `Equipment` rows are created. **Acceptance**: - Build a Galaxy address space via the Proxy → byte-equivalent OPC UA browse output to v1 - Memory test: large Galaxy (4000+ attributes) → Proxy peak RAM stays under 200 MB during build ### Stream D — Retire legacy OtOpcUa.Host (1 week, depends on B + C) #### Task D.1 — Delete legacy Host project Once Galaxy.Host + Galaxy.Proxy are functional, the legacy `OtOpcUa.Host` project's responsibilities are split: - Galaxy-specific code → `Driver.Galaxy.Host` (already moved in Stream B) - TopShelf wrapper, `Program.cs`, generic OPC UA hosting → already replaced by `OtOpcUa.Server` in Phase 1 - Anything else (configuration types, generic helpers) → moved to `OtOpcUa.Server` or `OtOpcUa.Configuration` as appropriate Delete the project from the solution. Update `.slnx` and any references. **Acceptance**: - `ls src/` shows `OtOpcUa.Host` is gone - `dotnet build OtOpcUa.slnx` succeeds with `OtOpcUa.Host` no longer in the build graph - All previously-`OtOpcUa.Host.Tests` tests are either moved to the appropriate new test project or deleted as obsolete #### Task D.2 — Update Windows service installer scripts Two services per cluster node when Galaxy is configured: - `OtOpcUa` (the main `OtOpcUa.Server`) — already installable per Phase 1 - `OtOpcUaGalaxyHost` (the `Driver.Galaxy.Host`) — new service registration Installer must: - Install both services with the correct service-account SIDs (Galaxy.Host's pipe ACL must grant the OtOpcUa service principal) - Set the supervisor's per-process secret in the registry or a protected file before first start - Honor service dependency: Galaxy.Host should be configured to start before OtOpcUa, or OtOpcUa retries until Galaxy.Host is up **Acceptance**: - Install both services on a test box → both start successfully - Uninstall both → no leftover registry / file system state - Service-restart cycle: stop OtOpcUa.Server → Galaxy.Host stays up → start OtOpcUa.Server → reconnects to Galaxy.Host pipe #### Task D.3 — Migrate Galaxy `appsettings.json` config to central config DB Galaxy-specific config sections (`MxAccess`, `Galaxy`, `Historian`) move into the `DriverInstance.DriverConfig` JSON for the Galaxy driver instance in the Configuration DB. The local `appsettings.json` keeps only `Cluster.NodeId` + `ClusterId` + DB conn (per decision #18). Migration script: for each existing v1 `appsettings.json`, generate the equivalent `DriverConfig` JSON and either insert via Admin UI or via a one-shot SQL script. **Acceptance**: - Migration script runs against a v1 dev `appsettings.json` → produces a JSON blob that loads into the Galaxy `DriverConfig` field - The Galaxy driver instance starts with the migrated config and serves the same address space as v1 ### Stream E — Parity validation (1 week, gate) #### Task E.1 — Run v1 IntegrationTests against v2 Galaxy topology Per decision #56: - The same v1 IntegrationTests suite runs against the v2 build with Galaxy.Proxy + Galaxy.Host instead of in-process Galaxy - All tests must pass - Pass count = v1 baseline; failure count = 0; skip count = v1 baseline - Test duration may increase (IPC round-trip latency); document the deviation **Acceptance**: - Test report shows pass/fail/skip counts identical to v1 baseline - Per-test duration regression report: any test that takes >2× v1 baseline is flagged for review (may be an IPC bottleneck) #### Task E.2 — Scripted Client.CLI walkthrough parity Per decision #56: - Execute the captured Client.CLI script (recorded at Phase 2 entry gate against v1) against the v2 Galaxy topology - Diff the output against v1 reference - Differences allowed only in: timestamps, latency-measurement output. Any value, quality, browse path, or alarm shape difference = parity defect **Acceptance**: - Walkthrough completes without errors - Output diff vs v1: only timestamp / latency lines differ #### Task E.3 — Regression tests for the four 2026-04-13 stability findings Per `driver-specs.md` Galaxy "Operational Stability Notes": each of the four findings closed in commits `c76ab8f` and `7310925` should have a regression test in the Phase 2 parity suite: - Phantom probe subscription flipping Tick() to Stopped (covered by Task B.5) - Cross-host quality clear wiping sibling state during recovery (covered by Task B.4) - Sync-over-async on the OPC UA stack thread → guard against new instances in `GenericDriverNodeManager` - Fire-and-forget alarm tasks racing shutdown → guard via the pre-shutdown drain ordering in Task B.3 **Acceptance**: - Each of the four scenarios has a named test in the parity suite - Each test fails on a hand-introduced regression (revert the v1 fix, see test fail) #### Task E.4 — Adversarial review of the Phase 2 diff Per `implementation/overview.md` exit gate: - Run `/codex:adversarial-review --base v2` on the merged Phase 2 diff - Findings closed or explicitly deferred with rationale and ticket link ## Compliance Checks (run at exit gate) `phase-2-compliance.ps1`: ### Schema compliance N/A for Phase 2 — no schema changes (Configuration DB schema is unchanged from Phase 1). ### Decision compliance For each decision number Phase 2 implements (#11, #24, #25, #28, #29, #32, #34, #44, #46–47, #55–56, #62, #63–69, #76, #102, plus the Galaxy-specific #62), verify at least one citation exists in source, tests, or migrations: ```powershell $decisions = @(11, 24, 25, 28, 29, 32, 34, 44, 46, 47, 55, 56, 62, 63..69, 76, 102, 122, 123, 124) foreach ($d in $decisions) { $hits = git grep "decision #$d" -- 'src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.*/' 'tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.*/' if (-not $hits) { Write-Error "Decision #$d has no citation"; exit 1 } } ``` ### Visual compliance N/A — no Admin UI changes in Phase 2 (Galaxy is just another `DriverInstance` in the Drivers tab). ### Behavioral compliance — parity smoke test The parity suite (Stream E) is the smoke test: 1. v1 IntegrationTests pass count = baseline, fail count = 0 2. Client.CLI walkthrough output matches v1 (modulo timestamps/latency) 3. Four regression tests for 2026-04-13 findings pass ### Stability compliance For Phase 2 (introduces the first Tier C driver in production form): - Galaxy.Host implements every Tier C cross-cutting protection from `driver-stability.md`: - SafeHandle wrappers for COM (Task B.3) ✓ - Memory watchdog with Galaxy thresholds (Task B.7) ✓ - Bounded operation queues per device (already in Core, Phase 1) ✓ - Heartbeat between proxy and host on separate channel (Tasks A.2, B.6, C.2) ✓ - Scheduled recycling with `WM_QUIT` escalation to hard exit (Task B.8) ✓ - Crash-loop circuit breaker (Task C.3) ✓ - Post-mortem MMF readable after hard crash (Task B.9) ✓ - IPC ACL + caller SID verification + per-process shared secret (Task B.6) ✓ Each protection has at least one regression test. The compliance script enumerates and verifies presence: ```powershell $protections = @( @{Name="SafeHandle for COM"; Test="MxAccessHandleFinalizerReleasesCom"}, @{Name="Memory watchdog"; Test="WatchdogTriggersRecycleAtThreshold"}, @{Name="Heartbeat detection"; Test="ThreeMissedHeartbeatsDeclaresHostDead"}, @{Name="WM_QUIT escalation"; Test="WedgedPumpEscalatesToHardExit"}, @{Name="Crash-loop breaker"; Test="ThreeCrashesInFiveMinutesOpensCircuit"}, @{Name="Post-mortem MMF"; Test="MmfSurvivesHardCrashAndIsReadable"}, @{Name="Pipe ACL enforcement"; Test="NonServerSidConnectionRejected"}, @{Name="Shared secret"; Test="ConnectionWithoutSecretRejected"} ) foreach ($p in $protections) { $hits = dotnet test --filter "FullyQualifiedName~$($p.Test)" --no-build --logger "console;verbosity=quiet" if ($LASTEXITCODE -ne 0) { Write-Error "Stability protection '$($p.Name)' has no passing test '$($p.Test)'"; exit 1 } } ``` ### Documentation compliance - Any deviation from the Galaxy deep dive in `driver-stability.md` reflected back; new decisions added with `supersedes` notes if needed - `driver-specs.md` §1 (Galaxy) updated to reflect the actual implementation if the IPC contract or recycle behavior differs from the design doc ## Completion Checklist ### Stream A — Driver.Galaxy.Shared - [ ] Project created (.NET Standard 2.0, MessagePack-only dependency) - [ ] All IPC contracts defined and round-trip tested - [ ] Hello-message version negotiation implemented - [ ] Reflection test confirms no .NET 10-only types leaked in ### Stream B — Driver.Galaxy.Host - [ ] Project created (.NET 4.8 x86) - [ ] All Galaxy-specific code moved from legacy Host - [ ] STA thread + Win32 pump implemented; pump health probe wired up - [ ] `MxAccessHandle : SafeHandle` for COM lifetime - [ ] Subscription registry + reconnect with cross-host quality scoping - [ ] `GalaxyRuntimeProbeManager` rebuilt; phantom-probe regression test passes - [ ] Named-pipe IPC server with mandatory ACL + caller SID verification + per-process secret - [ ] Memory watchdog with Galaxy-specific thresholds - [ ] Recycle policy with 15s grace + WM_QUIT escalation to hard exit - [ ] Post-mortem MMF writer + supervisor reader - [ ] FaultShim test-only assembly for fault injection ### Stream C — Driver.Galaxy.Proxy - [ ] Project created (.NET 10, depends on Core.Abstractions + Galaxy.Shared) - [ ] All capability interfaces implemented (IDriver, ITagDiscovery, IRediscoverable, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IHostConnectivityProbe) - [ ] Heartbeat sender on dedicated channel; missed-heartbeat detection - [ ] Supervisor with respawn-with-backoff + crash-loop circuit breaker (escalating cooldown 1h/4h/24h) - [ ] Address space build via `IAddressSpaceBuilder` produces byte-equivalent v1 output ### Stream D — Retire legacy OtOpcUa.Host - [ ] Legacy `OtOpcUa.Host` project deleted from solution - [ ] Windows service installer registers two services (OtOpcUa + OtOpcUaGalaxyHost) - [ ] Galaxy `appsettings.json` config migrated into central DB `DriverConfig` - [ ] Migration script tested against v1 dev config ### Stream E — Parity validation - [ ] v1 IntegrationTests pass with count = baseline, failures = 0 - [ ] Client.CLI walkthrough output matches v1 (modulo timestamps/latency) - [ ] All four 2026-04-13 stability findings have passing regression tests - [ ] Per-test duration regression report: no test >2× v1 baseline (or flagged for review) ### Cross-cutting - [ ] `phase-2-compliance.ps1` runs and exits 0 - [ ] All 8 Tier C stability protections have named, passing tests - [ ] Adversarial review of the phase diff — findings closed or deferred with rationale - [ ] PR opened against `v2`, includes: link to this doc, link to exit-gate record, compliance script output, parity test report, adversarial review output - [ ] Reviewer signoff (one reviewer beyond the implementation lead) - [ ] `exit-gate-phase-2.md` recorded ## Risks and Mitigations | Risk | Likelihood | Impact | Mitigation | |------|:----------:|:------:|------------| | IPC round-trip latency makes parity tests fail on timing assumptions | High | Medium | Per-test duration regression report identifies hot tests; tune timeouts in test config rather than in production code | | MessagePack contract drift between Proxy and Host during development | Medium | High | Hello-message version negotiation rejects mismatched majors loudly; CI builds both projects in the same job | | STA pump health probe is itself flaky and triggers spurious recycles | Medium | High | Probe interval tunable; default 10s gives 1000ms+ slack on a healthy pump; monitor via post-mortem MMF for false positives | | Pipe ACL misconfiguration on installer leaves the IPC accessible to local users | Low | Critical | Defense-in-depth shared secret catches the case; ACL enumeration test in installer integration test | | Galaxy.Host process recycle thrash if Galaxy or DB is intermittently unavailable | Medium | Medium | Crash-loop circuit breaker with escalating cooldown caps the thrash; Polly retry on the data path inside Host (not via supervisor restart) handles transient errors | | Migration of `appsettings.json` Galaxy config to DB blob breaks existing deployments | Medium | Medium | Migration script is idempotent and dry-run-able; deploy script asserts central DB has the migrated config before stopping legacy Host | | Phase 2 takes longer than 8 weeks | High | Medium | Mid-gate review at 4 weeks — if Stream B isn't past Task B.6 (IPC + ACL), defer Stream B.10 (FaultShim) to Phase 2.5 follow-up | | Wonderware Historian SDK incompatibility with .NET 4.8 x86 in the new project layout | Low | High | Move and validate Historian loader as part of Task B.1 — early signal if SDK has any project-shape sensitivity | | Hard-exit on wedged pump leaks COM resources | Accepted | Low | Documented intent: hard exit is the only safe response; OS process exit reclaims fds and the OS COM cleanup is best-effort. CNC equivalent in FOCAS deep dive accepts the same trade-off | ## Out of Scope (do not do in Phase 2) - Any non-Galaxy driver (Phase 3+) - UNS / Equipment-namespace work for Galaxy (Galaxy is SystemPlatform-namespace; no Equipment rows for Galaxy tags per decision #108) - Equipment-class template integration with the schemas repo (Galaxy doesn't use `EquipmentClassRef`) - Push-from-DB notification (decision #96 — v2.1) - Any change to OPC UA wire behavior visible to clients (parity is the gate) - Consumer cutover (ScadaBridge, Ignition, System Platform IO) — out of v2 scope, separate integration-team track per `implementation/overview.md` - Removing the v1 deployment from production (a v2 release decision, not Phase 2)