Files
lmxopcua/docs/v2/implementation/phase-2-galaxy-out-of-process.md
Joseph Doherty 4903a19ec9 Add data-path ACL design (acl-design.md, closes corrections B1) + dev-environment inventory and setup plan (dev-environment.md), and remove consumer cutover from OtOpcUa v2 scope.
ACL design defines NodePermissions bitmask flags covering Browse / Read / Subscribe / HistoryRead / WriteOperate / WriteTune / WriteConfigure / AlarmRead / AlarmAcknowledge / AlarmConfirm / AlarmShelve / MethodCall plus common bundles (ReadOnly / Operator / Engineer / Admin); 6-level scope hierarchy (Cluster / Namespace / UnsArea / UnsLine / Equipment / Tag) with default-deny + additive grants and Browse-implication on ancestors; per-LDAP-group grants in a new generation-versioned NodeAcl table edited via the same draft → diff → publish → rollback boundary as every other content table; per-session permission-trie evaluator with O(depth × group-count) cost cached for the lifetime of the session and rebuilt on generation-apply or LDAP group cache expiry; cluster-create workflow seeds a default ACL set matching the v1 LmxOpcUa LDAP-role-to-permission map for v1 → v2 consumer migration parity; Admin UI ACL tab with two views (by LDAP group, by scope), bulk-grant flow, and permission simulator that lets operators preview "as user X" effective permissions across the cluster's UNS tree before publishing; explicit Deny deferred to v2.1 since verbose grants suffice at v2.0 fleet sizes; only denied OPC UA operations are audit-logged (not allowed ones — would dwarf the audit log). Schema doc gains the NodeAcl table with cross-cluster invariant enforcement and same-generation FK validation; admin-ui.md gains the ACLs tab; phase-1 doc gains Task E.9 wiring this through Stream E plus a NodeAcl entry in Task B.1's DbContext list.

Dev-environment doc inventories every external resource the v2 build needs across two tiers per decision #99 — inner-loop (in-process simulators on developer machines: SQL Server local or container, GLAuth at C:\publish\glauth\, local dev Galaxy) and integration (one dedicated Windows host with Docker Desktop on WSL2 backend so TwinCAT XAR VM can run in Hyper-V alongside containerized oitc/modbus-server, plus WSL2-hosted Snap7 and ab_server, plus OPC Foundation reference server, plus FOCAS TestStub and FaultShim) — with concrete container images, ports, default dev credentials (clearly marked dev-only since production uses Integrated Security / gMSA per decision #46), bootstrap order for both tiers, network topology diagram, test data seed locations, and operational risks (TwinCAT trial expiry automation, Docker pricing, integration host SPOF mitigation, per-developer GLAuth config sync, Aveva license scoping that keeps Galaxy tests on developer machines and off the shared host).

Removes consumer cutover (ScadaBridge / Ignition / System Platform IO) from OtOpcUa v2 scope per decision #136 — owned by a separate integration / operations team, tracked in 3-year-plan handoff §"Rollout Posture" and corrections §C5; OtOpcUa team's scope ends at Phase 5. Updates implementation/overview.md phase index to drop the "6+" row and add an explicit "OUT of v2 scope" callout; updates phase-1 and phase-2 docs to reframe cutover as integration-team-owned rather than future-phase numbered.

Decisions #129–137 added: ACL model (#129), NodeAcl generation-versioned (#130), v1-compatibility seed (#131), denied-only audit logging (#132), two-tier dev environment (#133), Docker WSL2 backend for TwinCAT VM coexistence (#134), TwinCAT VM centrally managed / Galaxy on dev machines only (#135), cutover out of v2 scope (#136), dev credentials documented openly (#137).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 11:58:33 -04:00

506 lines
33 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 2 — Galaxy Out-of-Process Refactor (Tier C)
> **Status**: DRAFT — implementation plan for Phase 2 of the v2 build (`plan.md` §6, `driver-stability.md` §"Galaxy — Deep Dive").
>
> **Branch**: `v2/phase-2-galaxy`
> **Estimated duration**: 68 weeks (largest refactor phase; Tier C protections + IPC are the bulk)
> **Predecessor**: Phase 1 (`phase-1-configuration-and-admin-scaffold.md`)
> **Successor**: Phase 3 (Modbus TCP driver)
## Phase Objective
Move Galaxy / MXAccess from the legacy in-process `OtOpcUa.Host` project into the **Tier C out-of-process** topology specified in `driver-stability.md`:
1. **`Driver.Galaxy.Shared`** — .NET Standard 2.0 IPC message contracts (MessagePack DTOs)
2. **`Driver.Galaxy.Host`** — .NET 4.8 x86 separate Windows Service that owns `MxAccessBridge`, `GalaxyRepository`, alarm tracking, `GalaxyRuntimeProbeManager`, the Wonderware Historian SDK, the STA thread + Win32 message pump, and all Tier C cross-cutting protections (memory watchdog, scheduled recycle, post-mortem MMF, IPC ACL + caller SID verification, per-process shared secret)
3. **`Driver.Galaxy.Proxy`** — .NET 10 in-process driver implementing every capability interface (`IDriver`, `ITagDiscovery`, `IRediscoverable`, `IReadable`, `IWritable`, `ISubscribable`, `IAlarmSource`, `IHistoryProvider`, `IHostConnectivityProbe`), forwarding each call over named-pipe IPC and owning the supervisor (heartbeat, host liveness, respawn with backoff, crash-loop circuit breaker, fan-out of Bad quality on host death)
4. **Retire the legacy `OtOpcUa.Host` project** — its responsibilities now live in `OtOpcUa.Server` (built in Phase 1) for OPC UA hosting and `OtOpcUa.Driver.Galaxy.Host` for Galaxy-specific runtime
**Parity, not regression.** The phase exit gate is: the v1 `IntegrationTests` suite passes byte-for-byte against the v2 Galaxy.Proxy + Galaxy.Host topology, and a scripted Client.CLI walkthrough produces equivalent output to v1 (decision #56). Anything different — quality codes, browse paths, alarm shapes, history responses — is a parity defect.
This phase also closes the four 2026-04-13 stability findings (commits `c76ab8f` and `7310925`) by adding regression tests to the parity suite per `driver-specs.md` Galaxy "Operational Stability Notes".
## Scope — What Changes
| Concern | Change |
|---------|--------|
| Project layout | 3 new projects: `Driver.Galaxy.Shared` (.NET Standard 2.0), `Driver.Galaxy.Host` (.NET 4.8 x86), `Driver.Galaxy.Proxy` (.NET 10) |
| `OtOpcUa.Host` (legacy in-process) | **Retired**. Galaxy-specific code moves to `Driver.Galaxy.Host`; the small remainder (TopShelf wrapper, `Program.cs`) was already replaced by `OtOpcUa.Server` in Phase 1 |
| MXAccess COM access | Now lives only in `Driver.Galaxy.Host` (.NET 4.8 x86, STA thread + Win32 message pump). Main server (`OtOpcUa.Server`, .NET 10 x64) never references `ArchestrA.MxAccess` |
| Wonderware Historian SDK | Same — only in `Driver.Galaxy.Host` |
| Galaxy DB queries | `GalaxyRepository` moves to `Driver.Galaxy.Host`; the SQL connection string lives in the Galaxy `DriverConfig` JSON |
| OPC UA address space build for Galaxy | Driven by `Driver.Galaxy.Proxy` calls into `IAddressSpaceBuilder` (Phase 1 API) — Proxy fetches the hierarchy via IPC, streams nodes to the builder |
| Subscriptions, reads, writes, alarms, history | All forwarded over named-pipe IPC via MessagePack contracts in `Driver.Galaxy.Shared` |
| Tier C cross-cutting protections | All wired up per `driver-stability.md` §"Cross-Cutting Protections" → "Isolated host only (Tier C)" + the Galaxy deep dive |
| Windows service installer | Two services per Galaxy-using cluster node: `OtOpcUa` (the main server) + `OtOpcUaGalaxyHost` (the Galaxy host). Installer scripts updated. |
| `appsettings.json` (legacy Galaxy config sections) | Migrated into the central config DB under `DriverInstance.DriverConfig` JSON for the Galaxy driver instance. Local `appsettings.json` keeps only `Cluster.NodeId` + `ClusterId` + DB conn (per decision #18) |
## Scope — What Does NOT Change
| Item | Reason |
|------|--------|
| OPC UA wire behavior visible to clients | Parity is the gate. Clients see the same browse paths, quality codes, alarm shapes, and history responses as v1 |
| Galaxy hierarchy mapping (gobject parents → OPC UA folders) | Galaxy uses the SystemPlatform-kind namespace; UNS rules don't apply (decision #108). `Tag.FolderPath` mirrors v1 LmxOpcUa exactly |
| Galaxy `EquipmentClassRef` integration | Galaxy is SystemPlatform-namespace; no `Equipment` rows are created for Galaxy tags. Equipment-namespace work is for the native-protocol drivers in Phase 3+ |
| Any non-Galaxy driver | Phase 3+ |
| `OtOpcUa.Server` lifecycle / configuration substrate / Admin UI | Built in Phase 1; Phase 2 only adds the Galaxy.Proxy as a `DriverInstance` |
| Wonderware Historian dependency | Stays optional, loaded only when `Historian.Enabled = true` in the Galaxy `DriverConfig` |
## Entry Gate Checklist
- [ ] Phase 1 exit gate cleared (Configuration + Admin + Server + Core.Abstractions all green; Galaxy still in-process via legacy Host)
- [ ] `v2` branch is clean
- [ ] Phase 1 PR merged
- [ ] Dev Galaxy reachable for parity testing — same Galaxy that v1 tests against
- [ ] v1 IntegrationTests baseline pass count + duration recorded (this is the parity bar)
- [ ] Client.CLI walkthrough script captured against v1 and saved as reference output
- [ ] All Phase 2-relevant docs reviewed: `plan.md` §34, §5a (LmxNodeManager reusability), `driver-stability.md` §"Out-of-Process Driver Pattern (Generalized)" + §"Galaxy — Deep Dive (Tier C)", `driver-specs.md` §1 (Galaxy)
- [ ] Decisions cited or implemented by Phase 2 read at least skim-level: #11, #24, #25, #28, #29, #32, #34, #44, #4647, #5556, #62, #6369, #76, #102 (the Tier C IPC ACL + recycle decisions are all relevant)
- [ ] Confirmation that the four 2026-04-13 stability findings (`c76ab8f`, `7310925`) have existing v1 tests that will be the regression net for the v2 split
**Evidence file**: `docs/v2/implementation/entry-gate-phase-2.md`.
## Task Breakdown
Five work streams (AE). Stream A is the foundation; B and C run partly in parallel after A; D depends on B + C; E is the parity gate at the end.
### Stream A — Driver.Galaxy.Shared (1 week)
#### Task A.1 — Create the project
`src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/` (.NET Standard 2.0 — must be consumable by both .NET 10 Proxy and .NET 4.8 Host per decision #25). Single dependency: `MessagePack` NuGet (decision #32).
#### Task A.2 — IPC message contracts
Define the MessagePack DTOs covering every Galaxy operation the Proxy will forward:
- **Lifecycle**: `OpenSessionRequest`, `OpenSessionResponse`, `CloseSessionRequest`, `Heartbeat` (separate channel per decision §"Heartbeat between proxy and host")
- **Discovery**: `DiscoverGalaxyHierarchyRequest`, `GalaxyObjectInfo`, `GalaxyAttributeInfo` (these are not the v1 Domain types — they're the IPC-shape with MessagePack attributes; the Proxy maps to/from `DriverAttributeInfo` from `Core.Abstractions`)
- **Read / Write**: `ReadValuesRequest`, `ReadValuesResponse`, `WriteValuesRequest`, `WriteValuesResponse` (carries `DataValue` shape per decision #13: value + StatusCode + timestamps)
- **Subscriptions**: `SubscribeRequest`, `UnsubscribeRequest`, `OnDataChangeNotification` (server-pushed)
- **Alarms**: `AlarmSubscribeRequest`, `AlarmEvent`, `AlarmAcknowledgeRequest`
- **History**: `HistoryReadRequest`, `HistoryReadResponse`
- **Probe**: `HostConnectivityStatus`, `RuntimeStatusChangeNotification`
- **Recycle / control**: `RecycleHostRequest`, `RecycleStatusResponse`
Length-prefixed framing per decision #28; MessagePack body inside each frame.
**Acceptance**:
- All contracts compile against .NET Standard 2.0
- Unit test project asserts each contract round-trips through MessagePack serialize → deserialize byte-for-byte
- Reflection test asserts no contract references `System.Text.Json` or anything not in BCL/MessagePack
#### Task A.3 — Versioning + capability negotiation
Add a top-of-stream `Hello` message exchanged on connection: protocol version, supported features. Future-proofs for adding new operations without breaking older Hosts.
**Acceptance**:
- Proxy refuses to talk to a Host advertising a major version it doesn't understand; logs the mismatch
- Host refuses to accept a Proxy from an unknown major version
### Stream B — Driver.Galaxy.Host (34 weeks)
#### Task B.1 — Create the project + move Galaxy code
`src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/` (.NET 4.8, **x86 platform** target — required for MXAccess COM per decision #23).
Move from legacy `OtOpcUa.Host`:
- `MxAccessBridge.cs` and supporting types
- `GalaxyRepository.cs` and SQL queries
- Alarm tracking infrastructure
- `GalaxyRuntimeProbeManager.cs`
- `MxDataTypeMapper.cs`, `SecurityClassificationMapper.cs`
- Historian plugin loader and `IHistorianDataSource` (only loaded when `Historian.Enabled = true`)
- Configuration types (`MxAccessConfiguration`, `GalaxyRepositoryConfiguration`, `HistorianConfiguration`, `GalaxyScope`) — these now read from the JSON `DriverConfig` rather than `appsettings.json`
`Driver.Galaxy.Host` does **not** reference `Core.Abstractions` (decision §5 dependency graph) — it's a closed unit, IPC-fronted.
**Acceptance**:
- Project builds against .NET 4.8 x86
- All moved files have their namespace updated to `ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.*`
- v1 unit tests for these classes (still in `OtOpcUa.Host.Tests`) move to a new `OtOpcUa.Driver.Galaxy.Host.Tests` project and pass
#### Task B.2 — STA thread + Win32 message pump
Per `driver-stability.md` Galaxy deep dive:
- Single STA thread per Host process owns all `LMXProxyServer` instances
- Work item dispatch via `PostThreadMessage(WM_APP)`
- `WM_QUIT` shutdown only after all outstanding work items complete
- Pump health probe: no-op work item every 10s, timeout = wedged-pump signal that triggers recycle
This is essentially v1's `StaComThread` lifted from `LmxProxy.Host` reference (per CLAUDE.md "Reference Implementation" section).
**Acceptance**:
- Pump starts, dispatches work items, exits cleanly on `WM_QUIT`
- Pump-wedged simulation (work item that infinite-loops) triggers the 10s timeout and posts a recycle event
- COM call from non-STA thread fails fast with a recognizable error (regression net for cross-apartment bugs)
#### Task B.3 — `MxAccessHandle : SafeHandle` for COM lifetime
Wrap each `LMXProxyServer` connection in a `SafeHandle` subclass (decision #65 + Galaxy deep dive):
- `ReleaseHandle()` calls `Marshal.ReleaseComObject` until refcount = 0, then `UnregisterProxy`
- Subscription handles wrapped per item; `RemoveAdvise``RemoveItem` ordering enforced
- `CriticalFinalizerObject` for finalizer ordering during AppDomain unload
- Pre-shutdown drain: cancel all subscriptions cleanly via the STA pump, in order, before pump exit
**Acceptance**:
- Unit test asserts a leaked handle (no `Dispose`) is released by the finalizer
- Shutdown test asserts no orphan COM refs after Host exits cleanly
- Stress test: 1000 subscribe/unsubscribe cycles → handle table empty at the end
#### Task B.4 — Subscription registry + reconnect
Per `driver-stability.md` Galaxy deep dive §"Subscription State and Reconnect":
- In-memory registry of `(Item, AdviseId, OwningHost)` for every subscription
- Reconnect order: register proxy → re-add items → re-advise
- Cross-host quality clear gated on host-status check (closes 2026-04-13 finding)
**Acceptance**:
- Disconnect simulation: kill TCP to MXAccess; subscriptions go Bad; reconnect; subscriptions restore in correct order
- Multi-host test: stop AppEngine A while AppEngine B is running; verify A's subscriptions go Bad but B's stay Good (closes the cross-host quality clear regression)
#### Task B.5 — Connection health probe (`GalaxyRuntimeProbeManager` rebuild)
Lift the existing `GalaxyRuntimeProbeManager` into the new project. Behaviors per `driver-stability.md`:
- Subscribe to per-host runtime-status synthetic attribute
- Bad-quality fan-out scoped to the host's subtree (not Galaxy-wide)
- Failed probe subscription does **not** leave a phantom entry that Tick() flips to Stopped (closes 2026-04-13 finding)
**Acceptance**:
- Probe failure simulation → no phantom entry; Tick() does not flip arbitrary subscriptions to Stopped (regression test for the finding)
- Probe transitions Stopped → Running → Stopped → Running over 5 minutes; quality fan-out happens correctly each transition
#### Task B.6 — Named-pipe IPC server with mandatory ACL
Per decision #76 + `driver-stability.md` §"IPC Security":
- Pipe ACL on creation: `ReadWrite | Synchronize` granted only to the OtOpcUa server's service principal SID; LocalSystem and Administrators **explicitly denied**
- Caller identity verification on each new connection: `GetImpersonationUserName()` cross-checked against configured server service SID; mismatches dropped before any RPC frame is read
- Per-process shared secret: passed by the supervisor at spawn time, required on first frame of every connection
- Heartbeat pipe: separate from data-plane pipe, same ACL
**Acceptance**:
- Unit test: pipe ACL enumeration shows only the configured SID + Synchronize/ReadWrite
- Integration test: connection from a non-server-SID local process is dropped with audit log entry
- Integration test: connection without correct shared secret on first frame is dropped
- Defense-in-depth test: even if ACL is misconfigured (manually overridden), shared-secret check catches the wrong client
#### Task B.7 — Memory watchdog with Galaxy-specific thresholds
Per `driver-stability.md` Galaxy deep dive §"Memory Watchdog Thresholds":
- Sample RSS every 30s
- Warning: `1.5× baseline OR baseline + 200 MB` (whichever larger)
- Soft recycle: `2× baseline OR baseline + 200 MB` (whichever larger)
- Hard ceiling: 1.5 GB → force-kill
- Slope: > 5 MB/min sustained 30 min → soft recycle
**Acceptance**:
- Unit test against a mock RSS source: each threshold triggers the correct action
- Integration test with the FaultShim (Stream B.10): leak simulation crosses the soft-recycle threshold and triggers soft recycle path
#### Task B.8 — Recycle policy with WM_QUIT escalation
Per `driver-stability.md` Galaxy deep dive §"Recycle Policy (COM-specific)":
- 15s grace for in-flight COM calls (longer than FOCAS because legitimate MXAccess bulk reads take seconds)
- Per-handle: `RemoveAdvise``RemoveItem``ReleaseComObject``UnregisterProxy`, on the STA thread
- `WM_QUIT` posted only after all of the above complete
- If STA pump doesn't exit within 5s of `WM_QUIT``Environment.Exit(2)` (hard exit)
- Soft recycle scheduled daily at 03:00 local; recycle frequency cap 1/hour
**Acceptance**:
- Soft recycle test: in-flight call returns within grace → clean exit (`Exit(0)`)
- Soft recycle test: in-flight call exceeds grace → hard exit (`Exit(2)`); supervisor records as unclean recycle
- Wedged-pump test: pump doesn't drain after `WM_QUIT``Exit(2)` within 5s
- Frequency cap test: trigger 2 soft recycles within an hour → second is blocked, alert raised
#### Task B.9 — Post-mortem MMF writer
Per `driver-stability.md` Galaxy deep dive §"Post-Mortem Log Contents":
- Ring buffer of last 1000 IPC operations
- Plus Galaxy-specific snapshots: STA pump state (thread ID, last dispatched timestamp, queue depth), active subscription count by host, `MxAccessHandle` refcount snapshot, last 100 probe results, last redeploy event, Galaxy DB connection state, Historian connection state if HDA enabled
- Memory-mapped file at `%ProgramData%\OtOpcUa\driver-postmortem\galaxy.mmf`
- On graceful shutdown: flush ring + snapshots to a rotating log
- On hard crash: supervisor reads the MMF after the corpse is gone
**Acceptance**:
- Round-trip test: write 1000 operations → read back → assert order + content
- Hard-crash test: kill the process mid-operation → supervisor reads the MMF → ring tail shows the operation that was in flight
#### Task B.10 — Driver.Galaxy.FaultShim (test-only)
Per `driver-stability.md` §"Test Coverage for Galaxy Stability" — analogous to FOCAS FaultShim:
- Test-only managed assembly substituted for `ArchestrA.MxAccess.dll` via assembly binding
- Injects: COM exception at chosen call site, subscription that never fires `OnDataChange`, `Marshal.ReleaseComObject` returning unexpected refcount, STA pump deadlock simulation
- Production builds load the real `ArchestrA.MxAccess` from GAC
**Acceptance**:
- FaultShim binds successfully under test configuration
- Each fault scenario triggers the expected protection (memory watchdog → recycle, supervisor → respawn, etc.)
### Stream C — Driver.Galaxy.Proxy (1.5 weeks, can parallel with B after A done)
#### Task C.1 — Create the project + capability interface implementation
`src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/` (.NET 10). Dependencies: `Core.Abstractions` (Phase 1) + `Driver.Galaxy.Shared` (Stream A) + `MessagePack`.
Implement every interface listed in Phase Objective above. Each method:
- Marshals arguments into the matching IPC contract
- Sends over the data-plane pipe
- Awaits the response (with timeout per Polly per decision #34)
- Maps the response into the `Core.Abstractions` shape (`DataValue`, `DriverAttributeInfo`, etc.)
- Surfaces failures as the appropriate StatusCode
**Acceptance**:
- Each interface method has a unit test against a mock IPC channel: happy path + IPC timeout path + IPC error path
- `IRediscoverable` opt-in works: when Galaxy.Host signals a redeploy, Proxy invokes the Core's rediscovery flow (not full restart)
#### Task C.2 — Heartbeat sender + host liveness
Per `driver-stability.md` §"Heartbeat between proxy and host":
- 2s cadence (decision #72) on the dedicated heartbeat pipe
- 3 consecutive missed responses = host declared dead (6s detection)
- On host-dead: fan out Bad quality on all Galaxy-namespace nodes; ask supervisor to respawn
**Acceptance**:
- Heartbeat round-trip test against a mock host
- Missed-heartbeat test: stop the mock host's heartbeat responder → 3 misses → supervisor respawn requested
- GC pause test: simulate a 700ms GC pause on the proxy side → no false positive (single missed beat absorbed by 3-miss tolerance)
#### Task C.3 — Supervisor with respawn-with-backoff + crash-loop circuit breaker
Per `driver-stability.md` §"Crash-loop circuit breaker" + Galaxy §"Recovery Sequence After Crash":
- Backoff: 5s → 15s → 60s (capped)
- Crash-loop: 3 crashes / 5 min → escalating cooldown (1h → 4h → 24h manual)
- Sticky alert that doesn't auto-clear when cooldown elapses
- On respawn after recycle: reuse cached `time_of_last_deploy` watermark to skip full DB rediscovery if unchanged
**Acceptance**:
- Respawn test: kill host process → supervisor respawns within 5s → host re-establishes
- Crash-loop test: force 3 crashes within 5 minutes → 4th respawn blocked, alert raised, manual reset clears alert
- Cooldown escalation test: trip → 1h auto-reset → re-trip within 10 min → 4h cooldown → re-trip → 24h manual
#### Task C.4 — Address space build via `IAddressSpaceBuilder`
When the Proxy is asked to discover its tags, it issues `DiscoverGalaxyHierarchyRequest` to the Host, receives the gobject tree + attributes, and streams them to `IAddressSpaceBuilder` (Phase 1 API per decision #52). Galaxy uses the SystemPlatform-kind namespace; tags use `FolderPath` (v1-style) — no `Equipment` rows are created.
**Acceptance**:
- Build a Galaxy address space via the Proxy → byte-equivalent OPC UA browse output to v1
- Memory test: large Galaxy (4000+ attributes) → Proxy peak RAM stays under 200 MB during build
### Stream D — Retire legacy OtOpcUa.Host (1 week, depends on B + C)
#### Task D.1 — Delete legacy Host project
Once Galaxy.Host + Galaxy.Proxy are functional, the legacy `OtOpcUa.Host` project's responsibilities are split:
- Galaxy-specific code → `Driver.Galaxy.Host` (already moved in Stream B)
- TopShelf wrapper, `Program.cs`, generic OPC UA hosting → already replaced by `OtOpcUa.Server` in Phase 1
- Anything else (configuration types, generic helpers) → moved to `OtOpcUa.Server` or `OtOpcUa.Configuration` as appropriate
Delete the project from the solution. Update `.slnx` and any references.
**Acceptance**:
- `ls src/` shows `OtOpcUa.Host` is gone
- `dotnet build OtOpcUa.slnx` succeeds with `OtOpcUa.Host` no longer in the build graph
- All previously-`OtOpcUa.Host.Tests` tests are either moved to the appropriate new test project or deleted as obsolete
#### Task D.2 — Update Windows service installer scripts
Two services per cluster node when Galaxy is configured:
- `OtOpcUa` (the main `OtOpcUa.Server`) — already installable per Phase 1
- `OtOpcUaGalaxyHost` (the `Driver.Galaxy.Host`) — new service registration
Installer must:
- Install both services with the correct service-account SIDs (Galaxy.Host's pipe ACL must grant the OtOpcUa service principal)
- Set the supervisor's per-process secret in the registry or a protected file before first start
- Honor service dependency: Galaxy.Host should be configured to start before OtOpcUa, or OtOpcUa retries until Galaxy.Host is up
**Acceptance**:
- Install both services on a test box → both start successfully
- Uninstall both → no leftover registry / file system state
- Service-restart cycle: stop OtOpcUa.Server → Galaxy.Host stays up → start OtOpcUa.Server → reconnects to Galaxy.Host pipe
#### Task D.3 — Migrate Galaxy `appsettings.json` config to central config DB
Galaxy-specific config sections (`MxAccess`, `Galaxy`, `Historian`) move into the `DriverInstance.DriverConfig` JSON for the Galaxy driver instance in the Configuration DB. The local `appsettings.json` keeps only `Cluster.NodeId` + `ClusterId` + DB conn (per decision #18).
Migration script: for each existing v1 `appsettings.json`, generate the equivalent `DriverConfig` JSON and either insert via Admin UI or via a one-shot SQL script.
**Acceptance**:
- Migration script runs against a v1 dev `appsettings.json` → produces a JSON blob that loads into the Galaxy `DriverConfig` field
- The Galaxy driver instance starts with the migrated config and serves the same address space as v1
### Stream E — Parity validation (1 week, gate)
#### Task E.1 — Run v1 IntegrationTests against v2 Galaxy topology
Per decision #56:
- The same v1 IntegrationTests suite runs against the v2 build with Galaxy.Proxy + Galaxy.Host instead of in-process Galaxy
- All tests must pass
- Pass count = v1 baseline; failure count = 0; skip count = v1 baseline
- Test duration may increase (IPC round-trip latency); document the deviation
**Acceptance**:
- Test report shows pass/fail/skip counts identical to v1 baseline
- Per-test duration regression report: any test that takes >2× v1 baseline is flagged for review (may be an IPC bottleneck)
#### Task E.2 — Scripted Client.CLI walkthrough parity
Per decision #56:
- Execute the captured Client.CLI script (recorded at Phase 2 entry gate against v1) against the v2 Galaxy topology
- Diff the output against v1 reference
- Differences allowed only in: timestamps, latency-measurement output. Any value, quality, browse path, or alarm shape difference = parity defect
**Acceptance**:
- Walkthrough completes without errors
- Output diff vs v1: only timestamp / latency lines differ
#### Task E.3 — Regression tests for the four 2026-04-13 stability findings
Per `driver-specs.md` Galaxy "Operational Stability Notes": each of the four findings closed in commits `c76ab8f` and `7310925` should have a regression test in the Phase 2 parity suite:
- Phantom probe subscription flipping Tick() to Stopped (covered by Task B.5)
- Cross-host quality clear wiping sibling state during recovery (covered by Task B.4)
- Sync-over-async on the OPC UA stack thread → guard against new instances in `GenericDriverNodeManager`
- Fire-and-forget alarm tasks racing shutdown → guard via the pre-shutdown drain ordering in Task B.3
**Acceptance**:
- Each of the four scenarios has a named test in the parity suite
- Each test fails on a hand-introduced regression (revert the v1 fix, see test fail)
#### Task E.4 — Adversarial review of the Phase 2 diff
Per `implementation/overview.md` exit gate:
- Run `/codex:adversarial-review --base v2` on the merged Phase 2 diff
- Findings closed or explicitly deferred with rationale and ticket link
## Compliance Checks (run at exit gate)
`phase-2-compliance.ps1`:
### Schema compliance
N/A for Phase 2 — no schema changes (Configuration DB schema is unchanged from Phase 1).
### Decision compliance
For each decision number Phase 2 implements (#11, #24, #25, #28, #29, #32, #34, #44, #4647, #5556, #62, #6369, #76, #102, plus the Galaxy-specific #62), verify at least one citation exists in source, tests, or migrations:
```powershell
$decisions = @(11, 24, 25, 28, 29, 32, 34, 44, 46, 47, 55, 56, 62, 63..69, 76, 102, 122, 123, 124)
foreach ($d in $decisions) {
$hits = git grep "decision #$d" -- 'src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.*/' 'tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.*/'
if (-not $hits) { Write-Error "Decision #$d has no citation"; exit 1 }
}
```
### Visual compliance
N/A — no Admin UI changes in Phase 2 (Galaxy is just another `DriverInstance` in the Drivers tab).
### Behavioral compliance — parity smoke test
The parity suite (Stream E) is the smoke test:
1. v1 IntegrationTests pass count = baseline, fail count = 0
2. Client.CLI walkthrough output matches v1 (modulo timestamps/latency)
3. Four regression tests for 2026-04-13 findings pass
### Stability compliance
For Phase 2 (introduces the first Tier C driver in production form):
- Galaxy.Host implements every Tier C cross-cutting protection from `driver-stability.md`:
- SafeHandle wrappers for COM (Task B.3) ✓
- Memory watchdog with Galaxy thresholds (Task B.7) ✓
- Bounded operation queues per device (already in Core, Phase 1) ✓
- Heartbeat between proxy and host on separate channel (Tasks A.2, B.6, C.2) ✓
- Scheduled recycling with `WM_QUIT` escalation to hard exit (Task B.8) ✓
- Crash-loop circuit breaker (Task C.3) ✓
- Post-mortem MMF readable after hard crash (Task B.9) ✓
- IPC ACL + caller SID verification + per-process shared secret (Task B.6) ✓
Each protection has at least one regression test. The compliance script enumerates and verifies presence:
```powershell
$protections = @(
@{Name="SafeHandle for COM"; Test="MxAccessHandleFinalizerReleasesCom"},
@{Name="Memory watchdog"; Test="WatchdogTriggersRecycleAtThreshold"},
@{Name="Heartbeat detection"; Test="ThreeMissedHeartbeatsDeclaresHostDead"},
@{Name="WM_QUIT escalation"; Test="WedgedPumpEscalatesToHardExit"},
@{Name="Crash-loop breaker"; Test="ThreeCrashesInFiveMinutesOpensCircuit"},
@{Name="Post-mortem MMF"; Test="MmfSurvivesHardCrashAndIsReadable"},
@{Name="Pipe ACL enforcement"; Test="NonServerSidConnectionRejected"},
@{Name="Shared secret"; Test="ConnectionWithoutSecretRejected"}
)
foreach ($p in $protections) {
$hits = dotnet test --filter "FullyQualifiedName~$($p.Test)" --no-build --logger "console;verbosity=quiet"
if ($LASTEXITCODE -ne 0) { Write-Error "Stability protection '$($p.Name)' has no passing test '$($p.Test)'"; exit 1 }
}
```
### Documentation compliance
- Any deviation from the Galaxy deep dive in `driver-stability.md` reflected back; new decisions added with `supersedes` notes if needed
- `driver-specs.md` §1 (Galaxy) updated to reflect the actual implementation if the IPC contract or recycle behavior differs from the design doc
## Completion Checklist
### Stream A — Driver.Galaxy.Shared
- [ ] Project created (.NET Standard 2.0, MessagePack-only dependency)
- [ ] All IPC contracts defined and round-trip tested
- [ ] Hello-message version negotiation implemented
- [ ] Reflection test confirms no .NET 10-only types leaked in
### Stream B — Driver.Galaxy.Host
- [ ] Project created (.NET 4.8 x86)
- [ ] All Galaxy-specific code moved from legacy Host
- [ ] STA thread + Win32 pump implemented; pump health probe wired up
- [ ] `MxAccessHandle : SafeHandle` for COM lifetime
- [ ] Subscription registry + reconnect with cross-host quality scoping
- [ ] `GalaxyRuntimeProbeManager` rebuilt; phantom-probe regression test passes
- [ ] Named-pipe IPC server with mandatory ACL + caller SID verification + per-process secret
- [ ] Memory watchdog with Galaxy-specific thresholds
- [ ] Recycle policy with 15s grace + WM_QUIT escalation to hard exit
- [ ] Post-mortem MMF writer + supervisor reader
- [ ] FaultShim test-only assembly for fault injection
### Stream C — Driver.Galaxy.Proxy
- [ ] Project created (.NET 10, depends on Core.Abstractions + Galaxy.Shared)
- [ ] All capability interfaces implemented (IDriver, ITagDiscovery, IRediscoverable, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IHostConnectivityProbe)
- [ ] Heartbeat sender on dedicated channel; missed-heartbeat detection
- [ ] Supervisor with respawn-with-backoff + crash-loop circuit breaker (escalating cooldown 1h/4h/24h)
- [ ] Address space build via `IAddressSpaceBuilder` produces byte-equivalent v1 output
### Stream D — Retire legacy OtOpcUa.Host
- [ ] Legacy `OtOpcUa.Host` project deleted from solution
- [ ] Windows service installer registers two services (OtOpcUa + OtOpcUaGalaxyHost)
- [ ] Galaxy `appsettings.json` config migrated into central DB `DriverConfig`
- [ ] Migration script tested against v1 dev config
### Stream E — Parity validation
- [ ] v1 IntegrationTests pass with count = baseline, failures = 0
- [ ] Client.CLI walkthrough output matches v1 (modulo timestamps/latency)
- [ ] All four 2026-04-13 stability findings have passing regression tests
- [ ] Per-test duration regression report: no test >2× v1 baseline (or flagged for review)
### Cross-cutting
- [ ] `phase-2-compliance.ps1` runs and exits 0
- [ ] All 8 Tier C stability protections have named, passing tests
- [ ] Adversarial review of the phase diff — findings closed or deferred with rationale
- [ ] PR opened against `v2`, includes: link to this doc, link to exit-gate record, compliance script output, parity test report, adversarial review output
- [ ] Reviewer signoff (one reviewer beyond the implementation lead)
- [ ] `exit-gate-phase-2.md` recorded
## Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|------|:----------:|:------:|------------|
| IPC round-trip latency makes parity tests fail on timing assumptions | High | Medium | Per-test duration regression report identifies hot tests; tune timeouts in test config rather than in production code |
| MessagePack contract drift between Proxy and Host during development | Medium | High | Hello-message version negotiation rejects mismatched majors loudly; CI builds both projects in the same job |
| STA pump health probe is itself flaky and triggers spurious recycles | Medium | High | Probe interval tunable; default 10s gives 1000ms+ slack on a healthy pump; monitor via post-mortem MMF for false positives |
| Pipe ACL misconfiguration on installer leaves the IPC accessible to local users | Low | Critical | Defense-in-depth shared secret catches the case; ACL enumeration test in installer integration test |
| Galaxy.Host process recycle thrash if Galaxy or DB is intermittently unavailable | Medium | Medium | Crash-loop circuit breaker with escalating cooldown caps the thrash; Polly retry on the data path inside Host (not via supervisor restart) handles transient errors |
| Migration of `appsettings.json` Galaxy config to DB blob breaks existing deployments | Medium | Medium | Migration script is idempotent and dry-run-able; deploy script asserts central DB has the migrated config before stopping legacy Host |
| Phase 2 takes longer than 8 weeks | High | Medium | Mid-gate review at 4 weeks — if Stream B isn't past Task B.6 (IPC + ACL), defer Stream B.10 (FaultShim) to Phase 2.5 follow-up |
| Wonderware Historian SDK incompatibility with .NET 4.8 x86 in the new project layout | Low | High | Move and validate Historian loader as part of Task B.1 — early signal if SDK has any project-shape sensitivity |
| Hard-exit on wedged pump leaks COM resources | Accepted | Low | Documented intent: hard exit is the only safe response; OS process exit reclaims fds and the OS COM cleanup is best-effort. CNC equivalent in FOCAS deep dive accepts the same trade-off |
## Out of Scope (do not do in Phase 2)
- Any non-Galaxy driver (Phase 3+)
- UNS / Equipment-namespace work for Galaxy (Galaxy is SystemPlatform-namespace; no Equipment rows for Galaxy tags per decision #108)
- Equipment-class template integration with the schemas repo (Galaxy doesn't use `EquipmentClassRef`)
- Push-from-DB notification (decision #96 — v2.1)
- Any change to OPC UA wire behavior visible to clients (parity is the gate)
- Consumer cutover (ScadaBridge, Ignition, System Platform IO) — out of v2 scope, separate integration-team track per `implementation/overview.md`
- Removing the v1 deployment from production (a v2 release decision, not Phase 2)