ACL design defines NodePermissions bitmask flags covering Browse / Read / Subscribe / HistoryRead / WriteOperate / WriteTune / WriteConfigure / AlarmRead / AlarmAcknowledge / AlarmConfirm / AlarmShelve / MethodCall plus common bundles (ReadOnly / Operator / Engineer / Admin); 6-level scope hierarchy (Cluster / Namespace / UnsArea / UnsLine / Equipment / Tag) with default-deny + additive grants and Browse-implication on ancestors; per-LDAP-group grants in a new generation-versioned NodeAcl table edited via the same draft → diff → publish → rollback boundary as every other content table; per-session permission-trie evaluator with O(depth × group-count) cost cached for the lifetime of the session and rebuilt on generation-apply or LDAP group cache expiry; cluster-create workflow seeds a default ACL set matching the v1 LmxOpcUa LDAP-role-to-permission map for v1 → v2 consumer migration parity; Admin UI ACL tab with two views (by LDAP group, by scope), bulk-grant flow, and permission simulator that lets operators preview "as user X" effective permissions across the cluster's UNS tree before publishing; explicit Deny deferred to v2.1 since verbose grants suffice at v2.0 fleet sizes; only denied OPC UA operations are audit-logged (not allowed ones — would dwarf the audit log). Schema doc gains the NodeAcl table with cross-cluster invariant enforcement and same-generation FK validation; admin-ui.md gains the ACLs tab; phase-1 doc gains Task E.9 wiring this through Stream E plus a NodeAcl entry in Task B.1's DbContext list. Dev-environment doc inventories every external resource the v2 build needs across two tiers per decision #99 — inner-loop (in-process simulators on developer machines: SQL Server local or container, GLAuth at C:\publish\glauth\, local dev Galaxy) and integration (one dedicated Windows host with Docker Desktop on WSL2 backend so TwinCAT XAR VM can run in Hyper-V alongside containerized oitc/modbus-server, plus WSL2-hosted Snap7 and ab_server, plus OPC Foundation reference server, plus FOCAS TestStub and FaultShim) — with concrete container images, ports, default dev credentials (clearly marked dev-only since production uses Integrated Security / gMSA per decision #46), bootstrap order for both tiers, network topology diagram, test data seed locations, and operational risks (TwinCAT trial expiry automation, Docker pricing, integration host SPOF mitigation, per-developer GLAuth config sync, Aveva license scoping that keeps Galaxy tests on developer machines and off the shared host). Removes consumer cutover (ScadaBridge / Ignition / System Platform IO) from OtOpcUa v2 scope per decision #136 — owned by a separate integration / operations team, tracked in 3-year-plan handoff §"Rollout Posture" and corrections §C5; OtOpcUa team's scope ends at Phase 5. Updates implementation/overview.md phase index to drop the "6+" row and add an explicit "OUT of v2 scope" callout; updates phase-1 and phase-2 docs to reframe cutover as integration-team-owned rather than future-phase numbered. Decisions #129–137 added: ACL model (#129), NodeAcl generation-versioned (#130), v1-compatibility seed (#131), denied-only audit logging (#132), two-tier dev environment (#133), Docker WSL2 backend for TwinCAT VM coexistence (#134), TwinCAT VM centrally managed / Galaxy on dev machines only (#135), cutover out of v2 scope (#136), dev credentials documented openly (#137). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
506 lines
33 KiB
Markdown
506 lines
33 KiB
Markdown
# Phase 2 — Galaxy Out-of-Process Refactor (Tier C)
|
||
|
||
> **Status**: DRAFT — implementation plan for Phase 2 of the v2 build (`plan.md` §6, `driver-stability.md` §"Galaxy — Deep Dive").
|
||
>
|
||
> **Branch**: `v2/phase-2-galaxy`
|
||
> **Estimated duration**: 6–8 weeks (largest refactor phase; Tier C protections + IPC are the bulk)
|
||
> **Predecessor**: Phase 1 (`phase-1-configuration-and-admin-scaffold.md`)
|
||
> **Successor**: Phase 3 (Modbus TCP driver)
|
||
|
||
## Phase Objective
|
||
|
||
Move Galaxy / MXAccess from the legacy in-process `OtOpcUa.Host` project into the **Tier C out-of-process** topology specified in `driver-stability.md`:
|
||
|
||
1. **`Driver.Galaxy.Shared`** — .NET Standard 2.0 IPC message contracts (MessagePack DTOs)
|
||
2. **`Driver.Galaxy.Host`** — .NET 4.8 x86 separate Windows Service that owns `MxAccessBridge`, `GalaxyRepository`, alarm tracking, `GalaxyRuntimeProbeManager`, the Wonderware Historian SDK, the STA thread + Win32 message pump, and all Tier C cross-cutting protections (memory watchdog, scheduled recycle, post-mortem MMF, IPC ACL + caller SID verification, per-process shared secret)
|
||
3. **`Driver.Galaxy.Proxy`** — .NET 10 in-process driver implementing every capability interface (`IDriver`, `ITagDiscovery`, `IRediscoverable`, `IReadable`, `IWritable`, `ISubscribable`, `IAlarmSource`, `IHistoryProvider`, `IHostConnectivityProbe`), forwarding each call over named-pipe IPC and owning the supervisor (heartbeat, host liveness, respawn with backoff, crash-loop circuit breaker, fan-out of Bad quality on host death)
|
||
4. **Retire the legacy `OtOpcUa.Host` project** — its responsibilities now live in `OtOpcUa.Server` (built in Phase 1) for OPC UA hosting and `OtOpcUa.Driver.Galaxy.Host` for Galaxy-specific runtime
|
||
|
||
**Parity, not regression.** The phase exit gate is: the v1 `IntegrationTests` suite passes byte-for-byte against the v2 Galaxy.Proxy + Galaxy.Host topology, and a scripted Client.CLI walkthrough produces equivalent output to v1 (decision #56). Anything different — quality codes, browse paths, alarm shapes, history responses — is a parity defect.
|
||
|
||
This phase also closes the four 2026-04-13 stability findings (commits `c76ab8f` and `7310925`) by adding regression tests to the parity suite per `driver-specs.md` Galaxy "Operational Stability Notes".
|
||
|
||
## Scope — What Changes
|
||
|
||
| Concern | Change |
|
||
|---------|--------|
|
||
| Project layout | 3 new projects: `Driver.Galaxy.Shared` (.NET Standard 2.0), `Driver.Galaxy.Host` (.NET 4.8 x86), `Driver.Galaxy.Proxy` (.NET 10) |
|
||
| `OtOpcUa.Host` (legacy in-process) | **Retired**. Galaxy-specific code moves to `Driver.Galaxy.Host`; the small remainder (TopShelf wrapper, `Program.cs`) was already replaced by `OtOpcUa.Server` in Phase 1 |
|
||
| MXAccess COM access | Now lives only in `Driver.Galaxy.Host` (.NET 4.8 x86, STA thread + Win32 message pump). Main server (`OtOpcUa.Server`, .NET 10 x64) never references `ArchestrA.MxAccess` |
|
||
| Wonderware Historian SDK | Same — only in `Driver.Galaxy.Host` |
|
||
| Galaxy DB queries | `GalaxyRepository` moves to `Driver.Galaxy.Host`; the SQL connection string lives in the Galaxy `DriverConfig` JSON |
|
||
| OPC UA address space build for Galaxy | Driven by `Driver.Galaxy.Proxy` calls into `IAddressSpaceBuilder` (Phase 1 API) — Proxy fetches the hierarchy via IPC, streams nodes to the builder |
|
||
| Subscriptions, reads, writes, alarms, history | All forwarded over named-pipe IPC via MessagePack contracts in `Driver.Galaxy.Shared` |
|
||
| Tier C cross-cutting protections | All wired up per `driver-stability.md` §"Cross-Cutting Protections" → "Isolated host only (Tier C)" + the Galaxy deep dive |
|
||
| Windows service installer | Two services per Galaxy-using cluster node: `OtOpcUa` (the main server) + `OtOpcUaGalaxyHost` (the Galaxy host). Installer scripts updated. |
|
||
| `appsettings.json` (legacy Galaxy config sections) | Migrated into the central config DB under `DriverInstance.DriverConfig` JSON for the Galaxy driver instance. Local `appsettings.json` keeps only `Cluster.NodeId` + `ClusterId` + DB conn (per decision #18) |
|
||
|
||
## Scope — What Does NOT Change
|
||
|
||
| Item | Reason |
|
||
|------|--------|
|
||
| OPC UA wire behavior visible to clients | Parity is the gate. Clients see the same browse paths, quality codes, alarm shapes, and history responses as v1 |
|
||
| Galaxy hierarchy mapping (gobject parents → OPC UA folders) | Galaxy uses the SystemPlatform-kind namespace; UNS rules don't apply (decision #108). `Tag.FolderPath` mirrors v1 LmxOpcUa exactly |
|
||
| Galaxy `EquipmentClassRef` integration | Galaxy is SystemPlatform-namespace; no `Equipment` rows are created for Galaxy tags. Equipment-namespace work is for the native-protocol drivers in Phase 3+ |
|
||
| Any non-Galaxy driver | Phase 3+ |
|
||
| `OtOpcUa.Server` lifecycle / configuration substrate / Admin UI | Built in Phase 1; Phase 2 only adds the Galaxy.Proxy as a `DriverInstance` |
|
||
| Wonderware Historian dependency | Stays optional, loaded only when `Historian.Enabled = true` in the Galaxy `DriverConfig` |
|
||
|
||
## Entry Gate Checklist
|
||
|
||
- [ ] Phase 1 exit gate cleared (Configuration + Admin + Server + Core.Abstractions all green; Galaxy still in-process via legacy Host)
|
||
- [ ] `v2` branch is clean
|
||
- [ ] Phase 1 PR merged
|
||
- [ ] Dev Galaxy reachable for parity testing — same Galaxy that v1 tests against
|
||
- [ ] v1 IntegrationTests baseline pass count + duration recorded (this is the parity bar)
|
||
- [ ] Client.CLI walkthrough script captured against v1 and saved as reference output
|
||
- [ ] All Phase 2-relevant docs reviewed: `plan.md` §3–4, §5a (LmxNodeManager reusability), `driver-stability.md` §"Out-of-Process Driver Pattern (Generalized)" + §"Galaxy — Deep Dive (Tier C)", `driver-specs.md` §1 (Galaxy)
|
||
- [ ] Decisions cited or implemented by Phase 2 read at least skim-level: #11, #24, #25, #28, #29, #32, #34, #44, #46–47, #55–56, #62, #63–69, #76, #102 (the Tier C IPC ACL + recycle decisions are all relevant)
|
||
- [ ] Confirmation that the four 2026-04-13 stability findings (`c76ab8f`, `7310925`) have existing v1 tests that will be the regression net for the v2 split
|
||
|
||
**Evidence file**: `docs/v2/implementation/entry-gate-phase-2.md`.
|
||
|
||
## Task Breakdown
|
||
|
||
Five work streams (A–E). Stream A is the foundation; B and C run partly in parallel after A; D depends on B + C; E is the parity gate at the end.
|
||
|
||
### Stream A — Driver.Galaxy.Shared (1 week)
|
||
|
||
#### Task A.1 — Create the project
|
||
|
||
`src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/` (.NET Standard 2.0 — must be consumable by both .NET 10 Proxy and .NET 4.8 Host per decision #25). Single dependency: `MessagePack` NuGet (decision #32).
|
||
|
||
#### Task A.2 — IPC message contracts
|
||
|
||
Define the MessagePack DTOs covering every Galaxy operation the Proxy will forward:
|
||
- **Lifecycle**: `OpenSessionRequest`, `OpenSessionResponse`, `CloseSessionRequest`, `Heartbeat` (separate channel per decision §"Heartbeat between proxy and host")
|
||
- **Discovery**: `DiscoverGalaxyHierarchyRequest`, `GalaxyObjectInfo`, `GalaxyAttributeInfo` (these are not the v1 Domain types — they're the IPC-shape with MessagePack attributes; the Proxy maps to/from `DriverAttributeInfo` from `Core.Abstractions`)
|
||
- **Read / Write**: `ReadValuesRequest`, `ReadValuesResponse`, `WriteValuesRequest`, `WriteValuesResponse` (carries `DataValue` shape per decision #13: value + StatusCode + timestamps)
|
||
- **Subscriptions**: `SubscribeRequest`, `UnsubscribeRequest`, `OnDataChangeNotification` (server-pushed)
|
||
- **Alarms**: `AlarmSubscribeRequest`, `AlarmEvent`, `AlarmAcknowledgeRequest`
|
||
- **History**: `HistoryReadRequest`, `HistoryReadResponse`
|
||
- **Probe**: `HostConnectivityStatus`, `RuntimeStatusChangeNotification`
|
||
- **Recycle / control**: `RecycleHostRequest`, `RecycleStatusResponse`
|
||
|
||
Length-prefixed framing per decision #28; MessagePack body inside each frame.
|
||
|
||
**Acceptance**:
|
||
- All contracts compile against .NET Standard 2.0
|
||
- Unit test project asserts each contract round-trips through MessagePack serialize → deserialize byte-for-byte
|
||
- Reflection test asserts no contract references `System.Text.Json` or anything not in BCL/MessagePack
|
||
|
||
#### Task A.3 — Versioning + capability negotiation
|
||
|
||
Add a top-of-stream `Hello` message exchanged on connection: protocol version, supported features. Future-proofs for adding new operations without breaking older Hosts.
|
||
|
||
**Acceptance**:
|
||
- Proxy refuses to talk to a Host advertising a major version it doesn't understand; logs the mismatch
|
||
- Host refuses to accept a Proxy from an unknown major version
|
||
|
||
### Stream B — Driver.Galaxy.Host (3–4 weeks)
|
||
|
||
#### Task B.1 — Create the project + move Galaxy code
|
||
|
||
`src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/` (.NET 4.8, **x86 platform** target — required for MXAccess COM per decision #23).
|
||
|
||
Move from legacy `OtOpcUa.Host`:
|
||
- `MxAccessBridge.cs` and supporting types
|
||
- `GalaxyRepository.cs` and SQL queries
|
||
- Alarm tracking infrastructure
|
||
- `GalaxyRuntimeProbeManager.cs`
|
||
- `MxDataTypeMapper.cs`, `SecurityClassificationMapper.cs`
|
||
- Historian plugin loader and `IHistorianDataSource` (only loaded when `Historian.Enabled = true`)
|
||
- Configuration types (`MxAccessConfiguration`, `GalaxyRepositoryConfiguration`, `HistorianConfiguration`, `GalaxyScope`) — these now read from the JSON `DriverConfig` rather than `appsettings.json`
|
||
|
||
`Driver.Galaxy.Host` does **not** reference `Core.Abstractions` (decision §5 dependency graph) — it's a closed unit, IPC-fronted.
|
||
|
||
**Acceptance**:
|
||
- Project builds against .NET 4.8 x86
|
||
- All moved files have their namespace updated to `ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.*`
|
||
- v1 unit tests for these classes (still in `OtOpcUa.Host.Tests`) move to a new `OtOpcUa.Driver.Galaxy.Host.Tests` project and pass
|
||
|
||
#### Task B.2 — STA thread + Win32 message pump
|
||
|
||
Per `driver-stability.md` Galaxy deep dive:
|
||
- Single STA thread per Host process owns all `LMXProxyServer` instances
|
||
- Work item dispatch via `PostThreadMessage(WM_APP)`
|
||
- `WM_QUIT` shutdown only after all outstanding work items complete
|
||
- Pump health probe: no-op work item every 10s, timeout = wedged-pump signal that triggers recycle
|
||
|
||
This is essentially v1's `StaComThread` lifted from `LmxProxy.Host` reference (per CLAUDE.md "Reference Implementation" section).
|
||
|
||
**Acceptance**:
|
||
- Pump starts, dispatches work items, exits cleanly on `WM_QUIT`
|
||
- Pump-wedged simulation (work item that infinite-loops) triggers the 10s timeout and posts a recycle event
|
||
- COM call from non-STA thread fails fast with a recognizable error (regression net for cross-apartment bugs)
|
||
|
||
#### Task B.3 — `MxAccessHandle : SafeHandle` for COM lifetime
|
||
|
||
Wrap each `LMXProxyServer` connection in a `SafeHandle` subclass (decision #65 + Galaxy deep dive):
|
||
- `ReleaseHandle()` calls `Marshal.ReleaseComObject` until refcount = 0, then `UnregisterProxy`
|
||
- Subscription handles wrapped per item; `RemoveAdvise` → `RemoveItem` ordering enforced
|
||
- `CriticalFinalizerObject` for finalizer ordering during AppDomain unload
|
||
- Pre-shutdown drain: cancel all subscriptions cleanly via the STA pump, in order, before pump exit
|
||
|
||
**Acceptance**:
|
||
- Unit test asserts a leaked handle (no `Dispose`) is released by the finalizer
|
||
- Shutdown test asserts no orphan COM refs after Host exits cleanly
|
||
- Stress test: 1000 subscribe/unsubscribe cycles → handle table empty at the end
|
||
|
||
#### Task B.4 — Subscription registry + reconnect
|
||
|
||
Per `driver-stability.md` Galaxy deep dive §"Subscription State and Reconnect":
|
||
- In-memory registry of `(Item, AdviseId, OwningHost)` for every subscription
|
||
- Reconnect order: register proxy → re-add items → re-advise
|
||
- Cross-host quality clear gated on host-status check (closes 2026-04-13 finding)
|
||
|
||
**Acceptance**:
|
||
- Disconnect simulation: kill TCP to MXAccess; subscriptions go Bad; reconnect; subscriptions restore in correct order
|
||
- Multi-host test: stop AppEngine A while AppEngine B is running; verify A's subscriptions go Bad but B's stay Good (closes the cross-host quality clear regression)
|
||
|
||
#### Task B.5 — Connection health probe (`GalaxyRuntimeProbeManager` rebuild)
|
||
|
||
Lift the existing `GalaxyRuntimeProbeManager` into the new project. Behaviors per `driver-stability.md`:
|
||
- Subscribe to per-host runtime-status synthetic attribute
|
||
- Bad-quality fan-out scoped to the host's subtree (not Galaxy-wide)
|
||
- Failed probe subscription does **not** leave a phantom entry that Tick() flips to Stopped (closes 2026-04-13 finding)
|
||
|
||
**Acceptance**:
|
||
- Probe failure simulation → no phantom entry; Tick() does not flip arbitrary subscriptions to Stopped (regression test for the finding)
|
||
- Probe transitions Stopped → Running → Stopped → Running over 5 minutes; quality fan-out happens correctly each transition
|
||
|
||
#### Task B.6 — Named-pipe IPC server with mandatory ACL
|
||
|
||
Per decision #76 + `driver-stability.md` §"IPC Security":
|
||
- Pipe ACL on creation: `ReadWrite | Synchronize` granted only to the OtOpcUa server's service principal SID; LocalSystem and Administrators **explicitly denied**
|
||
- Caller identity verification on each new connection: `GetImpersonationUserName()` cross-checked against configured server service SID; mismatches dropped before any RPC frame is read
|
||
- Per-process shared secret: passed by the supervisor at spawn time, required on first frame of every connection
|
||
- Heartbeat pipe: separate from data-plane pipe, same ACL
|
||
|
||
**Acceptance**:
|
||
- Unit test: pipe ACL enumeration shows only the configured SID + Synchronize/ReadWrite
|
||
- Integration test: connection from a non-server-SID local process is dropped with audit log entry
|
||
- Integration test: connection without correct shared secret on first frame is dropped
|
||
- Defense-in-depth test: even if ACL is misconfigured (manually overridden), shared-secret check catches the wrong client
|
||
|
||
#### Task B.7 — Memory watchdog with Galaxy-specific thresholds
|
||
|
||
Per `driver-stability.md` Galaxy deep dive §"Memory Watchdog Thresholds":
|
||
- Sample RSS every 30s
|
||
- Warning: `1.5× baseline OR baseline + 200 MB` (whichever larger)
|
||
- Soft recycle: `2× baseline OR baseline + 200 MB` (whichever larger)
|
||
- Hard ceiling: 1.5 GB → force-kill
|
||
- Slope: > 5 MB/min sustained 30 min → soft recycle
|
||
|
||
**Acceptance**:
|
||
- Unit test against a mock RSS source: each threshold triggers the correct action
|
||
- Integration test with the FaultShim (Stream B.10): leak simulation crosses the soft-recycle threshold and triggers soft recycle path
|
||
|
||
#### Task B.8 — Recycle policy with WM_QUIT escalation
|
||
|
||
Per `driver-stability.md` Galaxy deep dive §"Recycle Policy (COM-specific)":
|
||
- 15s grace for in-flight COM calls (longer than FOCAS because legitimate MXAccess bulk reads take seconds)
|
||
- Per-handle: `RemoveAdvise` → `RemoveItem` → `ReleaseComObject` → `UnregisterProxy`, on the STA thread
|
||
- `WM_QUIT` posted only after all of the above complete
|
||
- If STA pump doesn't exit within 5s of `WM_QUIT` → `Environment.Exit(2)` (hard exit)
|
||
- Soft recycle scheduled daily at 03:00 local; recycle frequency cap 1/hour
|
||
|
||
**Acceptance**:
|
||
- Soft recycle test: in-flight call returns within grace → clean exit (`Exit(0)`)
|
||
- Soft recycle test: in-flight call exceeds grace → hard exit (`Exit(2)`); supervisor records as unclean recycle
|
||
- Wedged-pump test: pump doesn't drain after `WM_QUIT` → `Exit(2)` within 5s
|
||
- Frequency cap test: trigger 2 soft recycles within an hour → second is blocked, alert raised
|
||
|
||
#### Task B.9 — Post-mortem MMF writer
|
||
|
||
Per `driver-stability.md` Galaxy deep dive §"Post-Mortem Log Contents":
|
||
- Ring buffer of last 1000 IPC operations
|
||
- Plus Galaxy-specific snapshots: STA pump state (thread ID, last dispatched timestamp, queue depth), active subscription count by host, `MxAccessHandle` refcount snapshot, last 100 probe results, last redeploy event, Galaxy DB connection state, Historian connection state if HDA enabled
|
||
- Memory-mapped file at `%ProgramData%\OtOpcUa\driver-postmortem\galaxy.mmf`
|
||
- On graceful shutdown: flush ring + snapshots to a rotating log
|
||
- On hard crash: supervisor reads the MMF after the corpse is gone
|
||
|
||
**Acceptance**:
|
||
- Round-trip test: write 1000 operations → read back → assert order + content
|
||
- Hard-crash test: kill the process mid-operation → supervisor reads the MMF → ring tail shows the operation that was in flight
|
||
|
||
#### Task B.10 — Driver.Galaxy.FaultShim (test-only)
|
||
|
||
Per `driver-stability.md` §"Test Coverage for Galaxy Stability" — analogous to FOCAS FaultShim:
|
||
- Test-only managed assembly substituted for `ArchestrA.MxAccess.dll` via assembly binding
|
||
- Injects: COM exception at chosen call site, subscription that never fires `OnDataChange`, `Marshal.ReleaseComObject` returning unexpected refcount, STA pump deadlock simulation
|
||
- Production builds load the real `ArchestrA.MxAccess` from GAC
|
||
|
||
**Acceptance**:
|
||
- FaultShim binds successfully under test configuration
|
||
- Each fault scenario triggers the expected protection (memory watchdog → recycle, supervisor → respawn, etc.)
|
||
|
||
### Stream C — Driver.Galaxy.Proxy (1.5 weeks, can parallel with B after A done)
|
||
|
||
#### Task C.1 — Create the project + capability interface implementation
|
||
|
||
`src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/` (.NET 10). Dependencies: `Core.Abstractions` (Phase 1) + `Driver.Galaxy.Shared` (Stream A) + `MessagePack`.
|
||
|
||
Implement every interface listed in Phase Objective above. Each method:
|
||
- Marshals arguments into the matching IPC contract
|
||
- Sends over the data-plane pipe
|
||
- Awaits the response (with timeout per Polly per decision #34)
|
||
- Maps the response into the `Core.Abstractions` shape (`DataValue`, `DriverAttributeInfo`, etc.)
|
||
- Surfaces failures as the appropriate StatusCode
|
||
|
||
**Acceptance**:
|
||
- Each interface method has a unit test against a mock IPC channel: happy path + IPC timeout path + IPC error path
|
||
- `IRediscoverable` opt-in works: when Galaxy.Host signals a redeploy, Proxy invokes the Core's rediscovery flow (not full restart)
|
||
|
||
#### Task C.2 — Heartbeat sender + host liveness
|
||
|
||
Per `driver-stability.md` §"Heartbeat between proxy and host":
|
||
- 2s cadence (decision #72) on the dedicated heartbeat pipe
|
||
- 3 consecutive missed responses = host declared dead (6s detection)
|
||
- On host-dead: fan out Bad quality on all Galaxy-namespace nodes; ask supervisor to respawn
|
||
|
||
**Acceptance**:
|
||
- Heartbeat round-trip test against a mock host
|
||
- Missed-heartbeat test: stop the mock host's heartbeat responder → 3 misses → supervisor respawn requested
|
||
- GC pause test: simulate a 700ms GC pause on the proxy side → no false positive (single missed beat absorbed by 3-miss tolerance)
|
||
|
||
#### Task C.3 — Supervisor with respawn-with-backoff + crash-loop circuit breaker
|
||
|
||
Per `driver-stability.md` §"Crash-loop circuit breaker" + Galaxy §"Recovery Sequence After Crash":
|
||
- Backoff: 5s → 15s → 60s (capped)
|
||
- Crash-loop: 3 crashes / 5 min → escalating cooldown (1h → 4h → 24h manual)
|
||
- Sticky alert that doesn't auto-clear when cooldown elapses
|
||
- On respawn after recycle: reuse cached `time_of_last_deploy` watermark to skip full DB rediscovery if unchanged
|
||
|
||
**Acceptance**:
|
||
- Respawn test: kill host process → supervisor respawns within 5s → host re-establishes
|
||
- Crash-loop test: force 3 crashes within 5 minutes → 4th respawn blocked, alert raised, manual reset clears alert
|
||
- Cooldown escalation test: trip → 1h auto-reset → re-trip within 10 min → 4h cooldown → re-trip → 24h manual
|
||
|
||
#### Task C.4 — Address space build via `IAddressSpaceBuilder`
|
||
|
||
When the Proxy is asked to discover its tags, it issues `DiscoverGalaxyHierarchyRequest` to the Host, receives the gobject tree + attributes, and streams them to `IAddressSpaceBuilder` (Phase 1 API per decision #52). Galaxy uses the SystemPlatform-kind namespace; tags use `FolderPath` (v1-style) — no `Equipment` rows are created.
|
||
|
||
**Acceptance**:
|
||
- Build a Galaxy address space via the Proxy → byte-equivalent OPC UA browse output to v1
|
||
- Memory test: large Galaxy (4000+ attributes) → Proxy peak RAM stays under 200 MB during build
|
||
|
||
### Stream D — Retire legacy OtOpcUa.Host (1 week, depends on B + C)
|
||
|
||
#### Task D.1 — Delete legacy Host project
|
||
|
||
Once Galaxy.Host + Galaxy.Proxy are functional, the legacy `OtOpcUa.Host` project's responsibilities are split:
|
||
- Galaxy-specific code → `Driver.Galaxy.Host` (already moved in Stream B)
|
||
- TopShelf wrapper, `Program.cs`, generic OPC UA hosting → already replaced by `OtOpcUa.Server` in Phase 1
|
||
- Anything else (configuration types, generic helpers) → moved to `OtOpcUa.Server` or `OtOpcUa.Configuration` as appropriate
|
||
|
||
Delete the project from the solution. Update `.slnx` and any references.
|
||
|
||
**Acceptance**:
|
||
- `ls src/` shows `OtOpcUa.Host` is gone
|
||
- `dotnet build OtOpcUa.slnx` succeeds with `OtOpcUa.Host` no longer in the build graph
|
||
- All previously-`OtOpcUa.Host.Tests` tests are either moved to the appropriate new test project or deleted as obsolete
|
||
|
||
#### Task D.2 — Update Windows service installer scripts
|
||
|
||
Two services per cluster node when Galaxy is configured:
|
||
- `OtOpcUa` (the main `OtOpcUa.Server`) — already installable per Phase 1
|
||
- `OtOpcUaGalaxyHost` (the `Driver.Galaxy.Host`) — new service registration
|
||
|
||
Installer must:
|
||
- Install both services with the correct service-account SIDs (Galaxy.Host's pipe ACL must grant the OtOpcUa service principal)
|
||
- Set the supervisor's per-process secret in the registry or a protected file before first start
|
||
- Honor service dependency: Galaxy.Host should be configured to start before OtOpcUa, or OtOpcUa retries until Galaxy.Host is up
|
||
|
||
**Acceptance**:
|
||
- Install both services on a test box → both start successfully
|
||
- Uninstall both → no leftover registry / file system state
|
||
- Service-restart cycle: stop OtOpcUa.Server → Galaxy.Host stays up → start OtOpcUa.Server → reconnects to Galaxy.Host pipe
|
||
|
||
#### Task D.3 — Migrate Galaxy `appsettings.json` config to central config DB
|
||
|
||
Galaxy-specific config sections (`MxAccess`, `Galaxy`, `Historian`) move into the `DriverInstance.DriverConfig` JSON for the Galaxy driver instance in the Configuration DB. The local `appsettings.json` keeps only `Cluster.NodeId` + `ClusterId` + DB conn (per decision #18).
|
||
|
||
Migration script: for each existing v1 `appsettings.json`, generate the equivalent `DriverConfig` JSON and either insert via Admin UI or via a one-shot SQL script.
|
||
|
||
**Acceptance**:
|
||
- Migration script runs against a v1 dev `appsettings.json` → produces a JSON blob that loads into the Galaxy `DriverConfig` field
|
||
- The Galaxy driver instance starts with the migrated config and serves the same address space as v1
|
||
|
||
### Stream E — Parity validation (1 week, gate)
|
||
|
||
#### Task E.1 — Run v1 IntegrationTests against v2 Galaxy topology
|
||
|
||
Per decision #56:
|
||
- The same v1 IntegrationTests suite runs against the v2 build with Galaxy.Proxy + Galaxy.Host instead of in-process Galaxy
|
||
- All tests must pass
|
||
- Pass count = v1 baseline; failure count = 0; skip count = v1 baseline
|
||
- Test duration may increase (IPC round-trip latency); document the deviation
|
||
|
||
**Acceptance**:
|
||
- Test report shows pass/fail/skip counts identical to v1 baseline
|
||
- Per-test duration regression report: any test that takes >2× v1 baseline is flagged for review (may be an IPC bottleneck)
|
||
|
||
#### Task E.2 — Scripted Client.CLI walkthrough parity
|
||
|
||
Per decision #56:
|
||
- Execute the captured Client.CLI script (recorded at Phase 2 entry gate against v1) against the v2 Galaxy topology
|
||
- Diff the output against v1 reference
|
||
- Differences allowed only in: timestamps, latency-measurement output. Any value, quality, browse path, or alarm shape difference = parity defect
|
||
|
||
**Acceptance**:
|
||
- Walkthrough completes without errors
|
||
- Output diff vs v1: only timestamp / latency lines differ
|
||
|
||
#### Task E.3 — Regression tests for the four 2026-04-13 stability findings
|
||
|
||
Per `driver-specs.md` Galaxy "Operational Stability Notes": each of the four findings closed in commits `c76ab8f` and `7310925` should have a regression test in the Phase 2 parity suite:
|
||
- Phantom probe subscription flipping Tick() to Stopped (covered by Task B.5)
|
||
- Cross-host quality clear wiping sibling state during recovery (covered by Task B.4)
|
||
- Sync-over-async on the OPC UA stack thread → guard against new instances in `GenericDriverNodeManager`
|
||
- Fire-and-forget alarm tasks racing shutdown → guard via the pre-shutdown drain ordering in Task B.3
|
||
|
||
**Acceptance**:
|
||
- Each of the four scenarios has a named test in the parity suite
|
||
- Each test fails on a hand-introduced regression (revert the v1 fix, see test fail)
|
||
|
||
#### Task E.4 — Adversarial review of the Phase 2 diff
|
||
|
||
Per `implementation/overview.md` exit gate:
|
||
- Run `/codex:adversarial-review --base v2` on the merged Phase 2 diff
|
||
- Findings closed or explicitly deferred with rationale and ticket link
|
||
|
||
## Compliance Checks (run at exit gate)
|
||
|
||
`phase-2-compliance.ps1`:
|
||
|
||
### Schema compliance
|
||
N/A for Phase 2 — no schema changes (Configuration DB schema is unchanged from Phase 1).
|
||
|
||
### Decision compliance
|
||
For each decision number Phase 2 implements (#11, #24, #25, #28, #29, #32, #34, #44, #46–47, #55–56, #62, #63–69, #76, #102, plus the Galaxy-specific #62), verify at least one citation exists in source, tests, or migrations:
|
||
|
||
```powershell
|
||
$decisions = @(11, 24, 25, 28, 29, 32, 34, 44, 46, 47, 55, 56, 62, 63..69, 76, 102, 122, 123, 124)
|
||
foreach ($d in $decisions) {
|
||
$hits = git grep "decision #$d" -- 'src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.*/' 'tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.*/'
|
||
if (-not $hits) { Write-Error "Decision #$d has no citation"; exit 1 }
|
||
}
|
||
```
|
||
|
||
### Visual compliance
|
||
N/A — no Admin UI changes in Phase 2 (Galaxy is just another `DriverInstance` in the Drivers tab).
|
||
|
||
### Behavioral compliance — parity smoke test
|
||
The parity suite (Stream E) is the smoke test:
|
||
1. v1 IntegrationTests pass count = baseline, fail count = 0
|
||
2. Client.CLI walkthrough output matches v1 (modulo timestamps/latency)
|
||
3. Four regression tests for 2026-04-13 findings pass
|
||
|
||
### Stability compliance
|
||
For Phase 2 (introduces the first Tier C driver in production form):
|
||
- Galaxy.Host implements every Tier C cross-cutting protection from `driver-stability.md`:
|
||
- SafeHandle wrappers for COM (Task B.3) ✓
|
||
- Memory watchdog with Galaxy thresholds (Task B.7) ✓
|
||
- Bounded operation queues per device (already in Core, Phase 1) ✓
|
||
- Heartbeat between proxy and host on separate channel (Tasks A.2, B.6, C.2) ✓
|
||
- Scheduled recycling with `WM_QUIT` escalation to hard exit (Task B.8) ✓
|
||
- Crash-loop circuit breaker (Task C.3) ✓
|
||
- Post-mortem MMF readable after hard crash (Task B.9) ✓
|
||
- IPC ACL + caller SID verification + per-process shared secret (Task B.6) ✓
|
||
|
||
Each protection has at least one regression test. The compliance script enumerates and verifies presence:
|
||
|
||
```powershell
|
||
$protections = @(
|
||
@{Name="SafeHandle for COM"; Test="MxAccessHandleFinalizerReleasesCom"},
|
||
@{Name="Memory watchdog"; Test="WatchdogTriggersRecycleAtThreshold"},
|
||
@{Name="Heartbeat detection"; Test="ThreeMissedHeartbeatsDeclaresHostDead"},
|
||
@{Name="WM_QUIT escalation"; Test="WedgedPumpEscalatesToHardExit"},
|
||
@{Name="Crash-loop breaker"; Test="ThreeCrashesInFiveMinutesOpensCircuit"},
|
||
@{Name="Post-mortem MMF"; Test="MmfSurvivesHardCrashAndIsReadable"},
|
||
@{Name="Pipe ACL enforcement"; Test="NonServerSidConnectionRejected"},
|
||
@{Name="Shared secret"; Test="ConnectionWithoutSecretRejected"}
|
||
)
|
||
foreach ($p in $protections) {
|
||
$hits = dotnet test --filter "FullyQualifiedName~$($p.Test)" --no-build --logger "console;verbosity=quiet"
|
||
if ($LASTEXITCODE -ne 0) { Write-Error "Stability protection '$($p.Name)' has no passing test '$($p.Test)'"; exit 1 }
|
||
}
|
||
```
|
||
|
||
### Documentation compliance
|
||
- Any deviation from the Galaxy deep dive in `driver-stability.md` reflected back; new decisions added with `supersedes` notes if needed
|
||
- `driver-specs.md` §1 (Galaxy) updated to reflect the actual implementation if the IPC contract or recycle behavior differs from the design doc
|
||
|
||
## Completion Checklist
|
||
|
||
### Stream A — Driver.Galaxy.Shared
|
||
- [ ] Project created (.NET Standard 2.0, MessagePack-only dependency)
|
||
- [ ] All IPC contracts defined and round-trip tested
|
||
- [ ] Hello-message version negotiation implemented
|
||
- [ ] Reflection test confirms no .NET 10-only types leaked in
|
||
|
||
### Stream B — Driver.Galaxy.Host
|
||
- [ ] Project created (.NET 4.8 x86)
|
||
- [ ] All Galaxy-specific code moved from legacy Host
|
||
- [ ] STA thread + Win32 pump implemented; pump health probe wired up
|
||
- [ ] `MxAccessHandle : SafeHandle` for COM lifetime
|
||
- [ ] Subscription registry + reconnect with cross-host quality scoping
|
||
- [ ] `GalaxyRuntimeProbeManager` rebuilt; phantom-probe regression test passes
|
||
- [ ] Named-pipe IPC server with mandatory ACL + caller SID verification + per-process secret
|
||
- [ ] Memory watchdog with Galaxy-specific thresholds
|
||
- [ ] Recycle policy with 15s grace + WM_QUIT escalation to hard exit
|
||
- [ ] Post-mortem MMF writer + supervisor reader
|
||
- [ ] FaultShim test-only assembly for fault injection
|
||
|
||
### Stream C — Driver.Galaxy.Proxy
|
||
- [ ] Project created (.NET 10, depends on Core.Abstractions + Galaxy.Shared)
|
||
- [ ] All capability interfaces implemented (IDriver, ITagDiscovery, IRediscoverable, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IHostConnectivityProbe)
|
||
- [ ] Heartbeat sender on dedicated channel; missed-heartbeat detection
|
||
- [ ] Supervisor with respawn-with-backoff + crash-loop circuit breaker (escalating cooldown 1h/4h/24h)
|
||
- [ ] Address space build via `IAddressSpaceBuilder` produces byte-equivalent v1 output
|
||
|
||
### Stream D — Retire legacy OtOpcUa.Host
|
||
- [ ] Legacy `OtOpcUa.Host` project deleted from solution
|
||
- [ ] Windows service installer registers two services (OtOpcUa + OtOpcUaGalaxyHost)
|
||
- [ ] Galaxy `appsettings.json` config migrated into central DB `DriverConfig`
|
||
- [ ] Migration script tested against v1 dev config
|
||
|
||
### Stream E — Parity validation
|
||
- [ ] v1 IntegrationTests pass with count = baseline, failures = 0
|
||
- [ ] Client.CLI walkthrough output matches v1 (modulo timestamps/latency)
|
||
- [ ] All four 2026-04-13 stability findings have passing regression tests
|
||
- [ ] Per-test duration regression report: no test >2× v1 baseline (or flagged for review)
|
||
|
||
### Cross-cutting
|
||
- [ ] `phase-2-compliance.ps1` runs and exits 0
|
||
- [ ] All 8 Tier C stability protections have named, passing tests
|
||
- [ ] Adversarial review of the phase diff — findings closed or deferred with rationale
|
||
- [ ] PR opened against `v2`, includes: link to this doc, link to exit-gate record, compliance script output, parity test report, adversarial review output
|
||
- [ ] Reviewer signoff (one reviewer beyond the implementation lead)
|
||
- [ ] `exit-gate-phase-2.md` recorded
|
||
|
||
## Risks and Mitigations
|
||
|
||
| Risk | Likelihood | Impact | Mitigation |
|
||
|------|:----------:|:------:|------------|
|
||
| IPC round-trip latency makes parity tests fail on timing assumptions | High | Medium | Per-test duration regression report identifies hot tests; tune timeouts in test config rather than in production code |
|
||
| MessagePack contract drift between Proxy and Host during development | Medium | High | Hello-message version negotiation rejects mismatched majors loudly; CI builds both projects in the same job |
|
||
| STA pump health probe is itself flaky and triggers spurious recycles | Medium | High | Probe interval tunable; default 10s gives 1000ms+ slack on a healthy pump; monitor via post-mortem MMF for false positives |
|
||
| Pipe ACL misconfiguration on installer leaves the IPC accessible to local users | Low | Critical | Defense-in-depth shared secret catches the case; ACL enumeration test in installer integration test |
|
||
| Galaxy.Host process recycle thrash if Galaxy or DB is intermittently unavailable | Medium | Medium | Crash-loop circuit breaker with escalating cooldown caps the thrash; Polly retry on the data path inside Host (not via supervisor restart) handles transient errors |
|
||
| Migration of `appsettings.json` Galaxy config to DB blob breaks existing deployments | Medium | Medium | Migration script is idempotent and dry-run-able; deploy script asserts central DB has the migrated config before stopping legacy Host |
|
||
| Phase 2 takes longer than 8 weeks | High | Medium | Mid-gate review at 4 weeks — if Stream B isn't past Task B.6 (IPC + ACL), defer Stream B.10 (FaultShim) to Phase 2.5 follow-up |
|
||
| Wonderware Historian SDK incompatibility with .NET 4.8 x86 in the new project layout | Low | High | Move and validate Historian loader as part of Task B.1 — early signal if SDK has any project-shape sensitivity |
|
||
| Hard-exit on wedged pump leaks COM resources | Accepted | Low | Documented intent: hard exit is the only safe response; OS process exit reclaims fds and the OS COM cleanup is best-effort. CNC equivalent in FOCAS deep dive accepts the same trade-off |
|
||
|
||
## Out of Scope (do not do in Phase 2)
|
||
|
||
- Any non-Galaxy driver (Phase 3+)
|
||
- UNS / Equipment-namespace work for Galaxy (Galaxy is SystemPlatform-namespace; no Equipment rows for Galaxy tags per decision #108)
|
||
- Equipment-class template integration with the schemas repo (Galaxy doesn't use `EquipmentClassRef`)
|
||
- Push-from-DB notification (decision #96 — v2.1)
|
||
- Any change to OPC UA wire behavior visible to clients (parity is the gate)
|
||
- Consumer cutover (ScadaBridge, Ignition, System Platform IO) — out of v2 scope, separate integration-team track per `implementation/overview.md`
|
||
- Removing the v1 deployment from production (a v2 release decision, not Phase 2)
|