Confirm the v2 driver list as fixed (decision #128) and remove the Equipment Protocol Survey from the v2 prerequisites — the seven committed drivers (Modbus TCP including DL205, AB CIP, AB Legacy, S7, TwinCAT, FOCAS, OPC UA Client) plus Galaxy/MXAccess are confirmed by direct knowledge of the equipment estate (TwinCAT and AB Legacy specifically called out by the OtOpcUa team based on known Beckhoff installations and SLC/MicroLogix legacy equipment); the survey may still inform long-tail driver scoping and per-site capacity planning but adding/removing drivers from the v2 implementation list is now out of scope. Phase-1 implementation doc loses the survey row from its Out-of-Scope table.
Add Phase 2 detailed implementation plan (docs/v2/implementation/phase-2-galaxy-out-of-process.md) covering the largest refactor phase — moving Galaxy from the legacy in-process OtOpcUa.Host project into the Tier C out-of-process topology specified in driver-stability.md. Five work streams: A. Driver.Galaxy.Shared (.NET Standard 2.0 IPC contracts using MessagePack with hello-message version negotiation), B. Driver.Galaxy.Host (.NET 4.8 x86 separate Windows service that owns MxAccessBridge / GalaxyRepository / alarm tracking / GalaxyRuntimeProbeManager / Wonderware Historian SDK / STA thread + Win32 message pump with health probe / MxAccessHandle SafeHandle for COM lifetime / subscription registry with cross-host quality scoping / named-pipe IPC server with mandatory ACL + caller SID verification + per-process shared secret / memory watchdog with Galaxy-specific 1.5x baseline + 200MB floor + 1.5GB ceiling / recycle policy with 15s grace + WM_QUIT escalation to hard-exit / post-mortem MMF writer / Driver.Galaxy.FaultShim test-only assembly), C. Driver.Galaxy.Proxy (.NET 10 in-process driver implementing every capability interface, heartbeat sender on dedicated channel with 2s/3-miss tolerance, supervisor with respawn-with-backoff and crash-loop circuit breaker with escalating cooldown 1h/4h/24h, address space build via IAddressSpaceBuilder producing byte-equivalent v1 output), D. Retire legacy OtOpcUa.Host (delete from solution, two-service Windows installer, migrate appsettings.json Galaxy sections to central DB DriverConfig blob), E. Parity validation (v1 IntegrationTests pass count = baseline failures = 0, scripted Client.CLI walkthrough output diff vs v1 only differs in timestamps/latency, four named regression tests for the 2026-04-13 stability findings). Compliance script verifies all eight Tier C cross-cutting protections have named passing tests. Decision #128 captures the survey-removal; cross-references added to plan.md Reference Documents and overview.md phase index. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -173,7 +173,7 @@ The implementation **deviates from the plan** when any of those conditions fails
|
||||
|-------|-----|--------|
|
||||
| 0 | [`phase-0-rename-and-net10.md`](phase-0-rename-and-net10.md) | DRAFT |
|
||||
| 1 | [`phase-1-configuration-and-admin-scaffold.md`](phase-1-configuration-and-admin-scaffold.md) | DRAFT |
|
||||
| 2 | (Phase 2: Galaxy parity refactor — TBD) | NOT STARTED |
|
||||
| 2 | [`phase-2-galaxy-out-of-process.md`](phase-2-galaxy-out-of-process.md) | DRAFT |
|
||||
| 3 | (Phase 3: Modbus TCP driver — TBD) | NOT STARTED |
|
||||
| 4 | (Phase 4: PLC drivers AB CIP / AB Legacy / S7 / TwinCAT — TBD) | NOT STARTED |
|
||||
| 5 | (Phase 5: Specialty drivers FOCAS / OPC UA Client — TBD) | NOT STARTED |
|
||||
|
||||
@@ -41,7 +41,6 @@ Stand up the **central configuration substrate** for the v2 fleet:
|
||||
| Equipment-class template integration with future schemas repo | `EquipmentClassRef` is a nullable hook column; no validation yet (decisions #112, #115) |
|
||||
| Per-driver custom config editors in Admin | Generic JSON editor only in v2.0 (decision #27); driver-specific editors land in their respective phases |
|
||||
| Consumer cutover (ScadaBridge / Ignition / SystemPlatform IO) | Phases 6–8 |
|
||||
| Equipment Protocol Survey | External prerequisite — ideally runs in parallel with Phase 1 (handoff §"Equipment Protocol Survey") |
|
||||
|
||||
## Entry Gate Checklist
|
||||
|
||||
|
||||
505
docs/v2/implementation/phase-2-galaxy-out-of-process.md
Normal file
505
docs/v2/implementation/phase-2-galaxy-out-of-process.md
Normal file
@@ -0,0 +1,505 @@
|
||||
# Phase 2 — Galaxy Out-of-Process Refactor (Tier C)
|
||||
|
||||
> **Status**: DRAFT — implementation plan for Phase 2 of the v2 build (`plan.md` §6, `driver-stability.md` §"Galaxy — Deep Dive").
|
||||
>
|
||||
> **Branch**: `v2/phase-2-galaxy`
|
||||
> **Estimated duration**: 6–8 weeks (largest refactor phase; Tier C protections + IPC are the bulk)
|
||||
> **Predecessor**: Phase 1 (`phase-1-configuration-and-admin-scaffold.md`)
|
||||
> **Successor**: Phase 3 (Modbus TCP driver)
|
||||
|
||||
## Phase Objective
|
||||
|
||||
Move Galaxy / MXAccess from the legacy in-process `OtOpcUa.Host` project into the **Tier C out-of-process** topology specified in `driver-stability.md`:
|
||||
|
||||
1. **`Driver.Galaxy.Shared`** — .NET Standard 2.0 IPC message contracts (MessagePack DTOs)
|
||||
2. **`Driver.Galaxy.Host`** — .NET 4.8 x86 separate Windows Service that owns `MxAccessBridge`, `GalaxyRepository`, alarm tracking, `GalaxyRuntimeProbeManager`, the Wonderware Historian SDK, the STA thread + Win32 message pump, and all Tier C cross-cutting protections (memory watchdog, scheduled recycle, post-mortem MMF, IPC ACL + caller SID verification, per-process shared secret)
|
||||
3. **`Driver.Galaxy.Proxy`** — .NET 10 in-process driver implementing every capability interface (`IDriver`, `ITagDiscovery`, `IRediscoverable`, `IReadable`, `IWritable`, `ISubscribable`, `IAlarmSource`, `IHistoryProvider`, `IHostConnectivityProbe`), forwarding each call over named-pipe IPC and owning the supervisor (heartbeat, host liveness, respawn with backoff, crash-loop circuit breaker, fan-out of Bad quality on host death)
|
||||
4. **Retire the legacy `OtOpcUa.Host` project** — its responsibilities now live in `OtOpcUa.Server` (built in Phase 1) for OPC UA hosting and `OtOpcUa.Driver.Galaxy.Host` for Galaxy-specific runtime
|
||||
|
||||
**Parity, not regression.** The phase exit gate is: the v1 `IntegrationTests` suite passes byte-for-byte against the v2 Galaxy.Proxy + Galaxy.Host topology, and a scripted Client.CLI walkthrough produces equivalent output to v1 (decision #56). Anything different — quality codes, browse paths, alarm shapes, history responses — is a parity defect.
|
||||
|
||||
This phase also closes the four 2026-04-13 stability findings (commits `c76ab8f` and `7310925`) by adding regression tests to the parity suite per `driver-specs.md` Galaxy "Operational Stability Notes".
|
||||
|
||||
## Scope — What Changes
|
||||
|
||||
| Concern | Change |
|
||||
|---------|--------|
|
||||
| Project layout | 3 new projects: `Driver.Galaxy.Shared` (.NET Standard 2.0), `Driver.Galaxy.Host` (.NET 4.8 x86), `Driver.Galaxy.Proxy` (.NET 10) |
|
||||
| `OtOpcUa.Host` (legacy in-process) | **Retired**. Galaxy-specific code moves to `Driver.Galaxy.Host`; the small remainder (TopShelf wrapper, `Program.cs`) was already replaced by `OtOpcUa.Server` in Phase 1 |
|
||||
| MXAccess COM access | Now lives only in `Driver.Galaxy.Host` (.NET 4.8 x86, STA thread + Win32 message pump). Main server (`OtOpcUa.Server`, .NET 10 x64) never references `ArchestrA.MxAccess` |
|
||||
| Wonderware Historian SDK | Same — only in `Driver.Galaxy.Host` |
|
||||
| Galaxy DB queries | `GalaxyRepository` moves to `Driver.Galaxy.Host`; the SQL connection string lives in the Galaxy `DriverConfig` JSON |
|
||||
| OPC UA address space build for Galaxy | Driven by `Driver.Galaxy.Proxy` calls into `IAddressSpaceBuilder` (Phase 1 API) — Proxy fetches the hierarchy via IPC, streams nodes to the builder |
|
||||
| Subscriptions, reads, writes, alarms, history | All forwarded over named-pipe IPC via MessagePack contracts in `Driver.Galaxy.Shared` |
|
||||
| Tier C cross-cutting protections | All wired up per `driver-stability.md` §"Cross-Cutting Protections" → "Isolated host only (Tier C)" + the Galaxy deep dive |
|
||||
| Windows service installer | Two services per Galaxy-using cluster node: `OtOpcUa` (the main server) + `OtOpcUaGalaxyHost` (the Galaxy host). Installer scripts updated. |
|
||||
| `appsettings.json` (legacy Galaxy config sections) | Migrated into the central config DB under `DriverInstance.DriverConfig` JSON for the Galaxy driver instance. Local `appsettings.json` keeps only `Cluster.NodeId` + `ClusterId` + DB conn (per decision #18) |
|
||||
|
||||
## Scope — What Does NOT Change
|
||||
|
||||
| Item | Reason |
|
||||
|------|--------|
|
||||
| OPC UA wire behavior visible to clients | Parity is the gate. Clients see the same browse paths, quality codes, alarm shapes, and history responses as v1 |
|
||||
| Galaxy hierarchy mapping (gobject parents → OPC UA folders) | Galaxy uses the SystemPlatform-kind namespace; UNS rules don't apply (decision #108). `Tag.FolderPath` mirrors v1 LmxOpcUa exactly |
|
||||
| Galaxy `EquipmentClassRef` integration | Galaxy is SystemPlatform-namespace; no `Equipment` rows are created for Galaxy tags. Equipment-namespace work is for the native-protocol drivers in Phase 3+ |
|
||||
| Any non-Galaxy driver | Phase 3+ |
|
||||
| `OtOpcUa.Server` lifecycle / configuration substrate / Admin UI | Built in Phase 1; Phase 2 only adds the Galaxy.Proxy as a `DriverInstance` |
|
||||
| Wonderware Historian dependency | Stays optional, loaded only when `Historian.Enabled = true` in the Galaxy `DriverConfig` |
|
||||
|
||||
## Entry Gate Checklist
|
||||
|
||||
- [ ] Phase 1 exit gate cleared (Configuration + Admin + Server + Core.Abstractions all green; Galaxy still in-process via legacy Host)
|
||||
- [ ] `v2` branch is clean
|
||||
- [ ] Phase 1 PR merged
|
||||
- [ ] Dev Galaxy reachable for parity testing — same Galaxy that v1 tests against
|
||||
- [ ] v1 IntegrationTests baseline pass count + duration recorded (this is the parity bar)
|
||||
- [ ] Client.CLI walkthrough script captured against v1 and saved as reference output
|
||||
- [ ] All Phase 2-relevant docs reviewed: `plan.md` §3–4, §5a (LmxNodeManager reusability), `driver-stability.md` §"Out-of-Process Driver Pattern (Generalized)" + §"Galaxy — Deep Dive (Tier C)", `driver-specs.md` §1 (Galaxy)
|
||||
- [ ] Decisions cited or implemented by Phase 2 read at least skim-level: #11, #24, #25, #28, #29, #32, #34, #44, #46–47, #55–56, #62, #63–69, #76, #102 (the Tier C IPC ACL + recycle decisions are all relevant)
|
||||
- [ ] Confirmation that the four 2026-04-13 stability findings (`c76ab8f`, `7310925`) have existing v1 tests that will be the regression net for the v2 split
|
||||
|
||||
**Evidence file**: `docs/v2/implementation/entry-gate-phase-2.md`.
|
||||
|
||||
## Task Breakdown
|
||||
|
||||
Five work streams (A–E). Stream A is the foundation; B and C run partly in parallel after A; D depends on B + C; E is the parity gate at the end.
|
||||
|
||||
### Stream A — Driver.Galaxy.Shared (1 week)
|
||||
|
||||
#### Task A.1 — Create the project
|
||||
|
||||
`src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/` (.NET Standard 2.0 — must be consumable by both .NET 10 Proxy and .NET 4.8 Host per decision #25). Single dependency: `MessagePack` NuGet (decision #32).
|
||||
|
||||
#### Task A.2 — IPC message contracts
|
||||
|
||||
Define the MessagePack DTOs covering every Galaxy operation the Proxy will forward:
|
||||
- **Lifecycle**: `OpenSessionRequest`, `OpenSessionResponse`, `CloseSessionRequest`, `Heartbeat` (separate channel per decision §"Heartbeat between proxy and host")
|
||||
- **Discovery**: `DiscoverGalaxyHierarchyRequest`, `GalaxyObjectInfo`, `GalaxyAttributeInfo` (these are not the v1 Domain types — they're the IPC-shape with MessagePack attributes; the Proxy maps to/from `DriverAttributeInfo` from `Core.Abstractions`)
|
||||
- **Read / Write**: `ReadValuesRequest`, `ReadValuesResponse`, `WriteValuesRequest`, `WriteValuesResponse` (carries `DataValue` shape per decision #13: value + StatusCode + timestamps)
|
||||
- **Subscriptions**: `SubscribeRequest`, `UnsubscribeRequest`, `OnDataChangeNotification` (server-pushed)
|
||||
- **Alarms**: `AlarmSubscribeRequest`, `AlarmEvent`, `AlarmAcknowledgeRequest`
|
||||
- **History**: `HistoryReadRequest`, `HistoryReadResponse`
|
||||
- **Probe**: `HostConnectivityStatus`, `RuntimeStatusChangeNotification`
|
||||
- **Recycle / control**: `RecycleHostRequest`, `RecycleStatusResponse`
|
||||
|
||||
Length-prefixed framing per decision #28; MessagePack body inside each frame.
|
||||
|
||||
**Acceptance**:
|
||||
- All contracts compile against .NET Standard 2.0
|
||||
- Unit test project asserts each contract round-trips through MessagePack serialize → deserialize byte-for-byte
|
||||
- Reflection test asserts no contract references `System.Text.Json` or anything not in BCL/MessagePack
|
||||
|
||||
#### Task A.3 — Versioning + capability negotiation
|
||||
|
||||
Add a top-of-stream `Hello` message exchanged on connection: protocol version, supported features. Future-proofs for adding new operations without breaking older Hosts.
|
||||
|
||||
**Acceptance**:
|
||||
- Proxy refuses to talk to a Host advertising a major version it doesn't understand; logs the mismatch
|
||||
- Host refuses to accept a Proxy from an unknown major version
|
||||
|
||||
### Stream B — Driver.Galaxy.Host (3–4 weeks)
|
||||
|
||||
#### Task B.1 — Create the project + move Galaxy code
|
||||
|
||||
`src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/` (.NET 4.8, **x86 platform** target — required for MXAccess COM per decision #23).
|
||||
|
||||
Move from legacy `OtOpcUa.Host`:
|
||||
- `MxAccessBridge.cs` and supporting types
|
||||
- `GalaxyRepository.cs` and SQL queries
|
||||
- Alarm tracking infrastructure
|
||||
- `GalaxyRuntimeProbeManager.cs`
|
||||
- `MxDataTypeMapper.cs`, `SecurityClassificationMapper.cs`
|
||||
- Historian plugin loader and `IHistorianDataSource` (only loaded when `Historian.Enabled = true`)
|
||||
- Configuration types (`MxAccessConfiguration`, `GalaxyRepositoryConfiguration`, `HistorianConfiguration`, `GalaxyScope`) — these now read from the JSON `DriverConfig` rather than `appsettings.json`
|
||||
|
||||
`Driver.Galaxy.Host` does **not** reference `Core.Abstractions` (decision §5 dependency graph) — it's a closed unit, IPC-fronted.
|
||||
|
||||
**Acceptance**:
|
||||
- Project builds against .NET 4.8 x86
|
||||
- All moved files have their namespace updated to `ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.*`
|
||||
- v1 unit tests for these classes (still in `OtOpcUa.Host.Tests`) move to a new `OtOpcUa.Driver.Galaxy.Host.Tests` project and pass
|
||||
|
||||
#### Task B.2 — STA thread + Win32 message pump
|
||||
|
||||
Per `driver-stability.md` Galaxy deep dive:
|
||||
- Single STA thread per Host process owns all `LMXProxyServer` instances
|
||||
- Work item dispatch via `PostThreadMessage(WM_APP)`
|
||||
- `WM_QUIT` shutdown only after all outstanding work items complete
|
||||
- Pump health probe: no-op work item every 10s, timeout = wedged-pump signal that triggers recycle
|
||||
|
||||
This is essentially v1's `StaComThread` lifted from `LmxProxy.Host` reference (per CLAUDE.md "Reference Implementation" section).
|
||||
|
||||
**Acceptance**:
|
||||
- Pump starts, dispatches work items, exits cleanly on `WM_QUIT`
|
||||
- Pump-wedged simulation (work item that infinite-loops) triggers the 10s timeout and posts a recycle event
|
||||
- COM call from non-STA thread fails fast with a recognizable error (regression net for cross-apartment bugs)
|
||||
|
||||
#### Task B.3 — `MxAccessHandle : SafeHandle` for COM lifetime
|
||||
|
||||
Wrap each `LMXProxyServer` connection in a `SafeHandle` subclass (decision #65 + Galaxy deep dive):
|
||||
- `ReleaseHandle()` calls `Marshal.ReleaseComObject` until refcount = 0, then `UnregisterProxy`
|
||||
- Subscription handles wrapped per item; `RemoveAdvise` → `RemoveItem` ordering enforced
|
||||
- `CriticalFinalizerObject` for finalizer ordering during AppDomain unload
|
||||
- Pre-shutdown drain: cancel all subscriptions cleanly via the STA pump, in order, before pump exit
|
||||
|
||||
**Acceptance**:
|
||||
- Unit test asserts a leaked handle (no `Dispose`) is released by the finalizer
|
||||
- Shutdown test asserts no orphan COM refs after Host exits cleanly
|
||||
- Stress test: 1000 subscribe/unsubscribe cycles → handle table empty at the end
|
||||
|
||||
#### Task B.4 — Subscription registry + reconnect
|
||||
|
||||
Per `driver-stability.md` Galaxy deep dive §"Subscription State and Reconnect":
|
||||
- In-memory registry of `(Item, AdviseId, OwningHost)` for every subscription
|
||||
- Reconnect order: register proxy → re-add items → re-advise
|
||||
- Cross-host quality clear gated on host-status check (closes 2026-04-13 finding)
|
||||
|
||||
**Acceptance**:
|
||||
- Disconnect simulation: kill TCP to MXAccess; subscriptions go Bad; reconnect; subscriptions restore in correct order
|
||||
- Multi-host test: stop AppEngine A while AppEngine B is running; verify A's subscriptions go Bad but B's stay Good (closes the cross-host quality clear regression)
|
||||
|
||||
#### Task B.5 — Connection health probe (`GalaxyRuntimeProbeManager` rebuild)
|
||||
|
||||
Lift the existing `GalaxyRuntimeProbeManager` into the new project. Behaviors per `driver-stability.md`:
|
||||
- Subscribe to per-host runtime-status synthetic attribute
|
||||
- Bad-quality fan-out scoped to the host's subtree (not Galaxy-wide)
|
||||
- Failed probe subscription does **not** leave a phantom entry that Tick() flips to Stopped (closes 2026-04-13 finding)
|
||||
|
||||
**Acceptance**:
|
||||
- Probe failure simulation → no phantom entry; Tick() does not flip arbitrary subscriptions to Stopped (regression test for the finding)
|
||||
- Probe transitions Stopped → Running → Stopped → Running over 5 minutes; quality fan-out happens correctly each transition
|
||||
|
||||
#### Task B.6 — Named-pipe IPC server with mandatory ACL
|
||||
|
||||
Per decision #76 + `driver-stability.md` §"IPC Security":
|
||||
- Pipe ACL on creation: `ReadWrite | Synchronize` granted only to the OtOpcUa server's service principal SID; LocalSystem and Administrators **explicitly denied**
|
||||
- Caller identity verification on each new connection: `GetImpersonationUserName()` cross-checked against configured server service SID; mismatches dropped before any RPC frame is read
|
||||
- Per-process shared secret: passed by the supervisor at spawn time, required on first frame of every connection
|
||||
- Heartbeat pipe: separate from data-plane pipe, same ACL
|
||||
|
||||
**Acceptance**:
|
||||
- Unit test: pipe ACL enumeration shows only the configured SID + Synchronize/ReadWrite
|
||||
- Integration test: connection from a non-server-SID local process is dropped with audit log entry
|
||||
- Integration test: connection without correct shared secret on first frame is dropped
|
||||
- Defense-in-depth test: even if ACL is misconfigured (manually overridden), shared-secret check catches the wrong client
|
||||
|
||||
#### Task B.7 — Memory watchdog with Galaxy-specific thresholds
|
||||
|
||||
Per `driver-stability.md` Galaxy deep dive §"Memory Watchdog Thresholds":
|
||||
- Sample RSS every 30s
|
||||
- Warning: `1.5× baseline OR baseline + 200 MB` (whichever larger)
|
||||
- Soft recycle: `2× baseline OR baseline + 200 MB` (whichever larger)
|
||||
- Hard ceiling: 1.5 GB → force-kill
|
||||
- Slope: > 5 MB/min sustained 30 min → soft recycle
|
||||
|
||||
**Acceptance**:
|
||||
- Unit test against a mock RSS source: each threshold triggers the correct action
|
||||
- Integration test with the FaultShim (Stream B.10): leak simulation crosses the soft-recycle threshold and triggers soft recycle path
|
||||
|
||||
#### Task B.8 — Recycle policy with WM_QUIT escalation
|
||||
|
||||
Per `driver-stability.md` Galaxy deep dive §"Recycle Policy (COM-specific)":
|
||||
- 15s grace for in-flight COM calls (longer than FOCAS because legitimate MXAccess bulk reads take seconds)
|
||||
- Per-handle: `RemoveAdvise` → `RemoveItem` → `ReleaseComObject` → `UnregisterProxy`, on the STA thread
|
||||
- `WM_QUIT` posted only after all of the above complete
|
||||
- If STA pump doesn't exit within 5s of `WM_QUIT` → `Environment.Exit(2)` (hard exit)
|
||||
- Soft recycle scheduled daily at 03:00 local; recycle frequency cap 1/hour
|
||||
|
||||
**Acceptance**:
|
||||
- Soft recycle test: in-flight call returns within grace → clean exit (`Exit(0)`)
|
||||
- Soft recycle test: in-flight call exceeds grace → hard exit (`Exit(2)`); supervisor records as unclean recycle
|
||||
- Wedged-pump test: pump doesn't drain after `WM_QUIT` → `Exit(2)` within 5s
|
||||
- Frequency cap test: trigger 2 soft recycles within an hour → second is blocked, alert raised
|
||||
|
||||
#### Task B.9 — Post-mortem MMF writer
|
||||
|
||||
Per `driver-stability.md` Galaxy deep dive §"Post-Mortem Log Contents":
|
||||
- Ring buffer of last 1000 IPC operations
|
||||
- Plus Galaxy-specific snapshots: STA pump state (thread ID, last dispatched timestamp, queue depth), active subscription count by host, `MxAccessHandle` refcount snapshot, last 100 probe results, last redeploy event, Galaxy DB connection state, Historian connection state if HDA enabled
|
||||
- Memory-mapped file at `%ProgramData%\OtOpcUa\driver-postmortem\galaxy.mmf`
|
||||
- On graceful shutdown: flush ring + snapshots to a rotating log
|
||||
- On hard crash: supervisor reads the MMF after the corpse is gone
|
||||
|
||||
**Acceptance**:
|
||||
- Round-trip test: write 1000 operations → read back → assert order + content
|
||||
- Hard-crash test: kill the process mid-operation → supervisor reads the MMF → ring tail shows the operation that was in flight
|
||||
|
||||
#### Task B.10 — Driver.Galaxy.FaultShim (test-only)
|
||||
|
||||
Per `driver-stability.md` §"Test Coverage for Galaxy Stability" — analogous to FOCAS FaultShim:
|
||||
- Test-only managed assembly substituted for `ArchestrA.MxAccess.dll` via assembly binding
|
||||
- Injects: COM exception at chosen call site, subscription that never fires `OnDataChange`, `Marshal.ReleaseComObject` returning unexpected refcount, STA pump deadlock simulation
|
||||
- Production builds load the real `ArchestrA.MxAccess` from GAC
|
||||
|
||||
**Acceptance**:
|
||||
- FaultShim binds successfully under test configuration
|
||||
- Each fault scenario triggers the expected protection (memory watchdog → recycle, supervisor → respawn, etc.)
|
||||
|
||||
### Stream C — Driver.Galaxy.Proxy (1.5 weeks, can parallel with B after A done)
|
||||
|
||||
#### Task C.1 — Create the project + capability interface implementation
|
||||
|
||||
`src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/` (.NET 10). Dependencies: `Core.Abstractions` (Phase 1) + `Driver.Galaxy.Shared` (Stream A) + `MessagePack`.
|
||||
|
||||
Implement every interface listed in Phase Objective above. Each method:
|
||||
- Marshals arguments into the matching IPC contract
|
||||
- Sends over the data-plane pipe
|
||||
- Awaits the response (with timeout per Polly per decision #34)
|
||||
- Maps the response into the `Core.Abstractions` shape (`DataValue`, `DriverAttributeInfo`, etc.)
|
||||
- Surfaces failures as the appropriate StatusCode
|
||||
|
||||
**Acceptance**:
|
||||
- Each interface method has a unit test against a mock IPC channel: happy path + IPC timeout path + IPC error path
|
||||
- `IRediscoverable` opt-in works: when Galaxy.Host signals a redeploy, Proxy invokes the Core's rediscovery flow (not full restart)
|
||||
|
||||
#### Task C.2 — Heartbeat sender + host liveness
|
||||
|
||||
Per `driver-stability.md` §"Heartbeat between proxy and host":
|
||||
- 2s cadence (decision #72) on the dedicated heartbeat pipe
|
||||
- 3 consecutive missed responses = host declared dead (6s detection)
|
||||
- On host-dead: fan out Bad quality on all Galaxy-namespace nodes; ask supervisor to respawn
|
||||
|
||||
**Acceptance**:
|
||||
- Heartbeat round-trip test against a mock host
|
||||
- Missed-heartbeat test: stop the mock host's heartbeat responder → 3 misses → supervisor respawn requested
|
||||
- GC pause test: simulate a 700ms GC pause on the proxy side → no false positive (single missed beat absorbed by 3-miss tolerance)
|
||||
|
||||
#### Task C.3 — Supervisor with respawn-with-backoff + crash-loop circuit breaker
|
||||
|
||||
Per `driver-stability.md` §"Crash-loop circuit breaker" + Galaxy §"Recovery Sequence After Crash":
|
||||
- Backoff: 5s → 15s → 60s (capped)
|
||||
- Crash-loop: 3 crashes / 5 min → escalating cooldown (1h → 4h → 24h manual)
|
||||
- Sticky alert that doesn't auto-clear when cooldown elapses
|
||||
- On respawn after recycle: reuse cached `time_of_last_deploy` watermark to skip full DB rediscovery if unchanged
|
||||
|
||||
**Acceptance**:
|
||||
- Respawn test: kill host process → supervisor respawns within 5s → host re-establishes
|
||||
- Crash-loop test: force 3 crashes within 5 minutes → 4th respawn blocked, alert raised, manual reset clears alert
|
||||
- Cooldown escalation test: trip → 1h auto-reset → re-trip within 10 min → 4h cooldown → re-trip → 24h manual
|
||||
|
||||
#### Task C.4 — Address space build via `IAddressSpaceBuilder`
|
||||
|
||||
When the Proxy is asked to discover its tags, it issues `DiscoverGalaxyHierarchyRequest` to the Host, receives the gobject tree + attributes, and streams them to `IAddressSpaceBuilder` (Phase 1 API per decision #52). Galaxy uses the SystemPlatform-kind namespace; tags use `FolderPath` (v1-style) — no `Equipment` rows are created.
|
||||
|
||||
**Acceptance**:
|
||||
- Build a Galaxy address space via the Proxy → byte-equivalent OPC UA browse output to v1
|
||||
- Memory test: large Galaxy (4000+ attributes) → Proxy peak RAM stays under 200 MB during build
|
||||
|
||||
### Stream D — Retire legacy OtOpcUa.Host (1 week, depends on B + C)
|
||||
|
||||
#### Task D.1 — Delete legacy Host project
|
||||
|
||||
Once Galaxy.Host + Galaxy.Proxy are functional, the legacy `OtOpcUa.Host` project's responsibilities are split:
|
||||
- Galaxy-specific code → `Driver.Galaxy.Host` (already moved in Stream B)
|
||||
- TopShelf wrapper, `Program.cs`, generic OPC UA hosting → already replaced by `OtOpcUa.Server` in Phase 1
|
||||
- Anything else (configuration types, generic helpers) → moved to `OtOpcUa.Server` or `OtOpcUa.Configuration` as appropriate
|
||||
|
||||
Delete the project from the solution. Update `.slnx` and any references.
|
||||
|
||||
**Acceptance**:
|
||||
- `ls src/` shows `OtOpcUa.Host` is gone
|
||||
- `dotnet build OtOpcUa.slnx` succeeds with `OtOpcUa.Host` no longer in the build graph
|
||||
- All previously-`OtOpcUa.Host.Tests` tests are either moved to the appropriate new test project or deleted as obsolete
|
||||
|
||||
#### Task D.2 — Update Windows service installer scripts
|
||||
|
||||
Two services per cluster node when Galaxy is configured:
|
||||
- `OtOpcUa` (the main `OtOpcUa.Server`) — already installable per Phase 1
|
||||
- `OtOpcUaGalaxyHost` (the `Driver.Galaxy.Host`) — new service registration
|
||||
|
||||
Installer must:
|
||||
- Install both services with the correct service-account SIDs (Galaxy.Host's pipe ACL must grant the OtOpcUa service principal)
|
||||
- Set the supervisor's per-process secret in the registry or a protected file before first start
|
||||
- Honor service dependency: Galaxy.Host should be configured to start before OtOpcUa, or OtOpcUa retries until Galaxy.Host is up
|
||||
|
||||
**Acceptance**:
|
||||
- Install both services on a test box → both start successfully
|
||||
- Uninstall both → no leftover registry / file system state
|
||||
- Service-restart cycle: stop OtOpcUa.Server → Galaxy.Host stays up → start OtOpcUa.Server → reconnects to Galaxy.Host pipe
|
||||
|
||||
#### Task D.3 — Migrate Galaxy `appsettings.json` config to central config DB
|
||||
|
||||
Galaxy-specific config sections (`MxAccess`, `Galaxy`, `Historian`) move into the `DriverInstance.DriverConfig` JSON for the Galaxy driver instance in the Configuration DB. The local `appsettings.json` keeps only `Cluster.NodeId` + `ClusterId` + DB conn (per decision #18).
|
||||
|
||||
Migration script: for each existing v1 `appsettings.json`, generate the equivalent `DriverConfig` JSON and either insert via Admin UI or via a one-shot SQL script.
|
||||
|
||||
**Acceptance**:
|
||||
- Migration script runs against a v1 dev `appsettings.json` → produces a JSON blob that loads into the Galaxy `DriverConfig` field
|
||||
- The Galaxy driver instance starts with the migrated config and serves the same address space as v1
|
||||
|
||||
### Stream E — Parity validation (1 week, gate)
|
||||
|
||||
#### Task E.1 — Run v1 IntegrationTests against v2 Galaxy topology
|
||||
|
||||
Per decision #56:
|
||||
- The same v1 IntegrationTests suite runs against the v2 build with Galaxy.Proxy + Galaxy.Host instead of in-process Galaxy
|
||||
- All tests must pass
|
||||
- Pass count = v1 baseline; failure count = 0; skip count = v1 baseline
|
||||
- Test duration may increase (IPC round-trip latency); document the deviation
|
||||
|
||||
**Acceptance**:
|
||||
- Test report shows pass/fail/skip counts identical to v1 baseline
|
||||
- Per-test duration regression report: any test that takes >2× v1 baseline is flagged for review (may be an IPC bottleneck)
|
||||
|
||||
#### Task E.2 — Scripted Client.CLI walkthrough parity
|
||||
|
||||
Per decision #56:
|
||||
- Execute the captured Client.CLI script (recorded at Phase 2 entry gate against v1) against the v2 Galaxy topology
|
||||
- Diff the output against v1 reference
|
||||
- Differences allowed only in: timestamps, latency-measurement output. Any value, quality, browse path, or alarm shape difference = parity defect
|
||||
|
||||
**Acceptance**:
|
||||
- Walkthrough completes without errors
|
||||
- Output diff vs v1: only timestamp / latency lines differ
|
||||
|
||||
#### Task E.3 — Regression tests for the four 2026-04-13 stability findings
|
||||
|
||||
Per `driver-specs.md` Galaxy "Operational Stability Notes": each of the four findings closed in commits `c76ab8f` and `7310925` should have a regression test in the Phase 2 parity suite:
|
||||
- Phantom probe subscription flipping Tick() to Stopped (covered by Task B.5)
|
||||
- Cross-host quality clear wiping sibling state during recovery (covered by Task B.4)
|
||||
- Sync-over-async on the OPC UA stack thread → guard against new instances in `GenericDriverNodeManager`
|
||||
- Fire-and-forget alarm tasks racing shutdown → guard via the pre-shutdown drain ordering in Task B.3
|
||||
|
||||
**Acceptance**:
|
||||
- Each of the four scenarios has a named test in the parity suite
|
||||
- Each test fails on a hand-introduced regression (revert the v1 fix, see test fail)
|
||||
|
||||
#### Task E.4 — Adversarial review of the Phase 2 diff
|
||||
|
||||
Per `implementation/overview.md` exit gate:
|
||||
- Run `/codex:adversarial-review --base v2` on the merged Phase 2 diff
|
||||
- Findings closed or explicitly deferred with rationale and ticket link
|
||||
|
||||
## Compliance Checks (run at exit gate)
|
||||
|
||||
`phase-2-compliance.ps1`:
|
||||
|
||||
### Schema compliance
|
||||
N/A for Phase 2 — no schema changes (Configuration DB schema is unchanged from Phase 1).
|
||||
|
||||
### Decision compliance
|
||||
For each decision number Phase 2 implements (#11, #24, #25, #28, #29, #32, #34, #44, #46–47, #55–56, #62, #63–69, #76, #102, plus the Galaxy-specific #62), verify at least one citation exists in source, tests, or migrations:
|
||||
|
||||
```powershell
|
||||
$decisions = @(11, 24, 25, 28, 29, 32, 34, 44, 46, 47, 55, 56, 62, 63..69, 76, 102, 122, 123, 124)
|
||||
foreach ($d in $decisions) {
|
||||
$hits = git grep "decision #$d" -- 'src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.*/' 'tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.*/'
|
||||
if (-not $hits) { Write-Error "Decision #$d has no citation"; exit 1 }
|
||||
}
|
||||
```
|
||||
|
||||
### Visual compliance
|
||||
N/A — no Admin UI changes in Phase 2 (Galaxy is just another `DriverInstance` in the Drivers tab).
|
||||
|
||||
### Behavioral compliance — parity smoke test
|
||||
The parity suite (Stream E) is the smoke test:
|
||||
1. v1 IntegrationTests pass count = baseline, fail count = 0
|
||||
2. Client.CLI walkthrough output matches v1 (modulo timestamps/latency)
|
||||
3. Four regression tests for 2026-04-13 findings pass
|
||||
|
||||
### Stability compliance
|
||||
For Phase 2 (introduces the first Tier C driver in production form):
|
||||
- Galaxy.Host implements every Tier C cross-cutting protection from `driver-stability.md`:
|
||||
- SafeHandle wrappers for COM (Task B.3) ✓
|
||||
- Memory watchdog with Galaxy thresholds (Task B.7) ✓
|
||||
- Bounded operation queues per device (already in Core, Phase 1) ✓
|
||||
- Heartbeat between proxy and host on separate channel (Tasks A.2, B.6, C.2) ✓
|
||||
- Scheduled recycling with `WM_QUIT` escalation to hard exit (Task B.8) ✓
|
||||
- Crash-loop circuit breaker (Task C.3) ✓
|
||||
- Post-mortem MMF readable after hard crash (Task B.9) ✓
|
||||
- IPC ACL + caller SID verification + per-process shared secret (Task B.6) ✓
|
||||
|
||||
Each protection has at least one regression test. The compliance script enumerates and verifies presence:
|
||||
|
||||
```powershell
|
||||
$protections = @(
|
||||
@{Name="SafeHandle for COM"; Test="MxAccessHandleFinalizerReleasesCom"},
|
||||
@{Name="Memory watchdog"; Test="WatchdogTriggersRecycleAtThreshold"},
|
||||
@{Name="Heartbeat detection"; Test="ThreeMissedHeartbeatsDeclaresHostDead"},
|
||||
@{Name="WM_QUIT escalation"; Test="WedgedPumpEscalatesToHardExit"},
|
||||
@{Name="Crash-loop breaker"; Test="ThreeCrashesInFiveMinutesOpensCircuit"},
|
||||
@{Name="Post-mortem MMF"; Test="MmfSurvivesHardCrashAndIsReadable"},
|
||||
@{Name="Pipe ACL enforcement"; Test="NonServerSidConnectionRejected"},
|
||||
@{Name="Shared secret"; Test="ConnectionWithoutSecretRejected"}
|
||||
)
|
||||
foreach ($p in $protections) {
|
||||
$hits = dotnet test --filter "FullyQualifiedName~$($p.Test)" --no-build --logger "console;verbosity=quiet"
|
||||
if ($LASTEXITCODE -ne 0) { Write-Error "Stability protection '$($p.Name)' has no passing test '$($p.Test)'"; exit 1 }
|
||||
}
|
||||
```
|
||||
|
||||
### Documentation compliance
|
||||
- Any deviation from the Galaxy deep dive in `driver-stability.md` reflected back; new decisions added with `supersedes` notes if needed
|
||||
- `driver-specs.md` §1 (Galaxy) updated to reflect the actual implementation if the IPC contract or recycle behavior differs from the design doc
|
||||
|
||||
## Completion Checklist
|
||||
|
||||
### Stream A — Driver.Galaxy.Shared
|
||||
- [ ] Project created (.NET Standard 2.0, MessagePack-only dependency)
|
||||
- [ ] All IPC contracts defined and round-trip tested
|
||||
- [ ] Hello-message version negotiation implemented
|
||||
- [ ] Reflection test confirms no .NET 10-only types leaked in
|
||||
|
||||
### Stream B — Driver.Galaxy.Host
|
||||
- [ ] Project created (.NET 4.8 x86)
|
||||
- [ ] All Galaxy-specific code moved from legacy Host
|
||||
- [ ] STA thread + Win32 pump implemented; pump health probe wired up
|
||||
- [ ] `MxAccessHandle : SafeHandle` for COM lifetime
|
||||
- [ ] Subscription registry + reconnect with cross-host quality scoping
|
||||
- [ ] `GalaxyRuntimeProbeManager` rebuilt; phantom-probe regression test passes
|
||||
- [ ] Named-pipe IPC server with mandatory ACL + caller SID verification + per-process secret
|
||||
- [ ] Memory watchdog with Galaxy-specific thresholds
|
||||
- [ ] Recycle policy with 15s grace + WM_QUIT escalation to hard exit
|
||||
- [ ] Post-mortem MMF writer + supervisor reader
|
||||
- [ ] FaultShim test-only assembly for fault injection
|
||||
|
||||
### Stream C — Driver.Galaxy.Proxy
|
||||
- [ ] Project created (.NET 10, depends on Core.Abstractions + Galaxy.Shared)
|
||||
- [ ] All capability interfaces implemented (IDriver, ITagDiscovery, IRediscoverable, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IHostConnectivityProbe)
|
||||
- [ ] Heartbeat sender on dedicated channel; missed-heartbeat detection
|
||||
- [ ] Supervisor with respawn-with-backoff + crash-loop circuit breaker (escalating cooldown 1h/4h/24h)
|
||||
- [ ] Address space build via `IAddressSpaceBuilder` produces byte-equivalent v1 output
|
||||
|
||||
### Stream D — Retire legacy OtOpcUa.Host
|
||||
- [ ] Legacy `OtOpcUa.Host` project deleted from solution
|
||||
- [ ] Windows service installer registers two services (OtOpcUa + OtOpcUaGalaxyHost)
|
||||
- [ ] Galaxy `appsettings.json` config migrated into central DB `DriverConfig`
|
||||
- [ ] Migration script tested against v1 dev config
|
||||
|
||||
### Stream E — Parity validation
|
||||
- [ ] v1 IntegrationTests pass with count = baseline, failures = 0
|
||||
- [ ] Client.CLI walkthrough output matches v1 (modulo timestamps/latency)
|
||||
- [ ] All four 2026-04-13 stability findings have passing regression tests
|
||||
- [ ] Per-test duration regression report: no test >2× v1 baseline (or flagged for review)
|
||||
|
||||
### Cross-cutting
|
||||
- [ ] `phase-2-compliance.ps1` runs and exits 0
|
||||
- [ ] All 8 Tier C stability protections have named, passing tests
|
||||
- [ ] Adversarial review of the phase diff — findings closed or deferred with rationale
|
||||
- [ ] PR opened against `v2`, includes: link to this doc, link to exit-gate record, compliance script output, parity test report, adversarial review output
|
||||
- [ ] Reviewer signoff (one reviewer beyond the implementation lead)
|
||||
- [ ] `exit-gate-phase-2.md` recorded
|
||||
|
||||
## Risks and Mitigations
|
||||
|
||||
| Risk | Likelihood | Impact | Mitigation |
|
||||
|------|:----------:|:------:|------------|
|
||||
| IPC round-trip latency makes parity tests fail on timing assumptions | High | Medium | Per-test duration regression report identifies hot tests; tune timeouts in test config rather than in production code |
|
||||
| MessagePack contract drift between Proxy and Host during development | Medium | High | Hello-message version negotiation rejects mismatched majors loudly; CI builds both projects in the same job |
|
||||
| STA pump health probe is itself flaky and triggers spurious recycles | Medium | High | Probe interval tunable; default 10s gives 1000ms+ slack on a healthy pump; monitor via post-mortem MMF for false positives |
|
||||
| Pipe ACL misconfiguration on installer leaves the IPC accessible to local users | Low | Critical | Defense-in-depth shared secret catches the case; ACL enumeration test in installer integration test |
|
||||
| Galaxy.Host process recycle thrash if Galaxy or DB is intermittently unavailable | Medium | Medium | Crash-loop circuit breaker with escalating cooldown caps the thrash; Polly retry on the data path inside Host (not via supervisor restart) handles transient errors |
|
||||
| Migration of `appsettings.json` Galaxy config to DB blob breaks existing deployments | Medium | Medium | Migration script is idempotent and dry-run-able; deploy script asserts central DB has the migrated config before stopping legacy Host |
|
||||
| Phase 2 takes longer than 8 weeks | High | Medium | Mid-gate review at 4 weeks — if Stream B isn't past Task B.6 (IPC + ACL), defer Stream B.10 (FaultShim) to Phase 2.5 follow-up |
|
||||
| Wonderware Historian SDK incompatibility with .NET 4.8 x86 in the new project layout | Low | High | Move and validate Historian loader as part of Task B.1 — early signal if SDK has any project-shape sensitivity |
|
||||
| Hard-exit on wedged pump leaks COM resources | Accepted | Low | Documented intent: hard exit is the only safe response; OS process exit reclaims fds and the OS COM cleanup is best-effort. CNC equivalent in FOCAS deep dive accepts the same trade-off |
|
||||
|
||||
## Out of Scope (do not do in Phase 2)
|
||||
|
||||
- Any non-Galaxy driver (Phase 3+)
|
||||
- UNS / Equipment-namespace work for Galaxy (Galaxy is SystemPlatform-namespace; no Equipment rows for Galaxy tags per decision #108)
|
||||
- Equipment-class template integration with the schemas repo (Galaxy doesn't use `EquipmentClassRef`)
|
||||
- Push-from-DB notification (decision #96 — v2.1)
|
||||
- Any change to OPC UA wire behavior visible to clients (parity is the gate)
|
||||
- ScadaBridge cutover (Phase 6 — separate planning track)
|
||||
- Removing the v1 deployment from production (a v2 release decision, not Phase 2)
|
||||
Reference in New Issue
Block a user