Files
lmxopcua/docs/v2/implementation/phase-2-galaxy-out-of-process.md
Joseph Doherty 4903a19ec9 Add data-path ACL design (acl-design.md, closes corrections B1) + dev-environment inventory and setup plan (dev-environment.md), and remove consumer cutover from OtOpcUa v2 scope.
ACL design defines NodePermissions bitmask flags covering Browse / Read / Subscribe / HistoryRead / WriteOperate / WriteTune / WriteConfigure / AlarmRead / AlarmAcknowledge / AlarmConfirm / AlarmShelve / MethodCall plus common bundles (ReadOnly / Operator / Engineer / Admin); 6-level scope hierarchy (Cluster / Namespace / UnsArea / UnsLine / Equipment / Tag) with default-deny + additive grants and Browse-implication on ancestors; per-LDAP-group grants in a new generation-versioned NodeAcl table edited via the same draft → diff → publish → rollback boundary as every other content table; per-session permission-trie evaluator with O(depth × group-count) cost cached for the lifetime of the session and rebuilt on generation-apply or LDAP group cache expiry; cluster-create workflow seeds a default ACL set matching the v1 LmxOpcUa LDAP-role-to-permission map for v1 → v2 consumer migration parity; Admin UI ACL tab with two views (by LDAP group, by scope), bulk-grant flow, and permission simulator that lets operators preview "as user X" effective permissions across the cluster's UNS tree before publishing; explicit Deny deferred to v2.1 since verbose grants suffice at v2.0 fleet sizes; only denied OPC UA operations are audit-logged (not allowed ones — would dwarf the audit log). Schema doc gains the NodeAcl table with cross-cluster invariant enforcement and same-generation FK validation; admin-ui.md gains the ACLs tab; phase-1 doc gains Task E.9 wiring this through Stream E plus a NodeAcl entry in Task B.1's DbContext list.

Dev-environment doc inventories every external resource the v2 build needs across two tiers per decision #99 — inner-loop (in-process simulators on developer machines: SQL Server local or container, GLAuth at C:\publish\glauth\, local dev Galaxy) and integration (one dedicated Windows host with Docker Desktop on WSL2 backend so TwinCAT XAR VM can run in Hyper-V alongside containerized oitc/modbus-server, plus WSL2-hosted Snap7 and ab_server, plus OPC Foundation reference server, plus FOCAS TestStub and FaultShim) — with concrete container images, ports, default dev credentials (clearly marked dev-only since production uses Integrated Security / gMSA per decision #46), bootstrap order for both tiers, network topology diagram, test data seed locations, and operational risks (TwinCAT trial expiry automation, Docker pricing, integration host SPOF mitigation, per-developer GLAuth config sync, Aveva license scoping that keeps Galaxy tests on developer machines and off the shared host).

Removes consumer cutover (ScadaBridge / Ignition / System Platform IO) from OtOpcUa v2 scope per decision #136 — owned by a separate integration / operations team, tracked in 3-year-plan handoff §"Rollout Posture" and corrections §C5; OtOpcUa team's scope ends at Phase 5. Updates implementation/overview.md phase index to drop the "6+" row and add an explicit "OUT of v2 scope" callout; updates phase-1 and phase-2 docs to reframe cutover as integration-team-owned rather than future-phase numbered.

Decisions #129–137 added: ACL model (#129), NodeAcl generation-versioned (#130), v1-compatibility seed (#131), denied-only audit logging (#132), two-tier dev environment (#133), Docker WSL2 backend for TwinCAT VM coexistence (#134), TwinCAT VM centrally managed / Galaxy on dev machines only (#135), cutover out of v2 scope (#136), dev credentials documented openly (#137).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 11:58:33 -04:00

33 KiB
Raw Blame History

Phase 2 — Galaxy Out-of-Process Refactor (Tier C)

Status: DRAFT — implementation plan for Phase 2 of the v2 build (plan.md §6, driver-stability.md §"Galaxy — Deep Dive").

Branch: v2/phase-2-galaxy Estimated duration: 68 weeks (largest refactor phase; Tier C protections + IPC are the bulk) Predecessor: Phase 1 (phase-1-configuration-and-admin-scaffold.md) Successor: Phase 3 (Modbus TCP driver)

Phase Objective

Move Galaxy / MXAccess from the legacy in-process OtOpcUa.Host project into the Tier C out-of-process topology specified in driver-stability.md:

  1. Driver.Galaxy.Shared — .NET Standard 2.0 IPC message contracts (MessagePack DTOs)
  2. Driver.Galaxy.Host — .NET 4.8 x86 separate Windows Service that owns MxAccessBridge, GalaxyRepository, alarm tracking, GalaxyRuntimeProbeManager, the Wonderware Historian SDK, the STA thread + Win32 message pump, and all Tier C cross-cutting protections (memory watchdog, scheduled recycle, post-mortem MMF, IPC ACL + caller SID verification, per-process shared secret)
  3. Driver.Galaxy.Proxy — .NET 10 in-process driver implementing every capability interface (IDriver, ITagDiscovery, IRediscoverable, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IHostConnectivityProbe), forwarding each call over named-pipe IPC and owning the supervisor (heartbeat, host liveness, respawn with backoff, crash-loop circuit breaker, fan-out of Bad quality on host death)
  4. Retire the legacy OtOpcUa.Host project — its responsibilities now live in OtOpcUa.Server (built in Phase 1) for OPC UA hosting and OtOpcUa.Driver.Galaxy.Host for Galaxy-specific runtime

Parity, not regression. The phase exit gate is: the v1 IntegrationTests suite passes byte-for-byte against the v2 Galaxy.Proxy + Galaxy.Host topology, and a scripted Client.CLI walkthrough produces equivalent output to v1 (decision #56). Anything different — quality codes, browse paths, alarm shapes, history responses — is a parity defect.

This phase also closes the four 2026-04-13 stability findings (commits c76ab8f and 7310925) by adding regression tests to the parity suite per driver-specs.md Galaxy "Operational Stability Notes".

Scope — What Changes

Concern Change
Project layout 3 new projects: Driver.Galaxy.Shared (.NET Standard 2.0), Driver.Galaxy.Host (.NET 4.8 x86), Driver.Galaxy.Proxy (.NET 10)
OtOpcUa.Host (legacy in-process) Retired. Galaxy-specific code moves to Driver.Galaxy.Host; the small remainder (TopShelf wrapper, Program.cs) was already replaced by OtOpcUa.Server in Phase 1
MXAccess COM access Now lives only in Driver.Galaxy.Host (.NET 4.8 x86, STA thread + Win32 message pump). Main server (OtOpcUa.Server, .NET 10 x64) never references ArchestrA.MxAccess
Wonderware Historian SDK Same — only in Driver.Galaxy.Host
Galaxy DB queries GalaxyRepository moves to Driver.Galaxy.Host; the SQL connection string lives in the Galaxy DriverConfig JSON
OPC UA address space build for Galaxy Driven by Driver.Galaxy.Proxy calls into IAddressSpaceBuilder (Phase 1 API) — Proxy fetches the hierarchy via IPC, streams nodes to the builder
Subscriptions, reads, writes, alarms, history All forwarded over named-pipe IPC via MessagePack contracts in Driver.Galaxy.Shared
Tier C cross-cutting protections All wired up per driver-stability.md §"Cross-Cutting Protections" → "Isolated host only (Tier C)" + the Galaxy deep dive
Windows service installer Two services per Galaxy-using cluster node: OtOpcUa (the main server) + OtOpcUaGalaxyHost (the Galaxy host). Installer scripts updated.
appsettings.json (legacy Galaxy config sections) Migrated into the central config DB under DriverInstance.DriverConfig JSON for the Galaxy driver instance. Local appsettings.json keeps only Cluster.NodeId + ClusterId + DB conn (per decision #18)

Scope — What Does NOT Change

Item Reason
OPC UA wire behavior visible to clients Parity is the gate. Clients see the same browse paths, quality codes, alarm shapes, and history responses as v1
Galaxy hierarchy mapping (gobject parents → OPC UA folders) Galaxy uses the SystemPlatform-kind namespace; UNS rules don't apply (decision #108). Tag.FolderPath mirrors v1 LmxOpcUa exactly
Galaxy EquipmentClassRef integration Galaxy is SystemPlatform-namespace; no Equipment rows are created for Galaxy tags. Equipment-namespace work is for the native-protocol drivers in Phase 3+
Any non-Galaxy driver Phase 3+
OtOpcUa.Server lifecycle / configuration substrate / Admin UI Built in Phase 1; Phase 2 only adds the Galaxy.Proxy as a DriverInstance
Wonderware Historian dependency Stays optional, loaded only when Historian.Enabled = true in the Galaxy DriverConfig

Entry Gate Checklist

  • Phase 1 exit gate cleared (Configuration + Admin + Server + Core.Abstractions all green; Galaxy still in-process via legacy Host)
  • v2 branch is clean
  • Phase 1 PR merged
  • Dev Galaxy reachable for parity testing — same Galaxy that v1 tests against
  • v1 IntegrationTests baseline pass count + duration recorded (this is the parity bar)
  • Client.CLI walkthrough script captured against v1 and saved as reference output
  • All Phase 2-relevant docs reviewed: plan.md §34, §5a (LmxNodeManager reusability), driver-stability.md §"Out-of-Process Driver Pattern (Generalized)" + §"Galaxy — Deep Dive (Tier C)", driver-specs.md §1 (Galaxy)
  • Decisions cited or implemented by Phase 2 read at least skim-level: #11, #24, #25, #28, #29, #32, #34, #44, #4647, #5556, #62, #6369, #76, #102 (the Tier C IPC ACL + recycle decisions are all relevant)
  • Confirmation that the four 2026-04-13 stability findings (c76ab8f, 7310925) have existing v1 tests that will be the regression net for the v2 split

Evidence file: docs/v2/implementation/entry-gate-phase-2.md.

Task Breakdown

Five work streams (AE). Stream A is the foundation; B and C run partly in parallel after A; D depends on B + C; E is the parity gate at the end.

Stream A — Driver.Galaxy.Shared (1 week)

Task A.1 — Create the project

src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/ (.NET Standard 2.0 — must be consumable by both .NET 10 Proxy and .NET 4.8 Host per decision #25). Single dependency: MessagePack NuGet (decision #32).

Task A.2 — IPC message contracts

Define the MessagePack DTOs covering every Galaxy operation the Proxy will forward:

  • Lifecycle: OpenSessionRequest, OpenSessionResponse, CloseSessionRequest, Heartbeat (separate channel per decision §"Heartbeat between proxy and host")
  • Discovery: DiscoverGalaxyHierarchyRequest, GalaxyObjectInfo, GalaxyAttributeInfo (these are not the v1 Domain types — they're the IPC-shape with MessagePack attributes; the Proxy maps to/from DriverAttributeInfo from Core.Abstractions)
  • Read / Write: ReadValuesRequest, ReadValuesResponse, WriteValuesRequest, WriteValuesResponse (carries DataValue shape per decision #13: value + StatusCode + timestamps)
  • Subscriptions: SubscribeRequest, UnsubscribeRequest, OnDataChangeNotification (server-pushed)
  • Alarms: AlarmSubscribeRequest, AlarmEvent, AlarmAcknowledgeRequest
  • History: HistoryReadRequest, HistoryReadResponse
  • Probe: HostConnectivityStatus, RuntimeStatusChangeNotification
  • Recycle / control: RecycleHostRequest, RecycleStatusResponse

Length-prefixed framing per decision #28; MessagePack body inside each frame.

Acceptance:

  • All contracts compile against .NET Standard 2.0
  • Unit test project asserts each contract round-trips through MessagePack serialize → deserialize byte-for-byte
  • Reflection test asserts no contract references System.Text.Json or anything not in BCL/MessagePack

Task A.3 — Versioning + capability negotiation

Add a top-of-stream Hello message exchanged on connection: protocol version, supported features. Future-proofs for adding new operations without breaking older Hosts.

Acceptance:

  • Proxy refuses to talk to a Host advertising a major version it doesn't understand; logs the mismatch
  • Host refuses to accept a Proxy from an unknown major version

Stream B — Driver.Galaxy.Host (34 weeks)

Task B.1 — Create the project + move Galaxy code

src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/ (.NET 4.8, x86 platform target — required for MXAccess COM per decision #23).

Move from legacy OtOpcUa.Host:

  • MxAccessBridge.cs and supporting types
  • GalaxyRepository.cs and SQL queries
  • Alarm tracking infrastructure
  • GalaxyRuntimeProbeManager.cs
  • MxDataTypeMapper.cs, SecurityClassificationMapper.cs
  • Historian plugin loader and IHistorianDataSource (only loaded when Historian.Enabled = true)
  • Configuration types (MxAccessConfiguration, GalaxyRepositoryConfiguration, HistorianConfiguration, GalaxyScope) — these now read from the JSON DriverConfig rather than appsettings.json

Driver.Galaxy.Host does not reference Core.Abstractions (decision §5 dependency graph) — it's a closed unit, IPC-fronted.

Acceptance:

  • Project builds against .NET 4.8 x86
  • All moved files have their namespace updated to ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.*
  • v1 unit tests for these classes (still in OtOpcUa.Host.Tests) move to a new OtOpcUa.Driver.Galaxy.Host.Tests project and pass

Task B.2 — STA thread + Win32 message pump

Per driver-stability.md Galaxy deep dive:

  • Single STA thread per Host process owns all LMXProxyServer instances
  • Work item dispatch via PostThreadMessage(WM_APP)
  • WM_QUIT shutdown only after all outstanding work items complete
  • Pump health probe: no-op work item every 10s, timeout = wedged-pump signal that triggers recycle

This is essentially v1's StaComThread lifted from LmxProxy.Host reference (per CLAUDE.md "Reference Implementation" section).

Acceptance:

  • Pump starts, dispatches work items, exits cleanly on WM_QUIT
  • Pump-wedged simulation (work item that infinite-loops) triggers the 10s timeout and posts a recycle event
  • COM call from non-STA thread fails fast with a recognizable error (regression net for cross-apartment bugs)

Task B.3 — MxAccessHandle : SafeHandle for COM lifetime

Wrap each LMXProxyServer connection in a SafeHandle subclass (decision #65 + Galaxy deep dive):

  • ReleaseHandle() calls Marshal.ReleaseComObject until refcount = 0, then UnregisterProxy
  • Subscription handles wrapped per item; RemoveAdviseRemoveItem ordering enforced
  • CriticalFinalizerObject for finalizer ordering during AppDomain unload
  • Pre-shutdown drain: cancel all subscriptions cleanly via the STA pump, in order, before pump exit

Acceptance:

  • Unit test asserts a leaked handle (no Dispose) is released by the finalizer
  • Shutdown test asserts no orphan COM refs after Host exits cleanly
  • Stress test: 1000 subscribe/unsubscribe cycles → handle table empty at the end

Task B.4 — Subscription registry + reconnect

Per driver-stability.md Galaxy deep dive §"Subscription State and Reconnect":

  • In-memory registry of (Item, AdviseId, OwningHost) for every subscription
  • Reconnect order: register proxy → re-add items → re-advise
  • Cross-host quality clear gated on host-status check (closes 2026-04-13 finding)

Acceptance:

  • Disconnect simulation: kill TCP to MXAccess; subscriptions go Bad; reconnect; subscriptions restore in correct order
  • Multi-host test: stop AppEngine A while AppEngine B is running; verify A's subscriptions go Bad but B's stay Good (closes the cross-host quality clear regression)

Task B.5 — Connection health probe (GalaxyRuntimeProbeManager rebuild)

Lift the existing GalaxyRuntimeProbeManager into the new project. Behaviors per driver-stability.md:

  • Subscribe to per-host runtime-status synthetic attribute
  • Bad-quality fan-out scoped to the host's subtree (not Galaxy-wide)
  • Failed probe subscription does not leave a phantom entry that Tick() flips to Stopped (closes 2026-04-13 finding)

Acceptance:

  • Probe failure simulation → no phantom entry; Tick() does not flip arbitrary subscriptions to Stopped (regression test for the finding)
  • Probe transitions Stopped → Running → Stopped → Running over 5 minutes; quality fan-out happens correctly each transition

Task B.6 — Named-pipe IPC server with mandatory ACL

Per decision #76 + driver-stability.md §"IPC Security":

  • Pipe ACL on creation: ReadWrite | Synchronize granted only to the OtOpcUa server's service principal SID; LocalSystem and Administrators explicitly denied
  • Caller identity verification on each new connection: GetImpersonationUserName() cross-checked against configured server service SID; mismatches dropped before any RPC frame is read
  • Per-process shared secret: passed by the supervisor at spawn time, required on first frame of every connection
  • Heartbeat pipe: separate from data-plane pipe, same ACL

Acceptance:

  • Unit test: pipe ACL enumeration shows only the configured SID + Synchronize/ReadWrite
  • Integration test: connection from a non-server-SID local process is dropped with audit log entry
  • Integration test: connection without correct shared secret on first frame is dropped
  • Defense-in-depth test: even if ACL is misconfigured (manually overridden), shared-secret check catches the wrong client

Task B.7 — Memory watchdog with Galaxy-specific thresholds

Per driver-stability.md Galaxy deep dive §"Memory Watchdog Thresholds":

  • Sample RSS every 30s
  • Warning: 1.5× baseline OR baseline + 200 MB (whichever larger)
  • Soft recycle: 2× baseline OR baseline + 200 MB (whichever larger)
  • Hard ceiling: 1.5 GB → force-kill
  • Slope: > 5 MB/min sustained 30 min → soft recycle

Acceptance:

  • Unit test against a mock RSS source: each threshold triggers the correct action
  • Integration test with the FaultShim (Stream B.10): leak simulation crosses the soft-recycle threshold and triggers soft recycle path

Task B.8 — Recycle policy with WM_QUIT escalation

Per driver-stability.md Galaxy deep dive §"Recycle Policy (COM-specific)":

  • 15s grace for in-flight COM calls (longer than FOCAS because legitimate MXAccess bulk reads take seconds)
  • Per-handle: RemoveAdviseRemoveItemReleaseComObjectUnregisterProxy, on the STA thread
  • WM_QUIT posted only after all of the above complete
  • If STA pump doesn't exit within 5s of WM_QUITEnvironment.Exit(2) (hard exit)
  • Soft recycle scheduled daily at 03:00 local; recycle frequency cap 1/hour

Acceptance:

  • Soft recycle test: in-flight call returns within grace → clean exit (Exit(0))
  • Soft recycle test: in-flight call exceeds grace → hard exit (Exit(2)); supervisor records as unclean recycle
  • Wedged-pump test: pump doesn't drain after WM_QUITExit(2) within 5s
  • Frequency cap test: trigger 2 soft recycles within an hour → second is blocked, alert raised

Task B.9 — Post-mortem MMF writer

Per driver-stability.md Galaxy deep dive §"Post-Mortem Log Contents":

  • Ring buffer of last 1000 IPC operations
  • Plus Galaxy-specific snapshots: STA pump state (thread ID, last dispatched timestamp, queue depth), active subscription count by host, MxAccessHandle refcount snapshot, last 100 probe results, last redeploy event, Galaxy DB connection state, Historian connection state if HDA enabled
  • Memory-mapped file at %ProgramData%\OtOpcUa\driver-postmortem\galaxy.mmf
  • On graceful shutdown: flush ring + snapshots to a rotating log
  • On hard crash: supervisor reads the MMF after the corpse is gone

Acceptance:

  • Round-trip test: write 1000 operations → read back → assert order + content
  • Hard-crash test: kill the process mid-operation → supervisor reads the MMF → ring tail shows the operation that was in flight

Task B.10 — Driver.Galaxy.FaultShim (test-only)

Per driver-stability.md §"Test Coverage for Galaxy Stability" — analogous to FOCAS FaultShim:

  • Test-only managed assembly substituted for ArchestrA.MxAccess.dll via assembly binding
  • Injects: COM exception at chosen call site, subscription that never fires OnDataChange, Marshal.ReleaseComObject returning unexpected refcount, STA pump deadlock simulation
  • Production builds load the real ArchestrA.MxAccess from GAC

Acceptance:

  • FaultShim binds successfully under test configuration
  • Each fault scenario triggers the expected protection (memory watchdog → recycle, supervisor → respawn, etc.)

Stream C — Driver.Galaxy.Proxy (1.5 weeks, can parallel with B after A done)

Task C.1 — Create the project + capability interface implementation

src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/ (.NET 10). Dependencies: Core.Abstractions (Phase 1) + Driver.Galaxy.Shared (Stream A) + MessagePack.

Implement every interface listed in Phase Objective above. Each method:

  • Marshals arguments into the matching IPC contract
  • Sends over the data-plane pipe
  • Awaits the response (with timeout per Polly per decision #34)
  • Maps the response into the Core.Abstractions shape (DataValue, DriverAttributeInfo, etc.)
  • Surfaces failures as the appropriate StatusCode

Acceptance:

  • Each interface method has a unit test against a mock IPC channel: happy path + IPC timeout path + IPC error path
  • IRediscoverable opt-in works: when Galaxy.Host signals a redeploy, Proxy invokes the Core's rediscovery flow (not full restart)

Task C.2 — Heartbeat sender + host liveness

Per driver-stability.md §"Heartbeat between proxy and host":

  • 2s cadence (decision #72) on the dedicated heartbeat pipe
  • 3 consecutive missed responses = host declared dead (6s detection)
  • On host-dead: fan out Bad quality on all Galaxy-namespace nodes; ask supervisor to respawn

Acceptance:

  • Heartbeat round-trip test against a mock host
  • Missed-heartbeat test: stop the mock host's heartbeat responder → 3 misses → supervisor respawn requested
  • GC pause test: simulate a 700ms GC pause on the proxy side → no false positive (single missed beat absorbed by 3-miss tolerance)

Task C.3 — Supervisor with respawn-with-backoff + crash-loop circuit breaker

Per driver-stability.md §"Crash-loop circuit breaker" + Galaxy §"Recovery Sequence After Crash":

  • Backoff: 5s → 15s → 60s (capped)
  • Crash-loop: 3 crashes / 5 min → escalating cooldown (1h → 4h → 24h manual)
  • Sticky alert that doesn't auto-clear when cooldown elapses
  • On respawn after recycle: reuse cached time_of_last_deploy watermark to skip full DB rediscovery if unchanged

Acceptance:

  • Respawn test: kill host process → supervisor respawns within 5s → host re-establishes
  • Crash-loop test: force 3 crashes within 5 minutes → 4th respawn blocked, alert raised, manual reset clears alert
  • Cooldown escalation test: trip → 1h auto-reset → re-trip within 10 min → 4h cooldown → re-trip → 24h manual

Task C.4 — Address space build via IAddressSpaceBuilder

When the Proxy is asked to discover its tags, it issues DiscoverGalaxyHierarchyRequest to the Host, receives the gobject tree + attributes, and streams them to IAddressSpaceBuilder (Phase 1 API per decision #52). Galaxy uses the SystemPlatform-kind namespace; tags use FolderPath (v1-style) — no Equipment rows are created.

Acceptance:

  • Build a Galaxy address space via the Proxy → byte-equivalent OPC UA browse output to v1
  • Memory test: large Galaxy (4000+ attributes) → Proxy peak RAM stays under 200 MB during build

Stream D — Retire legacy OtOpcUa.Host (1 week, depends on B + C)

Task D.1 — Delete legacy Host project

Once Galaxy.Host + Galaxy.Proxy are functional, the legacy OtOpcUa.Host project's responsibilities are split:

  • Galaxy-specific code → Driver.Galaxy.Host (already moved in Stream B)
  • TopShelf wrapper, Program.cs, generic OPC UA hosting → already replaced by OtOpcUa.Server in Phase 1
  • Anything else (configuration types, generic helpers) → moved to OtOpcUa.Server or OtOpcUa.Configuration as appropriate

Delete the project from the solution. Update .slnx and any references.

Acceptance:

  • ls src/ shows OtOpcUa.Host is gone
  • dotnet build OtOpcUa.slnx succeeds with OtOpcUa.Host no longer in the build graph
  • All previously-OtOpcUa.Host.Tests tests are either moved to the appropriate new test project or deleted as obsolete

Task D.2 — Update Windows service installer scripts

Two services per cluster node when Galaxy is configured:

  • OtOpcUa (the main OtOpcUa.Server) — already installable per Phase 1
  • OtOpcUaGalaxyHost (the Driver.Galaxy.Host) — new service registration

Installer must:

  • Install both services with the correct service-account SIDs (Galaxy.Host's pipe ACL must grant the OtOpcUa service principal)
  • Set the supervisor's per-process secret in the registry or a protected file before first start
  • Honor service dependency: Galaxy.Host should be configured to start before OtOpcUa, or OtOpcUa retries until Galaxy.Host is up

Acceptance:

  • Install both services on a test box → both start successfully
  • Uninstall both → no leftover registry / file system state
  • Service-restart cycle: stop OtOpcUa.Server → Galaxy.Host stays up → start OtOpcUa.Server → reconnects to Galaxy.Host pipe

Task D.3 — Migrate Galaxy appsettings.json config to central config DB

Galaxy-specific config sections (MxAccess, Galaxy, Historian) move into the DriverInstance.DriverConfig JSON for the Galaxy driver instance in the Configuration DB. The local appsettings.json keeps only Cluster.NodeId + ClusterId + DB conn (per decision #18).

Migration script: for each existing v1 appsettings.json, generate the equivalent DriverConfig JSON and either insert via Admin UI or via a one-shot SQL script.

Acceptance:

  • Migration script runs against a v1 dev appsettings.json → produces a JSON blob that loads into the Galaxy DriverConfig field
  • The Galaxy driver instance starts with the migrated config and serves the same address space as v1

Stream E — Parity validation (1 week, gate)

Task E.1 — Run v1 IntegrationTests against v2 Galaxy topology

Per decision #56:

  • The same v1 IntegrationTests suite runs against the v2 build with Galaxy.Proxy + Galaxy.Host instead of in-process Galaxy
  • All tests must pass
  • Pass count = v1 baseline; failure count = 0; skip count = v1 baseline
  • Test duration may increase (IPC round-trip latency); document the deviation

Acceptance:

  • Test report shows pass/fail/skip counts identical to v1 baseline
  • Per-test duration regression report: any test that takes >2× v1 baseline is flagged for review (may be an IPC bottleneck)

Task E.2 — Scripted Client.CLI walkthrough parity

Per decision #56:

  • Execute the captured Client.CLI script (recorded at Phase 2 entry gate against v1) against the v2 Galaxy topology
  • Diff the output against v1 reference
  • Differences allowed only in: timestamps, latency-measurement output. Any value, quality, browse path, or alarm shape difference = parity defect

Acceptance:

  • Walkthrough completes without errors
  • Output diff vs v1: only timestamp / latency lines differ

Task E.3 — Regression tests for the four 2026-04-13 stability findings

Per driver-specs.md Galaxy "Operational Stability Notes": each of the four findings closed in commits c76ab8f and 7310925 should have a regression test in the Phase 2 parity suite:

  • Phantom probe subscription flipping Tick() to Stopped (covered by Task B.5)
  • Cross-host quality clear wiping sibling state during recovery (covered by Task B.4)
  • Sync-over-async on the OPC UA stack thread → guard against new instances in GenericDriverNodeManager
  • Fire-and-forget alarm tasks racing shutdown → guard via the pre-shutdown drain ordering in Task B.3

Acceptance:

  • Each of the four scenarios has a named test in the parity suite
  • Each test fails on a hand-introduced regression (revert the v1 fix, see test fail)

Task E.4 — Adversarial review of the Phase 2 diff

Per implementation/overview.md exit gate:

  • Run /codex:adversarial-review --base v2 on the merged Phase 2 diff
  • Findings closed or explicitly deferred with rationale and ticket link

Compliance Checks (run at exit gate)

phase-2-compliance.ps1:

Schema compliance

N/A for Phase 2 — no schema changes (Configuration DB schema is unchanged from Phase 1).

Decision compliance

For each decision number Phase 2 implements (#11, #24, #25, #28, #29, #32, #34, #44, #4647, #5556, #62, #6369, #76, #102, plus the Galaxy-specific #62), verify at least one citation exists in source, tests, or migrations:

$decisions = @(11, 24, 25, 28, 29, 32, 34, 44, 46, 47, 55, 56, 62, 63..69, 76, 102, 122, 123, 124)
foreach ($d in $decisions) {
    $hits = git grep "decision #$d" -- 'src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.*/' 'tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.*/'
    if (-not $hits) { Write-Error "Decision #$d has no citation"; exit 1 }
}

Visual compliance

N/A — no Admin UI changes in Phase 2 (Galaxy is just another DriverInstance in the Drivers tab).

Behavioral compliance — parity smoke test

The parity suite (Stream E) is the smoke test:

  1. v1 IntegrationTests pass count = baseline, fail count = 0
  2. Client.CLI walkthrough output matches v1 (modulo timestamps/latency)
  3. Four regression tests for 2026-04-13 findings pass

Stability compliance

For Phase 2 (introduces the first Tier C driver in production form):

  • Galaxy.Host implements every Tier C cross-cutting protection from driver-stability.md:
    • SafeHandle wrappers for COM (Task B.3) ✓
    • Memory watchdog with Galaxy thresholds (Task B.7) ✓
    • Bounded operation queues per device (already in Core, Phase 1) ✓
    • Heartbeat between proxy and host on separate channel (Tasks A.2, B.6, C.2) ✓
    • Scheduled recycling with WM_QUIT escalation to hard exit (Task B.8) ✓
    • Crash-loop circuit breaker (Task C.3) ✓
    • Post-mortem MMF readable after hard crash (Task B.9) ✓
    • IPC ACL + caller SID verification + per-process shared secret (Task B.6) ✓

Each protection has at least one regression test. The compliance script enumerates and verifies presence:

$protections = @(
    @{Name="SafeHandle for COM"; Test="MxAccessHandleFinalizerReleasesCom"},
    @{Name="Memory watchdog"; Test="WatchdogTriggersRecycleAtThreshold"},
    @{Name="Heartbeat detection"; Test="ThreeMissedHeartbeatsDeclaresHostDead"},
    @{Name="WM_QUIT escalation"; Test="WedgedPumpEscalatesToHardExit"},
    @{Name="Crash-loop breaker"; Test="ThreeCrashesInFiveMinutesOpensCircuit"},
    @{Name="Post-mortem MMF"; Test="MmfSurvivesHardCrashAndIsReadable"},
    @{Name="Pipe ACL enforcement"; Test="NonServerSidConnectionRejected"},
    @{Name="Shared secret"; Test="ConnectionWithoutSecretRejected"}
)
foreach ($p in $protections) {
    $hits = dotnet test --filter "FullyQualifiedName~$($p.Test)" --no-build --logger "console;verbosity=quiet"
    if ($LASTEXITCODE -ne 0) { Write-Error "Stability protection '$($p.Name)' has no passing test '$($p.Test)'"; exit 1 }
}

Documentation compliance

  • Any deviation from the Galaxy deep dive in driver-stability.md reflected back; new decisions added with supersedes notes if needed
  • driver-specs.md §1 (Galaxy) updated to reflect the actual implementation if the IPC contract or recycle behavior differs from the design doc

Completion Checklist

Stream A — Driver.Galaxy.Shared

  • Project created (.NET Standard 2.0, MessagePack-only dependency)
  • All IPC contracts defined and round-trip tested
  • Hello-message version negotiation implemented
  • Reflection test confirms no .NET 10-only types leaked in

Stream B — Driver.Galaxy.Host

  • Project created (.NET 4.8 x86)
  • All Galaxy-specific code moved from legacy Host
  • STA thread + Win32 pump implemented; pump health probe wired up
  • MxAccessHandle : SafeHandle for COM lifetime
  • Subscription registry + reconnect with cross-host quality scoping
  • GalaxyRuntimeProbeManager rebuilt; phantom-probe regression test passes
  • Named-pipe IPC server with mandatory ACL + caller SID verification + per-process secret
  • Memory watchdog with Galaxy-specific thresholds
  • Recycle policy with 15s grace + WM_QUIT escalation to hard exit
  • Post-mortem MMF writer + supervisor reader
  • FaultShim test-only assembly for fault injection

Stream C — Driver.Galaxy.Proxy

  • Project created (.NET 10, depends on Core.Abstractions + Galaxy.Shared)
  • All capability interfaces implemented (IDriver, ITagDiscovery, IRediscoverable, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IHostConnectivityProbe)
  • Heartbeat sender on dedicated channel; missed-heartbeat detection
  • Supervisor with respawn-with-backoff + crash-loop circuit breaker (escalating cooldown 1h/4h/24h)
  • Address space build via IAddressSpaceBuilder produces byte-equivalent v1 output

Stream D — Retire legacy OtOpcUa.Host

  • Legacy OtOpcUa.Host project deleted from solution
  • Windows service installer registers two services (OtOpcUa + OtOpcUaGalaxyHost)
  • Galaxy appsettings.json config migrated into central DB DriverConfig
  • Migration script tested against v1 dev config

Stream E — Parity validation

  • v1 IntegrationTests pass with count = baseline, failures = 0
  • Client.CLI walkthrough output matches v1 (modulo timestamps/latency)
  • All four 2026-04-13 stability findings have passing regression tests
  • Per-test duration regression report: no test >2× v1 baseline (or flagged for review)

Cross-cutting

  • phase-2-compliance.ps1 runs and exits 0
  • All 8 Tier C stability protections have named, passing tests
  • Adversarial review of the phase diff — findings closed or deferred with rationale
  • PR opened against v2, includes: link to this doc, link to exit-gate record, compliance script output, parity test report, adversarial review output
  • Reviewer signoff (one reviewer beyond the implementation lead)
  • exit-gate-phase-2.md recorded

Risks and Mitigations

Risk Likelihood Impact Mitigation
IPC round-trip latency makes parity tests fail on timing assumptions High Medium Per-test duration regression report identifies hot tests; tune timeouts in test config rather than in production code
MessagePack contract drift between Proxy and Host during development Medium High Hello-message version negotiation rejects mismatched majors loudly; CI builds both projects in the same job
STA pump health probe is itself flaky and triggers spurious recycles Medium High Probe interval tunable; default 10s gives 1000ms+ slack on a healthy pump; monitor via post-mortem MMF for false positives
Pipe ACL misconfiguration on installer leaves the IPC accessible to local users Low Critical Defense-in-depth shared secret catches the case; ACL enumeration test in installer integration test
Galaxy.Host process recycle thrash if Galaxy or DB is intermittently unavailable Medium Medium Crash-loop circuit breaker with escalating cooldown caps the thrash; Polly retry on the data path inside Host (not via supervisor restart) handles transient errors
Migration of appsettings.json Galaxy config to DB blob breaks existing deployments Medium Medium Migration script is idempotent and dry-run-able; deploy script asserts central DB has the migrated config before stopping legacy Host
Phase 2 takes longer than 8 weeks High Medium Mid-gate review at 4 weeks — if Stream B isn't past Task B.6 (IPC + ACL), defer Stream B.10 (FaultShim) to Phase 2.5 follow-up
Wonderware Historian SDK incompatibility with .NET 4.8 x86 in the new project layout Low High Move and validate Historian loader as part of Task B.1 — early signal if SDK has any project-shape sensitivity
Hard-exit on wedged pump leaks COM resources Accepted Low Documented intent: hard exit is the only safe response; OS process exit reclaims fds and the OS COM cleanup is best-effort. CNC equivalent in FOCAS deep dive accepts the same trade-off

Out of Scope (do not do in Phase 2)

  • Any non-Galaxy driver (Phase 3+)
  • UNS / Equipment-namespace work for Galaxy (Galaxy is SystemPlatform-namespace; no Equipment rows for Galaxy tags per decision #108)
  • Equipment-class template integration with the schemas repo (Galaxy doesn't use EquipmentClassRef)
  • Push-from-DB notification (decision #96 — v2.1)
  • Any change to OPC UA wire behavior visible to clients (parity is the gate)
  • Consumer cutover (ScadaBridge, Ignition, System Platform IO) — out of v2 scope, separate integration-team track per implementation/overview.md
  • Removing the v1 deployment from production (a v2 release decision, not Phase 2)