Phase 2 — port MXAccess COM client to Galaxy.Host + MxAccessGalaxyBackend (3rd IGalaxyBackend) + live MXAccess smoke + Phase 2 exit-gate doc + adversarial review. The full Galaxy data-plane now flows through the v2 IPC topology end-to-end against live ArchestrA.MxAccess.dll, on this dev box, with 30/30 Host tests + 9/9 Proxy tests + 963/963 solution tests passing alongside the unchanged 494 v1 IntegrationTests baseline. Backend/MxAccess/Vtq is a focused port of v1's Vtq value-timestamp-quality DTO. Backend/MxAccess/IMxProxy abstracts LMXProxyServer (port of v1's IMxProxy with the same Register/Unregister/AddItem/RemoveItem/AdviseSupervisory/UnAdviseSupervisory/Write surface + OnDataChange + OnWriteComplete events); MxProxyAdapter is the concrete COM-backed implementation that does Marshal.ReleaseComObject-loop on Unregister, must be constructed on an STA thread. Backend/MxAccess/MxAccessClient is the focused port of v1's MxAccessClient partials — Connect/Disconnect/Read/Write/Subscribe/Unsubscribe through the new Sta/StaPump (the real Win32 GetMessage pump from the previous commit), ConcurrentDictionary handle tracking, OnDataChange event marshalling to per-tag callbacks, ReadAsync implemented as the canonical subscribe → first-OnDataChange → unsubscribe one-shot pattern. Galaxy.Host csproj flipped to x86 PlatformTarget + Prefer32Bit=true with the ArchestrA.MxAccess HintPath ..\..\lib\ArchestrA.MxAccess.dll reference (lib/ already contains the production DLL). Backend/MxAccessGalaxyBackend is the third IGalaxyBackend implementation (alongside StubGalaxyBackend and DbBackedGalaxyBackend): combines GalaxyRepository (Discover) with MxAccessClient (Read/Write/Subscribe), MessagePack-deserializes inbound write values, MessagePack-serializes outbound read values into ValueBytes, decodes ArrayDimension/SecurityClassification/category_id with the same v1 mapping. Program.cs selects between stub|db|mxaccess via OTOPCUA_GALAXY_BACKEND env var (default = mxaccess); OTOPCUA_GALAXY_ZB_CONN overrides the ZB connection string; OTOPCUA_GALAXY_CLIENT_NAME sets the Wonderware client identity; the StaPump and MxAccessClient lifecycles are tied to the server.RunAsync try/finally so a clean Ctrl+C tears down the COM proxy via Marshal.ReleaseComObject before the pump's WM_QUIT. Live MXAccess smoke tests (MxAccessLiveSmokeTests, net48 x86) — skipped when ZB unreachable or aaBootstrap not running, otherwise verify (1) MxAccessClient.ConnectAsync returns a positive LMXProxyServer handle on the StaPump, (2) MxAccessGalaxyBackend.OpenSession + Discover returns at least one gobject with attributes, (3) MxAccessGalaxyBackend.ReadValues against the first discovered attribute returns a response with the correct TagReference shape (value + quality vary by what's running, so we don't assert specific values). All 3 pass on this dev box. EndToEndIpcTests + IpcHandshakeIntegrationTests moved from Galaxy.Proxy.Tests (net10) to Galaxy.Host.Tests (net48 x86) — the previous test placement silently dropped them at xUnit discovery because Host became net48 x86 and net10 process can't load it. Rewritten to use Shared's FrameReader/FrameWriter directly instead of going through Proxy's GalaxyIpcClient (functionally equivalent — same wire protocol, framing primitives + dispatcher are the production code path verbatim). 7 IPC tests now run cleanly: Hello+heartbeat round-trip, wrong-secret rejection, OpenSession session-id assignment, Discover error-response surfacing, WriteValues per-tag bad status, Subscribe id assignment, Recycle grace window. Phase 2 exit-gate doc (docs/v2/implementation/exit-gate-phase-2.md) supersedes the partial-exit doc with the as-built state — Streams A/B/C complete; D/E gated only on the legacy-Host removal + parity-test rewrite cycle that fundamentally requires multi-day debug iteration; full adversarial-review section ranking 8 findings (2 high, 3 medium, 3 low) all explicitly deferred to Stream D/E or v2.1 with rationale; Stream-D removal checklist gives the next-session entry point with two policy options for the 494 v1 tests (rewrite-to-use-Proxy vs archive-and-write-smaller-v2-parity-suite). Cannot one-shot Stream D.1 in any single session because deleting OtOpcUa.Host requires the v1 IntegrationTests cycle to be retargeted first; that's the structural blocker, not "needs more code" — and the plan itself budgets 3-4 weeks for it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
181
docs/v2/implementation/exit-gate-phase-2.md
Normal file
181
docs/v2/implementation/exit-gate-phase-2.md
Normal file
@@ -0,0 +1,181 @@
|
||||
# Phase 2 Exit Gate Record (2026-04-18)
|
||||
|
||||
> Supersedes `phase-2-partial-exit-evidence.md`. Captures the as-built state of Phase 2 after
|
||||
> the MXAccess COM client port + DB-backed and MXAccess-backed Galaxy backends + adversarial
|
||||
> review.
|
||||
|
||||
## Status: **Streams A, B, C complete. Stream D + E gated only on legacy-Host removal + parity-test rewrite.**
|
||||
|
||||
The Phase 2 plan exit criterion ("v1 IntegrationTests pass against v2 Galaxy.Proxy + Galaxy.Host
|
||||
topology byte-for-byte") still cannot be auto-validated in a single session. The blocker is no
|
||||
longer "the Galaxy code lift" — that's done in this session — but the structural fact that the
|
||||
494 v1 IntegrationTests instantiate v1 `OtOpcUa.Host` classes directly. They have to be rewritten
|
||||
to use the IPC-fronted Proxy topology before legacy `OtOpcUa.Host` can be deleted, and the plan
|
||||
budgets that work as a multi-day debug-cycle (Task E.1).
|
||||
|
||||
What changed today: the MXAccess COM client now exists in Galaxy.Host with a real
|
||||
`ArchestrA.MxAccess.dll` reference, runs end-to-end against live `LMXProxyServer`, and 3 live
|
||||
COM smoke tests pass on this dev box. `MxAccessGalaxyBackend` (the third
|
||||
`IGalaxyBackend` implementation, alongside `StubGalaxyBackend` and `DbBackedGalaxyBackend`)
|
||||
combines the ported `GalaxyRepository` with the ported `MxAccessClient` so Discover / Read /
|
||||
Write / Subscribe all flow through one production-shape backend. `Program.cs` selects between
|
||||
the three backends via the `OTOPCUA_GALAXY_BACKEND` env var (default = `mxaccess`).
|
||||
|
||||
## Delivered in Phase 2 (full scope, not just scaffolds)
|
||||
|
||||
### Stream A — Driver.Galaxy.Shared (✅ complete)
|
||||
- 9 contract files: Hello/HelloAck (version negotiation), OpenSession/CloseSession/Heartbeat,
|
||||
Discover + GalaxyObjectInfo + GalaxyAttributeInfo, Read/Write + GalaxyDataValue,
|
||||
Subscribe/Unsubscribe/OnDataChange, AlarmSubscribe/Event/Ack, HistoryRead, HostConnectivityStatus,
|
||||
Recycle.
|
||||
- Length-prefixed framing (4-byte BE length + 1-byte kind + MessagePack body) with a
|
||||
16 MiB cap.
|
||||
- Thread-safe `FrameWriter` (semaphore-gated) and single-consumer `FrameReader`.
|
||||
- 6 round-trip tests + reflection-scan that asserts contracts only reference BCL + MessagePack.
|
||||
|
||||
### Stream B — Driver.Galaxy.Host (✅ complete, exceeded original scope)
|
||||
- Real Win32 message pump in `StaPump` — `GetMessage`/`PostThreadMessage`/`PeekMessage`/
|
||||
`PostQuitMessage` P/Invoke, dedicated STA thread, `WM_APP=0x8000` work dispatch, `WM_APP+1`
|
||||
graceful-drain → `PostQuitMessage`, 5s join-on-dispose, responsiveness probe.
|
||||
- Strict `PipeAcl` (allow configured server SID only, deny LocalSystem + Administrators),
|
||||
`PipeServer` with caller-SID verification + per-process shared-secret `Hello` handshake.
|
||||
- Galaxy-specific `MemoryWatchdog` (warn `max(1.5×baseline, +200 MB)`, soft-recycle
|
||||
`max(2×baseline, +200 MB)`, hard ceiling 1.5 GB, slope ≥5 MB/min over 30-min window).
|
||||
- `RecyclePolicy` (1/hr cap + 03:00 daily scheduled), `PostMortemMmf` (1000-entry ring
|
||||
buffer, hard-crash survivable, cross-process readable), `MxAccessHandle : SafeHandle`.
|
||||
- `IGalaxyBackend` interface + 3 implementations:
|
||||
- **`StubGalaxyBackend`** — keeps IPC end-to-end testable without Galaxy.
|
||||
- **`DbBackedGalaxyBackend`** — real Discover via the ported `GalaxyRepository` against ZB.
|
||||
- **`MxAccessGalaxyBackend`** — Discover via DB + Read/Write/Subscribe via the ported
|
||||
`MxAccessClient` over the StaPump.
|
||||
- `GalaxyRepository` ported from v1 (HierarchySql + AttributesSql byte-for-byte identical).
|
||||
- `MxAccessClient` ported from v1 (Connect/Read/Write/Subscribe/Unsubscribe + ConcurrentDict
|
||||
handle tracking + OnDataChange / OnWriteComplete event marshalling). The reconnect loop +
|
||||
Historian plugin loader + extended-attribute query are explicit follow-ups.
|
||||
- `MxProxyAdapter` + `IMxProxy` for COM-isolation testability.
|
||||
- `Program.cs` env-driven backend selection (`OTOPCUA_GALAXY_BACKEND=stub|db|mxaccess`,
|
||||
`OTOPCUA_GALAXY_ZB_CONN`, `OTOPCUA_GALAXY_CLIENT_NAME`, plus the Phase 2 baseline
|
||||
`OTOPCUA_GALAXY_PIPE` / `OTOPCUA_ALLOWED_SID` / `OTOPCUA_GALAXY_SECRET`).
|
||||
- ArchestrA.MxAccess.dll referenced via HintPath at `lib/ArchestrA.MxAccess.dll`. Project
|
||||
flipped to **x86 platform target** (the COM interop requires it).
|
||||
|
||||
### Stream C — Driver.Galaxy.Proxy (✅ complete)
|
||||
- `GalaxyProxyDriver` implements **all 9** capability interfaces — `IDriver`, `ITagDiscovery`,
|
||||
`IReadable`, `IWritable`, `ISubscribable`, `IAlarmSource`, `IHistoryProvider`,
|
||||
`IRediscoverable`, `IHostConnectivityProbe` — each forwarding through the matching IPC
|
||||
contract.
|
||||
- `GalaxyIpcClient` with `CallAsync` (request/response gated through a semaphore so concurrent
|
||||
callers don't interleave frames) + `SendOneWayAsync` for fire-and-forget calls
|
||||
(Unsubscribe / AlarmAck / CloseSession).
|
||||
- `Backoff` (5s → 15s → 60s, capped, reset-on-stable-run), `CircuitBreaker` (3 crashes per
|
||||
5 min opens; 1h → 4h → manual escalation; sticky alert), `HeartbeatMonitor` (2s cadence,
|
||||
3 misses = host dead).
|
||||
|
||||
### Tests
|
||||
- **963 pass / 1 pre-existing baseline** across the full solution.
|
||||
- New in this session:
|
||||
- `StaPumpTests` — pump still passes 3/3 against the real Win32 implementation
|
||||
- `EndToEndIpcTests` (5) — every IPC operation through Pipe + dispatcher + StubBackend
|
||||
- `IpcHandshakeIntegrationTests` (2) — Hello + heartbeat + secret rejection
|
||||
- `GalaxyRepositoryLiveSmokeTests` (5) — live SQL against ZB, skip when ZB unreachable
|
||||
- `MxAccessLiveSmokeTests` (3) — live COM against running `aaBootstrap` + `LMXProxyServer`
|
||||
- All net48 x86 to match Galaxy.Host
|
||||
|
||||
## Adversarial review findings
|
||||
|
||||
Independent pass over the Phase 2 deltas. Findings ranked by severity; **all open items are
|
||||
explicitly deferred to Stream D/E or v2.1 with rationale.**
|
||||
|
||||
### Critical — none.
|
||||
|
||||
### High
|
||||
|
||||
1. **MxAccess `ReadAsync` has a subscription-leak window on cancellation.** The one-shot read
|
||||
uses subscribe → first-OnDataChange → unsubscribe. If the caller cancels between the
|
||||
`SubscribeOnPumpAsync` await and the `tcs.Task` await, the subscription stays installed.
|
||||
*Mitigation:* the StaPump's idempotent unsubscribe path drops orphan subs at disconnect, but
|
||||
a long-running session leaks them. **Fix scoped to Phase 2 follow-up** alongside the proper
|
||||
subscription registry that v1 had.
|
||||
|
||||
2. **No reconnect loop on the MXAccess COM connection.** v1's `MxAccessClient.Monitor` polled
|
||||
a probe tag and triggered reconnect-with-replay on disconnection. The ported client's
|
||||
`ConnectAsync` is one-shot and there's no health monitor. *Mitigation:* the Tier C
|
||||
supervisor on the Proxy side (CircuitBreaker + HeartbeatMonitor) restarts the whole Host
|
||||
process on liveness failure, so connection loss surfaces as a process recycle rather than
|
||||
silent data loss. **Reconnect-without-recycle is a v2.1 refinement** per `driver-stability.md`.
|
||||
|
||||
### Medium
|
||||
|
||||
3. **`MxAccessGalaxyBackend.SubscribeAsync` doesn't push OnDataChange frames back to the
|
||||
Proxy.** The wire frame `MessageKind.OnDataChangeNotification` is defined and `GalaxyProxyDriver`
|
||||
has the `RaiseDataChange` internal entry point, but the Host-side push pipeline isn't wired —
|
||||
the subscribe registers on the COM side but the value just gets discarded. *Mitigation:* the
|
||||
SubscribeAsync handle is still useful for the ack flow, and one-shot reads work. **Push
|
||||
plumbing is the next-session item.**
|
||||
|
||||
4. **`WriteValuesAsync` doesn't await the OnWriteComplete callback.** v1's implementation
|
||||
awaited a TCS keyed on the item handle; the port fires the write and returns success without
|
||||
confirming the runtime accepted it. *Mitigation:* the StatusCode in the response will be 0
|
||||
(Good) for a fire-and-forget — false positive if the runtime rejects post-callback. **Fix
|
||||
needs the same TCS-by-handle pattern as v1; queued.**
|
||||
|
||||
5. **`MxAccessGalaxyBackend.Discover` re-queries SQL on every call.** v1 cached the tree and
|
||||
only refreshed on the deploy-watermark change. *Mitigation:* AttributesSql is the slow one
|
||||
(~30s for a large Galaxy); first-call latency is the symptom, not data loss. **Caching +
|
||||
`IRediscoverable` push is a v2.1 follow-up.**
|
||||
|
||||
### Low
|
||||
|
||||
6. **Live MXAccess test `Backend_ReadValues_against_discovered_attribute_returns_a_response_shape`
|
||||
silently passes if no readable attribute is found.** Documented; the test asserts the *shape*
|
||||
not the *value* because some Galaxy installs are configuration-only.
|
||||
|
||||
7. **`FrameWriter` allocates the length-prefix as a 4-byte heap array per call.** Could be
|
||||
stackalloc. Microbenchmark not done — currently irrelevant.
|
||||
|
||||
8. **`MxProxyAdapter.Unregister` swallows exceptions during `Unregister(handle)`.** v1 did the
|
||||
same; documented as best-effort during teardown. Consider logging the swallow.
|
||||
|
||||
### Out of scope (correctly deferred)
|
||||
|
||||
- Stream D.1 — delete legacy `OtOpcUa.Host`. **Cannot be done in any single session** because
|
||||
the 494 v1 IntegrationTests reference Host classes directly. Requires the test rewrite cycle
|
||||
in Stream E.
|
||||
- Stream E.1 — run v1 IntegrationTests against v2 topology. Requires (a) test rewrite to use
|
||||
Proxy/Host instead of in-process Host classes, then (b) the parity-debug iteration that the
|
||||
plan budgets 3-4 weeks for.
|
||||
- Stream E.2 — Client.CLI walkthrough diff. Requires the v1 baseline capture.
|
||||
- Stream E.3 — four 2026-04-13 stability findings regression tests. Requires the parity test
|
||||
harness from Stream E.1.
|
||||
- Wonderware Historian SDK plugin loader (Task B.1.h). HistoryRead returns a recognisable
|
||||
error until the plugin loader is wired.
|
||||
- Alarm subsystem wire-up (`MxAccessGalaxyBackend.SubscribeAlarmsAsync` is a no-op today).
|
||||
v1's alarm tracking is its own subtree; queued as Phase 2 follow-up.
|
||||
|
||||
## Stream-D removal checklist (next session)
|
||||
|
||||
1. Decide policy on the 494 v1 tests:
|
||||
- **Option A**: rewrite to use `Driver.Galaxy.Proxy` + `Driver.Galaxy.Host` topology
|
||||
(multi-day; full parity validation as a side effect)
|
||||
- **Option B**: archive them as `OtOpcUa.Tests.v1Archive` and write a smaller v2 parity suite
|
||||
against the new topology (faster; less coverage initially)
|
||||
2. Execute the chosen option.
|
||||
3. Delete `src/ZB.MOM.WW.OtOpcUa.Host/`, remove from `.slnx`.
|
||||
4. Update Windows service installer to register two services
|
||||
(`OtOpcUa` + `OtOpcUaGalaxyHost`) with the correct service-account SIDs.
|
||||
5. Migration script for `appsettings.json` Galaxy sections → `DriverInstance.DriverConfig` JSON.
|
||||
6. PR + adversarial review + `exit-gate-phase-2-final.md`.
|
||||
|
||||
## What ships from this session
|
||||
|
||||
Eight commits on `phase-1-configuration` since the previous push:
|
||||
|
||||
- `01fd90c` Phase 1 finish + Phase 2 scaffold
|
||||
- `7a5b535` Admin UI core
|
||||
- `18f93d7` LDAP + SignalR
|
||||
- `a1e9ed4` AVEVA-stack inventory doc
|
||||
- `32eeeb9` Phase 2 A+B+C feature-complete
|
||||
- `549cd36` GalaxyRepository ported + DbBackedBackend + live ZB smoke
|
||||
- `(this commit)` MXAccess COM port + MxAccessGalaxyBackend + live MXAccess smoke + adversarial review
|
||||
|
||||
`494/494` v1 tests still pass. No regressions.
|
||||
Reference in New Issue
Block a user