Files
lmxopcua/docs/v2/implementation/phase-2-partial-exit-evidence.md
Joseph Doherty 01fd90c178 Phase 1 Streams B–E scaffold + Phase 2 Streams A–C scaffold — 8 new projects with ~70 new tests, all green alongside the 494 v1 IntegrationTests baseline (parity preserved: no v1 tests broken; legacy OtOpcUa.Host untouched). Phase 1 finish: Configuration project (16 entities + 10 enums + DbContext + DesignTimeDbContextFactory + InitialSchema/StoredProcedures/AuthorizationGrants migrations — 8 procs including sp_PublishGeneration with MERGE on ExternalIdReservation per decision #124, sp_RollbackToGeneration cloning rows into a new published generation, sp_ValidateDraft with cross-cluster-namespace + EquipmentUuid-immutability + ZTag/SAPID reservation pre-flight, sp_ComputeGenerationDiff with CHECKSUM-based row signature — plus OtOpcUaNode/OtOpcUaAdmin SQL roles with EXECUTE grants scoped to per-principal-class proc sets and DENY UPDATE/DELETE/INSERT/SELECT on dbo schema); managed DraftValidator covering UNS segment regex, path length, EquipmentUuid immutability across generations, same-cluster namespace binding (decision #122), reservation pre-flight, EquipmentId derivation (decision #125), driver↔namespace compatibility — returning every failing rule in one pass; LiteDB local cache with round-trip + ring pruning + corruption-fast-fail; GenerationApplier with per-entity Added/Removed/Modified diff and dependency-ordered callbacks (namespace → driver → device → equipment → poll-group → tag, Removed before Added); Core project with GenericDriverNodeManager (scaffold for the Phase 2 Galaxy port) and DriverHost lifecycle registry; Server project using Microsoft.Extensions.Hosting BackgroundService replacing TopShelf, with NodeBootstrap that falls back to LiteDB cache when the central DB is unreachable (decision #79); Admin project scaffolded as Blazor Server with Bootstrap 5 sidebar layout, cookie auth, three admin roles (ConfigViewer/ConfigEditor/FleetAdmin), Cluster + Generation services fronting the stored procs. Phase 2 scaffold: Driver.Galaxy.Shared (netstandard2.0) with full MessagePack IPC contract surface — Hello version negotiation, Open/CloseSession, Heartbeat, DiscoverHierarchy + GalaxyObjectInfo/GalaxyAttributeInfo, Read/WriteValues, Subscribe/Unsubscribe/OnDataChange, AlarmSubscribe/Event/Ack, HistoryRead, HostConnectivityStatus, Recycle — plus length-prefixed framing (decision #28) with a 16 MiB cap and thread-safe FrameWriter/FrameReader; Driver.Galaxy.Host (net48) implementing the Tier C cross-cutting protections from driver-stability.md — strict PipeAcl (allow configured server SID only, explicit deny on LocalSystem + Administrators), PipeServer with caller-SID verification via pipe.RunAsClient + WindowsIdentity.GetCurrent and per-process shared-secret Hello, Galaxy-specific MemoryWatchdog (warn at max(1.5×baseline, +200 MB), soft-recycle at max(2×baseline, +200 MB), hard ceiling 1.5 GB, slope ≥5 MB/min over 30-min rolling window), RecyclePolicy (1 soft recycle per hour cap + 03:00 local daily scheduled), PostMortemMmf (1000-entry ring buffer in %ProgramData%\OtOpcUa\driver-postmortem\galaxy.mmf, survives hard crash, readable cross-process), MxAccessHandle : SafeHandle (ReleaseHandle loops Marshal.ReleaseComObject until refcount=0 then calls optional unregister callback), StaPump with responsiveness probe (BlockingCollection dispatcher for Phase 1 — real Win32 GetMessage/DispatchMessage pump slots in with the same semantics when the Galaxy code lift happens), IsExternalInit shim for init setters on .NET 4.8; Driver.Galaxy.Proxy (net10) implementing IDriver + ITagDiscovery forwarding over the IPC channel with MX data-type and security-classification mapping, plus Supervisor pieces — Backoff (5s → 15s → 60s capped, reset-on-stable-run), CircuitBreaker (3 crashes per 5 min opens; 1h → 4h → manual cooldown escalation; sticky alert doesn't auto-clear), HeartbeatMonitor (2s cadence, 3 consecutive misses = host dead per driver-stability.md). Infrastructure: docker SQL Server remapped to host port 14330 to coexist with the native MSSQL14 Galaxy ZB DB instance on 1433; NuGetAuditSuppress applied per-project for two System.Security.Cryptography.Xml advisories that only reach via EF Core Design with PrivateAssets=all (fix ships in 11.0.0-preview); .slnx gains 14 project registrations. Deferred with explicit TODOs in docs/v2/implementation/phase-2-partial-exit-evidence.md: Phase 1 Stream E Admin UI pages (Generations listing + draft-diff-publish, Equipment CRUD with OPC 40010 fields, UNS Areas/Lines tabs, ACLs + permission simulator, Generic JSON config editor, SignalR real-time, Release-Reservation + Merge-Equipment workflows, LDAP login page, AppServer smoke test per decision #142), Phase 2 Stream D (Galaxy MXAccess code lift out of legacy OtOpcUa.Host, dual-service installer, appsettings → DriverConfig migration script, legacy Host deletion — blocked by parity), Phase 2 Stream E (v1 IntegrationTests against v2 topology, Client.CLI walkthrough diff, four 2026-04-13 stability findings regression tests, adversarial review — requires live MXAccess runtime).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 21:35:25 -04:00

164 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 2 — Partial Exit Evidence (2026-04-17)
> This records what Phase 2 of v2 completed in the current session and what was explicitly
> deferred. See `phase-2-galaxy-out-of-process.md` for the full task plan; this is the as-built
> delta.
## Status: **Streams A + B + C scaffolded and test-green. Streams D + E deferred.**
The goal per the plan is "parity, not regression" — the phase exit gate requires v1
IntegrationTests to pass against the v2 Galaxy.Proxy + Galaxy.Host topology byte-for-byte.
Achieving that requires live MXAccess runtime plus the Galaxy code lift out of the legacy
`OtOpcUa.Host`. Both are operations that need a dev Galaxy up and a parity test cycle to verify.
Without that cycle, deleting the legacy Host would break the 494 passing v1 tests that are the
parity baseline.
What *is* done: all scaffolding, IPC contracts, supervisor logic, and stability protections
needed to hang the real MXAccess code onto. Every piece has unit-level or IPC-level test
coverage.
## Delivered
### Stream A — `Driver.Galaxy.Shared` (1 week estimate, **complete**)
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/` (.NET Standard 2.0, MessagePack-only
dependency)
- **Contracts**: `Hello`/`HelloAck` (version negotiation per Task A.3), `OpenSessionRequest`/
`OpenSessionResponse`/`CloseSessionRequest`, `Heartbeat`/`HeartbeatAck`, `ErrorResponse`,
`DiscoverHierarchyRequest`/`Response` + `GalaxyObjectInfo` + `GalaxyAttributeInfo`,
`ReadValuesRequest`/`Response`, `WriteValuesRequest`/`Response`, `SubscribeRequest`/
`Response`/`UnsubscribeRequest`/`OnDataChangeNotification`, `AlarmSubscribeRequest`/
`GalaxyAlarmEvent`/`AlarmAckRequest`, `HistoryReadRequest`/`Response`+`HistoryTagValues`,
`HostConnectivityStatus`+`RuntimeStatusChangeNotification`, `RecycleHostRequest`/
`RecycleStatusResponse`
- **Framing**: length-prefixed (decision #28) + 1-byte kind tag + MessagePack body. 16 MiB
body cap. `FrameWriter`/`FrameReader` with thread-safe write gate.
- **Tests (6)**: reflection-scan round-trip for every `[MessagePackObject]`, referenced-
assemblies guard (only MessagePack allowed outside BCL), Hello version defaults,
`FrameWriter``FrameReader` interop, oversize-frame rejection.
### Stream B — `Driver.Galaxy.Host` (34 week estimate, **scaffold complete; MXAccess lift deferred**)
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/` (.NET Framework 4.8 AnyCPU — flips to x86 when
the Galaxy code lift happens per Task B.1 scope)
- **`Ipc/PipeAcl`**: builds the strict `PipeSecurity` — allow configured server-principal SID,
explicit deny on LocalSystem + Administrators, owner = allowed SID (decision #76).
- **`Ipc/PipeServer`**: named-pipe server that (1) enforces the ACL, (2) verifies caller SID
via `pipe.RunAsClient` + `WindowsIdentity.GetCurrent`, (3) requires the per-process shared
secret in the Hello frame before any other RPC, (4) rejects major-version mismatches.
- **`Stability/MemoryWatchdog`**: Galaxy thresholds — warn at `max(1.5×baseline, +200 MB)`,
soft-recycle at `max(2×baseline, +200 MB)`, hard ceiling 1.5 GB, slope ≥5 MB/min over 30 min.
Pluggable RSS source for unit testability.
- **`Stability/RecyclePolicy`**: 1-recycle/hr cap; 03:00 local daily scheduled recycle.
- **`Stability/PostMortemMmf`**: ring buffer of 1000 × 256-byte entries in `%ProgramData%\
OtOpcUa\driver-postmortem\galaxy.mmf`. Single-writer / multi-reader. Survives hard crash;
supervisor reads the MMF via a second process.
- **`Sta/MxAccessHandle`**: `SafeHandle` subclass — `ReleaseHandle` calls `Marshal.ReleaseComObject`
in a loop until refcount = 0 then invokes the optional `unregister` callback. Finalizer-safe.
Wraps any RCW via `object` so we can unit-test against a mock; the real wiring to
`ArchestrA.MxAccess.LMXProxyServer` lands with the deferred code move.
- **`Sta/StaPump`**: dedicated STA thread with `BlockingCollection` work queue + `InvokeAsync`
dispatch. Responsiveness probe (`IsResponsiveAsync`) returns false on wedge. The real
Win32 `GetMessage/DispatchMessage` pump from v1 `LmxProxy.Host` slots in here with the same
dispatch semantics.
- **`IsExternalInit` shim**: required for `init` setters on .NET 4.8.
- **`Program.cs`**: reads `OTOPCUA_GALAXY_PIPE`, `OTOPCUA_ALLOWED_SID`, `OTOPCUA_GALAXY_SECRET`
from env (supervisor sets at spawn), runs the pipe server, logs via Serilog to
`%ProgramData%\OtOpcUa\galaxy-host-YYYY-MM-DD.log`.
- **`Ipc/StubFrameHandler`**: placeholder that heartbeat-acks and returns `not-implemented`
errors. Swapped for the real Galaxy-backed handler when the MXAccess code move completes.
- **Tests (15)**: `MemoryWatchdog` thresholds + slope detection; `RecyclePolicy` cap + daily
schedule; `PostMortemMmf` round-trip + ring-wrap + truncation-safety; `StaPump`
apartment-state + responsiveness-probe wedge detection.
### Stream C — `Driver.Galaxy.Proxy` (1.5 week estimate, **complete as IPC-forwarder**)
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/` (.NET 10)
- **`Ipc/GalaxyIpcClient`**: Hello handshake + shared-secret authentication + single-call
request/response over the data-plane pipe. Serializes concurrent callers via
`SemaphoreSlim`. Lifts `ErrorResponse` to `GalaxyIpcException` with the error code.
- **`GalaxyProxyDriver`**: implements `IDriver` + `ITagDiscovery`. Forwards lifecycle and
discovery over IPC; maps Galaxy MX data types → `DriverDataType` and security classifications
→ `SecurityClassification`. Stream C-plan capability interfaces for `IReadable`, `IWritable`,
`ISubscribable`, `IAlarmSource`, `IHistoryProvider`, `IHostConnectivityProbe`,
`IRediscoverable` are structured identically — wire them in when the Host's MXAccess backend
exists so the round-trips can actually serve data.
- **`Supervisor/Backoff`**: 5s → 15s → 60s capped; `RecordStableRun` resets after 2-min
successful run.
- **`Supervisor/CircuitBreaker`**: 3 crashes per 5 min opens; cooldown escalates
1h → 4h → manual (`TimeSpan.MaxValue`). Sticky alert doesn't auto-clear when cooldown
elapses; `ManualReset` only.
- **`Supervisor/HeartbeatMonitor`**: 2s cadence, 3 consecutive misses = host dead.
- **Tests (11)**: `Backoff` sequence + reset; `CircuitBreaker` full 1h/4h/manual escalation
path; `HeartbeatMonitor` miss-count + ack-reset; full IPC handshake round-trip
(Host + Proxy over a real named pipe, heartbeat ack verified; shared-secret mismatch
rejected with `UnauthorizedAccessException`).
## Deferred (explicitly noted as TODO)
### Stream D — Retire legacy `OtOpcUa.Host`
**Not executable until Stream E parity passes.** Deleting the legacy project now would break
the 494 v1 IntegrationTests that are the parity baseline. Recovery requires:
1. Host MXAccess code lift (Task B.1 "move Galaxy code") from `OtOpcUa.Host/` into
`OtOpcUa.Driver.Galaxy.Host/` — STA pump wiring, `MxAccessHandle` backing the real
`LMXProxyServer`, `GalaxyRepository` and its SQL queries, `GalaxyRuntimeProbeManager`,
Historian loader, the Ipc stub handler replaced with a real `IFrameHandler` that invokes
the handle.
2. Address-space build via `IAddressSpaceBuilder` produces byte-equivalent OPC UA browse
output to v1 (Task C.4).
3. Windows service installer registers two services (`OtOpcUa` + `OtOpcUaGalaxyHost`) with
the correct service-account SIDs and per-process secret provisioning. Galaxy.Host starts
before OtOpcUa.
4. `appsettings.json` Galaxy config (MxAccess / Galaxy / Historian sections) migrated into
`DriverInstance.DriverConfig` JSON in the Configuration DB via an idempotent migration
script. Post-migration, the local `appsettings.json` keeps only `Cluster.NodeId`,
`ClusterId`, and the DB conn string per decision #18.
### Stream E — Parity validation
Requires live MXAccess + Galaxy runtime and the above lift complete. Work items:
- Run v1 IntegrationTests against the v2 Galaxy.Proxy + Galaxy.Host topology. Pass count =
v1 baseline; failures = 0. Per-test duration regression report flags any test >2× baseline.
- Scripted Client.CLI walkthrough recorded at Phase 2 entry gate against v1, replayed
against v2; diff must show only timestamp/latency differences.
- Regression tests for the four 2026-04-13 stability findings (phantom probe, cross-host
quality clear, sync-over-async guard, fire-and-forget alarm drain).
- `/codex:adversarial-review --base v2` on the merged Phase 2 diff — findings closed or
deferred with rationale.
## Also deferred from Stream B
- **Task B.10 FaultShim** (test-only `ArchestrA.MxAccess` substitute for fault injection).
Needs the production `ArchestrA.MxAccess` reference in place first; flagged as part of the
plan's "mid-gate review" fallback (Risk row 7).
- **Task B.8 WM_QUIT hard-exit escalation** — wired in when the real Win32 pump replaces the
`BlockingCollection` dispatcher. The `StaPump.IsResponsiveAsync` probe already exists; the
supervisor escalation-to-`Environment.Exit(2)` belongs to the Program main loop after the
pump integration.
## Cross-session impact on the build
- **Full solution**: 926 tests pass, 1 fails (pre-existing Phase 0 baseline
`Client.CLI.Tests.SubscribeCommandTests.Execute_PrintsSubscriptionMessage` — not a Phase 2
regression; was red before Phase 1 and stays red through Phase 2).
- **New projects added to `.slnx`**: `Driver.Galaxy.Shared`, `Driver.Galaxy.Host`,
`Driver.Galaxy.Proxy`, plus the three matching test projects.
- **No existing tests broke.** The 494 v1 `OtOpcUa.Tests` (net48) and 6 `IntegrationTests`
(net48) still pass because the legacy `OtOpcUa.Host` is untouched.
## Next-session checklist for Stream D + E
1. Stand up dev Galaxy; capture Client.CLI walkthrough baseline against v1.
2. Move Galaxy-specific files from `OtOpcUa.Host` into `Driver.Galaxy.Host`, renaming
namespaces. Replace `StubFrameHandler` with the real one.
3. Wire up the real Win32 pump inside `StaPump` (lift from scadalink-design's
`LmxProxy.Host` reference per CLAUDE.md).
4. Run v1 IntegrationTests against the v2 topology — iterate on parity defects until green.
5. Run Client.CLI walkthrough and diff.
6. Regression tests for the four stability findings.
7. Delete legacy `OtOpcUa.Host`; update `.slnx`; update installer scripts.
8. Adversarial review; `exit-gate-phase-2.md` recorded; PR merged.