Files
lmxopcua/docs/v2/implementation/phase-2-partial-exit-evidence.md
Joseph Doherty 01fd90c178 Phase 1 Streams B–E scaffold + Phase 2 Streams A–C scaffold — 8 new projects with ~70 new tests, all green alongside the 494 v1 IntegrationTests baseline (parity preserved: no v1 tests broken; legacy OtOpcUa.Host untouched). Phase 1 finish: Configuration project (16 entities + 10 enums + DbContext + DesignTimeDbContextFactory + InitialSchema/StoredProcedures/AuthorizationGrants migrations — 8 procs including sp_PublishGeneration with MERGE on ExternalIdReservation per decision #124, sp_RollbackToGeneration cloning rows into a new published generation, sp_ValidateDraft with cross-cluster-namespace + EquipmentUuid-immutability + ZTag/SAPID reservation pre-flight, sp_ComputeGenerationDiff with CHECKSUM-based row signature — plus OtOpcUaNode/OtOpcUaAdmin SQL roles with EXECUTE grants scoped to per-principal-class proc sets and DENY UPDATE/DELETE/INSERT/SELECT on dbo schema); managed DraftValidator covering UNS segment regex, path length, EquipmentUuid immutability across generations, same-cluster namespace binding (decision #122), reservation pre-flight, EquipmentId derivation (decision #125), driver↔namespace compatibility — returning every failing rule in one pass; LiteDB local cache with round-trip + ring pruning + corruption-fast-fail; GenerationApplier with per-entity Added/Removed/Modified diff and dependency-ordered callbacks (namespace → driver → device → equipment → poll-group → tag, Removed before Added); Core project with GenericDriverNodeManager (scaffold for the Phase 2 Galaxy port) and DriverHost lifecycle registry; Server project using Microsoft.Extensions.Hosting BackgroundService replacing TopShelf, with NodeBootstrap that falls back to LiteDB cache when the central DB is unreachable (decision #79); Admin project scaffolded as Blazor Server with Bootstrap 5 sidebar layout, cookie auth, three admin roles (ConfigViewer/ConfigEditor/FleetAdmin), Cluster + Generation services fronting the stored procs. Phase 2 scaffold: Driver.Galaxy.Shared (netstandard2.0) with full MessagePack IPC contract surface — Hello version negotiation, Open/CloseSession, Heartbeat, DiscoverHierarchy + GalaxyObjectInfo/GalaxyAttributeInfo, Read/WriteValues, Subscribe/Unsubscribe/OnDataChange, AlarmSubscribe/Event/Ack, HistoryRead, HostConnectivityStatus, Recycle — plus length-prefixed framing (decision #28) with a 16 MiB cap and thread-safe FrameWriter/FrameReader; Driver.Galaxy.Host (net48) implementing the Tier C cross-cutting protections from driver-stability.md — strict PipeAcl (allow configured server SID only, explicit deny on LocalSystem + Administrators), PipeServer with caller-SID verification via pipe.RunAsClient + WindowsIdentity.GetCurrent and per-process shared-secret Hello, Galaxy-specific MemoryWatchdog (warn at max(1.5×baseline, +200 MB), soft-recycle at max(2×baseline, +200 MB), hard ceiling 1.5 GB, slope ≥5 MB/min over 30-min rolling window), RecyclePolicy (1 soft recycle per hour cap + 03:00 local daily scheduled), PostMortemMmf (1000-entry ring buffer in %ProgramData%\OtOpcUa\driver-postmortem\galaxy.mmf, survives hard crash, readable cross-process), MxAccessHandle : SafeHandle (ReleaseHandle loops Marshal.ReleaseComObject until refcount=0 then calls optional unregister callback), StaPump with responsiveness probe (BlockingCollection dispatcher for Phase 1 — real Win32 GetMessage/DispatchMessage pump slots in with the same semantics when the Galaxy code lift happens), IsExternalInit shim for init setters on .NET 4.8; Driver.Galaxy.Proxy (net10) implementing IDriver + ITagDiscovery forwarding over the IPC channel with MX data-type and security-classification mapping, plus Supervisor pieces — Backoff (5s → 15s → 60s capped, reset-on-stable-run), CircuitBreaker (3 crashes per 5 min opens; 1h → 4h → manual cooldown escalation; sticky alert doesn't auto-clear), HeartbeatMonitor (2s cadence, 3 consecutive misses = host dead per driver-stability.md). Infrastructure: docker SQL Server remapped to host port 14330 to coexist with the native MSSQL14 Galaxy ZB DB instance on 1433; NuGetAuditSuppress applied per-project for two System.Security.Cryptography.Xml advisories that only reach via EF Core Design with PrivateAssets=all (fix ships in 11.0.0-preview); .slnx gains 14 project registrations. Deferred with explicit TODOs in docs/v2/implementation/phase-2-partial-exit-evidence.md: Phase 1 Stream E Admin UI pages (Generations listing + draft-diff-publish, Equipment CRUD with OPC 40010 fields, UNS Areas/Lines tabs, ACLs + permission simulator, Generic JSON config editor, SignalR real-time, Release-Reservation + Merge-Equipment workflows, LDAP login page, AppServer smoke test per decision #142), Phase 2 Stream D (Galaxy MXAccess code lift out of legacy OtOpcUa.Host, dual-service installer, appsettings → DriverConfig migration script, legacy Host deletion — blocked by parity), Phase 2 Stream E (v1 IntegrationTests against v2 topology, Client.CLI walkthrough diff, four 2026-04-13 stability findings regression tests, adversarial review — requires live MXAccess runtime).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 21:35:25 -04:00

10 KiB
Raw Blame History

Phase 2 — Partial Exit Evidence (2026-04-17)

This records what Phase 2 of v2 completed in the current session and what was explicitly deferred. See phase-2-galaxy-out-of-process.md for the full task plan; this is the as-built delta.

Status: Streams A + B + C scaffolded and test-green. Streams D + E deferred.

The goal per the plan is "parity, not regression" — the phase exit gate requires v1 IntegrationTests to pass against the v2 Galaxy.Proxy + Galaxy.Host topology byte-for-byte. Achieving that requires live MXAccess runtime plus the Galaxy code lift out of the legacy OtOpcUa.Host. Both are operations that need a dev Galaxy up and a parity test cycle to verify. Without that cycle, deleting the legacy Host would break the 494 passing v1 tests that are the parity baseline.

What is done: all scaffolding, IPC contracts, supervisor logic, and stability protections needed to hang the real MXAccess code onto. Every piece has unit-level or IPC-level test coverage.

Delivered

Stream A — Driver.Galaxy.Shared (1 week estimate, complete)

  • src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/ (.NET Standard 2.0, MessagePack-only dependency)
  • Contracts: Hello/HelloAck (version negotiation per Task A.3), OpenSessionRequest/ OpenSessionResponse/CloseSessionRequest, Heartbeat/HeartbeatAck, ErrorResponse, DiscoverHierarchyRequest/Response + GalaxyObjectInfo + GalaxyAttributeInfo, ReadValuesRequest/Response, WriteValuesRequest/Response, SubscribeRequest/ Response/UnsubscribeRequest/OnDataChangeNotification, AlarmSubscribeRequest/ GalaxyAlarmEvent/AlarmAckRequest, HistoryReadRequest/Response+HistoryTagValues, HostConnectivityStatus+RuntimeStatusChangeNotification, RecycleHostRequest/ RecycleStatusResponse
  • Framing: length-prefixed (decision #28) + 1-byte kind tag + MessagePack body. 16 MiB body cap. FrameWriter/FrameReader with thread-safe write gate.
  • Tests (6): reflection-scan round-trip for every [MessagePackObject], referenced- assemblies guard (only MessagePack allowed outside BCL), Hello version defaults, FrameWriterFrameReader interop, oversize-frame rejection.

Stream B — Driver.Galaxy.Host (34 week estimate, scaffold complete; MXAccess lift deferred)

  • src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/ (.NET Framework 4.8 AnyCPU — flips to x86 when the Galaxy code lift happens per Task B.1 scope)
  • Ipc/PipeAcl: builds the strict PipeSecurity — allow configured server-principal SID, explicit deny on LocalSystem + Administrators, owner = allowed SID (decision #76).
  • Ipc/PipeServer: named-pipe server that (1) enforces the ACL, (2) verifies caller SID via pipe.RunAsClient + WindowsIdentity.GetCurrent, (3) requires the per-process shared secret in the Hello frame before any other RPC, (4) rejects major-version mismatches.
  • Stability/MemoryWatchdog: Galaxy thresholds — warn at max(1.5×baseline, +200 MB), soft-recycle at max(2×baseline, +200 MB), hard ceiling 1.5 GB, slope ≥5 MB/min over 30 min. Pluggable RSS source for unit testability.
  • Stability/RecyclePolicy: 1-recycle/hr cap; 03:00 local daily scheduled recycle.
  • Stability/PostMortemMmf: ring buffer of 1000 × 256-byte entries in %ProgramData%\ OtOpcUa\driver-postmortem\galaxy.mmf. Single-writer / multi-reader. Survives hard crash; supervisor reads the MMF via a second process.
  • Sta/MxAccessHandle: SafeHandle subclass — ReleaseHandle calls Marshal.ReleaseComObject in a loop until refcount = 0 then invokes the optional unregister callback. Finalizer-safe. Wraps any RCW via object so we can unit-test against a mock; the real wiring to ArchestrA.MxAccess.LMXProxyServer lands with the deferred code move.
  • Sta/StaPump: dedicated STA thread with BlockingCollection work queue + InvokeAsync dispatch. Responsiveness probe (IsResponsiveAsync) returns false on wedge. The real Win32 GetMessage/DispatchMessage pump from v1 LmxProxy.Host slots in here with the same dispatch semantics.
  • IsExternalInit shim: required for init setters on .NET 4.8.
  • Program.cs: reads OTOPCUA_GALAXY_PIPE, OTOPCUA_ALLOWED_SID, OTOPCUA_GALAXY_SECRET from env (supervisor sets at spawn), runs the pipe server, logs via Serilog to %ProgramData%\OtOpcUa\galaxy-host-YYYY-MM-DD.log.
  • Ipc/StubFrameHandler: placeholder that heartbeat-acks and returns not-implemented errors. Swapped for the real Galaxy-backed handler when the MXAccess code move completes.
  • Tests (15): MemoryWatchdog thresholds + slope detection; RecyclePolicy cap + daily schedule; PostMortemMmf round-trip + ring-wrap + truncation-safety; StaPump apartment-state + responsiveness-probe wedge detection.

Stream C — Driver.Galaxy.Proxy (1.5 week estimate, complete as IPC-forwarder)

  • src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/ (.NET 10)
  • Ipc/GalaxyIpcClient: Hello handshake + shared-secret authentication + single-call request/response over the data-plane pipe. Serializes concurrent callers via SemaphoreSlim. Lifts ErrorResponse to GalaxyIpcException with the error code.
  • GalaxyProxyDriver: implements IDriver + ITagDiscovery. Forwards lifecycle and discovery over IPC; maps Galaxy MX data types → DriverDataType and security classifications → SecurityClassification. Stream C-plan capability interfaces for IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IHostConnectivityProbe, IRediscoverable are structured identically — wire them in when the Host's MXAccess backend exists so the round-trips can actually serve data.
  • Supervisor/Backoff: 5s → 15s → 60s capped; RecordStableRun resets after 2-min successful run.
  • Supervisor/CircuitBreaker: 3 crashes per 5 min opens; cooldown escalates 1h → 4h → manual (TimeSpan.MaxValue). Sticky alert doesn't auto-clear when cooldown elapses; ManualReset only.
  • Supervisor/HeartbeatMonitor: 2s cadence, 3 consecutive misses = host dead.
  • Tests (11): Backoff sequence + reset; CircuitBreaker full 1h/4h/manual escalation path; HeartbeatMonitor miss-count + ack-reset; full IPC handshake round-trip (Host + Proxy over a real named pipe, heartbeat ack verified; shared-secret mismatch rejected with UnauthorizedAccessException).

Deferred (explicitly noted as TODO)

Stream D — Retire legacy OtOpcUa.Host

Not executable until Stream E parity passes. Deleting the legacy project now would break the 494 v1 IntegrationTests that are the parity baseline. Recovery requires:

  1. Host MXAccess code lift (Task B.1 "move Galaxy code") from OtOpcUa.Host/ into OtOpcUa.Driver.Galaxy.Host/ — STA pump wiring, MxAccessHandle backing the real LMXProxyServer, GalaxyRepository and its SQL queries, GalaxyRuntimeProbeManager, Historian loader, the Ipc stub handler replaced with a real IFrameHandler that invokes the handle.
  2. Address-space build via IAddressSpaceBuilder produces byte-equivalent OPC UA browse output to v1 (Task C.4).
  3. Windows service installer registers two services (OtOpcUa + OtOpcUaGalaxyHost) with the correct service-account SIDs and per-process secret provisioning. Galaxy.Host starts before OtOpcUa.
  4. appsettings.json Galaxy config (MxAccess / Galaxy / Historian sections) migrated into DriverInstance.DriverConfig JSON in the Configuration DB via an idempotent migration script. Post-migration, the local appsettings.json keeps only Cluster.NodeId, ClusterId, and the DB conn string per decision #18.

Stream E — Parity validation

Requires live MXAccess + Galaxy runtime and the above lift complete. Work items:

  • Run v1 IntegrationTests against the v2 Galaxy.Proxy + Galaxy.Host topology. Pass count = v1 baseline; failures = 0. Per-test duration regression report flags any test >2× baseline.
  • Scripted Client.CLI walkthrough recorded at Phase 2 entry gate against v1, replayed against v2; diff must show only timestamp/latency differences.
  • Regression tests for the four 2026-04-13 stability findings (phantom probe, cross-host quality clear, sync-over-async guard, fire-and-forget alarm drain).
  • /codex:adversarial-review --base v2 on the merged Phase 2 diff — findings closed or deferred with rationale.

Also deferred from Stream B

  • Task B.10 FaultShim (test-only ArchestrA.MxAccess substitute for fault injection). Needs the production ArchestrA.MxAccess reference in place first; flagged as part of the plan's "mid-gate review" fallback (Risk row 7).
  • Task B.8 WM_QUIT hard-exit escalation — wired in when the real Win32 pump replaces the BlockingCollection dispatcher. The StaPump.IsResponsiveAsync probe already exists; the supervisor escalation-to-Environment.Exit(2) belongs to the Program main loop after the pump integration.

Cross-session impact on the build

  • Full solution: 926 tests pass, 1 fails (pre-existing Phase 0 baseline Client.CLI.Tests.SubscribeCommandTests.Execute_PrintsSubscriptionMessage — not a Phase 2 regression; was red before Phase 1 and stays red through Phase 2).
  • New projects added to .slnx: Driver.Galaxy.Shared, Driver.Galaxy.Host, Driver.Galaxy.Proxy, plus the three matching test projects.
  • No existing tests broke. The 494 v1 OtOpcUa.Tests (net48) and 6 IntegrationTests (net48) still pass because the legacy OtOpcUa.Host is untouched.

Next-session checklist for Stream D + E

  1. Stand up dev Galaxy; capture Client.CLI walkthrough baseline against v1.
  2. Move Galaxy-specific files from OtOpcUa.Host into Driver.Galaxy.Host, renaming namespaces. Replace StubFrameHandler with the real one.
  3. Wire up the real Win32 pump inside StaPump (lift from scadalink-design's LmxProxy.Host reference per CLAUDE.md).
  4. Run v1 IntegrationTests against the v2 topology — iterate on parity defects until green.
  5. Run Client.CLI walkthrough and diff.
  6. Regression tests for the four stability findings.
  7. Delete legacy OtOpcUa.Host; update .slnx; update installer scripts.
  8. Adversarial review; exit-gate-phase-2.md recorded; PR merged.