Phase 1 Streams B–E scaffold + Phase 2 Streams A–C scaffold — 8 new projects with ~70 new tests, all green alongside the 494 v1 IntegrationTests baseline (parity preserved: no v1 tests broken; legacy OtOpcUa.Host untouched). Phase 1 finish: Configuration project (16 entities + 10 enums + DbContext + DesignTimeDbContextFactory + InitialSchema/StoredProcedures/AuthorizationGrants migrations — 8 procs including sp_PublishGeneration with MERGE on ExternalIdReservation per decision #124, sp_RollbackToGeneration cloning rows into a new published generation, sp_ValidateDraft with cross-cluster-namespace + EquipmentUuid-immutability + ZTag/SAPID reservation pre-flight, sp_ComputeGenerationDiff with CHECKSUM-based row signature — plus OtOpcUaNode/OtOpcUaAdmin SQL roles with EXECUTE grants scoped to per-principal-class proc sets and DENY UPDATE/DELETE/INSERT/SELECT on dbo schema); managed DraftValidator covering UNS segment regex, path length, EquipmentUuid immutability across generations, same-cluster namespace binding (decision #122), reservation pre-flight, EquipmentId derivation (decision #125), driver↔namespace compatibility — returning every failing rule in one pass; LiteDB local cache with round-trip + ring pruning + corruption-fast-fail; GenerationApplier with per-entity Added/Removed/Modified diff and dependency-ordered callbacks (namespace → driver → device → equipment → poll-group → tag, Removed before Added); Core project with GenericDriverNodeManager (scaffold for the Phase 2 Galaxy port) and DriverHost lifecycle registry; Server project using Microsoft.Extensions.Hosting BackgroundService replacing TopShelf, with NodeBootstrap that falls back to LiteDB cache when the central DB is unreachable (decision #79); Admin project scaffolded as Blazor Server with Bootstrap 5 sidebar layout, cookie auth, three admin roles (ConfigViewer/ConfigEditor/FleetAdmin), Cluster + Generation services fronting the stored procs. Phase 2 scaffold: Driver.Galaxy.Shared (netstandard2.0) with full MessagePack IPC contract surface — Hello version negotiation, Open/CloseSession, Heartbeat, DiscoverHierarchy + GalaxyObjectInfo/GalaxyAttributeInfo, Read/WriteValues, Subscribe/Unsubscribe/OnDataChange, AlarmSubscribe/Event/Ack, HistoryRead, HostConnectivityStatus, Recycle — plus length-prefixed framing (decision #28) with a 16 MiB cap and thread-safe FrameWriter/FrameReader; Driver.Galaxy.Host (net48) implementing the Tier C cross-cutting protections from driver-stability.md — strict PipeAcl (allow configured server SID only, explicit deny on LocalSystem + Administrators), PipeServer with caller-SID verification via pipe.RunAsClient + WindowsIdentity.GetCurrent and per-process shared-secret Hello, Galaxy-specific MemoryWatchdog (warn at max(1.5×baseline, +200 MB), soft-recycle at max(2×baseline, +200 MB), hard ceiling 1.5 GB, slope ≥5 MB/min over 30-min rolling window), RecyclePolicy (1 soft recycle per hour cap + 03:00 local daily scheduled), PostMortemMmf (1000-entry ring buffer in %ProgramData%\OtOpcUa\driver-postmortem\galaxy.mmf, survives hard crash, readable cross-process), MxAccessHandle : SafeHandle (ReleaseHandle loops Marshal.ReleaseComObject until refcount=0 then calls optional unregister callback), StaPump with responsiveness probe (BlockingCollection dispatcher for Phase 1 — real Win32 GetMessage/DispatchMessage pump slots in with the same semantics when the Galaxy code lift happens), IsExternalInit shim for init setters on .NET 4.8; Driver.Galaxy.Proxy (net10) implementing IDriver + ITagDiscovery forwarding over the IPC channel with MX data-type and security-classification mapping, plus Supervisor pieces — Backoff (5s → 15s → 60s capped, reset-on-stable-run), CircuitBreaker (3 crashes per 5 min opens; 1h → 4h → manual cooldown escalation; sticky alert doesn't auto-clear), HeartbeatMonitor (2s cadence, 3 consecutive misses = host dead per driver-stability.md). Infrastructure: docker SQL Server remapped to host port 14330 to coexist with the native MSSQL14 Galaxy ZB DB instance on 1433; NuGetAuditSuppress applied per-project for two System.Security.Cryptography.Xml advisories that only reach via EF Core Design with PrivateAssets=all (fix ships in 11.0.0-preview); .slnx gains 14 project registrations. Deferred with explicit TODOs in docs/v2/implementation/phase-2-partial-exit-evidence.md: Phase 1 Stream E Admin UI pages (Generations listing + draft-diff-publish, Equipment CRUD with OPC 40010 fields, UNS Areas/Lines tabs, ACLs + permission simulator, Generic JSON config editor, SignalR real-time, Release-Reservation + Merge-Equipment workflows, LDAP login page, AppServer smoke test per decision #142), Phase 2 Stream D (Galaxy MXAccess code lift out of legacy OtOpcUa.Host, dual-service installer, appsettings → DriverConfig migration script, legacy Host deletion — blocked by parity), Phase 2 Stream E (v1 IntegrationTests against v2 topology, Client.CLI walkthrough diff, four 2026-04-13 stability findings regression tests, adversarial review — requires live MXAccess runtime).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-04-17 21:35:25 -04:00
parent fc0ce36308
commit 01fd90c178
128 changed files with 12352 additions and 4 deletions

View File

@@ -58,7 +58,7 @@ Running record of every v2 dev service stood up on this developer machine. Updat
| Service | Container / Process | Version | Host:Port | Credentials (dev-only) | Data location | Status |
|---------|---------------------|---------|-----------|------------------------|---------------|--------|
| **Central config DB** | Docker container `otopcua-mssql` (image `mcr.microsoft.com/mssql/server:2022-latest`) | 16.0.4250.1 (RTM-CU24-GDR, KB5083252) | `localhost:1433` | User `sa` / Password `OtOpcUaDev_2026!` | Docker named volume `otopcua-mssql-data` (mounted at `/var/opt/mssql` inside container) | ✅ Running |
| **Central config DB** | Docker container `otopcua-mssql` (image `mcr.microsoft.com/mssql/server:2022-latest`) | 16.0.4250.1 (RTM-CU24-GDR, KB5083252) | `localhost:14330` (host) → `1433` (container) — remapped from 1433 to avoid collision with the native MSSQL14 instance that hosts the Galaxy `ZB` DB (both bind 0.0.0.0:1433; whichever wins the race gets connections) | User `sa` / Password `OtOpcUaDev_2026!` | Docker named volume `otopcua-mssql-data` (mounted at `/var/opt/mssql` inside container) | ✅ Running `InitialSchema` migration applied, 16 entity tables live |
| Dev Galaxy (AVEVA System Platform) | Local install on this dev box | v1 baseline | Local COM via MXAccess | Windows Auth | Galaxy repository DB `ZB` on local SQL Server (separate instance from `otopcua-mssql` — legacy v1 Galaxy DB, not related to v2 config DB) | ✅ Available (per CLAUDE.md) |
| GLAuth (LDAP) | Local install at `C:\publish\glauth\` | v1 baseline | `localhost:3893` (LDAP) / `3894` (LDAPS) | Bind DN `cn=admin,dc=otopcua,dc=local` / password in `glauth-otopcua.cfg` | `C:\publish\glauth\` | Pending — v2 test users + groups config not yet seeded (Phase 1 Stream E task) |
| OPC Foundation reference server | Not yet built | — | `localhost:62541` (target) | `user1` / `password1` (reference-server defaults) | — | Pending (needed for Phase 5 OPC UA Client driver testing) |
@@ -75,7 +75,7 @@ Copy-paste-ready. **Never commit these to the repo** — they go in `appsettings
```jsonc
{
"ConfigDatabase": {
"ConnectionString": "Server=localhost,1433;Database=OtOpcUaConfig_Dev;User Id=sa;Password=OtOpcUaDev_2026!;TrustServerCertificate=true;Encrypt=false;"
"ConnectionString": "Server=localhost,14330;Database=OtOpcUaConfig_Dev;User Id=sa;Password=OtOpcUaDev_2026!;TrustServerCertificate=true;Encrypt=false;"
},
"Authentication": {
"Ldap": {
@@ -135,7 +135,7 @@ Dev credentials in this inventory are convenience defaults, not secrets. Change
| Resource | Purpose | Type | Default port | Default credentials | Owner |
|----------|---------|------|--------------|---------------------|-------|
| **SQL Server 2022 dev edition** | Central config DB; integration tests against `Configuration` project | Local install OR Docker container `mcr.microsoft.com/mssql/server:2022-latest` | 1433 | `sa` / `OtOpcUaDev_2026!` (dev only — production uses Integrated Security or gMSA per decision #46) | Developer (per machine) |
| **SQL Server 2022 dev edition** | Central config DB; integration tests against `Configuration` project | Local install OR Docker container `mcr.microsoft.com/mssql/server:2022-latest` | 1433 default, or 14330 when a native MSSQL instance (e.g. the Galaxy `ZB` host) already occupies 1433 | `sa` / `OtOpcUaDev_2026!` (dev only — production uses Integrated Security or gMSA per decision #46) | Developer (per machine) |
| **GLAuth (LDAP server)** | Admin UI authentication tests; data-path ACL evaluation tests | Local binary at `C:\publish\glauth\` per existing CLAUDE.md | 3893 (LDAP) / 3894 (LDAPS) | Service principal: `cn=admin,dc=otopcua,dc=local` / `OtOpcUaDev_2026!`; test users defined in GLAuth config | Developer (per machine) |
| **Local dev Galaxy** (Aveva System Platform) | Galaxy driver tests; v1 IntegrationTests parity | Existing on dev box per CLAUDE.md | n/a (local COM) | Windows Auth | Developer (already present per project setup) |
@@ -270,11 +270,13 @@ Order matters because some installs have prerequisites and several need admin el
docker run --name otopcua-mssql `
-e "ACCEPT_EULA=Y" `
-e "MSSQL_SA_PASSWORD=OtOpcUaDev_2026!" `
-p 1433:1433 `
-p 14330:1433 `
-v otopcua-mssql-data:/var/opt/mssql `
-d mcr.microsoft.com/mssql/server:2022-latest
```
The host port is **14330**, not 1433, to coexist with the native MSSQL14 instance that hosts the Galaxy `ZB` DB on port 1433. Both the native instance and Docker's port-proxy will happily bind `0.0.0.0:1433`, but only one of them catches any given connection — which is effectively non-deterministic and produces confusing "Login failed for user 'sa'" errors when the native instance wins. Using 14330 eliminates the race entirely.
The `-v otopcua-mssql-data:/var/opt/mssql` named volume preserves database files across container restarts and `docker rm` — drop it only if you want a strictly throwaway instance.
Verify:

View File

@@ -0,0 +1,163 @@
# Phase 2 — Partial Exit Evidence (2026-04-17)
> This records what Phase 2 of v2 completed in the current session and what was explicitly
> deferred. See `phase-2-galaxy-out-of-process.md` for the full task plan; this is the as-built
> delta.
## Status: **Streams A + B + C scaffolded and test-green. Streams D + E deferred.**
The goal per the plan is "parity, not regression" — the phase exit gate requires v1
IntegrationTests to pass against the v2 Galaxy.Proxy + Galaxy.Host topology byte-for-byte.
Achieving that requires live MXAccess runtime plus the Galaxy code lift out of the legacy
`OtOpcUa.Host`. Both are operations that need a dev Galaxy up and a parity test cycle to verify.
Without that cycle, deleting the legacy Host would break the 494 passing v1 tests that are the
parity baseline.
What *is* done: all scaffolding, IPC contracts, supervisor logic, and stability protections
needed to hang the real MXAccess code onto. Every piece has unit-level or IPC-level test
coverage.
## Delivered
### Stream A — `Driver.Galaxy.Shared` (1 week estimate, **complete**)
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/` (.NET Standard 2.0, MessagePack-only
dependency)
- **Contracts**: `Hello`/`HelloAck` (version negotiation per Task A.3), `OpenSessionRequest`/
`OpenSessionResponse`/`CloseSessionRequest`, `Heartbeat`/`HeartbeatAck`, `ErrorResponse`,
`DiscoverHierarchyRequest`/`Response` + `GalaxyObjectInfo` + `GalaxyAttributeInfo`,
`ReadValuesRequest`/`Response`, `WriteValuesRequest`/`Response`, `SubscribeRequest`/
`Response`/`UnsubscribeRequest`/`OnDataChangeNotification`, `AlarmSubscribeRequest`/
`GalaxyAlarmEvent`/`AlarmAckRequest`, `HistoryReadRequest`/`Response`+`HistoryTagValues`,
`HostConnectivityStatus`+`RuntimeStatusChangeNotification`, `RecycleHostRequest`/
`RecycleStatusResponse`
- **Framing**: length-prefixed (decision #28) + 1-byte kind tag + MessagePack body. 16 MiB
body cap. `FrameWriter`/`FrameReader` with thread-safe write gate.
- **Tests (6)**: reflection-scan round-trip for every `[MessagePackObject]`, referenced-
assemblies guard (only MessagePack allowed outside BCL), Hello version defaults,
`FrameWriter``FrameReader` interop, oversize-frame rejection.
### Stream B — `Driver.Galaxy.Host` (34 week estimate, **scaffold complete; MXAccess lift deferred**)
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/` (.NET Framework 4.8 AnyCPU — flips to x86 when
the Galaxy code lift happens per Task B.1 scope)
- **`Ipc/PipeAcl`**: builds the strict `PipeSecurity` — allow configured server-principal SID,
explicit deny on LocalSystem + Administrators, owner = allowed SID (decision #76).
- **`Ipc/PipeServer`**: named-pipe server that (1) enforces the ACL, (2) verifies caller SID
via `pipe.RunAsClient` + `WindowsIdentity.GetCurrent`, (3) requires the per-process shared
secret in the Hello frame before any other RPC, (4) rejects major-version mismatches.
- **`Stability/MemoryWatchdog`**: Galaxy thresholds — warn at `max(1.5×baseline, +200 MB)`,
soft-recycle at `max(2×baseline, +200 MB)`, hard ceiling 1.5 GB, slope ≥5 MB/min over 30 min.
Pluggable RSS source for unit testability.
- **`Stability/RecyclePolicy`**: 1-recycle/hr cap; 03:00 local daily scheduled recycle.
- **`Stability/PostMortemMmf`**: ring buffer of 1000 × 256-byte entries in `%ProgramData%\
OtOpcUa\driver-postmortem\galaxy.mmf`. Single-writer / multi-reader. Survives hard crash;
supervisor reads the MMF via a second process.
- **`Sta/MxAccessHandle`**: `SafeHandle` subclass — `ReleaseHandle` calls `Marshal.ReleaseComObject`
in a loop until refcount = 0 then invokes the optional `unregister` callback. Finalizer-safe.
Wraps any RCW via `object` so we can unit-test against a mock; the real wiring to
`ArchestrA.MxAccess.LMXProxyServer` lands with the deferred code move.
- **`Sta/StaPump`**: dedicated STA thread with `BlockingCollection` work queue + `InvokeAsync`
dispatch. Responsiveness probe (`IsResponsiveAsync`) returns false on wedge. The real
Win32 `GetMessage/DispatchMessage` pump from v1 `LmxProxy.Host` slots in here with the same
dispatch semantics.
- **`IsExternalInit` shim**: required for `init` setters on .NET 4.8.
- **`Program.cs`**: reads `OTOPCUA_GALAXY_PIPE`, `OTOPCUA_ALLOWED_SID`, `OTOPCUA_GALAXY_SECRET`
from env (supervisor sets at spawn), runs the pipe server, logs via Serilog to
`%ProgramData%\OtOpcUa\galaxy-host-YYYY-MM-DD.log`.
- **`Ipc/StubFrameHandler`**: placeholder that heartbeat-acks and returns `not-implemented`
errors. Swapped for the real Galaxy-backed handler when the MXAccess code move completes.
- **Tests (15)**: `MemoryWatchdog` thresholds + slope detection; `RecyclePolicy` cap + daily
schedule; `PostMortemMmf` round-trip + ring-wrap + truncation-safety; `StaPump`
apartment-state + responsiveness-probe wedge detection.
### Stream C — `Driver.Galaxy.Proxy` (1.5 week estimate, **complete as IPC-forwarder**)
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/` (.NET 10)
- **`Ipc/GalaxyIpcClient`**: Hello handshake + shared-secret authentication + single-call
request/response over the data-plane pipe. Serializes concurrent callers via
`SemaphoreSlim`. Lifts `ErrorResponse` to `GalaxyIpcException` with the error code.
- **`GalaxyProxyDriver`**: implements `IDriver` + `ITagDiscovery`. Forwards lifecycle and
discovery over IPC; maps Galaxy MX data types → `DriverDataType` and security classifications
→ `SecurityClassification`. Stream C-plan capability interfaces for `IReadable`, `IWritable`,
`ISubscribable`, `IAlarmSource`, `IHistoryProvider`, `IHostConnectivityProbe`,
`IRediscoverable` are structured identically — wire them in when the Host's MXAccess backend
exists so the round-trips can actually serve data.
- **`Supervisor/Backoff`**: 5s → 15s → 60s capped; `RecordStableRun` resets after 2-min
successful run.
- **`Supervisor/CircuitBreaker`**: 3 crashes per 5 min opens; cooldown escalates
1h → 4h → manual (`TimeSpan.MaxValue`). Sticky alert doesn't auto-clear when cooldown
elapses; `ManualReset` only.
- **`Supervisor/HeartbeatMonitor`**: 2s cadence, 3 consecutive misses = host dead.
- **Tests (11)**: `Backoff` sequence + reset; `CircuitBreaker` full 1h/4h/manual escalation
path; `HeartbeatMonitor` miss-count + ack-reset; full IPC handshake round-trip
(Host + Proxy over a real named pipe, heartbeat ack verified; shared-secret mismatch
rejected with `UnauthorizedAccessException`).
## Deferred (explicitly noted as TODO)
### Stream D — Retire legacy `OtOpcUa.Host`
**Not executable until Stream E parity passes.** Deleting the legacy project now would break
the 494 v1 IntegrationTests that are the parity baseline. Recovery requires:
1. Host MXAccess code lift (Task B.1 "move Galaxy code") from `OtOpcUa.Host/` into
`OtOpcUa.Driver.Galaxy.Host/` — STA pump wiring, `MxAccessHandle` backing the real
`LMXProxyServer`, `GalaxyRepository` and its SQL queries, `GalaxyRuntimeProbeManager`,
Historian loader, the Ipc stub handler replaced with a real `IFrameHandler` that invokes
the handle.
2. Address-space build via `IAddressSpaceBuilder` produces byte-equivalent OPC UA browse
output to v1 (Task C.4).
3. Windows service installer registers two services (`OtOpcUa` + `OtOpcUaGalaxyHost`) with
the correct service-account SIDs and per-process secret provisioning. Galaxy.Host starts
before OtOpcUa.
4. `appsettings.json` Galaxy config (MxAccess / Galaxy / Historian sections) migrated into
`DriverInstance.DriverConfig` JSON in the Configuration DB via an idempotent migration
script. Post-migration, the local `appsettings.json` keeps only `Cluster.NodeId`,
`ClusterId`, and the DB conn string per decision #18.
### Stream E — Parity validation
Requires live MXAccess + Galaxy runtime and the above lift complete. Work items:
- Run v1 IntegrationTests against the v2 Galaxy.Proxy + Galaxy.Host topology. Pass count =
v1 baseline; failures = 0. Per-test duration regression report flags any test >2× baseline.
- Scripted Client.CLI walkthrough recorded at Phase 2 entry gate against v1, replayed
against v2; diff must show only timestamp/latency differences.
- Regression tests for the four 2026-04-13 stability findings (phantom probe, cross-host
quality clear, sync-over-async guard, fire-and-forget alarm drain).
- `/codex:adversarial-review --base v2` on the merged Phase 2 diff — findings closed or
deferred with rationale.
## Also deferred from Stream B
- **Task B.10 FaultShim** (test-only `ArchestrA.MxAccess` substitute for fault injection).
Needs the production `ArchestrA.MxAccess` reference in place first; flagged as part of the
plan's "mid-gate review" fallback (Risk row 7).
- **Task B.8 WM_QUIT hard-exit escalation** — wired in when the real Win32 pump replaces the
`BlockingCollection` dispatcher. The `StaPump.IsResponsiveAsync` probe already exists; the
supervisor escalation-to-`Environment.Exit(2)` belongs to the Program main loop after the
pump integration.
## Cross-session impact on the build
- **Full solution**: 926 tests pass, 1 fails (pre-existing Phase 0 baseline
`Client.CLI.Tests.SubscribeCommandTests.Execute_PrintsSubscriptionMessage` — not a Phase 2
regression; was red before Phase 1 and stays red through Phase 2).
- **New projects added to `.slnx`**: `Driver.Galaxy.Shared`, `Driver.Galaxy.Host`,
`Driver.Galaxy.Proxy`, plus the three matching test projects.
- **No existing tests broke.** The 494 v1 `OtOpcUa.Tests` (net48) and 6 `IntegrationTests`
(net48) still pass because the legacy `OtOpcUa.Host` is untouched.
## Next-session checklist for Stream D + E
1. Stand up dev Galaxy; capture Client.CLI walkthrough baseline against v1.
2. Move Galaxy-specific files from `OtOpcUa.Host` into `Driver.Galaxy.Host`, renaming
namespaces. Replace `StubFrameHandler` with the real one.
3. Wire up the real Win32 pump inside `StaPump` (lift from scadalink-design's
`LmxProxy.Host` reference per CLAUDE.md).
4. Run v1 IntegrationTests against the v2 topology — iterate on parity defects until green.
5. Run Client.CLI walkthrough and diff.
6. Regression tests for the four stability findings.
7. Delete legacy `OtOpcUa.Host`; update `.slnx`; update installer scripts.
8. Adversarial review; `exit-gate-phase-2.md` recorded; PR merged.