From c55da145ec2aa1f5f8684f111ca76cbf8e31ba9f Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Wed, 29 Apr 2026 22:08:43 -0400 Subject: [PATCH] docs: add Galaxy parity rig runbook MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Walks through standing up both Galaxy backends side-by-side against a single live Galaxy: - Conceptual layout (two MxAccess sessions on distinct ClientNames so they don't evict each other) - What's already on the dev box (AVEVA + OtOpcUaGalaxyHost service) - mxaccessgw build + run + config (API key, ClientName) - The three OTOPCUA_PARITY_* env vars the harness reads - HarnessShapeTests as the two-line truth-teller for "did both halves resolve" - Galaxy-shape coverage matrix mapping each scenario to what's needed for it to assert (rather than skip) - Soak run recipes, including the compressed-tag fallback when the dev Galaxy doesn't have 50k attributes - Troubleshooting for the four common SkipReasons - Three further gates before PR 7.2 lands (matrix green, soak data, pilot flip) Explicitly drops the stale "use a non-elevated shell" precondition — the legacy Galaxy.Host pipe ACL accepts elevated and non-elevated dohertj2 alike (resolved 2026-04-24). Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/v2/Galaxy.ParityRig.md | 224 ++++++++++++++++++++++++++++++++++++ 1 file changed, 224 insertions(+) create mode 100644 docs/v2/Galaxy.ParityRig.md diff --git a/docs/v2/Galaxy.ParityRig.md b/docs/v2/Galaxy.ParityRig.md new file mode 100644 index 0000000..b8ffc0a --- /dev/null +++ b/docs/v2/Galaxy.ParityRig.md @@ -0,0 +1,224 @@ +# Galaxy parity rig — runbook + +Brings up both Galaxy backends side-by-side against a single live Galaxy +so the parity matrix in `docs/v2/Galaxy.ParityMatrix.md` and the soak +scenario in `tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests/SoakScenarioTests.cs` +can run for real. Closing the parity matrix is the gate for PR 7.2 +(retire legacy Galaxy projects). + +## Conceptual layout + +``` +Galaxy ZB SQL ──┬── OtOpcUaGalaxyHost (NSSM service, net48 x86) + │ └── MxAccess COM, ClientName "OtOpcUa-Galaxy.Host" + │ └── named pipe "OtOpcUaGalaxy" + │ ▲ + │ │ pipe IPC + │ │ + │ GalaxyProxyDriver ◄── parity test (legacy half) + │ + └── mxaccessgw service + └── MxAccess COM, ClientName "OtOpcUa-Parity" + └── gRPC on http://localhost:5120 + ▲ + │ gRPC + │ + GalaxyDriver (in-process) ◄── parity test (mxgw half) +``` + +Both halves talk to the **same Galaxy** through **two distinct MxAccess +sessions** (different ClientNames so they don't evict each other). + +## What's already on this dev box + +Per `~/.claude/projects/.../memory/`: + +- **AVEVA System Platform + Galaxy + MXAccess runtime** — `project_aveva_platform_installed.md`. +- **`OtOpcUaGalaxyHost`** Windows service running as `dohertj2`, NSSM-wrapped, + binary at `C:\publish\OtOpcUaGalaxyHost\OtOpcUa.Driver.Galaxy.Host.exe`, + shared secret at `.local/galaxy-host-secret.txt`, ZB SQL on `localhost:1433` + — `project_galaxy_host_installed.md`. +- **Parity test project** (`Driver.Galaxy.ParityTests`) committed and + skip-clean — runs as soon as the mxgw half resolves. + +## Setup steps (one-time) + +### 1. Build + run mxaccessgw + +The gateway source is at `c:\Users\dohertj2\Desktop\mxaccessgw\`. From +that repo: + +```powershell +cd C:\Users\dohertj2\Desktop\mxaccessgw +dotnet publish src\MxGateway.Server -c Release -o C:\publish\MxAccessGw +``` + +Configure: + +- An API key. Pick anything stable (e.g. `parity-suite-key`) and put it in + whichever config file `MxGateway.Server` reads — see `mxaccessgw/gateway.md` + for the current shape. +- ClientName for the worker's MxAccess registration — set to `OtOpcUa-Parity` + so it doesn't collide with `OtOpcUa-Galaxy.Host`. +- Bind to `http://localhost:5120` (default in `launchSettings.json`). + +Run it as a console app for the first session — easier to inspect logs. +NSSM-wrap it later if the rig becomes long-lived: + +```powershell +C:\publish\MxAccessGw\MxGateway.Server.exe +``` + +The worker should log a successful `Register` against MxAccess after a +few seconds. If it loops on `Register` failures, that's an MxAccess-side +problem — the legacy `OtOpcUaGalaxyHost` going through the same COM +stack is a known-good reference point. + +### 2. Set the parity env vars + +In the test-runner shell: + +```powershell +$env:OTOPCUA_PARITY_GW_ENDPOINT = "http://localhost:5120" +$env:OTOPCUA_PARITY_GW_API_KEY = "parity-suite-key" # match the gw config +$env:OTOPCUA_PARITY_CLIENT_NAME = "OtOpcUa-Parity" +``` + +Elevation status doesn't matter — the legacy Galaxy.Host pipe ACL accepts +elevated and non-elevated `dohertj2` shells alike (the Administrators deny +ACE was removed 2026-04-24; see `project_galaxy_host_installed.md`). + +### 3. Verify both halves resolve + +```powershell +cd C:\Users\dohertj2\Desktop\lmxopcua +dotnet test tests\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests\ ` + --filter "FullyQualifiedName~HarnessShapeTests" +``` + +`Harness_records_a_skip_reason_for_each_unavailable_backend` is the +two-line truth-teller: + +- Both `LegacyDriver` non-null + both `MxGatewayDriver` non-null → rig is up. +- One side null → read its `LegacySkipReason` / `MxGatewaySkipReason` and fix. + +## Running the matrix + +Once both halves resolve: + +```powershell +dotnet test tests\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests\ ` + --filter "Category=ParityE2E" +``` + +This runs all 17 scenario tests across the seven scenario classes +(BrowseAndRead / Subscribe / Write / Alarm / History / Reconnect / +ScanState). Each scenario class is independent — failures in one don't +block the rest. + +Track the result against `docs/v2/Galaxy.ParityMatrix.md`. Update each +row to: + +- **green** if the scenario passes +- **yellow** if it skipped because the dev Galaxy doesn't have the right + shape (see coverage matrix below) +- **red** if it asserted a real delta — those are the deltas that block + PR 7.2; chase each before retiring the legacy backend + +## Galaxy shape needed for full coverage + +Skip-on-empty-shape scenarios fail-soft today. To turn a skip into a +real result, the dev Galaxy needs the shape in the right column: + +| Scenario | Needs | +|---|---| +| `BrowseAndReadParityTests` (3 tests) | Any deployed objects with attributes | +| `SubscribeAndEventRateParityTests` event-rate | ≥5 attributes whose values *change* in 3s | +| `WriteByClassificationParityTests` (FreeAccess/Operate) | A FreeAccess/Operate numeric attribute | +| `WriteByClassificationParityTests` (Configure/Tune) | A Configure/Tune attribute | +| `AlarmTransitionParityTests` (2 tests) | Attributes with the `$Alarm*` extension | +| `HistoryReadParityTests` (historized set) | Attributes with the History extension | +| `ScanStateProbeParityTests` (2 tests) | Multiple `$WinPlatform` / `$AppEngine` objects | + +The dev Galaxy from the existing E2E smoke (`gr/seed-phase-7-smoke.sql`) +covers most of these; the multi-platform scenario probably needs +hand-deploying a second `$WinPlatform` instance. + +## Soak run + +The 24h × 50k soak gates the production confidence half of PR 7.2. + +```powershell +$env:OTOPCUA_SOAK_RUN = "1" +$env:OTOPCUA_SOAK_TAGS = "" +$env:OTOPCUA_SOAK_MINUTES = "1440" # default 24h; compress for first runs +$env:OTOPCUA_SOAK_DROP_PCT = "0.5" + +dotnet test tests\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests\ ` + --filter "Category=Soak" +``` + +The test logs a per-minute CSV-style line to stdout: + +``` +soak,1.0,received=51234,dispatched=51234,dropped=0,ws_mb=412 +soak,2.0,received=102468,dispatched=102468,dropped=0,ws_mb=415 +... +``` + +Capture stdout to a file for post-run analysis. The three guards +(`received` growing, `dropped/received` ratio, working-set delta) all +fire mid-run rather than at end-of-test, so a failure surfaces within +the first few minutes if the architecture is wrong. + +## Compressed-tag soak (when Galaxy isn't 50k tags) + +A first-pass validation is fine with the override: + +```powershell +$env:OTOPCUA_SOAK_RUN = "1" +$env:OTOPCUA_SOAK_TAGS = "500" # whatever the dev Galaxy has +$env:OTOPCUA_SOAK_MINUTES = "60" # one hour is enough to surface plumbing bugs +$env:OTOPCUA_SOAK_DROP_PCT = "1.0" +``` + +This validates the *plumbing* (bounded channel, pump invariants, leak +guard) but doesn't pin the 50k-tag scaling assertion. Defer the full +50k validation to a customer rig with that scale, or build a synthetic +Galaxy with a script that imports 50k attributes onto a generated UDO +(~2 hours of one-off work). + +## Troubleshooting + +- **`MxGatewaySkipReason` says "mxaccessgw not reachable"** — the gw + isn't listening, or it's on a different port. `Test-NetConnection + localhost -Port 5120` is the quick check. +- **`MxGatewaySkipReason` says "mxgateway backend boot failed: + RpcException: Unauthenticated"** — API key mismatch. Verify the + `OTOPCUA_PARITY_GW_API_KEY` env var matches the gw's configured key. +- **`LegacySkipReason` says "Galaxy ZB SQL not reachable on + localhost:1433"** — SQL Server isn't running, or its TCP listener is + off. Check `services.msc` for the SQL Server (default) instance. +- **`LegacySkipReason` says "Galaxy.Host EXE not built"** — the parity + harness looks under `src/.../bin/Debug/net48/`. Build it once: + `dotnet build src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host`. Note the + separately-published copy at `C:\publish\OtOpcUaGalaxyHost\` is for + the Windows service; the parity harness spawns its own subprocess. +- **Both halves resolve but parity scenarios assert deltas** — that's + the expected outcome the rig exists to surface. Review each delta + against `docs/v2/Galaxy.ParityMatrix.md`'s "Accepted deltas" section + to decide whether it's a real bug or a pre-accepted divergence. + +## After the rig is green + +Three further steps before PR 7.2 lands: + +1. Promote any newly-discovered accepted-delta to the matrix doc with + the why. +2. Run the full 24h × 50k soak (or compressed tag count if Galaxy isn't + that big) and link the stdout log in the PR description. +3. Pilot the default flip (PR 7.1's `Galaxy.DefaultBackend = "GalaxyMxGateway"`) + on a single production node for ~2 weeks. Watch the OTel/metrics + surface (`docs/v2/Galaxy.Performance.md`) for regressions. + +Then 7.2 has the rollback-risk evidence its precondition asks for.