docs: add Galaxy parity rig runbook

Walks through standing up both Galaxy backends side-by-side against a
single live Galaxy:

- Conceptual layout (two MxAccess sessions on distinct ClientNames so
  they don't evict each other)
- What's already on the dev box (AVEVA + OtOpcUaGalaxyHost service)
- mxaccessgw build + run + config (API key, ClientName)
- The three OTOPCUA_PARITY_* env vars the harness reads
- HarnessShapeTests as the two-line truth-teller for "did both halves
  resolve"
- Galaxy-shape coverage matrix mapping each scenario to what's needed
  for it to assert (rather than skip)
- Soak run recipes, including the compressed-tag fallback when the dev
  Galaxy doesn't have 50k attributes
- Troubleshooting for the four common SkipReasons
- Three further gates before PR 7.2 lands (matrix green, soak data,
  pilot flip)

Explicitly drops the stale "use a non-elevated shell" precondition —
the legacy Galaxy.Host pipe ACL accepts elevated and non-elevated
dohertj2 alike (resolved 2026-04-24).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-04-29 22:08:43 -04:00
parent 42f41fbe50
commit c55da145ec

224
docs/v2/Galaxy.ParityRig.md Normal file
View File

@@ -0,0 +1,224 @@
# Galaxy parity rig — runbook
Brings up both Galaxy backends side-by-side against a single live Galaxy
so the parity matrix in `docs/v2/Galaxy.ParityMatrix.md` and the soak
scenario in `tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests/SoakScenarioTests.cs`
can run for real. Closing the parity matrix is the gate for PR 7.2
(retire legacy Galaxy projects).
## Conceptual layout
```
Galaxy ZB SQL ──┬── OtOpcUaGalaxyHost (NSSM service, net48 x86)
│ └── MxAccess COM, ClientName "OtOpcUa-Galaxy.Host"
│ └── named pipe "OtOpcUaGalaxy"
│ ▲
│ │ pipe IPC
│ │
│ GalaxyProxyDriver ◄── parity test (legacy half)
└── mxaccessgw service
└── MxAccess COM, ClientName "OtOpcUa-Parity"
└── gRPC on http://localhost:5120
│ gRPC
GalaxyDriver (in-process) ◄── parity test (mxgw half)
```
Both halves talk to the **same Galaxy** through **two distinct MxAccess
sessions** (different ClientNames so they don't evict each other).
## What's already on this dev box
Per `~/.claude/projects/.../memory/`:
- **AVEVA System Platform + Galaxy + MXAccess runtime** — `project_aveva_platform_installed.md`.
- **`OtOpcUaGalaxyHost`** Windows service running as `dohertj2`, NSSM-wrapped,
binary at `C:\publish\OtOpcUaGalaxyHost\OtOpcUa.Driver.Galaxy.Host.exe`,
shared secret at `.local/galaxy-host-secret.txt`, ZB SQL on `localhost:1433`
`project_galaxy_host_installed.md`.
- **Parity test project** (`Driver.Galaxy.ParityTests`) committed and
skip-clean — runs as soon as the mxgw half resolves.
## Setup steps (one-time)
### 1. Build + run mxaccessgw
The gateway source is at `c:\Users\dohertj2\Desktop\mxaccessgw\`. From
that repo:
```powershell
cd C:\Users\dohertj2\Desktop\mxaccessgw
dotnet publish src\MxGateway.Server -c Release -o C:\publish\MxAccessGw
```
Configure:
- An API key. Pick anything stable (e.g. `parity-suite-key`) and put it in
whichever config file `MxGateway.Server` reads — see `mxaccessgw/gateway.md`
for the current shape.
- ClientName for the worker's MxAccess registration — set to `OtOpcUa-Parity`
so it doesn't collide with `OtOpcUa-Galaxy.Host`.
- Bind to `http://localhost:5120` (default in `launchSettings.json`).
Run it as a console app for the first session — easier to inspect logs.
NSSM-wrap it later if the rig becomes long-lived:
```powershell
C:\publish\MxAccessGw\MxGateway.Server.exe
```
The worker should log a successful `Register` against MxAccess after a
few seconds. If it loops on `Register` failures, that's an MxAccess-side
problem — the legacy `OtOpcUaGalaxyHost` going through the same COM
stack is a known-good reference point.
### 2. Set the parity env vars
In the test-runner shell:
```powershell
$env:OTOPCUA_PARITY_GW_ENDPOINT = "http://localhost:5120"
$env:OTOPCUA_PARITY_GW_API_KEY = "parity-suite-key" # match the gw config
$env:OTOPCUA_PARITY_CLIENT_NAME = "OtOpcUa-Parity"
```
Elevation status doesn't matter — the legacy Galaxy.Host pipe ACL accepts
elevated and non-elevated `dohertj2` shells alike (the Administrators deny
ACE was removed 2026-04-24; see `project_galaxy_host_installed.md`).
### 3. Verify both halves resolve
```powershell
cd C:\Users\dohertj2\Desktop\lmxopcua
dotnet test tests\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests\ `
--filter "FullyQualifiedName~HarnessShapeTests"
```
`Harness_records_a_skip_reason_for_each_unavailable_backend` is the
two-line truth-teller:
- Both `LegacyDriver` non-null + both `MxGatewayDriver` non-null → rig is up.
- One side null → read its `LegacySkipReason` / `MxGatewaySkipReason` and fix.
## Running the matrix
Once both halves resolve:
```powershell
dotnet test tests\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests\ `
--filter "Category=ParityE2E"
```
This runs all 17 scenario tests across the seven scenario classes
(BrowseAndRead / Subscribe / Write / Alarm / History / Reconnect /
ScanState). Each scenario class is independent — failures in one don't
block the rest.
Track the result against `docs/v2/Galaxy.ParityMatrix.md`. Update each
row to:
- **green** if the scenario passes
- **yellow** if it skipped because the dev Galaxy doesn't have the right
shape (see coverage matrix below)
- **red** if it asserted a real delta — those are the deltas that block
PR 7.2; chase each before retiring the legacy backend
## Galaxy shape needed for full coverage
Skip-on-empty-shape scenarios fail-soft today. To turn a skip into a
real result, the dev Galaxy needs the shape in the right column:
| Scenario | Needs |
|---|---|
| `BrowseAndReadParityTests` (3 tests) | Any deployed objects with attributes |
| `SubscribeAndEventRateParityTests` event-rate | ≥5 attributes whose values *change* in 3s |
| `WriteByClassificationParityTests` (FreeAccess/Operate) | A FreeAccess/Operate numeric attribute |
| `WriteByClassificationParityTests` (Configure/Tune) | A Configure/Tune attribute |
| `AlarmTransitionParityTests` (2 tests) | Attributes with the `$Alarm*` extension |
| `HistoryReadParityTests` (historized set) | Attributes with the History extension |
| `ScanStateProbeParityTests` (2 tests) | Multiple `$WinPlatform` / `$AppEngine` objects |
The dev Galaxy from the existing E2E smoke (`gr/seed-phase-7-smoke.sql`)
covers most of these; the multi-platform scenario probably needs
hand-deploying a second `$WinPlatform` instance.
## Soak run
The 24h × 50k soak gates the production confidence half of PR 7.2.
```powershell
$env:OTOPCUA_SOAK_RUN = "1"
$env:OTOPCUA_SOAK_TAGS = "<actual tag count if Galaxy < 50k>"
$env:OTOPCUA_SOAK_MINUTES = "1440" # default 24h; compress for first runs
$env:OTOPCUA_SOAK_DROP_PCT = "0.5"
dotnet test tests\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests\ `
--filter "Category=Soak"
```
The test logs a per-minute CSV-style line to stdout:
```
soak,1.0,received=51234,dispatched=51234,dropped=0,ws_mb=412
soak,2.0,received=102468,dispatched=102468,dropped=0,ws_mb=415
...
```
Capture stdout to a file for post-run analysis. The three guards
(`received` growing, `dropped/received` ratio, working-set delta) all
fire mid-run rather than at end-of-test, so a failure surfaces within
the first few minutes if the architecture is wrong.
## Compressed-tag soak (when Galaxy isn't 50k tags)
A first-pass validation is fine with the override:
```powershell
$env:OTOPCUA_SOAK_RUN = "1"
$env:OTOPCUA_SOAK_TAGS = "500" # whatever the dev Galaxy has
$env:OTOPCUA_SOAK_MINUTES = "60" # one hour is enough to surface plumbing bugs
$env:OTOPCUA_SOAK_DROP_PCT = "1.0"
```
This validates the *plumbing* (bounded channel, pump invariants, leak
guard) but doesn't pin the 50k-tag scaling assertion. Defer the full
50k validation to a customer rig with that scale, or build a synthetic
Galaxy with a script that imports 50k attributes onto a generated UDO
(~2 hours of one-off work).
## Troubleshooting
- **`MxGatewaySkipReason` says "mxaccessgw not reachable"** — the gw
isn't listening, or it's on a different port. `Test-NetConnection
localhost -Port 5120` is the quick check.
- **`MxGatewaySkipReason` says "mxgateway backend boot failed:
RpcException: Unauthenticated"** — API key mismatch. Verify the
`OTOPCUA_PARITY_GW_API_KEY` env var matches the gw's configured key.
- **`LegacySkipReason` says "Galaxy ZB SQL not reachable on
localhost:1433"** — SQL Server isn't running, or its TCP listener is
off. Check `services.msc` for the SQL Server (default) instance.
- **`LegacySkipReason` says "Galaxy.Host EXE not built"** — the parity
harness looks under `src/.../bin/Debug/net48/`. Build it once:
`dotnet build src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host`. Note the
separately-published copy at `C:\publish\OtOpcUaGalaxyHost\` is for
the Windows service; the parity harness spawns its own subprocess.
- **Both halves resolve but parity scenarios assert deltas** — that's
the expected outcome the rig exists to surface. Review each delta
against `docs/v2/Galaxy.ParityMatrix.md`'s "Accepted deltas" section
to decide whether it's a real bug or a pre-accepted divergence.
## After the rig is green
Three further steps before PR 7.2 lands:
1. Promote any newly-discovered accepted-delta to the matrix doc with
the why.
2. Run the full 24h × 50k soak (or compressed tag count if Galaxy isn't
that big) and link the stdout log in the PR description.
3. Pilot the default flip (PR 7.1's `Galaxy.DefaultBackend = "GalaxyMxGateway"`)
on a single production node for ~2 weeks. Watch the OTel/metrics
surface (`docs/v2/Galaxy.Performance.md`) for regressions.
Then 7.2 has the rollback-risk evidence its precondition asks for.