Files
lmxopcua/docs/v2/Galaxy.ParityRig.md
Joseph Doherty c55da145ec docs: add Galaxy parity rig runbook
Walks through standing up both Galaxy backends side-by-side against a
single live Galaxy:

- Conceptual layout (two MxAccess sessions on distinct ClientNames so
  they don't evict each other)
- What's already on the dev box (AVEVA + OtOpcUaGalaxyHost service)
- mxaccessgw build + run + config (API key, ClientName)
- The three OTOPCUA_PARITY_* env vars the harness reads
- HarnessShapeTests as the two-line truth-teller for "did both halves
  resolve"
- Galaxy-shape coverage matrix mapping each scenario to what's needed
  for it to assert (rather than skip)
- Soak run recipes, including the compressed-tag fallback when the dev
  Galaxy doesn't have 50k attributes
- Troubleshooting for the four common SkipReasons
- Three further gates before PR 7.2 lands (matrix green, soak data,
  pilot flip)

Explicitly drops the stale "use a non-elevated shell" precondition —
the legacy Galaxy.Host pipe ACL accepts elevated and non-elevated
dohertj2 alike (resolved 2026-04-24).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 22:08:43 -04:00

225 lines
8.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Galaxy parity rig — runbook
Brings up both Galaxy backends side-by-side against a single live Galaxy
so the parity matrix in `docs/v2/Galaxy.ParityMatrix.md` and the soak
scenario in `tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests/SoakScenarioTests.cs`
can run for real. Closing the parity matrix is the gate for PR 7.2
(retire legacy Galaxy projects).
## Conceptual layout
```
Galaxy ZB SQL ──┬── OtOpcUaGalaxyHost (NSSM service, net48 x86)
│ └── MxAccess COM, ClientName "OtOpcUa-Galaxy.Host"
│ └── named pipe "OtOpcUaGalaxy"
│ ▲
│ │ pipe IPC
│ │
│ GalaxyProxyDriver ◄── parity test (legacy half)
└── mxaccessgw service
└── MxAccess COM, ClientName "OtOpcUa-Parity"
└── gRPC on http://localhost:5120
│ gRPC
GalaxyDriver (in-process) ◄── parity test (mxgw half)
```
Both halves talk to the **same Galaxy** through **two distinct MxAccess
sessions** (different ClientNames so they don't evict each other).
## What's already on this dev box
Per `~/.claude/projects/.../memory/`:
- **AVEVA System Platform + Galaxy + MXAccess runtime** — `project_aveva_platform_installed.md`.
- **`OtOpcUaGalaxyHost`** Windows service running as `dohertj2`, NSSM-wrapped,
binary at `C:\publish\OtOpcUaGalaxyHost\OtOpcUa.Driver.Galaxy.Host.exe`,
shared secret at `.local/galaxy-host-secret.txt`, ZB SQL on `localhost:1433`
`project_galaxy_host_installed.md`.
- **Parity test project** (`Driver.Galaxy.ParityTests`) committed and
skip-clean — runs as soon as the mxgw half resolves.
## Setup steps (one-time)
### 1. Build + run mxaccessgw
The gateway source is at `c:\Users\dohertj2\Desktop\mxaccessgw\`. From
that repo:
```powershell
cd C:\Users\dohertj2\Desktop\mxaccessgw
dotnet publish src\MxGateway.Server -c Release -o C:\publish\MxAccessGw
```
Configure:
- An API key. Pick anything stable (e.g. `parity-suite-key`) and put it in
whichever config file `MxGateway.Server` reads — see `mxaccessgw/gateway.md`
for the current shape.
- ClientName for the worker's MxAccess registration — set to `OtOpcUa-Parity`
so it doesn't collide with `OtOpcUa-Galaxy.Host`.
- Bind to `http://localhost:5120` (default in `launchSettings.json`).
Run it as a console app for the first session — easier to inspect logs.
NSSM-wrap it later if the rig becomes long-lived:
```powershell
C:\publish\MxAccessGw\MxGateway.Server.exe
```
The worker should log a successful `Register` against MxAccess after a
few seconds. If it loops on `Register` failures, that's an MxAccess-side
problem — the legacy `OtOpcUaGalaxyHost` going through the same COM
stack is a known-good reference point.
### 2. Set the parity env vars
In the test-runner shell:
```powershell
$env:OTOPCUA_PARITY_GW_ENDPOINT = "http://localhost:5120"
$env:OTOPCUA_PARITY_GW_API_KEY = "parity-suite-key" # match the gw config
$env:OTOPCUA_PARITY_CLIENT_NAME = "OtOpcUa-Parity"
```
Elevation status doesn't matter — the legacy Galaxy.Host pipe ACL accepts
elevated and non-elevated `dohertj2` shells alike (the Administrators deny
ACE was removed 2026-04-24; see `project_galaxy_host_installed.md`).
### 3. Verify both halves resolve
```powershell
cd C:\Users\dohertj2\Desktop\lmxopcua
dotnet test tests\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests\ `
--filter "FullyQualifiedName~HarnessShapeTests"
```
`Harness_records_a_skip_reason_for_each_unavailable_backend` is the
two-line truth-teller:
- Both `LegacyDriver` non-null + both `MxGatewayDriver` non-null → rig is up.
- One side null → read its `LegacySkipReason` / `MxGatewaySkipReason` and fix.
## Running the matrix
Once both halves resolve:
```powershell
dotnet test tests\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests\ `
--filter "Category=ParityE2E"
```
This runs all 17 scenario tests across the seven scenario classes
(BrowseAndRead / Subscribe / Write / Alarm / History / Reconnect /
ScanState). Each scenario class is independent — failures in one don't
block the rest.
Track the result against `docs/v2/Galaxy.ParityMatrix.md`. Update each
row to:
- **green** if the scenario passes
- **yellow** if it skipped because the dev Galaxy doesn't have the right
shape (see coverage matrix below)
- **red** if it asserted a real delta — those are the deltas that block
PR 7.2; chase each before retiring the legacy backend
## Galaxy shape needed for full coverage
Skip-on-empty-shape scenarios fail-soft today. To turn a skip into a
real result, the dev Galaxy needs the shape in the right column:
| Scenario | Needs |
|---|---|
| `BrowseAndReadParityTests` (3 tests) | Any deployed objects with attributes |
| `SubscribeAndEventRateParityTests` event-rate | ≥5 attributes whose values *change* in 3s |
| `WriteByClassificationParityTests` (FreeAccess/Operate) | A FreeAccess/Operate numeric attribute |
| `WriteByClassificationParityTests` (Configure/Tune) | A Configure/Tune attribute |
| `AlarmTransitionParityTests` (2 tests) | Attributes with the `$Alarm*` extension |
| `HistoryReadParityTests` (historized set) | Attributes with the History extension |
| `ScanStateProbeParityTests` (2 tests) | Multiple `$WinPlatform` / `$AppEngine` objects |
The dev Galaxy from the existing E2E smoke (`gr/seed-phase-7-smoke.sql`)
covers most of these; the multi-platform scenario probably needs
hand-deploying a second `$WinPlatform` instance.
## Soak run
The 24h × 50k soak gates the production confidence half of PR 7.2.
```powershell
$env:OTOPCUA_SOAK_RUN = "1"
$env:OTOPCUA_SOAK_TAGS = "<actual tag count if Galaxy < 50k>"
$env:OTOPCUA_SOAK_MINUTES = "1440" # default 24h; compress for first runs
$env:OTOPCUA_SOAK_DROP_PCT = "0.5"
dotnet test tests\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests\ `
--filter "Category=Soak"
```
The test logs a per-minute CSV-style line to stdout:
```
soak,1.0,received=51234,dispatched=51234,dropped=0,ws_mb=412
soak,2.0,received=102468,dispatched=102468,dropped=0,ws_mb=415
...
```
Capture stdout to a file for post-run analysis. The three guards
(`received` growing, `dropped/received` ratio, working-set delta) all
fire mid-run rather than at end-of-test, so a failure surfaces within
the first few minutes if the architecture is wrong.
## Compressed-tag soak (when Galaxy isn't 50k tags)
A first-pass validation is fine with the override:
```powershell
$env:OTOPCUA_SOAK_RUN = "1"
$env:OTOPCUA_SOAK_TAGS = "500" # whatever the dev Galaxy has
$env:OTOPCUA_SOAK_MINUTES = "60" # one hour is enough to surface plumbing bugs
$env:OTOPCUA_SOAK_DROP_PCT = "1.0"
```
This validates the *plumbing* (bounded channel, pump invariants, leak
guard) but doesn't pin the 50k-tag scaling assertion. Defer the full
50k validation to a customer rig with that scale, or build a synthetic
Galaxy with a script that imports 50k attributes onto a generated UDO
(~2 hours of one-off work).
## Troubleshooting
- **`MxGatewaySkipReason` says "mxaccessgw not reachable"** — the gw
isn't listening, or it's on a different port. `Test-NetConnection
localhost -Port 5120` is the quick check.
- **`MxGatewaySkipReason` says "mxgateway backend boot failed:
RpcException: Unauthenticated"** — API key mismatch. Verify the
`OTOPCUA_PARITY_GW_API_KEY` env var matches the gw's configured key.
- **`LegacySkipReason` says "Galaxy ZB SQL not reachable on
localhost:1433"** — SQL Server isn't running, or its TCP listener is
off. Check `services.msc` for the SQL Server (default) instance.
- **`LegacySkipReason` says "Galaxy.Host EXE not built"** — the parity
harness looks under `src/.../bin/Debug/net48/`. Build it once:
`dotnet build src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host`. Note the
separately-published copy at `C:\publish\OtOpcUaGalaxyHost\` is for
the Windows service; the parity harness spawns its own subprocess.
- **Both halves resolve but parity scenarios assert deltas** — that's
the expected outcome the rig exists to surface. Review each delta
against `docs/v2/Galaxy.ParityMatrix.md`'s "Accepted deltas" section
to decide whether it's a real bug or a pre-accepted divergence.
## After the rig is green
Three further steps before PR 7.2 lands:
1. Promote any newly-discovered accepted-delta to the matrix doc with
the why.
2. Run the full 24h × 50k soak (or compressed tag count if Galaxy isn't
that big) and link the stdout log in the PR description.
3. Pilot the default flip (PR 7.1's `Galaxy.DefaultBackend = "GalaxyMxGateway"`)
on a single production node for ~2 weeks. Watch the OTel/metrics
surface (`docs/v2/Galaxy.Performance.md`) for regressions.
Then 7.2 has the rollback-risk evidence its precondition asks for.