Files
lmxopcua/docs/v2/Galaxy.ParityRig.md
Joseph Doherty c55da145ec docs: add Galaxy parity rig runbook
Walks through standing up both Galaxy backends side-by-side against a
single live Galaxy:

- Conceptual layout (two MxAccess sessions on distinct ClientNames so
  they don't evict each other)
- What's already on the dev box (AVEVA + OtOpcUaGalaxyHost service)
- mxaccessgw build + run + config (API key, ClientName)
- The three OTOPCUA_PARITY_* env vars the harness reads
- HarnessShapeTests as the two-line truth-teller for "did both halves
  resolve"
- Galaxy-shape coverage matrix mapping each scenario to what's needed
  for it to assert (rather than skip)
- Soak run recipes, including the compressed-tag fallback when the dev
  Galaxy doesn't have 50k attributes
- Troubleshooting for the four common SkipReasons
- Three further gates before PR 7.2 lands (matrix green, soak data,
  pilot flip)

Explicitly drops the stale "use a non-elevated shell" precondition —
the legacy Galaxy.Host pipe ACL accepts elevated and non-elevated
dohertj2 alike (resolved 2026-04-24).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 22:08:43 -04:00

8.9 KiB
Raw Blame History

Galaxy parity rig — runbook

Brings up both Galaxy backends side-by-side against a single live Galaxy so the parity matrix in docs/v2/Galaxy.ParityMatrix.md and the soak scenario in tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests/SoakScenarioTests.cs can run for real. Closing the parity matrix is the gate for PR 7.2 (retire legacy Galaxy projects).

Conceptual layout

Galaxy ZB SQL  ──┬──  OtOpcUaGalaxyHost (NSSM service, net48 x86)
                 │       └── MxAccess COM, ClientName "OtOpcUa-Galaxy.Host"
                 │       └── named pipe "OtOpcUaGalaxy"
                 │              ▲
                 │              │ pipe IPC
                 │              │
                 │       GalaxyProxyDriver  ◄── parity test (legacy half)
                 │
                 └──  mxaccessgw service
                         └── MxAccess COM, ClientName "OtOpcUa-Parity"
                         └── gRPC on http://localhost:5120
                                ▲
                                │ gRPC
                                │
                         GalaxyDriver (in-process)  ◄── parity test (mxgw half)

Both halves talk to the same Galaxy through two distinct MxAccess sessions (different ClientNames so they don't evict each other).

What's already on this dev box

Per ~/.claude/projects/.../memory/:

  • AVEVA System Platform + Galaxy + MXAccess runtimeproject_aveva_platform_installed.md.
  • OtOpcUaGalaxyHost Windows service running as dohertj2, NSSM-wrapped, binary at C:\publish\OtOpcUaGalaxyHost\OtOpcUa.Driver.Galaxy.Host.exe, shared secret at .local/galaxy-host-secret.txt, ZB SQL on localhost:1433project_galaxy_host_installed.md.
  • Parity test project (Driver.Galaxy.ParityTests) committed and skip-clean — runs as soon as the mxgw half resolves.

Setup steps (one-time)

1. Build + run mxaccessgw

The gateway source is at c:\Users\dohertj2\Desktop\mxaccessgw\. From that repo:

cd C:\Users\dohertj2\Desktop\mxaccessgw
dotnet publish src\MxGateway.Server -c Release -o C:\publish\MxAccessGw

Configure:

  • An API key. Pick anything stable (e.g. parity-suite-key) and put it in whichever config file MxGateway.Server reads — see mxaccessgw/gateway.md for the current shape.
  • ClientName for the worker's MxAccess registration — set to OtOpcUa-Parity so it doesn't collide with OtOpcUa-Galaxy.Host.
  • Bind to http://localhost:5120 (default in launchSettings.json).

Run it as a console app for the first session — easier to inspect logs. NSSM-wrap it later if the rig becomes long-lived:

C:\publish\MxAccessGw\MxGateway.Server.exe

The worker should log a successful Register against MxAccess after a few seconds. If it loops on Register failures, that's an MxAccess-side problem — the legacy OtOpcUaGalaxyHost going through the same COM stack is a known-good reference point.

2. Set the parity env vars

In the test-runner shell:

$env:OTOPCUA_PARITY_GW_ENDPOINT  = "http://localhost:5120"
$env:OTOPCUA_PARITY_GW_API_KEY   = "parity-suite-key"   # match the gw config
$env:OTOPCUA_PARITY_CLIENT_NAME  = "OtOpcUa-Parity"

Elevation status doesn't matter — the legacy Galaxy.Host pipe ACL accepts elevated and non-elevated dohertj2 shells alike (the Administrators deny ACE was removed 2026-04-24; see project_galaxy_host_installed.md).

3. Verify both halves resolve

cd C:\Users\dohertj2\Desktop\lmxopcua
dotnet test tests\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests\ `
  --filter "FullyQualifiedName~HarnessShapeTests"

Harness_records_a_skip_reason_for_each_unavailable_backend is the two-line truth-teller:

  • Both LegacyDriver non-null + both MxGatewayDriver non-null → rig is up.
  • One side null → read its LegacySkipReason / MxGatewaySkipReason and fix.

Running the matrix

Once both halves resolve:

dotnet test tests\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests\ `
  --filter "Category=ParityE2E"

This runs all 17 scenario tests across the seven scenario classes (BrowseAndRead / Subscribe / Write / Alarm / History / Reconnect / ScanState). Each scenario class is independent — failures in one don't block the rest.

Track the result against docs/v2/Galaxy.ParityMatrix.md. Update each row to:

  • green if the scenario passes
  • yellow if it skipped because the dev Galaxy doesn't have the right shape (see coverage matrix below)
  • red if it asserted a real delta — those are the deltas that block PR 7.2; chase each before retiring the legacy backend

Galaxy shape needed for full coverage

Skip-on-empty-shape scenarios fail-soft today. To turn a skip into a real result, the dev Galaxy needs the shape in the right column:

Scenario Needs
BrowseAndReadParityTests (3 tests) Any deployed objects with attributes
SubscribeAndEventRateParityTests event-rate ≥5 attributes whose values change in 3s
WriteByClassificationParityTests (FreeAccess/Operate) A FreeAccess/Operate numeric attribute
WriteByClassificationParityTests (Configure/Tune) A Configure/Tune attribute
AlarmTransitionParityTests (2 tests) Attributes with the $Alarm* extension
HistoryReadParityTests (historized set) Attributes with the History extension
ScanStateProbeParityTests (2 tests) Multiple $WinPlatform / $AppEngine objects

The dev Galaxy from the existing E2E smoke (gr/seed-phase-7-smoke.sql) covers most of these; the multi-platform scenario probably needs hand-deploying a second $WinPlatform instance.

Soak run

The 24h × 50k soak gates the production confidence half of PR 7.2.

$env:OTOPCUA_SOAK_RUN      = "1"
$env:OTOPCUA_SOAK_TAGS     = "<actual tag count if Galaxy < 50k>"
$env:OTOPCUA_SOAK_MINUTES  = "1440"   # default 24h; compress for first runs
$env:OTOPCUA_SOAK_DROP_PCT = "0.5"

dotnet test tests\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests\ `
  --filter "Category=Soak"

The test logs a per-minute CSV-style line to stdout:

soak,1.0,received=51234,dispatched=51234,dropped=0,ws_mb=412
soak,2.0,received=102468,dispatched=102468,dropped=0,ws_mb=415
...

Capture stdout to a file for post-run analysis. The three guards (received growing, dropped/received ratio, working-set delta) all fire mid-run rather than at end-of-test, so a failure surfaces within the first few minutes if the architecture is wrong.

Compressed-tag soak (when Galaxy isn't 50k tags)

A first-pass validation is fine with the override:

$env:OTOPCUA_SOAK_RUN      = "1"
$env:OTOPCUA_SOAK_TAGS     = "500"      # whatever the dev Galaxy has
$env:OTOPCUA_SOAK_MINUTES  = "60"       # one hour is enough to surface plumbing bugs
$env:OTOPCUA_SOAK_DROP_PCT = "1.0"

This validates the plumbing (bounded channel, pump invariants, leak guard) but doesn't pin the 50k-tag scaling assertion. Defer the full 50k validation to a customer rig with that scale, or build a synthetic Galaxy with a script that imports 50k attributes onto a generated UDO (~2 hours of one-off work).

Troubleshooting

  • MxGatewaySkipReason says "mxaccessgw not reachable" — the gw isn't listening, or it's on a different port. Test-NetConnection localhost -Port 5120 is the quick check.
  • MxGatewaySkipReason says "mxgateway backend boot failed: RpcException: Unauthenticated" — API key mismatch. Verify the OTOPCUA_PARITY_GW_API_KEY env var matches the gw's configured key.
  • LegacySkipReason says "Galaxy ZB SQL not reachable on localhost:1433" — SQL Server isn't running, or its TCP listener is off. Check services.msc for the SQL Server (default) instance.
  • LegacySkipReason says "Galaxy.Host EXE not built" — the parity harness looks under src/.../bin/Debug/net48/. Build it once: dotnet build src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host. Note the separately-published copy at C:\publish\OtOpcUaGalaxyHost\ is for the Windows service; the parity harness spawns its own subprocess.
  • Both halves resolve but parity scenarios assert deltas — that's the expected outcome the rig exists to surface. Review each delta against docs/v2/Galaxy.ParityMatrix.md's "Accepted deltas" section to decide whether it's a real bug or a pre-accepted divergence.

After the rig is green

Three further steps before PR 7.2 lands:

  1. Promote any newly-discovered accepted-delta to the matrix doc with the why.
  2. Run the full 24h × 50k soak (or compressed tag count if Galaxy isn't that big) and link the stdout log in the PR description.
  3. Pilot the default flip (PR 7.1's Galaxy.DefaultBackend = "GalaxyMxGateway") on a single production node for ~2 weeks. Watch the OTel/metrics surface (docs/v2/Galaxy.Performance.md) for regressions.

Then 7.2 has the rollback-risk evidence its precondition asks for.