Files
lmxopcua/docs/v2/Galaxy.ParityRig.md
Joseph Doherty 6bf147a113 docs: drop soak + 2-week-pilot as PR 7.2 preconditions
The parity matrix gate is the precondition for retiring the legacy
Galaxy projects. The 24h × 50k soak run and 2-week production pilot
were sketched in early planning as additional safety nets but aren't
operationally applicable for this deployment — there's no separate
production fleet to pilot against, and the soak harness's value is as
ongoing diagnostic infrastructure (still shipped in PR 6.4) rather
than a one-shot release gate.

PR 7.2's only remaining precondition is the matrix being fully green
or carrying documented accepted-deltas — verified 2026-04-30 on the
dev rig: 14 passed / 1 skipped / 0 failed.

Affected:
- docs/v2/Galaxy.ParityMatrix.md "Outstanding deltas" — flips to
  "PR 7.2 is unblocked"
- docs/v2/Galaxy.ParityRig.md "After the rig is green" — drops the
  three-step soak+pilot flow, keeps only the matrix-doc bookkeeping
  follow-up
- lmx_mxgw_impl.md PR 7.2 "Depends on" — replaces "fully soaked"
  with the matrix-green precondition + the verification date

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 07:51:39 -04:00

15 KiB
Raw Blame History

Galaxy parity rig — runbook

Brings up both Galaxy backends side-by-side against a single live Galaxy so the parity matrix in docs/v2/Galaxy.ParityMatrix.md and the soak scenario in tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests/SoakScenarioTests.cs can run for real. Closing the parity matrix is the gate for PR 7.2 (retire legacy Galaxy projects).

Conceptual layout

Galaxy ZB SQL  ──┬──  OtOpcUaGalaxyHost (NSSM service, net48 x86)
                 │       └── MxAccess COM, ClientName "OtOpcUa-Galaxy.Host"
                 │       └── named pipe "OtOpcUaGalaxy"
                 │              ▲
                 │              │ pipe IPC
                 │              │
                 │       GalaxyProxyDriver  ◄── parity test (legacy half)
                 │
                 └──  mxaccessgw service
                         └── MxAccess COM, ClientName "OtOpcUa-Parity"
                         └── gRPC on http://localhost:5120
                                ▲
                                │ gRPC
                                │
                         GalaxyDriver (in-process)  ◄── parity test (mxgw half)

Both halves talk to the same Galaxy through two distinct MxAccess sessions (different ClientNames so they don't evict each other).

What's already on this dev box

Per ~/.claude/projects/.../memory/:

  • AVEVA System Platform + Galaxy + MXAccess runtimeproject_aveva_platform_installed.md.
  • OtOpcUaGalaxyHost Windows service running as dohertj2, NSSM-wrapped, binary at C:\publish\OtOpcUaGalaxyHost\OtOpcUa.Driver.Galaxy.Host.exe, shared secret at .local/galaxy-host-secret.txt, ZB SQL on localhost:1433project_galaxy_host_installed.md.
  • Parity test project (Driver.Galaxy.ParityTests) committed and skip-clean — runs as soon as the mxgw half resolves.

Setup steps (one-time)

1. Build + run mxaccessgw

The gateway source is at c:\Users\dohertj2\Desktop\mxaccessgw\. Build both halves — the worker has to be x86 net48 (MxAccess COM bitness), the server is .NET 10:

cd C:\Users\dohertj2\Desktop\mxaccessgw
dotnet build src\MxGateway.Worker -c Release   # produces bin\x86\Release\net48\MxGateway.Worker.exe
dotnet build src\MxGateway.Server -c Release   # produces bin\Release\net10.0\MxGateway.Server.dll

Initialize the auth database and mint an API key. The CLI mode is gated by an apikey first-arg prefix:

$env:MxGateway__ApiKeyPepper = "parity-rig-dev-pepper"   # any stable string for dev
$srv = "C:\Users\dohertj2\Desktop\mxaccessgw\src\MxGateway.Server\bin\Release\net10.0\MxGateway.Server.dll"

dotnet $srv apikey init-db                                # → "init-db: initialized"

dotnet $srv apikey create-key `
  --key-id parity-rig `
  --display-name "OtOpcUa-Parity" `
  --scopes "session:open,session:close,invoke:read,invoke:write,invoke:secure,events:read,metadata:read"
# → "API key: mxgw_parity-rig_<base64suffix>"   ← capture this; you can't list secrets later

Save that exact key string for OTOPCUA_PARITY_GW_API_KEY in step 2.

Run the server with three env-var overrides — the defaults don't quite match what gRPC + the parity test need:

$env:MxGateway__ApiKeyPepper = "parity-rig-dev-pepper"   # MUST match the create-key invocation
$env:Kestrel__Endpoints__Http__Url = "http://localhost:5120"
$env:Kestrel__Endpoints__Http__Protocols = "Http2"        # gRPC needs h2c on plain HTTP
$env:MxGateway__Worker__ExecutablePath = `
  "C:\Users\dohertj2\Desktop\mxaccessgw\src\MxGateway.Worker\bin\x86\Release\net48\MxGateway.Worker.exe"
  # appsettings.json's relative path is missing the \net48 segment; absolute path sidesteps that

dotnet $srv
# → "Now listening on: http://localhost:5120"

The worker spawns lazily on the first OpenSession RPC — there's no worker process visible in Task Manager until the first session. If the worker can't spawn, the server returns Failed to open session session-… with a WorkerProcessLaunchException in the server log.

NSSM-wrap it later if the rig becomes long-lived; for first-pass provisioning a console window is easier to inspect.

2. Set the parity env vars

In the test-runner shell:

$env:OTOPCUA_PARITY_GW_ENDPOINT  = "http://localhost:5120"
$env:OTOPCUA_PARITY_GW_API_KEY   = "parity-suite-key"   # match the gw config
$env:OTOPCUA_PARITY_CLIENT_NAME  = "OtOpcUa-Parity"

Elevation status doesn't matter — the legacy Galaxy.Host pipe ACL accepts elevated and non-elevated dohertj2 shells alike (the Administrators deny ACE was removed 2026-04-24; see project_galaxy_host_installed.md).

3. Verify both halves resolve

cd C:\Users\dohertj2\Desktop\lmxopcua
dotnet test tests\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests\ `
  --filter "FullyQualifiedName~HarnessShapeTests"

Harness_records_a_skip_reason_for_each_unavailable_backend is the two-line truth-teller:

  • Both LegacyDriver non-null + both MxGatewayDriver non-null → rig is up.
  • One side null → read its LegacySkipReason / MxGatewaySkipReason and fix.

Running the matrix

Once both halves resolve:

dotnet test tests\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests\ `
  --filter "Category=ParityE2E"

This runs all 17 scenario tests across the seven scenario classes (BrowseAndRead / Subscribe / Write / Alarm / History / Reconnect / ScanState). Each scenario class is independent — failures in one don't block the rest.

Track the result against docs/v2/Galaxy.ParityMatrix.md. Update each row to:

  • green if the scenario passes
  • yellow if it skipped because the dev Galaxy doesn't have the right shape (see coverage matrix below)
  • red if it asserted a real delta — those are the deltas that block PR 7.2; chase each before retiring the legacy backend

Galaxy shape needed for full coverage

Skip-on-empty-shape scenarios fail-soft today. To turn a skip into a real result, the dev Galaxy needs the shape in the right column:

Scenario Needs Local rig
BrowseAndReadParityTests (3 tests) Any deployed objects with attributes existing seed
SubscribeAndEventRateParityTests event-rate ≥5 attributes whose values change in 3s ⚙ scriptable via graccess-cli
WriteByClassificationParityTests (FreeAccess/Operate) A FreeAccess/Operate numeric attribute ⚙ scriptable via graccess-cli
WriteByClassificationParityTests (Configure/Tune) A Configure/Tune attribute ⚙ scriptable via graccess-cli
AlarmTransitionParityTests (2 tests) Attributes with the $Alarm* extension ⚙ scriptable via graccess-cli
HistoryReadParityTests (historized set) Attributes with the History extension ⚙ scriptable via graccess-cli
ScanStateProbeParityTests (2 tests) Multiple $WinPlatform / $AppEngine objects deferred to customer rig — this dev box is provisioned for one platform only

The single-platform constraint

The dev box at DESKTOP-6JL3KKO is licensed / configured for a single deployed $WinPlatform. Adding a second platform isn't feasible here, so ScanStateProbeParityTests will skip in a "no overlap" branch on this rig. Both of its scenarios already handle that case gracefully (Assert.Skip("no overlapping platform hosts between backends — likely the transport names differ but no $WinPlatform was discovered")), so the matrix reports them as n/a (deferred) rather than red.

Plan: defer the two ScanState scenarios to a customer rig with multiple platforms. The PR 7.2 gate accepts "n/a, deferred" on these rows provided the legacy GalaxyRuntimeProbeManager and the in-process PerPlatformProbeWatcher have matching unit-test coverage of the state-decoder + member-tracking logic — which they do (PR 4.7's tests). Treat the runtime parity check as a customer-rig acceptance gate before that customer goes live, not a precondition for retiring the legacy projects on this dev box.

Provisioning the rest via graccess-cli

C:\Users\dohertj2\Desktop\graccess\graccess_cli\ is a .NET Framework 4.8 console app over the ArchestrA GRAccess COM API. It can configure templates, instances, attributes, UDAs, extensions, and attribute security — i.e. every row above marked ⚙ scriptable. Full surface in graccess/graccess_cli/docs/usage.md and per-area workflow guides (attribute-editing.md, template-editing.md, template-instance-editing.md).

Reserve a sandbox UDO (e.g. OtOpcUaParityTest) to avoid mutating attributes on plant-relevant objects. Concrete commands per requirement:

A FreeAccess/Operate numeric attribute (covers WriteByClassification FreeAccess/Operate scenario):

graccess object uda add `
  --galaxy ZB --name OtOpcUaParityTest --type template `
  --uda OperateValue --data-type MxFloat `
  --category MxCategoryWriteable_C --security MxSecurityOperate `
  --confirm --confirm-target OtOpcUaParityTest

A Configure / Tune attribute (covers WriteByClassification Configure/Tune scenario):

# Tune
graccess object uda add `
  --galaxy ZB --name OtOpcUaParityTest --type template `
  --uda TuneValue --data-type MxFloat `
  --category MxCategoryWriteable_T --security MxSecurityTune `
  --confirm --confirm-target OtOpcUaParityTest

# Configure
graccess object uda add `
  --galaxy ZB --name OtOpcUaParityTest --type template `
  --uda ConfigValue --data-type MxFloat `
  --category MxCategoryWriteable_C --security MxSecurityConfigure `
  --confirm --confirm-target OtOpcUaParityTest

A changing-value attribute (covers Subscribe event-rate scenario). Two ways:

  1. On-scan increment — bind a script extension that bumps a counter each scan. Simplest to author with object extension add against ScriptExtension plus object attribute set for the script body (see attribute-editing.md §"Edit Extensions" for the pattern).
  2. External writer loop — leave the attribute as plain Float and run a one-liner that writes incrementing values from the parity-test shell. Uses the legacy backend path so it's available before the mxgw subscriber is up. This keeps the Galaxy template clean.

For first-pass validation pick #2 — no template surgery needed, and the write loop runs only during dotnet test.

Attributes with the $Alarm* extension (covers AlarmTransition scenario). Per attribute-editing.md §"Edit Alarm Settings" the likely-named attributes vary by extension type (Limit, RateOfChange, etc.). Add the extension via:

graccess object extension add `
  --galaxy ZB --name OtOpcUaParityTest --type template `
  --extension-type AnalogLimitAlarm --primitive AlarmInput `
  --object-extension `
  --confirm --confirm-target OtOpcUaParityTest

Then set HiHi/Hi/Lo/LoLo limit values + priority on the resulting attributes via object attribute set. Inspect first via object attributes to see the names the extension introduces — they differ across Aveva versions.

Attributes with the History extension (covers HistoryRead routing scenario). History settings are usually attribute or extension attributes; attribute-editing.md §"Edit History Settings" covers the discovery flow. Quick start:

graccess object extension add `
  --galaxy ZB --name OtOpcUaParityTest --type template `
  --extension-type HistoryExtension --primitive HistoryRecord `
  --object-extension `
  --confirm --confirm-target OtOpcUaParityTest

# Then enable history on whichever attribute the extension points at
graccess object attribute set `
  --galaxy ZB --name OtOpcUaParityTest --type template `
  --attribute HistoryEnabled --value true --data-type bool `
  --confirm --confirm-target OtOpcUaParityTest

Deploy + restart Galaxy.Host after any of the above so MxAccess sees the change:

graccess object deploy --galaxy ZB --name OtOpcUaParityTest_001 `
  --confirm --confirm-target OtOpcUaParityTest_001
sc.exe restart OtOpcUaGalaxyHost

Then re-run the parity matrix. The previously-skipped scenarios should now find a sandbox attribute matching their selector and assert.

Soak run

The 24h × 50k soak gates the production confidence half of PR 7.2.

$env:OTOPCUA_SOAK_RUN      = "1"
$env:OTOPCUA_SOAK_TAGS     = "<actual tag count if Galaxy < 50k>"
$env:OTOPCUA_SOAK_MINUTES  = "1440"   # default 24h; compress for first runs
$env:OTOPCUA_SOAK_DROP_PCT = "0.5"

dotnet test tests\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests\ `
  --filter "Category=Soak"

The test logs a per-minute CSV-style line to stdout:

soak,1.0,received=51234,dispatched=51234,dropped=0,ws_mb=412
soak,2.0,received=102468,dispatched=102468,dropped=0,ws_mb=415
...

Capture stdout to a file for post-run analysis. The three guards (received growing, dropped/received ratio, working-set delta) all fire mid-run rather than at end-of-test, so a failure surfaces within the first few minutes if the architecture is wrong.

Compressed-tag soak (when Galaxy isn't 50k tags)

A first-pass validation is fine with the override:

$env:OTOPCUA_SOAK_RUN      = "1"
$env:OTOPCUA_SOAK_TAGS     = "500"      # whatever the dev Galaxy has
$env:OTOPCUA_SOAK_MINUTES  = "60"       # one hour is enough to surface plumbing bugs
$env:OTOPCUA_SOAK_DROP_PCT = "1.0"

This validates the plumbing (bounded channel, pump invariants, leak guard) but doesn't pin the 50k-tag scaling assertion. Defer the full 50k validation to a customer rig with that scale, or build a synthetic Galaxy with a script that imports 50k attributes onto a generated UDO (~2 hours of one-off work).

Troubleshooting

  • MxGatewaySkipReason says "mxaccessgw not reachable" — the gw isn't listening, or it's on a different port. Test-NetConnection localhost -Port 5120 is the quick check.
  • MxGatewaySkipReason says "mxgateway backend boot failed: RpcException: Unauthenticated" — API key mismatch. Verify the OTOPCUA_PARITY_GW_API_KEY env var matches the gw's configured key.
  • LegacySkipReason says "Galaxy ZB SQL not reachable on localhost:1433" — SQL Server isn't running, or its TCP listener is off. Check services.msc for the SQL Server (default) instance.
  • LegacySkipReason says "Galaxy.Host EXE not built" — the parity harness looks under src/.../bin/Debug/net48/. Build it once: dotnet build src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host. Note the separately-published copy at C:\publish\OtOpcUaGalaxyHost\ is for the Windows service; the parity harness spawns its own subprocess.
  • Both halves resolve but parity scenarios assert deltas — that's the expected outcome the rig exists to surface. Review each delta against docs/v2/Galaxy.ParityMatrix.md's "Accepted deltas" section to decide whether it's a real bug or a pre-accepted divergence.

After the rig is green

When the matrix is fully green or carries documented accepted-deltas, PR 7.2 (legacy project deletion) is unblocked. The only follow-up is to promote any newly-discovered accepted-delta to the matrix doc with the why so the matrix history stays auditable.