Files
lmxopcua/docs/v2/Galaxy.ParityMatrix.md
Joseph Doherty 6bf147a113 docs: drop soak + 2-week-pilot as PR 7.2 preconditions
The parity matrix gate is the precondition for retiring the legacy
Galaxy projects. The 24h × 50k soak run and 2-week production pilot
were sketched in early planning as additional safety nets but aren't
operationally applicable for this deployment — there's no separate
production fleet to pilot against, and the soak harness's value is as
ongoing diagnostic infrastructure (still shipped in PR 6.4) rather
than a one-shot release gate.

PR 7.2's only remaining precondition is the matrix being fully green
or carrying documented accepted-deltas — verified 2026-04-30 on the
dev rig: 14 passed / 1 skipped / 0 failed.

Affected:
- docs/v2/Galaxy.ParityMatrix.md "Outstanding deltas" — flips to
  "PR 7.2 is unblocked"
- docs/v2/Galaxy.ParityRig.md "After the rig is green" — drops the
  three-step soak+pilot flow, keeps only the matrix-doc bookkeeping
  follow-up
- lmx_mxgw_impl.md PR 7.2 "Depends on" — replaces "fully soaked"
  with the matrix-green precondition + the verification date

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 07:51:39 -04:00

150 lines
8.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Galaxy backend parity matrix
This document tracks the scenario × result matrix that the
`Driver.Galaxy.ParityTests` suite drives against both Galaxy backends —
the legacy out-of-process **Galaxy.Host** (.NET 4.8 x86 + MXAccess COM,
fronted by `GalaxyProxyDriver`) and the new in-process **mxgateway**
backend (`GalaxyDriver`, .NET 10 + gRPC against `mxaccessgw`).
Maintained alongside Phase 5 (PR 5.W). The Phase 7 default flip
(PR 7.1) consumes this matrix as its go/no-go gate — every row must be
either green or carry an explicit *accepted-delta* justification.
## Reading the matrix
- **Status: green** — the scenario asserts strict parity and passes
(or skips cleanly when the rig isn't up).
- **Status: yellow** — soft pin only (count or shape parity, not value
parity) — acceptable when the underlying COM/gRPC stacks have known
divergences in raw payloads but the surface presented to the
DriverNodeManager is equivalent.
- **Status: red** — divergence detected. Row carries a fix or a
follow-up task ID.
## Scenarios
Last verified end-to-end on the dev parity rig: **2026-04-30**
(legacy `OtOpcUaGalaxyHost` mxaccess backend; mxaccessgw v1.x at
`http://localhost:5120`; sandbox `OtOpcUaParityTest_001` deployed in
the `ZB` galaxy; 13 passed / 1 skipped / 0 failed in 19 minutes).
| PR | Test class | Scenario | Status | Notes |
|----|-----------|----------|--------|-------|
| 5.2 | `BrowseAndReadParityTests` | Same variable set | green | symmetric set diff on full-reference set, after `[]` array-suffix workaround in `GalaxyDiscoverer` |
| 5.2 | `BrowseAndReadParityTests` | Same DataType / SecurityClass / IsHistorized | green | per-attribute meta triple parity |
| 5.2 | `BrowseAndReadParityTests` | Same StatusCode-class on a sampled read | yellow | pins status class (Bad/Uncertain/Good); CLR type intentionally not asserted — see "Accepted deltas" #6 |
| 5.3 | `SubscribeAndEventRateParityTests` | Subscribe returns a handle on each backend | green | symmetric Unsubscribe cleanup |
| 5.3 | `SubscribeAndEventRateParityTests` | Event rate within ±50% over 3s | yellow | both backends fed by the same upstream MXAccess subscriptions; tolerance absorbs scheduler jitter |
| 5.4 | `WriteByClassificationParityTests` | FreeAccess / Operate write status-class parity | yellow | pins status class only; legacy flat-maps every failure to BadInternalError, mxgw distinguishes (BadCommunicationError, BadDeviceFailure, etc.) — see "Accepted deltas" #7 |
| 5.4 | `WriteByClassificationParityTests` | Configure / Tune routes via secured-write | yellow | same status-class pin |
| 5.5 | `AlarmTransitionParityTests` | Same alarm-condition source-node-id set | green | one-way invariant on sub-attribute refs (legacy populated → mxgw matches; legacy null → mxgw free to populate per AlarmRefBuilder) |
| 5.5 | `AlarmTransitionParityTests` | IsAlarm-marked variable count parity | green | soft pin — count must match, doesn't have to be non-zero |
| 5.6 | `HistoryReadParityTests` | Same historized attribute set | green | what HistoryRouter consumes when routing to the Wonderware sidecar |
| 5.6 | `HistoryReadParityTests` | New mxgw GalaxyDriver does not implement `IHistoryProvider` | green | architectural pin from Phase 1 (PR 1.3) on the *new* path; legacy `GalaxyProxyDriver` keeps the interface for back-compat until PR 7.2 — see "Accepted deltas" #8 |
| 5.7 | `ReconnectParityTests` | Reinitialize → both Healthy + reads succeed | green | recovery latency is *not* pinned (legacy: pipe + COM client; mxgw: re-Register gw session) |
| 5.7 | `ReconnectParityTests` | Health diverges only when one side recovers | yellow | soft pin until a toxiproxy-style fault injector lands |
| 5.8 | `ScanStateProbeParityTests` | Same per-platform host set | n/a — deferred | dev rig is licensed for one `$WinPlatform` only; multi-platform parity deferred to a customer rig (PR 4.7's unit tests pin the state-decoder + member-tracking logic) |
| 5.8 | `ScanStateProbeParityTests` | Same `HostState` per overlapping platform | n/a — deferred | same single-platform constraint |
## Accepted deltas
These are intentional differences between the two backends — the parity
suite skips or tolerates them by design.
1. **Transport-entry host name.** The legacy backend's
`IHostConnectivityProbe` surface includes a host entry named after
the Galaxy.Host process identity; the mxgw backend uses the
configured `MxAccess.ClientName`. The names differ, but both are
correct for their respective sessions — the parity test compares
only the platform-host subset.
2. **Reconnect latency cadence.** Legacy reconnect roundtrips an OS
named pipe + an MxAccess COM client + a Galaxy.Host process restart
if the host died. The mxgw reconnect re-Registers the gateway session
over an existing gRPC channel. Sub-second vs multi-second recoveries
are both correct for their own paths; only the eventual `Healthy`
convergence is pinned.
3. **Read-value drift.** A read sampled twice on a live Galaxy can
return different values legitimately. We pin `StatusCode`-class
parity (Bad/Uncertain/Good); value equality is not pinned.
4. **Event-rate variance.** Both backends consume the same upstream
MXAccess publish events but route them through different deserializers
(LMXProxyServer COM events vs gRPC `MxEvent` protos). Scheduler
jitter on either side can shift counts within a 3s window; we pin a
±50% ratio, not strict equality.
5. **`IHistoryProvider` on the new path only.** Phase 1 (PR 1.3) lifted
history off the per-driver path onto the server-owned
`HistoryRouter` for the *new* in-process `GalaxyDriver`. The legacy
`GalaxyProxyDriver` still surfaces `IHistoryProvider` for back-compat
with the legacy server bootstrap path — it's an accepted delta
retired in PR 7.2 alongside the rest of the legacy projects. The
pin we want to enforce is "the new path doesn't regress to per-driver
history."
6. **Read value-CLR-type.** Legacy returns the raw VARIANT (e.g.
`Byte[]`) for an attribute that hasn't received its first value
cycle from MxAccess yet, while mxgw returns the typed value
(`Single`, `Int32`, etc.). Once a real value is written or scanned,
both converge. Pinning CLR-type equality across the uninitialized
window adds noise without a real parity invariant — the
`StatusCode`-class assertion already covers the
"did the read succeed" question.
7. **Write-failure StatusCode mapping.** Legacy
`MxAccessGalaxyBackend.WriteValuesAsync` flat-maps every failure to
`BadInternalError` (`0x80020000`); mxgw
`GatewayGalaxyDataWriter.TranslateReply` uses
`MxStatusProxy.RawDetectedBy` to distinguish gw-layer faults
(`BadCommunicationError`, `0x80050000`) from MxAccess HRESULT
faults (`BadDeviceFailure`, `BadNotConnected`, etc.). Both yield
Bad-status — the parity invariant is the *status class*, not the
exact code. Tighter mapping parity isn't worth investing in: the
legacy mapping retires alongside `GalaxyProxyDriver` in PR 7.2.
8. **Single-platform scope on the dev rig.** Two
`ScanStateProbeParityTests` scenarios are deferred to a customer
rig with multiple deployed `$WinPlatform` instances; this dev box
is licensed for one. PR 4.7's unit tests (`PerPlatformProbeWatcherTests`)
pin the state-decoder + member-tracking logic at the seam level,
so the runtime parity check becomes a customer-rig acceptance gate
before that customer goes live, not a precondition for retiring
the legacy projects on this dev box.
9. **Workaround for the gw `[]` array-suffix bug.**
`mxaccessgw/src/MxGateway.Server/Galaxy/GalaxyRepository.cs:173-175`
appends `[]` to the `full_tag_reference` of array-typed attributes,
which `MxAccess COM IInstance.AddItem` doesn't accept. The lmxopcua
discoverer (`GalaxyDiscoverer.StripArraySuffix`) defensively strips
the suffix. Tracked in `mxaccessgw/requirements-array-suffix-fix.md`;
the workaround is removed when that gw fix lands.
## Outstanding deltas
None as of 2026-04-30. Phase 7 (PR 7.1) flipped the default to
`mxgw`; PR 7.2 (legacy project deletion) is unblocked — the matrix
gate is satisfied and no further soak/pilot precondition applies.
## Running the matrix
```bash
# Both backends must be reachable for any row to run; rows skip
# cleanly when their backend is unavailable.
dotnet test tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests/
```
Environment overrides for the mxgw backend:
| Variable | Default | Purpose |
|----------|---------|---------|
| `OTOPCUA_PARITY_GW_ENDPOINT` | `http://localhost:5120` | mxaccessgw gRPC endpoint |
| `OTOPCUA_PARITY_GW_API_KEY` | `parity-suite-key` | API key handed to `MxGatewayClient` |
| `OTOPCUA_PARITY_CLIENT_NAME` | `OtOpcUa-Parity` | `MxAccess.ClientName` for the session |
The legacy backend reads ZB SQL on `localhost:1433` and spawns
`OtOpcUa.Driver.Galaxy.Host.exe` from
`src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/bin/Debug/net48/` — both
must exist for the legacy half to resolve.