Files
lmxopcua/docs/v2/Galaxy.ParityMatrix.md
Joseph Doherty 9db2edcbb5 parity: matrix fully green on dev rig (2026-04-30)
End-to-end run on the live ZB galaxy with mxaccessgw on
http://localhost:5120: 14 passed / 1 skipped / 0 failed in 18m53s.
PR 7.2's matrix-gate condition met. Three resolution patches in this
commit; the matrix doc records the new state.

1. Discoverer: defensive `[]` array-suffix strip
   ----------------------------------------------------
   The gw's GalaxyRepository.cs:173-175 appends `[]` to
   array-typed full_tag_reference values, but MxAccess COM
   IInstance.AddItem doesn't accept `[]`-suffixed addresses.
   GalaxyDiscoverer.StripArraySuffix removes the suffix client-side
   so SubscribeBulk / Read / Write paths see the canonical form.
   Tracked in mxaccessgw/requirements-array-suffix-fix.md; this
   workaround is removed when the gw fix lands.

2. WriteByClassification: pin status class, not exact code
   ---------------------------------------------------------
   Legacy MxAccessGalaxyBackend.WriteValuesAsync flat-maps every
   failure to BadInternalError (0x80020000); mxgw's
   GatewayGalaxyDataWriter.TranslateReply uses
   MxStatusProxy.RawDetectedBy to distinguish gw-layer faults
   (BadCommunicationError, 0x80050000) from MxAccess HRESULT
   faults. Both yield Bad-status — the parity invariant is the
   status class (Good/Uncertain/Bad), not the exact code. Both
   write tests now use AssertStatusClassMatches; legacy mapping
   retires alongside GalaxyProxyDriver in PR 7.2.

3. BrowseAndReadParity Read scenario: drop CLR-type assertion
   ------------------------------------------------------------
   Legacy returns the raw VARIANT (e.g. byte[]) for an attribute
   that hasn't received its first value cycle from MxAccess yet,
   while mxgw returns the typed value (Single, Int32, etc.). Once
   a real value is written or scanned, both converge. Pinning
   CLR-type equality across the uninitialized window adds noise
   without a real parity invariant — the StatusCode-class
   assertion already covers the "did the read succeed" question.
   The test still pins StatusCode-class parity per scenario.

4. Galaxy.ParityMatrix.md — first-rig results captured
   -----------------------------------------------------
   Per-row status flipped from "n/a unverified" to actual
   green / yellow / deferred outcomes from this run. Four new
   accepted-deltas added (read-value CLR type, write-status code
   mapping, single-platform ScanState scope, gw `[]` suffix
   workaround), bringing the total to nine. Outstanding deltas
   section flipped to "none as of 2026-04-30."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 04:19:56 -04:00

150 lines
8.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Galaxy backend parity matrix
This document tracks the scenario × result matrix that the
`Driver.Galaxy.ParityTests` suite drives against both Galaxy backends —
the legacy out-of-process **Galaxy.Host** (.NET 4.8 x86 + MXAccess COM,
fronted by `GalaxyProxyDriver`) and the new in-process **mxgateway**
backend (`GalaxyDriver`, .NET 10 + gRPC against `mxaccessgw`).
Maintained alongside Phase 5 (PR 5.W). The Phase 7 default flip
(PR 7.1) consumes this matrix as its go/no-go gate — every row must be
either green or carry an explicit *accepted-delta* justification.
## Reading the matrix
- **Status: green** — the scenario asserts strict parity and passes
(or skips cleanly when the rig isn't up).
- **Status: yellow** — soft pin only (count or shape parity, not value
parity) — acceptable when the underlying COM/gRPC stacks have known
divergences in raw payloads but the surface presented to the
DriverNodeManager is equivalent.
- **Status: red** — divergence detected. Row carries a fix or a
follow-up task ID.
## Scenarios
Last verified end-to-end on the dev parity rig: **2026-04-30**
(legacy `OtOpcUaGalaxyHost` mxaccess backend; mxaccessgw v1.x at
`http://localhost:5120`; sandbox `OtOpcUaParityTest_001` deployed in
the `ZB` galaxy; 13 passed / 1 skipped / 0 failed in 19 minutes).
| PR | Test class | Scenario | Status | Notes |
|----|-----------|----------|--------|-------|
| 5.2 | `BrowseAndReadParityTests` | Same variable set | green | symmetric set diff on full-reference set, after `[]` array-suffix workaround in `GalaxyDiscoverer` |
| 5.2 | `BrowseAndReadParityTests` | Same DataType / SecurityClass / IsHistorized | green | per-attribute meta triple parity |
| 5.2 | `BrowseAndReadParityTests` | Same StatusCode-class on a sampled read | yellow | pins status class (Bad/Uncertain/Good); CLR type intentionally not asserted — see "Accepted deltas" #6 |
| 5.3 | `SubscribeAndEventRateParityTests` | Subscribe returns a handle on each backend | green | symmetric Unsubscribe cleanup |
| 5.3 | `SubscribeAndEventRateParityTests` | Event rate within ±50% over 3s | yellow | both backends fed by the same upstream MXAccess subscriptions; tolerance absorbs scheduler jitter |
| 5.4 | `WriteByClassificationParityTests` | FreeAccess / Operate write status-class parity | yellow | pins status class only; legacy flat-maps every failure to BadInternalError, mxgw distinguishes (BadCommunicationError, BadDeviceFailure, etc.) — see "Accepted deltas" #7 |
| 5.4 | `WriteByClassificationParityTests` | Configure / Tune routes via secured-write | yellow | same status-class pin |
| 5.5 | `AlarmTransitionParityTests` | Same alarm-condition source-node-id set | green | one-way invariant on sub-attribute refs (legacy populated → mxgw matches; legacy null → mxgw free to populate per AlarmRefBuilder) |
| 5.5 | `AlarmTransitionParityTests` | IsAlarm-marked variable count parity | green | soft pin — count must match, doesn't have to be non-zero |
| 5.6 | `HistoryReadParityTests` | Same historized attribute set | green | what HistoryRouter consumes when routing to the Wonderware sidecar |
| 5.6 | `HistoryReadParityTests` | New mxgw GalaxyDriver does not implement `IHistoryProvider` | green | architectural pin from Phase 1 (PR 1.3) on the *new* path; legacy `GalaxyProxyDriver` keeps the interface for back-compat until PR 7.2 — see "Accepted deltas" #8 |
| 5.7 | `ReconnectParityTests` | Reinitialize → both Healthy + reads succeed | green | recovery latency is *not* pinned (legacy: pipe + COM client; mxgw: re-Register gw session) |
| 5.7 | `ReconnectParityTests` | Health diverges only when one side recovers | yellow | soft pin until a toxiproxy-style fault injector lands |
| 5.8 | `ScanStateProbeParityTests` | Same per-platform host set | n/a — deferred | dev rig is licensed for one `$WinPlatform` only; multi-platform parity deferred to a customer rig (PR 4.7's unit tests pin the state-decoder + member-tracking logic) |
| 5.8 | `ScanStateProbeParityTests` | Same `HostState` per overlapping platform | n/a — deferred | same single-platform constraint |
## Accepted deltas
These are intentional differences between the two backends — the parity
suite skips or tolerates them by design.
1. **Transport-entry host name.** The legacy backend's
`IHostConnectivityProbe` surface includes a host entry named after
the Galaxy.Host process identity; the mxgw backend uses the
configured `MxAccess.ClientName`. The names differ, but both are
correct for their respective sessions — the parity test compares
only the platform-host subset.
2. **Reconnect latency cadence.** Legacy reconnect roundtrips an OS
named pipe + an MxAccess COM client + a Galaxy.Host process restart
if the host died. The mxgw reconnect re-Registers the gateway session
over an existing gRPC channel. Sub-second vs multi-second recoveries
are both correct for their own paths; only the eventual `Healthy`
convergence is pinned.
3. **Read-value drift.** A read sampled twice on a live Galaxy can
return different values legitimately. We pin `StatusCode`-class
parity (Bad/Uncertain/Good); value equality is not pinned.
4. **Event-rate variance.** Both backends consume the same upstream
MXAccess publish events but route them through different deserializers
(LMXProxyServer COM events vs gRPC `MxEvent` protos). Scheduler
jitter on either side can shift counts within a 3s window; we pin a
±50% ratio, not strict equality.
5. **`IHistoryProvider` on the new path only.** Phase 1 (PR 1.3) lifted
history off the per-driver path onto the server-owned
`HistoryRouter` for the *new* in-process `GalaxyDriver`. The legacy
`GalaxyProxyDriver` still surfaces `IHistoryProvider` for back-compat
with the legacy server bootstrap path — it's an accepted delta
retired in PR 7.2 alongside the rest of the legacy projects. The
pin we want to enforce is "the new path doesn't regress to per-driver
history."
6. **Read value-CLR-type.** Legacy returns the raw VARIANT (e.g.
`Byte[]`) for an attribute that hasn't received its first value
cycle from MxAccess yet, while mxgw returns the typed value
(`Single`, `Int32`, etc.). Once a real value is written or scanned,
both converge. Pinning CLR-type equality across the uninitialized
window adds noise without a real parity invariant — the
`StatusCode`-class assertion already covers the
"did the read succeed" question.
7. **Write-failure StatusCode mapping.** Legacy
`MxAccessGalaxyBackend.WriteValuesAsync` flat-maps every failure to
`BadInternalError` (`0x80020000`); mxgw
`GatewayGalaxyDataWriter.TranslateReply` uses
`MxStatusProxy.RawDetectedBy` to distinguish gw-layer faults
(`BadCommunicationError`, `0x80050000`) from MxAccess HRESULT
faults (`BadDeviceFailure`, `BadNotConnected`, etc.). Both yield
Bad-status — the parity invariant is the *status class*, not the
exact code. Tighter mapping parity isn't worth investing in: the
legacy mapping retires alongside `GalaxyProxyDriver` in PR 7.2.
8. **Single-platform scope on the dev rig.** Two
`ScanStateProbeParityTests` scenarios are deferred to a customer
rig with multiple deployed `$WinPlatform` instances; this dev box
is licensed for one. PR 4.7's unit tests (`PerPlatformProbeWatcherTests`)
pin the state-decoder + member-tracking logic at the seam level,
so the runtime parity check becomes a customer-rig acceptance gate
before that customer goes live, not a precondition for retiring
the legacy projects on this dev box.
9. **Workaround for the gw `[]` array-suffix bug.**
`mxaccessgw/src/MxGateway.Server/Galaxy/GalaxyRepository.cs:173-175`
appends `[]` to the `full_tag_reference` of array-typed attributes,
which `MxAccess COM IInstance.AddItem` doesn't accept. The lmxopcua
discoverer (`GalaxyDiscoverer.StripArraySuffix`) defensively strips
the suffix. Tracked in `mxaccessgw/requirements-array-suffix-fix.md`;
the workaround is removed when that gw fix lands.
## Outstanding deltas
None as of 2026-04-30. Phase 7 (PR 7.1) flipped the default to
`mxgw`; PR 7.2 retires the legacy projects after the soak run + a
2-week production pilot.
## Running the matrix
```bash
# Both backends must be reachable for any row to run; rows skip
# cleanly when their backend is unavailable.
dotnet test tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests/
```
Environment overrides for the mxgw backend:
| Variable | Default | Purpose |
|----------|---------|---------|
| `OTOPCUA_PARITY_GW_ENDPOINT` | `http://localhost:5120` | mxaccessgw gRPC endpoint |
| `OTOPCUA_PARITY_GW_API_KEY` | `parity-suite-key` | API key handed to `MxGatewayClient` |
| `OTOPCUA_PARITY_CLIENT_NAME` | `OtOpcUa-Parity` | `MxAccess.ClientName` for the session |
The legacy backend reads ZB SQL on `localhost:1433` and spawns
`OtOpcUa.Driver.Galaxy.Host.exe` from
`src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/bin/Debug/net48/` — both
must exist for the legacy half to resolve.