Files
mxaccessgw/docs/GatewayTesting.md
T
Joseph Doherty a0203503a7 Code-review 2026-05-20 sweep: re-review at 1cd51bb, resolve 72 findings across all 11 modules
Re-reviewed every module/client against the 10-category checklist
(REVIEW-PROCESS.md) at commit 1cd51bb, filed 72 new findings, and
fixed them in three priority waves (3 High, 17 Medium, 52 Low).

Highs
- Server-017: enumerate AcknowledgeAlarm / QueryActiveAlarms in
  GatewayGrpcScopeResolver so non-admin keys can use them; document
  the mapping in docs/Authorization.md; add interceptor tests.
- Client.Java-013: add the five missing bulk-method stubs to the
  CLI FakeSession so the test module compiles on a clean tree.
- Client.Rust-013: fix the clippy::doc_lazy_continuation regression
  in generated tonic code by reformatting the ReadBulkCommand proto
  comment and scoping a #![allow(...)] to the generated submodules.

Mediums (highlights)
- Server: unify GatewaySession state-lock discipline (-015) and
  make DisposeAsync race-safe against in-flight CloseAsync (-016);
  add constraint-enforcement test coverage for the bulk-plan path
  (-021).
- Worker: introduce StaRuntimeShutdownException so RunAlarmPollLoop
  can distinguish graceful shutdown from a real STA-affinity
  violation (-016); have the watchdog skip StaHung while
  CurrentCommandCorrelationId is non-empty so a legitimate slow
  ReadBulk no longer self-faults (-017).
- Tests: add per-method round-trip + cancellation coverage for the
  11 GatewaySession bulk methods (-013); replace the real TCP probe
  in GalaxyHierarchyCacheTests with an IGalaxyRepository fake
  (-016).
- IntegrationTests: drive the StreamEvents writer in the live Write
  test and assert OnWriteComplete (-012); add live tests for
  Unadvise/RemoveItem/Unregister ordering, WriteSecured, and
  abnormal worker exit (-014).
- Worker.Tests: replace MxAccessSession reflection with an internal
  CreateForTesting factory (-016); cover WorkerCancel and
  unexpected-body envelope branches (-017).
- Client.Java: cancel MxEventStream when close() races
  beforeStart() (-014); return a CancellingCompletableFuture that
  actually forwards cancellation through .thenApply chains (-015).
- Client.Python: drop the silent localhost-plaintext downgrade in
  the CLI; require explicit --plaintext (-013).
- Client.Rust: stop bench-read-bulk from polluting success-latency
  histograms with failed-call durations (-015); add coverage for
  the five MalformedReply paths, the bulk-write helpers, the
  Error::Unavailable mapping, and the unary-fault path (-016).
- Contracts: extend docs/Contracts.md with the bulk read/write
  command family (-009).

Lows (highlights)
- Server: cap GalaxyGlobMatcher.RegexCache; align
  WorkerAlarmRpcDispatcher missing-session handling; drop the
  duplicate dashboard @page routes; refresh IAlarmRpcDispatcher
  XML doc.
- Worker: surface SetXmlAlarmQuery COM failures; remove dead
  subscriptionExpression / ExecutingCommand arms; preserve
  factory-supplied runtime sessions; split MxAlarmSnapshot.cs into
  three files.
- Tests: dispose the WebApplication in seven test classes; rebuild
  FakeWorkerProcess.WaitForExitAsync against a real TaskCompletion
  source; switch the heartbeat-expires test to ManualTimeProvider;
  add InvariantCulture to the remaining DateTimeOffset.Parse sites;
  document GalaxyFilterInputSafetyTests in GatewayTesting.md.
- IntegrationTests: comment fixes, RecordingServerStreamWriter
  IDisposable, class-level [Trait], single-source ZB default
  connection string.
- Worker.Tests: replace silent-return gating with LiveMxAccessFact
  so absent env vars SKIP not pass; PascalCase rename of probe
  [Fact]s; deterministic deadline test; new frame-protocol error
  tests; ComputeTransitions diff-coverage; relocate dev-rig probes
  to Probes/.
- Contracts: add round-trip coverage and per-field redaction /
  Galaxy-identifier comments to the protos.
- Client.Dotnet: introduce clients/dotnet/Directory.Build.props so
  TreatWarningsAsErrors / analysers apply; document
  DiscoverHierarchyOptions and IMxGatewayCliClient; require typed
  bulk-read handles in CLI; surface AcknowledgeAlarm transport
  faults through Translate().
- Client.Go: kill dead code in alarms_test / fakeGalaxyServer /
  runWriteBulkVariant; document the six new subcommands in
  writeUsage; drain galaxy-watch events on limit; switch io.EOF
  comparisons to errors.Is.
- Client.Java: shared shutdown helpers + new shutdownTimeout
  option; regex-based credential redaction; Long.toUnsignedString
  for uint64 sequence; doc fixes.
- Client.Python: combine duplicate imports; add coverage for
  _percentile / bench-read-bulk / MAX_AGGREGATE_EVENTS /
  _api_key_from_env; populate pyproject metadata and ship py.typed.
- Client.Rust: expose next_correlation_id() so CLI ping/close
  stop hard-coding correlation IDs; resync RustClientDesign.md
  with the current Session / Error surface and CLI subcommand set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:46:47 -04:00

334 lines
17 KiB
Markdown

# Gateway Testing
Gateway tests run without installed MXAccess by using fake workers, fake
transports, and in-process gRPC service fakes. Live MXAccess verification belongs
in opt-in integration tests because it depends on installed COM components and
provider state.
## Fake Worker Harness
`FakeWorkerHarness` in `src/MxGateway.Tests/Gateway/Workers/Fakes/` provides an
in-process worker side for named-pipe IPC tests. It uses the same
`WorkerFrameReader`, `WorkerFrameWriter`, and `WorkerEnvelope` contract as the
gateway so tests exercise real frame validation and worker-client state changes.
Use the harness when a gateway or session test needs worker behavior without
starting `MxGateway.Worker.exe` or loading MXAccess COM. The harness scripts:
- `WorkerHello` and `WorkerReady` startup,
- command replies with matching correlation ids,
- ordered `WorkerEvent` frames,
- `WorkerHeartbeat` frames,
- `WorkerFault` frames,
- shutdown acknowledgements,
- malformed protobuf payloads and oversized frame headers,
- slow or hung workers by withholding a reply.
Session-level tests can connect the harness to the pipe created by
`SessionWorkerClientFactory` with `ConnectToGatewayPipeAsync`. Lower-level
`WorkerClient` tests can use `CreateConnectedPairAsync` to create both pipe ends
inside the test.
`GatewayEndToEndFakeWorkerSmokeTests` composes the real gRPC service,
`SessionManager`, `SessionWorkerClientFactory`, `WorkerClient`, and
`EventStreamService` with a scripted fake worker launcher. The smoke test covers
`OpenSession`, `Register`, `AddItem`, `Advise`, one streamed `OnDataChange`
event, and `CloseSession` without loading MXAccess COM.
## Live MXAccess Smoke
`WorkerLiveMxAccessSmokeTests` in `src/MxGateway.IntegrationTests/` composes the
real gRPC service, `SessionManager`, `SessionWorkerClientFactory`,
`WorkerClient`, `WorkerProcessLauncher`, and `MxGateway.Worker.exe`. It is
skipped unless `MXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1` is set because it creates
the installed MXAccess COM object and depends on live provider state.
The live smoke opens a gateway session, launches the x86 worker, runs
`Register`, `AddItem`, and `Advise`, waits a bounded time for the first
`OnDataChange` event (skipping any earlier bootstrap/registration-state event),
and closes the session in a `finally` block so the worker gets a graceful
shutdown request even when a command or event assertion fails. Cleanup failures
in that `finally` block are logged rather than thrown, so a real assertion
failure is never masked by a shutdown timeout.
`WorkerLiveMxAccessSmokeTests` additionally covers five MXAccess parity paths the
fake-worker tests cannot validate:
- a `Write` round-trip against an advised item, asserting both that the reply is
`Ok` / `MxCommandKind.Write` *and* that the worker emits a matching
`OnWriteComplete` event for the targeted (server, item) handle pair — the
same round-trip proof used by `scripts/run-client-e2e-tests.ps1`,
- an `AddItem` against an invalid server handle, asserting the MXAccess failure
surfaces in the command reply without faulting the gateway transport,
- the `UnAdvise``RemoveItem``Unregister` teardown chain, asserting each
step replies `Ok` with the matching `MxCommandKind`, that no further
`OnDataChange` events arrive for the un-advised pair, and that a second
`RemoveItem` against the freed handle relays a non-`Ok` MXAccess failure,
- a `WriteSecured` round-trip after `AuthenticateUser`, asserting the reply
carries `MxCommandKind.WriteSecured` and the credential password never
appears in the diagnostic message (parity for both the secured-write
ordering rule and the "do not log secrets" contract), and
- an abnormal worker exit (the worker process is killed mid-session) where the
gateway must transition the session to `SessionState.Faulted` with a
non-empty fault description carrying a known worker-client classification
(pipe disconnected / worker faulted / end-of-stream / heartbeat expired).
All six tests are gated by the same `MXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1`
opt-in variable.
Build the worker before running the smoke:
```bash
dotnet build src/MxGateway.Worker/MxGateway.Worker.csproj -p:Platform=x86
```
Run the smoke explicitly:
```bash
$env:MXGATEWAY_RUN_LIVE_MXACCESS_TESTS = "1"
dotnet test src/MxGateway.IntegrationTests/MxGateway.IntegrationTests.csproj --filter FullyQualifiedName~WorkerLiveMxAccessSmokeTests
```
Optional live smoke variables:
| Variable | Default | Description |
|----------|---------|-------------|
| `MXGATEWAY_LIVE_MXACCESS_WORKER_EXE` | First existing `MxGateway.Worker.exe` under `src/MxGateway.Worker/bin/...` | Worker executable path. Set this when running against a packaged worker or a non-default build output. |
| `MXGATEWAY_LIVE_MXACCESS_ITEM` | `TestChildObject.TestInt` | MXAccess item reference used by `AddItem`. |
| `MXGATEWAY_LIVE_MXACCESS_CLIENT_NAME` | `MxGateway.IntegrationTests` | Client name passed to `Register`. |
| `MXGATEWAY_LIVE_MXACCESS_EVENT_TIMEOUT_SECONDS` | `15` | Maximum wait for the first `OnDataChange` (also used for the `OnWriteComplete` round-trip and the abnormal-exit fault transition). |
| `MXGATEWAY_LIVE_MXACCESS_WRITE_SECURED_USER` | `admin` | ArchestrA user name passed to `AuthenticateUser` before the `WriteSecured` parity step. |
| `MXGATEWAY_LIVE_MXACCESS_WRITE_SECURED_PASSWORD` | `admin123` | Password paired with the user above. Never logged; the test asserts the value does not appear in the WriteSecured diagnostic message. |
The test output includes session id, worker process id, command status,
HRESULT/status diagnostics, event sequence and handles, close status, and worker
stdout/stderr lines emitted during the run.
## Live Galaxy Repository
`GalaxyRepositoryLiveTests` in `src/MxGateway.IntegrationTests/Galaxy/` exercises
`GalaxyRepository` directly against the `ZB` Galaxy Repository SQL database. It is
skipped unless `MXGATEWAY_RUN_LIVE_GALAXY_TESTS=1` is set because it depends on a
reachable SQL Server instance and deployed Galaxy state — fake-worker tests cannot
cover the SQL browse RPCs.
The suite covers `TestConnectionAsync`, `GetLastDeployTimeAsync`,
`GetHierarchyAsync`, and `GetAttributesAsync`. `GetHierarchyAsync` and
`GetAttributesAsync` assert a non-empty result, so the connected `ZB` database
must contain a deployed Galaxy, not just an empty schema.
Run the Galaxy live tests explicitly:
```bash
$env:MXGATEWAY_RUN_LIVE_GALAXY_TESTS = "1"
dotnet test src/MxGateway.IntegrationTests/MxGateway.IntegrationTests.csproj --filter FullyQualifiedName~GalaxyRepositoryLiveTests
```
Optional live Galaxy variables:
| Variable | Default | Description |
|----------|---------|-------------|
| `MXGATEWAY_LIVE_GALAXY_CONN` | `Server=localhost;Database=ZB;Integrated Security=True;TrustServerCertificate=True;Encrypt=False;` | Galaxy Repository connection string. Set this when the `ZB` database is on a non-default instance or needs SQL authentication. |
The default connection string targets `ZB` on `localhost` with Windows
authentication, which matches the Galaxy Repository conventions in CLAUDE.md.
## Galaxy Filter Safety
`GalaxyFilterInputSafetyTests` in `src/MxGateway.Tests/Galaxy/` covers adversarial
input handling for the Galaxy Repository browse filter layer. It runs in the
unit-test project (no live SQL needed) and complements the live SQL coverage in
`GalaxyRepositoryLiveTests`.
The test class re-frames the original "Galaxy SQL injection" concern (Tests-002 in
`code-reviews/Tests/findings.md`). `GalaxyRepository` issues only four *constant*
SQL statements (`HierarchySql`, `AttributesSql`, `SELECT 1`,
`SELECT time_of_last_deploy FROM galaxy`) — no `DiscoverHierarchyRequest` field
is ever concatenated into a SQL string, so there is no dynamic SQL surface and no
`LIKE`-escaping helper to test. All filters (`TagNameGlob`, `RootTagName`,
template-chain, category, contained-path) are applied **in memory** by
`GalaxyHierarchyProjector` / `GalaxyGlobMatcher` against the cached snapshot.
The adversarial-input matrix (`'`, `' OR '1'='1`, `'; DROP TABLE gobject;--`,
`%`, `_`, `100%_off`, `[abc]`, `Pump'001`) pins the following invariants:
- SQL metacharacters (`'`, `;`) and `LIKE`-wildcards (`%`, `_`) are treated as
opaque literals by `GalaxyGlobMatcher` — they never act as wildcards, never
spuriously match unrelated text.
- Only `*` and `?` are glob wildcards.
- `GalaxyGlobMatcher` applies a 100 ms regex timeout so a pathological glob
(e.g. 5 000 `a` characters plus a literal `!`) completes promptly rather than
catastrophically backtracking.
- `GalaxyHierarchyProjector` returns zero matches (rather than the whole
hierarchy) for an adversarial `TagNameGlob` or `TemplateChainContains`, and
surfaces `NotFound` for an adversarial `RootTagName`.
- The `DiscoverHierarchy` RPC end-to-end returns zero matches for adversarial
`TagNameGlob` rather than faulting.
These invariants are the real security surface of the Galaxy browse path; the
SQL-injection framing does not apply to a constant-query layer.
## Live LDAP
`DashboardLdapLiveTests` in `src/MxGateway.IntegrationTests/` exercises
`DashboardAuthenticator` against the live GLAuth directory. It is skipped unless
`MXGATEWAY_RUN_LIVE_LDAP_TESTS=1` is set because it binds against the GLAuth
service described in `glauth.md`.
The suite builds the authenticator with a default `GatewayOptions`, so
`LdapOptions.RequiredGroup` keeps its `GwAdmin` default. `GwAdmin` is the
gateway-specific dashboard-admin role and is **not** part of the five baseline
GLAuth role groups — it must be provisioned before the LDAP live tests pass.
`AuthenticateAsync_AdminInGwAdminGroup_Succeeds` fails (rather than skips) when
GLAuth has only the baseline groups, so this is a hard prerequisite beyond "LDAP
is up." See the "Adding a gw-specific group" section of `glauth.md` for the
provisioning step that adds `GwAdmin` and grants it to `admin`.
The suite covers both the success path and the `DashboardAuthenticator` failure
branches: `admin` in `GwAdmin` succeeds; `readonly` is denied for missing group;
`admin` with a wrong password is rejected by the candidate bind without leaking
the password into `FailureMessage`; an unknown username yields no candidate; and
an unreachable LDAP server is absorbed into a failed result rather than throwing.
Run the LDAP live tests explicitly:
```bash
$env:MXGATEWAY_RUN_LIVE_LDAP_TESTS = "1"
dotnet test src/MxGateway.IntegrationTests/MxGateway.IntegrationTests.csproj --filter FullyQualifiedName~DashboardLdapLiveTests
```
## Client E2E Scripts
`scripts/discover-testmachine-tags.ps1` queries the ZB Galaxy Repository for the
deployed runtime references used by the live client e2e scripts. It reads
`TestMachine_001` through `TestMachine_020` and the expected attributes:
- `ProtectedValue`
- `TestChangingInt`
- `TestBoolArray`
- `TestIntArray`
- `TestDateTimeArray`
- `TestStringArray`
The discovery output includes the exact `fullTagReference`, data type, array
dimension, and security classification. The array attributes are expected to be
dimension 50. `ProtectedValue` has security classification 2 and requires
secured write semantics; the current client CLI e2e runner subscribes to it but
does not attempt a normal `Write`.
Run discovery directly when validating the Galaxy Repository inputs:
```powershell
powershell -ExecutionPolicy Bypass -File scripts/discover-testmachine-tags.ps1 -Json
```
`scripts/run-client-e2e-tests.ps1` drives the .NET, Go, Rust, Python, and Java
client CLIs through a live gateway session. The gateway and worker are assumed
to be already running at `-Endpoint`; the script does not start or stop them.
For each client it runs these phases, then closes the session in a `finally`
path and writes a JSON report under `artifacts/e2e/`:
1. **Session + register** — opens one session and registers.
2. **Bulk** — verifies `SubscribeBulk` / `UnsubscribeBulk` on a bounded tag
subset (skip with `-SkipBulk`).
3. **Add-item / advise** — adds and advises every discovered test tag.
4. **Stream** — asserts a bounded event stream delivers at least one event
(skip with `-SkipStream`).
5. **Parity** — asserts MXAccess error paths are rejected rather than silently
succeeding: an invalid item handle and an unknown session id (skip with
`-SkipParity`).
6. **Auth rejection** — asserts `open-session` is rejected when the API key is
missing, and (when `-RejectScopeApiKeyEnv` names an insufficient-scope key)
when the key lacks the required scope. Skip with `-SkipAuth`.
7. **Write round-trip***opt-in (`-VerifyWrite`).* Runs right after
`register`: adds and advises a configurable writable attribute
(`-WriteAttribute`, default `TestChangingInt`), writes a per-client
sentinel value, then streams events and asserts an `OnWriteComplete` event
for that item is observed — proof the write round-tripped through the
gateway, worker, and MXAccess provider. The written value being echoed back
in an `OnDataChange` is recorded best-effort (`echoObserved`): a
provider-driven attribute such as `TestChangingInt` accepts the write but
immediately overwrites it, so no data-change carries the value back. The
Rust `stream-events` CLI emits full per-event JSON (`family`, `itemHandle`,
`value`) so all five clients apply the same checks.
It is opt-in because it mutates live tag state. The phase fails fast if the
write command is rejected — e.g. against a gateway whose worker predates
write support (`MxAccessCommandExecutor` returning `InvalidRequest` for
`Write`/`Write2`/`WriteSecured`/`WriteSecured2`).
Build the gateway and worker, start the gateway, and provide a valid API key
before running the client e2e script:
```powershell
$env:MXGATEWAY_API_KEY = "<api-key>"
powershell -ExecutionPolicy Bypass -File scripts/run-client-e2e-tests.ps1
```
Useful runner options:
```powershell
powershell -ExecutionPolicy Bypass -File scripts/run-client-e2e-tests.ps1 -Clients dotnet,python -MachineStart 1 -MachineEnd 2
powershell -ExecutionPolicy Bypass -File scripts/run-client-e2e-tests.ps1 -BulkTagCount 10
powershell -ExecutionPolicy Bypass -File scripts/run-client-e2e-tests.ps1 -SkipStream
powershell -ExecutionPolicy Bypass -File scripts/run-client-e2e-tests.ps1 -SkipBulk
# Write round-trip (opt-in): point at a writable scalar attribute and its
# value type.
powershell -ExecutionPolicy Bypass -File scripts/run-client-e2e-tests.ps1 -VerifyWrite -WriteAttribute TestChangingInt -WriteType int32
# Auth rejection: also assert an insufficient-scope key is denied.
powershell -ExecutionPolicy Bypass -File scripts/run-client-e2e-tests.ps1 -RejectScopeApiKeyEnv MXGATEWAY_READONLY_API_KEY
# Run all five clients concurrently as isolated child processes.
powershell -ExecutionPolicy Bypass -File scripts/run-client-e2e-tests.ps1 -Parallel
# Validate the flow offline (prints commands, contacts no gateway).
powershell -ExecutionPolicy Bypass -File scripts/run-client-e2e-tests.ps1 -DryRun
powershell -ExecutionPolicy Bypass -File scripts/run-client-e2e-tests.ps1 -Endpoint localhost:5000 -ApiKeyEnv MXGATEWAY_API_KEY
```
When `-VerifyWrite` is enabled, the write round-trip fails loudly if the write
command is rejected, if `-WriteAttribute` does not name a writable scalar
attribute, or if no `OnWriteComplete` event is observed for the written item
within `-WriteEchoMaxEvents` (default 200) streamed events. Raise
`-WriteEchoMaxEvents` if the gateway's per-session event backlog is large
enough to push `OnWriteComplete` past that bound.
## Focused Commands
Run the cross-language smoke matrix tests after changing the documented client
smoke command list:
```bash
dotnet test src/MxGateway.Tests/MxGateway.Tests.csproj --filter FullyQualifiedName~CrossLanguageSmokeMatrixTests
```
Run the parity fixture matrix tests after changing the integration parity
scenario list:
```bash
dotnet test src/MxGateway.Tests/MxGateway.Tests.csproj --filter FullyQualifiedName~ParityFixtureMatrixTests
```
Run the fake worker tests after changing gateway worker IPC, session startup, or
event streaming behavior:
```bash
dotnet test src/MxGateway.Tests/MxGateway.Tests.csproj --filter FullyQualifiedName~FakeWorkerHarnessTests
dotnet test src/MxGateway.Tests/MxGateway.Tests.csproj --filter FullyQualifiedName~SessionWorkerClientFactoryFakeWorkerTests
dotnet test src/MxGateway.Tests/MxGateway.Tests.csproj --filter FullyQualifiedName~GatewayEndToEndFakeWorkerSmokeTests
dotnet test src/MxGateway.Tests/MxGateway.Tests.csproj --filter FullyQualifiedName~WorkerClientTests
dotnet test src/MxGateway.Worker.Tests/MxGateway.Worker.Tests.csproj -p:Platform=x86 --filter FullyQualifiedName~WorkerPipeSessionTests
```
Run the gateway test project after shared gateway test infrastructure changes:
```bash
dotnet test src/MxGateway.Tests/MxGateway.Tests.csproj
```
## Related Documentation
- [Cross-Language Smoke Matrix](./CrossLanguageSmokeMatrix.md)
- [Parity Fixture Matrix](./ParityFixtureMatrix.md)
- [Gateway Process Design](./GatewayProcessDesign.md)
- [Worker Frame Protocol](./WorkerFrameProtocol.md)
- [MXAccess Worker Instance Detailed Design](./MxAccessWorkerInstanceDesign.md)