Files
mxaccessgw/docs/GatewayTesting.md
T
Joseph Doherty a0203503a7 Code-review 2026-05-20 sweep: re-review at 1cd51bb, resolve 72 findings across all 11 modules
Re-reviewed every module/client against the 10-category checklist
(REVIEW-PROCESS.md) at commit 1cd51bb, filed 72 new findings, and
fixed them in three priority waves (3 High, 17 Medium, 52 Low).

Highs
- Server-017: enumerate AcknowledgeAlarm / QueryActiveAlarms in
  GatewayGrpcScopeResolver so non-admin keys can use them; document
  the mapping in docs/Authorization.md; add interceptor tests.
- Client.Java-013: add the five missing bulk-method stubs to the
  CLI FakeSession so the test module compiles on a clean tree.
- Client.Rust-013: fix the clippy::doc_lazy_continuation regression
  in generated tonic code by reformatting the ReadBulkCommand proto
  comment and scoping a #![allow(...)] to the generated submodules.

Mediums (highlights)
- Server: unify GatewaySession state-lock discipline (-015) and
  make DisposeAsync race-safe against in-flight CloseAsync (-016);
  add constraint-enforcement test coverage for the bulk-plan path
  (-021).
- Worker: introduce StaRuntimeShutdownException so RunAlarmPollLoop
  can distinguish graceful shutdown from a real STA-affinity
  violation (-016); have the watchdog skip StaHung while
  CurrentCommandCorrelationId is non-empty so a legitimate slow
  ReadBulk no longer self-faults (-017).
- Tests: add per-method round-trip + cancellation coverage for the
  11 GatewaySession bulk methods (-013); replace the real TCP probe
  in GalaxyHierarchyCacheTests with an IGalaxyRepository fake
  (-016).
- IntegrationTests: drive the StreamEvents writer in the live Write
  test and assert OnWriteComplete (-012); add live tests for
  Unadvise/RemoveItem/Unregister ordering, WriteSecured, and
  abnormal worker exit (-014).
- Worker.Tests: replace MxAccessSession reflection with an internal
  CreateForTesting factory (-016); cover WorkerCancel and
  unexpected-body envelope branches (-017).
- Client.Java: cancel MxEventStream when close() races
  beforeStart() (-014); return a CancellingCompletableFuture that
  actually forwards cancellation through .thenApply chains (-015).
- Client.Python: drop the silent localhost-plaintext downgrade in
  the CLI; require explicit --plaintext (-013).
- Client.Rust: stop bench-read-bulk from polluting success-latency
  histograms with failed-call durations (-015); add coverage for
  the five MalformedReply paths, the bulk-write helpers, the
  Error::Unavailable mapping, and the unary-fault path (-016).
- Contracts: extend docs/Contracts.md with the bulk read/write
  command family (-009).

Lows (highlights)
- Server: cap GalaxyGlobMatcher.RegexCache; align
  WorkerAlarmRpcDispatcher missing-session handling; drop the
  duplicate dashboard @page routes; refresh IAlarmRpcDispatcher
  XML doc.
- Worker: surface SetXmlAlarmQuery COM failures; remove dead
  subscriptionExpression / ExecutingCommand arms; preserve
  factory-supplied runtime sessions; split MxAlarmSnapshot.cs into
  three files.
- Tests: dispose the WebApplication in seven test classes; rebuild
  FakeWorkerProcess.WaitForExitAsync against a real TaskCompletion
  source; switch the heartbeat-expires test to ManualTimeProvider;
  add InvariantCulture to the remaining DateTimeOffset.Parse sites;
  document GalaxyFilterInputSafetyTests in GatewayTesting.md.
- IntegrationTests: comment fixes, RecordingServerStreamWriter
  IDisposable, class-level [Trait], single-source ZB default
  connection string.
- Worker.Tests: replace silent-return gating with LiveMxAccessFact
  so absent env vars SKIP not pass; PascalCase rename of probe
  [Fact]s; deterministic deadline test; new frame-protocol error
  tests; ComputeTransitions diff-coverage; relocate dev-rig probes
  to Probes/.
- Contracts: add round-trip coverage and per-field redaction /
  Galaxy-identifier comments to the protos.
- Client.Dotnet: introduce clients/dotnet/Directory.Build.props so
  TreatWarningsAsErrors / analysers apply; document
  DiscoverHierarchyOptions and IMxGatewayCliClient; require typed
  bulk-read handles in CLI; surface AcknowledgeAlarm transport
  faults through Translate().
- Client.Go: kill dead code in alarms_test / fakeGalaxyServer /
  runWriteBulkVariant; document the six new subcommands in
  writeUsage; drain galaxy-watch events on limit; switch io.EOF
  comparisons to errors.Is.
- Client.Java: shared shutdown helpers + new shutdownTimeout
  option; regex-based credential redaction; Long.toUnsignedString
  for uint64 sequence; doc fixes.
- Client.Python: combine duplicate imports; add coverage for
  _percentile / bench-read-bulk / MAX_AGGREGATE_EVENTS /
  _api_key_from_env; populate pyproject metadata and ship py.typed.
- Client.Rust: expose next_correlation_id() so CLI ping/close
  stop hard-coding correlation IDs; resync RustClientDesign.md
  with the current Session / Error surface and CLI subcommand set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:46:47 -04:00

17 KiB

Gateway Testing

Gateway tests run without installed MXAccess by using fake workers, fake transports, and in-process gRPC service fakes. Live MXAccess verification belongs in opt-in integration tests because it depends on installed COM components and provider state.

Fake Worker Harness

FakeWorkerHarness in src/MxGateway.Tests/Gateway/Workers/Fakes/ provides an in-process worker side for named-pipe IPC tests. It uses the same WorkerFrameReader, WorkerFrameWriter, and WorkerEnvelope contract as the gateway so tests exercise real frame validation and worker-client state changes.

Use the harness when a gateway or session test needs worker behavior without starting MxGateway.Worker.exe or loading MXAccess COM. The harness scripts:

  • WorkerHello and WorkerReady startup,
  • command replies with matching correlation ids,
  • ordered WorkerEvent frames,
  • WorkerHeartbeat frames,
  • WorkerFault frames,
  • shutdown acknowledgements,
  • malformed protobuf payloads and oversized frame headers,
  • slow or hung workers by withholding a reply.

Session-level tests can connect the harness to the pipe created by SessionWorkerClientFactory with ConnectToGatewayPipeAsync. Lower-level WorkerClient tests can use CreateConnectedPairAsync to create both pipe ends inside the test.

GatewayEndToEndFakeWorkerSmokeTests composes the real gRPC service, SessionManager, SessionWorkerClientFactory, WorkerClient, and EventStreamService with a scripted fake worker launcher. The smoke test covers OpenSession, Register, AddItem, Advise, one streamed OnDataChange event, and CloseSession without loading MXAccess COM.

Live MXAccess Smoke

WorkerLiveMxAccessSmokeTests in src/MxGateway.IntegrationTests/ composes the real gRPC service, SessionManager, SessionWorkerClientFactory, WorkerClient, WorkerProcessLauncher, and MxGateway.Worker.exe. It is skipped unless MXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1 is set because it creates the installed MXAccess COM object and depends on live provider state.

The live smoke opens a gateway session, launches the x86 worker, runs Register, AddItem, and Advise, waits a bounded time for the first OnDataChange event (skipping any earlier bootstrap/registration-state event), and closes the session in a finally block so the worker gets a graceful shutdown request even when a command or event assertion fails. Cleanup failures in that finally block are logged rather than thrown, so a real assertion failure is never masked by a shutdown timeout.

WorkerLiveMxAccessSmokeTests additionally covers five MXAccess parity paths the fake-worker tests cannot validate:

  • a Write round-trip against an advised item, asserting both that the reply is Ok / MxCommandKind.Write and that the worker emits a matching OnWriteComplete event for the targeted (server, item) handle pair — the same round-trip proof used by scripts/run-client-e2e-tests.ps1,
  • an AddItem against an invalid server handle, asserting the MXAccess failure surfaces in the command reply without faulting the gateway transport,
  • the UnAdviseRemoveItemUnregister teardown chain, asserting each step replies Ok with the matching MxCommandKind, that no further OnDataChange events arrive for the un-advised pair, and that a second RemoveItem against the freed handle relays a non-Ok MXAccess failure,
  • a WriteSecured round-trip after AuthenticateUser, asserting the reply carries MxCommandKind.WriteSecured and the credential password never appears in the diagnostic message (parity for both the secured-write ordering rule and the "do not log secrets" contract), and
  • an abnormal worker exit (the worker process is killed mid-session) where the gateway must transition the session to SessionState.Faulted with a non-empty fault description carrying a known worker-client classification (pipe disconnected / worker faulted / end-of-stream / heartbeat expired).

All six tests are gated by the same MXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1 opt-in variable.

Build the worker before running the smoke:

dotnet build src/MxGateway.Worker/MxGateway.Worker.csproj -p:Platform=x86

Run the smoke explicitly:

$env:MXGATEWAY_RUN_LIVE_MXACCESS_TESTS = "1"
dotnet test src/MxGateway.IntegrationTests/MxGateway.IntegrationTests.csproj --filter FullyQualifiedName~WorkerLiveMxAccessSmokeTests

Optional live smoke variables:

Variable Default Description
MXGATEWAY_LIVE_MXACCESS_WORKER_EXE First existing MxGateway.Worker.exe under src/MxGateway.Worker/bin/... Worker executable path. Set this when running against a packaged worker or a non-default build output.
MXGATEWAY_LIVE_MXACCESS_ITEM TestChildObject.TestInt MXAccess item reference used by AddItem.
MXGATEWAY_LIVE_MXACCESS_CLIENT_NAME MxGateway.IntegrationTests Client name passed to Register.
MXGATEWAY_LIVE_MXACCESS_EVENT_TIMEOUT_SECONDS 15 Maximum wait for the first OnDataChange (also used for the OnWriteComplete round-trip and the abnormal-exit fault transition).
MXGATEWAY_LIVE_MXACCESS_WRITE_SECURED_USER admin ArchestrA user name passed to AuthenticateUser before the WriteSecured parity step.
MXGATEWAY_LIVE_MXACCESS_WRITE_SECURED_PASSWORD admin123 Password paired with the user above. Never logged; the test asserts the value does not appear in the WriteSecured diagnostic message.

The test output includes session id, worker process id, command status, HRESULT/status diagnostics, event sequence and handles, close status, and worker stdout/stderr lines emitted during the run.

Live Galaxy Repository

GalaxyRepositoryLiveTests in src/MxGateway.IntegrationTests/Galaxy/ exercises GalaxyRepository directly against the ZB Galaxy Repository SQL database. It is skipped unless MXGATEWAY_RUN_LIVE_GALAXY_TESTS=1 is set because it depends on a reachable SQL Server instance and deployed Galaxy state — fake-worker tests cannot cover the SQL browse RPCs.

The suite covers TestConnectionAsync, GetLastDeployTimeAsync, GetHierarchyAsync, and GetAttributesAsync. GetHierarchyAsync and GetAttributesAsync assert a non-empty result, so the connected ZB database must contain a deployed Galaxy, not just an empty schema.

Run the Galaxy live tests explicitly:

$env:MXGATEWAY_RUN_LIVE_GALAXY_TESTS = "1"
dotnet test src/MxGateway.IntegrationTests/MxGateway.IntegrationTests.csproj --filter FullyQualifiedName~GalaxyRepositoryLiveTests

Optional live Galaxy variables:

Variable Default Description
MXGATEWAY_LIVE_GALAXY_CONN Server=localhost;Database=ZB;Integrated Security=True;TrustServerCertificate=True;Encrypt=False; Galaxy Repository connection string. Set this when the ZB database is on a non-default instance or needs SQL authentication.

The default connection string targets ZB on localhost with Windows authentication, which matches the Galaxy Repository conventions in CLAUDE.md.

Galaxy Filter Safety

GalaxyFilterInputSafetyTests in src/MxGateway.Tests/Galaxy/ covers adversarial input handling for the Galaxy Repository browse filter layer. It runs in the unit-test project (no live SQL needed) and complements the live SQL coverage in GalaxyRepositoryLiveTests.

The test class re-frames the original "Galaxy SQL injection" concern (Tests-002 in code-reviews/Tests/findings.md). GalaxyRepository issues only four constant SQL statements (HierarchySql, AttributesSql, SELECT 1, SELECT time_of_last_deploy FROM galaxy) — no DiscoverHierarchyRequest field is ever concatenated into a SQL string, so there is no dynamic SQL surface and no LIKE-escaping helper to test. All filters (TagNameGlob, RootTagName, template-chain, category, contained-path) are applied in memory by GalaxyHierarchyProjector / GalaxyGlobMatcher against the cached snapshot.

The adversarial-input matrix (', ' OR '1'='1, '; DROP TABLE gobject;--, %, _, 100%_off, [abc], Pump'001) pins the following invariants:

  • SQL metacharacters (', ;) and LIKE-wildcards (%, _) are treated as opaque literals by GalaxyGlobMatcher — they never act as wildcards, never spuriously match unrelated text.
  • Only * and ? are glob wildcards.
  • GalaxyGlobMatcher applies a 100 ms regex timeout so a pathological glob (e.g. 5 000 a characters plus a literal !) completes promptly rather than catastrophically backtracking.
  • GalaxyHierarchyProjector returns zero matches (rather than the whole hierarchy) for an adversarial TagNameGlob or TemplateChainContains, and surfaces NotFound for an adversarial RootTagName.
  • The DiscoverHierarchy RPC end-to-end returns zero matches for adversarial TagNameGlob rather than faulting.

These invariants are the real security surface of the Galaxy browse path; the SQL-injection framing does not apply to a constant-query layer.

Live LDAP

DashboardLdapLiveTests in src/MxGateway.IntegrationTests/ exercises DashboardAuthenticator against the live GLAuth directory. It is skipped unless MXGATEWAY_RUN_LIVE_LDAP_TESTS=1 is set because it binds against the GLAuth service described in glauth.md.

The suite builds the authenticator with a default GatewayOptions, so LdapOptions.RequiredGroup keeps its GwAdmin default. GwAdmin is the gateway-specific dashboard-admin role and is not part of the five baseline GLAuth role groups — it must be provisioned before the LDAP live tests pass. AuthenticateAsync_AdminInGwAdminGroup_Succeeds fails (rather than skips) when GLAuth has only the baseline groups, so this is a hard prerequisite beyond "LDAP is up." See the "Adding a gw-specific group" section of glauth.md for the provisioning step that adds GwAdmin and grants it to admin.

The suite covers both the success path and the DashboardAuthenticator failure branches: admin in GwAdmin succeeds; readonly is denied for missing group; admin with a wrong password is rejected by the candidate bind without leaking the password into FailureMessage; an unknown username yields no candidate; and an unreachable LDAP server is absorbed into a failed result rather than throwing.

Run the LDAP live tests explicitly:

$env:MXGATEWAY_RUN_LIVE_LDAP_TESTS = "1"
dotnet test src/MxGateway.IntegrationTests/MxGateway.IntegrationTests.csproj --filter FullyQualifiedName~DashboardLdapLiveTests

Client E2E Scripts

scripts/discover-testmachine-tags.ps1 queries the ZB Galaxy Repository for the deployed runtime references used by the live client e2e scripts. It reads TestMachine_001 through TestMachine_020 and the expected attributes:

  • ProtectedValue
  • TestChangingInt
  • TestBoolArray
  • TestIntArray
  • TestDateTimeArray
  • TestStringArray

The discovery output includes the exact fullTagReference, data type, array dimension, and security classification. The array attributes are expected to be dimension 50. ProtectedValue has security classification 2 and requires secured write semantics; the current client CLI e2e runner subscribes to it but does not attempt a normal Write.

Run discovery directly when validating the Galaxy Repository inputs:

powershell -ExecutionPolicy Bypass -File scripts/discover-testmachine-tags.ps1 -Json

scripts/run-client-e2e-tests.ps1 drives the .NET, Go, Rust, Python, and Java client CLIs through a live gateway session. The gateway and worker are assumed to be already running at -Endpoint; the script does not start or stop them. For each client it runs these phases, then closes the session in a finally path and writes a JSON report under artifacts/e2e/:

  1. Session + register — opens one session and registers.

  2. Bulk — verifies SubscribeBulk / UnsubscribeBulk on a bounded tag subset (skip with -SkipBulk).

  3. Add-item / advise — adds and advises every discovered test tag.

  4. Stream — asserts a bounded event stream delivers at least one event (skip with -SkipStream).

  5. Parity — asserts MXAccess error paths are rejected rather than silently succeeding: an invalid item handle and an unknown session id (skip with -SkipParity).

  6. Auth rejection — asserts open-session is rejected when the API key is missing, and (when -RejectScopeApiKeyEnv names an insufficient-scope key) when the key lacks the required scope. Skip with -SkipAuth.

  7. Write round-tripopt-in (-VerifyWrite). Runs right after register: adds and advises a configurable writable attribute (-WriteAttribute, default TestChangingInt), writes a per-client sentinel value, then streams events and asserts an OnWriteComplete event for that item is observed — proof the write round-tripped through the gateway, worker, and MXAccess provider. The written value being echoed back in an OnDataChange is recorded best-effort (echoObserved): a provider-driven attribute such as TestChangingInt accepts the write but immediately overwrites it, so no data-change carries the value back. The Rust stream-events CLI emits full per-event JSON (family, itemHandle, value) so all five clients apply the same checks.

    It is opt-in because it mutates live tag state. The phase fails fast if the write command is rejected — e.g. against a gateway whose worker predates write support (MxAccessCommandExecutor returning InvalidRequest for Write/Write2/WriteSecured/WriteSecured2).

Build the gateway and worker, start the gateway, and provide a valid API key before running the client e2e script:

$env:MXGATEWAY_API_KEY = "<api-key>"
powershell -ExecutionPolicy Bypass -File scripts/run-client-e2e-tests.ps1

Useful runner options:

powershell -ExecutionPolicy Bypass -File scripts/run-client-e2e-tests.ps1 -Clients dotnet,python -MachineStart 1 -MachineEnd 2
powershell -ExecutionPolicy Bypass -File scripts/run-client-e2e-tests.ps1 -BulkTagCount 10
powershell -ExecutionPolicy Bypass -File scripts/run-client-e2e-tests.ps1 -SkipStream
powershell -ExecutionPolicy Bypass -File scripts/run-client-e2e-tests.ps1 -SkipBulk
# Write round-trip (opt-in): point at a writable scalar attribute and its
# value type.
powershell -ExecutionPolicy Bypass -File scripts/run-client-e2e-tests.ps1 -VerifyWrite -WriteAttribute TestChangingInt -WriteType int32
# Auth rejection: also assert an insufficient-scope key is denied.
powershell -ExecutionPolicy Bypass -File scripts/run-client-e2e-tests.ps1 -RejectScopeApiKeyEnv MXGATEWAY_READONLY_API_KEY
# Run all five clients concurrently as isolated child processes.
powershell -ExecutionPolicy Bypass -File scripts/run-client-e2e-tests.ps1 -Parallel
# Validate the flow offline (prints commands, contacts no gateway).
powershell -ExecutionPolicy Bypass -File scripts/run-client-e2e-tests.ps1 -DryRun
powershell -ExecutionPolicy Bypass -File scripts/run-client-e2e-tests.ps1 -Endpoint localhost:5000 -ApiKeyEnv MXGATEWAY_API_KEY

When -VerifyWrite is enabled, the write round-trip fails loudly if the write command is rejected, if -WriteAttribute does not name a writable scalar attribute, or if no OnWriteComplete event is observed for the written item within -WriteEchoMaxEvents (default 200) streamed events. Raise -WriteEchoMaxEvents if the gateway's per-session event backlog is large enough to push OnWriteComplete past that bound.

Focused Commands

Run the cross-language smoke matrix tests after changing the documented client smoke command list:

dotnet test src/MxGateway.Tests/MxGateway.Tests.csproj --filter FullyQualifiedName~CrossLanguageSmokeMatrixTests

Run the parity fixture matrix tests after changing the integration parity scenario list:

dotnet test src/MxGateway.Tests/MxGateway.Tests.csproj --filter FullyQualifiedName~ParityFixtureMatrixTests

Run the fake worker tests after changing gateway worker IPC, session startup, or event streaming behavior:

dotnet test src/MxGateway.Tests/MxGateway.Tests.csproj --filter FullyQualifiedName~FakeWorkerHarnessTests
dotnet test src/MxGateway.Tests/MxGateway.Tests.csproj --filter FullyQualifiedName~SessionWorkerClientFactoryFakeWorkerTests
dotnet test src/MxGateway.Tests/MxGateway.Tests.csproj --filter FullyQualifiedName~GatewayEndToEndFakeWorkerSmokeTests
dotnet test src/MxGateway.Tests/MxGateway.Tests.csproj --filter FullyQualifiedName~WorkerClientTests
dotnet test src/MxGateway.Worker.Tests/MxGateway.Worker.Tests.csproj -p:Platform=x86 --filter FullyQualifiedName~WorkerPipeSessionTests

Run the gateway test project after shared gateway test infrastructure changes:

dotnet test src/MxGateway.Tests/MxGateway.Tests.csproj