Files
mxaccessgw/code-reviews/Tests/findings.md
T
Joseph Doherty 5ade3f4f48 Resolve Tests-003, -004, -005, -006 code-review findings
Tests-003: temp auth-DB directories leaked under %TEMP%. Added the
TempDatabaseDirectory IDisposable helper (clears the Sqlite connection pool,
then recursively deletes); SqliteAuthStoreTests and ApiKeyAdminCliRunnerTests
now dispose every directory they create.

Tests-004: added end-to-end coverage composing the real authorization
interceptor in front of the real MxAccessGatewayService, plus scope-resolver
tests confirming an unmapped request type fails closed to the admin scope.

Tests-005: added coverage for a worker faulting mid-command — a pipe
disconnect and a worker fault while an InvokeAsync is in flight both fail the
pending invoke. No product change needed.

Tests-006 (re-triaged): the flaky ReadLoop_WhenClientFaults_KillsOwnedWorkerProcess
is a test race, not a product bug — the kill runs synchronously inside
SetFaulted. Rewrote it to await FakeWorkerProcess exit deterministically, and
replaced fixed Task.Delay timing in the late-reply and heartbeat tests with
FIFO ordering and an injected ManualTimeProvider.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:44:55 -04:00

212 lines
19 KiB
Markdown

# Code Review — Tests
| Field | Value |
|---|---|
| Module | `src/MxGateway.Tests` |
| Reviewer | Claude Code |
| Review date | 2026-05-18 |
| Commit reviewed | `6c64030` |
| Status | Reviewed |
| Open findings | 6 |
## Checklist coverage
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Issue found: Tests-001 (`FakeSessionManager.TryGetSession` always returns true), Tests-011 (unobserved worker task). |
| 2 | mxaccessgw conventions | FakeWorkerHarness used per docs; no real secrets; minor style drift in three alarm-test files (Tests-008). |
| 3 | Concurrency & thread safety | Issues found: Tests-006 (`Task.Delay`-based timing), Tests-012 (no parallelism guard for `WebApplication` tests). |
| 4 | Error handling & resilience | Strong — timeouts, faults, overflow, kill paths, protocol violations all exercised. No issues found. |
| 5 | Security | Issues found: Tests-002 (no SQL-injection coverage of Galaxy RPCs), Tests-010 (anonymous-localhost negative cases untested). |
| 6 | Performance & resource management | Issue found: Tests-003 (temp DB/worker directories never cleaned up). |
| 7 | Design-document adherence | Tests match `docs/GatewayTesting.md`; no drift found. No issues found. |
| 8 | Code organization & conventions | Issue found: Tests-007 (`TestServerCallContext` copy-pasted into 4+ files). |
| 9 | Testing coverage | Issues found: Tests-001, Tests-004 (no end-to-end interceptor+service test), Tests-005 (no worker-crash-mid-command coverage), Tests-002. |
| 10 | Documentation & comments | Issue found: Tests-009 (stale/mismatched XML `<summary>` comments). |
## Findings
### Tests-001
| Field | Value |
|---|---|
| Severity | High |
| Category | Testing coverage |
| Location | `src/MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs:483-489` |
| Status | Resolved |
**Description:** `FakeSessionManager.TryGetSession` unconditionally returns `true` and synthesizes a session for any id. As a result, `Invoke_WhenSessionMissing_ThrowsNotFound` (line 52) only passes because `InvokeException` is pre-seeded — it does not verify that the gateway service maps a genuinely missing session to `NotFound`. No test exercises the real gateway path where `TryGetSession` returns `false` (for `StreamEvents`, `CloseSession`, alarm RPCs). A regression dropping the missing-session check would not be caught.
**Recommendation:** Make `FakeSessionManager.TryGetSession` return `false` for unknown ids (return only seeded sessions), then assert `NotFound`/`InvalidArgument` is produced by the service's own lookup logic rather than an injected exception.
**Resolution:** Resolved 2026-05-18: confirmed root cause — added `ResolveOnlySeededSessions`/`SeedSession` to `FakeSessionManager` so `TryGetSession` returns `false` for unseeded ids, rewrote `Invoke_WhenSessionMissing_ThrowsNotFound` to drop the injected `InvokeException` and exercise the service's own `ResolveSession` lookup (asserts `InvokeCount == 0`), and added `Invoke_WhenSessionSeeded_ResolvesAndInvokes`, `AcknowledgeAlarm_WhenSessionMissing_ThrowsNotFound`, and `QueryActiveAlarms_WhenSessionMissing_ThrowsNotFound`.
### Tests-002
| Field | Value |
|---|---|
| Severity | High |
| Category | Security |
| Location | `src/MxGateway.Tests/Gateway/Grpc/GalaxyRepositoryGrpcServiceTests.cs:198-210` |
| Status | Resolved |
**Description:** The Galaxy Repository RPCs browse a SQL Server database (`ZB`). Every test injects a `StubGalaxyHierarchyCache`, so actual SQL query construction, parameterization, and filter/glob translation are never exercised. No test demonstrates that `TagNameGlob`, `RootTagName`, `AlarmFilterPrefix`, etc. are passed as parameters rather than concatenated into SQL. SQL-injection resistance of the Galaxy layer has zero coverage.
**Recommendation:** Add tests for the `GalaxyRepository` query-building layer (against SQLite or an in-memory abstraction, or by asserting parameter objects), covering glob/prefix inputs containing `'`, `%`, `_`, and `;`. At minimum add a unit test over the SQL `LIKE`-pattern escaping helper.
**Re-triage note:** The finding's premise is partly misframed. `GalaxyRepository` issues only four *constant* SQL statements (`HierarchySql`, `AttributesSql`, `SELECT 1`, `SELECT time_of_last_deploy FROM galaxy`) — no `DiscoverHierarchyRequest` field is ever concatenated into SQL, so there is no dynamic SQL-injection surface and no `LIKE`-escaping helper to test. `AlarmFilterPrefix` belongs to the worker alarm path, not the Galaxy SQL layer. All filters (`TagNameGlob`, `RootTagName`, template-chain, category, contained-path) are applied **in memory** by `GalaxyHierarchyProjector`/`GalaxyGlobMatcher` against the cached snapshot. The genuine, testable concern — that adversarial filter strings are treated as opaque literals (no wildcard behaviour, no ReDoS, no exceptions) — remains valid and was previously uncovered. Severity left at High: an unsafe in-memory filter would still be a real security gap.
**Resolution:** Resolved 2026-05-18: added `src/MxGateway.Tests/Galaxy/GalaxyFilterInputSafetyTests.cs` (10 test methods, mostly `[Theory]` over adversarial inputs `'`, `' OR '1'='1`, `'; DROP TABLE gobject;--`, `%`, `_`, `100%_off`, `[abc]`, `Pump'001`) covering `GalaxyGlobMatcher` literal-treatment / `LIKE`-wildcard / pathological-input (ReDoS) behaviour and `GalaxyHierarchyProjector` + `DiscoverHierarchy` RPC handling of adversarial `TagNameGlob`, `RootTagName`, and `TemplateChainContains`. No product bug found — the in-memory filter layer treats all metacharacters as literals; the passing tests resolve the coverage gap.
### Tests-003
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Performance & resource management |
| Location | `src/MxGateway.Tests/Security/Authentication/SqliteAuthStoreTests.cs:170-176`, `src/MxGateway.Tests/Security/Authentication/ApiKeyAdminCliRunnerTests.cs:252-258` |
| Status | Resolved |
**Description:** `CreateTempDatabasePath` creates a fresh directory under `%TEMP%\mxgateway-auth-tests\<guid>` (and `...-cli-tests`) for every test but nothing ever deletes it. `WorkerProcessLauncherTests.TestDirectory` correctly implements `IDisposable` and cleans up; these two do not. SQLite connection pooling can also keep the `.db` handle open after the test. Over many CI runs this leaks temp files and open handles.
**Recommendation:** Wrap the temp directory in an `IDisposable`/`IAsyncDisposable` helper (as `WorkerProcessLauncherTests` does) and call `SqliteConnection.ClearAllPools()` before deletion, or use `Microsoft.Data.Sqlite` in-memory mode where a real file is not needed.
**Resolution:** Resolved 2026-05-18: confirmed root cause — both `CreateTempDatabasePath` helpers created `%TEMP%` directories with no cleanup, and `Microsoft.Data.Sqlite` pools connections by default so the `.db` handle outlives the test. Added a shared `TempDatabaseDirectory` (`src/MxGateway.Tests/Security/Authentication/TempDatabaseDirectory.cs`) `IDisposable` helper that calls `SqliteConnection.ClearAllPools()` and recursively deletes its directory. `SqliteAuthStoreTests` and `ApiKeyAdminCliRunnerTests` now implement `IDisposable`, track every directory created via `CreateTempDatabasePath`, and dispose them after each test. All affected tests still pass.
### Tests-004
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `src/MxGateway.Tests/Security/Authorization/GatewayGrpcAuthorizationInterceptorTests.cs` |
| Status | Resolved |
**Description:** The authorization interceptor and `MxAccessGatewayService` are each tested in isolation, but no test composes the interceptor in front of the real service to confirm scope enforcement gates real RPCs end-to-end. A wiring mistake — interceptor not registered, or a new RPC added without a scope mapping in `GatewayGrpcScopeResolver` — would pass every existing test. `GatewayGrpcScopeResolverTests` also only checks an enumerated allow-list; it never asserts an unmapped request type fails closed.
**Recommendation:** Add an end-to-end test that runs `OpenSession`/`Invoke` through the interceptor+service composition with insufficient scope and asserts `PermissionDenied`; add a `GatewayGrpcScopeResolver` test asserting an unknown/unmapped request type throws or denies rather than returning a permissive default.
**Resolution:** Resolved 2026-05-18: confirmed the coverage gap. Added three interceptor+service composition tests to `GatewayGrpcAuthorizationInterceptorTests` that run the real `GatewayGrpcAuthorizationInterceptor` continuation into a real `MxAccessGatewayService`: `InterceptorComposedWithService_OpenSessionMissingScope_DeniesBeforeServiceRuns` (asserts `PermissionDenied` and `OpenSessionCount == 0`), `InterceptorComposedWithService_OpenSessionWithScope_RunsServiceWithIdentity` (service runs and observes the interceptor-pushed identity), and `InterceptorComposedWithService_InvokeWriteCommandWithReadScope_DeniesBeforeServiceRuns` (a `Write` command with only `invoke:read` is denied). Added two `GatewayGrpcScopeResolverTests`: `ResolveRequiredScope_UnmappedRequestType_FailsClosedToAdminScope` confirms an unmapped request type resolves to the most-restrictive `Admin` scope (the resolver's `_ => GatewayScopes.Admin` default already fails closed — no product bug), and `ResolveRequiredScope_UnknownInvokeCommandKind_ReturnsInvokeReadScope` confirms an unknown command kind does not silently grant write/admin access.
### Tests-005
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `src/MxGateway.Tests/Gateway/Grpc/EventStreamServiceTests.cs:239-261`, `src/MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs` |
| Status | Resolved |
**Description:** Worker-crash handling is only tested as a clean terminal exception from `ReadEventsAsync` or a pre-set `ShutdownException`. There is no test for a worker that faults mid-command — an `InvokeAsync` in flight when the pipe/worker dies — which is a core fault-handling path of the two-process design. `WorkerClientTests` covers pipe-disconnect faulting the read loop, but not the interaction where a pending `InvokeAsync` task observes the fault and surfaces a meaningful error code.
**Recommendation:** Add a `WorkerClient`/`SessionManager` test that disposes the worker pipe (or emits a `WorkerFault`) while an `InvokeAsync` is pending, and assert the invoke task fails with a `WorkerClientException`/`SessionManagerException` carrying the worker-faulted error code.
**Resolution:** Resolved 2026-05-18: confirmed the coverage gap and confirmed the product path already handles it correctly (`WorkerClient.ReadLoopAsync``SetFaulted``CompletePendingCommands(fault)` fails every pending command with the fault exception). Added two `WorkerClientTests`: `InvokeAsync_WhenPipeDisconnectsMidCommand_FailsPendingInvokeWithPipeDisconnected` (worker reads the command then disposes its pipe side; the pending invoke task fails with `WorkerClientErrorCode.PipeDisconnected`) and `InvokeAsync_WhenWorkerFaultsMidCommand_FailsPendingInvokeWithWorkerFaulted` (worker emits a `WorkerFault` envelope while the invoke is pending; the task fails with `WorkerClientErrorCode.WorkerFaulted`). Both also assert the client transitions to `Faulted`. No product change needed.
### Tests-006
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `src/MxGateway.Tests/Gateway/Workers/WorkerClientTests.cs:76`, `src/MxGateway.Tests/Gateway/Workers/FakeWorkerHarnessTests.cs:122` |
| Status | Resolved |
**Description:** Several tests rely on fixed `Task.Delay` values: `WorkerClientTests.InvokeAsync_WithLateReply…` waits a hard-coded 50 ms after writing a late reply before issuing the second command, and the heartbeat tests use a 20 ms delay to make timestamps strictly increase. On a slow CI agent the 50 ms delay can be insufficient, and `DateTimeOffset.UtcNow` resolution can make the 20 ms heartbeat-advance assertion flaky.
**Recommendation:** Replace fixed delays with the existing `WaitUntilAsync` condition polling, and inject a controllable `TimeProvider` for heartbeat-timestamp comparisons instead of relying on wall-clock advance.
**Re-triage note:** The brief flagged `ReadLoop_WhenClientFaults_KillsOwnedWorkerProcess` as "a real `WorkerClient` fault→kill bug". On inspection it is **not a product bug** — it is a test race. `WorkerClient.SetFaulted` publishes the `Faulted` state under lock *before* calling `KillOwnedProcess`, so the old test's `WaitUntilAsync(() => client.State == Faulted)` could return between those two statements and observe `process.KillCount == 0`. The kill itself always runs synchronously inside `SetFaulted`, and `ShutdownAsync`/`DisposeAsync` re-issue an idempotent kill, so no real consumer relies on "state==Faulted implies process dead". The fix is therefore a test-quality fix (correctly Medium / Concurrency), not a product fix.
**Resolution:** Resolved 2026-05-18: (1) Made `ReadLoop_WhenClientFaults_KillsOwnedWorkerProcess` deterministic — it now `await`s `FakeWorkerProcess.WaitForExitAsync` (the `TaskCompletionSource` completed inside `Kill()`), which completes exactly when the kill runs, eliminating the state-polling race; verified by running it five times in isolation (5/5 pass). (2) Removed the fixed 50 ms `Task.Delay` from `InvokeAsync_WithLateReply_IgnoresLateReplyAndKeepsClientReady` — the stale reply and the second reply are now sent in pipe (FIFO) order, so the read loop discards the stale reply before the second reply with no timing window. (3) Replaced the 20 ms `Task.Delay` heartbeat-advance hacks in `WorkerClientTests.ReadLoop_WhenHeartbeatArrives_UpdatesLastHeartbeatAndWorkerProcess` and `FakeWorkerHarnessTests.SendHeartbeatAsync_UpdatesClientHeartbeatState` with an injected `ManualTimeProvider` advanced by a fixed `TimeSpan`; both tests now assert the exact post-advance timestamp instead of `>` against wall-clock drift.
### Tests-007
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `src/MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs:682`, `src/MxGateway.Tests/Gateway/Grpc/GalaxyRepositoryGrpcServiceTests.cs:324`, `src/MxGateway.Tests/Gateway/GatewayEndToEndFakeWorkerSmokeTests.cs:460`, `src/MxGateway.Tests/Security/Authorization/GatewayGrpcAuthorizationInterceptorTests.cs:233` |
| Status | Open |
**Description:** A near-identical `TestServerCallContext` implementation is copy-pasted into at least four test files (and `AllowAllConstraintEnforcer` / `TestServerStreamWriter` / `RecordingStreamWriter` into several). Duplication risks the copies drifting and bloats each file.
**Recommendation:** Extract a shared `TestServerCallContext`, `RecordingServerStreamWriter<T>`, and `AllowAllConstraintEnforcer` into a common test-support folder/namespace.
**Resolution:** _(open)_
### Tests-008
| Field | Value |
|---|---|
| Severity | Low |
| Category | mxaccessgw conventions |
| Location | `src/MxGateway.Tests/Gateway/Sessions/WorkerAlarmRpcDispatcherTests.cs:1-9`, `src/MxGateway.Tests/Gateway/Sessions/NotWiredAlarmRpcDispatcherTests.cs:1-3`, `src/MxGateway.Tests/Gateway/Sessions/SessionManagerAlarmAutoSubscribeTests.cs:1` |
| Status | Open |
**Description:** The alarm test files diverge from the project's C# style and the rest of the suite: snake_case test method names instead of the PascalCase `Method_Condition_Result` pattern; redundant explicit `using System;`/`System.Threading;` imports despite implicit global usings; and explicit-type `new` instead of target-typed `new()` used elsewhere. There is also a typo in fixture data (`"wnwrap subscribe failed"`).
**Recommendation:** Rename the alarm tests to the house `Method_Condition_Result` convention, drop redundant `System.*` usings, align `new` usage, and fix the `wnwrap` typo.
**Resolution:** _(open)_
### Tests-009
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs:36-37,99,365` |
| Status | Open |
**Description:** Several XML `<summary>` comments are copy-paste mismatches: the comment above `OpenSessionAsync_SetsInitialDefaultLease` describes correlation-ID generation; the comment above `GatewaySessionSubscribeBulkAsync_ForwardsOneBulkCommand…` describes lease refresh; the comment above `CloseExpiredLeasesAsync_DoesNotCloseActiveEventSubscriber` describes shutdown closing all sessions. Misleading test docs hinder triage.
**Recommendation:** Correct the `<summary>` text to match each test's actual behavior, or remove the redundant comments since the test names already describe the behavior.
**Resolution:** _(open)_
### Tests-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | Security |
| Location | `src/MxGateway.Tests/Gateway/Dashboard/DashboardAuthorizationHandlerTests.cs:26-36` |
| Status | Open |
**Description:** The anonymous-localhost bypass is tested only for the success case (`allowAnonymousLocalhost: true` + loopback succeeds) and the remote-unauthenticated denial. There is no test for the security-critical negatives: anonymous + loopback when `AllowAnonymousLocalhost` is `false` must be denied, and anonymous + non-loopback when the flag is `true` must still be denied (the bypass is scoped strictly to loopback). Those are the misconfiguration cases that would expose the dashboard.
**Recommendation:** Add tests: anonymous + loopback + `allowAnonymousLocalhost: false` → not succeeded; anonymous + non-loopback + `allowAnonymousLocalhost: true` → not succeeded.
**Resolution:** _(open)_
### Tests-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `src/MxGateway.Tests/Gateway/GatewayEndToEndFakeWorkerSmokeTests.cs:233-301` |
| Status | Open |
**Description:** `GatewayEndToEndFakeWorkerSmokeTests` correctly stores and awaits `launcher.WorkerTask`, but `SessionWorkerClientFactoryFakeWorkerTests` uses `_ = RunWorkerAsync(...)` with no stored task (lines 152, 184, 220). An unhandled exception in the scripted worker becomes an unobserved `TaskException` that can surface as a process-level failure in an unrelated later test rather than failing the owning test.
**Recommendation:** Store the worker task and either await it during disposal or attach a continuation that fails the test on fault, mirroring `GatewayEndToEndFakeWorkerSmokeTests`.
**Resolution:** _(open)_
### Tests-012
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `src/MxGateway.Tests/Gateway/Workers/Fakes/FakeWorkerHarness.cs:62`, `src/MxGateway.Tests/Gateway/Workers/WorkerClientTests.cs:472` |
| Status | Open |
**Description:** Pipe names are uniquified per test with a GUID (good), but xUnit runs test classes in parallel by default and there is no `xunit.runner.json` or collection configuration. Tests that build a full `WebApplication` bind ephemeral ports (`--urls=http://127.0.0.1:0`, fine) but spin up DI containers and hosted services concurrently. Currently safe, but a future test binding a fixed port would silently collide.
**Recommendation:** Add an `xunit.runner.json` or a collection grouping the `WebApplication`-building tests, and keep the `:0` ephemeral-port convention explicit so future tests do not introduce a fixed-port collision.
**Resolution:** _(open)_