Resolve Tests-027..031: flake root cause + coverage gaps

Tests-027 GatewayMetrics exposes its internal Meter; the StreamEvents_WhenEventIsWritten_RecordsSendDuration listener now filters by ReferenceEquals(instrument.Meter, metrics.Meter) instead of Meter.Name, so parallel tests with their own GatewayMetrics no longer cross-contaminate the families list. Tests-028 FakeWorkerClient.Kill now captures LastKillReason; SessionManager.KillWorkerAsync tests pin the reason propagation end-to-end and cover the blank/null guard. The DashboardSessionAdminService kill test pins the literal dashboard-admin-kill reason. Tests-029 Added CloseSessionAsync_BlankSessionId_ReturnsFailure to mirror the existing KillWorkerAsync blank-id coverage. Tests-030 DeleteAsync_WhenStoreRefuses_ReportsFriendlyError renamed and extended to assert the dashboard-delete-key audit row with Details = not-found-or-active. Added DeleteAsync_BlankKeyId_ReturnsFailure. Tests-031 DashboardSnapshotPublisher reconnect test now measures the gap from the first throw inside the fake (firstThrowAt) to secondSubscribeAt, isolating Task.Delay from StartAsync / scheduling overhead. All resolved at 2026-05-24; 512/512 gateway tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 09:28:54 -04:00
parent 430187c28b
commit 6bae5ea3a3
7 changed files with 255 additions and 18 deletions
@@ -7,7 +7,7 @@
 | Review date | 2026-05-24 |
 | Commit reviewed | `42b0037` |
 | Status | Re-reviewed |
-| Open findings | 5 |
+| Open findings | 0 |

 ## Checklist coverage

@@ -488,12 +488,14 @@ The cancellation tests for `WorkerClient` in `WorkerClientTests` *do* exercise t
 | Severity | Medium |
 | Category | Concurrency & thread safety |
 | Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs:199-240`, `src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs:8,73,246-251` |
-| Status | Open |
+| Status | Resolved |

 **Description:** The review brief explicitly flagged `MxAccessGatewayServiceTests.StreamEvents_WhenEventIsWritten_RecordsSendDuration` as a known flake that "passed solo on rerun". The root cause is the `MeterListener` subscribes by `instrument.Meter.Name == GatewayMetrics.MeterName` (a *process-shared* constant `"MxGateway.Server"`), not by the specific `GatewayMetrics` instance constructed in the test. Tests-012 made the xUnit parallelism policy explicit (`parallelizeTestCollections: true`, `maxParallelThreads: -1`), and every other test that builds its own `GatewayMetrics()` and exercises `MxAccessGatewayService.StreamEvents` or `EventStreamService.StreamEventsAsync` (e.g. the new `StreamEventsAsync_*` family added by Tests-026 and Server-041, plus the pre-existing `StreamEventsAsync_YieldsEventsInWorkerOrder` etc.) routes through `GatewayMetrics.RecordEventStreamSend` → the same histogram name `mxgateway.events.stream_send.duration`. When two such tests run concurrently in the same xUnit process, the `MeterListener` in this test sees measurements from *both* meters and `families.Count` grows to >1, breaking `Assert.Equal([MxEventFamily.OnDataChange.ToString()], families)`. Solo reruns pass because no other producer is alive. This is exactly the cross-test mutable-state pattern Tests-012 set the guardrail comment against.

 **Recommendation:** Either (a) filter the `MeterListener` callback by the specific `Meter` instance — capture `metrics._meter` (or expose `GatewayMetrics.Meter`) and compare with `ReferenceEquals(instrument.Meter, expectedMeter)` instead of comparing `Meter.Name`; or (b) place this test in a single-threaded `[Collection("GatewayMetrics-Listener")]` so no other `RecordEventStreamSend` producer runs concurrently. Option (a) is preferred because it removes the cross-talk vector permanently and lets the test stay parallelisable.

+**Resolution:** 2026-05-24 — Applied option (a). Added an `internal Meter Meter => _meter;` accessor on `GatewayMetrics` (visible to the Tests project via the existing `InternalsVisibleTo`) and changed both the `InstrumentPublished` filter and the `SetMeasurementEventCallback<double>` filter in `StreamEvents_WhenEventIsWritten_RecordsSendDuration` from `instrument.Meter.Name == GatewayMetrics.MeterName` to `ReferenceEquals(instrument.Meter, metrics.Meter)`. Added a companion regression `StreamEvents_RecordSendDurationListener_IgnoresMeasurementsFromOtherMetersWithSameName` that constructs a second `GatewayMetrics`, records an `OnWriteComplete` measurement on it before the test-under-test publishes, and asserts the listener captures only the test-under-test's `OnDataChange` family. Confirmed the regression catches the original `Meter.Name`-only filter (got `["OnWriteComplete", "OnDataChange"]` for `["OnDataChange"]`) by temporarily reverting the filter shape; restored ReferenceEquals after. Suite green 3/3 (512/512); the two Tests-027 tests pass 5/5 solo. The cross-talk vector is permanently closed without giving up parallelism.
+
 ### Tests-028

 | Field | Value |
@@ -501,12 +503,14 @@ The cancellation tests for `WorkerClient` in `WorkerClientTests` *do* exercise t
 | Severity | Low |
 | Category | Testing coverage |
 | Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs:466-496,802-807`, `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:216-253` |
-| Status | Open |
+| Status | Resolved |

 **Description:** The new `KillWorkerAsync_KillsWorkerAndRemovesSession` (line 466) and `KillWorkerAsync_WhenSessionMissing_ThrowsSessionNotFound` (line 486) pin the new kill-path entry, but they do not pin the `reason` argument propagating through the chain. `SessionManager.KillWorkerAsync(sessionId, reason, ct)` validates `reason` with `ArgumentException.ThrowIfNullOrWhiteSpace(reason)` (line 221), calls `session.KillWorker(reason)` (line 229), and logs `reason={Reason}` (line 251); but the `FakeWorkerClient.Kill(string reason)` discards the argument (line 803-807) and the assertion is only `Assert.Equal(1, workerClient.KillCount)`. A regression that (a) hard-coded an internal `"unspecified"` reason between `SessionManager` and `GatewaySession`, (b) swapped to a different overload that dropped the reason, or (c) deleted the `ThrowIfNullOrWhiteSpace` guard would all pass the current tests. The dashboard caller (`DashboardSessionAdminService.KillWorkerAsync`) passes a hard-coded `"dashboard-admin-kill"` reason and the only test that observes it (`KillWorkerAsync_AdminKillsWorker`) asserts `!string.IsNullOrWhiteSpace(LastKillReason)` rather than pinning the value — so the same-class drift is also untested.

 **Recommendation:** (1) Capture `LastKillReason` on `FakeWorkerClient.Kill` and assert `KillWorkerAsync_KillsWorkerAndRemovesSession` propagates the test-supplied `"test-kill"` string end-to-end. (2) Add `KillWorkerAsync_WithBlankReason_ThrowsArgumentException` (parameterised over `null`, `""`, `"   "`) to pin the `ArgumentException.ThrowIfNullOrWhiteSpace` guard. (3) Tighten `DashboardSessionAdminServiceTests.KillWorkerAsync_AdminKillsWorker` to `Assert.Equal("dashboard-admin-kill", sessionManager.LastKillReason)` so a future reason-string change is a deliberate test update.

+**Resolution:** 2026-05-24 — Added `LastKillReason` to `FakeWorkerClient` in `SessionManagerTests.cs` and set it inside `Kill(string reason)`. Tightened `KillWorkerAsync_KillsWorkerAndRemovesSession` to assert `workerClient.LastKillReason == "test-kill"`, pinning the end-to-end propagation from `SessionManager.KillWorkerAsync` → `session.KillWorker(reason)` → `IWorkerClient.Kill(reason)`. Added `KillWorkerAsync_WithBlankReason_ThrowsArgumentException` as a `[Theory]` over `""`, `" "`, `"\t"` plus a separate `KillWorkerAsync_WithNullReason_ThrowsArgumentNullException` fact (xUnit `InlineData` cannot carry `null` for a non-nullable string, and `ArgumentException.ThrowIfNullOrWhiteSpace` throws `ArgumentNullException` for `null`). Both new tests confirm `KillCount == 0` and the session remains registered, proving the guard fires before any lookup or worker call. Tightened `DashboardSessionAdminServiceTests.KillWorkerAsync_AdminKillsWorker` to `Assert.Equal("dashboard-admin-kill", sessionManager.LastKillReason)`. All affected tests pass; suite green.
+
 ### Tests-029

 | Field | Value |
@@ -514,12 +518,16 @@ The cancellation tests for `WorkerClient` in `WorkerClientTests` *do* exercise t
 | Severity | Low |
 | Category | Error handling & resilience |
 | Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSessionAdminServiceTests.cs:61-106,139-222`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs:77-125` |
-| Status | Open |
+| Status | Resolved |

 **Description:** The new `DashboardSessionAdminServiceTests` covers the happy path and the viewer-denial path for both `CloseSessionAsync` and `KillWorkerAsync`, plus `CloseSessionAsync_WhenSessionMissing_ReportsFriendlyError` for the close-side `SessionNotFound` catch — but the kill-side error branches are not tested. The product code's `KillWorkerAsync` (lines 111-114) has the same `SessionNotFound` catch returning `"Session {id} was not found."` and (lines 115-124) a generic `SessionManagerException` catch returning `"Kill failed: {message}"`; neither is exercised. The fake's `KillWorkerAsync` (lines 200-209) only succeeds — there is no `KillThrowsNotFound` / `KillThrowsGeneric` configuration option matching the existing `CloseThrowsNotFound`. Symmetrically, `CloseSessionAsync` has the same `IsNullOrWhiteSpace(sessionId)` guard (line 37-40) but no blank-id test even though `KillWorkerAsync_BlankSessionId_ReturnsFailure` exists for the parallel kill guard — a guard-removal regression on close would slip through.

 **Recommendation:** Mirror the existing close-side fixtures onto the kill side: add `KillThrowsNotFound` / `KillThrowsGeneric` init-flags to the `FakeSessionManager`, then `KillWorkerAsync_WhenSessionMissing_ReportsFriendlyError`, `KillWorkerAsync_WhenSessionManagerThrows_ReportsKillFailedMessage`, and `CloseSessionAsync_BlankSessionId_ReturnsFailure`. These are mechanical copies of the existing patterns and bring close/kill coverage into symmetry.

+**Re-triage note:** The Server batch already added `CloseSessionAsync_WhenManagerThrowsUnexpected_ReturnsFriendlyFail` and `KillWorkerAsync_WhenManagerThrowsUnexpected_ReturnsFriendlyFail` (the Server-050 regressions visible at HEAD lines 125-162 of the test file), so the kill-side `SessionManagerException` general-catch branch and the close-side parallel are both covered there in a generic-exception shape. The only remaining asymmetry was the blank-session-id guard, per the prompt scope.
+
+**Resolution:** 2026-05-24 — Added `CloseSessionAsync_BlankSessionId_ReturnsFailure` to `DashboardSessionAdminServiceTests`. The new test invokes `service.CloseSessionAsync(adminUser, "   ", ct)` and asserts `Succeeded == false` and `sessionManager.CloseCount == 0`, pinning the `string.IsNullOrWhiteSpace(sessionId)` guard at `DashboardSessionAdminService.cs:52-55`. This brings close/kill blank-id coverage into symmetry with the existing `KillWorkerAsync_BlankSessionId_ReturnsFailure`. The `KillThrowsNotFound` / `KillThrowsGeneric` extensions from the original recommendation are not needed because the unexpected-throw branches are already covered by the Server-050 regressions noted above. All tests pass; suite green.
+
 ### Tests-030

 | Field | Value |
@@ -527,12 +535,14 @@ The cancellation tests for `WorkerClient` in `WorkerClientTests` *do* exercise t
 | Severity | Low |
 | Category | Testing coverage |
 | Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardApiKeyManagementServiceTests.cs:115-163`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardApiKeyManagementService.cs:146-177` |
-| Status | Open |
+| Status | Resolved |

 **Description:** The three new `DeleteAsync_*` fixtures cover unauthorised user, success path with audit, and store-refuses-with-friendly-error. They do not exercise two production behaviours: (1) `DeleteAsync_WhenStoreRefuses_ReportsFriendlyError` (line 151-163) does not construct or inject a `FakeApiKeyAuditStore`, so it never observes that the product code still emits an audit entry with `EventType = "dashboard-delete-key"` and `Details = "not-found-or-active"` on the failure branch (`AppendAuditAsync` runs unconditionally at line 167-172). A regression that placed the `AppendAuditAsync` call inside the `if (deleted)` branch would silently drop the audit trail for refused deletes — a real audit-completeness gap. (2) There is no `DeleteAsync_BlankKeyId_ReturnsFailure` or `DeleteAsync_InvalidKeyId_ReturnsFailure` test, even though `ValidateKeyId(keyId)` (line 156-160) guards on the same conditions as Create/Revoke/Rotate. The `Revoke`/`Rotate` paths have equivalent fixtures (the file's earlier tests cover them); only Delete is missing them.

 **Recommendation:** (1) Add a `FakeApiKeyAuditStore` to `DeleteAsync_WhenStoreRefuses_ReportsFriendlyError` and assert it contains exactly one entry with `EventType == "dashboard-delete-key"` and `Details == "not-found-or-active"`. (2) Add `DeleteAsync_BlankKeyId_ReturnsFailure` (parameterised over `null`, `""`, `"   "`) and `DeleteAsync_InvalidKeyId_ReturnsFailure` (a keyId with characters the `ValidateKeyId` rules reject) to pin the validation branch end-to-end.

+**Resolution:** 2026-05-24 — Renamed `DeleteAsync_WhenStoreRefuses_ReportsFriendlyError` to `DeleteAsync_WhenStoreRefuses_ReportsFriendlyErrorAndAudits` and extended it to inject a `FakeApiKeyAuditStore`; the test now asserts the single audit entry has `EventType == "dashboard-delete-key"`, `KeyId == "operator01"`, and `Details == "not-found-or-active"`. This pins the unconditional-audit invariant at `DashboardApiKeyManagementService.cs:167-172` — a regression moving the `AppendAuditAsync` call inside `if (deleted)` would now fail the test. Added `DeleteAsync_BlankKeyId_ReturnsFailure` as a `[Theory]` over `""`, `" "`, `"\t"` that asserts `Succeeded == false`, `adminStore.DeleteCount == 0`, AND `auditStore.Entries` is empty — pinning that the `ValidateKeyId` guard at line 156-160 fires before any store or audit work. All tests pass; suite green.
+
 ### Tests-031

 | Field | Value |
@@ -540,8 +550,10 @@ The cancellation tests for `WorkerClient` in `WorkerClientTests` *do* exercise t
 | Severity | Low |
 | Category | Concurrency & thread safety |
 | Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSnapshotPublisherTests.cs:22-61` |
-| Status | Open |
+| Status | Resolved |

 **Description:** `ExecuteAsync_WhenSnapshotServiceThrowsOnce_ReconnectsAfterDelay` records `startedAt = DateTimeOffset.UtcNow` *before* calling `publisher.StartAsync(...)`, then asserts `secondSubscribeAt - startedAt >= reconnectDelay - 10ms` (line 59). The measured gap is *not* the reconnect delay in isolation — it is `(StartAsync scheduling) + (first WatchSnapshotsAsync setup + Task.Yield) + (throw) + reconnect delay + (second WatchSnapshotsAsync setup)`. On a slow/contended CI agent the first three terms easily dominate (favouring the assertion); but on a fast machine, Windows `Task.Delay(50ms)` rounds up to the next ~15.6 ms tick boundary and may return at ~46-50 ms relative to schedule, while the first three terms can be sub-millisecond — so the gap measurement can land within 1-2 ms of the lower bound, and the 10 ms slack may not absorb a single missed quantum. This is a latent flake of the same flavour as Tests-006 (heartbeat timing) but on a wall-clock dependency the test cannot inject around because `DashboardSnapshotPublisher` uses `Task.Delay(_reconnectDelay)` directly. Tests-006 / Tests-017 moved heartbeat tests onto `ManualTimeProvider`; this test cannot do that without a product change to use a `TimeProvider`-aware delay.

 **Recommendation:** (a) The cheap fix: have `ThrowOnceThenYieldSnapshotService` record `_firstThrowAt = DateTimeOffset.UtcNow` immediately before the `throw`, and change the assertion to `secondSubscribeAt - firstThrowAt >= reconnectDelay - 10ms` — the gap then measures only the reconnect delay, eliminating the variable scheduling baseline. (b) The deeper fix: extend `DashboardSnapshotPublisher` to accept an `ITimeProvider`-style delay seam (or a virtual `DelayAsync` hook) so a `ManualTimeProvider` could advance time deterministically. (a) is preferred for now; (b) belongs as a follow-up if more reconnect-loop tests are added.
+
+**Resolution:** 2026-05-24 — Applied option (a). Added `FirstThrowAt` to `ThrowOnceThenYieldSnapshotService` and set it via `FirstThrowAt = DateTimeOffset.UtcNow;` immediately before the first-call `throw`. Removed the pre-`StartAsync` `startedAt` baseline; the assertion now reads `gap = secondSubscribeAt - firstThrowAt` (both timestamps captured inside the fake), and the 10 ms slack absorbs the Windows `Task.Delay` quantum without the variable `StartAsync` / scheduling overhead in the baseline. This is the same flake-isolation pattern Tests-006 / Tests-017 used (measuring only the production delay, not test-side setup). Suite green; the test passes deterministically across repeated runs.