From 6bae5ea3a36fddb345bed87b5d1542148a14a53d Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Sun, 24 May 2026 09:28:54 -0400 Subject: [PATCH] Resolve Tests-027..031: flake root cause + coverage gaps Tests-027 GatewayMetrics exposes its internal Meter; the StreamEvents_WhenEventIsWritten_RecordsSendDuration listener now filters by ReferenceEquals(instrument.Meter, metrics.Meter) instead of Meter.Name, so parallel tests with their own GatewayMetrics no longer cross-contaminate the families list. Tests-028 FakeWorkerClient.Kill now captures LastKillReason; SessionManager.KillWorkerAsync tests pin the reason propagation end-to-end and cover the blank/null guard. The DashboardSessionAdminService kill test pins the literal dashboard-admin-kill reason. Tests-029 Added CloseSessionAsync_BlankSessionId_ReturnsFailure to mirror the existing KillWorkerAsync blank-id coverage. Tests-030 DeleteAsync_WhenStoreRefuses_ReportsFriendlyError renamed and extended to assert the dashboard-delete-key audit row with Details = not-found-or-active. Added DeleteAsync_BlankKeyId_ReturnsFailure. Tests-031 DashboardSnapshotPublisher reconnect test now measures the gap from the first throw inside the fake (firstThrowAt) to secondSubscribeAt, isolating Task.Delay from StartAsync / scheduling overhead. All resolved at 2026-05-24; 512/512 gateway tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) --- code-reviews/Tests/findings.md | 24 ++++-- .../Metrics/GatewayMetrics.cs | 9 +++ .../DashboardApiKeyManagementServiceTests.cs | 43 +++++++++- .../DashboardSessionAdminServiceTests.cs | 27 ++++++- .../DashboardSnapshotPublisherTests.cs | 31 +++++-- .../Grpc/MxAccessGatewayServiceTests.cs | 80 ++++++++++++++++++- .../Gateway/Sessions/SessionManagerTests.cs | 59 +++++++++++++- 7 files changed, 255 insertions(+), 18 deletions(-) diff --git a/code-reviews/Tests/findings.md b/code-reviews/Tests/findings.md index ca95532..e448eb2 100644 --- a/code-reviews/Tests/findings.md +++ b/code-reviews/Tests/findings.md @@ -7,7 +7,7 @@ | Review date | 2026-05-24 | | Commit reviewed | `42b0037` | | Status | Re-reviewed | -| Open findings | 5 | +| Open findings | 0 | ## Checklist coverage @@ -488,12 +488,14 @@ The cancellation tests for `WorkerClient` in `WorkerClientTests` *do* exercise t | Severity | Medium | | Category | Concurrency & thread safety | | Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs:199-240`, `src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs:8,73,246-251` | -| Status | Open | +| Status | Resolved | **Description:** The review brief explicitly flagged `MxAccessGatewayServiceTests.StreamEvents_WhenEventIsWritten_RecordsSendDuration` as a known flake that "passed solo on rerun". The root cause is the `MeterListener` subscribes by `instrument.Meter.Name == GatewayMetrics.MeterName` (a *process-shared* constant `"MxGateway.Server"`), not by the specific `GatewayMetrics` instance constructed in the test. Tests-012 made the xUnit parallelism policy explicit (`parallelizeTestCollections: true`, `maxParallelThreads: -1`), and every other test that builds its own `GatewayMetrics()` and exercises `MxAccessGatewayService.StreamEvents` or `EventStreamService.StreamEventsAsync` (e.g. the new `StreamEventsAsync_*` family added by Tests-026 and Server-041, plus the pre-existing `StreamEventsAsync_YieldsEventsInWorkerOrder` etc.) routes through `GatewayMetrics.RecordEventStreamSend` → the same histogram name `mxgateway.events.stream_send.duration`. When two such tests run concurrently in the same xUnit process, the `MeterListener` in this test sees measurements from *both* meters and `families.Count` grows to >1, breaking `Assert.Equal([MxEventFamily.OnDataChange.ToString()], families)`. Solo reruns pass because no other producer is alive. This is exactly the cross-test mutable-state pattern Tests-012 set the guardrail comment against. **Recommendation:** Either (a) filter the `MeterListener` callback by the specific `Meter` instance — capture `metrics._meter` (or expose `GatewayMetrics.Meter`) and compare with `ReferenceEquals(instrument.Meter, expectedMeter)` instead of comparing `Meter.Name`; or (b) place this test in a single-threaded `[Collection("GatewayMetrics-Listener")]` so no other `RecordEventStreamSend` producer runs concurrently. Option (a) is preferred because it removes the cross-talk vector permanently and lets the test stay parallelisable. +**Resolution:** 2026-05-24 — Applied option (a). Added an `internal Meter Meter => _meter;` accessor on `GatewayMetrics` (visible to the Tests project via the existing `InternalsVisibleTo`) and changed both the `InstrumentPublished` filter and the `SetMeasurementEventCallback` filter in `StreamEvents_WhenEventIsWritten_RecordsSendDuration` from `instrument.Meter.Name == GatewayMetrics.MeterName` to `ReferenceEquals(instrument.Meter, metrics.Meter)`. Added a companion regression `StreamEvents_RecordSendDurationListener_IgnoresMeasurementsFromOtherMetersWithSameName` that constructs a second `GatewayMetrics`, records an `OnWriteComplete` measurement on it before the test-under-test publishes, and asserts the listener captures only the test-under-test's `OnDataChange` family. Confirmed the regression catches the original `Meter.Name`-only filter (got `["OnWriteComplete", "OnDataChange"]` for `["OnDataChange"]`) by temporarily reverting the filter shape; restored ReferenceEquals after. Suite green 3/3 (512/512); the two Tests-027 tests pass 5/5 solo. The cross-talk vector is permanently closed without giving up parallelism. + ### Tests-028 | Field | Value | @@ -501,12 +503,14 @@ The cancellation tests for `WorkerClient` in `WorkerClientTests` *do* exercise t | Severity | Low | | Category | Testing coverage | | Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs:466-496,802-807`, `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:216-253` | -| Status | Open | +| Status | Resolved | **Description:** The new `KillWorkerAsync_KillsWorkerAndRemovesSession` (line 466) and `KillWorkerAsync_WhenSessionMissing_ThrowsSessionNotFound` (line 486) pin the new kill-path entry, but they do not pin the `reason` argument propagating through the chain. `SessionManager.KillWorkerAsync(sessionId, reason, ct)` validates `reason` with `ArgumentException.ThrowIfNullOrWhiteSpace(reason)` (line 221), calls `session.KillWorker(reason)` (line 229), and logs `reason={Reason}` (line 251); but the `FakeWorkerClient.Kill(string reason)` discards the argument (line 803-807) and the assertion is only `Assert.Equal(1, workerClient.KillCount)`. A regression that (a) hard-coded an internal `"unspecified"` reason between `SessionManager` and `GatewaySession`, (b) swapped to a different overload that dropped the reason, or (c) deleted the `ThrowIfNullOrWhiteSpace` guard would all pass the current tests. The dashboard caller (`DashboardSessionAdminService.KillWorkerAsync`) passes a hard-coded `"dashboard-admin-kill"` reason and the only test that observes it (`KillWorkerAsync_AdminKillsWorker`) asserts `!string.IsNullOrWhiteSpace(LastKillReason)` rather than pinning the value — so the same-class drift is also untested. **Recommendation:** (1) Capture `LastKillReason` on `FakeWorkerClient.Kill` and assert `KillWorkerAsync_KillsWorkerAndRemovesSession` propagates the test-supplied `"test-kill"` string end-to-end. (2) Add `KillWorkerAsync_WithBlankReason_ThrowsArgumentException` (parameterised over `null`, `""`, `" "`) to pin the `ArgumentException.ThrowIfNullOrWhiteSpace` guard. (3) Tighten `DashboardSessionAdminServiceTests.KillWorkerAsync_AdminKillsWorker` to `Assert.Equal("dashboard-admin-kill", sessionManager.LastKillReason)` so a future reason-string change is a deliberate test update. +**Resolution:** 2026-05-24 — Added `LastKillReason` to `FakeWorkerClient` in `SessionManagerTests.cs` and set it inside `Kill(string reason)`. Tightened `KillWorkerAsync_KillsWorkerAndRemovesSession` to assert `workerClient.LastKillReason == "test-kill"`, pinning the end-to-end propagation from `SessionManager.KillWorkerAsync` → `session.KillWorker(reason)` → `IWorkerClient.Kill(reason)`. Added `KillWorkerAsync_WithBlankReason_ThrowsArgumentException` as a `[Theory]` over `""`, `" "`, `"\t"` plus a separate `KillWorkerAsync_WithNullReason_ThrowsArgumentNullException` fact (xUnit `InlineData` cannot carry `null` for a non-nullable string, and `ArgumentException.ThrowIfNullOrWhiteSpace` throws `ArgumentNullException` for `null`). Both new tests confirm `KillCount == 0` and the session remains registered, proving the guard fires before any lookup or worker call. Tightened `DashboardSessionAdminServiceTests.KillWorkerAsync_AdminKillsWorker` to `Assert.Equal("dashboard-admin-kill", sessionManager.LastKillReason)`. All affected tests pass; suite green. + ### Tests-029 | Field | Value | @@ -514,12 +518,16 @@ The cancellation tests for `WorkerClient` in `WorkerClientTests` *do* exercise t | Severity | Low | | Category | Error handling & resilience | | Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSessionAdminServiceTests.cs:61-106,139-222`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs:77-125` | -| Status | Open | +| Status | Resolved | **Description:** The new `DashboardSessionAdminServiceTests` covers the happy path and the viewer-denial path for both `CloseSessionAsync` and `KillWorkerAsync`, plus `CloseSessionAsync_WhenSessionMissing_ReportsFriendlyError` for the close-side `SessionNotFound` catch — but the kill-side error branches are not tested. The product code's `KillWorkerAsync` (lines 111-114) has the same `SessionNotFound` catch returning `"Session {id} was not found."` and (lines 115-124) a generic `SessionManagerException` catch returning `"Kill failed: {message}"`; neither is exercised. The fake's `KillWorkerAsync` (lines 200-209) only succeeds — there is no `KillThrowsNotFound` / `KillThrowsGeneric` configuration option matching the existing `CloseThrowsNotFound`. Symmetrically, `CloseSessionAsync` has the same `IsNullOrWhiteSpace(sessionId)` guard (line 37-40) but no blank-id test even though `KillWorkerAsync_BlankSessionId_ReturnsFailure` exists for the parallel kill guard — a guard-removal regression on close would slip through. **Recommendation:** Mirror the existing close-side fixtures onto the kill side: add `KillThrowsNotFound` / `KillThrowsGeneric` init-flags to the `FakeSessionManager`, then `KillWorkerAsync_WhenSessionMissing_ReportsFriendlyError`, `KillWorkerAsync_WhenSessionManagerThrows_ReportsKillFailedMessage`, and `CloseSessionAsync_BlankSessionId_ReturnsFailure`. These are mechanical copies of the existing patterns and bring close/kill coverage into symmetry. +**Re-triage note:** The Server batch already added `CloseSessionAsync_WhenManagerThrowsUnexpected_ReturnsFriendlyFail` and `KillWorkerAsync_WhenManagerThrowsUnexpected_ReturnsFriendlyFail` (the Server-050 regressions visible at HEAD lines 125-162 of the test file), so the kill-side `SessionManagerException` general-catch branch and the close-side parallel are both covered there in a generic-exception shape. The only remaining asymmetry was the blank-session-id guard, per the prompt scope. + +**Resolution:** 2026-05-24 — Added `CloseSessionAsync_BlankSessionId_ReturnsFailure` to `DashboardSessionAdminServiceTests`. The new test invokes `service.CloseSessionAsync(adminUser, " ", ct)` and asserts `Succeeded == false` and `sessionManager.CloseCount == 0`, pinning the `string.IsNullOrWhiteSpace(sessionId)` guard at `DashboardSessionAdminService.cs:52-55`. This brings close/kill blank-id coverage into symmetry with the existing `KillWorkerAsync_BlankSessionId_ReturnsFailure`. The `KillThrowsNotFound` / `KillThrowsGeneric` extensions from the original recommendation are not needed because the unexpected-throw branches are already covered by the Server-050 regressions noted above. All tests pass; suite green. + ### Tests-030 | Field | Value | @@ -527,12 +535,14 @@ The cancellation tests for `WorkerClient` in `WorkerClientTests` *do* exercise t | Severity | Low | | Category | Testing coverage | | Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardApiKeyManagementServiceTests.cs:115-163`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardApiKeyManagementService.cs:146-177` | -| Status | Open | +| Status | Resolved | **Description:** The three new `DeleteAsync_*` fixtures cover unauthorised user, success path with audit, and store-refuses-with-friendly-error. They do not exercise two production behaviours: (1) `DeleteAsync_WhenStoreRefuses_ReportsFriendlyError` (line 151-163) does not construct or inject a `FakeApiKeyAuditStore`, so it never observes that the product code still emits an audit entry with `EventType = "dashboard-delete-key"` and `Details = "not-found-or-active"` on the failure branch (`AppendAuditAsync` runs unconditionally at line 167-172). A regression that placed the `AppendAuditAsync` call inside the `if (deleted)` branch would silently drop the audit trail for refused deletes — a real audit-completeness gap. (2) There is no `DeleteAsync_BlankKeyId_ReturnsFailure` or `DeleteAsync_InvalidKeyId_ReturnsFailure` test, even though `ValidateKeyId(keyId)` (line 156-160) guards on the same conditions as Create/Revoke/Rotate. The `Revoke`/`Rotate` paths have equivalent fixtures (the file's earlier tests cover them); only Delete is missing them. **Recommendation:** (1) Add a `FakeApiKeyAuditStore` to `DeleteAsync_WhenStoreRefuses_ReportsFriendlyError` and assert it contains exactly one entry with `EventType == "dashboard-delete-key"` and `Details == "not-found-or-active"`. (2) Add `DeleteAsync_BlankKeyId_ReturnsFailure` (parameterised over `null`, `""`, `" "`) and `DeleteAsync_InvalidKeyId_ReturnsFailure` (a keyId with characters the `ValidateKeyId` rules reject) to pin the validation branch end-to-end. +**Resolution:** 2026-05-24 — Renamed `DeleteAsync_WhenStoreRefuses_ReportsFriendlyError` to `DeleteAsync_WhenStoreRefuses_ReportsFriendlyErrorAndAudits` and extended it to inject a `FakeApiKeyAuditStore`; the test now asserts the single audit entry has `EventType == "dashboard-delete-key"`, `KeyId == "operator01"`, and `Details == "not-found-or-active"`. This pins the unconditional-audit invariant at `DashboardApiKeyManagementService.cs:167-172` — a regression moving the `AppendAuditAsync` call inside `if (deleted)` would now fail the test. Added `DeleteAsync_BlankKeyId_ReturnsFailure` as a `[Theory]` over `""`, `" "`, `"\t"` that asserts `Succeeded == false`, `adminStore.DeleteCount == 0`, AND `auditStore.Entries` is empty — pinning that the `ValidateKeyId` guard at line 156-160 fires before any store or audit work. All tests pass; suite green. + ### Tests-031 | Field | Value | @@ -540,8 +550,10 @@ The cancellation tests for `WorkerClient` in `WorkerClientTests` *do* exercise t | Severity | Low | | Category | Concurrency & thread safety | | Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSnapshotPublisherTests.cs:22-61` | -| Status | Open | +| Status | Resolved | **Description:** `ExecuteAsync_WhenSnapshotServiceThrowsOnce_ReconnectsAfterDelay` records `startedAt = DateTimeOffset.UtcNow` *before* calling `publisher.StartAsync(...)`, then asserts `secondSubscribeAt - startedAt >= reconnectDelay - 10ms` (line 59). The measured gap is *not* the reconnect delay in isolation — it is `(StartAsync scheduling) + (first WatchSnapshotsAsync setup + Task.Yield) + (throw) + reconnect delay + (second WatchSnapshotsAsync setup)`. On a slow/contended CI agent the first three terms easily dominate (favouring the assertion); but on a fast machine, Windows `Task.Delay(50ms)` rounds up to the next ~15.6 ms tick boundary and may return at ~46-50 ms relative to schedule, while the first three terms can be sub-millisecond — so the gap measurement can land within 1-2 ms of the lower bound, and the 10 ms slack may not absorb a single missed quantum. This is a latent flake of the same flavour as Tests-006 (heartbeat timing) but on a wall-clock dependency the test cannot inject around because `DashboardSnapshotPublisher` uses `Task.Delay(_reconnectDelay)` directly. Tests-006 / Tests-017 moved heartbeat tests onto `ManualTimeProvider`; this test cannot do that without a product change to use a `TimeProvider`-aware delay. **Recommendation:** (a) The cheap fix: have `ThrowOnceThenYieldSnapshotService` record `_firstThrowAt = DateTimeOffset.UtcNow` immediately before the `throw`, and change the assertion to `secondSubscribeAt - firstThrowAt >= reconnectDelay - 10ms` — the gap then measures only the reconnect delay, eliminating the variable scheduling baseline. (b) The deeper fix: extend `DashboardSnapshotPublisher` to accept an `ITimeProvider`-style delay seam (or a virtual `DelayAsync` hook) so a `ManualTimeProvider` could advance time deterministically. (a) is preferred for now; (b) belongs as a follow-up if more reconnect-loop tests are added. + +**Resolution:** 2026-05-24 — Applied option (a). Added `FirstThrowAt` to `ThrowOnceThenYieldSnapshotService` and set it via `FirstThrowAt = DateTimeOffset.UtcNow;` immediately before the first-call `throw`. Removed the pre-`StartAsync` `startedAt` baseline; the assertion now reads `gap = secondSubscribeAt - firstThrowAt` (both timestamps captured inside the fake), and the 10 ms slack absorbs the Windows `Task.Delay` quantum without the variable `StartAsync` / scheduling overhead in the baseline. This is the same flake-isolation pattern Tests-006 / Tests-017 used (measuring only the production delay, not test-side setup). Suite green; the test passes deterministically across repeated runs. diff --git a/src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs b/src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs index eadd0ce..65d0c16 100644 --- a/src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs +++ b/src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs @@ -78,6 +78,15 @@ public sealed class GatewayMetrics : IDisposable _meter.CreateObservableGauge("mxgateway.events.grpc_stream_queue.depth", GetGrpcEventStreamQueueDepth); } + /// + /// Gets the underlying instance backing this metrics object. Exposed to tests + /// (via InternalsVisibleTo) so a can filter measurements by + /// against this specific instance rather than by the + /// process-shared , which would cross-talk between parallel tests that + /// each build their own (Tests-027). + /// + internal Meter Meter => _meter; + /// /// Records that a session has been opened. /// diff --git a/src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardApiKeyManagementServiceTests.cs b/src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardApiKeyManagementServiceTests.cs index cef0222..47f470d 100644 --- a/src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardApiKeyManagementServiceTests.cs +++ b/src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardApiKeyManagementServiceTests.cs @@ -147,11 +147,20 @@ public sealed class DashboardApiKeyManagementServiceTests && entry.Details == "deleted"); } + /// + /// Tests-030: when the admin store refuses the delete (returns false), the service + /// still emits a dashboard-delete-key audit entry with Details = "not-found-or-active" + /// because AppendAuditAsync runs unconditionally after the store call. A regression that + /// moved the audit-append call inside the if (deleted) branch would silently drop the + /// audit trail for refused deletes — a real audit-completeness gap. This test pins both the + /// friendly-error response AND the unconditional audit entry. + /// [Fact] - public async Task DeleteAsync_WhenStoreRefuses_ReportsFriendlyError() + public async Task DeleteAsync_WhenStoreRefuses_ReportsFriendlyErrorAndAudits() { FakeApiKeyAdminStore adminStore = new() { DeleteResult = false }; - DashboardApiKeyManagementService service = CreateService(adminStore); + FakeApiKeyAuditStore auditStore = new(); + DashboardApiKeyManagementService service = CreateService(adminStore, auditStore); DashboardApiKeyManagementResult result = await service.DeleteAsync( CreateAuthorizedUser(), @@ -160,6 +169,36 @@ public sealed class DashboardApiKeyManagementServiceTests Assert.False(result.Succeeded); Assert.Contains("Revoke", result.Message, StringComparison.Ordinal); + + ApiKeyAuditEntry auditEntry = Assert.Single(auditStore.Entries); + Assert.Equal("dashboard-delete-key", auditEntry.EventType); + Assert.Equal("operator01", auditEntry.KeyId); + Assert.Equal("not-found-or-active", auditEntry.Details); + } + + /// + /// Tests-030: calls + /// ValidateKeyId after the authorisation check. A blank key id must fail with the + /// shared "API key id is required." message before any store or audit call runs. + /// + [Theory] + [InlineData("")] + [InlineData(" ")] + [InlineData("\t")] + public async Task DeleteAsync_BlankKeyId_ReturnsFailure(string blankKeyId) + { + FakeApiKeyAdminStore adminStore = new(); + FakeApiKeyAuditStore auditStore = new(); + DashboardApiKeyManagementService service = CreateService(adminStore, auditStore); + + DashboardApiKeyManagementResult result = await service.DeleteAsync( + CreateAuthorizedUser(), + blankKeyId, + CancellationToken.None); + + Assert.False(result.Succeeded); + Assert.Equal(0, adminStore.DeleteCount); + Assert.Empty(auditStore.Entries); } /// diff --git a/src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSessionAdminServiceTests.cs b/src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSessionAdminServiceTests.cs index d7fda0b..4b08618 100644 --- a/src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSessionAdminServiceTests.cs +++ b/src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSessionAdminServiceTests.cs @@ -87,7 +87,12 @@ public sealed class DashboardSessionAdminServiceTests Assert.True(result.Succeeded); Assert.Equal(1, sessionManager.KillCount); Assert.Equal("session-1", sessionManager.LastKilledSessionId); - Assert.False(string.IsNullOrWhiteSpace(sessionManager.LastKillReason)); + + // Tests-028: pin the literal reason string so a future caller-side change is a deliberate + // test update rather than a silent drift. DashboardSessionAdminService passes a hard-coded + // "dashboard-admin-kill" so the worker-exit metric (mxgateway.workers.killed) carries a + // stable, machine-greppable reason tag. + Assert.Equal("dashboard-admin-kill", sessionManager.LastKillReason); } [Fact] @@ -105,6 +110,26 @@ public sealed class DashboardSessionAdminServiceTests Assert.Equal(0, sessionManager.KillCount); } + /// + /// Tests-029: CloseSessionAsync has the same blank-session-id guard as + /// KillWorkerAsync but previously had no parallel test. Coverage was asymmetric. + /// A guard-removal regression on the close path would slip through. + /// + [Fact] + public async Task CloseSessionAsync_BlankSessionId_ReturnsFailure() + { + FakeSessionManager sessionManager = new(); + DashboardSessionAdminService service = CreateService(sessionManager); + + DashboardSessionAdminResult result = await service.CloseSessionAsync( + CreateUser(DashboardRoles.Admin), + " ", + CancellationToken.None); + + Assert.False(result.Succeeded); + Assert.Equal(0, sessionManager.CloseCount); + } + [Fact] public void CanManage_RejectsUnauthenticatedAndViewer() { diff --git a/src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSnapshotPublisherTests.cs b/src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSnapshotPublisherTests.cs index 393be5c..46bdd60 100644 --- a/src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSnapshotPublisherTests.cs +++ b/src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSnapshotPublisherTests.cs @@ -17,6 +17,13 @@ public sealed class DashboardSnapshotPublisherTests /// reconnect delay and then re-open the subscription. Before the fix, /// the publisher exited on the first non-cancellation exception and /// the dashboard's snapshot stream went silent until process restart. + /// + /// Tests-031: the reconnect-gap measurement is bounded between the + /// moment the first subscribe actually throws and the moment the + /// second subscribe begins. Measuring from startedAt (pre-StartAsync) + /// baselined scheduling overhead into the budget and made the lower bound + /// flaky on slow CI; recording firstThrowAt inside the fake removes + /// that baseline so only the Task.Delay(reconnectDelay) contributes. /// [Fact] public async Task ExecuteAsync_WhenSnapshotServiceThrowsOnce_ReconnectsAfterDelay() @@ -31,7 +38,6 @@ public sealed class DashboardSnapshotPublisherTests reconnectDelay); using CancellationTokenSource cts = new(); - DateTimeOffset startedAt = DateTimeOffset.UtcNow; Task execute = publisher.StartAsync(cts.Token); await execute.WaitAsync(TestTimeout); @@ -42,6 +48,8 @@ public sealed class DashboardSnapshotPublisherTests await WaitUntilAsync(() => snapshotService.SubscribeCount >= 2); await WaitUntilAsync(() => hubContext.SendCount >= 1); + DateTimeOffset firstThrowAt = snapshotService.FirstThrowAt + ?? throw new InvalidOperationException("First subscribe did not record a throw timestamp."); DateTimeOffset secondSubscribeAt = snapshotService.SecondSubscribeAt ?? throw new InvalidOperationException("Second subscribe did not record a timestamp."); @@ -52,10 +60,13 @@ public sealed class DashboardSnapshotPublisherTests $"Expected at least 2 subscribe calls, got {snapshotService.SubscribeCount}."); Assert.True(hubContext.SendCount >= 1); - // The gap between the throw (first subscribe) and the reconnect - // (second subscribe) is bounded below by the reconnect delay. We - // give a small slack (10ms) for scheduling jitter on slow CI VMs. - TimeSpan gap = secondSubscribeAt - startedAt; + // Tests-031: the gap is measured from the moment the first subscribe + // actually threw (inside the fake) to the moment the second subscribe + // began (also inside the fake). This isolates the publisher's + // Task.Delay(reconnectDelay) — no StartAsync / scheduling overhead in + // the baseline. The 10ms slack absorbs Task.Delay's coarse Windows + // timer quantum (~15ms) when the underlying scheduler wakes early. + TimeSpan gap = secondSubscribeAt - firstThrowAt; Assert.True(gap >= reconnectDelay - TimeSpan.FromMilliseconds(10), $"Expected reconnect gap >= {reconnectDelay.TotalMilliseconds}ms; got {gap.TotalMilliseconds}ms."); } @@ -100,6 +111,15 @@ public sealed class DashboardSnapshotPublisherTests private sealed class ThrowOnceThenYieldSnapshotService : IDashboardSnapshotService { public int SubscribeCount { get; private set; } + + /// + /// Tests-031: the wall-clock instant the first WatchSnapshotsAsync throws. + /// The reconnect-gap assertion is measured against this timestamp (NOT the + /// pre-StartAsync wall clock) so scheduling overhead is not baselined + /// into the lower bound. + /// + public DateTimeOffset? FirstThrowAt { get; private set; } + public DateTimeOffset? SecondSubscribeAt { get; private set; } public DashboardSnapshot GetSnapshot() @@ -118,6 +138,7 @@ public sealed class DashboardSnapshotPublisherTests // First call: throw after a brief yield so the publisher // observes us as a live producer that failed. await Task.Yield(); + FirstThrowAt = DateTimeOffset.UtcNow; throw new InvalidOperationException("simulated transient snapshot failure"); } diff --git a/src/ZB.MOM.WW.MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs b/src/ZB.MOM.WW.MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs index 3b4bcd5..cacda2e 100644 --- a/src/ZB.MOM.WW.MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs +++ b/src/ZB.MOM.WW.MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs @@ -194,7 +194,17 @@ public sealed class MxAccessGatewayServiceTests Assert.Equal("session-1", sessionManager.LastReadEventsSessionId); } - /// Verifies that StreamEvents records send duration metrics when an event is written. + /// + /// Verifies that StreamEvents records the send-duration histogram per event. + /// + /// Tests-027 (concurrency flake): the listener must filter by the specific + /// instance owned by this test, not by the process-shared + /// . Otherwise a parallel test that constructs its own + /// and records mxgateway.events.stream_send.duration would + /// cross-contaminate families and break the equality assertion below. See the companion + /// + /// regression for the cross-talk reproduction. + /// [Fact] public async Task StreamEvents_WhenEventIsWritten_RecordsSendDuration() { @@ -203,7 +213,7 @@ public sealed class MxAccessGatewayServiceTests List families = []; listener.InstrumentPublished = (instrument, meterListener) => { - if (instrument.Meter.Name == GatewayMetrics.MeterName + if (ReferenceEquals(instrument.Meter, metrics.Meter) && instrument.Name == "mxgateway.events.stream_send.duration") { meterListener.EnableMeasurementEvents(instrument); @@ -212,7 +222,8 @@ public sealed class MxAccessGatewayServiceTests listener.SetMeasurementEventCallback( (instrument, measurement, tags, _) => { - if (instrument.Name != "mxgateway.events.stream_send.duration") + if (!ReferenceEquals(instrument.Meter, metrics.Meter) + || instrument.Name != "mxgateway.events.stream_send.duration") { return; } @@ -239,6 +250,69 @@ public sealed class MxAccessGatewayServiceTests Assert.Equal([MxEventFamily.OnDataChange.ToString()], families); } + /// + /// Tests-027 regression: a that filters by the specific + /// instance (via ) + /// must NOT observe measurements recorded on a different that shares + /// the same . This is the cross-talk vector that previously + /// caused StreamEvents_WhenEventIsWritten_RecordsSendDuration to fail intermittently when + /// run in parallel with another test recording the same histogram. + /// + [Fact] + public async Task StreamEvents_RecordSendDurationListener_IgnoresMeasurementsFromOtherMetersWithSameName() + { + using GatewayMetrics metricsUnderTest = new(); + using GatewayMetrics otherMetrics = new(); + using MeterListener listener = new(); + List families = []; + listener.InstrumentPublished = (instrument, meterListener) => + { + // Subscribe to the stream_send histogram on BOTH meters so the listener + // would observe a cross-talk measurement if the callback did not filter. + if (instrument.Name == "mxgateway.events.stream_send.duration") + { + meterListener.EnableMeasurementEvents(instrument); + } + }; + listener.SetMeasurementEventCallback( + (instrument, measurement, tags, _) => + { + if (!ReferenceEquals(instrument.Meter, metricsUnderTest.Meter) + || instrument.Name != "mxgateway.events.stream_send.duration") + { + return; + } + + foreach (KeyValuePair tag in tags) + { + if (tag.Key == "family" && tag.Value is string family) + { + families.Add(family); + } + } + }); + listener.Start(); + + // Simulate the cross-talk: another test's GatewayMetrics records a value + // before the test-under-test does its single event publish. The listener + // must filter this out by Meter reference. + otherMetrics.RecordEventStreamSend(MxEventFamily.OnWriteComplete.ToString(), TimeSpan.FromMilliseconds(123)); + + FakeSessionManager sessionManager = new(); + sessionManager.Events.Add(CreateWorkerEvent("session-1", workerSequence: 2)); + MxAccessGatewayService service = CreateService(sessionManager, metrics: metricsUnderTest); + RecordingServerStreamWriter writer = new(); + + await service.StreamEvents( + new StreamEventsRequest { SessionId = "session-1" }, + writer, + new TestServerCallContext()); + + // Only the test-under-test's OnDataChange recording should be observed — + // the OnAlarm recording on the sibling meter must NOT leak through. + Assert.Equal([MxEventFamily.OnDataChange.ToString()], families); + } + /// Verifies that CloseSession throws InvalidArgument when session ID is blank. [Fact] public async Task CloseSession_WithBlankSessionId_ThrowsInvalidArgument() diff --git a/src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs b/src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs index fd4deab..f379ab8 100644 --- a/src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs +++ b/src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs @@ -466,7 +466,12 @@ public sealed class SessionManagerTests Assert.Equal(0, metrics.GetSnapshot().OpenSessions); } - /// Verifies that killing a worker removes the session from the registry without calling shutdown. + /// + /// Verifies that killing a worker removes the session from the registry without calling shutdown. + /// Tests-028: also pins the reason argument propagating through + /// SessionManager.KillWorkerAsyncsession.KillWorker(reason) → + /// IWorkerClient.Kill(reason). + /// [Fact] public async Task KillWorkerAsync_KillsWorkerAndRemovesSession() { @@ -480,12 +485,54 @@ public sealed class SessionManagerTests Assert.False(result.AlreadyClosed); Assert.Equal(SessionState.Closed, result.FinalState); Assert.Equal(1, workerClient.KillCount); + Assert.Equal("test-kill", workerClient.LastKillReason); Assert.Equal(0, workerClient.ShutdownCount); Assert.False(manager.TryGetSession(session.SessionId, out _)); Assert.Equal(1, metrics.GetSnapshot().SessionsClosed); Assert.Equal(0, metrics.GetSnapshot().OpenSessions); } + /// + /// Tests-028: guards its reason argument with + /// . A blank or whitespace reason must throw + /// before any session lookup or worker call runs. + /// + [Theory] + [InlineData("")] + [InlineData(" ")] + [InlineData("\t")] + public async Task KillWorkerAsync_WithBlankReason_ThrowsArgumentException(string blankReason) + { + FakeWorkerClient workerClient = new(); + SessionManager manager = CreateManager(new FakeSessionWorkerClientFactory(workerClient)); + GatewaySession session = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", CancellationToken.None); + + await Assert.ThrowsAsync( + async () => await manager.KillWorkerAsync(session.SessionId, blankReason, CancellationToken.None)); + + Assert.Equal(0, workerClient.KillCount); + Assert.True(manager.TryGetSession(session.SessionId, out _)); + } + + /// + /// Tests-028: also rejects null. + /// with cannot carry null for a + /// non-nullable string parameter on .NET 10, so the null case is its own fact. + /// + [Fact] + public async Task KillWorkerAsync_WithNullReason_ThrowsArgumentNullException() + { + FakeWorkerClient workerClient = new(); + SessionManager manager = CreateManager(new FakeSessionWorkerClientFactory(workerClient)); + GatewaySession session = await manager.OpenSessionAsync(CreateOpenRequest(), "client-1", CancellationToken.None); + + await Assert.ThrowsAsync( + async () => await manager.KillWorkerAsync(session.SessionId, null!, CancellationToken.None)); + + Assert.Equal(0, workerClient.KillCount); + Assert.True(manager.TryGetSession(session.SessionId, out _)); + } + /// Verifies that killing the worker for an unknown session raises SessionNotFound. [Fact] public async Task KillWorkerAsync_WhenSessionMissing_ThrowsSessionNotFound() @@ -827,6 +874,15 @@ public sealed class SessionManagerTests /// Gets the number of times kill was called on the fake worker client. public int KillCount { get; private set; } + /// + /// Gets the last reason argument observed by . Tests-028: + /// pins the reason-string propagation through + /// SessionManager.KillWorkerAsyncsession.KillWorker(reason) → + /// IWorkerClient.Kill(reason). Without this, the chain could silently + /// drop or substitute the reason argument and existing tests would still pass. + /// + public string? LastKillReason { get; private set; } + /// Gets the number of times dispose was called on the fake worker client. public int DisposeCount { get; private set; } @@ -913,6 +969,7 @@ public sealed class SessionManagerTests public void Kill(string reason) { KillCount++; + LastKillReason = reason; if (KillException is not null) { throw KillException;