code-reviews: re-review Tests at 42b0037

Append 5 new findings (Tests-027..031) covering the StreamEvents_WhenEventIsWritten_RecordsSendDuration flake root cause (shared MeterListener by meter name), missing kill-path coverage (reason propagation + concurrent-kill double-count), asymmetric guard coverage between Close and Kill, missing audit-failure-path coverage for ApiKey Delete, and the DashboardSnapshotPublisher reconnect-window timer sensitivity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 08:28:15 -04:00
parent 056f0d8808
commit 2b92be02b9
1 changed files with 97 additions and 3 deletions
@@ -5,9 +5,9 @@
 | Module | `src/ZB.MOM.WW.MxGateway.Tests` |
 | Reviewer | Claude Code |
 | Review date | 2026-05-24 |
-| Commit reviewed | `d692232` |
-| Status | Reviewed |
-| Open findings | 0 |
+| Commit reviewed | `42b0037` |
+| Status | Re-reviewed |
+| Open findings | 5 |

 ## Checklist coverage

@@ -26,6 +26,35 @@ This pass (commit `a020350`) re-reviews the module after the Tests-013–019 bat
 | 9 | Testing coverage | Issues found: Tests-020 (`MxAccessGatewayServiceConstraintTests` covers only 2 of 4 `WriteBulkConstraintPlan` switch arms — `Write2Bulk`/`WriteSecured2Bulk` `GetPayload`/`SetPayload` would silently break with no failing test), Tests-022 (the eleven `SessionManagerBulkTests.*_PropagatesCancellation` tests pre-cancel the token, so the fake's first-line `ThrowIfCancellationRequested` handles it before `InvokeBulkInternalAsync` even runs — they do not exercise mid-flight cancellation), Tests-024 (`BulkConstraintPlan.MergeDeniedInto` silently drops or under-fills if the worker reply count diverges from the allowed-count — no test pins this protocol-mismatch edge case). |
 | 10 | Documentation & comments | No new issues. Tests-019's `docs/GatewayTesting.md` addition is in place; new test files (`SessionManagerBulkTests`, `MxAccessGatewayServiceConstraintTests`, `PredicateConstraintEnforcer`) all have orienting class-level summaries. |

+### 2026-05-24 re-review (commit 42b0037)
+
+Re-review pass at `42b0037` after the dashboard admin-actions wave: the new
+`DashboardSessionAdminServiceTests` covers Close/Kill viewer/admin gating, the
+extended `SessionManagerTests.KillWorkerAsync_*` tests cover the
+SessionManager-side kill path, `DashboardApiKeyManagementServiceTests` gains
+DeleteAsync coverage, and `EventStreamServiceTests` adds a recording
+broadcaster fixture and a throwing-broadcaster fixture. Four FakeSessionManager
+impls were extended with `KillWorkerAsync` stubs to keep the interface compiling.
+Also new: `DashboardSnapshotPublisherTests` (Server-042 reconnect loop),
+`HubTokenServiceTests` (Server-039 null-identity rejection), and a large
+`ProtobufContractRoundTripTests` expansion for the new bulk write/read
+payload oneof cases.
+
+| # | Category | Result |
+|---|---|---|
+| 1 | Correctness & logic bugs | No issues found in this diff. |
+| 2 | mxaccessgw conventions | No issues found in this diff. |
+| 3 | Concurrency & thread safety | Issue found: Tests-027 (the known-flake `MxAccessGatewayServiceTests.StreamEvents_WhenEventIsWritten_RecordsSendDuration` cross-talks via the shared `MxGateway.Server` Meter when parallel tests record `mxgateway.events.stream_send.duration`). |
+| 4 | Error handling & resilience | Issue found: Tests-029 (`DashboardSessionAdminService.KillWorkerAsync` has a `SessionNotFound` and a general `SessionManagerException` catch branch; neither is tested. Also `CloseSessionAsync` has no blank-id test even though Kill does — symmetric guard, asymmetric coverage). |
+| 5 | Security | No issues found in this diff. |
+| 6 | Performance & resource management | No issues found in this diff. |
+| 7 | Design-document adherence | No issues found in this diff. |
+| 8 | Code organization & conventions | No issues found in this diff — the `NullDashboardEventBroadcaster` consolidation (Tests-025) is in place. |
+| 9 | Testing coverage | Issues found: Tests-028 (`KillWorkerAsync_KillsWorkerAndRemovesSession` does not pin the `reason` argument propagating through `SessionManager.KillWorkerAsync` → `session.KillWorker(reason)` → `IWorkerClient.Kill(reason)`; the fake's `Kill(string reason)` discards the argument and there is no `KillWorkerAsync_WithBlankReason_ThrowsArgument` test), Tests-030 (`DashboardApiKeyManagementService.DeleteAsync_WhenStoreRefuses_ReportsFriendlyError` does not inject an audit store or assert the `"not-found-or-active"` audit entry the product code emits on the failure path, and there is no `DeleteAsync_BlankKeyId_ReturnsFailure` for the `ValidateKeyId` guard). |
+| 10 | Documentation & comments | No issues found in this diff. |
+
+Flake-risk note: Tests-031 (`DashboardSnapshotPublisherTests.ExecuteAsync_WhenSnapshotServiceThrowsOnce_ReconnectsAfterDelay` measures the reconnect gap from `startedAt` to `secondSubscribeAt` with only 10 ms slack against a 50 ms `reconnectDelay`; Windows `Task.Delay` timer granularity is ~15 ms and the gap baseline includes scheduling overhead, so the lower-bound assertion can spuriously fail on a slow CI agent).
+
 ### 2026-05-24 review (commit d692232)

 Re-review pass at `d692232` scoped to the test-side fixture churn from
@@ -451,3 +480,68 @@ The cancellation tests for `WorkerClient` in `WorkerClientTests` *do* exercise t
 **Recommendation:** Add a fixture to `EventStreamServiceTests` named e.g. `StreamEventsAsync_PublishesEachEventToDashboardBroadcaster`: inject a recording fake that captures `(sessionId, mxEvent)` calls, push two events through the fake session, and assert the broadcaster received both with the correct session id and matching `WorkerSequence`. This pins the broadcast hook and proves the dashboard event mirror is not a no-op.

 **Resolution:** 2026-05-24 — Added `src/ZB.MOM.WW.MxGateway.Tests/TestSupport/RecordingDashboardEventBroadcaster.cs` — a thread-safe `IDashboardEventBroadcaster` test double backed by a `ConcurrentQueue<DashboardEventCapture>` that captures every `(sessionId, mxEvent)` invocation. Added `StreamEventsAsync_PublishesEachEventToDashboardBroadcaster` to `EventStreamServiceTests`: pushes two events (`WorkerSequence` 7 / `OnDataChange` and 8 / `OnWriteComplete`) through the fake session, drains the stream, and asserts the recording broadcaster captured exactly two publishes with the matching `SessionId`, `WorkerSequence`, and `Family` for each. TDD red/green confirmed: a deliberate "expected 3 captures" failed (`Expected: 3, Actual: 2`) before flipping to the correct count; the green run passes deterministically. The fixture would have caught a regression that drops or wraps `dashboardEventBroadcaster.Publish` at `EventStreamService.cs:133`. Build clean (0 warnings); full Tests suite green (486/486).
+
+### Tests-027
+
+| Field | Value |
+|---|---|
+| Severity | Medium |
+| Category | Concurrency & thread safety |
+| Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs:199-240`, `src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs:8,73,246-251` |
+| Status | Open |
+
+**Description:** The review brief explicitly flagged `MxAccessGatewayServiceTests.StreamEvents_WhenEventIsWritten_RecordsSendDuration` as a known flake that "passed solo on rerun". The root cause is the `MeterListener` subscribes by `instrument.Meter.Name == GatewayMetrics.MeterName` (a *process-shared* constant `"MxGateway.Server"`), not by the specific `GatewayMetrics` instance constructed in the test. Tests-012 made the xUnit parallelism policy explicit (`parallelizeTestCollections: true`, `maxParallelThreads: -1`), and every other test that builds its own `GatewayMetrics()` and exercises `MxAccessGatewayService.StreamEvents` or `EventStreamService.StreamEventsAsync` (e.g. the new `StreamEventsAsync_*` family added by Tests-026 and Server-041, plus the pre-existing `StreamEventsAsync_YieldsEventsInWorkerOrder` etc.) routes through `GatewayMetrics.RecordEventStreamSend` → the same histogram name `mxgateway.events.stream_send.duration`. When two such tests run concurrently in the same xUnit process, the `MeterListener` in this test sees measurements from *both* meters and `families.Count` grows to >1, breaking `Assert.Equal([MxEventFamily.OnDataChange.ToString()], families)`. Solo reruns pass because no other producer is alive. This is exactly the cross-test mutable-state pattern Tests-012 set the guardrail comment against.
+
+**Recommendation:** Either (a) filter the `MeterListener` callback by the specific `Meter` instance — capture `metrics._meter` (or expose `GatewayMetrics.Meter`) and compare with `ReferenceEquals(instrument.Meter, expectedMeter)` instead of comparing `Meter.Name`; or (b) place this test in a single-threaded `[Collection("GatewayMetrics-Listener")]` so no other `RecordEventStreamSend` producer runs concurrently. Option (a) is preferred because it removes the cross-talk vector permanently and lets the test stay parallelisable.
+
+### Tests-028
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | Testing coverage |
+| Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs:466-496,802-807`, `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:216-253` |
+| Status | Open |
+
+**Description:** The new `KillWorkerAsync_KillsWorkerAndRemovesSession` (line 466) and `KillWorkerAsync_WhenSessionMissing_ThrowsSessionNotFound` (line 486) pin the new kill-path entry, but they do not pin the `reason` argument propagating through the chain. `SessionManager.KillWorkerAsync(sessionId, reason, ct)` validates `reason` with `ArgumentException.ThrowIfNullOrWhiteSpace(reason)` (line 221), calls `session.KillWorker(reason)` (line 229), and logs `reason={Reason}` (line 251); but the `FakeWorkerClient.Kill(string reason)` discards the argument (line 803-807) and the assertion is only `Assert.Equal(1, workerClient.KillCount)`. A regression that (a) hard-coded an internal `"unspecified"` reason between `SessionManager` and `GatewaySession`, (b) swapped to a different overload that dropped the reason, or (c) deleted the `ThrowIfNullOrWhiteSpace` guard would all pass the current tests. The dashboard caller (`DashboardSessionAdminService.KillWorkerAsync`) passes a hard-coded `"dashboard-admin-kill"` reason and the only test that observes it (`KillWorkerAsync_AdminKillsWorker`) asserts `!string.IsNullOrWhiteSpace(LastKillReason)` rather than pinning the value — so the same-class drift is also untested.
+
+**Recommendation:** (1) Capture `LastKillReason` on `FakeWorkerClient.Kill` and assert `KillWorkerAsync_KillsWorkerAndRemovesSession` propagates the test-supplied `"test-kill"` string end-to-end. (2) Add `KillWorkerAsync_WithBlankReason_ThrowsArgumentException` (parameterised over `null`, `""`, `"   "`) to pin the `ArgumentException.ThrowIfNullOrWhiteSpace` guard. (3) Tighten `DashboardSessionAdminServiceTests.KillWorkerAsync_AdminKillsWorker` to `Assert.Equal("dashboard-admin-kill", sessionManager.LastKillReason)` so a future reason-string change is a deliberate test update.
+
+### Tests-029
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | Error handling & resilience |
+| Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSessionAdminServiceTests.cs:61-106,139-222`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs:77-125` |
+| Status | Open |
+
+**Description:** The new `DashboardSessionAdminServiceTests` covers the happy path and the viewer-denial path for both `CloseSessionAsync` and `KillWorkerAsync`, plus `CloseSessionAsync_WhenSessionMissing_ReportsFriendlyError` for the close-side `SessionNotFound` catch — but the kill-side error branches are not tested. The product code's `KillWorkerAsync` (lines 111-114) has the same `SessionNotFound` catch returning `"Session {id} was not found."` and (lines 115-124) a generic `SessionManagerException` catch returning `"Kill failed: {message}"`; neither is exercised. The fake's `KillWorkerAsync` (lines 200-209) only succeeds — there is no `KillThrowsNotFound` / `KillThrowsGeneric` configuration option matching the existing `CloseThrowsNotFound`. Symmetrically, `CloseSessionAsync` has the same `IsNullOrWhiteSpace(sessionId)` guard (line 37-40) but no blank-id test even though `KillWorkerAsync_BlankSessionId_ReturnsFailure` exists for the parallel kill guard — a guard-removal regression on close would slip through.
+
+**Recommendation:** Mirror the existing close-side fixtures onto the kill side: add `KillThrowsNotFound` / `KillThrowsGeneric` init-flags to the `FakeSessionManager`, then `KillWorkerAsync_WhenSessionMissing_ReportsFriendlyError`, `KillWorkerAsync_WhenSessionManagerThrows_ReportsKillFailedMessage`, and `CloseSessionAsync_BlankSessionId_ReturnsFailure`. These are mechanical copies of the existing patterns and bring close/kill coverage into symmetry.
+
+### Tests-030
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | Testing coverage |
+| Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardApiKeyManagementServiceTests.cs:115-163`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardApiKeyManagementService.cs:146-177` |
+| Status | Open |
+
+**Description:** The three new `DeleteAsync_*` fixtures cover unauthorised user, success path with audit, and store-refuses-with-friendly-error. They do not exercise two production behaviours: (1) `DeleteAsync_WhenStoreRefuses_ReportsFriendlyError` (line 151-163) does not construct or inject a `FakeApiKeyAuditStore`, so it never observes that the product code still emits an audit entry with `EventType = "dashboard-delete-key"` and `Details = "not-found-or-active"` on the failure branch (`AppendAuditAsync` runs unconditionally at line 167-172). A regression that placed the `AppendAuditAsync` call inside the `if (deleted)` branch would silently drop the audit trail for refused deletes — a real audit-completeness gap. (2) There is no `DeleteAsync_BlankKeyId_ReturnsFailure` or `DeleteAsync_InvalidKeyId_ReturnsFailure` test, even though `ValidateKeyId(keyId)` (line 156-160) guards on the same conditions as Create/Revoke/Rotate. The `Revoke`/`Rotate` paths have equivalent fixtures (the file's earlier tests cover them); only Delete is missing them.
+
+**Recommendation:** (1) Add a `FakeApiKeyAuditStore` to `DeleteAsync_WhenStoreRefuses_ReportsFriendlyError` and assert it contains exactly one entry with `EventType == "dashboard-delete-key"` and `Details == "not-found-or-active"`. (2) Add `DeleteAsync_BlankKeyId_ReturnsFailure` (parameterised over `null`, `""`, `"   "`) and `DeleteAsync_InvalidKeyId_ReturnsFailure` (a keyId with characters the `ValidateKeyId` rules reject) to pin the validation branch end-to-end.
+
+### Tests-031
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | Concurrency & thread safety |
+| Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSnapshotPublisherTests.cs:22-61` |
+| Status | Open |
+
+**Description:** `ExecuteAsync_WhenSnapshotServiceThrowsOnce_ReconnectsAfterDelay` records `startedAt = DateTimeOffset.UtcNow` *before* calling `publisher.StartAsync(...)`, then asserts `secondSubscribeAt - startedAt >= reconnectDelay - 10ms` (line 59). The measured gap is *not* the reconnect delay in isolation — it is `(StartAsync scheduling) + (first WatchSnapshotsAsync setup + Task.Yield) + (throw) + reconnect delay + (second WatchSnapshotsAsync setup)`. On a slow/contended CI agent the first three terms easily dominate (favouring the assertion); but on a fast machine, Windows `Task.Delay(50ms)` rounds up to the next ~15.6 ms tick boundary and may return at ~46-50 ms relative to schedule, while the first three terms can be sub-millisecond — so the gap measurement can land within 1-2 ms of the lower bound, and the 10 ms slack may not absorb a single missed quantum. This is a latent flake of the same flavour as Tests-006 (heartbeat timing) but on a wall-clock dependency the test cannot inject around because `DashboardSnapshotPublisher` uses `Task.Delay(_reconnectDelay)` directly. Tests-006 / Tests-017 moved heartbeat tests onto `ManualTimeProvider`; this test cannot do that without a product change to use a `TimeProvider`-aware delay.
+
+**Recommendation:** (a) The cheap fix: have `ThrowOnceThenYieldSnapshotService` record `_firstThrowAt = DateTimeOffset.UtcNow` immediately before the `throw`, and change the assertion to `secondSubscribeAt - firstThrowAt >= reconnectDelay - 10ms` — the gap then measures only the reconnect delay, eliminating the variable scheduling baseline. (b) The deeper fix: extend `DashboardSnapshotPublisher` to accept an `ITimeProvider`-style delay seam (or a virtual `DelayAsync` hook) so a `ManualTimeProvider` could advance time deterministically. (a) is preferred for now; (b) belongs as a follow-up if more reconnect-loop tests are added.