code-reviews: re-review Tests at 42b0037

Append 5 new findings (Tests-027..031) covering the
StreamEvents_WhenEventIsWritten_RecordsSendDuration flake root cause
(shared MeterListener by meter name), missing kill-path coverage
(reason propagation + concurrent-kill double-count), asymmetric guard
coverage between Close and Kill, missing audit-failure-path coverage
for ApiKey Delete, and the DashboardSnapshotPublisher reconnect-window
timer sensitivity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-24 08:28:15 -04:00
parent 056f0d8808
commit 2b92be02b9
+97 -3
View File
@@ -5,9 +5,9 @@
| Module | `src/ZB.MOM.WW.MxGateway.Tests` |
| Reviewer | Claude Code |
| Review date | 2026-05-24 |
| Commit reviewed | `d692232` |
| Status | Reviewed |
| Open findings | 0 |
| Commit reviewed | `42b0037` |
| Status | Re-reviewed |
| Open findings | 5 |
## Checklist coverage
@@ -26,6 +26,35 @@ This pass (commit `a020350`) re-reviews the module after the Tests-013019 bat
| 9 | Testing coverage | Issues found: Tests-020 (`MxAccessGatewayServiceConstraintTests` covers only 2 of 4 `WriteBulkConstraintPlan` switch arms — `Write2Bulk`/`WriteSecured2Bulk` `GetPayload`/`SetPayload` would silently break with no failing test), Tests-022 (the eleven `SessionManagerBulkTests.*_PropagatesCancellation` tests pre-cancel the token, so the fake's first-line `ThrowIfCancellationRequested` handles it before `InvokeBulkInternalAsync` even runs — they do not exercise mid-flight cancellation), Tests-024 (`BulkConstraintPlan.MergeDeniedInto` silently drops or under-fills if the worker reply count diverges from the allowed-count — no test pins this protocol-mismatch edge case). |
| 10 | Documentation & comments | No new issues. Tests-019's `docs/GatewayTesting.md` addition is in place; new test files (`SessionManagerBulkTests`, `MxAccessGatewayServiceConstraintTests`, `PredicateConstraintEnforcer`) all have orienting class-level summaries. |
### 2026-05-24 re-review (commit 42b0037)
Re-review pass at `42b0037` after the dashboard admin-actions wave: the new
`DashboardSessionAdminServiceTests` covers Close/Kill viewer/admin gating, the
extended `SessionManagerTests.KillWorkerAsync_*` tests cover the
SessionManager-side kill path, `DashboardApiKeyManagementServiceTests` gains
DeleteAsync coverage, and `EventStreamServiceTests` adds a recording
broadcaster fixture and a throwing-broadcaster fixture. Four FakeSessionManager
impls were extended with `KillWorkerAsync` stubs to keep the interface compiling.
Also new: `DashboardSnapshotPublisherTests` (Server-042 reconnect loop),
`HubTokenServiceTests` (Server-039 null-identity rejection), and a large
`ProtobufContractRoundTripTests` expansion for the new bulk write/read
payload oneof cases.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | No issues found in this diff. |
| 2 | mxaccessgw conventions | No issues found in this diff. |
| 3 | Concurrency & thread safety | Issue found: Tests-027 (the known-flake `MxAccessGatewayServiceTests.StreamEvents_WhenEventIsWritten_RecordsSendDuration` cross-talks via the shared `MxGateway.Server` Meter when parallel tests record `mxgateway.events.stream_send.duration`). |
| 4 | Error handling & resilience | Issue found: Tests-029 (`DashboardSessionAdminService.KillWorkerAsync` has a `SessionNotFound` and a general `SessionManagerException` catch branch; neither is tested. Also `CloseSessionAsync` has no blank-id test even though Kill does — symmetric guard, asymmetric coverage). |
| 5 | Security | No issues found in this diff. |
| 6 | Performance & resource management | No issues found in this diff. |
| 7 | Design-document adherence | No issues found in this diff. |
| 8 | Code organization & conventions | No issues found in this diff — the `NullDashboardEventBroadcaster` consolidation (Tests-025) is in place. |
| 9 | Testing coverage | Issues found: Tests-028 (`KillWorkerAsync_KillsWorkerAndRemovesSession` does not pin the `reason` argument propagating through `SessionManager.KillWorkerAsync``session.KillWorker(reason)``IWorkerClient.Kill(reason)`; the fake's `Kill(string reason)` discards the argument and there is no `KillWorkerAsync_WithBlankReason_ThrowsArgument` test), Tests-030 (`DashboardApiKeyManagementService.DeleteAsync_WhenStoreRefuses_ReportsFriendlyError` does not inject an audit store or assert the `"not-found-or-active"` audit entry the product code emits on the failure path, and there is no `DeleteAsync_BlankKeyId_ReturnsFailure` for the `ValidateKeyId` guard). |
| 10 | Documentation & comments | No issues found in this diff. |
Flake-risk note: Tests-031 (`DashboardSnapshotPublisherTests.ExecuteAsync_WhenSnapshotServiceThrowsOnce_ReconnectsAfterDelay` measures the reconnect gap from `startedAt` to `secondSubscribeAt` with only 10 ms slack against a 50 ms `reconnectDelay`; Windows `Task.Delay` timer granularity is ~15 ms and the gap baseline includes scheduling overhead, so the lower-bound assertion can spuriously fail on a slow CI agent).
### 2026-05-24 review (commit d692232)
Re-review pass at `d692232` scoped to the test-side fixture churn from
@@ -451,3 +480,68 @@ The cancellation tests for `WorkerClient` in `WorkerClientTests` *do* exercise t
**Recommendation:** Add a fixture to `EventStreamServiceTests` named e.g. `StreamEventsAsync_PublishesEachEventToDashboardBroadcaster`: inject a recording fake that captures `(sessionId, mxEvent)` calls, push two events through the fake session, and assert the broadcaster received both with the correct session id and matching `WorkerSequence`. This pins the broadcast hook and proves the dashboard event mirror is not a no-op.
**Resolution:** 2026-05-24 — Added `src/ZB.MOM.WW.MxGateway.Tests/TestSupport/RecordingDashboardEventBroadcaster.cs` — a thread-safe `IDashboardEventBroadcaster` test double backed by a `ConcurrentQueue<DashboardEventCapture>` that captures every `(sessionId, mxEvent)` invocation. Added `StreamEventsAsync_PublishesEachEventToDashboardBroadcaster` to `EventStreamServiceTests`: pushes two events (`WorkerSequence` 7 / `OnDataChange` and 8 / `OnWriteComplete`) through the fake session, drains the stream, and asserts the recording broadcaster captured exactly two publishes with the matching `SessionId`, `WorkerSequence`, and `Family` for each. TDD red/green confirmed: a deliberate "expected 3 captures" failed (`Expected: 3, Actual: 2`) before flipping to the correct count; the green run passes deterministically. The fixture would have caught a regression that drops or wraps `dashboardEventBroadcaster.Publish` at `EventStreamService.cs:133`. Build clean (0 warnings); full Tests suite green (486/486).
### Tests-027
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs:199-240`, `src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs:8,73,246-251` |
| Status | Open |
**Description:** The review brief explicitly flagged `MxAccessGatewayServiceTests.StreamEvents_WhenEventIsWritten_RecordsSendDuration` as a known flake that "passed solo on rerun". The root cause is the `MeterListener` subscribes by `instrument.Meter.Name == GatewayMetrics.MeterName` (a *process-shared* constant `"MxGateway.Server"`), not by the specific `GatewayMetrics` instance constructed in the test. Tests-012 made the xUnit parallelism policy explicit (`parallelizeTestCollections: true`, `maxParallelThreads: -1`), and every other test that builds its own `GatewayMetrics()` and exercises `MxAccessGatewayService.StreamEvents` or `EventStreamService.StreamEventsAsync` (e.g. the new `StreamEventsAsync_*` family added by Tests-026 and Server-041, plus the pre-existing `StreamEventsAsync_YieldsEventsInWorkerOrder` etc.) routes through `GatewayMetrics.RecordEventStreamSend` → the same histogram name `mxgateway.events.stream_send.duration`. When two such tests run concurrently in the same xUnit process, the `MeterListener` in this test sees measurements from *both* meters and `families.Count` grows to >1, breaking `Assert.Equal([MxEventFamily.OnDataChange.ToString()], families)`. Solo reruns pass because no other producer is alive. This is exactly the cross-test mutable-state pattern Tests-012 set the guardrail comment against.
**Recommendation:** Either (a) filter the `MeterListener` callback by the specific `Meter` instance — capture `metrics._meter` (or expose `GatewayMetrics.Meter`) and compare with `ReferenceEquals(instrument.Meter, expectedMeter)` instead of comparing `Meter.Name`; or (b) place this test in a single-threaded `[Collection("GatewayMetrics-Listener")]` so no other `RecordEventStreamSend` producer runs concurrently. Option (a) is preferred because it removes the cross-talk vector permanently and lets the test stay parallelisable.
### Tests-028
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs:466-496,802-807`, `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:216-253` |
| Status | Open |
**Description:** The new `KillWorkerAsync_KillsWorkerAndRemovesSession` (line 466) and `KillWorkerAsync_WhenSessionMissing_ThrowsSessionNotFound` (line 486) pin the new kill-path entry, but they do not pin the `reason` argument propagating through the chain. `SessionManager.KillWorkerAsync(sessionId, reason, ct)` validates `reason` with `ArgumentException.ThrowIfNullOrWhiteSpace(reason)` (line 221), calls `session.KillWorker(reason)` (line 229), and logs `reason={Reason}` (line 251); but the `FakeWorkerClient.Kill(string reason)` discards the argument (line 803-807) and the assertion is only `Assert.Equal(1, workerClient.KillCount)`. A regression that (a) hard-coded an internal `"unspecified"` reason between `SessionManager` and `GatewaySession`, (b) swapped to a different overload that dropped the reason, or (c) deleted the `ThrowIfNullOrWhiteSpace` guard would all pass the current tests. The dashboard caller (`DashboardSessionAdminService.KillWorkerAsync`) passes a hard-coded `"dashboard-admin-kill"` reason and the only test that observes it (`KillWorkerAsync_AdminKillsWorker`) asserts `!string.IsNullOrWhiteSpace(LastKillReason)` rather than pinning the value — so the same-class drift is also untested.
**Recommendation:** (1) Capture `LastKillReason` on `FakeWorkerClient.Kill` and assert `KillWorkerAsync_KillsWorkerAndRemovesSession` propagates the test-supplied `"test-kill"` string end-to-end. (2) Add `KillWorkerAsync_WithBlankReason_ThrowsArgumentException` (parameterised over `null`, `""`, `" "`) to pin the `ArgumentException.ThrowIfNullOrWhiteSpace` guard. (3) Tighten `DashboardSessionAdminServiceTests.KillWorkerAsync_AdminKillsWorker` to `Assert.Equal("dashboard-admin-kill", sessionManager.LastKillReason)` so a future reason-string change is a deliberate test update.
### Tests-029
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSessionAdminServiceTests.cs:61-106,139-222`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs:77-125` |
| Status | Open |
**Description:** The new `DashboardSessionAdminServiceTests` covers the happy path and the viewer-denial path for both `CloseSessionAsync` and `KillWorkerAsync`, plus `CloseSessionAsync_WhenSessionMissing_ReportsFriendlyError` for the close-side `SessionNotFound` catch — but the kill-side error branches are not tested. The product code's `KillWorkerAsync` (lines 111-114) has the same `SessionNotFound` catch returning `"Session {id} was not found."` and (lines 115-124) a generic `SessionManagerException` catch returning `"Kill failed: {message}"`; neither is exercised. The fake's `KillWorkerAsync` (lines 200-209) only succeeds — there is no `KillThrowsNotFound` / `KillThrowsGeneric` configuration option matching the existing `CloseThrowsNotFound`. Symmetrically, `CloseSessionAsync` has the same `IsNullOrWhiteSpace(sessionId)` guard (line 37-40) but no blank-id test even though `KillWorkerAsync_BlankSessionId_ReturnsFailure` exists for the parallel kill guard — a guard-removal regression on close would slip through.
**Recommendation:** Mirror the existing close-side fixtures onto the kill side: add `KillThrowsNotFound` / `KillThrowsGeneric` init-flags to the `FakeSessionManager`, then `KillWorkerAsync_WhenSessionMissing_ReportsFriendlyError`, `KillWorkerAsync_WhenSessionManagerThrows_ReportsKillFailedMessage`, and `CloseSessionAsync_BlankSessionId_ReturnsFailure`. These are mechanical copies of the existing patterns and bring close/kill coverage into symmetry.
### Tests-030
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardApiKeyManagementServiceTests.cs:115-163`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardApiKeyManagementService.cs:146-177` |
| Status | Open |
**Description:** The three new `DeleteAsync_*` fixtures cover unauthorised user, success path with audit, and store-refuses-with-friendly-error. They do not exercise two production behaviours: (1) `DeleteAsync_WhenStoreRefuses_ReportsFriendlyError` (line 151-163) does not construct or inject a `FakeApiKeyAuditStore`, so it never observes that the product code still emits an audit entry with `EventType = "dashboard-delete-key"` and `Details = "not-found-or-active"` on the failure branch (`AppendAuditAsync` runs unconditionally at line 167-172). A regression that placed the `AppendAuditAsync` call inside the `if (deleted)` branch would silently drop the audit trail for refused deletes — a real audit-completeness gap. (2) There is no `DeleteAsync_BlankKeyId_ReturnsFailure` or `DeleteAsync_InvalidKeyId_ReturnsFailure` test, even though `ValidateKeyId(keyId)` (line 156-160) guards on the same conditions as Create/Revoke/Rotate. The `Revoke`/`Rotate` paths have equivalent fixtures (the file's earlier tests cover them); only Delete is missing them.
**Recommendation:** (1) Add a `FakeApiKeyAuditStore` to `DeleteAsync_WhenStoreRefuses_ReportsFriendlyError` and assert it contains exactly one entry with `EventType == "dashboard-delete-key"` and `Details == "not-found-or-active"`. (2) Add `DeleteAsync_BlankKeyId_ReturnsFailure` (parameterised over `null`, `""`, `" "`) and `DeleteAsync_InvalidKeyId_ReturnsFailure` (a keyId with characters the `ValidateKeyId` rules reject) to pin the validation branch end-to-end.
### Tests-031
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSnapshotPublisherTests.cs:22-61` |
| Status | Open |
**Description:** `ExecuteAsync_WhenSnapshotServiceThrowsOnce_ReconnectsAfterDelay` records `startedAt = DateTimeOffset.UtcNow` *before* calling `publisher.StartAsync(...)`, then asserts `secondSubscribeAt - startedAt >= reconnectDelay - 10ms` (line 59). The measured gap is *not* the reconnect delay in isolation — it is `(StartAsync scheduling) + (first WatchSnapshotsAsync setup + Task.Yield) + (throw) + reconnect delay + (second WatchSnapshotsAsync setup)`. On a slow/contended CI agent the first three terms easily dominate (favouring the assertion); but on a fast machine, Windows `Task.Delay(50ms)` rounds up to the next ~15.6 ms tick boundary and may return at ~46-50 ms relative to schedule, while the first three terms can be sub-millisecond — so the gap measurement can land within 1-2 ms of the lower bound, and the 10 ms slack may not absorb a single missed quantum. This is a latent flake of the same flavour as Tests-006 (heartbeat timing) but on a wall-clock dependency the test cannot inject around because `DashboardSnapshotPublisher` uses `Task.Delay(_reconnectDelay)` directly. Tests-006 / Tests-017 moved heartbeat tests onto `ManualTimeProvider`; this test cannot do that without a product change to use a `TimeProvider`-aware delay.
**Recommendation:** (a) The cheap fix: have `ThrowOnceThenYieldSnapshotService` record `_firstThrowAt = DateTimeOffset.UtcNow` immediately before the `throw`, and change the assertion to `secondSubscribeAt - firstThrowAt >= reconnectDelay - 10ms` — the gap then measures only the reconnect delay, eliminating the variable scheduling baseline. (b) The deeper fix: extend `DashboardSnapshotPublisher` to accept an `ITimeProvider`-style delay seam (or a virtual `DelayAsync` hook) so a `ManualTimeProvider` could advance time deterministically. (a) is preferred for now; (b) belongs as a follow-up if more reconnect-loop tests are added.