diff --git a/code-reviews/Tests/findings.md b/code-reviews/Tests/findings.md index ef5c141..ca95532 100644 --- a/code-reviews/Tests/findings.md +++ b/code-reviews/Tests/findings.md @@ -5,9 +5,9 @@ | Module | `src/ZB.MOM.WW.MxGateway.Tests` | | Reviewer | Claude Code | | Review date | 2026-05-24 | -| Commit reviewed | `d692232` | -| Status | Reviewed | -| Open findings | 0 | +| Commit reviewed | `42b0037` | +| Status | Re-reviewed | +| Open findings | 5 | ## Checklist coverage @@ -26,6 +26,35 @@ This pass (commit `a020350`) re-reviews the module after the Tests-013–019 bat | 9 | Testing coverage | Issues found: Tests-020 (`MxAccessGatewayServiceConstraintTests` covers only 2 of 4 `WriteBulkConstraintPlan` switch arms — `Write2Bulk`/`WriteSecured2Bulk` `GetPayload`/`SetPayload` would silently break with no failing test), Tests-022 (the eleven `SessionManagerBulkTests.*_PropagatesCancellation` tests pre-cancel the token, so the fake's first-line `ThrowIfCancellationRequested` handles it before `InvokeBulkInternalAsync` even runs — they do not exercise mid-flight cancellation), Tests-024 (`BulkConstraintPlan.MergeDeniedInto` silently drops or under-fills if the worker reply count diverges from the allowed-count — no test pins this protocol-mismatch edge case). | | 10 | Documentation & comments | No new issues. Tests-019's `docs/GatewayTesting.md` addition is in place; new test files (`SessionManagerBulkTests`, `MxAccessGatewayServiceConstraintTests`, `PredicateConstraintEnforcer`) all have orienting class-level summaries. | +### 2026-05-24 re-review (commit 42b0037) + +Re-review pass at `42b0037` after the dashboard admin-actions wave: the new +`DashboardSessionAdminServiceTests` covers Close/Kill viewer/admin gating, the +extended `SessionManagerTests.KillWorkerAsync_*` tests cover the +SessionManager-side kill path, `DashboardApiKeyManagementServiceTests` gains +DeleteAsync coverage, and `EventStreamServiceTests` adds a recording +broadcaster fixture and a throwing-broadcaster fixture. Four FakeSessionManager +impls were extended with `KillWorkerAsync` stubs to keep the interface compiling. +Also new: `DashboardSnapshotPublisherTests` (Server-042 reconnect loop), +`HubTokenServiceTests` (Server-039 null-identity rejection), and a large +`ProtobufContractRoundTripTests` expansion for the new bulk write/read +payload oneof cases. + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | No issues found in this diff. | +| 2 | mxaccessgw conventions | No issues found in this diff. | +| 3 | Concurrency & thread safety | Issue found: Tests-027 (the known-flake `MxAccessGatewayServiceTests.StreamEvents_WhenEventIsWritten_RecordsSendDuration` cross-talks via the shared `MxGateway.Server` Meter when parallel tests record `mxgateway.events.stream_send.duration`). | +| 4 | Error handling & resilience | Issue found: Tests-029 (`DashboardSessionAdminService.KillWorkerAsync` has a `SessionNotFound` and a general `SessionManagerException` catch branch; neither is tested. Also `CloseSessionAsync` has no blank-id test even though Kill does — symmetric guard, asymmetric coverage). | +| 5 | Security | No issues found in this diff. | +| 6 | Performance & resource management | No issues found in this diff. | +| 7 | Design-document adherence | No issues found in this diff. | +| 8 | Code organization & conventions | No issues found in this diff — the `NullDashboardEventBroadcaster` consolidation (Tests-025) is in place. | +| 9 | Testing coverage | Issues found: Tests-028 (`KillWorkerAsync_KillsWorkerAndRemovesSession` does not pin the `reason` argument propagating through `SessionManager.KillWorkerAsync` → `session.KillWorker(reason)` → `IWorkerClient.Kill(reason)`; the fake's `Kill(string reason)` discards the argument and there is no `KillWorkerAsync_WithBlankReason_ThrowsArgument` test), Tests-030 (`DashboardApiKeyManagementService.DeleteAsync_WhenStoreRefuses_ReportsFriendlyError` does not inject an audit store or assert the `"not-found-or-active"` audit entry the product code emits on the failure path, and there is no `DeleteAsync_BlankKeyId_ReturnsFailure` for the `ValidateKeyId` guard). | +| 10 | Documentation & comments | No issues found in this diff. | + +Flake-risk note: Tests-031 (`DashboardSnapshotPublisherTests.ExecuteAsync_WhenSnapshotServiceThrowsOnce_ReconnectsAfterDelay` measures the reconnect gap from `startedAt` to `secondSubscribeAt` with only 10 ms slack against a 50 ms `reconnectDelay`; Windows `Task.Delay` timer granularity is ~15 ms and the gap baseline includes scheduling overhead, so the lower-bound assertion can spuriously fail on a slow CI agent). + ### 2026-05-24 review (commit d692232) Re-review pass at `d692232` scoped to the test-side fixture churn from @@ -451,3 +480,68 @@ The cancellation tests for `WorkerClient` in `WorkerClientTests` *do* exercise t **Recommendation:** Add a fixture to `EventStreamServiceTests` named e.g. `StreamEventsAsync_PublishesEachEventToDashboardBroadcaster`: inject a recording fake that captures `(sessionId, mxEvent)` calls, push two events through the fake session, and assert the broadcaster received both with the correct session id and matching `WorkerSequence`. This pins the broadcast hook and proves the dashboard event mirror is not a no-op. **Resolution:** 2026-05-24 — Added `src/ZB.MOM.WW.MxGateway.Tests/TestSupport/RecordingDashboardEventBroadcaster.cs` — a thread-safe `IDashboardEventBroadcaster` test double backed by a `ConcurrentQueue` that captures every `(sessionId, mxEvent)` invocation. Added `StreamEventsAsync_PublishesEachEventToDashboardBroadcaster` to `EventStreamServiceTests`: pushes two events (`WorkerSequence` 7 / `OnDataChange` and 8 / `OnWriteComplete`) through the fake session, drains the stream, and asserts the recording broadcaster captured exactly two publishes with the matching `SessionId`, `WorkerSequence`, and `Family` for each. TDD red/green confirmed: a deliberate "expected 3 captures" failed (`Expected: 3, Actual: 2`) before flipping to the correct count; the green run passes deterministically. The fixture would have caught a regression that drops or wraps `dashboardEventBroadcaster.Publish` at `EventStreamService.cs:133`. Build clean (0 warnings); full Tests suite green (486/486). + +### Tests-027 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Concurrency & thread safety | +| Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs:199-240`, `src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs:8,73,246-251` | +| Status | Open | + +**Description:** The review brief explicitly flagged `MxAccessGatewayServiceTests.StreamEvents_WhenEventIsWritten_RecordsSendDuration` as a known flake that "passed solo on rerun". The root cause is the `MeterListener` subscribes by `instrument.Meter.Name == GatewayMetrics.MeterName` (a *process-shared* constant `"MxGateway.Server"`), not by the specific `GatewayMetrics` instance constructed in the test. Tests-012 made the xUnit parallelism policy explicit (`parallelizeTestCollections: true`, `maxParallelThreads: -1`), and every other test that builds its own `GatewayMetrics()` and exercises `MxAccessGatewayService.StreamEvents` or `EventStreamService.StreamEventsAsync` (e.g. the new `StreamEventsAsync_*` family added by Tests-026 and Server-041, plus the pre-existing `StreamEventsAsync_YieldsEventsInWorkerOrder` etc.) routes through `GatewayMetrics.RecordEventStreamSend` → the same histogram name `mxgateway.events.stream_send.duration`. When two such tests run concurrently in the same xUnit process, the `MeterListener` in this test sees measurements from *both* meters and `families.Count` grows to >1, breaking `Assert.Equal([MxEventFamily.OnDataChange.ToString()], families)`. Solo reruns pass because no other producer is alive. This is exactly the cross-test mutable-state pattern Tests-012 set the guardrail comment against. + +**Recommendation:** Either (a) filter the `MeterListener` callback by the specific `Meter` instance — capture `metrics._meter` (or expose `GatewayMetrics.Meter`) and compare with `ReferenceEquals(instrument.Meter, expectedMeter)` instead of comparing `Meter.Name`; or (b) place this test in a single-threaded `[Collection("GatewayMetrics-Listener")]` so no other `RecordEventStreamSend` producer runs concurrently. Option (a) is preferred because it removes the cross-talk vector permanently and lets the test stay parallelisable. + +### Tests-028 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Testing coverage | +| Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs:466-496,802-807`, `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:216-253` | +| Status | Open | + +**Description:** The new `KillWorkerAsync_KillsWorkerAndRemovesSession` (line 466) and `KillWorkerAsync_WhenSessionMissing_ThrowsSessionNotFound` (line 486) pin the new kill-path entry, but they do not pin the `reason` argument propagating through the chain. `SessionManager.KillWorkerAsync(sessionId, reason, ct)` validates `reason` with `ArgumentException.ThrowIfNullOrWhiteSpace(reason)` (line 221), calls `session.KillWorker(reason)` (line 229), and logs `reason={Reason}` (line 251); but the `FakeWorkerClient.Kill(string reason)` discards the argument (line 803-807) and the assertion is only `Assert.Equal(1, workerClient.KillCount)`. A regression that (a) hard-coded an internal `"unspecified"` reason between `SessionManager` and `GatewaySession`, (b) swapped to a different overload that dropped the reason, or (c) deleted the `ThrowIfNullOrWhiteSpace` guard would all pass the current tests. The dashboard caller (`DashboardSessionAdminService.KillWorkerAsync`) passes a hard-coded `"dashboard-admin-kill"` reason and the only test that observes it (`KillWorkerAsync_AdminKillsWorker`) asserts `!string.IsNullOrWhiteSpace(LastKillReason)` rather than pinning the value — so the same-class drift is also untested. + +**Recommendation:** (1) Capture `LastKillReason` on `FakeWorkerClient.Kill` and assert `KillWorkerAsync_KillsWorkerAndRemovesSession` propagates the test-supplied `"test-kill"` string end-to-end. (2) Add `KillWorkerAsync_WithBlankReason_ThrowsArgumentException` (parameterised over `null`, `""`, `" "`) to pin the `ArgumentException.ThrowIfNullOrWhiteSpace` guard. (3) Tighten `DashboardSessionAdminServiceTests.KillWorkerAsync_AdminKillsWorker` to `Assert.Equal("dashboard-admin-kill", sessionManager.LastKillReason)` so a future reason-string change is a deliberate test update. + +### Tests-029 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Error handling & resilience | +| Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSessionAdminServiceTests.cs:61-106,139-222`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs:77-125` | +| Status | Open | + +**Description:** The new `DashboardSessionAdminServiceTests` covers the happy path and the viewer-denial path for both `CloseSessionAsync` and `KillWorkerAsync`, plus `CloseSessionAsync_WhenSessionMissing_ReportsFriendlyError` for the close-side `SessionNotFound` catch — but the kill-side error branches are not tested. The product code's `KillWorkerAsync` (lines 111-114) has the same `SessionNotFound` catch returning `"Session {id} was not found."` and (lines 115-124) a generic `SessionManagerException` catch returning `"Kill failed: {message}"`; neither is exercised. The fake's `KillWorkerAsync` (lines 200-209) only succeeds — there is no `KillThrowsNotFound` / `KillThrowsGeneric` configuration option matching the existing `CloseThrowsNotFound`. Symmetrically, `CloseSessionAsync` has the same `IsNullOrWhiteSpace(sessionId)` guard (line 37-40) but no blank-id test even though `KillWorkerAsync_BlankSessionId_ReturnsFailure` exists for the parallel kill guard — a guard-removal regression on close would slip through. + +**Recommendation:** Mirror the existing close-side fixtures onto the kill side: add `KillThrowsNotFound` / `KillThrowsGeneric` init-flags to the `FakeSessionManager`, then `KillWorkerAsync_WhenSessionMissing_ReportsFriendlyError`, `KillWorkerAsync_WhenSessionManagerThrows_ReportsKillFailedMessage`, and `CloseSessionAsync_BlankSessionId_ReturnsFailure`. These are mechanical copies of the existing patterns and bring close/kill coverage into symmetry. + +### Tests-030 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Testing coverage | +| Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardApiKeyManagementServiceTests.cs:115-163`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardApiKeyManagementService.cs:146-177` | +| Status | Open | + +**Description:** The three new `DeleteAsync_*` fixtures cover unauthorised user, success path with audit, and store-refuses-with-friendly-error. They do not exercise two production behaviours: (1) `DeleteAsync_WhenStoreRefuses_ReportsFriendlyError` (line 151-163) does not construct or inject a `FakeApiKeyAuditStore`, so it never observes that the product code still emits an audit entry with `EventType = "dashboard-delete-key"` and `Details = "not-found-or-active"` on the failure branch (`AppendAuditAsync` runs unconditionally at line 167-172). A regression that placed the `AppendAuditAsync` call inside the `if (deleted)` branch would silently drop the audit trail for refused deletes — a real audit-completeness gap. (2) There is no `DeleteAsync_BlankKeyId_ReturnsFailure` or `DeleteAsync_InvalidKeyId_ReturnsFailure` test, even though `ValidateKeyId(keyId)` (line 156-160) guards on the same conditions as Create/Revoke/Rotate. The `Revoke`/`Rotate` paths have equivalent fixtures (the file's earlier tests cover them); only Delete is missing them. + +**Recommendation:** (1) Add a `FakeApiKeyAuditStore` to `DeleteAsync_WhenStoreRefuses_ReportsFriendlyError` and assert it contains exactly one entry with `EventType == "dashboard-delete-key"` and `Details == "not-found-or-active"`. (2) Add `DeleteAsync_BlankKeyId_ReturnsFailure` (parameterised over `null`, `""`, `" "`) and `DeleteAsync_InvalidKeyId_ReturnsFailure` (a keyId with characters the `ValidateKeyId` rules reject) to pin the validation branch end-to-end. + +### Tests-031 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Concurrency & thread safety | +| Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSnapshotPublisherTests.cs:22-61` | +| Status | Open | + +**Description:** `ExecuteAsync_WhenSnapshotServiceThrowsOnce_ReconnectsAfterDelay` records `startedAt = DateTimeOffset.UtcNow` *before* calling `publisher.StartAsync(...)`, then asserts `secondSubscribeAt - startedAt >= reconnectDelay - 10ms` (line 59). The measured gap is *not* the reconnect delay in isolation — it is `(StartAsync scheduling) + (first WatchSnapshotsAsync setup + Task.Yield) + (throw) + reconnect delay + (second WatchSnapshotsAsync setup)`. On a slow/contended CI agent the first three terms easily dominate (favouring the assertion); but on a fast machine, Windows `Task.Delay(50ms)` rounds up to the next ~15.6 ms tick boundary and may return at ~46-50 ms relative to schedule, while the first three terms can be sub-millisecond — so the gap measurement can land within 1-2 ms of the lower bound, and the 10 ms slack may not absorb a single missed quantum. This is a latent flake of the same flavour as Tests-006 (heartbeat timing) but on a wall-clock dependency the test cannot inject around because `DashboardSnapshotPublisher` uses `Task.Delay(_reconnectDelay)` directly. Tests-006 / Tests-017 moved heartbeat tests onto `ManualTimeProvider`; this test cannot do that without a product change to use a `TimeProvider`-aware delay. + +**Recommendation:** (a) The cheap fix: have `ThrowOnceThenYieldSnapshotService` record `_firstThrowAt = DateTimeOffset.UtcNow` immediately before the `throw`, and change the assertion to `secondSubscribeAt - firstThrowAt >= reconnectDelay - 10ms` — the gap then measures only the reconnect delay, eliminating the variable scheduling baseline. (b) The deeper fix: extend `DashboardSnapshotPublisher` to accept an `ITimeProvider`-style delay seam (or a virtual `DelayAsync` hook) so a `ManualTimeProvider` could advance time deterministically. (a) is preferred for now; (b) belongs as a follow-up if more reconnect-loop tests are added.