diff --git a/code-reviews/Server/findings.md b/code-reviews/Server/findings.md index 52c6b52..2686488 100644 --- a/code-reviews/Server/findings.md +++ b/code-reviews/Server/findings.md @@ -5,9 +5,9 @@ | Module | `src/ZB.MOM.WW.MxGateway.Server` | | Reviewer | Claude Code | | Review date | 2026-05-24 | -| Commit reviewed | `d692232` | -| Status | Reviewed | -| Open findings | 0 | +| Commit reviewed | `42b0037` | +| Status | Re-reviewed | +| Open findings | 7 | ## Checklist coverage @@ -69,6 +69,34 @@ findings (Server-001 through Server-032) are unchanged by this pass. | 9 | Testing coverage | Issues found: Server-037 (no test for the corrupt-snapshot restore path or for `PersistSnapshot = false` at the cache level). | | 10 | Documentation & comments | No issues found — XML docs match behavior; the `GalaxyRepository.md` "On-disk snapshot" section documents the Stale-on-restore lifecycle. | +### 2026-05-24 re-review (commit 42b0037) + +Re-review pass at `42b0037` scoped to the dashboard destructive-action wave on +top of `d692232`. Seven commits in range: `c5e7479`/`0e56b5b` add the admin-only +Close/Kill flow (new `ISessionManager.KillWorkerAsync`, new +`IDashboardSessionAdminService`, the shared `ConfirmDialog.razor`); `24cc5fd` +adds `IApiKeyAdminStore.DeleteAsync` + the dashboard Delete action on revoked +keys; `c5153d6` chains `base.OnInitializedAsync()` on `ApiKeysPage`; `de7639a` +removes the legacy `MapGet("/", ...)` redirect that was colliding with the +Blazor `@page "/"` (a real 500); `42b0037` is the cosmetic `@using` switch on +`GalaxyPage`/`SessionDetailsPage`. Tests added: `DashboardSessionAdminServiceTests`, +`DashboardSnapshotPublisherTests`, `HubTokenServiceTests`, two new +`SessionManagerTests.KillWorkerAsync_*` cases, plus the API-key-management +delete-path tests. + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | Issues found: Server-044 (`SessionManager.KillWorkerAsync` catch-path leaks the `mxgateway.sessions.open` gauge — mirror of Server-006 but for the kill path). | +| 2 | mxaccessgw conventions | No issues found — file-scoped namespaces, `sealed` by default, `Async` suffix, primary constructors used by new types (`DashboardSessionAdminService`); no Blazor UI component libraries pulled in (`ConfirmDialog.razor` uses local Bootstrap classes only); no secrets logged. | +| 3 | Concurrency & thread safety | Issues found: Server-045 (`KillWorkerAsync` reads `session.State` without synchronization; two concurrent kills can both observe `wasClosed=false` and double-increment the `sessions.closed` counter). | +| 4 | Error handling & resilience | Issues found: Server-046 (`SessionManager.ShutdownAsync`'s `KillWorker` fallback doesn't call `_metrics.SessionClosed()` — gauge leaks for every session whose graceful close throws), Server-050 (`DashboardSessionAdminService` only catches `SessionManagerException`; any other exception from `RemoveSessionAsync`/`session.DisposeAsync` propagates raw into Blazor). | +| 5 | Security | No issues found — `DashboardSessionAdminService.CanManage` requires `DashboardRoles.Admin` and the Razor pages gate the Close/Kill buttons on it; audit-log events `dashboard-close-session` / `dashboard-kill-worker` / `dashboard-delete-key` write through the existing `IApiKeyAuditStore` pipeline; `IApiKeyAdminStore.DeleteAsync` SQL guards on `revoked_utc IS NOT NULL`. | +| 6 | Performance & resource management | No issues found in the changed code. | +| 7 | Design-document adherence | No issues found in the changed code. | +| 8 | Code organization & conventions | Issues found: Server-047 (`ApiKeysPage.razor` `ConfirmPendingAsync` clears `PendingAction` before awaiting the action while `SessionsPage`/`SessionDetailsPage` clear it in the `finally`; the three consumers of the new `ConfirmDialog` differ on a small but visible UX point). | +| 9 | Testing coverage | Issues found: Server-048 (no test for `KillWorkerAsync` catch-path metric leak, `wasClosed=true` short-circuit, or concurrent-kill double-count). | +| 10 | Documentation & comments | Issues found: Server-049 (`IDashboardSessionAdminService` interface has no XML docs on any of its three members; `DashboardSessionAdminService` has none either — the `IDashboardApiKeyManagementService` peer carries the same gap but the convention is to document new public surfaces per CLAUDE.md "update docs in the same change as the source"). | + ### 2026-05-24 review (commit d692232) Re-review pass at `d692232` scoped to the dashboard refactor wave: the @@ -778,3 +806,112 @@ Add a regression test that advises N items without an active `StreamEvents` cons **Recommendation:** Add a `` block to `HubTokenService` noting "Registered as a singleton in `AddGatewayDashboard`; the underlying `ITimeLimitedDataProtector` is thread-safe and shared across hub-token issuance and validation." Optionally add a comment near the DI registration explaining the lifetime contract. **Resolution:** 2026-05-24 — Added a `` block to `HubTokenService` (`src/ZB.MOM.WW.MxGateway.Server/Dashboard/HubTokenService.cs`) documenting that the service is registered as a singleton in `DashboardServiceCollectionExtensions.AddGatewayDashboard` and is shared by two consumer scopes — `DashboardHubConnectionFactory` (scoped, per-circuit; calls `Issue` from the cookie-authenticated dashboard) and `HubTokenAuthenticationHandler` (transient, per-request; calls `Validate` from the SignalR negotiate / connection path). Notes that the underlying `ITimeLimitedDataProtector` is thread-safe so concurrent mint/validate from any number of callers is safe, and explicitly asks future maintainers to preserve the singleton lifetime to keep the protector instance stable. Pure documentation change; no test. + +## Re-review 2026-05-24 (commit 42b0037) + +### Server-044 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Location | `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:216-254` | +| Status | Open | + +**Description:** `KillWorkerAsync` is the mirror of `CloseSessionCoreAsync` for the new admin-only Kill flow, but its catch path leaks the `mxgateway.sessions.open` gauge — the exact bug that Server-006 closed for `OpenSessionAsync`. The happy path increments `_metrics.SessionClosed()` once after `session.KillWorker(reason)` returns (line 244), which decrements `_openSessions`. The catch path, however, records `_metrics.Fault(...)`, calls `session.MarkFaulted(...)`, and then awaits `RemoveSessionAsync(session)` — but never calls `_metrics.SessionClosed()` (nor `SessionRemoved()`), so a kill that throws from `session.KillWorker` leaves the open-session gauge permanently incremented. `RemoveSessionAsync` only calls `_metrics.RemoveSessionEvents(...)` and `ReleaseSessionSlot()`; neither touches `_openSessions`. Server-006's fix pattern (track whether the open-counter was recorded, and decrement on the failing path) was applied to `OpenSessionAsync` but not propagated to this new write path. + +In practice the trigger is narrow — `GatewaySession.KillWorker` calls `_workerClient?.Kill(reason)` and `TransitionTo(SessionState.Closed)`; the worker client's `Kill` method on a faulted client or a worker process that has already exited could throw, and the catch path then leaks the gauge. Sustained operator use of the dashboard Kill action on misbehaving workers would gradually inflate `mxgateway.sessions.open` and corrupt the metric exposed by `/health/metrics` and any Grafana panel keying off it. + +**Recommendation:** Mirror Server-006's fix: track whether the session was counted as opened (it always is in `KillWorkerAsync` — `GetRequiredSession` only succeeds for sessions in the registry, all of which had `SessionOpened()` called), and decrement on the failing path. Concretely, add `_metrics.SessionClosed()` (or `_metrics.SessionRemoved()` if the kill is being treated as an unclean removal) inside the catch block before `RemoveSessionAsync(session)`. The cleanest form is to record `SessionClosed()` once at the top of the method (under a flag), then only re-record if the happy path actually transitions; or to add `_metrics.SessionClosed()` in the catch right after `MarkFaulted`. Add a `SessionManagerTests.KillWorkerAsync_WhenSessionKillThrows_DecrementsOpenSessionGauge` regression test that uses a `FakeWorkerClient.KillThrows = true` to exercise the catch. + +### Server-045 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Concurrency & thread safety | +| Location | `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:225,242-245`, `src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs:837-841` | +| Status | Open | + +**Description:** `KillWorkerAsync` reads `session.State` once into a local `bool wasClosed` (line 225) before calling `session.KillWorker(reason)`. The read is unsynchronized — `State` is a getter that takes `_syncRoot` internally so the read itself is safe, but there is no lock spanning "read state, call KillWorker, conditionally record metric." Two concurrent `KillWorkerAsync` calls on the same session (e.g. one operator clicking Kill on the Sessions page and another clicking Kill on the Session Details page within the same render tick) can both observe `wasClosed = false`, then both call `session.KillWorker(...)` (the second is effectively a no-op because `TransitionTo` refuses to overwrite `Closed`), and both call `_metrics.SessionClosed()` at line 244. The `_openSessions` gauge is bounded at 0 by `GatewayMetrics.SessionClosed`'s `if (_openSessions > 0)` guard, but the `_sessionsClosed` counter (and the `mxgateway.sessions.closed` counter exported by the meter) is double-incremented; `_metrics.Fault` is not used here, so the only mitigation is the SessionsRegistry race — the second call's `GetRequiredSession` could miss if the first already removed the session via `RemoveSessionAsync`, but only if the second arrives after the first's removal completes. The window is small but exists, and the same race exists for "Kill from one tab while the lease-expired sweep is closing the session." `CloseSessionCoreAsync` has the same shape, so this isn't a regression specifically from the kill change — but the new path widens the surface where the issue can fire. + +**Recommendation:** Either (a) gate `KillWorkerAsync` on a per-session lock — extending the `_closeLock` pattern that `GatewaySession.CloseAsync` already uses, or introducing a new `_killLock` and accepting that close + kill don't serialize against each other — or (b) accept the metric double-count as harmless and document it on `KillWorkerAsync`'s XML doc. Option (a) is the more defensible long-term fix; option (b) is acceptable for v1 if the metric is purely informational. Adding a test that issues concurrent kills against the same session id and asserts `_sessionsClosed == 1` would pin the chosen behavior either way. + +### Server-046 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Error handling & resilience | +| Location | `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:286-307` | +| Status | Open | + +**Description:** `ShutdownAsync` was updated to fall back to `KillWorker` when `CloseSessionCoreAsync` throws (lines 294-305) — a useful resilience improvement on its own. But the fallback's bookkeeping is wrong: `session.KillWorker(GatewayShutdownReason)` is called and `RemoveSessionAsync(session)` is awaited, but `_metrics.SessionClosed()` is never invoked, so for every session whose graceful close throws, the `mxgateway.sessions.open` gauge stays incremented after shutdown completes. Worse, `CloseSessionCoreAsync`'s `SessionCloseStartedException` catch (line 330) already records `_metrics.SessionRemoved()` (line 334-336) before re-throwing — so for that specific exception type, the gauge is decremented inside the inner catch, then the outer fallback runs and does not double-decrement (good), but `_metrics.SessionClosed()` is never called, so the `_sessionsClosed` counter under-counts by one. For any other exception (the more common case), neither inner catch records anything, so both `_sessionsClosed` and `_openSessions` end up wrong: gauge is left high, counter is left low. + +**Recommendation:** Inside the `ShutdownAsync` fallback (after the `KillWorker` call but before/inside the `RemoveSessionAsync`), call `_metrics.SessionClosed()` unless the inner catch already recorded the close. The simplest shape is to propagate a `wasClosed` flag out of `CloseSessionCoreAsync` (or replace the fallback's manual choreography with a single call into `KillWorkerAsync(...)`, which has the right metric path once Server-044 is fixed). The latter is the cleanest — `ShutdownAsync` becomes "try graceful, fall back to `KillWorkerAsync`," and there's exactly one accounting path for each session. Add a `SessionManagerTests.ShutdownAsync_WhenCloseThrows_StillDecrementsOpenSessionGauge` test using a session whose `CloseAsync` throws (e.g. a `BlockingShutdownWorkerClient` configured to throw on `ShutdownAsync`). + +### Server-047 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/ApiKeysPage.razor:324-334`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/SessionsPage.razor:171-195`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/SessionDetailsPage.razor:231-255` | +| Status | Open | + +**Description:** The shared `ConfirmDialog.razor` (added in `0e56b5b` / `24cc5fd`) is wired by three pages, but the pages handle `PendingAction` cleanup inconsistently: + +- `ApiKeysPage.ConfirmPendingAsync` captures the action, sets `PendingAction = null` synchronously, **then** awaits the action via `RunManagementActionAsync`. The dialog disappears the moment Confirm is clicked, and the user sees no busy indication on the dialog itself (the busy state lives on `RunManagementActionAsync`'s `IsBusy = true`). +- `SessionsPage.ConfirmPendingAsync` and `SessionDetailsPage.ConfirmPendingAsync` keep `PendingAction` set, set `IsBusy = true`, **await** the action, then clear `PendingAction = null` in the `finally`. The dialog stays open during the call and visibly disables its buttons via `IsBusy`. + +The user-visible difference: rotating/revoking/deleting a key vs closing/killing a session uses two different dialog-lifecycle patterns. Neither is broken, but `ConfirmDialog` is the shared component and its `IsBusy` parameter exists precisely to render the in-flight state — `ApiKeysPage` discards that signal by closing the dialog before the action runs. + +**Recommendation:** Align `ApiKeysPage.ConfirmPendingAsync` with the sessions pages: hold `PendingAction`, set `IsBusy = true`, run the action, then clear `PendingAction` in the `finally`. The current ApiKeysPage shape was inherited from before the dialog existed (when the confirmation was a `confirm()` JS call); the dialog component change can flatten the difference now. As a smaller alternative, document the divergence on the component's XML doc — but the shared component should ideally be used consistently. + +### Server-048 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Testing coverage | +| Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs:463-498` | +| Status | Open | + +**Description:** The two new `KillWorkerAsync_*` tests cover the happy path (`KillWorkerAsync_KillsWorkerAndRemovesSession`) and the missing-session error (`KillWorkerAsync_WhenSessionMissing_ThrowsSessionNotFound`). Three behaviorally distinct cases are missing: + +1. **Catch path** — `session.KillWorker` throws. Today no test exercises the failure branch, so the Server-044 gauge leak ships without coverage. A `FakeWorkerClient.ThrowOnKill = true` (or equivalent) would let `WorkerClient.Kill` throw; the test would assert the session is removed, `_metrics.Fault` is recorded, and the open-session gauge is decremented. +2. **`wasClosed = true` path** — kill on an already-`Closed` session must not re-increment `mxgateway.sessions.closed`. No assertion pins this. +3. **Concurrent kill** — two `KillWorkerAsync` calls for the same session id; Server-045's double-increment lives or dies on whether this is tested. + +**Recommendation:** Add the three tests above. The fakes in `MxGateway.Tests/TestSupport/` already cover most of the moving parts; `FakeWorkerClient` needs a single `ThrowOnKill` flag (or the existing `KillThrowing` if any). + +### Server-049 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Documentation & comments | +| Location | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/IDashboardSessionAdminService.cs:5-18`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs:8-25` | +| Status | Open | + +**Description:** `IDashboardSessionAdminService` declares three members — `CanManage`, `CloseSessionAsync`, `KillWorkerAsync` — none of which carry XML documentation. `DashboardSessionAdminService.CanManage` and the two operation methods are also undocumented (only the constructor parameters are named). The C# style guide requires public-surface XML docs and CLAUDE.md mandates that "docs change with the code." The peer `IDashboardApiKeyManagementService` is also undocumented, so this isn't unique — but the new interface is a fresh public surface being landed in `c5e7479`, and the contract subtleties (CanManage returns false for non-Admin; missing-session paths surface as `Succeeded = false` not as a thrown exception; `KillReason` is fixed at `"dashboard-admin-kill"` and that value reaches the audit log) are exactly what XML docs are for. + +**Recommendation:** Add `` blocks to `IDashboardSessionAdminService.CanManage` (states the Admin-role gate), `CloseSessionAsync` and `KillWorkerAsync` (state that missing sessions return `DashboardSessionAdminResult.Fail(...)` rather than throwing, and that the audit log captures actor + remote IP). Add `` and `` for the request/response shape. The same sweep can pick up the longstanding gap on `IDashboardApiKeyManagementService` if the team wants — but the new file is the load-bearing one. + +### Server-050 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Error handling & resilience | +| Location | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs:42-75,92-125` | +| Status | Open | + +**Description:** `CloseSessionAsync` and `KillWorkerAsync` catch only `SessionManagerException` (the `SessionNotFound` filter, then a general `SessionManagerException` catch). Anything else propagates raw to Blazor's error boundary. The propagation paths exist: + +- `SessionManager.CloseSessionAsync` → `CloseSessionCoreAsync` catches `OperationCanceledException` and `SessionCloseStartedException`; any other exception (e.g. an `IOException` from worker pipe teardown surfacing through `session.CloseAsync` → `_workerClient.ShutdownAsync`) propagates raw. +- `SessionManager.KillWorkerAsync` wraps only the `session.KillWorker(reason)` call in a try/catch. Exceptions from `RemoveSessionAsync` → `session.DisposeAsync` (the new `_closeLock.WaitAsync` / dispose choreography from Server-016) — particularly a `TaskCanceledException` if the caller's CancellationToken fires mid-dispose, or an aggregate exception from concurrent disposal — also propagate raw. + +Today neither call site has a Blazor error boundary, so an unhandled exception lands as a generic Blazor circuit error page. The friendlier-error contract that Server-044's commit message advertises ("audit-logs, friendly errors") is incomplete: only `SessionManagerException` gets a friendly error. + +**Recommendation:** Add a general `catch (Exception exception)` after the `SessionManagerException` catch in both `CloseSessionAsync` and `KillWorkerAsync`, log a warning (matching the SessionManagerException pattern), and return `DashboardSessionAdminResult.Fail($"{operation} failed unexpectedly. See the gateway log for details.")`. This makes the result type truly the only output the page sees. Add a regression test using a `ThrowingSessionManager` that throws e.g. `InvalidOperationException` from `KillWorkerAsync` and asserts the service returns a failing result rather than propagating.