Resolve Server-044..050: KillWorker accounting + admin service hardening

Server-044  KillWorkerAsync catch path now calls _metrics.SessionRemoved
            so the open-session gauge does not leak when KillWorker throws.
Server-045  KillWorkerAsync routes through a new
            GatewaySession.KillWorkerWithCloseGateAsync that takes the
            per-session close lock, so concurrent kills count SessionsClosed
            exactly once.
Server-046  CloseSessionCoreAsync's SessionCloseStartedException branch and
            ShutdownAsync's kill fallback both increment SessionsClosed (not
            just the gauge), so the counter and gauge stay consistent.
Server-047  ApiKeysPage.ConfirmPendingAsync holds PendingAction across the
            awaited action and clears it in finally, matching the sessions
            pages.
Server-048  Closed: the 044/045 regression tests cover the previously-
            untested kill paths.
Server-049  IDashboardSessionAdminService + DashboardSessionAdminService
            now carry XML docs that pin the Admin gate, missing-session
            return-Fail semantics, and the dashboard-admin-kill reason.
Server-050  CloseSessionAsync and KillWorkerAsync catch unexpected
            exceptions after the SessionManagerException catches and return
            a friendly Fail; OperationCanceledException tied to the caller
            token still propagates.

All resolved at 2026-05-24; 503/503 gateway tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-24 08:49:34 -04:00
parent 6079c62709
commit 4d77279e7e
8 changed files with 403 additions and 16 deletions
+22 -8
View File
@@ -7,7 +7,7 @@
| Review date | 2026-05-24 |
| Commit reviewed | `42b0037` |
| Status | Re-reviewed |
| Open findings | 7 |
| Open findings | 0 |
## Checklist coverage
@@ -816,7 +816,7 @@ Add a regression test that advises N items without an active `StreamEvents` cons
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:216-254` |
| Status | Open |
| Status | Resolved |
**Description:** `KillWorkerAsync` is the mirror of `CloseSessionCoreAsync` for the new admin-only Kill flow, but its catch path leaks the `mxgateway.sessions.open` gauge — the exact bug that Server-006 closed for `OpenSessionAsync`. The happy path increments `_metrics.SessionClosed()` once after `session.KillWorker(reason)` returns (line 244), which decrements `_openSessions`. The catch path, however, records `_metrics.Fault(...)`, calls `session.MarkFaulted(...)`, and then awaits `RemoveSessionAsync(session)` — but never calls `_metrics.SessionClosed()` (nor `SessionRemoved()`), so a kill that throws from `session.KillWorker` leaves the open-session gauge permanently incremented. `RemoveSessionAsync` only calls `_metrics.RemoveSessionEvents(...)` and `ReleaseSessionSlot()`; neither touches `_openSessions`. Server-006's fix pattern (track whether the open-counter was recorded, and decrement on the failing path) was applied to `OpenSessionAsync` but not propagated to this new write path.
@@ -824,6 +824,8 @@ In practice the trigger is narrow — `GatewaySession.KillWorker` calls `_worker
**Recommendation:** Mirror Server-006's fix: track whether the session was counted as opened (it always is in `KillWorkerAsync``GetRequiredSession` only succeeds for sessions in the registry, all of which had `SessionOpened()` called), and decrement on the failing path. Concretely, add `_metrics.SessionClosed()` (or `_metrics.SessionRemoved()` if the kill is being treated as an unclean removal) inside the catch block before `RemoveSessionAsync(session)`. The cleanest form is to record `SessionClosed()` once at the top of the method (under a flag), then only re-record if the happy path actually transitions; or to add `_metrics.SessionClosed()` in the catch right after `MarkFaulted`. Add a `SessionManagerTests.KillWorkerAsync_WhenSessionKillThrows_DecrementsOpenSessionGauge` regression test that uses a `FakeWorkerClient.KillThrows = true` to exercise the catch.
**Resolution:** 2026-05-24 — Confirmed against source: `KillWorkerAsync`'s catch block called `MarkFaulted`, `Fault`, and `RemoveSessionAsync` but never decremented the open-session gauge, mirroring exactly the Server-006 leak on the open path. The catch path now calls `_metrics.SessionRemoved()` after `MarkFaulted`, so the gauge is restored when `session.KillWorker` (via the new `KillWorkerWithCloseGateAsync` helper) throws. Combined with the Server-045 fix (the kill path now routes through a new `GatewaySession.KillWorkerWithCloseGateAsync` that takes the per-session `_closeLock`), every session reaching `KillWorkerAsync` had `SessionOpened()` recorded and the catch correctly decrements it. Regression test in `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs`: `KillWorkerAsync_WhenSessionKillThrows_DecrementsOpenSessionGauge` (uses a new `FakeWorkerClient.KillException` flag to force `_workerClient.Kill` to throw and asserts the open-session gauge returns to 0 after the kill faults). Confirmed to fail before the fix and pass after.
### Server-045
| Field | Value |
@@ -831,12 +833,14 @@ In practice the trigger is narrow — `GatewaySession.KillWorker` calls `_worker
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:225,242-245`, `src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs:837-841` |
| Status | Open |
| Status | Resolved |
**Description:** `KillWorkerAsync` reads `session.State` once into a local `bool wasClosed` (line 225) before calling `session.KillWorker(reason)`. The read is unsynchronized — `State` is a getter that takes `_syncRoot` internally so the read itself is safe, but there is no lock spanning "read state, call KillWorker, conditionally record metric." Two concurrent `KillWorkerAsync` calls on the same session (e.g. one operator clicking Kill on the Sessions page and another clicking Kill on the Session Details page within the same render tick) can both observe `wasClosed = false`, then both call `session.KillWorker(...)` (the second is effectively a no-op because `TransitionTo` refuses to overwrite `Closed`), and both call `_metrics.SessionClosed()` at line 244. The `_openSessions` gauge is bounded at 0 by `GatewayMetrics.SessionClosed`'s `if (_openSessions > 0)` guard, but the `_sessionsClosed` counter (and the `mxgateway.sessions.closed` counter exported by the meter) is double-incremented; `_metrics.Fault` is not used here, so the only mitigation is the SessionsRegistry race — the second call's `GetRequiredSession` could miss if the first already removed the session via `RemoveSessionAsync`, but only if the second arrives after the first's removal completes. The window is small but exists, and the same race exists for "Kill from one tab while the lease-expired sweep is closing the session." `CloseSessionCoreAsync` has the same shape, so this isn't a regression specifically from the kill change — but the new path widens the surface where the issue can fire.
**Recommendation:** Either (a) gate `KillWorkerAsync` on a per-session lock — extending the `_closeLock` pattern that `GatewaySession.CloseAsync` already uses, or introducing a new `_killLock` and accepting that close + kill don't serialize against each other — or (b) accept the metric double-count as harmless and document it on `KillWorkerAsync`'s XML doc. Option (a) is the more defensible long-term fix; option (b) is acceptable for v1 if the metric is purely informational. Adding a test that issues concurrent kills against the same session id and asserts `_sessionsClosed == 1` would pin the chosen behavior either way.
**Resolution:** 2026-05-24 — Took recommended option (a). Added `GatewaySession.KillWorkerWithCloseGateAsync(reason, ct)` that acquires the per-session `_closeLock`, reads `_state` under `_syncRoot`, calls `_workerClient.Kill(reason)`, then `TransitionTo(Closed)`, and returns the wasClosed observation. `SessionManager.KillWorkerAsync` now invokes that helper instead of reading `State` and calling `KillWorker` separately. Concurrent kill (and concurrent close+kill) callers now serialize on `_closeLock`, so the first caller observes `wasClosed=false` and the second observes `wasClosed=true`, eliminating the double-increment of `mxgateway.sessions.closed`. Regression test in `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs`: `KillWorkerAsync_ConcurrentCallsOnSameSession_CountClosedExactlyOnce` (issues two `KillWorkerAsync` calls on the same session id concurrently, accepts `SessionNotFound` on whichever loses the race after `RemoveSessionAsync`, and asserts `SessionsClosed == 1` and `OpenSessions == 0`).
### Server-046
| Field | Value |
@@ -844,12 +848,14 @@ In practice the trigger is narrow — `GatewaySession.KillWorker` calls `_worker
| Severity | Low |
| Category | Error handling & resilience |
| Location | `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:286-307` |
| Status | Open |
| Status | Resolved |
**Description:** `ShutdownAsync` was updated to fall back to `KillWorker` when `CloseSessionCoreAsync` throws (lines 294-305) — a useful resilience improvement on its own. But the fallback's bookkeeping is wrong: `session.KillWorker(GatewayShutdownReason)` is called and `RemoveSessionAsync(session)` is awaited, but `_metrics.SessionClosed()` is never invoked, so for every session whose graceful close throws, the `mxgateway.sessions.open` gauge stays incremented after shutdown completes. Worse, `CloseSessionCoreAsync`'s `SessionCloseStartedException` catch (line 330) already records `_metrics.SessionRemoved()` (line 334-336) before re-throwing — so for that specific exception type, the gauge is decremented inside the inner catch, then the outer fallback runs and does not double-decrement (good), but `_metrics.SessionClosed()` is never called, so the `_sessionsClosed` counter under-counts by one. For any other exception (the more common case), neither inner catch records anything, so both `_sessionsClosed` and `_openSessions` end up wrong: gauge is left high, counter is left low.
**Recommendation:** Inside the `ShutdownAsync` fallback (after the `KillWorker` call but before/inside the `RemoveSessionAsync`), call `_metrics.SessionClosed()` unless the inner catch already recorded the close. The simplest shape is to propagate a `wasClosed` flag out of `CloseSessionCoreAsync` (or replace the fallback's manual choreography with a single call into `KillWorkerAsync(...)`, which has the right metric path once Server-044 is fixed). The latter is the cleanest — `ShutdownAsync` becomes "try graceful, fall back to `KillWorkerAsync`," and there's exactly one accounting path for each session. Add a `SessionManagerTests.ShutdownAsync_WhenCloseThrows_StillDecrementsOpenSessionGauge` test using a session whose `CloseAsync` throws (e.g. a `BlockingShutdownWorkerClient` configured to throw on `ShutdownAsync`).
**Resolution:** 2026-05-24 — Two coordinated changes: (1) `CloseSessionCoreAsync`'s `SessionCloseStartedException` catch now calls `_metrics.SessionClosed()` (decrements the open-session gauge AND increments the closed counter) instead of `_metrics.SessionRemoved()` (gauge only). A close that ran far enough to attempt the worker shutdown but failed is still a closed session for accounting purposes — the session is removed from the registry and disposed below, so the counter must reflect that. (2) `ShutdownAsync`'s outer fallback now routes the kill through `KillWorkerAsync` (which has the correct metric path post-Server-044) rather than manually calling `session.KillWorker` + `RemoveSessionAsync`. In practice the inner catch already removes the session so the outer fallback is defensive — but routing both paths through the same accounting eliminates the inconsistency the finding called out. The pre-existing `CloseSessionAsync_WhenWorkerShutdownFails_RemovesSessionAndReleasesSlot` test was updated to assert the new (correct) `SessionsClosed == 1` value, with a comment back-referencing Server-046. New regression test in `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs`: `ShutdownAsync_WhenSessionCloseThrows_StillDecrementsOpenSessionGaugeAndIncrementsClosedCounter` (uses a `FakeWorkerClient.ShutdownException` to force the graceful close to throw, then asserts both the open-session gauge drops to 0 and the closed counter increments to 1). Confirmed to fail before the fix and pass after.
### Server-047
| Field | Value |
@@ -857,7 +863,7 @@ In practice the trigger is narrow — `GatewaySession.KillWorker` calls `_worker
| Severity | Low |
| Category | Code organization & conventions |
| Location | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/ApiKeysPage.razor:324-334`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/SessionsPage.razor:171-195`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/SessionDetailsPage.razor:231-255` |
| Status | Open |
| Status | Resolved |
**Description:** The shared `ConfirmDialog.razor` (added in `0e56b5b` / `24cc5fd`) is wired by three pages, but the pages handle `PendingAction` cleanup inconsistently:
@@ -868,6 +874,8 @@ The user-visible difference: rotating/revoking/deleting a key vs closing/killing
**Recommendation:** Align `ApiKeysPage.ConfirmPendingAsync` with the sessions pages: hold `PendingAction`, set `IsBusy = true`, run the action, then clear `PendingAction` in the `finally`. The current ApiKeysPage shape was inherited from before the dialog existed (when the confirmation was a `confirm()` JS call); the dialog component change can flatten the difference now. As a smaller alternative, document the divergence on the component's XML doc — but the shared component should ideally be used consistently.
**Resolution:** 2026-05-24 — Took the recommended alignment. `ApiKeysPage.ConfirmPendingAsync` (`src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/ApiKeysPage.razor`) now holds `PendingAction` for the duration of the awaited action (so the shared `ConfirmDialog` renders its `IsBusy` in-flight state on the dialog itself, matching the sessions pages) and clears it in `finally` regardless of outcome. The action is captured up front so a clear in `finally` works even when the action throws. `RunManagementActionAsync` continues to drive `IsBusy = true` inside its own `try/finally`, so the dialog now correctly disables Confirm/Cancel while the awaited service call runs. Pure UX-consistency change; no new automated test (no bUnit harness in the test project — same precedent as Server-010).
### Server-048
| Field | Value |
@@ -875,7 +883,7 @@ The user-visible difference: rotating/revoking/deleting a key vs closing/killing
| Severity | Low |
| Category | Testing coverage |
| Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs:463-498` |
| Status | Open |
| Status | Resolved |
**Description:** The two new `KillWorkerAsync_*` tests cover the happy path (`KillWorkerAsync_KillsWorkerAndRemovesSession`) and the missing-session error (`KillWorkerAsync_WhenSessionMissing_ThrowsSessionNotFound`). Three behaviorally distinct cases are missing:
@@ -885,6 +893,8 @@ The user-visible difference: rotating/revoking/deleting a key vs closing/killing
**Recommendation:** Add the three tests above. The fakes in `MxGateway.Tests/TestSupport/` already cover most of the moving parts; `FakeWorkerClient` needs a single `ThrowOnKill` flag (or the existing `KillThrowing` if any).
**Resolution:** 2026-05-24 — Closed by the regression tests added for Server-044 and Server-045 per the prompt's direction: case (1) is covered by `KillWorkerAsync_WhenSessionKillThrows_DecrementsOpenSessionGauge` (uses the new `FakeWorkerClient.KillException` flag); case (3) is covered by `KillWorkerAsync_ConcurrentCallsOnSameSession_CountClosedExactlyOnce`. Case (2) (`wasClosed=true` short-circuit) is implicitly exercised by the concurrent test — once the kill path serializes on the per-session close lock (Server-045 fix), the second kill that wins the registry race observes `wasClosed=true` and skips the counter increment, which is what the test pins (`SessionsClosed == 1`, not 2). The dedicated `KillWorkerAsync_WhenSessionAlreadyClosed_DoesNotReincrementClosedCounter` test was drafted but removed: closing a session disposes it (Server-016's `_closeLock.Dispose()`), so re-issuing a kill against a previously-closed-and-disposed session always fails on the disposed semaphore, which is realistic for production but not a useful unit-test shape. No new test file; the regression coverage already lives in `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs`.
### Server-049
| Field | Value |
@@ -892,12 +902,14 @@ The user-visible difference: rotating/revoking/deleting a key vs closing/killing
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/IDashboardSessionAdminService.cs:5-18`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs:8-25` |
| Status | Open |
| Status | Resolved |
**Description:** `IDashboardSessionAdminService` declares three members — `CanManage`, `CloseSessionAsync`, `KillWorkerAsync` — none of which carry XML documentation. `DashboardSessionAdminService.CanManage` and the two operation methods are also undocumented (only the constructor parameters are named). The C# style guide requires public-surface XML docs and CLAUDE.md mandates that "docs change with the code." The peer `IDashboardApiKeyManagementService` is also undocumented, so this isn't unique — but the new interface is a fresh public surface being landed in `c5e7479`, and the contract subtleties (CanManage returns false for non-Admin; missing-session paths surface as `Succeeded = false` not as a thrown exception; `KillReason` is fixed at `"dashboard-admin-kill"` and that value reaches the audit log) are exactly what XML docs are for.
**Recommendation:** Add `<summary>` blocks to `IDashboardSessionAdminService.CanManage` (states the Admin-role gate), `CloseSessionAsync` and `KillWorkerAsync` (state that missing sessions return `DashboardSessionAdminResult.Fail(...)` rather than throwing, and that the audit log captures actor + remote IP). Add `<param>` and `<returns>` for the request/response shape. The same sweep can pick up the longstanding gap on `IDashboardApiKeyManagementService` if the team wants — but the new file is the load-bearing one.
**Resolution:** 2026-05-24 — Added `<summary>` + `<remarks>` blocks to every member of `IDashboardSessionAdminService` (`src/ZB.MOM.WW.MxGateway.Server/Dashboard/IDashboardSessionAdminService.cs`): an interface-level `<remarks>` describing the Admin-role gate, audit log shape, and `DashboardSessionAdminResult.Fail` semantics; per-member docs on `CanManage`, `CloseSessionAsync`, and `KillWorkerAsync` calling out the missing-session-returns-Fail contract and the `dashboard-admin-kill` reason constant that reaches the worker-kill audit log and `mxgateway.workers.killed` counter tag. `DashboardSessionAdminService` (`src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs`) picked up a class-level `<summary>` + `<remarks>` describing the per-page audit-log seam, plus `<inheritdoc />` on each public method. Pure documentation change; no test (the behavioral contracts the docs describe are already exercised by the existing `DashboardSessionAdminServiceTests` cases).
### Server-050
| Field | Value |
@@ -905,7 +917,7 @@ The user-visible difference: rotating/revoking/deleting a key vs closing/killing
| Severity | Low |
| Category | Error handling & resilience |
| Location | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs:42-75,92-125` |
| Status | Open |
| Status | Resolved |
**Description:** `CloseSessionAsync` and `KillWorkerAsync` catch only `SessionManagerException` (the `SessionNotFound` filter, then a general `SessionManagerException` catch). Anything else propagates raw to Blazor's error boundary. The propagation paths exist:
@@ -915,3 +927,5 @@ The user-visible difference: rotating/revoking/deleting a key vs closing/killing
Today neither call site has a Blazor error boundary, so an unhandled exception lands as a generic Blazor circuit error page. The friendlier-error contract that Server-044's commit message advertises ("audit-logs, friendly errors") is incomplete: only `SessionManagerException` gets a friendly error.
**Recommendation:** Add a general `catch (Exception exception)` after the `SessionManagerException` catch in both `CloseSessionAsync` and `KillWorkerAsync`, log a warning (matching the SessionManagerException pattern), and return `DashboardSessionAdminResult.Fail($"{operation} failed unexpectedly. See the gateway log for details.")`. This makes the result type truly the only output the page sees. Add a regression test using a `ThrowingSessionManager` that throws e.g. `InvalidOperationException` from `KillWorkerAsync` and asserts the service returns a failing result rather than propagating.
**Resolution:** 2026-05-24 — Added the recommended general `catch (Exception)` arms to both `DashboardSessionAdminService.CloseSessionAsync` and `KillWorkerAsync` (`src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs`), placed after the `SessionManagerException` catches and behind a `catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) throw;` so caller cancellation still propagates cleanly. The new catches log a warning with actor + session id and return `DashboardSessionAdminResult.Fail("{Operation} failed unexpectedly for session {SessionId}. See the gateway log for details.")`, mirroring the SessionManagerException pattern. Regression tests in `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSessionAdminServiceTests.cs`: `CloseSessionAsync_WhenManagerThrowsUnexpected_ReturnsFriendlyFail` (the `ISessionManager` throws `InvalidOperationException("unexpected")`) and `KillWorkerAsync_WhenManagerThrowsUnexpected_ReturnsFriendlyFail` (throws `IOException("pipe broken")`); both assert the service returns a failing result with a non-blank message rather than propagating. The fake's new `CloseThrowsUnexpected` / `KillThrowsUnexpected` properties hold the configured exception. Confirmed to fail before the fix (raw exception propagated) and pass after.