Resolve Server-044..050: KillWorker accounting + admin service hardening

Server-044  KillWorkerAsync catch path now calls _metrics.SessionRemoved
            so the open-session gauge does not leak when KillWorker throws.
Server-045  KillWorkerAsync routes through a new
            GatewaySession.KillWorkerWithCloseGateAsync that takes the
            per-session close lock, so concurrent kills count SessionsClosed
            exactly once.
Server-046  CloseSessionCoreAsync's SessionCloseStartedException branch and
            ShutdownAsync's kill fallback both increment SessionsClosed (not
            just the gauge), so the counter and gauge stay consistent.
Server-047  ApiKeysPage.ConfirmPendingAsync holds PendingAction across the
            awaited action and clears it in finally, matching the sessions
            pages.
Server-048  Closed: the 044/045 regression tests cover the previously-
            untested kill paths.
Server-049  IDashboardSessionAdminService + DashboardSessionAdminService
            now carry XML docs that pin the Admin gate, missing-session
            return-Fail semantics, and the dashboard-admin-kill reason.
Server-050  CloseSessionAsync and KillWorkerAsync catch unexpected
            exceptions after the SessionManagerException catches and return
            a friendly Fail; OperationCanceledException tied to the caller
            token still propagates.

All resolved at 2026-05-24; 503/503 gateway tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-24 08:49:34 -04:00
parent 6079c62709
commit 4d77279e7e
8 changed files with 403 additions and 16 deletions
+22 -8
View File
@@ -7,7 +7,7 @@
| Review date | 2026-05-24 |
| Commit reviewed | `42b0037` |
| Status | Re-reviewed |
| Open findings | 7 |
| Open findings | 0 |
## Checklist coverage
@@ -816,7 +816,7 @@ Add a regression test that advises N items without an active `StreamEvents` cons
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:216-254` |
| Status | Open |
| Status | Resolved |
**Description:** `KillWorkerAsync` is the mirror of `CloseSessionCoreAsync` for the new admin-only Kill flow, but its catch path leaks the `mxgateway.sessions.open` gauge — the exact bug that Server-006 closed for `OpenSessionAsync`. The happy path increments `_metrics.SessionClosed()` once after `session.KillWorker(reason)` returns (line 244), which decrements `_openSessions`. The catch path, however, records `_metrics.Fault(...)`, calls `session.MarkFaulted(...)`, and then awaits `RemoveSessionAsync(session)` — but never calls `_metrics.SessionClosed()` (nor `SessionRemoved()`), so a kill that throws from `session.KillWorker` leaves the open-session gauge permanently incremented. `RemoveSessionAsync` only calls `_metrics.RemoveSessionEvents(...)` and `ReleaseSessionSlot()`; neither touches `_openSessions`. Server-006's fix pattern (track whether the open-counter was recorded, and decrement on the failing path) was applied to `OpenSessionAsync` but not propagated to this new write path.
@@ -824,6 +824,8 @@ In practice the trigger is narrow — `GatewaySession.KillWorker` calls `_worker
**Recommendation:** Mirror Server-006's fix: track whether the session was counted as opened (it always is in `KillWorkerAsync``GetRequiredSession` only succeeds for sessions in the registry, all of which had `SessionOpened()` called), and decrement on the failing path. Concretely, add `_metrics.SessionClosed()` (or `_metrics.SessionRemoved()` if the kill is being treated as an unclean removal) inside the catch block before `RemoveSessionAsync(session)`. The cleanest form is to record `SessionClosed()` once at the top of the method (under a flag), then only re-record if the happy path actually transitions; or to add `_metrics.SessionClosed()` in the catch right after `MarkFaulted`. Add a `SessionManagerTests.KillWorkerAsync_WhenSessionKillThrows_DecrementsOpenSessionGauge` regression test that uses a `FakeWorkerClient.KillThrows = true` to exercise the catch.
**Resolution:** 2026-05-24 — Confirmed against source: `KillWorkerAsync`'s catch block called `MarkFaulted`, `Fault`, and `RemoveSessionAsync` but never decremented the open-session gauge, mirroring exactly the Server-006 leak on the open path. The catch path now calls `_metrics.SessionRemoved()` after `MarkFaulted`, so the gauge is restored when `session.KillWorker` (via the new `KillWorkerWithCloseGateAsync` helper) throws. Combined with the Server-045 fix (the kill path now routes through a new `GatewaySession.KillWorkerWithCloseGateAsync` that takes the per-session `_closeLock`), every session reaching `KillWorkerAsync` had `SessionOpened()` recorded and the catch correctly decrements it. Regression test in `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs`: `KillWorkerAsync_WhenSessionKillThrows_DecrementsOpenSessionGauge` (uses a new `FakeWorkerClient.KillException` flag to force `_workerClient.Kill` to throw and asserts the open-session gauge returns to 0 after the kill faults). Confirmed to fail before the fix and pass after.
### Server-045
| Field | Value |
@@ -831,12 +833,14 @@ In practice the trigger is narrow — `GatewaySession.KillWorker` calls `_worker
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:225,242-245`, `src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs:837-841` |
| Status | Open |
| Status | Resolved |
**Description:** `KillWorkerAsync` reads `session.State` once into a local `bool wasClosed` (line 225) before calling `session.KillWorker(reason)`. The read is unsynchronized — `State` is a getter that takes `_syncRoot` internally so the read itself is safe, but there is no lock spanning "read state, call KillWorker, conditionally record metric." Two concurrent `KillWorkerAsync` calls on the same session (e.g. one operator clicking Kill on the Sessions page and another clicking Kill on the Session Details page within the same render tick) can both observe `wasClosed = false`, then both call `session.KillWorker(...)` (the second is effectively a no-op because `TransitionTo` refuses to overwrite `Closed`), and both call `_metrics.SessionClosed()` at line 244. The `_openSessions` gauge is bounded at 0 by `GatewayMetrics.SessionClosed`'s `if (_openSessions > 0)` guard, but the `_sessionsClosed` counter (and the `mxgateway.sessions.closed` counter exported by the meter) is double-incremented; `_metrics.Fault` is not used here, so the only mitigation is the SessionsRegistry race — the second call's `GetRequiredSession` could miss if the first already removed the session via `RemoveSessionAsync`, but only if the second arrives after the first's removal completes. The window is small but exists, and the same race exists for "Kill from one tab while the lease-expired sweep is closing the session." `CloseSessionCoreAsync` has the same shape, so this isn't a regression specifically from the kill change — but the new path widens the surface where the issue can fire.
**Recommendation:** Either (a) gate `KillWorkerAsync` on a per-session lock — extending the `_closeLock` pattern that `GatewaySession.CloseAsync` already uses, or introducing a new `_killLock` and accepting that close + kill don't serialize against each other — or (b) accept the metric double-count as harmless and document it on `KillWorkerAsync`'s XML doc. Option (a) is the more defensible long-term fix; option (b) is acceptable for v1 if the metric is purely informational. Adding a test that issues concurrent kills against the same session id and asserts `_sessionsClosed == 1` would pin the chosen behavior either way.
**Resolution:** 2026-05-24 — Took recommended option (a). Added `GatewaySession.KillWorkerWithCloseGateAsync(reason, ct)` that acquires the per-session `_closeLock`, reads `_state` under `_syncRoot`, calls `_workerClient.Kill(reason)`, then `TransitionTo(Closed)`, and returns the wasClosed observation. `SessionManager.KillWorkerAsync` now invokes that helper instead of reading `State` and calling `KillWorker` separately. Concurrent kill (and concurrent close+kill) callers now serialize on `_closeLock`, so the first caller observes `wasClosed=false` and the second observes `wasClosed=true`, eliminating the double-increment of `mxgateway.sessions.closed`. Regression test in `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs`: `KillWorkerAsync_ConcurrentCallsOnSameSession_CountClosedExactlyOnce` (issues two `KillWorkerAsync` calls on the same session id concurrently, accepts `SessionNotFound` on whichever loses the race after `RemoveSessionAsync`, and asserts `SessionsClosed == 1` and `OpenSessions == 0`).
### Server-046
| Field | Value |
@@ -844,12 +848,14 @@ In practice the trigger is narrow — `GatewaySession.KillWorker` calls `_worker
| Severity | Low |
| Category | Error handling & resilience |
| Location | `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:286-307` |
| Status | Open |
| Status | Resolved |
**Description:** `ShutdownAsync` was updated to fall back to `KillWorker` when `CloseSessionCoreAsync` throws (lines 294-305) — a useful resilience improvement on its own. But the fallback's bookkeeping is wrong: `session.KillWorker(GatewayShutdownReason)` is called and `RemoveSessionAsync(session)` is awaited, but `_metrics.SessionClosed()` is never invoked, so for every session whose graceful close throws, the `mxgateway.sessions.open` gauge stays incremented after shutdown completes. Worse, `CloseSessionCoreAsync`'s `SessionCloseStartedException` catch (line 330) already records `_metrics.SessionRemoved()` (line 334-336) before re-throwing — so for that specific exception type, the gauge is decremented inside the inner catch, then the outer fallback runs and does not double-decrement (good), but `_metrics.SessionClosed()` is never called, so the `_sessionsClosed` counter under-counts by one. For any other exception (the more common case), neither inner catch records anything, so both `_sessionsClosed` and `_openSessions` end up wrong: gauge is left high, counter is left low.
**Recommendation:** Inside the `ShutdownAsync` fallback (after the `KillWorker` call but before/inside the `RemoveSessionAsync`), call `_metrics.SessionClosed()` unless the inner catch already recorded the close. The simplest shape is to propagate a `wasClosed` flag out of `CloseSessionCoreAsync` (or replace the fallback's manual choreography with a single call into `KillWorkerAsync(...)`, which has the right metric path once Server-044 is fixed). The latter is the cleanest — `ShutdownAsync` becomes "try graceful, fall back to `KillWorkerAsync`," and there's exactly one accounting path for each session. Add a `SessionManagerTests.ShutdownAsync_WhenCloseThrows_StillDecrementsOpenSessionGauge` test using a session whose `CloseAsync` throws (e.g. a `BlockingShutdownWorkerClient` configured to throw on `ShutdownAsync`).
**Resolution:** 2026-05-24 — Two coordinated changes: (1) `CloseSessionCoreAsync`'s `SessionCloseStartedException` catch now calls `_metrics.SessionClosed()` (decrements the open-session gauge AND increments the closed counter) instead of `_metrics.SessionRemoved()` (gauge only). A close that ran far enough to attempt the worker shutdown but failed is still a closed session for accounting purposes — the session is removed from the registry and disposed below, so the counter must reflect that. (2) `ShutdownAsync`'s outer fallback now routes the kill through `KillWorkerAsync` (which has the correct metric path post-Server-044) rather than manually calling `session.KillWorker` + `RemoveSessionAsync`. In practice the inner catch already removes the session so the outer fallback is defensive — but routing both paths through the same accounting eliminates the inconsistency the finding called out. The pre-existing `CloseSessionAsync_WhenWorkerShutdownFails_RemovesSessionAndReleasesSlot` test was updated to assert the new (correct) `SessionsClosed == 1` value, with a comment back-referencing Server-046. New regression test in `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs`: `ShutdownAsync_WhenSessionCloseThrows_StillDecrementsOpenSessionGaugeAndIncrementsClosedCounter` (uses a `FakeWorkerClient.ShutdownException` to force the graceful close to throw, then asserts both the open-session gauge drops to 0 and the closed counter increments to 1). Confirmed to fail before the fix and pass after.
### Server-047
| Field | Value |
@@ -857,7 +863,7 @@ In practice the trigger is narrow — `GatewaySession.KillWorker` calls `_worker
| Severity | Low |
| Category | Code organization & conventions |
| Location | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/ApiKeysPage.razor:324-334`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/SessionsPage.razor:171-195`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/SessionDetailsPage.razor:231-255` |
| Status | Open |
| Status | Resolved |
**Description:** The shared `ConfirmDialog.razor` (added in `0e56b5b` / `24cc5fd`) is wired by three pages, but the pages handle `PendingAction` cleanup inconsistently:
@@ -868,6 +874,8 @@ The user-visible difference: rotating/revoking/deleting a key vs closing/killing
**Recommendation:** Align `ApiKeysPage.ConfirmPendingAsync` with the sessions pages: hold `PendingAction`, set `IsBusy = true`, run the action, then clear `PendingAction` in the `finally`. The current ApiKeysPage shape was inherited from before the dialog existed (when the confirmation was a `confirm()` JS call); the dialog component change can flatten the difference now. As a smaller alternative, document the divergence on the component's XML doc — but the shared component should ideally be used consistently.
**Resolution:** 2026-05-24 — Took the recommended alignment. `ApiKeysPage.ConfirmPendingAsync` (`src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/ApiKeysPage.razor`) now holds `PendingAction` for the duration of the awaited action (so the shared `ConfirmDialog` renders its `IsBusy` in-flight state on the dialog itself, matching the sessions pages) and clears it in `finally` regardless of outcome. The action is captured up front so a clear in `finally` works even when the action throws. `RunManagementActionAsync` continues to drive `IsBusy = true` inside its own `try/finally`, so the dialog now correctly disables Confirm/Cancel while the awaited service call runs. Pure UX-consistency change; no new automated test (no bUnit harness in the test project — same precedent as Server-010).
### Server-048
| Field | Value |
@@ -875,7 +883,7 @@ The user-visible difference: rotating/revoking/deleting a key vs closing/killing
| Severity | Low |
| Category | Testing coverage |
| Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs:463-498` |
| Status | Open |
| Status | Resolved |
**Description:** The two new `KillWorkerAsync_*` tests cover the happy path (`KillWorkerAsync_KillsWorkerAndRemovesSession`) and the missing-session error (`KillWorkerAsync_WhenSessionMissing_ThrowsSessionNotFound`). Three behaviorally distinct cases are missing:
@@ -885,6 +893,8 @@ The user-visible difference: rotating/revoking/deleting a key vs closing/killing
**Recommendation:** Add the three tests above. The fakes in `MxGateway.Tests/TestSupport/` already cover most of the moving parts; `FakeWorkerClient` needs a single `ThrowOnKill` flag (or the existing `KillThrowing` if any).
**Resolution:** 2026-05-24 — Closed by the regression tests added for Server-044 and Server-045 per the prompt's direction: case (1) is covered by `KillWorkerAsync_WhenSessionKillThrows_DecrementsOpenSessionGauge` (uses the new `FakeWorkerClient.KillException` flag); case (3) is covered by `KillWorkerAsync_ConcurrentCallsOnSameSession_CountClosedExactlyOnce`. Case (2) (`wasClosed=true` short-circuit) is implicitly exercised by the concurrent test — once the kill path serializes on the per-session close lock (Server-045 fix), the second kill that wins the registry race observes `wasClosed=true` and skips the counter increment, which is what the test pins (`SessionsClosed == 1`, not 2). The dedicated `KillWorkerAsync_WhenSessionAlreadyClosed_DoesNotReincrementClosedCounter` test was drafted but removed: closing a session disposes it (Server-016's `_closeLock.Dispose()`), so re-issuing a kill against a previously-closed-and-disposed session always fails on the disposed semaphore, which is realistic for production but not a useful unit-test shape. No new test file; the regression coverage already lives in `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs`.
### Server-049
| Field | Value |
@@ -892,12 +902,14 @@ The user-visible difference: rotating/revoking/deleting a key vs closing/killing
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/IDashboardSessionAdminService.cs:5-18`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs:8-25` |
| Status | Open |
| Status | Resolved |
**Description:** `IDashboardSessionAdminService` declares three members — `CanManage`, `CloseSessionAsync`, `KillWorkerAsync` — none of which carry XML documentation. `DashboardSessionAdminService.CanManage` and the two operation methods are also undocumented (only the constructor parameters are named). The C# style guide requires public-surface XML docs and CLAUDE.md mandates that "docs change with the code." The peer `IDashboardApiKeyManagementService` is also undocumented, so this isn't unique — but the new interface is a fresh public surface being landed in `c5e7479`, and the contract subtleties (CanManage returns false for non-Admin; missing-session paths surface as `Succeeded = false` not as a thrown exception; `KillReason` is fixed at `"dashboard-admin-kill"` and that value reaches the audit log) are exactly what XML docs are for.
**Recommendation:** Add `<summary>` blocks to `IDashboardSessionAdminService.CanManage` (states the Admin-role gate), `CloseSessionAsync` and `KillWorkerAsync` (state that missing sessions return `DashboardSessionAdminResult.Fail(...)` rather than throwing, and that the audit log captures actor + remote IP). Add `<param>` and `<returns>` for the request/response shape. The same sweep can pick up the longstanding gap on `IDashboardApiKeyManagementService` if the team wants — but the new file is the load-bearing one.
**Resolution:** 2026-05-24 — Added `<summary>` + `<remarks>` blocks to every member of `IDashboardSessionAdminService` (`src/ZB.MOM.WW.MxGateway.Server/Dashboard/IDashboardSessionAdminService.cs`): an interface-level `<remarks>` describing the Admin-role gate, audit log shape, and `DashboardSessionAdminResult.Fail` semantics; per-member docs on `CanManage`, `CloseSessionAsync`, and `KillWorkerAsync` calling out the missing-session-returns-Fail contract and the `dashboard-admin-kill` reason constant that reaches the worker-kill audit log and `mxgateway.workers.killed` counter tag. `DashboardSessionAdminService` (`src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs`) picked up a class-level `<summary>` + `<remarks>` describing the per-page audit-log seam, plus `<inheritdoc />` on each public method. Pure documentation change; no test (the behavioral contracts the docs describe are already exercised by the existing `DashboardSessionAdminServiceTests` cases).
### Server-050
| Field | Value |
@@ -905,7 +917,7 @@ The user-visible difference: rotating/revoking/deleting a key vs closing/killing
| Severity | Low |
| Category | Error handling & resilience |
| Location | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs:42-75,92-125` |
| Status | Open |
| Status | Resolved |
**Description:** `CloseSessionAsync` and `KillWorkerAsync` catch only `SessionManagerException` (the `SessionNotFound` filter, then a general `SessionManagerException` catch). Anything else propagates raw to Blazor's error boundary. The propagation paths exist:
@@ -915,3 +927,5 @@ The user-visible difference: rotating/revoking/deleting a key vs closing/killing
Today neither call site has a Blazor error boundary, so an unhandled exception lands as a generic Blazor circuit error page. The friendlier-error contract that Server-044's commit message advertises ("audit-logs, friendly errors") is incomplete: only `SessionManagerException` gets a friendly error.
**Recommendation:** Add a general `catch (Exception exception)` after the `SessionManagerException` catch in both `CloseSessionAsync` and `KillWorkerAsync`, log a warning (matching the SessionManagerException pattern), and return `DashboardSessionAdminResult.Fail($"{operation} failed unexpectedly. See the gateway log for details.")`. This makes the result type truly the only output the page sees. Add a regression test using a `ThrowingSessionManager` that throws e.g. `InvalidOperationException` from `KillWorkerAsync` and asserts the service returns a failing result rather than propagating.
**Resolution:** 2026-05-24 — Added the recommended general `catch (Exception)` arms to both `DashboardSessionAdminService.CloseSessionAsync` and `KillWorkerAsync` (`src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs`), placed after the `SessionManagerException` catches and behind a `catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) throw;` so caller cancellation still propagates cleanly. The new catches log a warning with actor + session id and return `DashboardSessionAdminResult.Fail("{Operation} failed unexpectedly for session {SessionId}. See the gateway log for details.")`, mirroring the SessionManagerException pattern. Regression tests in `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSessionAdminServiceTests.cs`: `CloseSessionAsync_WhenManagerThrowsUnexpected_ReturnsFriendlyFail` (the `ISessionManager` throws `InvalidOperationException("unexpected")`) and `KillWorkerAsync_WhenManagerThrowsUnexpected_ReturnsFriendlyFail` (throws `IOException("pipe broken")`); both assert the service returns a failing result with a non-blank message rather than propagating. The fake's new `CloseThrowsUnexpected` / `KillThrowsUnexpected` properties hold the configured exception. Confirmed to fail before the fix (raw exception propagated) and pass after.
@@ -328,9 +328,18 @@ else
return;
}
// Server-047: align the pending-action lifecycle with SessionsPage / SessionDetailsPage —
// hold PendingAction while the awaited action runs so the shared ConfirmDialog can render
// its in-flight (IsBusy) state, then clear in finally regardless of outcome.
Func<System.Security.Claims.ClaimsPrincipal, Task<DashboardApiKeyManagementResult>> action = PendingAction.Action;
PendingAction = null;
await RunManagementActionAsync(action).ConfigureAwait(false);
try
{
await RunManagementActionAsync(action).ConfigureAwait(false);
}
finally
{
PendingAction = null;
}
}
private sealed record PendingConfirm(
@@ -5,6 +5,19 @@ using ZB.MOM.WW.MxGateway.Server.Sessions;
namespace ZB.MOM.WW.MxGateway.Server.Dashboard;
/// <summary>
/// Default implementation of <see cref="IDashboardSessionAdminService"/>: gates
/// destructive session actions on the <see cref="DashboardRoles.Admin"/> role,
/// audit-logs successful operations, and converts <see cref="SessionManagerException"/>
/// (and any other unexpected exceptions) into <see cref="DashboardSessionAdminResult.Fail(string)"/>
/// so the Blazor pages never see a raw exception.
/// </summary>
/// <remarks>
/// The constant <c>dashboard-admin-kill</c> is the reason passed to
/// <see cref="ISessionManager.KillWorkerAsync"/> and forwarded as the
/// <c>reason</c> tag on the <c>mxgateway.workers.killed</c> counter and in
/// the worker-kill audit log entries.
/// </remarks>
public sealed class DashboardSessionAdminService(
ISessionManager sessionManager,
IHttpContextAccessor httpContextAccessor,
@@ -16,6 +29,7 @@ public sealed class DashboardSessionAdminService(
private readonly ILogger<DashboardSessionAdminService> _logger =
logger ?? NullLogger<DashboardSessionAdminService>.Instance;
/// <inheritdoc />
public bool CanManage(ClaimsPrincipal user)
{
ArgumentNullException.ThrowIfNull(user);
@@ -24,6 +38,7 @@ public sealed class DashboardSessionAdminService(
&& user.IsInRole(DashboardRoles.Admin);
}
/// <inheritdoc />
public async Task<DashboardSessionAdminResult> CloseSessionAsync(
ClaimsPrincipal user,
string sessionId,
@@ -72,8 +87,27 @@ public sealed class DashboardSessionAdminService(
return DashboardSessionAdminResult.Fail(
$"Close failed: {exception.Message}");
}
catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested)
{
throw;
}
catch (Exception exception)
{
// Server-050: any non-SessionManagerException (e.g. an IOException or
// InvalidOperationException from the session DisposeAsync / pipe teardown
// path) used to propagate raw into Blazor's error boundary. Convert it to
// a friendly failure so the Razor pages see only DashboardSessionAdminResult.
_logger.LogWarning(
exception,
"Dashboard admin {Actor} close failed unexpectedly for session {SessionId}.",
actor,
sessionId);
return DashboardSessionAdminResult.Fail(
$"Close failed unexpectedly for session {sessionId}. See the gateway log for details.");
}
}
/// <inheritdoc />
public async Task<DashboardSessionAdminResult> KillWorkerAsync(
ClaimsPrincipal user,
string sessionId,
@@ -122,6 +156,26 @@ public sealed class DashboardSessionAdminService(
return DashboardSessionAdminResult.Fail(
$"Kill failed: {exception.Message}");
}
catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested)
{
throw;
}
catch (Exception exception)
{
// Server-050: any non-SessionManagerException (e.g. an IOException from
// worker pipe teardown surfacing through session.DisposeAsync, or an
// InvalidOperationException from a corrupted worker handle) used to
// propagate raw into Blazor's error boundary. Convert it to a friendly
// failure so the page renders the ResultMessage rather than the circuit
// error page.
_logger.LogWarning(
exception,
"Dashboard admin {Actor} kill failed unexpectedly for session {SessionId}.",
actor,
sessionId);
return DashboardSessionAdminResult.Fail(
$"Kill failed unexpectedly for session {sessionId}. See the gateway log for details.");
}
}
private static string ResolveActor(ClaimsPrincipal user)
@@ -2,15 +2,82 @@ using System.Security.Claims;
namespace ZB.MOM.WW.MxGateway.Server.Dashboard;
/// <summary>
/// Dashboard surface for the destructive session-management actions —
/// Close (graceful shutdown) and Kill (force-terminate) — exposed by the
/// Admin role.
/// </summary>
/// <remarks>
/// The dashboard binds the destructive Close/Kill UI to this service so
/// the underlying <see cref="Sessions.ISessionManager"/> calls flow through
/// a single audited and role-gated entry point. All operations are gated
/// by <see cref="CanManage"/>; non-Admin callers are rejected with a
/// <c>Succeeded = false</c> result rather than throwing. Missing sessions
/// also surface as <see cref="DashboardSessionAdminResult.Fail(string)"/> so
/// Razor pages can render the message without an error boundary. Each
/// successful call is logged at Information including the acting user
/// (from <see cref="ClaimsPrincipal.Identity"/>'s name) and the remote
/// address resolved from <see cref="IHttpContextAccessor"/>.
/// </remarks>
public interface IDashboardSessionAdminService
{
/// <summary>
/// Returns whether the given principal is authorized to perform
/// destructive session-management actions.
/// </summary>
/// <remarks>
/// Requires <see cref="System.Security.Principal.IIdentity.IsAuthenticated"/>
/// to be true and membership in the
/// <see cref="DashboardRoles.Admin"/> role. Pages typically gate the
/// Close/Kill buttons on this value at render time so non-Admin
/// viewers never see them.
/// </remarks>
/// <param name="user">Caller principal.</param>
/// <returns><c>true</c> when the caller may close or kill sessions; otherwise <c>false</c>.</returns>
bool CanManage(ClaimsPrincipal user);
/// <summary>
/// Closes the given session gracefully (worker is given the configured
/// shutdown grace period before being terminated).
/// </summary>
/// <remarks>
/// Returns <see cref="DashboardSessionAdminResult.Fail(string)"/>
/// when the caller is not Admin, when the session id is blank, or when
/// <see cref="Sessions.ISessionManager.CloseSessionAsync"/> raises a
/// <see cref="Sessions.SessionManagerException"/> (including the
/// <see cref="Sessions.SessionManagerErrorCode.SessionNotFound"/>
/// case). Successful closes are audit-logged with the caller name,
/// session id, and remote address.
/// </remarks>
/// <param name="user">Caller principal.</param>
/// <param name="sessionId">Session identifier to close.</param>
/// <param name="cancellationToken">Cancellation token.</param>
/// <returns>Result describing success/failure and a user-facing message.</returns>
Task<DashboardSessionAdminResult> CloseSessionAsync(
ClaimsPrincipal user,
string sessionId,
CancellationToken cancellationToken);
/// <summary>
/// Force-terminates the worker process backing the given session without
/// attempting a graceful shutdown.
/// </summary>
/// <remarks>
/// Invoked from the dashboard Kill button. Uses the
/// <c>dashboard-admin-kill</c> reason constant — that string reaches
/// the audit log and the <c>mxgateway.workers.killed</c> metric tag.
/// Returns <see cref="DashboardSessionAdminResult.Fail(string)"/> for
/// non-Admin callers, blank session ids, or any
/// <see cref="Sessions.SessionManagerException"/> from the underlying
/// manager (the <see cref="Sessions.SessionManagerErrorCode.SessionNotFound"/>
/// case is recognized and reported as "not found"). Successful kills
/// are audit-logged with the caller name, session id, and remote
/// address.
/// </remarks>
/// <param name="user">Caller principal.</param>
/// <param name="sessionId">Session identifier to kill.</param>
/// <param name="cancellationToken">Cancellation token.</param>
/// <returns>Result describing success/failure and a user-facing message.</returns>
Task<DashboardSessionAdminResult> KillWorkerAsync(
ClaimsPrincipal user,
string sessionId,
@@ -840,6 +840,45 @@ public sealed class GatewaySession
TransitionTo(SessionState.Closed);
}
/// <summary>
/// Terminates the worker process immediately while holding the per-session
/// close lock so concurrent close/kill callers serialize. Returns the
/// session state observed at the start of the call so the caller can
/// dedup metric accounting (e.g. only record <c>SessionClosed</c> when
/// the session was not already closed).
/// </summary>
/// <remarks>
/// Mirrors <see cref="CloseAsync"/>'s use of <c>_closeLock</c> so that
/// a Close in flight from one caller and a Kill from another do not
/// race on the "was the session already closed" observation that
/// drives metric increments (Server-045).
/// </remarks>
/// <param name="reason">Reason for killing the worker.</param>
/// <param name="cancellationToken">Cancellation token.</param>
/// <returns><c>true</c> if the session was already <see cref="SessionState.Closed"/> when the lock was acquired; otherwise <c>false</c>.</returns>
public async ValueTask<bool> KillWorkerWithCloseGateAsync(
string reason,
CancellationToken cancellationToken)
{
await _closeLock.WaitAsync(cancellationToken).ConfigureAwait(false);
try
{
bool wasClosed;
lock (_syncRoot)
{
wasClosed = _state == SessionState.Closed;
}
_workerClient?.Kill(reason);
TransitionTo(SessionState.Closed);
return wasClosed;
}
finally
{
_closeLock.Release();
}
}
/// <summary>
/// Disposes the session and frees associated resources.
/// </summary>
@@ -222,16 +222,26 @@ public sealed class SessionManager : ISessionManager
cancellationToken.ThrowIfCancellationRequested();
GatewaySession session = GetRequiredSession(sessionId);
bool wasClosed = session.State == SessionState.Closed;
// Serialize concurrent kill/close attempts on this session by routing through the
// per-session close lock (Server-045). Returns whether the session was already in
// Closed state when the lock was acquired so the metric counter is incremented at
// most once across concurrent callers.
bool wasClosed;
try
{
session.KillWorker(reason);
wasClosed = await session.KillWorkerWithCloseGateAsync(reason, cancellationToken).ConfigureAwait(false);
}
catch (Exception exception)
{
session.MarkFaulted(exception.Message);
_metrics.Fault(SessionManagerErrorCode.CloseFailed.ToString());
// Server-044: the open-session gauge was incremented in OpenSessionAsync;
// every session reaching KillWorkerAsync had SessionOpened recorded. If the
// kill path throws, decrement the gauge here so mxgateway.sessions.open
// does not leak — mirroring the Server-006 fix on OpenSessionAsync.
_metrics.SessionRemoved();
await RemoveSessionAsync(session).ConfigureAwait(false);
throw new SessionManagerException(
SessionManagerErrorCode.CloseFailed,
@@ -297,10 +307,24 @@ public sealed class SessionManager : ISessionManager
exception,
"Graceful shutdown failed for session {SessionId}; killing worker.",
session.SessionId);
// Defensive fallback: CloseSessionCoreAsync's inner SessionCloseStartedException
// catch normally removes the session and accounts the close (Server-046). The
// outer fallback only fires for sessions still in the registry — route through
// KillWorkerAsync so the bookkeeping is identical to the dashboard kill path.
if (_registry.TryGet(session.SessionId, out _))
{
session.KillWorker(GatewayShutdownReason);
await RemoveSessionAsync(session).ConfigureAwait(false);
try
{
await KillWorkerAsync(session.SessionId, GatewayShutdownReason, cancellationToken).ConfigureAwait(false);
}
catch (SessionManagerException killException)
{
_logger.LogWarning(
killException,
"Worker kill fallback failed for session {SessionId}.",
session.SessionId);
}
}
}
}
@@ -332,7 +356,12 @@ public sealed class SessionManager : ISessionManager
session.MarkFaulted(exception.Message);
if (!wasClosed)
{
_metrics.SessionRemoved();
// Server-046: account the close as a SessionClosed (decrements the open-session
// gauge AND increments the sessions.closed counter), not just SessionRemoved.
// The session is being removed from the registry below; treating this as a
// half-finished close that only decremented the gauge under-counted the closed
// counter.
_metrics.SessionClosed();
}
_metrics.Fault(SessionManagerErrorCode.CloseFailed.ToString());
@@ -115,6 +115,52 @@ public sealed class DashboardSessionAdminServiceTests
Assert.True(service.CanManage(CreateUser(DashboardRoles.Admin)));
}
/// <summary>
/// Regression for Server-050: an unexpected (non-<see cref="SessionManagerException"/>)
/// exception from <c>CloseSessionAsync</c> — e.g. an <see cref="InvalidOperationException"/>
/// or <see cref="IOException"/> surfaced from <c>RemoveSessionAsync</c>/<c>DisposeAsync</c> —
/// must be converted to a friendly <see cref="DashboardSessionAdminResult.Fail(string)"/>
/// rather than propagating raw into Blazor's error boundary.
/// </summary>
[Fact]
public async Task CloseSessionAsync_WhenManagerThrowsUnexpected_ReturnsFriendlyFail()
{
FakeSessionManager sessionManager = new()
{
CloseThrowsUnexpected = new InvalidOperationException("unexpected"),
};
DashboardSessionAdminService service = CreateService(sessionManager);
DashboardSessionAdminResult result = await service.CloseSessionAsync(
CreateUser(DashboardRoles.Admin),
"session-1",
CancellationToken.None);
Assert.False(result.Succeeded);
Assert.False(string.IsNullOrWhiteSpace(result.Message));
}
/// <summary>
/// Regression for Server-050: same friendly-fail contract for the Kill path.
/// </summary>
[Fact]
public async Task KillWorkerAsync_WhenManagerThrowsUnexpected_ReturnsFriendlyFail()
{
FakeSessionManager sessionManager = new()
{
KillThrowsUnexpected = new IOException("pipe broken"),
};
DashboardSessionAdminService service = CreateService(sessionManager);
DashboardSessionAdminResult result = await service.KillWorkerAsync(
CreateUser(DashboardRoles.Admin),
"session-1",
CancellationToken.None);
Assert.False(result.Succeeded);
Assert.False(string.IsNullOrWhiteSpace(result.Message));
}
private static DashboardSessionAdminService CreateService(ISessionManager sessionManager)
{
DefaultHttpContext httpContext = new();
@@ -150,6 +196,10 @@ public sealed class DashboardSessionAdminServiceTests
public bool CloseThrowsNotFound { get; init; }
public Exception? CloseThrowsUnexpected { get; init; }
public Exception? KillThrowsUnexpected { get; init; }
public Task<GatewaySession> OpenSessionAsync(
SessionOpenRequest request,
string? clientIdentity,
@@ -194,6 +244,11 @@ public sealed class DashboardSessionAdminServiceTests
$"Session {sessionId} was not found.");
}
if (CloseThrowsUnexpected is not null)
{
throw CloseThrowsUnexpected;
}
return Task.FromResult(new SessionCloseResult(sessionId, SessionState.Closed, AlreadyClosed: false));
}
@@ -205,6 +260,11 @@ public sealed class DashboardSessionAdminServiceTests
KillCount++;
LastKilledSessionId = sessionId;
LastKillReason = reason;
if (KillThrowsUnexpected is not null)
{
throw KillThrowsUnexpected;
}
return Task.FromResult(new SessionCloseResult(sessionId, SessionState.Closed, AlreadyClosed: false));
}
@@ -410,7 +410,10 @@ public sealed class SessionManagerTests
Assert.Equal(1, failingWorkerClient.KillCount);
Assert.Equal(1, failingWorkerClient.DisposeCount);
GatewayMetricsSnapshot snapshot = metrics.GetSnapshot();
Assert.Equal(0, snapshot.SessionsClosed);
// Server-046: a close-that-failed now accounts as SessionClosed (counter += 1) rather
// than SessionRemoved (gauge -= 1, counter unchanged). The session is being removed
// from the registry on this path, so it must show up in the closed count.
Assert.Equal(1, snapshot.SessionsClosed);
Assert.False(snapshot.EventsBySession.ContainsKey(firstSession.SessionId));
Assert.Equal(1, snapshot.OpenSessions);
}
@@ -495,6 +498,110 @@ public sealed class SessionManagerTests
Assert.Equal(SessionManagerErrorCode.SessionNotFound, exception.ErrorCode);
}
/// <summary>
/// Regression for Server-044: when <c>session.KillWorker</c> throws, the catch path must still
/// decrement <c>mxgateway.sessions.open</c> (parity with the Server-006 fix in
/// <c>OpenSessionAsync</c>). Without the fix the gauge leaks one open session per failed kill.
/// </summary>
[Fact]
public async Task KillWorkerAsync_WhenSessionKillThrows_DecrementsOpenSessionGauge()
{
FakeWorkerClient workerClient = new()
{
KillException = new InvalidOperationException("worker kill failed"),
};
using GatewayMetrics metrics = new();
SessionManager manager = CreateManager(
new FakeSessionWorkerClientFactory(workerClient),
metrics: metrics);
GatewaySession session = await manager.OpenSessionAsync(
CreateOpenRequest(),
"client-1",
CancellationToken.None);
Assert.Equal(1, metrics.GetSnapshot().OpenSessions);
SessionManagerException exception = await Assert.ThrowsAsync<SessionManagerException>(
async () => await manager.KillWorkerAsync(session.SessionId, "test-kill", CancellationToken.None));
Assert.Equal(SessionManagerErrorCode.CloseFailed, exception.ErrorCode);
Assert.False(manager.TryGetSession(session.SessionId, out _));
Assert.Equal(0, metrics.GetSnapshot().OpenSessions);
Assert.True(metrics.GetSnapshot().Faults > 0);
}
/// <summary>
/// Regression for Server-045 / Server-048: concurrent kills on the same session must not
/// double-increment <c>mxgateway.sessions.closed</c>. The first kill wins, the second
/// observes <c>wasClosed == true</c> (or a missing session after removal) and short-circuits.
/// </summary>
[Fact]
public async Task KillWorkerAsync_ConcurrentCallsOnSameSession_CountClosedExactlyOnce()
{
FakeWorkerClient workerClient = new();
using GatewayMetrics metrics = new();
SessionManager manager = CreateManager(
new FakeSessionWorkerClientFactory(workerClient),
metrics: metrics);
GatewaySession session = await manager.OpenSessionAsync(
CreateOpenRequest(),
"client-1",
CancellationToken.None);
Task<SessionCloseResult> first = manager.KillWorkerAsync(session.SessionId, "kill-a", CancellationToken.None);
Task<SessionCloseResult> second = Task.Run(async () =>
{
try
{
return await manager.KillWorkerAsync(session.SessionId, "kill-b", CancellationToken.None);
}
catch (SessionManagerException missing) when (missing.ErrorCode == SessionManagerErrorCode.SessionNotFound)
{
return new SessionCloseResult(session.SessionId, SessionState.Closed, AlreadyClosed: true);
}
});
await Task.WhenAll(first, second);
Assert.Equal(1, metrics.GetSnapshot().SessionsClosed);
Assert.Equal(0, metrics.GetSnapshot().OpenSessions);
Assert.False(manager.TryGetSession(session.SessionId, out _));
}
/// <summary>
/// Regression for Server-046: <c>ShutdownAsync</c>'s graceful-close fallback (which calls
/// <c>KillWorker</c> + <c>RemoveSessionAsync</c> when <c>CloseSessionCoreAsync</c> throws)
/// must still account a successful close: both the open-session gauge must drop to zero AND
/// the <c>mxgateway.sessions.closed</c> counter must increment. Without the fix, the
/// graceful-close failure path under-counts the closed counter.
/// </summary>
[Fact]
public async Task ShutdownAsync_WhenSessionCloseThrows_StillDecrementsOpenSessionGaugeAndIncrementsClosedCounter()
{
FakeWorkerClient throwingClient = new()
{
ShutdownException = new InvalidOperationException("worker shutdown failed"),
};
using GatewayMetrics metrics = new();
SessionManager manager = CreateManager(
new FakeSessionWorkerClientFactory(throwingClient),
metrics: metrics);
GatewaySession session = await manager.OpenSessionAsync(
CreateOpenRequest(),
"client-1",
CancellationToken.None);
Assert.Equal(1, metrics.GetSnapshot().OpenSessions);
await manager.ShutdownAsync(CancellationToken.None);
// After shutdown, regardless of whether the graceful close path or the kill fallback ran,
// the open-session gauge must be zero and the closed counter must be incremented.
Assert.Equal(0, metrics.GetSnapshot().OpenSessions);
Assert.Equal(1, metrics.GetSnapshot().SessionsClosed);
Assert.False(manager.TryGetSession(session.SessionId, out _));
}
/// <summary>Verifies that when worker creation fails, the session is removed from the registry.</summary>
[Fact]
public async Task OpenSessionAsync_WhenWorkerCreationFails_RemovesSessionFromRegistry()
@@ -726,6 +833,9 @@ public sealed class SessionManagerTests
/// <summary>Gets the exception to throw when shutdown is called, if any.</summary>
public Exception? ShutdownException { get; init; }
/// <summary>Gets the exception to throw when kill is called, if any.</summary>
public Exception? KillException { get; init; }
/// <summary>Gets a value indicating whether to block shutdown on the fake worker client.</summary>
public bool BlockShutdown { get; init; }
@@ -803,6 +913,11 @@ public sealed class SessionManagerTests
public void Kill(string reason)
{
KillCount++;
if (KillException is not null)
{
throw KillException;
}
State = WorkerClientState.Faulted;
}