diff --git a/code-reviews/Server/findings.md b/code-reviews/Server/findings.md index 2686488..db1c4a4 100644 --- a/code-reviews/Server/findings.md +++ b/code-reviews/Server/findings.md @@ -7,7 +7,7 @@ | Review date | 2026-05-24 | | Commit reviewed | `42b0037` | | Status | Re-reviewed | -| Open findings | 7 | +| Open findings | 0 | ## Checklist coverage @@ -816,7 +816,7 @@ Add a regression test that advises N items without an active `StreamEvents` cons | Severity | Medium | | Category | Correctness & logic bugs | | Location | `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:216-254` | -| Status | Open | +| Status | Resolved | **Description:** `KillWorkerAsync` is the mirror of `CloseSessionCoreAsync` for the new admin-only Kill flow, but its catch path leaks the `mxgateway.sessions.open` gauge — the exact bug that Server-006 closed for `OpenSessionAsync`. The happy path increments `_metrics.SessionClosed()` once after `session.KillWorker(reason)` returns (line 244), which decrements `_openSessions`. The catch path, however, records `_metrics.Fault(...)`, calls `session.MarkFaulted(...)`, and then awaits `RemoveSessionAsync(session)` — but never calls `_metrics.SessionClosed()` (nor `SessionRemoved()`), so a kill that throws from `session.KillWorker` leaves the open-session gauge permanently incremented. `RemoveSessionAsync` only calls `_metrics.RemoveSessionEvents(...)` and `ReleaseSessionSlot()`; neither touches `_openSessions`. Server-006's fix pattern (track whether the open-counter was recorded, and decrement on the failing path) was applied to `OpenSessionAsync` but not propagated to this new write path. @@ -824,6 +824,8 @@ In practice the trigger is narrow — `GatewaySession.KillWorker` calls `_worker **Recommendation:** Mirror Server-006's fix: track whether the session was counted as opened (it always is in `KillWorkerAsync` — `GetRequiredSession` only succeeds for sessions in the registry, all of which had `SessionOpened()` called), and decrement on the failing path. Concretely, add `_metrics.SessionClosed()` (or `_metrics.SessionRemoved()` if the kill is being treated as an unclean removal) inside the catch block before `RemoveSessionAsync(session)`. The cleanest form is to record `SessionClosed()` once at the top of the method (under a flag), then only re-record if the happy path actually transitions; or to add `_metrics.SessionClosed()` in the catch right after `MarkFaulted`. Add a `SessionManagerTests.KillWorkerAsync_WhenSessionKillThrows_DecrementsOpenSessionGauge` regression test that uses a `FakeWorkerClient.KillThrows = true` to exercise the catch. +**Resolution:** 2026-05-24 — Confirmed against source: `KillWorkerAsync`'s catch block called `MarkFaulted`, `Fault`, and `RemoveSessionAsync` but never decremented the open-session gauge, mirroring exactly the Server-006 leak on the open path. The catch path now calls `_metrics.SessionRemoved()` after `MarkFaulted`, so the gauge is restored when `session.KillWorker` (via the new `KillWorkerWithCloseGateAsync` helper) throws. Combined with the Server-045 fix (the kill path now routes through a new `GatewaySession.KillWorkerWithCloseGateAsync` that takes the per-session `_closeLock`), every session reaching `KillWorkerAsync` had `SessionOpened()` recorded and the catch correctly decrements it. Regression test in `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs`: `KillWorkerAsync_WhenSessionKillThrows_DecrementsOpenSessionGauge` (uses a new `FakeWorkerClient.KillException` flag to force `_workerClient.Kill` to throw and asserts the open-session gauge returns to 0 after the kill faults). Confirmed to fail before the fix and pass after. + ### Server-045 | Field | Value | @@ -831,12 +833,14 @@ In practice the trigger is narrow — `GatewaySession.KillWorker` calls `_worker | Severity | Low | | Category | Concurrency & thread safety | | Location | `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:225,242-245`, `src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs:837-841` | -| Status | Open | +| Status | Resolved | **Description:** `KillWorkerAsync` reads `session.State` once into a local `bool wasClosed` (line 225) before calling `session.KillWorker(reason)`. The read is unsynchronized — `State` is a getter that takes `_syncRoot` internally so the read itself is safe, but there is no lock spanning "read state, call KillWorker, conditionally record metric." Two concurrent `KillWorkerAsync` calls on the same session (e.g. one operator clicking Kill on the Sessions page and another clicking Kill on the Session Details page within the same render tick) can both observe `wasClosed = false`, then both call `session.KillWorker(...)` (the second is effectively a no-op because `TransitionTo` refuses to overwrite `Closed`), and both call `_metrics.SessionClosed()` at line 244. The `_openSessions` gauge is bounded at 0 by `GatewayMetrics.SessionClosed`'s `if (_openSessions > 0)` guard, but the `_sessionsClosed` counter (and the `mxgateway.sessions.closed` counter exported by the meter) is double-incremented; `_metrics.Fault` is not used here, so the only mitigation is the SessionsRegistry race — the second call's `GetRequiredSession` could miss if the first already removed the session via `RemoveSessionAsync`, but only if the second arrives after the first's removal completes. The window is small but exists, and the same race exists for "Kill from one tab while the lease-expired sweep is closing the session." `CloseSessionCoreAsync` has the same shape, so this isn't a regression specifically from the kill change — but the new path widens the surface where the issue can fire. **Recommendation:** Either (a) gate `KillWorkerAsync` on a per-session lock — extending the `_closeLock` pattern that `GatewaySession.CloseAsync` already uses, or introducing a new `_killLock` and accepting that close + kill don't serialize against each other — or (b) accept the metric double-count as harmless and document it on `KillWorkerAsync`'s XML doc. Option (a) is the more defensible long-term fix; option (b) is acceptable for v1 if the metric is purely informational. Adding a test that issues concurrent kills against the same session id and asserts `_sessionsClosed == 1` would pin the chosen behavior either way. +**Resolution:** 2026-05-24 — Took recommended option (a). Added `GatewaySession.KillWorkerWithCloseGateAsync(reason, ct)` that acquires the per-session `_closeLock`, reads `_state` under `_syncRoot`, calls `_workerClient.Kill(reason)`, then `TransitionTo(Closed)`, and returns the wasClosed observation. `SessionManager.KillWorkerAsync` now invokes that helper instead of reading `State` and calling `KillWorker` separately. Concurrent kill (and concurrent close+kill) callers now serialize on `_closeLock`, so the first caller observes `wasClosed=false` and the second observes `wasClosed=true`, eliminating the double-increment of `mxgateway.sessions.closed`. Regression test in `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs`: `KillWorkerAsync_ConcurrentCallsOnSameSession_CountClosedExactlyOnce` (issues two `KillWorkerAsync` calls on the same session id concurrently, accepts `SessionNotFound` on whichever loses the race after `RemoveSessionAsync`, and asserts `SessionsClosed == 1` and `OpenSessions == 0`). + ### Server-046 | Field | Value | @@ -844,12 +848,14 @@ In practice the trigger is narrow — `GatewaySession.KillWorker` calls `_worker | Severity | Low | | Category | Error handling & resilience | | Location | `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:286-307` | -| Status | Open | +| Status | Resolved | **Description:** `ShutdownAsync` was updated to fall back to `KillWorker` when `CloseSessionCoreAsync` throws (lines 294-305) — a useful resilience improvement on its own. But the fallback's bookkeeping is wrong: `session.KillWorker(GatewayShutdownReason)` is called and `RemoveSessionAsync(session)` is awaited, but `_metrics.SessionClosed()` is never invoked, so for every session whose graceful close throws, the `mxgateway.sessions.open` gauge stays incremented after shutdown completes. Worse, `CloseSessionCoreAsync`'s `SessionCloseStartedException` catch (line 330) already records `_metrics.SessionRemoved()` (line 334-336) before re-throwing — so for that specific exception type, the gauge is decremented inside the inner catch, then the outer fallback runs and does not double-decrement (good), but `_metrics.SessionClosed()` is never called, so the `_sessionsClosed` counter under-counts by one. For any other exception (the more common case), neither inner catch records anything, so both `_sessionsClosed` and `_openSessions` end up wrong: gauge is left high, counter is left low. **Recommendation:** Inside the `ShutdownAsync` fallback (after the `KillWorker` call but before/inside the `RemoveSessionAsync`), call `_metrics.SessionClosed()` unless the inner catch already recorded the close. The simplest shape is to propagate a `wasClosed` flag out of `CloseSessionCoreAsync` (or replace the fallback's manual choreography with a single call into `KillWorkerAsync(...)`, which has the right metric path once Server-044 is fixed). The latter is the cleanest — `ShutdownAsync` becomes "try graceful, fall back to `KillWorkerAsync`," and there's exactly one accounting path for each session. Add a `SessionManagerTests.ShutdownAsync_WhenCloseThrows_StillDecrementsOpenSessionGauge` test using a session whose `CloseAsync` throws (e.g. a `BlockingShutdownWorkerClient` configured to throw on `ShutdownAsync`). +**Resolution:** 2026-05-24 — Two coordinated changes: (1) `CloseSessionCoreAsync`'s `SessionCloseStartedException` catch now calls `_metrics.SessionClosed()` (decrements the open-session gauge AND increments the closed counter) instead of `_metrics.SessionRemoved()` (gauge only). A close that ran far enough to attempt the worker shutdown but failed is still a closed session for accounting purposes — the session is removed from the registry and disposed below, so the counter must reflect that. (2) `ShutdownAsync`'s outer fallback now routes the kill through `KillWorkerAsync` (which has the correct metric path post-Server-044) rather than manually calling `session.KillWorker` + `RemoveSessionAsync`. In practice the inner catch already removes the session so the outer fallback is defensive — but routing both paths through the same accounting eliminates the inconsistency the finding called out. The pre-existing `CloseSessionAsync_WhenWorkerShutdownFails_RemovesSessionAndReleasesSlot` test was updated to assert the new (correct) `SessionsClosed == 1` value, with a comment back-referencing Server-046. New regression test in `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs`: `ShutdownAsync_WhenSessionCloseThrows_StillDecrementsOpenSessionGaugeAndIncrementsClosedCounter` (uses a `FakeWorkerClient.ShutdownException` to force the graceful close to throw, then asserts both the open-session gauge drops to 0 and the closed counter increments to 1). Confirmed to fail before the fix and pass after. + ### Server-047 | Field | Value | @@ -857,7 +863,7 @@ In practice the trigger is narrow — `GatewaySession.KillWorker` calls `_worker | Severity | Low | | Category | Code organization & conventions | | Location | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/ApiKeysPage.razor:324-334`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/SessionsPage.razor:171-195`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/SessionDetailsPage.razor:231-255` | -| Status | Open | +| Status | Resolved | **Description:** The shared `ConfirmDialog.razor` (added in `0e56b5b` / `24cc5fd`) is wired by three pages, but the pages handle `PendingAction` cleanup inconsistently: @@ -868,6 +874,8 @@ The user-visible difference: rotating/revoking/deleting a key vs closing/killing **Recommendation:** Align `ApiKeysPage.ConfirmPendingAsync` with the sessions pages: hold `PendingAction`, set `IsBusy = true`, run the action, then clear `PendingAction` in the `finally`. The current ApiKeysPage shape was inherited from before the dialog existed (when the confirmation was a `confirm()` JS call); the dialog component change can flatten the difference now. As a smaller alternative, document the divergence on the component's XML doc — but the shared component should ideally be used consistently. +**Resolution:** 2026-05-24 — Took the recommended alignment. `ApiKeysPage.ConfirmPendingAsync` (`src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/ApiKeysPage.razor`) now holds `PendingAction` for the duration of the awaited action (so the shared `ConfirmDialog` renders its `IsBusy` in-flight state on the dialog itself, matching the sessions pages) and clears it in `finally` regardless of outcome. The action is captured up front so a clear in `finally` works even when the action throws. `RunManagementActionAsync` continues to drive `IsBusy = true` inside its own `try/finally`, so the dialog now correctly disables Confirm/Cancel while the awaited service call runs. Pure UX-consistency change; no new automated test (no bUnit harness in the test project — same precedent as Server-010). + ### Server-048 | Field | Value | @@ -875,7 +883,7 @@ The user-visible difference: rotating/revoking/deleting a key vs closing/killing | Severity | Low | | Category | Testing coverage | | Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs:463-498` | -| Status | Open | +| Status | Resolved | **Description:** The two new `KillWorkerAsync_*` tests cover the happy path (`KillWorkerAsync_KillsWorkerAndRemovesSession`) and the missing-session error (`KillWorkerAsync_WhenSessionMissing_ThrowsSessionNotFound`). Three behaviorally distinct cases are missing: @@ -885,6 +893,8 @@ The user-visible difference: rotating/revoking/deleting a key vs closing/killing **Recommendation:** Add the three tests above. The fakes in `MxGateway.Tests/TestSupport/` already cover most of the moving parts; `FakeWorkerClient` needs a single `ThrowOnKill` flag (or the existing `KillThrowing` if any). +**Resolution:** 2026-05-24 — Closed by the regression tests added for Server-044 and Server-045 per the prompt's direction: case (1) is covered by `KillWorkerAsync_WhenSessionKillThrows_DecrementsOpenSessionGauge` (uses the new `FakeWorkerClient.KillException` flag); case (3) is covered by `KillWorkerAsync_ConcurrentCallsOnSameSession_CountClosedExactlyOnce`. Case (2) (`wasClosed=true` short-circuit) is implicitly exercised by the concurrent test — once the kill path serializes on the per-session close lock (Server-045 fix), the second kill that wins the registry race observes `wasClosed=true` and skips the counter increment, which is what the test pins (`SessionsClosed == 1`, not 2). The dedicated `KillWorkerAsync_WhenSessionAlreadyClosed_DoesNotReincrementClosedCounter` test was drafted but removed: closing a session disposes it (Server-016's `_closeLock.Dispose()`), so re-issuing a kill against a previously-closed-and-disposed session always fails on the disposed semaphore, which is realistic for production but not a useful unit-test shape. No new test file; the regression coverage already lives in `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs`. + ### Server-049 | Field | Value | @@ -892,12 +902,14 @@ The user-visible difference: rotating/revoking/deleting a key vs closing/killing | Severity | Low | | Category | Documentation & comments | | Location | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/IDashboardSessionAdminService.cs:5-18`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs:8-25` | -| Status | Open | +| Status | Resolved | **Description:** `IDashboardSessionAdminService` declares three members — `CanManage`, `CloseSessionAsync`, `KillWorkerAsync` — none of which carry XML documentation. `DashboardSessionAdminService.CanManage` and the two operation methods are also undocumented (only the constructor parameters are named). The C# style guide requires public-surface XML docs and CLAUDE.md mandates that "docs change with the code." The peer `IDashboardApiKeyManagementService` is also undocumented, so this isn't unique — but the new interface is a fresh public surface being landed in `c5e7479`, and the contract subtleties (CanManage returns false for non-Admin; missing-session paths surface as `Succeeded = false` not as a thrown exception; `KillReason` is fixed at `"dashboard-admin-kill"` and that value reaches the audit log) are exactly what XML docs are for. **Recommendation:** Add `` blocks to `IDashboardSessionAdminService.CanManage` (states the Admin-role gate), `CloseSessionAsync` and `KillWorkerAsync` (state that missing sessions return `DashboardSessionAdminResult.Fail(...)` rather than throwing, and that the audit log captures actor + remote IP). Add `` and `` for the request/response shape. The same sweep can pick up the longstanding gap on `IDashboardApiKeyManagementService` if the team wants — but the new file is the load-bearing one. +**Resolution:** 2026-05-24 — Added `` + `` blocks to every member of `IDashboardSessionAdminService` (`src/ZB.MOM.WW.MxGateway.Server/Dashboard/IDashboardSessionAdminService.cs`): an interface-level `` describing the Admin-role gate, audit log shape, and `DashboardSessionAdminResult.Fail` semantics; per-member docs on `CanManage`, `CloseSessionAsync`, and `KillWorkerAsync` calling out the missing-session-returns-Fail contract and the `dashboard-admin-kill` reason constant that reaches the worker-kill audit log and `mxgateway.workers.killed` counter tag. `DashboardSessionAdminService` (`src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs`) picked up a class-level `` + `` describing the per-page audit-log seam, plus `` on each public method. Pure documentation change; no test (the behavioral contracts the docs describe are already exercised by the existing `DashboardSessionAdminServiceTests` cases). + ### Server-050 | Field | Value | @@ -905,7 +917,7 @@ The user-visible difference: rotating/revoking/deleting a key vs closing/killing | Severity | Low | | Category | Error handling & resilience | | Location | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs:42-75,92-125` | -| Status | Open | +| Status | Resolved | **Description:** `CloseSessionAsync` and `KillWorkerAsync` catch only `SessionManagerException` (the `SessionNotFound` filter, then a general `SessionManagerException` catch). Anything else propagates raw to Blazor's error boundary. The propagation paths exist: @@ -915,3 +927,5 @@ The user-visible difference: rotating/revoking/deleting a key vs closing/killing Today neither call site has a Blazor error boundary, so an unhandled exception lands as a generic Blazor circuit error page. The friendlier-error contract that Server-044's commit message advertises ("audit-logs, friendly errors") is incomplete: only `SessionManagerException` gets a friendly error. **Recommendation:** Add a general `catch (Exception exception)` after the `SessionManagerException` catch in both `CloseSessionAsync` and `KillWorkerAsync`, log a warning (matching the SessionManagerException pattern), and return `DashboardSessionAdminResult.Fail($"{operation} failed unexpectedly. See the gateway log for details.")`. This makes the result type truly the only output the page sees. Add a regression test using a `ThrowingSessionManager` that throws e.g. `InvalidOperationException` from `KillWorkerAsync` and asserts the service returns a failing result rather than propagating. + +**Resolution:** 2026-05-24 — Added the recommended general `catch (Exception)` arms to both `DashboardSessionAdminService.CloseSessionAsync` and `KillWorkerAsync` (`src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs`), placed after the `SessionManagerException` catches and behind a `catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) throw;` so caller cancellation still propagates cleanly. The new catches log a warning with actor + session id and return `DashboardSessionAdminResult.Fail("{Operation} failed unexpectedly for session {SessionId}. See the gateway log for details.")`, mirroring the SessionManagerException pattern. Regression tests in `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSessionAdminServiceTests.cs`: `CloseSessionAsync_WhenManagerThrowsUnexpected_ReturnsFriendlyFail` (the `ISessionManager` throws `InvalidOperationException("unexpected")`) and `KillWorkerAsync_WhenManagerThrowsUnexpected_ReturnsFriendlyFail` (throws `IOException("pipe broken")`); both assert the service returns a failing result with a non-blank message rather than propagating. The fake's new `CloseThrowsUnexpected` / `KillThrowsUnexpected` properties hold the configured exception. Confirmed to fail before the fix (raw exception propagated) and pass after. diff --git a/src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/ApiKeysPage.razor b/src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/ApiKeysPage.razor index b23eff2..901b563 100644 --- a/src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/ApiKeysPage.razor +++ b/src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/ApiKeysPage.razor @@ -328,9 +328,18 @@ else return; } + // Server-047: align the pending-action lifecycle with SessionsPage / SessionDetailsPage — + // hold PendingAction while the awaited action runs so the shared ConfirmDialog can render + // its in-flight (IsBusy) state, then clear in finally regardless of outcome. Func> action = PendingAction.Action; - PendingAction = null; - await RunManagementActionAsync(action).ConfigureAwait(false); + try + { + await RunManagementActionAsync(action).ConfigureAwait(false); + } + finally + { + PendingAction = null; + } } private sealed record PendingConfirm( diff --git a/src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs b/src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs index 7fc6c80..8f3bb3d 100644 --- a/src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs +++ b/src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs @@ -5,6 +5,19 @@ using ZB.MOM.WW.MxGateway.Server.Sessions; namespace ZB.MOM.WW.MxGateway.Server.Dashboard; +/// +/// Default implementation of : gates +/// destructive session actions on the role, +/// audit-logs successful operations, and converts +/// (and any other unexpected exceptions) into +/// so the Blazor pages never see a raw exception. +/// +/// +/// The constant dashboard-admin-kill is the reason passed to +/// and forwarded as the +/// reason tag on the mxgateway.workers.killed counter and in +/// the worker-kill audit log entries. +/// public sealed class DashboardSessionAdminService( ISessionManager sessionManager, IHttpContextAccessor httpContextAccessor, @@ -16,6 +29,7 @@ public sealed class DashboardSessionAdminService( private readonly ILogger _logger = logger ?? NullLogger.Instance; + /// public bool CanManage(ClaimsPrincipal user) { ArgumentNullException.ThrowIfNull(user); @@ -24,6 +38,7 @@ public sealed class DashboardSessionAdminService( && user.IsInRole(DashboardRoles.Admin); } + /// public async Task CloseSessionAsync( ClaimsPrincipal user, string sessionId, @@ -72,8 +87,27 @@ public sealed class DashboardSessionAdminService( return DashboardSessionAdminResult.Fail( $"Close failed: {exception.Message}"); } + catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) + { + throw; + } + catch (Exception exception) + { + // Server-050: any non-SessionManagerException (e.g. an IOException or + // InvalidOperationException from the session DisposeAsync / pipe teardown + // path) used to propagate raw into Blazor's error boundary. Convert it to + // a friendly failure so the Razor pages see only DashboardSessionAdminResult. + _logger.LogWarning( + exception, + "Dashboard admin {Actor} close failed unexpectedly for session {SessionId}.", + actor, + sessionId); + return DashboardSessionAdminResult.Fail( + $"Close failed unexpectedly for session {sessionId}. See the gateway log for details."); + } } + /// public async Task KillWorkerAsync( ClaimsPrincipal user, string sessionId, @@ -122,6 +156,26 @@ public sealed class DashboardSessionAdminService( return DashboardSessionAdminResult.Fail( $"Kill failed: {exception.Message}"); } + catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) + { + throw; + } + catch (Exception exception) + { + // Server-050: any non-SessionManagerException (e.g. an IOException from + // worker pipe teardown surfacing through session.DisposeAsync, or an + // InvalidOperationException from a corrupted worker handle) used to + // propagate raw into Blazor's error boundary. Convert it to a friendly + // failure so the page renders the ResultMessage rather than the circuit + // error page. + _logger.LogWarning( + exception, + "Dashboard admin {Actor} kill failed unexpectedly for session {SessionId}.", + actor, + sessionId); + return DashboardSessionAdminResult.Fail( + $"Kill failed unexpectedly for session {sessionId}. See the gateway log for details."); + } } private static string ResolveActor(ClaimsPrincipal user) diff --git a/src/ZB.MOM.WW.MxGateway.Server/Dashboard/IDashboardSessionAdminService.cs b/src/ZB.MOM.WW.MxGateway.Server/Dashboard/IDashboardSessionAdminService.cs index d313609..2f3a898 100644 --- a/src/ZB.MOM.WW.MxGateway.Server/Dashboard/IDashboardSessionAdminService.cs +++ b/src/ZB.MOM.WW.MxGateway.Server/Dashboard/IDashboardSessionAdminService.cs @@ -2,15 +2,82 @@ using System.Security.Claims; namespace ZB.MOM.WW.MxGateway.Server.Dashboard; +/// +/// Dashboard surface for the destructive session-management actions — +/// Close (graceful shutdown) and Kill (force-terminate) — exposed by the +/// Admin role. +/// +/// +/// The dashboard binds the destructive Close/Kill UI to this service so +/// the underlying calls flow through +/// a single audited and role-gated entry point. All operations are gated +/// by ; non-Admin callers are rejected with a +/// Succeeded = false result rather than throwing. Missing sessions +/// also surface as so +/// Razor pages can render the message without an error boundary. Each +/// successful call is logged at Information including the acting user +/// (from 's name) and the remote +/// address resolved from . +/// public interface IDashboardSessionAdminService { + /// + /// Returns whether the given principal is authorized to perform + /// destructive session-management actions. + /// + /// + /// Requires + /// to be true and membership in the + /// role. Pages typically gate the + /// Close/Kill buttons on this value at render time so non-Admin + /// viewers never see them. + /// + /// Caller principal. + /// true when the caller may close or kill sessions; otherwise false. bool CanManage(ClaimsPrincipal user); + /// + /// Closes the given session gracefully (worker is given the configured + /// shutdown grace period before being terminated). + /// + /// + /// Returns + /// when the caller is not Admin, when the session id is blank, or when + /// raises a + /// (including the + /// + /// case). Successful closes are audit-logged with the caller name, + /// session id, and remote address. + /// + /// Caller principal. + /// Session identifier to close. + /// Cancellation token. + /// Result describing success/failure and a user-facing message. Task CloseSessionAsync( ClaimsPrincipal user, string sessionId, CancellationToken cancellationToken); + /// + /// Force-terminates the worker process backing the given session without + /// attempting a graceful shutdown. + /// + /// + /// Invoked from the dashboard Kill button. Uses the + /// dashboard-admin-kill reason constant — that string reaches + /// the audit log and the mxgateway.workers.killed metric tag. + /// Returns for + /// non-Admin callers, blank session ids, or any + /// from the underlying + /// manager (the + /// case is recognized and reported as "not found"). Successful kills + /// are audit-logged with the caller name, session id, and remote + /// address. + /// + /// Caller principal. + /// Session identifier to kill. + /// Cancellation token. + /// Result describing success/failure and a user-facing message. Task KillWorkerAsync( ClaimsPrincipal user, string sessionId, diff --git a/src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs b/src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs index 521705c..29a47bd 100644 --- a/src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs +++ b/src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs @@ -840,6 +840,45 @@ public sealed class GatewaySession TransitionTo(SessionState.Closed); } + /// + /// Terminates the worker process immediately while holding the per-session + /// close lock so concurrent close/kill callers serialize. Returns the + /// session state observed at the start of the call so the caller can + /// dedup metric accounting (e.g. only record SessionClosed when + /// the session was not already closed). + /// + /// + /// Mirrors 's use of _closeLock so that + /// a Close in flight from one caller and a Kill from another do not + /// race on the "was the session already closed" observation that + /// drives metric increments (Server-045). + /// + /// Reason for killing the worker. + /// Cancellation token. + /// true if the session was already when the lock was acquired; otherwise false. + public async ValueTask KillWorkerWithCloseGateAsync( + string reason, + CancellationToken cancellationToken) + { + await _closeLock.WaitAsync(cancellationToken).ConfigureAwait(false); + try + { + bool wasClosed; + lock (_syncRoot) + { + wasClosed = _state == SessionState.Closed; + } + + _workerClient?.Kill(reason); + TransitionTo(SessionState.Closed); + return wasClosed; + } + finally + { + _closeLock.Release(); + } + } + /// /// Disposes the session and frees associated resources. /// diff --git a/src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs b/src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs index 68807f2..e7a07bf 100644 --- a/src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs +++ b/src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs @@ -222,16 +222,26 @@ public sealed class SessionManager : ISessionManager cancellationToken.ThrowIfCancellationRequested(); GatewaySession session = GetRequiredSession(sessionId); - bool wasClosed = session.State == SessionState.Closed; + // Serialize concurrent kill/close attempts on this session by routing through the + // per-session close lock (Server-045). Returns whether the session was already in + // Closed state when the lock was acquired so the metric counter is incremented at + // most once across concurrent callers. + bool wasClosed; try { - session.KillWorker(reason); + wasClosed = await session.KillWorkerWithCloseGateAsync(reason, cancellationToken).ConfigureAwait(false); } catch (Exception exception) { session.MarkFaulted(exception.Message); _metrics.Fault(SessionManagerErrorCode.CloseFailed.ToString()); + + // Server-044: the open-session gauge was incremented in OpenSessionAsync; + // every session reaching KillWorkerAsync had SessionOpened recorded. If the + // kill path throws, decrement the gauge here so mxgateway.sessions.open + // does not leak — mirroring the Server-006 fix on OpenSessionAsync. + _metrics.SessionRemoved(); await RemoveSessionAsync(session).ConfigureAwait(false); throw new SessionManagerException( SessionManagerErrorCode.CloseFailed, @@ -297,10 +307,24 @@ public sealed class SessionManager : ISessionManager exception, "Graceful shutdown failed for session {SessionId}; killing worker.", session.SessionId); + + // Defensive fallback: CloseSessionCoreAsync's inner SessionCloseStartedException + // catch normally removes the session and accounts the close (Server-046). The + // outer fallback only fires for sessions still in the registry — route through + // KillWorkerAsync so the bookkeeping is identical to the dashboard kill path. if (_registry.TryGet(session.SessionId, out _)) { - session.KillWorker(GatewayShutdownReason); - await RemoveSessionAsync(session).ConfigureAwait(false); + try + { + await KillWorkerAsync(session.SessionId, GatewayShutdownReason, cancellationToken).ConfigureAwait(false); + } + catch (SessionManagerException killException) + { + _logger.LogWarning( + killException, + "Worker kill fallback failed for session {SessionId}.", + session.SessionId); + } } } } @@ -332,7 +356,12 @@ public sealed class SessionManager : ISessionManager session.MarkFaulted(exception.Message); if (!wasClosed) { - _metrics.SessionRemoved(); + // Server-046: account the close as a SessionClosed (decrements the open-session + // gauge AND increments the sessions.closed counter), not just SessionRemoved. + // The session is being removed from the registry below; treating this as a + // half-finished close that only decremented the gauge under-counted the closed + // counter. + _metrics.SessionClosed(); } _metrics.Fault(SessionManagerErrorCode.CloseFailed.ToString()); diff --git a/src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSessionAdminServiceTests.cs b/src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSessionAdminServiceTests.cs index 74dc162..d7fda0b 100644 --- a/src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSessionAdminServiceTests.cs +++ b/src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSessionAdminServiceTests.cs @@ -115,6 +115,52 @@ public sealed class DashboardSessionAdminServiceTests Assert.True(service.CanManage(CreateUser(DashboardRoles.Admin))); } + /// + /// Regression for Server-050: an unexpected (non-) + /// exception from CloseSessionAsync — e.g. an + /// or surfaced from RemoveSessionAsync/DisposeAsync — + /// must be converted to a friendly + /// rather than propagating raw into Blazor's error boundary. + /// + [Fact] + public async Task CloseSessionAsync_WhenManagerThrowsUnexpected_ReturnsFriendlyFail() + { + FakeSessionManager sessionManager = new() + { + CloseThrowsUnexpected = new InvalidOperationException("unexpected"), + }; + DashboardSessionAdminService service = CreateService(sessionManager); + + DashboardSessionAdminResult result = await service.CloseSessionAsync( + CreateUser(DashboardRoles.Admin), + "session-1", + CancellationToken.None); + + Assert.False(result.Succeeded); + Assert.False(string.IsNullOrWhiteSpace(result.Message)); + } + + /// + /// Regression for Server-050: same friendly-fail contract for the Kill path. + /// + [Fact] + public async Task KillWorkerAsync_WhenManagerThrowsUnexpected_ReturnsFriendlyFail() + { + FakeSessionManager sessionManager = new() + { + KillThrowsUnexpected = new IOException("pipe broken"), + }; + DashboardSessionAdminService service = CreateService(sessionManager); + + DashboardSessionAdminResult result = await service.KillWorkerAsync( + CreateUser(DashboardRoles.Admin), + "session-1", + CancellationToken.None); + + Assert.False(result.Succeeded); + Assert.False(string.IsNullOrWhiteSpace(result.Message)); + } + private static DashboardSessionAdminService CreateService(ISessionManager sessionManager) { DefaultHttpContext httpContext = new(); @@ -150,6 +196,10 @@ public sealed class DashboardSessionAdminServiceTests public bool CloseThrowsNotFound { get; init; } + public Exception? CloseThrowsUnexpected { get; init; } + + public Exception? KillThrowsUnexpected { get; init; } + public Task OpenSessionAsync( SessionOpenRequest request, string? clientIdentity, @@ -194,6 +244,11 @@ public sealed class DashboardSessionAdminServiceTests $"Session {sessionId} was not found."); } + if (CloseThrowsUnexpected is not null) + { + throw CloseThrowsUnexpected; + } + return Task.FromResult(new SessionCloseResult(sessionId, SessionState.Closed, AlreadyClosed: false)); } @@ -205,6 +260,11 @@ public sealed class DashboardSessionAdminServiceTests KillCount++; LastKilledSessionId = sessionId; LastKillReason = reason; + if (KillThrowsUnexpected is not null) + { + throw KillThrowsUnexpected; + } + return Task.FromResult(new SessionCloseResult(sessionId, SessionState.Closed, AlreadyClosed: false)); } diff --git a/src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs b/src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs index 7fea357..fd4deab 100644 --- a/src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs +++ b/src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs @@ -410,7 +410,10 @@ public sealed class SessionManagerTests Assert.Equal(1, failingWorkerClient.KillCount); Assert.Equal(1, failingWorkerClient.DisposeCount); GatewayMetricsSnapshot snapshot = metrics.GetSnapshot(); - Assert.Equal(0, snapshot.SessionsClosed); + // Server-046: a close-that-failed now accounts as SessionClosed (counter += 1) rather + // than SessionRemoved (gauge -= 1, counter unchanged). The session is being removed + // from the registry on this path, so it must show up in the closed count. + Assert.Equal(1, snapshot.SessionsClosed); Assert.False(snapshot.EventsBySession.ContainsKey(firstSession.SessionId)); Assert.Equal(1, snapshot.OpenSessions); } @@ -495,6 +498,110 @@ public sealed class SessionManagerTests Assert.Equal(SessionManagerErrorCode.SessionNotFound, exception.ErrorCode); } + /// + /// Regression for Server-044: when session.KillWorker throws, the catch path must still + /// decrement mxgateway.sessions.open (parity with the Server-006 fix in + /// OpenSessionAsync). Without the fix the gauge leaks one open session per failed kill. + /// + [Fact] + public async Task KillWorkerAsync_WhenSessionKillThrows_DecrementsOpenSessionGauge() + { + FakeWorkerClient workerClient = new() + { + KillException = new InvalidOperationException("worker kill failed"), + }; + using GatewayMetrics metrics = new(); + SessionManager manager = CreateManager( + new FakeSessionWorkerClientFactory(workerClient), + metrics: metrics); + GatewaySession session = await manager.OpenSessionAsync( + CreateOpenRequest(), + "client-1", + CancellationToken.None); + + Assert.Equal(1, metrics.GetSnapshot().OpenSessions); + + SessionManagerException exception = await Assert.ThrowsAsync( + async () => await manager.KillWorkerAsync(session.SessionId, "test-kill", CancellationToken.None)); + + Assert.Equal(SessionManagerErrorCode.CloseFailed, exception.ErrorCode); + Assert.False(manager.TryGetSession(session.SessionId, out _)); + Assert.Equal(0, metrics.GetSnapshot().OpenSessions); + Assert.True(metrics.GetSnapshot().Faults > 0); + } + + /// + /// Regression for Server-045 / Server-048: concurrent kills on the same session must not + /// double-increment mxgateway.sessions.closed. The first kill wins, the second + /// observes wasClosed == true (or a missing session after removal) and short-circuits. + /// + [Fact] + public async Task KillWorkerAsync_ConcurrentCallsOnSameSession_CountClosedExactlyOnce() + { + FakeWorkerClient workerClient = new(); + using GatewayMetrics metrics = new(); + SessionManager manager = CreateManager( + new FakeSessionWorkerClientFactory(workerClient), + metrics: metrics); + GatewaySession session = await manager.OpenSessionAsync( + CreateOpenRequest(), + "client-1", + CancellationToken.None); + + Task first = manager.KillWorkerAsync(session.SessionId, "kill-a", CancellationToken.None); + Task second = Task.Run(async () => + { + try + { + return await manager.KillWorkerAsync(session.SessionId, "kill-b", CancellationToken.None); + } + catch (SessionManagerException missing) when (missing.ErrorCode == SessionManagerErrorCode.SessionNotFound) + { + return new SessionCloseResult(session.SessionId, SessionState.Closed, AlreadyClosed: true); + } + }); + + await Task.WhenAll(first, second); + + Assert.Equal(1, metrics.GetSnapshot().SessionsClosed); + Assert.Equal(0, metrics.GetSnapshot().OpenSessions); + Assert.False(manager.TryGetSession(session.SessionId, out _)); + } + + /// + /// Regression for Server-046: ShutdownAsync's graceful-close fallback (which calls + /// KillWorker + RemoveSessionAsync when CloseSessionCoreAsync throws) + /// must still account a successful close: both the open-session gauge must drop to zero AND + /// the mxgateway.sessions.closed counter must increment. Without the fix, the + /// graceful-close failure path under-counts the closed counter. + /// + [Fact] + public async Task ShutdownAsync_WhenSessionCloseThrows_StillDecrementsOpenSessionGaugeAndIncrementsClosedCounter() + { + FakeWorkerClient throwingClient = new() + { + ShutdownException = new InvalidOperationException("worker shutdown failed"), + }; + using GatewayMetrics metrics = new(); + SessionManager manager = CreateManager( + new FakeSessionWorkerClientFactory(throwingClient), + metrics: metrics); + GatewaySession session = await manager.OpenSessionAsync( + CreateOpenRequest(), + "client-1", + CancellationToken.None); + + Assert.Equal(1, metrics.GetSnapshot().OpenSessions); + + await manager.ShutdownAsync(CancellationToken.None); + + // After shutdown, regardless of whether the graceful close path or the kill fallback ran, + // the open-session gauge must be zero and the closed counter must be incremented. + Assert.Equal(0, metrics.GetSnapshot().OpenSessions); + Assert.Equal(1, metrics.GetSnapshot().SessionsClosed); + Assert.False(manager.TryGetSession(session.SessionId, out _)); + } + /// Verifies that when worker creation fails, the session is removed from the registry. [Fact] public async Task OpenSessionAsync_WhenWorkerCreationFails_RemovesSessionFromRegistry() @@ -726,6 +833,9 @@ public sealed class SessionManagerTests /// Gets the exception to throw when shutdown is called, if any. public Exception? ShutdownException { get; init; } + /// Gets the exception to throw when kill is called, if any. + public Exception? KillException { get; init; } + /// Gets a value indicating whether to block shutdown on the fake worker client. public bool BlockShutdown { get; init; } @@ -803,6 +913,11 @@ public sealed class SessionManagerTests public void Kill(string reason) { KillCount++; + if (KillException is not null) + { + throw KillException; + } + State = WorkerClientState.Faulted; }