File and fix Server-030 and Client.Dotnet-017 from e2e surfacing
Both findings surfaced when running the cross-language e2e matrix (scripts/run-client-e2e-tests.ps1) against the redeployed gateway at commit84d36b7. Filed in code-reviews/Server/findings.md and code-reviews/Client.Dotnet/findings.md and fixed in the same change. Server-030 (Medium / Error handling): GatewaySession.GetReadyWorkerClient gated on `_state == Ready && _workerClient.State == Ready` but only formatted `_state` into the SessionManagerException message. Under load the gateway-driven `_state` and the worker-driven `WorkerClient.State` can diverge, producing a self-contradictory diagnostic ("Session ... is not ready. Current state is Ready."). The Java e2e client hit this on the 56th item after 55 successful add-items. Rewrote the message to include both states ("Session state is X; worker state is Y"), added an XML doc explaining the two-state contract and that this branch is the fail-fast for a divergence race, and added regression test SessionManagerTests.InvokeAsync_WhenWorkerNotReadyButSessionReady_DiagnosticIncludesBothStates that pins both states appear in the message. The deeper race (should the gateway briefly wait for worker-Ready before failing?) remains open as a follow-up. Client.Dotnet-017 (Low / Error handling): stream-events CLI threw OperationCanceledException as an unhandled exception when the user's --timeout expired before --max-events was reached. Exit code -532462766, no aggregate JSON. The other client CLIs (Go, Rust, Python, Java) exit 0 in this case. Wrapped the `await foreach` in `catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested)` so the supplied token's cancellation (--timeout, Ctrl+C, or parent CTS) becomes graceful completion; the aggregate `{ "events": [...] }` JSON still runs after the catch. Added regression test RunAsync_StreamEvents_WhenTimeoutFiresAfterEvents_EmitsCollectedEventsAndExitsZero backed by a new FakeCliClient.StreamHangAfterEvents hook that yields the configured events then parks on the cancellation token. Side cleanup: the GatewayApplicationTests test added under Server-020 was asserting an invariant (`/dashboard/dashboard/X` doesn't exist) that I broke by reverting Server-020 in84d36b7. The doubled endpoint shapes do exist now (MapGroup("/dashboard") prefixing an already "/dashboard/X" @page directive) but they're harmless — no client requests `/dashboard/dashboard/X`. Replaced the test with a positive assertion (`/dashboard/X` routes ARE registered) and rewrote the XML doc to record the actual contract. Verified: dotnet test src/MxGateway.Tests passes 480/480, dotnet test clients/dotnet/MxGateway.Client.Tests passes 77/77, gateway redeployed at this commit and GET http://localhost:5130/dashboard returns 200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -489,3 +489,18 @@ Re-review pass at `a020350` — the cross-module sweep that resolved Server-015
|
||||
**Recommendation:** Either (a) extend the advertised list with `bulk-read-command` and `bulk-write-commands` (`WriteBulk` / `Write2Bulk` / `WriteSecuredBulk` / `WriteSecured2Bulk` collectively), or (b) document in `gateway.md` and `docs/Contracts.md` that `Capabilities` is informational only and not the contract version. Option (a) is the simplest forward-compatible fix and keeps the capability token shape clients are already familiar with.
|
||||
|
||||
**Resolution:** 2026-05-20 — Extended the `OpenSession` capabilities list with `bulk-read-commands` and `bulk-write-commands` alongside the existing `bulk-subscribe-commands` token, so clients that gate on capability strings have an explicit signal for the bulk-read and bulk-write families.
|
||||
|
||||
### Server-030
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | Medium |
|
||||
| Category | Error handling & resilience |
|
||||
| Location | `src/MxGateway.Server/Sessions/GatewaySession.cs:952-980` |
|
||||
| Status | Resolved |
|
||||
|
||||
**Description:** Surfaced during the 2026-05-20 cross-language e2e run against a redeployed gateway (`a020350`). The Java client got 55 of 120 `AddItem` calls in, then `Advise` returned `Session session-de7728a290bd41028ad6fec81e233144 is not ready. Current state is Ready.` — a self-contradictory diagnostic. The check in `GetReadyWorkerClient` (`GatewaySession.cs:956`) is `_state != SessionState.Ready || _workerClient?.State != WorkerClientState.Ready`, but the formatted message only includes `_state`. When the gateway-side session state is `Ready` but the worker client's own `WorkerClientState` has transitioned (heartbeat watchdog firing, pipe disconnect detected by the read loop, etc.) before the session-level reaction observes it, the in-flight RPC fails fast here — and the operator sees a message that doesn't tell them which side of the gate the failure is on. The two-state gap itself is a real race (the worker-side state can shift independently of the gateway-driven session state) but a clear diagnostic is the prerequisite for diagnosing it; without it, a future investigation will start from "it says Ready but it's not Ready" instead of "the worker is Handshaking / Closing / Faulted while the session is still Ready".
|
||||
|
||||
**Recommendation:** Format both states into the exception message — `Session {SessionId} is not ready. Session state is {_state}; worker state is {workerClientState}.` (or `"<no worker>"` when `_workerClient` is null). Document on the method that the two states can diverge under load and that this branch is the fail-fast for that case. Add a regression test that flips `FakeWorkerClient.State` to a non-Ready value (e.g. `Handshaking`) while the session is `Ready` and asserts both pieces of state appear in the thrown `SessionManagerException.Message`. The deeper race investigation (should the gateway briefly wait for worker-Ready before failing? when does `WorkerClient.State` legitimately shift while the session is still `Ready`?) is out of scope for this finding but is worth a follow-up.
|
||||
|
||||
**Resolution:** 2026-05-20 — Rewrote `GetReadyWorkerClient` so the `SessionManagerException` message includes both `_state` and `_workerClient.State` (or `"<no worker>"` for the null case): `"Session {SessionId} is not ready. Session state is {_state}; worker state is {workerState}."`. Added XML doc on the method explaining the two-state contract and that this branch is the fail-fast for a state-divergence race. Added regression test `SessionManagerTests.InvokeAsync_WhenWorkerNotReadyButSessionReady_DiagnosticIncludesBothStates` that sets `FakeWorkerClient.State = WorkerClientState.Handshaking` while the session is `Ready` and asserts both `"Session state is Ready"` and `"worker state is Handshaking"` appear in the message; the test also pins `InvokeCount == 0` so the worker isn't called. The deeper race (should `GetReadyWorkerClient` retry briefly when state has just diverged?) remains open for follow-up.
|
||||
|
||||
Reference in New Issue
Block a user