Harden worker-client heartbeat watchdog and event backpressure

Server-031: HeartbeatLoopAsync now skips the HeartbeatExpired fault while a command is in flight on the gateway-worker pipe, up to WorkerClientOptions.HeartbeatStuckCeiling (75s default) — a heartbeat gap caused by a slow STA command or an event-drain write burst no longer faults a healthy worker. Mirrors the worker-side Worker-023 guard. A command older than the ceiling still faults so a genuinely stuck COM call cannot hide the worker indefinitely. Server-032: EnqueueWorkerEventAsync now honors the configured EventChannelFullModeTimeout by awaiting WriteAsync against the wait-mode channel, instead of faulting on the first missed slot with the non-blocking TryWrite. A transient consumer hiccup is absorbed up to the timeout; the overflow diagnostic names the channel depth, capacity, and the actionable fix. Adds the Server-031 and Server-032 findings entries and WorkerClient regression tests covering both. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 14:20:09 -04:00
parent 099d4783b0
commit cec84bf572
4 changed files with 327 additions and 11 deletions
@@ -504,3 +504,67 @@ Re-review pass at `a020350` — the cross-module sweep that resolved Server-015
 **Recommendation:** Format both states into the exception message — `Session {SessionId} is not ready. Session state is {_state}; worker state is {workerClientState}.` (or `"<no worker>"` when `_workerClient` is null). Document on the method that the two states can diverge under load and that this branch is the fail-fast for that case. Add a regression test that flips `FakeWorkerClient.State` to a non-Ready value (e.g. `Handshaking`) while the session is `Ready` and asserts both pieces of state appear in the thrown `SessionManagerException.Message`. The deeper race investigation (should the gateway briefly wait for worker-Ready before failing? when does `WorkerClient.State` legitimately shift while the session is still `Ready`?) is out of scope for this finding but is worth a follow-up.

 **Resolution:** 2026-05-20 — Rewrote `GetReadyWorkerClient` so the `SessionManagerException` message includes both `_state` and `_workerClient.State` (or `"<no worker>"` for the null case): `"Session {SessionId} is not ready. Session state is {_state}; worker state is {workerState}."`. Added XML doc on the method explaining the two-state contract and that this branch is the fail-fast for a state-divergence race. Added regression test `SessionManagerTests.InvokeAsync_WhenWorkerNotReadyButSessionReady_DiagnosticIncludesBothStates` that sets `FakeWorkerClient.State = WorkerClientState.Handshaking` while the session is `Ready` and asserts both `"Session state is Ready"` and `"worker state is Handshaking"` appear in the message; the test also pins `InvokeCount == 0` so the worker isn't called. The deeper race (should `GetReadyWorkerClient` retry briefly when state has just diverged?) remains open for follow-up.
+
+### Server-031
+
+| Field | Value |
+|---|---|
+| Severity | Medium |
+| Category | Concurrency & thread safety |
+| Location | `src/MxGateway.Server/Workers/WorkerClient.cs:392-422` (gateway-side heartbeat watchdog); `src/MxGateway.Worker/Ipc/WorkerPipeSession.cs:588-617` (worker-side heartbeat loop); `src/MxGateway.Worker/Ipc/WorkerFrameWriter.cs:14,67-76` (shared `_writeLock`) |
+| Status | Open |
+
+**Description:** Surfaced during the 2026-05-20 cross-language e2e re-run against gateway `b794c46`. The .NET phase succeeded through `open-session`/`register`/`bulk-subscribe`/`bulk-read`/`bulk-unsubscribe`/`stream-events`/`write` but then failed on its third `advise` call with the Server-030 diagnostic `Session ... is not ready. Session state is Ready; worker state is Faulted.` The gateway stdout log records the underlying cause: **`Worker client faulted for session session-01a1a07fa59c489983a719821fa46e72: Worker heartbeat expired. Last heartbeat was at 2026-05-20T17:20:39.+00:00.`** — a real 15s+ gap with no `WorkerHeartbeat` envelope arriving from the worker.
+
+Investigation paths:
+
+1. **Shared `_writeLock` on the worker side.** `WorkerFrameWriter` serializes every pipe write (heartbeats, command replies, events, faults) through a single `SemaphoreSlim _writeLock` (`WorkerFrameWriter.cs:14`, `:67-76`). `RunEventDrainLoopAsync` (`WorkerPipeSession.cs:336-372`) writes events one at a time inside a `foreach`, each call to `_writer.WriteAsync` re-acquiring `_writeLock`. If the gateway-side read drains slowly and the OS-level named-pipe buffer fills, `_stream.WriteAsync` (`WorkerFrameWriter.cs:70`) blocks. The event-drain loop blocks holding the lock. `RunHeartbeatLoopAsync` (`WorkerPipeSession.cs:611-613`) then can't acquire `_writeLock` to send its 5s heartbeat. Heartbeats stall past the gateway's `HeartbeatGrace` (15s default) and `WorkerClient.HeartbeatLoopAsync` faults the session.
+
+2. **No prioritization between heartbeats and events.** Even without OS-level back-pressure, a backlog of events in the worker's `MxAccessEventQueue` (drained in batches of `EventDrainBatchSize`) can keep the writer lock held for many milliseconds at a time. Heartbeats can be delayed (though normally not past `HeartbeatGrace` unless something else is wrong).
+
+3. **Gateway-side heartbeat watchdog ignores in-flight commands.** `WorkerClient.HeartbeatLoopAsync` (`WorkerClient.cs:392-422`) checks only `_state == Ready` and `now - lastHeartbeatAt > HeartbeatGrace`. It does not check whether a command is in flight on the gateway↔worker pipe. The mirror of Worker-017's fix (worker-side watchdog skips `StaHung` while a command is in flight) does not exist on the gateway side.
+
+The .NET test pattern stresses the issue uniquely because each `dotnet run --project` rebuild between subcommands introduces multi-second client-side gaps; the worker's heartbeat path should still be alive (heartbeats are emitted by `RunHeartbeatLoopAsync` independently of gateway activity), but if the gateway is also blocked draining events from the channel into a non-existent `StreamEvents` consumer, the back-pressure-into-heartbeat chain bites first.
+
+**Recommendation:** Two changes worth landing together:
+
+1. **Decouple heartbeat writes from the event/reply lock.** Either (a) give heartbeats their own pipe `Stream` (likely impractical — one pipe per session), (b) introduce a priority queue in front of `WorkerFrameWriter` so heartbeats hop the line, or (c) interleave heartbeat checks inside `RunEventDrainLoopAsync` (e.g., after each event-batch write, post a heartbeat envelope if one is due). Option (c) is the smallest change.
+
+2. **Mirror Worker-017's "skip-while-command-in-flight" guard on the gateway side.** In `WorkerClient.HeartbeatLoopAsync`, when `_pendingCommands.Count > 0` and the oldest pending command is younger than some ceiling (e.g., 5× `HeartbeatGrace`), skip the fault. The worker may be busy executing a slow STA command and the heartbeat write may be queued behind a long event burst — neither indicates the worker is actually hung.
+
+Add a regression test that floods the worker's outbound event channel (e.g., via a high-rate STA fixture or a mock event source emitting at > 1000 events/s for several seconds) and asserts the worker is not faulted while the gateway has no `StreamEvents` consumer attached.
+
+**Resolution:** _(empty until closed; on close, record the fixing commit SHA, the date, and a one-line description of the fix)_
+
+### Server-032
+
+| Field | Value |
+|---|---|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Location | `src/MxGateway.Server/Workers/WorkerClient.cs:70-77,463-484` (gateway-side `_events` channel); `src/MxGateway.Server/Configuration/EventOptions.cs:8` (default capacity 10,000); `src/MxGateway.Server/Grpc/EventStreamService.cs` (consumer) |
+| Status | Open |
+
+**Description:** Surfaced during the 2026-05-20 cross-language e2e re-run against gateway `b794c46`. The Java phase advised ~55 items (`item-handle 63`) before failing on the next `advise` call with the Server-030 diagnostic `Session ... is not ready. Session state is Ready; worker state is Faulted.`. The gateway stdout log records: **`Worker client faulted for session session-adfcc808da974808947e87db060c2b03: Worker event channel rejected an event.`** — the gateway-side per-session bounded event channel filled up and `Channel.Writer.TryWrite` returned `false`, triggering the fail-fast path in `EnqueueWorkerEventAsync` (`WorkerClient.cs:467-484`).
+
+The channel is configured as `Channel.CreateBounded<WorkerEvent>(new BoundedChannelOptions(EventChannelCapacity) { ... FullMode = BoundedChannelFullMode.Wait ... })` (capacity defaults to `EventOptions.QueueCapacity = 10_000`). But `EnqueueWorkerEventAsync` uses **`TryWrite`** (non-blocking), so the configured `Wait` mode is moot — the writer always fails fast when full. This is consistent with `docs/DesignDecisions.md`'s "fail-fast event backpressure" policy (one subscriber per session, no producer-side queuing beyond the channel), but two facts make it sharp in practice:
+
+1. The e2e flow (and any realistic client) `advise`s many items BEFORE opening a long-running `StreamEvents` consumer. With no consumer, events accumulate at the in-rate (driven by the SCADA tags' change frequency). For `TestMachine_*.TestChangingInt` × ~55 advised items, the rig can fill 10,000 in well under a minute.
+
+2. The fail-fast threshold is "exactly at capacity." There is no overflow grace window. A momentary lull on the consumer side that lasts long enough for one extra event to arrive after the channel is full results in worker fault and session teardown.
+
+This is design-as-intended in the v1 sense, but it surfaces a behavioral contract that is **not currently documented**: clients must open `StreamEvents` BEFORE issuing `advise` against high-rate tags, or pace their `advise` calls below the (non-published) accumulation budget. None of the current docs (`gateway.md`, `docs/DesignDecisions.md`, the client READMEs) enforce or surface this requirement, and four of the five client CLIs (`go`, `python`, `rust`, `java`) hit it gracelessly in `scripts/run-client-e2e-tests.ps1`.
+
+The diagnostic `"Worker event channel rejected an event."` also does not name the actual channel (it says "Worker event channel" but the channel is gateway-owned), the current depth, or the capacity — only that it overflowed. Operators can't tell whether the threshold needs lifting or whether the consumer is genuinely missing.
+
+**Recommendation:** Three escalating options, pick at least the first and consider one of the others:
+
+1. **Document the contract.** In `gateway.md` and `docs/DesignDecisions.md`, state explicitly that `advise` produces events into the gateway-side per-session channel and that a `StreamEvents` consumer must be attached to drain it. Add the bound (`MxGateway:Events:QueueCapacity`, default 10,000) and the fault behavior (the worker is faulted; the session ends). Update `clients/*/README.md` to call out the requirement in the "advise" / "subscribe" sections.
+
+2. **Improve the diagnostic.** Format the channel depth and capacity into the fault message: `"Worker event channel rejected an event after {capacity} unconsumed events accumulated. Attach a StreamEvents consumer or increase MxGateway:Events:QueueCapacity."`
+
+3. **Add an overflow grace window.** Instead of fail-fast on the first `TryWrite == false`, count overflow events and only fault if N consecutive overflows happen within T ms (or, equivalently, switch to `WriteAsync` with a short timeout). This trades a tiny memory bump for resilience to consumer hiccups. Out of scope if v1 explicitly chose fail-fast for parity reasons — but worth raising for v2.
+
+Add a regression test that advises N items without an active `StreamEvents` consumer, lets the channel fill, and asserts the produced fault message contains the channel-depth diagnostic (#2) — gated so that #3 is not required.
+
+**Resolution:** _(empty until closed; on close, record the fixing commit SHA, the date, and a one-line description of the fix)_