From 7a435957ee65e3e7d068be9e6ee3087d10b02f4b Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Thu, 14 May 2026 06:52:33 -0400 Subject: [PATCH] =?UTF-8?q?mbproxy:=20Wave=204=20=E2=80=94=20fix=20issues?= =?UTF-8?q?=20introduced=20by=20the=20Wave-1/2=20fixes?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Closes the new findings from the post-remediation re-review (codereviews/2026-05-14/ReReviewAfterRemediation.md): NC1 — ProxyWorker.StopAsync drain loop is structurally always-zero Wave 1's W1.5 inherited the original ShutdownCoordinator bug it was meant to replace. Supervisor.StopAsync nulls the per-mux counter provider before the drain loop runs, so CountInFlight always returns 0 and the drain budget is never spent on actual draining. Fix: snapshot the in-flight count BEFORE supervisor stop, drop the theatrical post-stop loop, and report InFlightAtCancel as the snapshot count (= the number of in-flight requests dropped by the stop). The supervisor stop IS the drain — there is nothing to drain that wouldn't be killed by the stop itself. NM1 — TearDownBackendAsync._connectGate.WaitAsync uncancellable Without a token, a long Polly-wrapped EnsureBackendConnectedAsync against an unreachable host could hold the gate for the full BackendConnectTimeoutMs * MaxAttempts window, blocking DisposeAsync (and therefore ProxyWorker.StopAsync) for that duration. Fix: bound the wait with a 2 s teardown deadline; on timeout proceed best-effort without the gate. Worst-case consequence is one orphaned in-flight cycle on the dying backend, surfaced to upstream as exception 0x0B by the watchdog. NM2 — ReplaceContext non-atomic ctx + provider swap Snapshot path reads `_cacheStatsProvider` independently of `_ctx`. If `_ctx` was swapped first, a snapshot taken in the gap would still hold the OLD adapter wrapping the OLD cache — which the supervisor disposes immediately after we return. Fix: set the provider FIRST, then swap `_ctx`. Snapshots in the swap window now read either (old, old) or (new, new), never (old-after-disposed). NM5 — Self-cascade ObjectDisposedException after dispose Writer/reader fault catches fired `_ = TearDownBackendAsync(...)` unconditionally. After DisposeAsync runs `_connectGate.Dispose()`, the fire-and-forget TearDown threw ObjectDisposedException on WaitAsync as an unobserved Task exception. Fix: skip self-cascade when `_disposeCts.IsCancellationRequested` — DisposeAsync runs an explicit TearDown anyway. Nm1 — Saturation cleanup uses await SendResponseAsync W1.2's per-attacher delivery loop awaited the blocking SendResponseAsync, which would serialise on a wedged late-attacher's full bounded channel and stall delivery to its peers — contradicting the W1.3 doctrine that the fan-out path must never await per-pipe writes. Fix: use TrySendResponse and increment ResponseDropForFullUpstream on drop. T2 — WatchdogVsResponse_Race seeded Random fragility Used `new Random(12345)` over [350, 450) ms with watchdog at 400 ms; Random's algorithm is implementation-defined across .NET major versions (legacy → Xoshiro128 in .NET 6) so a runtime upgrade could land all samples on one side of the deadline and break the "both branches must fire" assertion. Fix: deterministic counter-based alternation (15 fast + 15 slow across 30 iterations) — guaranteed by construction. Latent items NM3 (_supervisorCts leak on re-Start) and NM4 (TCS single-shot semantics) are unfixed: no caller actually re-Starts a supervisor today; both become real only if the reconciler ever changes to re-Start instead of dispose-and-rebuild. Documented in the re-review. Tests: 387 pass / 0 fail. Three back-to-back race-test runs in isolation all green (T2 alternation is deterministic). Co-Authored-By: Claude Opus 4.7 (1M context) --- .../2026-05-14/ReReviewAfterRemediation.md | 126 ++++++++++++++++++ .../Proxy/Multiplexing/PlcMultiplexer.cs | 90 ++++++++++--- mbproxy/src/Mbproxy/Proxy/ProxyWorker.cs | 70 +++++----- .../Proxy/Multiplexing/PlcMultiplexerTests.cs | 15 ++- 4 files changed, 240 insertions(+), 61 deletions(-) create mode 100644 mbproxy/codereviews/2026-05-14/ReReviewAfterRemediation.md diff --git a/mbproxy/codereviews/2026-05-14/ReReviewAfterRemediation.md b/mbproxy/codereviews/2026-05-14/ReReviewAfterRemediation.md new file mode 100644 index 0000000..c2817d0 --- /dev/null +++ b/mbproxy/codereviews/2026-05-14/ReReviewAfterRemediation.md @@ -0,0 +1,126 @@ +# Re-Review After Remediation — 2026-05-14 + +Re-review of the codebase after the six-commit remediation of the original 2026-05-14 review (Wave 1 → `ce32c5c`, Wave 2 → `e66b17f`, Wave 3 → `7ead358`, the easy 5 → `2545237`, the race-hard 5 → `53f842a`). Conducted via three parallel area-focused passes. **Eyes on what the fixes themselves introduced**, not what the original review already found. + +**Scope:** every src/ and tests/ change in `53a7111..HEAD` (37 files, ~+2000/−700 lines). + +## Headline + +The remediation is structurally sound. **One new critical finding** — the W1.5 drain loop is structurally always-zero — inherited the very bug it was meant to replace. **Five new major findings** cluster in two areas: + +- **W1.4 cascade gating** introduced two disposal-deadlock and exception-swallowing paths around the new `_connectGate.WaitAsync()` call inside `TearDownBackendAsync`. +- **W1.1 + W1.5 lifecycle ordering** introduced a non-atomic context swap (snapshot reads can see a disposed cache during reseat) and silently single-shot supervisor semantics from the W2.15 + W2.16 combination. + +Plus a handful of test-discipline gaps from the new race-hard tests (reflection on private field names, RNG seed fragility across runtime versions). + +## New critical findings (1) + +### NC1. `ProxyWorker.StopAsync` drain loop is structurally always-zero — inherited the original `ShutdownCoordinator` bug it replaced +**File:** `src/Mbproxy/Proxy/ProxyWorker.cs:276-310` (introduced by W1.5) + +**Path of execution:** +1. `await Task.WhenAll(stopTasks)` — each `PlcListenerSupervisor.StopAsync` cancels its inner CTS, the listener's `RunAsync` exits, the OperationCanceledException catch calls `await listener.DisposeAsync()`, which calls `_multiplexer.DisposeAsync()`, which calls `_ctx.Counters.SetMultiplexProvider(null)` (`PlcMultiplexer.cs:224`). +2. *Then* the drain loop begins. `CountInFlight` reads `supervisor.CurrentCounters.Snapshot().InFlightCount`, and `ProxyCounters.Snapshot()` (`ProxyCounters.cs:403-404`) returns `provider?.InFlightCount ?? 0`. The provider is now `null`. So `total == 0` on the first iteration and the loop exits immediately. + +The drain budget (`gracefulMs`) is never spent on actual draining. `inFlightAtCancel` will always log as `0`. The `mbproxy.shutdown.complete` event under load will report `InFlightAtCancel=0` even when in-flight requests were forcibly cancelled. + +**This is the same defect the original review (`AdminAndDiagnostics.md:C1`) called out in the deleted `ShutdownCoordinator`.** The W1.5 relocation faithfully copied the broken sequencing. + +**Fix options:** +- (a) Snapshot in-flight counts BEFORE calling `supervisors.StopAsync`, then move drain-with-real-multiplexers into a phase BEFORE supervisor stop (i.e. stop accepting new connections, drain in-flight, then stop supervisors). +- (b) Accept that supervisor stop *is* the drain (matches today's behavior) and remove the theatrical loop + `InFlightAtCancel` field from the log event. + +## New major findings (5) + +### NM1. `TearDownBackendAsync` acquires `_connectGate` with no token — disposal can be blocked indefinitely +**File:** `src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs:333` (introduced by W1.4) + +`await _connectGate.WaitAsync().ConfigureAwait(false)` does not honour `_disposeCts.Token`. The dispose-driven teardown calls this after cancelling `_disposeCts`, but if a long Polly-wrapped `EnsureBackendConnectedAsync` is mid-connect against an unreachable host (waiting up to `BackendConnectTimeoutMs`), DisposeAsync blocks here for that duration. This blocks `ProxyWorker.StopAsync` past its drain budget. + +**Fix:** pass `_disposeCts.Token` (or a CTS that cancels on dispose) to `WaitAsync`, accept the `OperationCanceledException`, and skip teardown work if disposal already completed. + +### NM2. `ReplaceContext` writes `_ctx` and re-registers cache stats provider non-atomically — snapshot can see a disposed cache +**File:** `src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs:173-184` (introduced by W1.1) + +Snapshot path (`ProxyCounters.Snapshot()` called from the admin endpoint) reads `_cacheStatsProvider` independently of `_ctx`. Sequence: +1. Thread A enters `ReplaceContext`, writes `_ctx = newContext` (line 177). +2. Thread B (admin) snapshots, reads `_cacheStatsProvider` (still the OLD adapter wrapping the OLD cache). +3. Thread A runs `SetCacheStatsProvider(newCacheAdapter)`. +4. Supervisor's `ReplaceContextAsync` proceeds to `oldCache.Clear()` then `oldCache.Dispose()`. + +A snapshot during the swap reports `cacheEntryCount`/`cacheBytes` from a cache the live multiplexer no longer references. Worse, if the snapshot races past step 4, the stats adapter holds a disposed `ResponseCache` and `Count`/`ApproximateBytes` may throw. **Same shape of race as the original review's "snapshot inconsistency" pattern**, re-opened on a new path by the W1.1 fix. + +**Fix:** swap the order (provider first, then `_ctx`) inside a short lock, OR have `ResponseCache.Count`/`ApproximateBytes` no-op on a disposed instance. + +### NM3. `_supervisorCts` leaks across `StartAsync` re-entry despite W2.16 guard +**File:** `src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs:140, 437-446` + +W2.16's guard prevents re-Start while busy, but it does not address the leaking previous CTS the comment claims to fix. `StopAsync` sets `_state = Stopped` first, then awaits `_supervisorTask`. The previous `_supervisorCts` is disposed only in `DisposeAsync`, not in `StopAsync`. A re-Start pattern (Stop → Start without Dispose) leaks one CTS per cycle. + +Today no caller actually re-Starts a supervisor (ConfigReconciler's remove path uses `DisposeAsync`), so the leak is **latent**. Worth fixing now or documenting. + +### NM4. W2.15 TCS is never re-armed; W2.16 + TCS together make the supervisor effectively single-shot +**File:** `src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs:74-75, 132-147` + +`_firstAttemptCompleted` is `readonly` and set in the field initialiser. Once `TrySetResult` fires (anywhere in `RunSupervisorAsync` including the `finally`), it stays completed forever. A restarted supervisor would have `WaitForInitialBindAttemptAsync` return immediately on the *previous* run's signal, regardless of the new run's bind status. Combined with the comment claiming "the supervisor's state machine has exactly one Start", the intent appears to be single-shot — but neither the guard nor the TCS enforces it cleanly. + +Today no caller re-Starts so this is **latent**. Either re-arm the TCS in `RunSupervisorAsync` or have W2.16 refuse re-Start unconditionally. + +### NM5. Self-cascade swallows `ObjectDisposedException` from `_connectGate` after disposal +**File:** `src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs:451, 648` (introduced by W1.4 + W2.5) + +After `DisposeAsync` runs `_connectGate.Dispose()`, any in-flight self-cascade (`_ = TearDownBackendAsync(...)` from writer/reader fault paths) reaches `WaitAsync()` and throws `ObjectDisposedException`. The `try/finally` will then run `Release()` on a disposed semaphore in the `finally`, also throwing. Both throws happen on a fire-and-forget Task — they become unobserved exceptions on TPL's UnobservedTaskException event. Noisy but not fatal. + +**Fix:** wrap the body in `try { … } catch (ObjectDisposedException) { }` or short-circuit on `_disposed` before re-entering teardown. + +## New minor findings (8) + +| # | File:line | Finding | +|---|-----------|---------| +| Nm1 | `PlcMultiplexer.cs:842` (W1.2 cleanup) | Saturation cleanup uses `await SendResponseAsync` (blocking) per attached pipe. A single wedged late-attacher stalls saturation delivery to all others — contradicts the W1.3 doctrine. Use `TrySendResponse` here too. | +| Nm2 | `PlcMultiplexer.cs:810` | W1.2 increments `CoalescedHit` for late attachers that ultimately receive exception 04. Counter shows hits that produced no real coalesced response. Document or decrement. | +| Nm3 | `ProxyWorker.cs:269-310` | Both supervisor-stop and (theatrical) drain share `gracefulMs`. Worst-case shutdown is `2 * gracefulMs + 2s admin` — operators expect single budget. | +| Nm4 | `PlcListenerSupervisor.cs:433-446` | `finally` writes `_state = Stopped` directly, preserving `_lastBindError` from a prior fault. After clean cancellation operators see "last bind error" that was never the cause of stopping. Doc the field as "last fault while running" or clear on clean exit. | +| Nm5 | `EventLogBridge.cs:35-45` | W2.23 cached `_sourceExists` is correct, but no startup log confirms armed state. If `install.ps1` registers source after service starts, every Error+ event is silently dropped until restart. Add a one-line startup log. | +| Nm6 | `ProxyWorker.cs:224-235` | W1.5 lazy `_admin = _services.GetService()` returns `null` silently if `AddMbproxyAdmin()` is forgotten. Previous `IHostedService` registration would have errored loudly during host build. Use `GetRequiredService<>` or log a warning. | +| Nm7 | `AdminEndpointHost.cs:200-210` | `DisposeAsync` lacks a `_disposed` guard; `ProxyWorker.StopAsync` calls `StopAsync` then DI disposes again. Operations are idempotent today but symmetry with `PlcMultiplexer` would be cleaner. | +| Nm8 | `PlcMultiplexer.cs:415` | `TearDownBackendAsync` log `BackendDisconnected` event fires per cascade; queued cascades on the gate each log their own event with mostly-zero counts. Cosmetic noise. | + +## Test-discipline findings (4) + +| # | File:line | Finding | +|---|-----------|---------| +| T1 | `PlcMultiplexerTests.cs:1152-1165`, `SupervisorTests.cs:1779-1789` | `DrainAllocator` and the runtime-fault test reflect on private field names (`_allocator`, `_currentListener`, `_listener`). A rename refactor breaks them at run-time, not compile-time. Recommend `[InternalsVisibleTo]` test seam OR comments on each reflected field warning that tests depend on the name. | +| T2 | `PlcMultiplexerTests.cs:1379-1380` (W3 #8) | `WatchdogVsResponse_Race` uses `new Random(12345)` over `[350, 450)` ms with watchdog at 400 ms; asserts BOTH branches hit. `Random` with a seed is implementation-defined across major .NET versions (legacy → Xoshiro128 in .NET 6). On a runtime upgrade the seed could land all 30 samples on one side and the test breaks. Replace with deterministic alternation: `int delay = (i & 1) == 0 ? 350 : 450;`. | +| T3 | `Mbproxy.Tests.csproj:38-40` (W2.21) | `RemoveInheritedAppsettings` Target uses `AfterTargets="Build"` and deletes from `$(OutputPath)`. Won't fire on `dotnet publish`. Test projects rarely publish but worth `AfterTargets="Build;Publish"` + a second Delete against `$(PublishDir)appsettings.json`. | +| T4 | `InFlightByKeyMapTests.cs:46, 75, 112` | After W3 removed the `bool` return, test method names still read `TryAttachOrCreate_..._ReturnsTrue_WasNewTrue`. Decorative but misleading for grep. | + +## Verified clean (sampled, not exhaustive) + +- **W2.3 ConcurrentDictionary migration** on `_supervisors` — all mutations atomic; status-page enumeration lock-free; Restart's "remove + add" two-step is per-key (parallel keys disjoint by name). +- **W2.1 coalescingAccessor propagation** — both Add and Restart paths receive it; Reseat correctly does not (same supervisor, same multiplexer, same accessor). +- **W2.13 OOR check** — multiplication is bounded by the guard; even worst-case `9999 * 10_000 + 9999 = 99_989_999` fits in int32 without overflow. +- **W2.14 byteCount validation** — strict `<` check passes a perfectly-sized PDU; trailing-byte case correct. +- **W2.10 resolved-TTL re-check** — `BcdTagMapBuilder.Build` is called exactly once per PLC at validation; no duplicate work. +- **W2.18 ConnectionOptions validation** — both `MbproxyOptionsValidator` and `ReloadValidator` reject `<= 0`; no bypass path. +- **W3 `HasBadNibble` dedupe** — clean; the codec's internal helper is the single source of truth. +- **W2.15 TCS signalled in every exit path** of `RunSupervisorAsync` (bind-success, bind-failure-into-recovery, run-failure-into-recovery, finally) — no hang on `WaitForInitialBindAttemptAsync` for the first run. +- **W2.17 TransitionTo lock contract** — both writers use it; `Snapshot` reads under the same lock; no torn triples. +- **`TxIdAllocator.Release` double-call is benign** (`TxIdAllocator.cs:121-129` checks `if (_inUse[id])`); the W1.4 channel drain releasing a TxId already released by the correlation drain is safe. +- **W1.1 in-PDU snapshot consistency** — `OnUpstreamFrameAsync` reads `_ctx.Cache` and `_ctx.TagMap.ResolveCacheTtlMs` non-atomically; the only mid-PDU swap visible would change cache eligibility, not produce corrupted output. Downstream `WithCurrentRequest` snapshots TagMap+Cache for the rewriter, so the rewrite itself is consistent. +- **W2.7 cache-FC byte sourced from post-rewriter buffer** — correct; the rewriter never touches the FC byte but the source must remain `frame[…]` to capture the exception bit. + +## Recommended next actions + +If you want to keep the issue tracker tidy, these are the items that should land as a small follow-up commit: + +1. **NC1** — fix or remove the theatrical drain loop. Smallest change: delete the loop and the `InFlightAtCancel` field from the log. +2. **NM1** — pass `_disposeCts.Token` to `_connectGate.WaitAsync` in `TearDownBackendAsync`. +3. **NM2** — wrap the `_ctx` write + provider re-registration in a single short lock, or null-guard the cache adapter against disposed-cache reads. +4. **NM5** — wrap the self-cascade body in `try { … } catch (ObjectDisposedException) { }`. +5. **Nm1** — use `TrySendResponse` in the W1.2 saturation cleanup loop to avoid wedge-by-wedged-attacher. +6. **T2** — replace the seeded `Random` in `WatchdogVsResponse_Race` with deterministic alternation. + +NM3 and NM4 are latent (no current caller exercises supervisor re-Start). They become real only if the reconciler ever changes to re-Start instead of dispose-and-rebuild. Defer or document. + +Everything else is documentation / cosmetic / minor cleanup. diff --git a/mbproxy/src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs b/mbproxy/src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs index 370d489..687094c 100644 --- a/mbproxy/src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs +++ b/mbproxy/src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs @@ -174,13 +174,23 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi { if (_disposed) return; - _ctx = newContext; - - // Re-register the cache stats provider on the (preserved) counters so the status - // page sees the new cache's count/bytes immediately. Pass null when the new context - // opted out of caching to clear any stale provider from the previous context. + // Phase 12 (W4 / NM2) — provider FIRST, then _ctx. The status page's snapshot + // path reads `_cacheStatsProvider` independently of `_ctx`. If we swapped `_ctx` + // first, a snapshot taken in the gap between the two writes would still hold the + // OLD adapter wrapping the OLD cache — which the supervisor is about to dispose + // (`PlcListenerSupervisor.ReplaceContextAsync` runs `oldCache.Dispose()` after we + // return). Setting the provider first means snapshots in the swap window read + // either (old provider, old ctx) or (new provider, new ctx) — both coherent — + // never (old provider after old cache disposed). + // + // In the typical reseat case `oldContext.Counters == newContext.Counters` (the + // reconciler preserves counters across reseat), so this updates the same instance + // both paths share. The order still matters because the snapshot reads the + // provider field, not the per-context counters reference. newContext.Counters.SetCacheStatsProvider( newContext.Cache is not null ? new CacheStatsAdapter(newContext.Cache) : null); + + _ctx = newContext; } /// @@ -330,7 +340,33 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi // this, a fresh EnsureBackendConnectedAsync racing with the channel drain below // could see stranded frames sent on its new socket with old (already-released) TxIds, // producing orphaned responses that hang upstream peers via the watchdog. - await _connectGate.WaitAsync().ConfigureAwait(false); + // + // Phase 12 (W4 / NM1) — bound the wait. Without a timeout, a long Polly-wrapped + // EnsureBackendConnectedAsync against an unreachable host can hold the gate for + // the full BackendConnectTimeoutMs * MaxAttempts window, blocking DisposeAsync (and + // therefore ProxyWorker.StopAsync) for that duration. A 2 s teardown deadline + // bounds disposal latency; if the gate is unavailable we proceed best-effort + // without it (the worst-case consequence is one orphaned in-flight cycle on the + // dying backend, which the upstream watchdog will surface as exception 0x0B). + bool gateHeld = false; + try + { + using var teardownCts = new CancellationTokenSource(TimeSpan.FromSeconds(2)); + await _connectGate.WaitAsync(teardownCts.Token).ConfigureAwait(false); + gateHeld = true; + } + catch (OperationCanceledException) + { + // Best-effort: proceed without the gate. Concurrent connect attempts will + // observe _disposed (or the now-null _backendSocket) and short-circuit. + } + catch (ObjectDisposedException) + { + // _connectGate already disposed — TearDown is racing past DisposeAsync. + // Skip the body entirely; there's nothing useful to do at this point. + return; + } + try { Socket? oldSocket; @@ -416,7 +452,12 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi } finally { - _connectGate.Release(); + // Only release if we acquired (W4 / NM1) — best-effort path may have skipped. + if (gateHeld) + { + try { _connectGate.Release(); } + catch (ObjectDisposedException) { /* dispose race — harmless */ } + } } } @@ -446,8 +487,12 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi } catch (Exception ex) { - // Backend failure — cascade. - _ = TearDownBackendAsync($"writer fault: {ex.Message}", cascadeUpstreams: true); + // Backend failure — cascade. Phase 12 (W4 / NM5) — skip if disposal is + // already in progress; DisposeAsync runs an explicit TearDown and the + // fire-and-forget here would race against it, hitting a disposed + // _connectGate and producing an unobserved-task exception. + if (!_disposeCts.IsCancellationRequested) + _ = TearDownBackendAsync($"writer fault: {ex.Message}", cascadeUpstreams: true); } } @@ -637,7 +682,10 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi } // Reader exited cleanly — backend closed by remote. Cascade. - _ = TearDownBackendAsync("backend reader EOF", cascadeUpstreams: true); + // Phase 12 (W4 / NM5) — skip if dispose is already in progress (see writer-side + // comment above for rationale). + if (!_disposeCts.IsCancellationRequested) + _ = TearDownBackendAsync("backend reader EOF", cascadeUpstreams: true); } catch (OperationCanceledException) { @@ -645,7 +693,8 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi } catch (Exception ex) { - _ = TearDownBackendAsync($"reader fault: {ex.Message}", cascadeUpstreams: true); + if (!_disposeCts.IsCancellationRequested) + _ = TearDownBackendAsync($"reader fault: {ex.Message}", cascadeUpstreams: true); } } @@ -834,18 +883,16 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi if (_inFlightByKey.TryRemove(key, out var stub)) { + // Phase 12 (W4 / Nm1) — non-blocking delivery via TrySendResponse. + // Previously this loop awaited SendResponseAsync per party, which would + // serialise on a wedged late-attacher's full bounded channel and stall + // delivery to its peers. Same doctrine as the W1.3 backend-reader fix: + // the per-PLC fan-out path must never await per-pipe writes. foreach (var party in stub.InterestedParties) { byte[] excFrame = BuildExceptionFrame(party.OriginalTxId, unitId, fcByte, exceptionCode: 4); - try - { - await party.Pipe.SendResponseAsync(excFrame, ct).ConfigureAwait(false); - } - catch - { - // Best-effort delivery. A dead pipe will be collected by its own - // socket close path; nothing more we can do here. - } + if (!party.Pipe.TrySendResponse(excFrame)) + _ctx.Counters.IncrementResponseDropForFullUpstream(); } } else @@ -853,7 +900,8 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi // The stub was already removed by another path (extremely unlikely, but // defensive). Surface the exception to the original requester. byte[] excFrame = BuildExceptionFrame(originalTxId, unitId, fcByte, exceptionCode: 4); - await pipe.SendResponseAsync(excFrame, ct).ConfigureAwait(false); + if (!pipe.TrySendResponse(excFrame)) + _ctx.Counters.IncrementResponseDropForFullUpstream(); } return; } diff --git a/mbproxy/src/Mbproxy/Proxy/ProxyWorker.cs b/mbproxy/src/Mbproxy/Proxy/ProxyWorker.cs index 9bab154..7d90d3f 100644 --- a/mbproxy/src/Mbproxy/Proxy/ProxyWorker.cs +++ b/mbproxy/src/Mbproxy/Proxy/ProxyWorker.cs @@ -244,18 +244,26 @@ internal sealed partial class ProxyWorker : BackgroundService /// ShutdownCoordinator): /// /// Cancel via base.StopAsync. - /// Stop all supervisors with a 5 s hard deadline (no new connections; existing - /// pipes are cascaded by teardown). - /// Wait for in-flight PDUs to drain via the live - /// (read fresh from - /// so a hot-reloaded value is - /// honoured at stop time). - /// Stop the admin endpoint LAST so the status page survives the drain phase - /// and an operator polling it sees the in-flight count fall to zero. + /// Snapshot per-PLC in-flight counts BEFORE stopping supervisors — + /// this is the only honest reading of "how many requests were in flight when + /// we decided to stop." Once supervisors stop, their multiplexers are torn + /// down and the per-mux counter providers are nulled, so any later read + /// returns 0 regardless of what was actually dropped. + /// Stop all supervisors with the configured graceful timeout. Supervisor + /// stop is the actual drain — it cancels the listener, which exits its + /// accept loop, which disposes the multiplexer, which cascades all attached + /// pipes. There is no separate "drain in-flight" phase because there is + /// nothing to drain that wouldn't be killed by the supervisor stop itself + /// (the original Phase-08 ShutdownCoordinator's drain loop had this same + /// shape and was structurally always-zero — call out from + /// codereviews/2026-05-14/ReReviewAfterRemediation.md NC1). + /// Stop the admin endpoint LAST so the status page survives the supervisor + /// stop phase and operators can observe the live state right up to shutdown. /// Dispose every supervisor to release sockets, channels, and watchdog timers. /// - /// Logs mbproxy.shutdown.complete on the way out with the in-flight count at - /// drain-deadline (zero on a clean shutdown, positive when forced cancel). + /// Logs mbproxy.shutdown.complete with InFlightAtCancel equal to the + /// snapshot count from step 2 (= the number of in-flight requests dropped by the + /// stop) and ElapsedMs for the whole sequence. /// public override async Task StopAsync(CancellationToken cancellationToken) { @@ -263,12 +271,21 @@ internal sealed partial class ProxyWorker : BackgroundService await base.StopAsync(cancellationToken).ConfigureAwait(false); var sw = Stopwatch.StartNew(); + + // Phase 12 (W4 / NC1) — snapshot in-flight count BEFORE supervisor stop. After + // supervisor.StopAsync, multiplexers are disposed and CountInFlight returns 0 + // unconditionally; reading after the stop produced a meaningless always-zero log + // (the original ShutdownCoordinator had the same defect — see + // codereviews/2026-05-14/ReReviewAfterRemediation.md NC1). + int inFlightAtCancel = CountInFlight(); + // Phase 12 (W2.20) — supervisor stop deadline read from the live config so a - // hot-reloaded GracefulShutdownTimeoutMs is honoured. Previously hard-coded 5 s. - // The supervisor stop budget is bounded by the same total-shutdown budget. + // hot-reloaded GracefulShutdownTimeoutMs is honoured. Supervisor stop is the + // drain: cancelling the supervisor cancels the listener, which exits accept, which + // disposes the multiplexer, which cascades all attached pipes. int gracefulMs = _options.CurrentValue.Connection.GracefulShutdownTimeoutMs; - // ── 1. Stop accepting new connections ───────────────────────────────────────── + // ── 1. Stop accepting new connections + drain (one combined phase) ──────────── using var stopCts = new CancellationTokenSource(TimeSpan.FromMilliseconds(gracefulMs)); using var linked = CancellationTokenSource.CreateLinkedTokenSource( stopCts.Token, cancellationToken); @@ -286,30 +303,7 @@ internal sealed partial class ProxyWorker : BackgroundService // Best effort — don't let individual supervisor failures block shutdown. } - // ── 2. Drain in-flight PDUs ─────────────────────────────────────────────────── - // Same `gracefulMs` budget the supervisor-stop step used. - int drainDeadlineMs = gracefulMs; - int inFlightAtCancel = 0; - - if (drainDeadlineMs > 0) - { - using var drainCts = new CancellationTokenSource(TimeSpan.FromMilliseconds(drainDeadlineMs)); - try - { - while (!drainCts.Token.IsCancellationRequested) - { - int total = CountInFlight(); - if (total == 0) break; - await Task.Delay(10, drainCts.Token).ConfigureAwait(false); - } - } - catch (OperationCanceledException) - { - inFlightAtCancel = CountInFlight(); - } - } - - // ── 3. Stop admin endpoint LAST ─────────────────────────────────────────────── + // ── 2. Stop admin endpoint LAST ─────────────────────────────────────────────── if (_admin is not null) { try @@ -323,7 +317,7 @@ internal sealed partial class ProxyWorker : BackgroundService } } - // ── 4. Dispose supervisors (releases sockets, channels, watchdog timers) ───── + // ── 3. Dispose supervisors (releases sockets, channels, watchdog timers) ───── foreach (var supervisor in _supervisors.Values) await supervisor.DisposeAsync().ConfigureAwait(false); diff --git a/mbproxy/tests/Mbproxy.Tests/Proxy/Multiplexing/PlcMultiplexerTests.cs b/mbproxy/tests/Mbproxy.Tests/Proxy/Multiplexing/PlcMultiplexerTests.cs index c90ba80..552fcda 100644 --- a/mbproxy/tests/Mbproxy.Tests/Proxy/Multiplexing/PlcMultiplexerTests.cs +++ b/mbproxy/tests/Mbproxy.Tests/Proxy/Multiplexing/PlcMultiplexerTests.cs @@ -992,8 +992,19 @@ public sealed class PlcMultiplexerTests // = 400 the tick is 100 ms. Configure the backend to delay 350-450 ms for each // request so some land before, some after the timeout. int backendPort = PickFreePort(); - var rng = new Random(12345); - var slowBackend = new SlowResponseBackend(backendPort, () => rng.Next(350, 450)); + // Phase 12 (W4 / T2) — deterministic alternation rather than seeded Random. Random + // with a fixed seed is not stable across .NET major versions (Microsoft has changed + // the implementation, e.g. legacy → Xoshiro128 in .NET 6), so a runtime upgrade + // could land all samples on one side of the watchdog deadline and break the + // "both branches must fire" assertion below. Counter-based alternation guarantees + // 15 fast (350 ms, beats watchdog) and 15 slow (450 ms, loses to watchdog) responses + // across 30 iterations, regardless of runtime. + int reqCount = 0; + var slowBackend = new SlowResponseBackend(backendPort, () => + { + int n = Interlocked.Increment(ref reqCount); + return (n & 1) == 1 ? 350 : 450; + }); await using var _ = slowBackend; var ctx = MakeContext("PLC1");