mbproxy: Wave 5 — fixes from third re-review pass

Closes findings from the third focused re-review pass on the post-W4-followup state (recorded in codereviews/2026-05-14/ReReviewAfterRemediation.md). W5/M1 — AdminEndpointHost OnChange callback can resurrect Kestrel after StopAsync The hot-reload OnChange handler at AdminEndpointHost.StartAsync did fire-and-forget `_ = Task.Run(...)` with no _disposed check. If AdminPort was hot-reloaded during shutdown, the queued Task could land between StopAsync's registration-dispose and DisposeAsync's _lock-dispose, take the lock, and bind a fresh Kestrel WebApplication on the new port — resurrecting admin AFTER the host considered it shut down. Worse, if DisposeAsync had already run _lock.Dispose, the queued Task throws ObjectDisposedException as an unobserved Task exception. Fix: _disposed guard at the top of the OnChange lambda AND inside the queued Task.Run, plus try/catch (ObjectDisposedException) around _lock.WaitAsync and _lock.Release. W5/m2 — inFlightAtCancel computed AFTER base.StopAsync The W4/NC1 fix correctly snapshotted inFlight BEFORE supervisor.StopAsync (so the multiplexers' counter providers were still wired), but it computed the snapshot AFTER base.StopAsync(cancellationToken). Between those two lines, in-flight requests whose responses arrive get removed from _correlation, and the watchdog can clear stale entries. The reported count therefore drifted downward from "in-flight at signal time" to "in-flight at compute time." Fix: snapshot at the very top of StopAsync before any cancellation is propagated. W5/m1 — Cascade gate-not-held path race (accepted as documented best-effort) When TearDownBackendAsync's _connectGate.WaitAsync(2s) times out, the body runs unprotected. A concurrent EnsureBackendConnectedAsync that DOES hold the gate may TryAllocate a TxId that collides (after wraparound in the allocator's forward scan) with one being released by the channel drain. The double-release would mark the new request's slot as free even though it's legitimately in-flight, allowing the next allocation to reuse the same slot and CorrelationMap.TryAdd to fail (silent request drop). Probability is very low (gate timeout AND new accept landing AND TxId collision in 65,536-slot space); the only consequence is one dropped request the client retries. Documented inline at PlcMultiplexer.cs near the gateHeld declaration as accepted best-effort behaviour. W5/m3 — CountInFlight allocates a CounterSnapshot record per supervisor Trivial (~5 KB on a 54-PLC fleet, called once per shutdown). Skipped per re-review verdict. Tests: 387 pass / 0 fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 07:13:47 -04:00
parent 9251c564c1
commit 59d0b5deb9
4 changed files with 69 additions and 12 deletions
@@ -348,6 +348,18 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
        // bounds disposal latency; if the gate is unavailable we proceed best-effort
        // without it (the worst-case consequence is one orphaned in-flight cycle on the
        // dying backend, which the upstream watchdog will surface as exception 0x0B).
+        //
+        // Phase 12 (W5 / m1) — KNOWN RACE on the gate-not-held path: a concurrent
+        // EnsureBackendConnectedAsync that DOES hold the gate may TryAllocate a TxId
+        // that collides (after wraparound in the allocator's forward scan) with a TxId
+        // we're about to release from the channel-drain step below. The double-release
+        // would mark the new request's slot as free even though it's legitimately
+        // in-flight, allowing the next allocation to reuse the same slot and
+        // CorrelationMap.TryAdd to fail (silent request drop). Probability is very low
+        // (requires gate timeout + new accept landing during cascade + TxId collision in
+        // a 65,536-slot space); the only consequence is one dropped request that the
+        // client retries. Documented as accepted best-effort behaviour in
+        // codereviews/2026-05-14/ReReviewAfterRemediation.md (m1).
        bool gateHeld = false;
        try
        {
@@ -277,18 +277,23 @@ internal sealed partial class ProxyWorker : BackgroundService
    /// </summary>
    public override async Task StopAsync(CancellationToken cancellationToken)
    {
+        // Phase 12 (W5 / m2) — snapshot in-flight BEFORE base.StopAsync so the field
+        // matches its name: "the count at the moment the host signalled stop", not "the
+        // count at the moment we got around to computing it." `base.StopAsync` cancels the
+        // ExecuteAsync stoppingToken; in the milliseconds before it returns, in-flight
+        // requests whose responses arrive will be removed from _correlation and the
+        // watchdog can clear stale entries — the count would otherwise drift downward.
+        //
+        // Phase 12 (W4 / NC1) — must run BEFORE supervisor stop too: after
+        // supervisor.StopAsync, multiplexers are disposed and CountInFlight returns 0
+        // unconditionally (the original ShutdownCoordinator had the same defect).
+        int inFlightAtCancel = CountInFlight();
+
        // Cancel ExecuteAsync first.
        await base.StopAsync(cancellationToken).ConfigureAwait(false);

        var sw = Stopwatch.StartNew();

-        // Phase 12 (W4 / NC1) — snapshot in-flight count BEFORE supervisor stop. After
-        // supervisor.StopAsync, multiplexers are disposed and CountInFlight returns 0
-        // unconditionally; reading after the stop produced a meaningless always-zero log
-        // (the original ShutdownCoordinator had the same defect — see
-        // codereviews/2026-05-14/ReReviewAfterRemediation.md NC1).
-        int inFlightAtCancel = CountInFlight();
-
        // Phase 12 (W2.20) — supervisor stop deadline read from the live config so a
        // hot-reloaded GracefulShutdownTimeoutMs is honoured. Supervisor stop is the
        // drain: cancelling the supervisor cancels the listener, which exits accept, which