mbproxy: Wave 5 — fixes from third re-review pass

Closes findings from the third focused re-review pass on the post-W4-followup
state (recorded in codereviews/2026-05-14/ReReviewAfterRemediation.md).

W5/M1 — AdminEndpointHost OnChange callback can resurrect Kestrel after StopAsync
  The hot-reload OnChange handler at AdminEndpointHost.StartAsync did
  fire-and-forget `_ = Task.Run(...)` with no _disposed check. If AdminPort
  was hot-reloaded during shutdown, the queued Task could land between
  StopAsync's registration-dispose and DisposeAsync's _lock-dispose, take
  the lock, and bind a fresh Kestrel WebApplication on the new port —
  resurrecting admin AFTER the host considered it shut down. Worse, if
  DisposeAsync had already run _lock.Dispose, the queued Task throws
  ObjectDisposedException as an unobserved Task exception. Fix: _disposed
  guard at the top of the OnChange lambda AND inside the queued Task.Run,
  plus try/catch (ObjectDisposedException) around _lock.WaitAsync and
  _lock.Release.

W5/m2 — inFlightAtCancel computed AFTER base.StopAsync
  The W4/NC1 fix correctly snapshotted inFlight BEFORE supervisor.StopAsync
  (so the multiplexers' counter providers were still wired), but it computed
  the snapshot AFTER base.StopAsync(cancellationToken). Between those two
  lines, in-flight requests whose responses arrive get removed from
  _correlation, and the watchdog can clear stale entries. The reported
  count therefore drifted downward from "in-flight at signal time" to
  "in-flight at compute time." Fix: snapshot at the very top of StopAsync
  before any cancellation is propagated.

W5/m1 — Cascade gate-not-held path race (accepted as documented best-effort)
  When TearDownBackendAsync's _connectGate.WaitAsync(2s) times out, the
  body runs unprotected. A concurrent EnsureBackendConnectedAsync that
  DOES hold the gate may TryAllocate a TxId that collides (after wraparound
  in the allocator's forward scan) with one being released by the channel
  drain. The double-release would mark the new request's slot as free even
  though it's legitimately in-flight, allowing the next allocation to reuse
  the same slot and CorrelationMap.TryAdd to fail (silent request drop).
  Probability is very low (gate timeout AND new accept landing AND TxId
  collision in 65,536-slot space); the only consequence is one dropped
  request the client retries. Documented inline at PlcMultiplexer.cs near
  the gateHeld declaration as accepted best-effort behaviour.

W5/m3 — CountInFlight allocates a CounterSnapshot record per supervisor
  Trivial (~5 KB on a 54-PLC fleet, called once per shutdown). Skipped per
  re-review verdict.

Tests: 387 pass / 0 fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-14 07:13:47 -04:00
parent 9251c564c1
commit 59d0b5deb9
4 changed files with 69 additions and 12 deletions
@@ -6,9 +6,9 @@ Re-review of the codebase after the six-commit remediation of the original 2026-
## Status
> **All actionable findings resolved.** Wave 4 (`7a43595`) closed NC1 + NM1 + NM2 + NM5 + Nm1 + T2. Wave 4-followup (this commit) closed NM3 + NM4 + Nm6 + Nm7 + T3 + T4. Remaining items are **Accepted** with rationale (explained below) — they're deliberate trade-offs or doc-only nuances that don't warrant code changes.
> **All actionable findings resolved across two re-review passes.** Wave 4 (`7a43595`) closed NC1 + NM1 + NM2 + NM5 + Nm1 + T2. Wave 4-followup (`9251c56`) closed NM3 + NM4 + Nm6 + Nm7 + T3 + T4. A third focused pass surfaced one more major (W5/M1) and two cosmetics (W5/m1, W5/m2); Wave 5 (this commit) resolved M1 + m2 and documented m1 as accepted best-effort.
>
> **Final test count:** 387 pass / 0 fail. Race tests stable across 3 isolated runs.
> **Final test count:** 387 pass / 0 fail.
## Headline
@@ -43,6 +43,25 @@ The remediation was structurally sound. The re-review found:
**Resolved: 13/18. Accepted: 5/18.**
## Third pass — final findings (Wave 5)
A third focused review pass on the post-W4-followup state turned up these additional items:
| ID | Severity | Finding | Status | Commit |
|----|----------|---------|--------|--------|
| **W5/M1** | Major | `AdminEndpointHost` `OnChange` callback can resurrect a Kestrel app after `StopAsync` returned (no `_disposed` check inside the fire-and-forget Task.Run lambda) | ✅ **Resolved** | (W5) |
| **W5/m1** | Minor | `TearDownBackendAsync` gate-not-held path: a concurrent freshly-allocated TxId can collide with one being released by the channel drain → silent request drop. Probability very low (gate timeout AND new accept AND TxId collision in 65,536-slot space). | ⚪ **Accepted** | (W5 — inline doc comment in `PlcMultiplexer.cs`) |
| **W5/m2** | Minor | `inFlightAtCancel` was computed AFTER `base.StopAsync` — narrower window than the field name promises | ✅ **Resolved** | (W5) |
| **W5/m3** | Cosmetic | `CountInFlight` allocates a 35-field `CounterSnapshot` record per supervisor on shutdown | ⚪ **Accepted** (skip) | — |
**W5/M1 fix detail.** Added `if (_disposed) return;` at the top of the `OnChange` lambda AND inside the queued `Task.Run`, plus `try/catch (ObjectDisposedException)` around `_lock.WaitAsync` and `_lock.Release()` so a hot-reload of `AdminPort` during shutdown can no longer resurrect a fresh Kestrel WebApplication on the new port after the host considered admin shut down.
**W5/m2 fix detail.** Moved `int inFlightAtCancel = CountInFlight();` to BEFORE `await base.StopAsync(cancellationToken)`. Now the count actually reflects "in-flight at the moment the host signalled stop" — not "in-flight at the moment we got around to computing it after the cancel propagated."
**W5/m1 acceptance.** Documented inline at `PlcMultiplexer.cs:TearDownBackendAsync` near the `gateHeld` flag declaration. The race requires three coincidences (gate-timeout + new accept landing during cascade + TxId collision); the only consequence is one dropped request that the client retries on its next attempt.
**W5/m3 skip.** Trivial per-PLC allocation (~5 KB on a 54-PLC fleet, called once per shutdown). Optimising it would require exposing a single-field accessor on `ProxyCounters`; not worth the surface change.
---
## Resolved findings — what landed
@@ -126,4 +145,4 @@ The original re-review listed the following as verified clean by inspection duri
## Closed
The 2026-05-14 review series — original review → 4 remediation waves → re-review → wave-4 → wave-4-followup — is now closed. Tests: 387 pass / 0 fail. Three back-to-back race-test runs in isolation all green. Every actionable finding resolved or explicitly accepted with rationale.
The 2026-05-14 review series — original review → 4 remediation waves → first re-review → wave 4 + followup → second re-review → wave 5 — is now closed. Tests: 387 pass / 0 fail. Three back-to-back race-test runs in isolation all green. Every actionable finding resolved or explicitly accepted with rationale.