Files
wwtools/mbproxy/codereviews/2026-05-14/ReReviewAfterRemediation.md
T
Joseph Doherty 7a435957ee mbproxy: Wave 4 — fix issues introduced by the Wave-1/2 fixes
Closes the new findings from the post-remediation re-review
(codereviews/2026-05-14/ReReviewAfterRemediation.md):

NC1 — ProxyWorker.StopAsync drain loop is structurally always-zero
  Wave 1's W1.5 inherited the original ShutdownCoordinator bug it was
  meant to replace. Supervisor.StopAsync nulls the per-mux counter
  provider before the drain loop runs, so CountInFlight always returns 0
  and the drain budget is never spent on actual draining. Fix: snapshot
  the in-flight count BEFORE supervisor stop, drop the theatrical
  post-stop loop, and report InFlightAtCancel as the snapshot count
  (= the number of in-flight requests dropped by the stop). The
  supervisor stop IS the drain — there is nothing to drain that
  wouldn't be killed by the stop itself.

NM1 — TearDownBackendAsync._connectGate.WaitAsync uncancellable
  Without a token, a long Polly-wrapped EnsureBackendConnectedAsync
  against an unreachable host could hold the gate for the full
  BackendConnectTimeoutMs * MaxAttempts window, blocking DisposeAsync
  (and therefore ProxyWorker.StopAsync) for that duration. Fix: bound
  the wait with a 2 s teardown deadline; on timeout proceed
  best-effort without the gate. Worst-case consequence is one orphaned
  in-flight cycle on the dying backend, surfaced to upstream as
  exception 0x0B by the watchdog.

NM2 — ReplaceContext non-atomic ctx + provider swap
  Snapshot path reads `_cacheStatsProvider` independently of `_ctx`. If
  `_ctx` was swapped first, a snapshot taken in the gap would still hold
  the OLD adapter wrapping the OLD cache — which the supervisor disposes
  immediately after we return. Fix: set the provider FIRST, then swap
  `_ctx`. Snapshots in the swap window now read either (old, old) or
  (new, new), never (old-after-disposed).

NM5 — Self-cascade ObjectDisposedException after dispose
  Writer/reader fault catches fired `_ = TearDownBackendAsync(...)`
  unconditionally. After DisposeAsync runs `_connectGate.Dispose()`, the
  fire-and-forget TearDown threw ObjectDisposedException on WaitAsync as
  an unobserved Task exception. Fix: skip self-cascade when
  `_disposeCts.IsCancellationRequested` — DisposeAsync runs an explicit
  TearDown anyway.

Nm1 — Saturation cleanup uses await SendResponseAsync
  W1.2's per-attacher delivery loop awaited the blocking SendResponseAsync,
  which would serialise on a wedged late-attacher's full bounded channel
  and stall delivery to its peers — contradicting the W1.3 doctrine that
  the fan-out path must never await per-pipe writes. Fix: use
  TrySendResponse and increment ResponseDropForFullUpstream on drop.

T2 — WatchdogVsResponse_Race seeded Random fragility
  Used `new Random(12345)` over [350, 450) ms with watchdog at 400 ms;
  Random's algorithm is implementation-defined across .NET major versions
  (legacy → Xoshiro128 in .NET 6) so a runtime upgrade could land all
  samples on one side of the deadline and break the "both branches must
  fire" assertion. Fix: deterministic counter-based alternation (15 fast
  + 15 slow across 30 iterations) — guaranteed by construction.

Latent items NM3 (_supervisorCts leak on re-Start) and NM4 (TCS
single-shot semantics) are unfixed: no caller actually re-Starts a
supervisor today; both become real only if the reconciler ever changes
to re-Start instead of dispose-and-rebuild. Documented in the re-review.

Tests: 387 pass / 0 fail. Three back-to-back race-test runs in
isolation all green (T2 alternation is deterministic).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 06:52:33 -04:00

127 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Re-Review After Remediation — 2026-05-14
Re-review of the codebase after the six-commit remediation of the original 2026-05-14 review (Wave 1 → `ce32c5c`, Wave 2 → `e66b17f`, Wave 3 → `7ead358`, the easy 5 → `2545237`, the race-hard 5 → `53f842a`). Conducted via three parallel area-focused passes. **Eyes on what the fixes themselves introduced**, not what the original review already found.
**Scope:** every src/ and tests/ change in `53a7111..HEAD` (37 files, ~+2000/700 lines).
## Headline
The remediation is structurally sound. **One new critical finding** — the W1.5 drain loop is structurally always-zero — inherited the very bug it was meant to replace. **Five new major findings** cluster in two areas:
- **W1.4 cascade gating** introduced two disposal-deadlock and exception-swallowing paths around the new `_connectGate.WaitAsync()` call inside `TearDownBackendAsync`.
- **W1.1 + W1.5 lifecycle ordering** introduced a non-atomic context swap (snapshot reads can see a disposed cache during reseat) and silently single-shot supervisor semantics from the W2.15 + W2.16 combination.
Plus a handful of test-discipline gaps from the new race-hard tests (reflection on private field names, RNG seed fragility across runtime versions).
## New critical findings (1)
### NC1. `ProxyWorker.StopAsync` drain loop is structurally always-zero — inherited the original `ShutdownCoordinator` bug it replaced
**File:** `src/Mbproxy/Proxy/ProxyWorker.cs:276-310` (introduced by W1.5)
**Path of execution:**
1. `await Task.WhenAll(stopTasks)` — each `PlcListenerSupervisor.StopAsync` cancels its inner CTS, the listener's `RunAsync` exits, the OperationCanceledException catch calls `await listener.DisposeAsync()`, which calls `_multiplexer.DisposeAsync()`, which calls `_ctx.Counters.SetMultiplexProvider(null)` (`PlcMultiplexer.cs:224`).
2. *Then* the drain loop begins. `CountInFlight` reads `supervisor.CurrentCounters.Snapshot().InFlightCount`, and `ProxyCounters.Snapshot()` (`ProxyCounters.cs:403-404`) returns `provider?.InFlightCount ?? 0`. The provider is now `null`. So `total == 0` on the first iteration and the loop exits immediately.
The drain budget (`gracefulMs`) is never spent on actual draining. `inFlightAtCancel` will always log as `0`. The `mbproxy.shutdown.complete` event under load will report `InFlightAtCancel=0` even when in-flight requests were forcibly cancelled.
**This is the same defect the original review (`AdminAndDiagnostics.md:C1`) called out in the deleted `ShutdownCoordinator`.** The W1.5 relocation faithfully copied the broken sequencing.
**Fix options:**
- (a) Snapshot in-flight counts BEFORE calling `supervisors.StopAsync`, then move drain-with-real-multiplexers into a phase BEFORE supervisor stop (i.e. stop accepting new connections, drain in-flight, then stop supervisors).
- (b) Accept that supervisor stop *is* the drain (matches today's behavior) and remove the theatrical loop + `InFlightAtCancel` field from the log event.
## New major findings (5)
### NM1. `TearDownBackendAsync` acquires `_connectGate` with no token — disposal can be blocked indefinitely
**File:** `src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs:333` (introduced by W1.4)
`await _connectGate.WaitAsync().ConfigureAwait(false)` does not honour `_disposeCts.Token`. The dispose-driven teardown calls this after cancelling `_disposeCts`, but if a long Polly-wrapped `EnsureBackendConnectedAsync` is mid-connect against an unreachable host (waiting up to `BackendConnectTimeoutMs`), DisposeAsync blocks here for that duration. This blocks `ProxyWorker.StopAsync` past its drain budget.
**Fix:** pass `_disposeCts.Token` (or a CTS that cancels on dispose) to `WaitAsync`, accept the `OperationCanceledException`, and skip teardown work if disposal already completed.
### NM2. `ReplaceContext` writes `_ctx` and re-registers cache stats provider non-atomically — snapshot can see a disposed cache
**File:** `src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs:173-184` (introduced by W1.1)
Snapshot path (`ProxyCounters.Snapshot()` called from the admin endpoint) reads `_cacheStatsProvider` independently of `_ctx`. Sequence:
1. Thread A enters `ReplaceContext`, writes `_ctx = newContext` (line 177).
2. Thread B (admin) snapshots, reads `_cacheStatsProvider` (still the OLD adapter wrapping the OLD cache).
3. Thread A runs `SetCacheStatsProvider(newCacheAdapter)`.
4. Supervisor's `ReplaceContextAsync` proceeds to `oldCache.Clear()` then `oldCache.Dispose()`.
A snapshot during the swap reports `cacheEntryCount`/`cacheBytes` from a cache the live multiplexer no longer references. Worse, if the snapshot races past step 4, the stats adapter holds a disposed `ResponseCache` and `Count`/`ApproximateBytes` may throw. **Same shape of race as the original review's "snapshot inconsistency" pattern**, re-opened on a new path by the W1.1 fix.
**Fix:** swap the order (provider first, then `_ctx`) inside a short lock, OR have `ResponseCache.Count`/`ApproximateBytes` no-op on a disposed instance.
### NM3. `_supervisorCts` leaks across `StartAsync` re-entry despite W2.16 guard
**File:** `src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs:140, 437-446`
W2.16's guard prevents re-Start while busy, but it does not address the leaking previous CTS the comment claims to fix. `StopAsync` sets `_state = Stopped` first, then awaits `_supervisorTask`. The previous `_supervisorCts` is disposed only in `DisposeAsync`, not in `StopAsync`. A re-Start pattern (Stop → Start without Dispose) leaks one CTS per cycle.
Today no caller actually re-Starts a supervisor (ConfigReconciler's remove path uses `DisposeAsync`), so the leak is **latent**. Worth fixing now or documenting.
### NM4. W2.15 TCS is never re-armed; W2.16 + TCS together make the supervisor effectively single-shot
**File:** `src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs:74-75, 132-147`
`_firstAttemptCompleted` is `readonly` and set in the field initialiser. Once `TrySetResult` fires (anywhere in `RunSupervisorAsync` including the `finally`), it stays completed forever. A restarted supervisor would have `WaitForInitialBindAttemptAsync` return immediately on the *previous* run's signal, regardless of the new run's bind status. Combined with the comment claiming "the supervisor's state machine has exactly one Start", the intent appears to be single-shot — but neither the guard nor the TCS enforces it cleanly.
Today no caller re-Starts so this is **latent**. Either re-arm the TCS in `RunSupervisorAsync` or have W2.16 refuse re-Start unconditionally.
### NM5. Self-cascade swallows `ObjectDisposedException` from `_connectGate` after disposal
**File:** `src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs:451, 648` (introduced by W1.4 + W2.5)
After `DisposeAsync` runs `_connectGate.Dispose()`, any in-flight self-cascade (`_ = TearDownBackendAsync(...)` from writer/reader fault paths) reaches `WaitAsync()` and throws `ObjectDisposedException`. The `try/finally` will then run `Release()` on a disposed semaphore in the `finally`, also throwing. Both throws happen on a fire-and-forget Task — they become unobserved exceptions on TPL's UnobservedTaskException event. Noisy but not fatal.
**Fix:** wrap the body in `try { … } catch (ObjectDisposedException) { }` or short-circuit on `_disposed` before re-entering teardown.
## New minor findings (8)
| # | File:line | Finding |
|---|-----------|---------|
| Nm1 | `PlcMultiplexer.cs:842` (W1.2 cleanup) | Saturation cleanup uses `await SendResponseAsync` (blocking) per attached pipe. A single wedged late-attacher stalls saturation delivery to all others — contradicts the W1.3 doctrine. Use `TrySendResponse` here too. |
| Nm2 | `PlcMultiplexer.cs:810` | W1.2 increments `CoalescedHit` for late attachers that ultimately receive exception 04. Counter shows hits that produced no real coalesced response. Document or decrement. |
| Nm3 | `ProxyWorker.cs:269-310` | Both supervisor-stop and (theatrical) drain share `gracefulMs`. Worst-case shutdown is `2 * gracefulMs + 2s admin` — operators expect single budget. |
| Nm4 | `PlcListenerSupervisor.cs:433-446` | `finally` writes `_state = Stopped` directly, preserving `_lastBindError` from a prior fault. After clean cancellation operators see "last bind error" that was never the cause of stopping. Doc the field as "last fault while running" or clear on clean exit. |
| Nm5 | `EventLogBridge.cs:35-45` | W2.23 cached `_sourceExists` is correct, but no startup log confirms armed state. If `install.ps1` registers source after service starts, every Error+ event is silently dropped until restart. Add a one-line startup log. |
| Nm6 | `ProxyWorker.cs:224-235` | W1.5 lazy `_admin = _services.GetService<AdminEndpointHost>()` returns `null` silently if `AddMbproxyAdmin()` is forgotten. Previous `IHostedService` registration would have errored loudly during host build. Use `GetRequiredService<>` or log a warning. |
| Nm7 | `AdminEndpointHost.cs:200-210` | `DisposeAsync` lacks a `_disposed` guard; `ProxyWorker.StopAsync` calls `StopAsync` then DI disposes again. Operations are idempotent today but symmetry with `PlcMultiplexer` would be cleaner. |
| Nm8 | `PlcMultiplexer.cs:415` | `TearDownBackendAsync` log `BackendDisconnected` event fires per cascade; queued cascades on the gate each log their own event with mostly-zero counts. Cosmetic noise. |
## Test-discipline findings (4)
| # | File:line | Finding |
|---|-----------|---------|
| T1 | `PlcMultiplexerTests.cs:1152-1165`, `SupervisorTests.cs:1779-1789` | `DrainAllocator` and the runtime-fault test reflect on private field names (`_allocator`, `_currentListener`, `_listener`). A rename refactor breaks them at run-time, not compile-time. Recommend `[InternalsVisibleTo]` test seam OR comments on each reflected field warning that tests depend on the name. |
| T2 | `PlcMultiplexerTests.cs:1379-1380` (W3 #8) | `WatchdogVsResponse_Race` uses `new Random(12345)` over `[350, 450)` ms with watchdog at 400 ms; asserts BOTH branches hit. `Random` with a seed is implementation-defined across major .NET versions (legacy → Xoshiro128 in .NET 6). On a runtime upgrade the seed could land all 30 samples on one side and the test breaks. Replace with deterministic alternation: `int delay = (i & 1) == 0 ? 350 : 450;`. |
| T3 | `Mbproxy.Tests.csproj:38-40` (W2.21) | `RemoveInheritedAppsettings` Target uses `AfterTargets="Build"` and deletes from `$(OutputPath)`. Won't fire on `dotnet publish`. Test projects rarely publish but worth `AfterTargets="Build;Publish"` + a second Delete against `$(PublishDir)appsettings.json`. |
| T4 | `InFlightByKeyMapTests.cs:46, 75, 112` | After W3 removed the `bool` return, test method names still read `TryAttachOrCreate_..._ReturnsTrue_WasNewTrue`. Decorative but misleading for grep. |
## Verified clean (sampled, not exhaustive)
- **W2.3 ConcurrentDictionary migration** on `_supervisors` — all mutations atomic; status-page enumeration lock-free; Restart's "remove + add" two-step is per-key (parallel keys disjoint by name).
- **W2.1 coalescingAccessor propagation** — both Add and Restart paths receive it; Reseat correctly does not (same supervisor, same multiplexer, same accessor).
- **W2.13 OOR check** — multiplication is bounded by the guard; even worst-case `9999 * 10_000 + 9999 = 99_989_999` fits in int32 without overflow.
- **W2.14 byteCount validation** — strict `<` check passes a perfectly-sized PDU; trailing-byte case correct.
- **W2.10 resolved-TTL re-check** — `BcdTagMapBuilder.Build` is called exactly once per PLC at validation; no duplicate work.
- **W2.18 ConnectionOptions validation** — both `MbproxyOptionsValidator` and `ReloadValidator` reject `<= 0`; no bypass path.
- **W3 `HasBadNibble` dedupe** — clean; the codec's internal helper is the single source of truth.
- **W2.15 TCS signalled in every exit path** of `RunSupervisorAsync` (bind-success, bind-failure-into-recovery, run-failure-into-recovery, finally) — no hang on `WaitForInitialBindAttemptAsync` for the first run.
- **W2.17 TransitionTo lock contract** — both writers use it; `Snapshot` reads under the same lock; no torn triples.
- **`TxIdAllocator.Release` double-call is benign** (`TxIdAllocator.cs:121-129` checks `if (_inUse[id])`); the W1.4 channel drain releasing a TxId already released by the correlation drain is safe.
- **W1.1 in-PDU snapshot consistency** — `OnUpstreamFrameAsync` reads `_ctx.Cache` and `_ctx.TagMap.ResolveCacheTtlMs` non-atomically; the only mid-PDU swap visible would change cache eligibility, not produce corrupted output. Downstream `WithCurrentRequest` snapshots TagMap+Cache for the rewriter, so the rewrite itself is consistent.
- **W2.7 cache-FC byte sourced from post-rewriter buffer** — correct; the rewriter never touches the FC byte but the source must remain `frame[…]` to capture the exception bit.
## Recommended next actions
If you want to keep the issue tracker tidy, these are the items that should land as a small follow-up commit:
1. **NC1** — fix or remove the theatrical drain loop. Smallest change: delete the loop and the `InFlightAtCancel` field from the log.
2. **NM1** — pass `_disposeCts.Token` to `_connectGate.WaitAsync` in `TearDownBackendAsync`.
3. **NM2** — wrap the `_ctx` write + provider re-registration in a single short lock, or null-guard the cache adapter against disposed-cache reads.
4. **NM5** — wrap the self-cascade body in `try { … } catch (ObjectDisposedException) { }`.
5. **Nm1** — use `TrySendResponse` in the W1.2 saturation cleanup loop to avoid wedge-by-wedged-attacher.
6. **T2** — replace the seeded `Random` in `WatchdogVsResponse_Race` with deterministic alternation.
NM3 and NM4 are latent (no current caller exercises supervisor re-Start). They become real only if the reconciler ever changes to re-Start instead of dispose-and-rebuild. Defer or document.
Everything else is documentation / cosmetic / minor cleanup.