Files
wwtools/mbproxy/codereviews/2026-05-14/ReReviewAfterRemediation.md
T
Joseph Doherty 7a435957ee mbproxy: Wave 4 — fix issues introduced by the Wave-1/2 fixes
Closes the new findings from the post-remediation re-review
(codereviews/2026-05-14/ReReviewAfterRemediation.md):

NC1 — ProxyWorker.StopAsync drain loop is structurally always-zero
  Wave 1's W1.5 inherited the original ShutdownCoordinator bug it was
  meant to replace. Supervisor.StopAsync nulls the per-mux counter
  provider before the drain loop runs, so CountInFlight always returns 0
  and the drain budget is never spent on actual draining. Fix: snapshot
  the in-flight count BEFORE supervisor stop, drop the theatrical
  post-stop loop, and report InFlightAtCancel as the snapshot count
  (= the number of in-flight requests dropped by the stop). The
  supervisor stop IS the drain — there is nothing to drain that
  wouldn't be killed by the stop itself.

NM1 — TearDownBackendAsync._connectGate.WaitAsync uncancellable
  Without a token, a long Polly-wrapped EnsureBackendConnectedAsync
  against an unreachable host could hold the gate for the full
  BackendConnectTimeoutMs * MaxAttempts window, blocking DisposeAsync
  (and therefore ProxyWorker.StopAsync) for that duration. Fix: bound
  the wait with a 2 s teardown deadline; on timeout proceed
  best-effort without the gate. Worst-case consequence is one orphaned
  in-flight cycle on the dying backend, surfaced to upstream as
  exception 0x0B by the watchdog.

NM2 — ReplaceContext non-atomic ctx + provider swap
  Snapshot path reads `_cacheStatsProvider` independently of `_ctx`. If
  `_ctx` was swapped first, a snapshot taken in the gap would still hold
  the OLD adapter wrapping the OLD cache — which the supervisor disposes
  immediately after we return. Fix: set the provider FIRST, then swap
  `_ctx`. Snapshots in the swap window now read either (old, old) or
  (new, new), never (old-after-disposed).

NM5 — Self-cascade ObjectDisposedException after dispose
  Writer/reader fault catches fired `_ = TearDownBackendAsync(...)`
  unconditionally. After DisposeAsync runs `_connectGate.Dispose()`, the
  fire-and-forget TearDown threw ObjectDisposedException on WaitAsync as
  an unobserved Task exception. Fix: skip self-cascade when
  `_disposeCts.IsCancellationRequested` — DisposeAsync runs an explicit
  TearDown anyway.

Nm1 — Saturation cleanup uses await SendResponseAsync
  W1.2's per-attacher delivery loop awaited the blocking SendResponseAsync,
  which would serialise on a wedged late-attacher's full bounded channel
  and stall delivery to its peers — contradicting the W1.3 doctrine that
  the fan-out path must never await per-pipe writes. Fix: use
  TrySendResponse and increment ResponseDropForFullUpstream on drop.

T2 — WatchdogVsResponse_Race seeded Random fragility
  Used `new Random(12345)` over [350, 450) ms with watchdog at 400 ms;
  Random's algorithm is implementation-defined across .NET major versions
  (legacy → Xoshiro128 in .NET 6) so a runtime upgrade could land all
  samples on one side of the deadline and break the "both branches must
  fire" assertion. Fix: deterministic counter-based alternation (15 fast
  + 15 slow across 30 iterations) — guaranteed by construction.

Latent items NM3 (_supervisorCts leak on re-Start) and NM4 (TCS
single-shot semantics) are unfixed: no caller actually re-Starts a
supervisor today; both become real only if the reconciler ever changes
to re-Start instead of dispose-and-rebuild. Documented in the re-review.

Tests: 387 pass / 0 fail. Three back-to-back race-test runs in
isolation all green (T2 alternation is deterministic).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 06:52:33 -04:00

14 KiB
Raw Blame History

Re-Review After Remediation — 2026-05-14

Re-review of the codebase after the six-commit remediation of the original 2026-05-14 review (Wave 1 → ce32c5c, Wave 2 → e66b17f, Wave 3 → 7ead358, the easy 5 → 2545237, the race-hard 5 → 53f842a). Conducted via three parallel area-focused passes. Eyes on what the fixes themselves introduced, not what the original review already found.

Scope: every src/ and tests/ change in 53a7111..HEAD (37 files, ~+2000/700 lines).

Headline

The remediation is structurally sound. One new critical finding — the W1.5 drain loop is structurally always-zero — inherited the very bug it was meant to replace. Five new major findings cluster in two areas:

  • W1.4 cascade gating introduced two disposal-deadlock and exception-swallowing paths around the new _connectGate.WaitAsync() call inside TearDownBackendAsync.
  • W1.1 + W1.5 lifecycle ordering introduced a non-atomic context swap (snapshot reads can see a disposed cache during reseat) and silently single-shot supervisor semantics from the W2.15 + W2.16 combination.

Plus a handful of test-discipline gaps from the new race-hard tests (reflection on private field names, RNG seed fragility across runtime versions).

New critical findings (1)

NC1. ProxyWorker.StopAsync drain loop is structurally always-zero — inherited the original ShutdownCoordinator bug it replaced

File: src/Mbproxy/Proxy/ProxyWorker.cs:276-310 (introduced by W1.5)

Path of execution:

  1. await Task.WhenAll(stopTasks) — each PlcListenerSupervisor.StopAsync cancels its inner CTS, the listener's RunAsync exits, the OperationCanceledException catch calls await listener.DisposeAsync(), which calls _multiplexer.DisposeAsync(), which calls _ctx.Counters.SetMultiplexProvider(null) (PlcMultiplexer.cs:224).
  2. Then the drain loop begins. CountInFlight reads supervisor.CurrentCounters.Snapshot().InFlightCount, and ProxyCounters.Snapshot() (ProxyCounters.cs:403-404) returns provider?.InFlightCount ?? 0. The provider is now null. So total == 0 on the first iteration and the loop exits immediately.

The drain budget (gracefulMs) is never spent on actual draining. inFlightAtCancel will always log as 0. The mbproxy.shutdown.complete event under load will report InFlightAtCancel=0 even when in-flight requests were forcibly cancelled.

This is the same defect the original review (AdminAndDiagnostics.md:C1) called out in the deleted ShutdownCoordinator. The W1.5 relocation faithfully copied the broken sequencing.

Fix options:

  • (a) Snapshot in-flight counts BEFORE calling supervisors.StopAsync, then move drain-with-real-multiplexers into a phase BEFORE supervisor stop (i.e. stop accepting new connections, drain in-flight, then stop supervisors).
  • (b) Accept that supervisor stop is the drain (matches today's behavior) and remove the theatrical loop + InFlightAtCancel field from the log event.

New major findings (5)

NM1. TearDownBackendAsync acquires _connectGate with no token — disposal can be blocked indefinitely

File: src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs:333 (introduced by W1.4)

await _connectGate.WaitAsync().ConfigureAwait(false) does not honour _disposeCts.Token. The dispose-driven teardown calls this after cancelling _disposeCts, but if a long Polly-wrapped EnsureBackendConnectedAsync is mid-connect against an unreachable host (waiting up to BackendConnectTimeoutMs), DisposeAsync blocks here for that duration. This blocks ProxyWorker.StopAsync past its drain budget.

Fix: pass _disposeCts.Token (or a CTS that cancels on dispose) to WaitAsync, accept the OperationCanceledException, and skip teardown work if disposal already completed.

NM2. ReplaceContext writes _ctx and re-registers cache stats provider non-atomically — snapshot can see a disposed cache

File: src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs:173-184 (introduced by W1.1)

Snapshot path (ProxyCounters.Snapshot() called from the admin endpoint) reads _cacheStatsProvider independently of _ctx. Sequence:

  1. Thread A enters ReplaceContext, writes _ctx = newContext (line 177).
  2. Thread B (admin) snapshots, reads _cacheStatsProvider (still the OLD adapter wrapping the OLD cache).
  3. Thread A runs SetCacheStatsProvider(newCacheAdapter).
  4. Supervisor's ReplaceContextAsync proceeds to oldCache.Clear() then oldCache.Dispose().

A snapshot during the swap reports cacheEntryCount/cacheBytes from a cache the live multiplexer no longer references. Worse, if the snapshot races past step 4, the stats adapter holds a disposed ResponseCache and Count/ApproximateBytes may throw. Same shape of race as the original review's "snapshot inconsistency" pattern, re-opened on a new path by the W1.1 fix.

Fix: swap the order (provider first, then _ctx) inside a short lock, OR have ResponseCache.Count/ApproximateBytes no-op on a disposed instance.

NM3. _supervisorCts leaks across StartAsync re-entry despite W2.16 guard

File: src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs:140, 437-446

W2.16's guard prevents re-Start while busy, but it does not address the leaking previous CTS the comment claims to fix. StopAsync sets _state = Stopped first, then awaits _supervisorTask. The previous _supervisorCts is disposed only in DisposeAsync, not in StopAsync. A re-Start pattern (Stop → Start without Dispose) leaks one CTS per cycle.

Today no caller actually re-Starts a supervisor (ConfigReconciler's remove path uses DisposeAsync), so the leak is latent. Worth fixing now or documenting.

NM4. W2.15 TCS is never re-armed; W2.16 + TCS together make the supervisor effectively single-shot

File: src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs:74-75, 132-147

_firstAttemptCompleted is readonly and set in the field initialiser. Once TrySetResult fires (anywhere in RunSupervisorAsync including the finally), it stays completed forever. A restarted supervisor would have WaitForInitialBindAttemptAsync return immediately on the previous run's signal, regardless of the new run's bind status. Combined with the comment claiming "the supervisor's state machine has exactly one Start", the intent appears to be single-shot — but neither the guard nor the TCS enforces it cleanly.

Today no caller re-Starts so this is latent. Either re-arm the TCS in RunSupervisorAsync or have W2.16 refuse re-Start unconditionally.

NM5. Self-cascade swallows ObjectDisposedException from _connectGate after disposal

File: src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs:451, 648 (introduced by W1.4 + W2.5)

After DisposeAsync runs _connectGate.Dispose(), any in-flight self-cascade (_ = TearDownBackendAsync(...) from writer/reader fault paths) reaches WaitAsync() and throws ObjectDisposedException. The try/finally will then run Release() on a disposed semaphore in the finally, also throwing. Both throws happen on a fire-and-forget Task — they become unobserved exceptions on TPL's UnobservedTaskException event. Noisy but not fatal.

Fix: wrap the body in try { … } catch (ObjectDisposedException) { } or short-circuit on _disposed before re-entering teardown.

New minor findings (8)

# File:line Finding
Nm1 PlcMultiplexer.cs:842 (W1.2 cleanup) Saturation cleanup uses await SendResponseAsync (blocking) per attached pipe. A single wedged late-attacher stalls saturation delivery to all others — contradicts the W1.3 doctrine. Use TrySendResponse here too.
Nm2 PlcMultiplexer.cs:810 W1.2 increments CoalescedHit for late attachers that ultimately receive exception 04. Counter shows hits that produced no real coalesced response. Document or decrement.
Nm3 ProxyWorker.cs:269-310 Both supervisor-stop and (theatrical) drain share gracefulMs. Worst-case shutdown is 2 * gracefulMs + 2s admin — operators expect single budget.
Nm4 PlcListenerSupervisor.cs:433-446 finally writes _state = Stopped directly, preserving _lastBindError from a prior fault. After clean cancellation operators see "last bind error" that was never the cause of stopping. Doc the field as "last fault while running" or clear on clean exit.
Nm5 EventLogBridge.cs:35-45 W2.23 cached _sourceExists is correct, but no startup log confirms armed state. If install.ps1 registers source after service starts, every Error+ event is silently dropped until restart. Add a one-line startup log.
Nm6 ProxyWorker.cs:224-235 W1.5 lazy _admin = _services.GetService<AdminEndpointHost>() returns null silently if AddMbproxyAdmin() is forgotten. Previous IHostedService registration would have errored loudly during host build. Use GetRequiredService<> or log a warning.
Nm7 AdminEndpointHost.cs:200-210 DisposeAsync lacks a _disposed guard; ProxyWorker.StopAsync calls StopAsync then DI disposes again. Operations are idempotent today but symmetry with PlcMultiplexer would be cleaner.
Nm8 PlcMultiplexer.cs:415 TearDownBackendAsync log BackendDisconnected event fires per cascade; queued cascades on the gate each log their own event with mostly-zero counts. Cosmetic noise.

Test-discipline findings (4)

# File:line Finding
T1 PlcMultiplexerTests.cs:1152-1165, SupervisorTests.cs:1779-1789 DrainAllocator and the runtime-fault test reflect on private field names (_allocator, _currentListener, _listener). A rename refactor breaks them at run-time, not compile-time. Recommend [InternalsVisibleTo] test seam OR comments on each reflected field warning that tests depend on the name.
T2 PlcMultiplexerTests.cs:1379-1380 (W3 #8) WatchdogVsResponse_Race uses new Random(12345) over [350, 450) ms with watchdog at 400 ms; asserts BOTH branches hit. Random with a seed is implementation-defined across major .NET versions (legacy → Xoshiro128 in .NET 6). On a runtime upgrade the seed could land all 30 samples on one side and the test breaks. Replace with deterministic alternation: int delay = (i & 1) == 0 ? 350 : 450;.
T3 Mbproxy.Tests.csproj:38-40 (W2.21) RemoveInheritedAppsettings Target uses AfterTargets="Build" and deletes from $(OutputPath). Won't fire on dotnet publish. Test projects rarely publish but worth AfterTargets="Build;Publish" + a second Delete against $(PublishDir)appsettings.json.
T4 InFlightByKeyMapTests.cs:46, 75, 112 After W3 removed the bool return, test method names still read TryAttachOrCreate_..._ReturnsTrue_WasNewTrue. Decorative but misleading for grep.

Verified clean (sampled, not exhaustive)

  • W2.3 ConcurrentDictionary migration on _supervisors — all mutations atomic; status-page enumeration lock-free; Restart's "remove + add" two-step is per-key (parallel keys disjoint by name).
  • W2.1 coalescingAccessor propagation — both Add and Restart paths receive it; Reseat correctly does not (same supervisor, same multiplexer, same accessor).
  • W2.13 OOR check — multiplication is bounded by the guard; even worst-case 9999 * 10_000 + 9999 = 99_989_999 fits in int32 without overflow.
  • W2.14 byteCount validation — strict < check passes a perfectly-sized PDU; trailing-byte case correct.
  • W2.10 resolved-TTL re-checkBcdTagMapBuilder.Build is called exactly once per PLC at validation; no duplicate work.
  • W2.18 ConnectionOptions validation — both MbproxyOptionsValidator and ReloadValidator reject <= 0; no bypass path.
  • W3 HasBadNibble dedupe — clean; the codec's internal helper is the single source of truth.
  • W2.15 TCS signalled in every exit path of RunSupervisorAsync (bind-success, bind-failure-into-recovery, run-failure-into-recovery, finally) — no hang on WaitForInitialBindAttemptAsync for the first run.
  • W2.17 TransitionTo lock contract — both writers use it; Snapshot reads under the same lock; no torn triples.
  • TxIdAllocator.Release double-call is benign (TxIdAllocator.cs:121-129 checks if (_inUse[id])); the W1.4 channel drain releasing a TxId already released by the correlation drain is safe.
  • W1.1 in-PDU snapshot consistencyOnUpstreamFrameAsync reads _ctx.Cache and _ctx.TagMap.ResolveCacheTtlMs non-atomically; the only mid-PDU swap visible would change cache eligibility, not produce corrupted output. Downstream WithCurrentRequest snapshots TagMap+Cache for the rewriter, so the rewrite itself is consistent.
  • W2.7 cache-FC byte sourced from post-rewriter buffer — correct; the rewriter never touches the FC byte but the source must remain frame[…] to capture the exception bit.

If you want to keep the issue tracker tidy, these are the items that should land as a small follow-up commit:

  1. NC1 — fix or remove the theatrical drain loop. Smallest change: delete the loop and the InFlightAtCancel field from the log.
  2. NM1 — pass _disposeCts.Token to _connectGate.WaitAsync in TearDownBackendAsync.
  3. NM2 — wrap the _ctx write + provider re-registration in a single short lock, or null-guard the cache adapter against disposed-cache reads.
  4. NM5 — wrap the self-cascade body in try { … } catch (ObjectDisposedException) { }.
  5. Nm1 — use TrySendResponse in the W1.2 saturation cleanup loop to avoid wedge-by-wedged-attacher.
  6. T2 — replace the seeded Random in WatchdogVsResponse_Race with deterministic alternation.

NM3 and NM4 are latent (no current caller exercises supervisor re-Start). They become real only if the reconciler ever changes to re-Start instead of dispose-and-rebuild. Defer or document.

Everything else is documentation / cosmetic / minor cleanup.