Closes the new findings from the post-remediation re-review (codereviews/2026-05-14/ReReviewAfterRemediation.md): NC1 — ProxyWorker.StopAsync drain loop is structurally always-zero Wave 1's W1.5 inherited the original ShutdownCoordinator bug it was meant to replace. Supervisor.StopAsync nulls the per-mux counter provider before the drain loop runs, so CountInFlight always returns 0 and the drain budget is never spent on actual draining. Fix: snapshot the in-flight count BEFORE supervisor stop, drop the theatrical post-stop loop, and report InFlightAtCancel as the snapshot count (= the number of in-flight requests dropped by the stop). The supervisor stop IS the drain — there is nothing to drain that wouldn't be killed by the stop itself. NM1 — TearDownBackendAsync._connectGate.WaitAsync uncancellable Without a token, a long Polly-wrapped EnsureBackendConnectedAsync against an unreachable host could hold the gate for the full BackendConnectTimeoutMs * MaxAttempts window, blocking DisposeAsync (and therefore ProxyWorker.StopAsync) for that duration. Fix: bound the wait with a 2 s teardown deadline; on timeout proceed best-effort without the gate. Worst-case consequence is one orphaned in-flight cycle on the dying backend, surfaced to upstream as exception 0x0B by the watchdog. NM2 — ReplaceContext non-atomic ctx + provider swap Snapshot path reads `_cacheStatsProvider` independently of `_ctx`. If `_ctx` was swapped first, a snapshot taken in the gap would still hold the OLD adapter wrapping the OLD cache — which the supervisor disposes immediately after we return. Fix: set the provider FIRST, then swap `_ctx`. Snapshots in the swap window now read either (old, old) or (new, new), never (old-after-disposed). NM5 — Self-cascade ObjectDisposedException after dispose Writer/reader fault catches fired `_ = TearDownBackendAsync(...)` unconditionally. After DisposeAsync runs `_connectGate.Dispose()`, the fire-and-forget TearDown threw ObjectDisposedException on WaitAsync as an unobserved Task exception. Fix: skip self-cascade when `_disposeCts.IsCancellationRequested` — DisposeAsync runs an explicit TearDown anyway. Nm1 — Saturation cleanup uses await SendResponseAsync W1.2's per-attacher delivery loop awaited the blocking SendResponseAsync, which would serialise on a wedged late-attacher's full bounded channel and stall delivery to its peers — contradicting the W1.3 doctrine that the fan-out path must never await per-pipe writes. Fix: use TrySendResponse and increment ResponseDropForFullUpstream on drop. T2 — WatchdogVsResponse_Race seeded Random fragility Used `new Random(12345)` over [350, 450) ms with watchdog at 400 ms; Random's algorithm is implementation-defined across .NET major versions (legacy → Xoshiro128 in .NET 6) so a runtime upgrade could land all samples on one side of the deadline and break the "both branches must fire" assertion. Fix: deterministic counter-based alternation (15 fast + 15 slow across 30 iterations) — guaranteed by construction. Latent items NM3 (_supervisorCts leak on re-Start) and NM4 (TCS single-shot semantics) are unfixed: no caller actually re-Starts a supervisor today; both become real only if the reconciler ever changes to re-Start instead of dispose-and-rebuild. Documented in the re-review. Tests: 387 pass / 0 fail. Three back-to-back race-test runs in isolation all green (T2 alternation is deterministic). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
14 KiB
Re-Review After Remediation — 2026-05-14
Re-review of the codebase after the six-commit remediation of the original 2026-05-14 review (Wave 1 → ce32c5c, Wave 2 → e66b17f, Wave 3 → 7ead358, the easy 5 → 2545237, the race-hard 5 → 53f842a). Conducted via three parallel area-focused passes. Eyes on what the fixes themselves introduced, not what the original review already found.
Scope: every src/ and tests/ change in 53a7111..HEAD (37 files, ~+2000/−700 lines).
Headline
The remediation is structurally sound. One new critical finding — the W1.5 drain loop is structurally always-zero — inherited the very bug it was meant to replace. Five new major findings cluster in two areas:
- W1.4 cascade gating introduced two disposal-deadlock and exception-swallowing paths around the new
_connectGate.WaitAsync()call insideTearDownBackendAsync. - W1.1 + W1.5 lifecycle ordering introduced a non-atomic context swap (snapshot reads can see a disposed cache during reseat) and silently single-shot supervisor semantics from the W2.15 + W2.16 combination.
Plus a handful of test-discipline gaps from the new race-hard tests (reflection on private field names, RNG seed fragility across runtime versions).
New critical findings (1)
NC1. ProxyWorker.StopAsync drain loop is structurally always-zero — inherited the original ShutdownCoordinator bug it replaced
File: src/Mbproxy/Proxy/ProxyWorker.cs:276-310 (introduced by W1.5)
Path of execution:
await Task.WhenAll(stopTasks)— eachPlcListenerSupervisor.StopAsynccancels its inner CTS, the listener'sRunAsyncexits, the OperationCanceledException catch callsawait listener.DisposeAsync(), which calls_multiplexer.DisposeAsync(), which calls_ctx.Counters.SetMultiplexProvider(null)(PlcMultiplexer.cs:224).- Then the drain loop begins.
CountInFlightreadssupervisor.CurrentCounters.Snapshot().InFlightCount, andProxyCounters.Snapshot()(ProxyCounters.cs:403-404) returnsprovider?.InFlightCount ?? 0. The provider is nownull. Sototal == 0on the first iteration and the loop exits immediately.
The drain budget (gracefulMs) is never spent on actual draining. inFlightAtCancel will always log as 0. The mbproxy.shutdown.complete event under load will report InFlightAtCancel=0 even when in-flight requests were forcibly cancelled.
This is the same defect the original review (AdminAndDiagnostics.md:C1) called out in the deleted ShutdownCoordinator. The W1.5 relocation faithfully copied the broken sequencing.
Fix options:
- (a) Snapshot in-flight counts BEFORE calling
supervisors.StopAsync, then move drain-with-real-multiplexers into a phase BEFORE supervisor stop (i.e. stop accepting new connections, drain in-flight, then stop supervisors). - (b) Accept that supervisor stop is the drain (matches today's behavior) and remove the theatrical loop +
InFlightAtCancelfield from the log event.
New major findings (5)
NM1. TearDownBackendAsync acquires _connectGate with no token — disposal can be blocked indefinitely
File: src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs:333 (introduced by W1.4)
await _connectGate.WaitAsync().ConfigureAwait(false) does not honour _disposeCts.Token. The dispose-driven teardown calls this after cancelling _disposeCts, but if a long Polly-wrapped EnsureBackendConnectedAsync is mid-connect against an unreachable host (waiting up to BackendConnectTimeoutMs), DisposeAsync blocks here for that duration. This blocks ProxyWorker.StopAsync past its drain budget.
Fix: pass _disposeCts.Token (or a CTS that cancels on dispose) to WaitAsync, accept the OperationCanceledException, and skip teardown work if disposal already completed.
NM2. ReplaceContext writes _ctx and re-registers cache stats provider non-atomically — snapshot can see a disposed cache
File: src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs:173-184 (introduced by W1.1)
Snapshot path (ProxyCounters.Snapshot() called from the admin endpoint) reads _cacheStatsProvider independently of _ctx. Sequence:
- Thread A enters
ReplaceContext, writes_ctx = newContext(line 177). - Thread B (admin) snapshots, reads
_cacheStatsProvider(still the OLD adapter wrapping the OLD cache). - Thread A runs
SetCacheStatsProvider(newCacheAdapter). - Supervisor's
ReplaceContextAsyncproceeds tooldCache.Clear()thenoldCache.Dispose().
A snapshot during the swap reports cacheEntryCount/cacheBytes from a cache the live multiplexer no longer references. Worse, if the snapshot races past step 4, the stats adapter holds a disposed ResponseCache and Count/ApproximateBytes may throw. Same shape of race as the original review's "snapshot inconsistency" pattern, re-opened on a new path by the W1.1 fix.
Fix: swap the order (provider first, then _ctx) inside a short lock, OR have ResponseCache.Count/ApproximateBytes no-op on a disposed instance.
NM3. _supervisorCts leaks across StartAsync re-entry despite W2.16 guard
File: src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs:140, 437-446
W2.16's guard prevents re-Start while busy, but it does not address the leaking previous CTS the comment claims to fix. StopAsync sets _state = Stopped first, then awaits _supervisorTask. The previous _supervisorCts is disposed only in DisposeAsync, not in StopAsync. A re-Start pattern (Stop → Start without Dispose) leaks one CTS per cycle.
Today no caller actually re-Starts a supervisor (ConfigReconciler's remove path uses DisposeAsync), so the leak is latent. Worth fixing now or documenting.
NM4. W2.15 TCS is never re-armed; W2.16 + TCS together make the supervisor effectively single-shot
File: src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs:74-75, 132-147
_firstAttemptCompleted is readonly and set in the field initialiser. Once TrySetResult fires (anywhere in RunSupervisorAsync including the finally), it stays completed forever. A restarted supervisor would have WaitForInitialBindAttemptAsync return immediately on the previous run's signal, regardless of the new run's bind status. Combined with the comment claiming "the supervisor's state machine has exactly one Start", the intent appears to be single-shot — but neither the guard nor the TCS enforces it cleanly.
Today no caller re-Starts so this is latent. Either re-arm the TCS in RunSupervisorAsync or have W2.16 refuse re-Start unconditionally.
NM5. Self-cascade swallows ObjectDisposedException from _connectGate after disposal
File: src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs:451, 648 (introduced by W1.4 + W2.5)
After DisposeAsync runs _connectGate.Dispose(), any in-flight self-cascade (_ = TearDownBackendAsync(...) from writer/reader fault paths) reaches WaitAsync() and throws ObjectDisposedException. The try/finally will then run Release() on a disposed semaphore in the finally, also throwing. Both throws happen on a fire-and-forget Task — they become unobserved exceptions on TPL's UnobservedTaskException event. Noisy but not fatal.
Fix: wrap the body in try { … } catch (ObjectDisposedException) { } or short-circuit on _disposed before re-entering teardown.
New minor findings (8)
| # | File:line | Finding |
|---|---|---|
| Nm1 | PlcMultiplexer.cs:842 (W1.2 cleanup) |
Saturation cleanup uses await SendResponseAsync (blocking) per attached pipe. A single wedged late-attacher stalls saturation delivery to all others — contradicts the W1.3 doctrine. Use TrySendResponse here too. |
| Nm2 | PlcMultiplexer.cs:810 |
W1.2 increments CoalescedHit for late attachers that ultimately receive exception 04. Counter shows hits that produced no real coalesced response. Document or decrement. |
| Nm3 | ProxyWorker.cs:269-310 |
Both supervisor-stop and (theatrical) drain share gracefulMs. Worst-case shutdown is 2 * gracefulMs + 2s admin — operators expect single budget. |
| Nm4 | PlcListenerSupervisor.cs:433-446 |
finally writes _state = Stopped directly, preserving _lastBindError from a prior fault. After clean cancellation operators see "last bind error" that was never the cause of stopping. Doc the field as "last fault while running" or clear on clean exit. |
| Nm5 | EventLogBridge.cs:35-45 |
W2.23 cached _sourceExists is correct, but no startup log confirms armed state. If install.ps1 registers source after service starts, every Error+ event is silently dropped until restart. Add a one-line startup log. |
| Nm6 | ProxyWorker.cs:224-235 |
W1.5 lazy _admin = _services.GetService<AdminEndpointHost>() returns null silently if AddMbproxyAdmin() is forgotten. Previous IHostedService registration would have errored loudly during host build. Use GetRequiredService<> or log a warning. |
| Nm7 | AdminEndpointHost.cs:200-210 |
DisposeAsync lacks a _disposed guard; ProxyWorker.StopAsync calls StopAsync then DI disposes again. Operations are idempotent today but symmetry with PlcMultiplexer would be cleaner. |
| Nm8 | PlcMultiplexer.cs:415 |
TearDownBackendAsync log BackendDisconnected event fires per cascade; queued cascades on the gate each log their own event with mostly-zero counts. Cosmetic noise. |
Test-discipline findings (4)
| # | File:line | Finding |
|---|---|---|
| T1 | PlcMultiplexerTests.cs:1152-1165, SupervisorTests.cs:1779-1789 |
DrainAllocator and the runtime-fault test reflect on private field names (_allocator, _currentListener, _listener). A rename refactor breaks them at run-time, not compile-time. Recommend [InternalsVisibleTo] test seam OR comments on each reflected field warning that tests depend on the name. |
| T2 | PlcMultiplexerTests.cs:1379-1380 (W3 #8) |
WatchdogVsResponse_Race uses new Random(12345) over [350, 450) ms with watchdog at 400 ms; asserts BOTH branches hit. Random with a seed is implementation-defined across major .NET versions (legacy → Xoshiro128 in .NET 6). On a runtime upgrade the seed could land all 30 samples on one side and the test breaks. Replace with deterministic alternation: int delay = (i & 1) == 0 ? 350 : 450;. |
| T3 | Mbproxy.Tests.csproj:38-40 (W2.21) |
RemoveInheritedAppsettings Target uses AfterTargets="Build" and deletes from $(OutputPath). Won't fire on dotnet publish. Test projects rarely publish but worth AfterTargets="Build;Publish" + a second Delete against $(PublishDir)appsettings.json. |
| T4 | InFlightByKeyMapTests.cs:46, 75, 112 |
After W3 removed the bool return, test method names still read TryAttachOrCreate_..._ReturnsTrue_WasNewTrue. Decorative but misleading for grep. |
Verified clean (sampled, not exhaustive)
- W2.3 ConcurrentDictionary migration on
_supervisors— all mutations atomic; status-page enumeration lock-free; Restart's "remove + add" two-step is per-key (parallel keys disjoint by name). - W2.1 coalescingAccessor propagation — both Add and Restart paths receive it; Reseat correctly does not (same supervisor, same multiplexer, same accessor).
- W2.13 OOR check — multiplication is bounded by the guard; even worst-case
9999 * 10_000 + 9999 = 99_989_999fits in int32 without overflow. - W2.14 byteCount validation — strict
<check passes a perfectly-sized PDU; trailing-byte case correct. - W2.10 resolved-TTL re-check —
BcdTagMapBuilder.Buildis called exactly once per PLC at validation; no duplicate work. - W2.18 ConnectionOptions validation — both
MbproxyOptionsValidatorandReloadValidatorreject<= 0; no bypass path. - W3
HasBadNibblededupe — clean; the codec's internal helper is the single source of truth. - W2.15 TCS signalled in every exit path of
RunSupervisorAsync(bind-success, bind-failure-into-recovery, run-failure-into-recovery, finally) — no hang onWaitForInitialBindAttemptAsyncfor the first run. - W2.17 TransitionTo lock contract — both writers use it;
Snapshotreads under the same lock; no torn triples. TxIdAllocator.Releasedouble-call is benign (TxIdAllocator.cs:121-129checksif (_inUse[id])); the W1.4 channel drain releasing a TxId already released by the correlation drain is safe.- W1.1 in-PDU snapshot consistency —
OnUpstreamFrameAsyncreads_ctx.Cacheand_ctx.TagMap.ResolveCacheTtlMsnon-atomically; the only mid-PDU swap visible would change cache eligibility, not produce corrupted output. DownstreamWithCurrentRequestsnapshots TagMap+Cache for the rewriter, so the rewrite itself is consistent. - W2.7 cache-FC byte sourced from post-rewriter buffer — correct; the rewriter never touches the FC byte but the source must remain
frame[…]to capture the exception bit.
Recommended next actions
If you want to keep the issue tracker tidy, these are the items that should land as a small follow-up commit:
- NC1 — fix or remove the theatrical drain loop. Smallest change: delete the loop and the
InFlightAtCancelfield from the log. - NM1 — pass
_disposeCts.Tokento_connectGate.WaitAsyncinTearDownBackendAsync. - NM2 — wrap the
_ctxwrite + provider re-registration in a single short lock, or null-guard the cache adapter against disposed-cache reads. - NM5 — wrap the self-cascade body in
try { … } catch (ObjectDisposedException) { }. - Nm1 — use
TrySendResponsein the W1.2 saturation cleanup loop to avoid wedge-by-wedged-attacher. - T2 — replace the seeded
RandominWatchdogVsResponse_Racewith deterministic alternation.
NM3 and NM4 are latent (no current caller exercises supervisor re-Start). They become real only if the reconciler ever changes to re-Start instead of dispose-and-rebuild. Defer or document.
Everything else is documentation / cosmetic / minor cleanup.