Files

T

Joseph Doherty 7a435957ee mbproxy: Wave 4 — fix issues introduced by the Wave-1/2 fixes

Closes the new findings from the post-remediation re-review
(codereviews/2026-05-14/ReReviewAfterRemediation.md):

NC1 — ProxyWorker.StopAsync drain loop is structurally always-zero
  Wave 1's W1.5 inherited the original ShutdownCoordinator bug it was
  meant to replace. Supervisor.StopAsync nulls the per-mux counter
  provider before the drain loop runs, so CountInFlight always returns 0
  and the drain budget is never spent on actual draining. Fix: snapshot
  the in-flight count BEFORE supervisor stop, drop the theatrical
  post-stop loop, and report InFlightAtCancel as the snapshot count
  (= the number of in-flight requests dropped by the stop). The
  supervisor stop IS the drain — there is nothing to drain that
  wouldn't be killed by the stop itself.

NM1 — TearDownBackendAsync._connectGate.WaitAsync uncancellable
  Without a token, a long Polly-wrapped EnsureBackendConnectedAsync
  against an unreachable host could hold the gate for the full
  BackendConnectTimeoutMs * MaxAttempts window, blocking DisposeAsync
  (and therefore ProxyWorker.StopAsync) for that duration. Fix: bound
  the wait with a 2 s teardown deadline; on timeout proceed
  best-effort without the gate. Worst-case consequence is one orphaned
  in-flight cycle on the dying backend, surfaced to upstream as
  exception 0x0B by the watchdog.

NM2 — ReplaceContext non-atomic ctx + provider swap
  Snapshot path reads `_cacheStatsProvider` independently of `_ctx`. If
  `_ctx` was swapped first, a snapshot taken in the gap would still hold
  the OLD adapter wrapping the OLD cache — which the supervisor disposes
  immediately after we return. Fix: set the provider FIRST, then swap
  `_ctx`. Snapshots in the swap window now read either (old, old) or
  (new, new), never (old-after-disposed).

NM5 — Self-cascade ObjectDisposedException after dispose
  Writer/reader fault catches fired `_ = TearDownBackendAsync(...)`
  unconditionally. After DisposeAsync runs `_connectGate.Dispose()`, the
  fire-and-forget TearDown threw ObjectDisposedException on WaitAsync as
  an unobserved Task exception. Fix: skip self-cascade when
  `_disposeCts.IsCancellationRequested` — DisposeAsync runs an explicit
  TearDown anyway.

Nm1 — Saturation cleanup uses await SendResponseAsync
  W1.2's per-attacher delivery loop awaited the blocking SendResponseAsync,
  which would serialise on a wedged late-attacher's full bounded channel
  and stall delivery to its peers — contradicting the W1.3 doctrine that
  the fan-out path must never await per-pipe writes. Fix: use
  TrySendResponse and increment ResponseDropForFullUpstream on drop.

T2 — WatchdogVsResponse_Race seeded Random fragility
  Used `new Random(12345)` over [350, 450) ms with watchdog at 400 ms;
  Random's algorithm is implementation-defined across .NET major versions
  (legacy → Xoshiro128 in .NET 6) so a runtime upgrade could land all
  samples on one side of the deadline and break the "both branches must
  fire" assertion. Fix: deterministic counter-based alternation (15 fast
  + 15 slow across 30 iterations) — guaranteed by construction.

Latent items NM3 (_supervisorCts leak on re-Start) and NM4 (TCS
single-shot semantics) are unfixed: no caller actually re-Starts a
supervisor today; both become real only if the reconciler ever changes
to re-Start instead of dispose-and-rebuild. Documented in the re-review.

Tests: 387 pass / 0 fail. Three back-to-back race-test runs in
isolation all green (T2 alternation is deterministic).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-14 06:52:33 -04:00

14 KiB

Raw Blame History

Re-Review After Remediation — 2026-05-14

Re-review of the codebase after the six-commit remediation of the original 2026-05-14 review (Wave 1 → ce32c5c, Wave 2 → e66b17f, Wave 3 → 7ead358, the easy 5 → 2545237, the race-hard 5 → 53f842a). Conducted via three parallel area-focused passes. Eyes on what the fixes themselves introduced, not what the original review already found.

Scope: every src/ and tests/ change in 53a7111..HEAD (37 files, ~+2000/−700 lines).

Headline

The remediation is structurally sound. One new critical finding — the W1.5 drain loop is structurally always-zero — inherited the very bug it was meant to replace. Five new major findings cluster in two areas:

W1.4 cascade gating introduced two disposal-deadlock and exception-swallowing paths around the new _connectGate.WaitAsync() call inside TearDownBackendAsync.
W1.1 + W1.5 lifecycle ordering introduced a non-atomic context swap (snapshot reads can see a disposed cache during reseat) and silently single-shot supervisor semantics from the W2.15 + W2.16 combination.

Plus a handful of test-discipline gaps from the new race-hard tests (reflection on private field names, RNG seed fragility across runtime versions).

New critical findings (1)

NC1. `ProxyWorker.StopAsync` drain loop is structurally always-zero — inherited the original `ShutdownCoordinator` bug it replaced

File: src/Mbproxy/Proxy/ProxyWorker.cs:276-310 (introduced by W1.5)

Path of execution:

await Task.WhenAll(stopTasks) — each PlcListenerSupervisor.StopAsync cancels its inner CTS, the listener's RunAsync exits, the OperationCanceledException catch calls await listener.DisposeAsync(), which calls _multiplexer.DisposeAsync(), which calls _ctx.Counters.SetMultiplexProvider(null) (PlcMultiplexer.cs:224).
Then the drain loop begins. CountInFlight reads supervisor.CurrentCounters.Snapshot().InFlightCount, and ProxyCounters.Snapshot() (ProxyCounters.cs:403-404) returns provider?.InFlightCount ?? 0. The provider is now null. So total == 0 on the first iteration and the loop exits immediately.

The drain budget (gracefulMs) is never spent on actual draining. inFlightAtCancel will always log as 0. The mbproxy.shutdown.complete event under load will report InFlightAtCancel=0 even when in-flight requests were forcibly cancelled.

This is the same defect the original review (AdminAndDiagnostics.md:C1) called out in the deleted ShutdownCoordinator. The W1.5 relocation faithfully copied the broken sequencing.

Fix options:

(a) Snapshot in-flight counts BEFORE calling supervisors.StopAsync, then move drain-with-real-multiplexers into a phase BEFORE supervisor stop (i.e. stop accepting new connections, drain in-flight, then stop supervisors).
(b) Accept that supervisor stop is the drain (matches today's behavior) and remove the theatrical loop + InFlightAtCancel field from the log event.

New major findings (5)

NM1. `TearDownBackendAsync` acquires `_connectGate` with no token — disposal can be blocked indefinitely

File: src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs:333 (introduced by W1.4)

await _connectGate.WaitAsync().ConfigureAwait(false) does not honour _disposeCts.Token. The dispose-driven teardown calls this after cancelling _disposeCts, but if a long Polly-wrapped EnsureBackendConnectedAsync is mid-connect against an unreachable host (waiting up to BackendConnectTimeoutMs), DisposeAsync blocks here for that duration. This blocks ProxyWorker.StopAsync past its drain budget.

Fix: pass _disposeCts.Token (or a CTS that cancels on dispose) to WaitAsync, accept the OperationCanceledException, and skip teardown work if disposal already completed.

NM2. `ReplaceContext` writes `_ctx` and re-registers cache stats provider non-atomically — snapshot can see a disposed cache

File: src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs:173-184 (introduced by W1.1)

Snapshot path (ProxyCounters.Snapshot() called from the admin endpoint) reads _cacheStatsProvider independently of _ctx. Sequence:

Thread A enters ReplaceContext, writes _ctx = newContext (line 177).
Thread B (admin) snapshots, reads _cacheStatsProvider (still the OLD adapter wrapping the OLD cache).
Thread A runs SetCacheStatsProvider(newCacheAdapter).
Supervisor's ReplaceContextAsync proceeds to oldCache.Clear() then oldCache.Dispose().

A snapshot during the swap reports cacheEntryCount/cacheBytes from a cache the live multiplexer no longer references. Worse, if the snapshot races past step 4, the stats adapter holds a disposed ResponseCache and Count/ApproximateBytes may throw. Same shape of race as the original review's "snapshot inconsistency" pattern, re-opened on a new path by the W1.1 fix.

Fix: swap the order (provider first, then _ctx) inside a short lock, OR have ResponseCache.Count/ApproximateBytes no-op on a disposed instance.

NM3. `_supervisorCts` leaks across `StartAsync` re-entry despite W2.16 guard

File: src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs:140, 437-446

W2.16's guard prevents re-Start while busy, but it does not address the leaking previous CTS the comment claims to fix. StopAsync sets _state = Stopped first, then awaits _supervisorTask. The previous _supervisorCts is disposed only in DisposeAsync, not in StopAsync. A re-Start pattern (Stop → Start without Dispose) leaks one CTS per cycle.

Today no caller actually re-Starts a supervisor (ConfigReconciler's remove path uses DisposeAsync), so the leak is latent. Worth fixing now or documenting.

NM4. W2.15 TCS is never re-armed; W2.16 + TCS together make the supervisor effectively single-shot

File: src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs:74-75, 132-147

_firstAttemptCompleted is readonly and set in the field initialiser. Once TrySetResult fires (anywhere in RunSupervisorAsync including the finally), it stays completed forever. A restarted supervisor would have WaitForInitialBindAttemptAsync return immediately on the previous run's signal, regardless of the new run's bind status. Combined with the comment claiming "the supervisor's state machine has exactly one Start", the intent appears to be single-shot — but neither the guard nor the TCS enforces it cleanly.

Today no caller re-Starts so this is latent. Either re-arm the TCS in RunSupervisorAsync or have W2.16 refuse re-Start unconditionally.

NM5. Self-cascade swallows `ObjectDisposedException` from `_connectGate` after disposal

File: src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs:451, 648 (introduced by W1.4 + W2.5)

After DisposeAsync runs _connectGate.Dispose(), any in-flight self-cascade (_ = TearDownBackendAsync(...) from writer/reader fault paths) reaches WaitAsync() and throws ObjectDisposedException. The try/finally will then run Release() on a disposed semaphore in the finally, also throwing. Both throws happen on a fire-and-forget Task — they become unobserved exceptions on TPL's UnobservedTaskException event. Noisy but not fatal.

Fix: wrap the body in try { … } catch (ObjectDisposedException) { } or short-circuit on _disposed before re-entering teardown.

New minor findings (8)

#	File:line	Finding
Nm1	`PlcMultiplexer.cs:842` (W1.2 cleanup)	Saturation cleanup uses `await SendResponseAsync` (blocking) per attached pipe. A single wedged late-attacher stalls saturation delivery to all others — contradicts the W1.3 doctrine. Use `TrySendResponse` here too.
Nm2	`PlcMultiplexer.cs:810`	W1.2 increments `CoalescedHit` for late attachers that ultimately receive exception 04. Counter shows hits that produced no real coalesced response. Document or decrement.
Nm3	`ProxyWorker.cs:269-310`	Both supervisor-stop and (theatrical) drain share `gracefulMs`. Worst-case shutdown is `2 * gracefulMs + 2s admin` — operators expect single budget.
Nm4	`PlcListenerSupervisor.cs:433-446`	`finally` writes `_state = Stopped` directly, preserving `_lastBindError` from a prior fault. After clean cancellation operators see "last bind error" that was never the cause of stopping. Doc the field as "last fault while running" or clear on clean exit.
Nm5	`EventLogBridge.cs:35-45`	W2.23 cached `_sourceExists` is correct, but no startup log confirms armed state. If `install.ps1` registers source after service starts, every Error+ event is silently dropped until restart. Add a one-line startup log.
Nm6	`ProxyWorker.cs:224-235`	W1.5 lazy `_admin = _services.GetService<AdminEndpointHost>()` returns `null` silently if `AddMbproxyAdmin()` is forgotten. Previous `IHostedService` registration would have errored loudly during host build. Use `GetRequiredService<>` or log a warning.
Nm7	`AdminEndpointHost.cs:200-210`	`DisposeAsync` lacks a `_disposed` guard; `ProxyWorker.StopAsync` calls `StopAsync` then DI disposes again. Operations are idempotent today but symmetry with `PlcMultiplexer` would be cleaner.
Nm8	`PlcMultiplexer.cs:415`	`TearDownBackendAsync` log `BackendDisconnected` event fires per cascade; queued cascades on the gate each log their own event with mostly-zero counts. Cosmetic noise.

Test-discipline findings (4)

#	File:line	Finding
T1	`PlcMultiplexerTests.cs:1152-1165`, `SupervisorTests.cs:1779-1789`	`DrainAllocator` and the runtime-fault test reflect on private field names (`_allocator`, `_currentListener`, `_listener`). A rename refactor breaks them at run-time, not compile-time. Recommend `[InternalsVisibleTo]` test seam OR comments on each reflected field warning that tests depend on the name.
T2	`PlcMultiplexerTests.cs:1379-1380` (W3 #8)	`WatchdogVsResponse_Race` uses `new Random(12345)` over `[350, 450)` ms with watchdog at 400 ms; asserts BOTH branches hit. `Random` with a seed is implementation-defined across major .NET versions (legacy → Xoshiro128 in .NET 6). On a runtime upgrade the seed could land all 30 samples on one side and the test breaks. Replace with deterministic alternation: `int delay = (i & 1) == 0 ? 350 : 450;`.
T3	`Mbproxy.Tests.csproj:38-40` (W2.21)	`RemoveInheritedAppsettings` Target uses `AfterTargets="Build"` and deletes from `$(OutputPath)`. Won't fire on `dotnet publish`. Test projects rarely publish but worth `AfterTargets="Build;Publish"` + a second Delete against `$(PublishDir)appsettings.json`.
T4	`InFlightByKeyMapTests.cs:46, 75, 112`	After W3 removed the `bool` return, test method names still read `TryAttachOrCreate_..._ReturnsTrue_WasNewTrue`. Decorative but misleading for grep.

Verified clean (sampled, not exhaustive)

W2.3 ConcurrentDictionary migration on _supervisors — all mutations atomic; status-page enumeration lock-free; Restart's "remove + add" two-step is per-key (parallel keys disjoint by name).
W2.1 coalescingAccessor propagation — both Add and Restart paths receive it; Reseat correctly does not (same supervisor, same multiplexer, same accessor).
W2.13 OOR check — multiplication is bounded by the guard; even worst-case 9999 * 10_000 + 9999 = 99_989_999 fits in int32 without overflow.
W2.14 byteCount validation — strict < check passes a perfectly-sized PDU; trailing-byte case correct.
W2.10 resolved-TTL re-check — BcdTagMapBuilder.Build is called exactly once per PLC at validation; no duplicate work.
W2.18 ConnectionOptions validation — both MbproxyOptionsValidator and ReloadValidator reject <= 0; no bypass path.
W3 HasBadNibble dedupe — clean; the codec's internal helper is the single source of truth.
W2.15 TCS signalled in every exit path of RunSupervisorAsync (bind-success, bind-failure-into-recovery, run-failure-into-recovery, finally) — no hang on WaitForInitialBindAttemptAsync for the first run.
W2.17 TransitionTo lock contract — both writers use it; Snapshot reads under the same lock; no torn triples.
TxIdAllocator.Release double-call is benign (TxIdAllocator.cs:121-129 checks if (_inUse[id])); the W1.4 channel drain releasing a TxId already released by the correlation drain is safe.
W1.1 in-PDU snapshot consistency — OnUpstreamFrameAsync reads _ctx.Cache and _ctx.TagMap.ResolveCacheTtlMs non-atomically; the only mid-PDU swap visible would change cache eligibility, not produce corrupted output. Downstream WithCurrentRequest snapshots TagMap+Cache for the rewriter, so the rewrite itself is consistent.
W2.7 cache-FC byte sourced from post-rewriter buffer — correct; the rewriter never touches the FC byte but the source must remain frame[…] to capture the exception bit.

Recommended next actions

If you want to keep the issue tracker tidy, these are the items that should land as a small follow-up commit:

NC1 — fix or remove the theatrical drain loop. Smallest change: delete the loop and the InFlightAtCancel field from the log.
NM1 — pass _disposeCts.Token to _connectGate.WaitAsync in TearDownBackendAsync.
NM2 — wrap the _ctx write + provider re-registration in a single short lock, or null-guard the cache adapter against disposed-cache reads.
NM5 — wrap the self-cascade body in try { … } catch (ObjectDisposedException) { }.
Nm1 — use TrySendResponse in the W1.2 saturation cleanup loop to avoid wedge-by-wedged-attacher.
T2 — replace the seeded Random in WatchdogVsResponse_Race with deterministic alternation.

NM3 and NM4 are latent (no current caller exercises supervisor re-Start). They become real only if the reconciler ever changes to re-Start instead of dispose-and-rebuild. Defer or document.

Everything else is documentation / cosmetic / minor cleanup.

14 KiB Raw Blame History Unescape Escape