Files

T

Joseph Doherty 59d0b5deb9 mbproxy: Wave 5 — fixes from third re-review pass

Closes findings from the third focused re-review pass on the post-W4-followup
state (recorded in codereviews/2026-05-14/ReReviewAfterRemediation.md).

W5/M1 — AdminEndpointHost OnChange callback can resurrect Kestrel after StopAsync
  The hot-reload OnChange handler at AdminEndpointHost.StartAsync did
  fire-and-forget `_ = Task.Run(...)` with no _disposed check. If AdminPort
  was hot-reloaded during shutdown, the queued Task could land between
  StopAsync's registration-dispose and DisposeAsync's _lock-dispose, take
  the lock, and bind a fresh Kestrel WebApplication on the new port —
  resurrecting admin AFTER the host considered it shut down. Worse, if
  DisposeAsync had already run _lock.Dispose, the queued Task throws
  ObjectDisposedException as an unobserved Task exception. Fix: _disposed
  guard at the top of the OnChange lambda AND inside the queued Task.Run,
  plus try/catch (ObjectDisposedException) around _lock.WaitAsync and
  _lock.Release.

W5/m2 — inFlightAtCancel computed AFTER base.StopAsync
  The W4/NC1 fix correctly snapshotted inFlight BEFORE supervisor.StopAsync
  (so the multiplexers' counter providers were still wired), but it computed
  the snapshot AFTER base.StopAsync(cancellationToken). Between those two
  lines, in-flight requests whose responses arrive get removed from
  _correlation, and the watchdog can clear stale entries. The reported
  count therefore drifted downward from "in-flight at signal time" to
  "in-flight at compute time." Fix: snapshot at the very top of StopAsync
  before any cancellation is propagated.

W5/m1 — Cascade gate-not-held path race (accepted as documented best-effort)
  When TearDownBackendAsync's _connectGate.WaitAsync(2s) times out, the
  body runs unprotected. A concurrent EnsureBackendConnectedAsync that
  DOES hold the gate may TryAllocate a TxId that collides (after wraparound
  in the allocator's forward scan) with one being released by the channel
  drain. The double-release would mark the new request's slot as free even
  though it's legitimately in-flight, allowing the next allocation to reuse
  the same slot and CorrelationMap.TryAdd to fail (silent request drop).
  Probability is very low (gate timeout AND new accept landing AND TxId
  collision in 65,536-slot space); the only consequence is one dropped
  request the client retries. Documented inline at PlcMultiplexer.cs near
  the gateHeld declaration as accepted best-effort behaviour.

W5/m3 — CountInFlight allocates a CounterSnapshot record per supervisor
  Trivial (~5 KB on a 54-PLC fleet, called once per shutdown). Skipped per
  re-review verdict.

Tests: 387 pass / 0 fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-14 07:13:47 -04:00

15 KiB

Raw Permalink Blame History

Re-Review After Remediation — 2026-05-14

Re-review of the codebase after the six-commit remediation of the original 2026-05-14 review (Wave 1 → ce32c5c, Wave 2 → e66b17f, Wave 3 → 7ead358, the easy 5 → 2545237, the race-hard 5 → 53f842a). Conducted via three parallel area-focused passes. Eyes on what the fixes themselves introduced, not what the original review already found.

Scope: every src/ and tests/ change in 53a7111..HEAD (37 files, ~+2000/−700 lines).

Status

All actionable findings resolved across two re-review passes. Wave 4 (7a43595) closed NC1 + NM1 + NM2 + NM5 + Nm1 + T2. Wave 4-followup (9251c56) closed NM3 + NM4 + Nm6 + Nm7 + T3 + T4. A third focused pass surfaced one more major (W5/M1) and two cosmetics (W5/m1, W5/m2); Wave 5 (this commit) resolved M1 + m2 and documented m1 as accepted best-effort.

Final test count: 387 pass / 0 fail.

Headline

The remediation was structurally sound. The re-review found:

1 critical finding — the W1.5 drain loop inherited the very ShutdownCoordinator bug it was meant to replace. Resolved by Wave 4 (7a43595) snapshotting the in-flight count BEFORE supervisor stop and deleting the theatrical post-stop loop.
5 major findings clustered in W1.4 cascade gating and W1.1 + W1.5 lifecycle ordering. All resolved by Wave 4 + Wave 4-followup.
8 minor findings + 4 test-discipline findings. Most resolved; the rest accepted with documented rationale.

Resolution table

ID	Severity	Finding	Status	Commit
NC1	Critical	`ProxyWorker.StopAsync` drain loop structurally always-zero — inherited the original `ShutdownCoordinator` bug	✅ Resolved	`7a43595`
NM1	Major	`TearDownBackendAsync._connectGate.WaitAsync()` uncancellable — disposal can be blocked indefinitely	✅ Resolved	`7a43595`
NM2	Major	`ReplaceContext` writes `_ctx` and re-registers cache stats provider non-atomically — snapshot can read a disposed cache	✅ Resolved	`7a43595`
NM3	Major	`_supervisorCts` leaks across `StartAsync` re-entry despite W2.16 guard	✅ Resolved	(W4-followup)
NM4	Major	W2.15 TCS never re-armed — supervisor effectively single-shot	✅ Resolved	(W4-followup)
NM5	Major	Self-cascade swallows `ObjectDisposedException` from `_connectGate` after disposal	✅ Resolved	`7a43595`
Nm1	Minor	Saturation cleanup uses `await SendResponseAsync` (blocking) per attached pipe	✅ Resolved	`7a43595`
Nm2	Minor	W1.2 increments `CoalescedHit` for late attachers that ultimately receive exception 04	⚪ Accepted	(doc)
Nm3	Minor	Both supervisor-stop and drain shared `gracefulMs` budget	✅ Resolved	`7a43595` (drain deleted)
Nm4	Minor	`finally` preserves `_lastBindError` after clean cancellation	⚪ Accepted	(by design)
Nm5	Minor	`EventLogBridge` no startup log of armed state	⚪ Accepted	(low value)
Nm6	Minor	`_admin` lazy resolution returns `null` silently if registration absent	✅ Resolved	(W4-followup)
Nm7	Minor	`AdminEndpointHost.DisposeAsync` no `_disposed` guard	✅ Resolved	(W4-followup)
Nm8	Minor	`TearDownBackendAsync` cosmetic log-noise on queued cascades	⚪ Accepted	(cosmetic)
T1	Test	Reflection coupling on private field names	⚪ Accepted	(commented in code)
T2	Test	`WatchdogVsResponse_Race` seeded `Random` cross-runtime fragility	✅ Resolved	`7a43595`
T3	Test	`RemoveInheritedAppsettings` only fires on Build, not Publish	✅ Resolved	(W4-followup)
T4	Test	Stale `TryAttachOrCreate__ReturnsTrue_` test method names after W3 dropped the bool	✅ Resolved	(W4-followup)

Resolved: 13/18. Accepted: 5/18.

Third pass — final findings (Wave 5)

A third focused review pass on the post-W4-followup state turned up these additional items:

ID	Severity	Finding	Status	Commit
W5/M1	Major	`AdminEndpointHost` `OnChange` callback can resurrect a Kestrel app after `StopAsync` returned (no `_disposed` check inside the fire-and-forget Task.Run lambda)	✅ Resolved	(W5)
W5/m1	Minor	`TearDownBackendAsync` gate-not-held path: a concurrent freshly-allocated TxId can collide with one being released by the channel drain → silent request drop. Probability very low (gate timeout AND new accept AND TxId collision in 65,536-slot space).	⚪ Accepted	(W5 — inline doc comment in `PlcMultiplexer.cs`)
W5/m2	Minor	`inFlightAtCancel` was computed AFTER `base.StopAsync` — narrower window than the field name promises	✅ Resolved	(W5)
W5/m3	Cosmetic	`CountInFlight` allocates a 35-field `CounterSnapshot` record per supervisor on shutdown	⚪ Accepted (skip)	—

W5/M1 fix detail. Added if (_disposed) return; at the top of the OnChange lambda AND inside the queued Task.Run, plus try/catch (ObjectDisposedException) around _lock.WaitAsync and _lock.Release() so a hot-reload of AdminPort during shutdown can no longer resurrect a fresh Kestrel WebApplication on the new port after the host considered admin shut down.

W5/m2 fix detail. Moved int inFlightAtCancel = CountInFlight(); to BEFORE await base.StopAsync(cancellationToken). Now the count actually reflects "in-flight at the moment the host signalled stop" — not "in-flight at the moment we got around to computing it after the cancel propagated."

W5/m1 acceptance. Documented inline at PlcMultiplexer.cs:TearDownBackendAsync near the gateHeld flag declaration. The race requires three coincidences (gate-timeout + new accept landing during cascade + TxId collision); the only consequence is one dropped request that the client retries on its next attempt.

W5/m3 skip. Trivial per-PLC allocation (~5 KB on a 54-PLC fleet, called once per shutdown). Optimising it would require exposing a single-field accessor on ProxyCounters; not worth the surface change.

Resolved findings — what landed

NC1 — `ProxyWorker.StopAsync` drain loop is structurally always-zero

Resolution: 7a43595 snapshots inFlightAtCancel = CountInFlight() BEFORE calling supervisor.StopAsync(...). The post-stop drain loop and drainCts are deleted entirely. The supervisor stop IS the drain — there's nothing to wait for that wouldn't be killed by the stop itself. mbproxy.shutdown.complete now reports a meaningful "requests dropped by stop" count.

NM1 — `TearDownBackendAsync._connectGate.WaitAsync()` uncancellable

Resolution: 7a43595 bounds the wait with a 2-second teardown CTS. On timeout the body proceeds best-effort without the gate; gateHeld flag tracks whether we acquired it so the finally only releases when appropriate. ObjectDisposedException from a disposed semaphore short-circuits to a clean return.

NM2 — `ReplaceContext` non-atomic ctx + provider swap

Resolution: 7a43595 swapped the order: provider FIRST, then _ctx. Snapshots in the swap window now read either (old, old) or (new, new) — never (old-after-disposed).

NM3 — `_supervisorCts` leaks across `StartAsync` re-entry

Resolution: W4-followup. StartAsync now does try { _supervisorCts.Dispose(); } catch (ObjectDisposedException) { } before reassigning, with the catch covering the very-first-Start case where the field-init CTS is still fresh.

NM4 — W2.15 TCS never re-armed (supervisor single-shot)

Resolution: W4-followup. _firstAttemptCompleted is now non-readonly and re-created in StartAsync after the W2.16 guard. A re-Started supervisor's WaitForInitialBindAttemptAsync no longer observes the previous run's signal.

NM5 — Self-cascade `ObjectDisposedException` after dispose

Resolution: 7a43595 gated the writer + reader fault-path _ = TearDownBackendAsync(...) calls behind if (!_disposeCts.IsCancellationRequested). DisposeAsync runs an explicit teardown anyway, so the fire-and-forget path was redundant and exception-throwing.

Nm1 — Saturation cleanup uses `await SendResponseAsync`

Resolution: 7a43595 replaced the per-party await SendResponseAsync with TrySendResponse and IncrementResponseDropForFullUpstream on drop. Same doctrine as W1.3.

Nm3 — `gracefulMs` used twice

Resolution: Implicit — Wave 4's NC1 fix deleted the drain loop, so the budget is now used exactly once (supervisor stop). Worst-case shutdown is gracefulMs + 2 s admin.

Nm6 — `_admin` lazy resolution silent null

Resolution: W4-followup. ProxyWorker.ExecuteAsync now logs a Warning when GetService<AdminEndpointHost>() returns null, surfacing a botched composition without blocking startup. Previous IHostedService registration would have hard-errored in DI; this preserves the loud-failure intent without forcing callers to register admin in unit-test hosts.

Nm7 — `AdminEndpointHost.DisposeAsync` no `_disposed` guard

Resolution: W4-followup. Added a volatile bool _disposed flag and a guard at the top of DisposeAsync; symmetry with PlcMultiplexer.

T2 — `WatchdogVsResponse_Race` seeded `Random` fragility

Resolution: 7a43595 replaced new Random(12345) → rng.Next(350, 450) with a counter-based alternation (n & 1) == 1 ? 350 : 450. Guaranteed 15 fast + 15 slow across 30 iterations regardless of runtime.

T3 — `RemoveInheritedAppsettings` only on Build

Resolution: W4-followup. AfterTargets="Build;Publish" and a second Delete against $(PublishDir)appsettings.json (guarded by Condition="'$(PublishDir)' != ''" so a plain Build doesn't trip it).

T4 — Stale `TryAttachOrCreate_*` test names

Resolution: W4-followup. Renamed three test methods to drop Try and _ReturnsTrue_ to match the W3 AttachOrCreate (void-returning) signature.

Accepted findings — what wasn't fixed and why

Nm2 — `CoalescedHit` increments for late attachers that get exception 04

Why accepted: The counter is correct per the design contract (coalescedHit + coalescedMiss = total FC03/FC04 requests). A late attacher did coalesce onto a stub; the fact that the stub later delivered an exception instead of a real response is orthogonal to the coalesce-counting semantics. Decrementing on saturation cleanup would break the parity invariant that operator dashboards rely on. Documented inline at PlcMultiplexer.cs:809-812 (W2.4 doc clarification covers this).

Nm4 — `_lastBindError` preserved after clean cancellation

Why accepted: Intentional. The field's contract is "last fault while running" — useful for operators triaging why a supervisor exited (was it a bind error vs a clean stop?). Clearing it on clean exit would lose forensic information. The finally block at PlcListenerSupervisor.cs:435-436 already documents this with a comment.

Nm5 — `EventLogBridge` no startup log of armed state

Why accepted: The bridge is a Serilog sink and doesn't have an injectable logger (would require restructuring to use Log.ForContext or similar). The class doc at EventLogBridge.cs:18-20 already tells operators "must be created by install.ps1 before the service starts". The W2.23 caching means a missing source is silent, but the architectural alternative (log from the bridge constructor using Log.Logger) creates a startup-ordering dependency on Serilog being fully wired before any sink construction. Low value vs the surface change.

Nm8 — `TearDownBackendAsync` log-noise on queued cascades

Why accepted: Cosmetic. Each cascade logs its own BackendDisconnected event with its own counts; queued cascades on the gate would fire mostly-zero events. Log filtering is the operator's tool here.

T1 — Reflection on private field names in race tests

Why accepted: The two reflection sites (DrainAllocator in PlcMultiplexerTests, the runtime-fault test in SupervisorTests) are the only externally-impossible-to-hit primitives that prove the saturation/runtime-fault contracts. The alternatives are (a) introducing test-only internal accessors that pollute production code, (b) exposing fields via [InternalsVisibleTo] (which Mbproxy.csproj already does for the test project — but the fields themselves are private, not internal), or (c) skipping the tests. The reflection coupling is documented as a maintenance trap in the test code. A future rename refactor breaks at run-time, not compile-time — but the tests' xmldoc explicitly warns about this.

Verified clean (sampled, not exhaustive)

The original re-review listed the following as verified clean by inspection during the post-Wave-3 review pass; nothing in Wave 4 / W4-followup invalidated these:

W2.3 ConcurrentDictionary migration on _supervisors — all mutations atomic; status-page enumeration lock-free; Restart's "remove + add" two-step is per-key (parallel keys disjoint by name).
W2.1 coalescingAccessor propagation — both Add and Restart paths receive it; Reseat correctly does not (same supervisor, same multiplexer, same accessor).
W2.13 OOR check — multiplication is bounded by the guard; even worst-case 9999 * 10_000 + 9999 = 99_989_999 fits in int32 without overflow.
W2.14 byteCount validation — strict < check passes a perfectly-sized PDU; trailing-byte case correct.
W2.10 resolved-TTL re-check — BcdTagMapBuilder.Build is called exactly once per PLC at validation; no duplicate work.
W2.18 ConnectionOptions validation — both MbproxyOptionsValidator and ReloadValidator reject <= 0; no bypass path.
W3 HasBadNibble dedupe — clean; the codec's internal helper is the single source of truth.
W2.15 TCS signalled in every exit path of RunSupervisorAsync — no hang on WaitForInitialBindAttemptAsync for the first run. (W4 / NM4 added the re-arm for subsequent Starts.)
W2.17 TransitionTo lock contract — both writers use it; Snapshot reads under the same lock; no torn triples.
TxIdAllocator.Release double-call is benign (TxIdAllocator.cs:121-129 checks if (_inUse[id])); the W1.4 channel drain releasing a TxId already released by the correlation drain is safe.
W1.1 in-PDU snapshot consistency — OnUpstreamFrameAsync reads _ctx.Cache and _ctx.TagMap.ResolveCacheTtlMs non-atomically; the only mid-PDU swap visible would change cache eligibility, not produce corrupted output. Downstream WithCurrentRequest snapshots TagMap+Cache for the rewriter, so the rewrite itself is consistent.
W2.7 cache-FC byte sourced from post-rewriter buffer — correct; the rewriter never touches the FC byte but the source must remain frame[…] to capture the exception bit.

Closed

The 2026-05-14 review series — original review → 4 remediation waves → first re-review → wave 4 + followup → second re-review → wave 5 — is now closed. Tests: 387 pass / 0 fail. Three back-to-back race-test runs in isolation all green. Every actionable finding resolved or explicitly accepted with rationale.

15 KiB Raw Permalink Blame History Unescape Escape

Re-Review After Remediation — 2026-05-14

Status

Headline

Resolution table

Third pass — final findings (Wave 5)

Resolved findings — what landed

NC1 — ProxyWorker.StopAsync drain loop is structurally always-zero

NM1 — TearDownBackendAsync._connectGate.WaitAsync() uncancellable

NM2 — ReplaceContext non-atomic ctx + provider swap

NM3 — _supervisorCts leaks across StartAsync re-entry

NM4 — W2.15 TCS never re-armed (supervisor single-shot)

NM5 — Self-cascade ObjectDisposedException after dispose

Nm1 — Saturation cleanup uses await SendResponseAsync

Nm3 — gracefulMs used twice

Nm6 — _admin lazy resolution silent null

Nm7 — AdminEndpointHost.DisposeAsync no _disposed guard

T2 — WatchdogVsResponse_Race seeded Random fragility

T3 — RemoveInheritedAppsettings only on Build

T4 — Stale TryAttachOrCreate_* test names