Closes findings from the third focused re-review pass on the post-W4-followup state (recorded in codereviews/2026-05-14/ReReviewAfterRemediation.md). W5/M1 — AdminEndpointHost OnChange callback can resurrect Kestrel after StopAsync The hot-reload OnChange handler at AdminEndpointHost.StartAsync did fire-and-forget `_ = Task.Run(...)` with no _disposed check. If AdminPort was hot-reloaded during shutdown, the queued Task could land between StopAsync's registration-dispose and DisposeAsync's _lock-dispose, take the lock, and bind a fresh Kestrel WebApplication on the new port — resurrecting admin AFTER the host considered it shut down. Worse, if DisposeAsync had already run _lock.Dispose, the queued Task throws ObjectDisposedException as an unobserved Task exception. Fix: _disposed guard at the top of the OnChange lambda AND inside the queued Task.Run, plus try/catch (ObjectDisposedException) around _lock.WaitAsync and _lock.Release. W5/m2 — inFlightAtCancel computed AFTER base.StopAsync The W4/NC1 fix correctly snapshotted inFlight BEFORE supervisor.StopAsync (so the multiplexers' counter providers were still wired), but it computed the snapshot AFTER base.StopAsync(cancellationToken). Between those two lines, in-flight requests whose responses arrive get removed from _correlation, and the watchdog can clear stale entries. The reported count therefore drifted downward from "in-flight at signal time" to "in-flight at compute time." Fix: snapshot at the very top of StopAsync before any cancellation is propagated. W5/m1 — Cascade gate-not-held path race (accepted as documented best-effort) When TearDownBackendAsync's _connectGate.WaitAsync(2s) times out, the body runs unprotected. A concurrent EnsureBackendConnectedAsync that DOES hold the gate may TryAllocate a TxId that collides (after wraparound in the allocator's forward scan) with one being released by the channel drain. The double-release would mark the new request's slot as free even though it's legitimately in-flight, allowing the next allocation to reuse the same slot and CorrelationMap.TryAdd to fail (silent request drop). Probability is very low (gate timeout AND new accept landing AND TxId collision in 65,536-slot space); the only consequence is one dropped request the client retries. Documented inline at PlcMultiplexer.cs near the gateHeld declaration as accepted best-effort behaviour. W5/m3 — CountInFlight allocates a CounterSnapshot record per supervisor Trivial (~5 KB on a 54-PLC fleet, called once per shutdown). Skipped per re-review verdict. Tests: 387 pass / 0 fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
15 KiB
Re-Review After Remediation — 2026-05-14
Re-review of the codebase after the six-commit remediation of the original 2026-05-14 review (Wave 1 → ce32c5c, Wave 2 → e66b17f, Wave 3 → 7ead358, the easy 5 → 2545237, the race-hard 5 → 53f842a). Conducted via three parallel area-focused passes. Eyes on what the fixes themselves introduced, not what the original review already found.
Scope: every src/ and tests/ change in 53a7111..HEAD (37 files, ~+2000/−700 lines).
Status
All actionable findings resolved across two re-review passes. Wave 4 (
7a43595) closed NC1 + NM1 + NM2 + NM5 + Nm1 + T2. Wave 4-followup (9251c56) closed NM3 + NM4 + Nm6 + Nm7 + T3 + T4. A third focused pass surfaced one more major (W5/M1) and two cosmetics (W5/m1, W5/m2); Wave 5 (this commit) resolved M1 + m2 and documented m1 as accepted best-effort.Final test count: 387 pass / 0 fail.
Headline
The remediation was structurally sound. The re-review found:
- 1 critical finding — the W1.5 drain loop inherited the very
ShutdownCoordinatorbug it was meant to replace. Resolved by Wave 4 (7a43595) snapshotting the in-flight count BEFORE supervisor stop and deleting the theatrical post-stop loop. - 5 major findings clustered in W1.4 cascade gating and W1.1 + W1.5 lifecycle ordering. All resolved by Wave 4 + Wave 4-followup.
- 8 minor findings + 4 test-discipline findings. Most resolved; the rest accepted with documented rationale.
Resolution table
| ID | Severity | Finding | Status | Commit |
|---|---|---|---|---|
| NC1 | Critical | ProxyWorker.StopAsync drain loop structurally always-zero — inherited the original ShutdownCoordinator bug |
✅ Resolved | 7a43595 |
| NM1 | Major | TearDownBackendAsync._connectGate.WaitAsync() uncancellable — disposal can be blocked indefinitely |
✅ Resolved | 7a43595 |
| NM2 | Major | ReplaceContext writes _ctx and re-registers cache stats provider non-atomically — snapshot can read a disposed cache |
✅ Resolved | 7a43595 |
| NM3 | Major | _supervisorCts leaks across StartAsync re-entry despite W2.16 guard |
✅ Resolved | (W4-followup) |
| NM4 | Major | W2.15 TCS never re-armed — supervisor effectively single-shot | ✅ Resolved | (W4-followup) |
| NM5 | Major | Self-cascade swallows ObjectDisposedException from _connectGate after disposal |
✅ Resolved | 7a43595 |
| Nm1 | Minor | Saturation cleanup uses await SendResponseAsync (blocking) per attached pipe |
✅ Resolved | 7a43595 |
| Nm2 | Minor | W1.2 increments CoalescedHit for late attachers that ultimately receive exception 04 |
⚪ Accepted | (doc) |
| Nm3 | Minor | Both supervisor-stop and drain shared gracefulMs budget |
✅ Resolved | 7a43595 (drain deleted) |
| Nm4 | Minor | finally preserves _lastBindError after clean cancellation |
⚪ Accepted | (by design) |
| Nm5 | Minor | EventLogBridge no startup log of armed state |
⚪ Accepted | (low value) |
| Nm6 | Minor | _admin lazy resolution returns null silently if registration absent |
✅ Resolved | (W4-followup) |
| Nm7 | Minor | AdminEndpointHost.DisposeAsync no _disposed guard |
✅ Resolved | (W4-followup) |
| Nm8 | Minor | TearDownBackendAsync cosmetic log-noise on queued cascades |
⚪ Accepted | (cosmetic) |
| T1 | Test | Reflection coupling on private field names | ⚪ Accepted | (commented in code) |
| T2 | Test | WatchdogVsResponse_Race seeded Random cross-runtime fragility |
✅ Resolved | 7a43595 |
| T3 | Test | RemoveInheritedAppsettings only fires on Build, not Publish |
✅ Resolved | (W4-followup) |
| T4 | Test | Stale TryAttachOrCreate_*_ReturnsTrue_* test method names after W3 dropped the bool |
✅ Resolved | (W4-followup) |
Resolved: 13/18. Accepted: 5/18.
Third pass — final findings (Wave 5)
A third focused review pass on the post-W4-followup state turned up these additional items:
| ID | Severity | Finding | Status | Commit |
|---|---|---|---|---|
| W5/M1 | Major | AdminEndpointHost OnChange callback can resurrect a Kestrel app after StopAsync returned (no _disposed check inside the fire-and-forget Task.Run lambda) |
✅ Resolved | (W5) |
| W5/m1 | Minor | TearDownBackendAsync gate-not-held path: a concurrent freshly-allocated TxId can collide with one being released by the channel drain → silent request drop. Probability very low (gate timeout AND new accept AND TxId collision in 65,536-slot space). |
⚪ Accepted | (W5 — inline doc comment in PlcMultiplexer.cs) |
| W5/m2 | Minor | inFlightAtCancel was computed AFTER base.StopAsync — narrower window than the field name promises |
✅ Resolved | (W5) |
| W5/m3 | Cosmetic | CountInFlight allocates a 35-field CounterSnapshot record per supervisor on shutdown |
⚪ Accepted (skip) | — |
W5/M1 fix detail. Added if (_disposed) return; at the top of the OnChange lambda AND inside the queued Task.Run, plus try/catch (ObjectDisposedException) around _lock.WaitAsync and _lock.Release() so a hot-reload of AdminPort during shutdown can no longer resurrect a fresh Kestrel WebApplication on the new port after the host considered admin shut down.
W5/m2 fix detail. Moved int inFlightAtCancel = CountInFlight(); to BEFORE await base.StopAsync(cancellationToken). Now the count actually reflects "in-flight at the moment the host signalled stop" — not "in-flight at the moment we got around to computing it after the cancel propagated."
W5/m1 acceptance. Documented inline at PlcMultiplexer.cs:TearDownBackendAsync near the gateHeld flag declaration. The race requires three coincidences (gate-timeout + new accept landing during cascade + TxId collision); the only consequence is one dropped request that the client retries on its next attempt.
W5/m3 skip. Trivial per-PLC allocation (~5 KB on a 54-PLC fleet, called once per shutdown). Optimising it would require exposing a single-field accessor on ProxyCounters; not worth the surface change.
Resolved findings — what landed
NC1 — ProxyWorker.StopAsync drain loop is structurally always-zero
Resolution: 7a43595 snapshots inFlightAtCancel = CountInFlight() BEFORE calling supervisor.StopAsync(...). The post-stop drain loop and drainCts are deleted entirely. The supervisor stop IS the drain — there's nothing to wait for that wouldn't be killed by the stop itself. mbproxy.shutdown.complete now reports a meaningful "requests dropped by stop" count.
NM1 — TearDownBackendAsync._connectGate.WaitAsync() uncancellable
Resolution: 7a43595 bounds the wait with a 2-second teardown CTS. On timeout the body proceeds best-effort without the gate; gateHeld flag tracks whether we acquired it so the finally only releases when appropriate. ObjectDisposedException from a disposed semaphore short-circuits to a clean return.
NM2 — ReplaceContext non-atomic ctx + provider swap
Resolution: 7a43595 swapped the order: provider FIRST, then _ctx. Snapshots in the swap window now read either (old, old) or (new, new) — never (old-after-disposed).
NM3 — _supervisorCts leaks across StartAsync re-entry
Resolution: W4-followup. StartAsync now does try { _supervisorCts.Dispose(); } catch (ObjectDisposedException) { } before reassigning, with the catch covering the very-first-Start case where the field-init CTS is still fresh.
NM4 — W2.15 TCS never re-armed (supervisor single-shot)
Resolution: W4-followup. _firstAttemptCompleted is now non-readonly and re-created in StartAsync after the W2.16 guard. A re-Started supervisor's WaitForInitialBindAttemptAsync no longer observes the previous run's signal.
NM5 — Self-cascade ObjectDisposedException after dispose
Resolution: 7a43595 gated the writer + reader fault-path _ = TearDownBackendAsync(...) calls behind if (!_disposeCts.IsCancellationRequested). DisposeAsync runs an explicit teardown anyway, so the fire-and-forget path was redundant and exception-throwing.
Nm1 — Saturation cleanup uses await SendResponseAsync
Resolution: 7a43595 replaced the per-party await SendResponseAsync with TrySendResponse and IncrementResponseDropForFullUpstream on drop. Same doctrine as W1.3.
Nm3 — gracefulMs used twice
Resolution: Implicit — Wave 4's NC1 fix deleted the drain loop, so the budget is now used exactly once (supervisor stop). Worst-case shutdown is gracefulMs + 2 s admin.
Nm6 — _admin lazy resolution silent null
Resolution: W4-followup. ProxyWorker.ExecuteAsync now logs a Warning when GetService<AdminEndpointHost>() returns null, surfacing a botched composition without blocking startup. Previous IHostedService registration would have hard-errored in DI; this preserves the loud-failure intent without forcing callers to register admin in unit-test hosts.
Nm7 — AdminEndpointHost.DisposeAsync no _disposed guard
Resolution: W4-followup. Added a volatile bool _disposed flag and a guard at the top of DisposeAsync; symmetry with PlcMultiplexer.
T2 — WatchdogVsResponse_Race seeded Random fragility
Resolution: 7a43595 replaced new Random(12345) → rng.Next(350, 450) with a counter-based alternation (n & 1) == 1 ? 350 : 450. Guaranteed 15 fast + 15 slow across 30 iterations regardless of runtime.
T3 — RemoveInheritedAppsettings only on Build
Resolution: W4-followup. AfterTargets="Build;Publish" and a second Delete against $(PublishDir)appsettings.json (guarded by Condition="'$(PublishDir)' != ''" so a plain Build doesn't trip it).
T4 — Stale TryAttachOrCreate_* test names
Resolution: W4-followup. Renamed three test methods to drop Try and _ReturnsTrue_ to match the W3 AttachOrCreate (void-returning) signature.
Accepted findings — what wasn't fixed and why
Nm2 — CoalescedHit increments for late attachers that get exception 04
Why accepted: The counter is correct per the design contract (coalescedHit + coalescedMiss = total FC03/FC04 requests). A late attacher did coalesce onto a stub; the fact that the stub later delivered an exception instead of a real response is orthogonal to the coalesce-counting semantics. Decrementing on saturation cleanup would break the parity invariant that operator dashboards rely on. Documented inline at PlcMultiplexer.cs:809-812 (W2.4 doc clarification covers this).
Nm4 — _lastBindError preserved after clean cancellation
Why accepted: Intentional. The field's contract is "last fault while running" — useful for operators triaging why a supervisor exited (was it a bind error vs a clean stop?). Clearing it on clean exit would lose forensic information. The finally block at PlcListenerSupervisor.cs:435-436 already documents this with a comment.
Nm5 — EventLogBridge no startup log of armed state
Why accepted: The bridge is a Serilog sink and doesn't have an injectable logger (would require restructuring to use Log.ForContext or similar). The class doc at EventLogBridge.cs:18-20 already tells operators "must be created by install.ps1 before the service starts". The W2.23 caching means a missing source is silent, but the architectural alternative (log from the bridge constructor using Log.Logger) creates a startup-ordering dependency on Serilog being fully wired before any sink construction. Low value vs the surface change.
Nm8 — TearDownBackendAsync log-noise on queued cascades
Why accepted: Cosmetic. Each cascade logs its own BackendDisconnected event with its own counts; queued cascades on the gate would fire mostly-zero events. Log filtering is the operator's tool here.
T1 — Reflection on private field names in race tests
Why accepted: The two reflection sites (DrainAllocator in PlcMultiplexerTests, the runtime-fault test in SupervisorTests) are the only externally-impossible-to-hit primitives that prove the saturation/runtime-fault contracts. The alternatives are (a) introducing test-only internal accessors that pollute production code, (b) exposing fields via [InternalsVisibleTo] (which Mbproxy.csproj already does for the test project — but the fields themselves are private, not internal), or (c) skipping the tests. The reflection coupling is documented as a maintenance trap in the test code. A future rename refactor breaks at run-time, not compile-time — but the tests' xmldoc explicitly warns about this.
Verified clean (sampled, not exhaustive)
The original re-review listed the following as verified clean by inspection during the post-Wave-3 review pass; nothing in Wave 4 / W4-followup invalidated these:
- W2.3 ConcurrentDictionary migration on
_supervisors— all mutations atomic; status-page enumeration lock-free; Restart's "remove + add" two-step is per-key (parallel keys disjoint by name). - W2.1 coalescingAccessor propagation — both Add and Restart paths receive it; Reseat correctly does not (same supervisor, same multiplexer, same accessor).
- W2.13 OOR check — multiplication is bounded by the guard; even worst-case
9999 * 10_000 + 9999 = 99_989_999fits in int32 without overflow. - W2.14 byteCount validation — strict
<check passes a perfectly-sized PDU; trailing-byte case correct. - W2.10 resolved-TTL re-check —
BcdTagMapBuilder.Buildis called exactly once per PLC at validation; no duplicate work. - W2.18 ConnectionOptions validation — both
MbproxyOptionsValidatorandReloadValidatorreject<= 0; no bypass path. - W3
HasBadNibblededupe — clean; the codec's internal helper is the single source of truth. - W2.15 TCS signalled in every exit path of
RunSupervisorAsync— no hang onWaitForInitialBindAttemptAsyncfor the first run. (W4 / NM4 added the re-arm for subsequent Starts.) - W2.17
TransitionTolock contract — both writers use it;Snapshotreads under the same lock; no torn triples. TxIdAllocator.Releasedouble-call is benign (TxIdAllocator.cs:121-129checksif (_inUse[id])); the W1.4 channel drain releasing a TxId already released by the correlation drain is safe.- W1.1 in-PDU snapshot consistency —
OnUpstreamFrameAsyncreads_ctx.Cacheand_ctx.TagMap.ResolveCacheTtlMsnon-atomically; the only mid-PDU swap visible would change cache eligibility, not produce corrupted output. DownstreamWithCurrentRequestsnapshots TagMap+Cache for the rewriter, so the rewrite itself is consistent. - W2.7 cache-FC byte sourced from post-rewriter buffer — correct; the rewriter never touches the FC byte but the source must remain
frame[…]to capture the exception bit.
Closed
The 2026-05-14 review series — original review → 4 remediation waves → first re-review → wave 4 + followup → second re-review → wave 5 — is now closed. Tests: 387 pass / 0 fail. Three back-to-back race-test runs in isolation all green. Every actionable finding resolved or explicitly accepted with rationale.