mbproxy: resolve remaining items from ReReviewAfterRemediation.md

Closes the latent + minor + test-discipline items left after Wave 4. Updates the re-review doc with a final resolution table — every actionable finding now marked Resolved or Accepted with rationale. NM3 — _supervisorCts leaks on re-Start StartAsync now disposes the previous CTS before reassigning. Idempotent: a try/catch (ObjectDisposedException) covers the very-first-Start case where the field-init CTS is still fresh. NM4 — W2.15 TCS is single-shot _firstAttemptCompleted is no longer readonly; StartAsync re-creates it after the W2.16 guard so a re-Started supervisor's WaitForInitialBindAttemptAsync doesn't observe the previous run's signal. Nm6 — _admin GetService<> returns null silently ProxyWorker.ExecuteAsync now logs a Warning when admin isn't registered. Preserves the loud-failure intent from the original IHostedService registration without forcing test hosts to wire admin. Nm7 — AdminEndpointHost.DisposeAsync no double-dispose guard Added a volatile bool _disposed flag with an early-return at the top of DisposeAsync. Symmetry with PlcMultiplexer; protects against ProxyWorker.StopAsync explicitly disposing then DI disposing the singleton again on host shutdown. T3 — RemoveInheritedAppsettings only fires on Build AfterTargets="Build;Publish" + a second Delete against $(PublishDir) so a `dotnet publish` against the test csproj doesn't ship the example PLCs from the linked install template. T4 — Stale TryAttachOrCreate_*_ReturnsTrue_* test method names Renamed to AttachOrCreate_*_WasNew{True,False} after W3 dropped the bool return. Accepted (with rationale documented in ReReviewAfterRemediation.md): Nm2 — CoalescedHit semantic is per-design Nm4 — _lastBindError preservation on clean exit is intentional forensics Nm5 — EventLogBridge has no injectable logger Nm8 — Cosmetic log noise T1 — Reflection on private fields documented as maintenance trap Tests: 387 pass / 0 fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 07:02:21 -04:00
parent 7a435957ee
commit 9251c564c1
6 changed files with 134 additions and 87 deletions
@@ -4,123 +4,126 @@ Re-review of the codebase after the six-commit remediation of the original 2026-

 **Scope:** every src/ and tests/ change in `53a7111..HEAD` (37 files, ~+2000/−700 lines).

+## Status
+
+> **All actionable findings resolved.** Wave 4 (`7a43595`) closed NC1 + NM1 + NM2 + NM5 + Nm1 + T2. Wave 4-followup (this commit) closed NM3 + NM4 + Nm6 + Nm7 + T3 + T4. Remaining items are **Accepted** with rationale (explained below) — they're deliberate trade-offs or doc-only nuances that don't warrant code changes.
+>
+> **Final test count:** 387 pass / 0 fail. Race tests stable across 3 isolated runs.
+
 ## Headline

-The remediation is structurally sound. **One new critical finding** — the W1.5 drain loop is structurally always-zero — inherited the very bug it was meant to replace. **Five new major findings** cluster in two areas:
+The remediation was structurally sound. The re-review found:

- **W1.4 cascade gating** introduced two disposal-deadlock and exception-swallowing paths around the new `_connectGate.WaitAsync()` call inside `TearDownBackendAsync`.
- **W1.1 + W1.5 lifecycle ordering** introduced a non-atomic context swap (snapshot reads can see a disposed cache during reseat) and silently single-shot supervisor semantics from the W2.15 + W2.16 combination.
+- **1 critical finding** — the W1.5 drain loop inherited the very `ShutdownCoordinator` bug it was meant to replace. **Resolved** by Wave 4 (`7a43595`) snapshotting the in-flight count BEFORE supervisor stop and deleting the theatrical post-stop loop.
+- **5 major findings** clustered in W1.4 cascade gating and W1.1 + W1.5 lifecycle ordering. **All resolved** by Wave 4 + Wave 4-followup.
+- **8 minor findings** + **4 test-discipline findings**. Most resolved; the rest accepted with documented rationale.

-Plus a handful of test-discipline gaps from the new race-hard tests (reflection on private field names, RNG seed fragility across runtime versions).
+## Resolution table

-## New critical findings (1)
+| ID | Severity | Finding | Status | Commit |
+|----|----------|---------|--------|--------|
+| **NC1** | Critical | `ProxyWorker.StopAsync` drain loop structurally always-zero — inherited the original `ShutdownCoordinator` bug | ✅ **Resolved** | `7a43595` |
+| **NM1** | Major | `TearDownBackendAsync._connectGate.WaitAsync()` uncancellable — disposal can be blocked indefinitely | ✅ **Resolved** | `7a43595` |
+| **NM2** | Major | `ReplaceContext` writes `_ctx` and re-registers cache stats provider non-atomically — snapshot can read a disposed cache | ✅ **Resolved** | `7a43595` |
+| **NM3** | Major | `_supervisorCts` leaks across `StartAsync` re-entry despite W2.16 guard | ✅ **Resolved** | (W4-followup) |
+| **NM4** | Major | W2.15 TCS never re-armed — supervisor effectively single-shot | ✅ **Resolved** | (W4-followup) |
+| **NM5** | Major | Self-cascade swallows `ObjectDisposedException` from `_connectGate` after disposal | ✅ **Resolved** | `7a43595` |
+| **Nm1** | Minor | Saturation cleanup uses `await SendResponseAsync` (blocking) per attached pipe | ✅ **Resolved** | `7a43595` |
+| **Nm2** | Minor | W1.2 increments `CoalescedHit` for late attachers that ultimately receive exception 04 | ⚪ **Accepted** | (doc) |
+| **Nm3** | Minor | Both supervisor-stop and drain shared `gracefulMs` budget | ✅ **Resolved** | `7a43595` (drain deleted) |
+| **Nm4** | Minor | `finally` preserves `_lastBindError` after clean cancellation | ⚪ **Accepted** | (by design) |
+| **Nm5** | Minor | `EventLogBridge` no startup log of armed state | ⚪ **Accepted** | (low value) |
+| **Nm6** | Minor | `_admin` lazy resolution returns `null` silently if registration absent | ✅ **Resolved** | (W4-followup) |
+| **Nm7** | Minor | `AdminEndpointHost.DisposeAsync` no `_disposed` guard | ✅ **Resolved** | (W4-followup) |
+| **Nm8** | Minor | `TearDownBackendAsync` cosmetic log-noise on queued cascades | ⚪ **Accepted** | (cosmetic) |
+| **T1** | Test | Reflection coupling on private field names | ⚪ **Accepted** | (commented in code) |
+| **T2** | Test | `WatchdogVsResponse_Race` seeded `Random` cross-runtime fragility | ✅ **Resolved** | `7a43595` |
+| **T3** | Test | `RemoveInheritedAppsettings` only fires on Build, not Publish | ✅ **Resolved** | (W4-followup) |
+| **T4** | Test | Stale `TryAttachOrCreate_*_ReturnsTrue_*` test method names after W3 dropped the bool | ✅ **Resolved** | (W4-followup) |

-### NC1. `ProxyWorker.StopAsync` drain loop is structurally always-zero — inherited the original `ShutdownCoordinator` bug it replaced
-**File:** `src/Mbproxy/Proxy/ProxyWorker.cs:276-310` (introduced by W1.5)
+**Resolved: 13/18. Accepted: 5/18.**

-**Path of execution:**
-1. `await Task.WhenAll(stopTasks)` — each `PlcListenerSupervisor.StopAsync` cancels its inner CTS, the listener's `RunAsync` exits, the OperationCanceledException catch calls `await listener.DisposeAsync()`, which calls `_multiplexer.DisposeAsync()`, which calls `_ctx.Counters.SetMultiplexProvider(null)` (`PlcMultiplexer.cs:224`).
-2. *Then* the drain loop begins. `CountInFlight` reads `supervisor.CurrentCounters.Snapshot().InFlightCount`, and `ProxyCounters.Snapshot()` (`ProxyCounters.cs:403-404`) returns `provider?.InFlightCount ?? 0`. The provider is now `null`. So `total == 0` on the first iteration and the loop exits immediately.
+---

-The drain budget (`gracefulMs`) is never spent on actual draining. `inFlightAtCancel` will always log as `0`. The `mbproxy.shutdown.complete` event under load will report `InFlightAtCancel=0` even when in-flight requests were forcibly cancelled.
+## Resolved findings — what landed

-**This is the same defect the original review (`AdminAndDiagnostics.md:C1`) called out in the deleted `ShutdownCoordinator`.** The W1.5 relocation faithfully copied the broken sequencing.
+### NC1 — `ProxyWorker.StopAsync` drain loop is structurally always-zero
+**Resolution:** `7a43595` snapshots `inFlightAtCancel = CountInFlight()` BEFORE calling `supervisor.StopAsync(...)`. The post-stop drain loop and `drainCts` are deleted entirely. The supervisor stop IS the drain — there's nothing to wait for that wouldn't be killed by the stop itself. `mbproxy.shutdown.complete` now reports a meaningful "requests dropped by stop" count.

-**Fix options:**
- (a) Snapshot in-flight counts BEFORE calling `supervisors.StopAsync`, then move drain-with-real-multiplexers into a phase BEFORE supervisor stop (i.e. stop accepting new connections, drain in-flight, then stop supervisors).
- (b) Accept that supervisor stop *is* the drain (matches today's behavior) and remove the theatrical loop + `InFlightAtCancel` field from the log event.
+### NM1 — `TearDownBackendAsync._connectGate.WaitAsync()` uncancellable
+**Resolution:** `7a43595` bounds the wait with a 2-second teardown CTS. On timeout the body proceeds best-effort without the gate; `gateHeld` flag tracks whether we acquired it so the `finally` only releases when appropriate. `ObjectDisposedException` from a disposed semaphore short-circuits to a clean return.

-## New major findings (5)
+### NM2 — `ReplaceContext` non-atomic ctx + provider swap
+**Resolution:** `7a43595` swapped the order: provider FIRST, then `_ctx`. Snapshots in the swap window now read either (old, old) or (new, new) — never (old-after-disposed).

-### NM1. `TearDownBackendAsync` acquires `_connectGate` with no token — disposal can be blocked indefinitely
-**File:** `src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs:333` (introduced by W1.4)
+### NM3 — `_supervisorCts` leaks across `StartAsync` re-entry
+**Resolution:** W4-followup. `StartAsync` now does `try { _supervisorCts.Dispose(); } catch (ObjectDisposedException) { }` before reassigning, with the catch covering the very-first-Start case where the field-init CTS is still fresh.

-`await _connectGate.WaitAsync().ConfigureAwait(false)` does not honour `_disposeCts.Token`. The dispose-driven teardown calls this after cancelling `_disposeCts`, but if a long Polly-wrapped `EnsureBackendConnectedAsync` is mid-connect against an unreachable host (waiting up to `BackendConnectTimeoutMs`), DisposeAsync blocks here for that duration. This blocks `ProxyWorker.StopAsync` past its drain budget.
+### NM4 — W2.15 TCS never re-armed (supervisor single-shot)
+**Resolution:** W4-followup. `_firstAttemptCompleted` is now non-readonly and re-created in `StartAsync` after the W2.16 guard. A re-Started supervisor's `WaitForInitialBindAttemptAsync` no longer observes the previous run's signal.

-**Fix:** pass `_disposeCts.Token` (or a CTS that cancels on dispose) to `WaitAsync`, accept the `OperationCanceledException`, and skip teardown work if disposal already completed.
+### NM5 — Self-cascade `ObjectDisposedException` after dispose
+**Resolution:** `7a43595` gated the writer + reader fault-path `_ = TearDownBackendAsync(...)` calls behind `if (!_disposeCts.IsCancellationRequested)`. DisposeAsync runs an explicit teardown anyway, so the fire-and-forget path was redundant *and* exception-throwing.

-### NM2. `ReplaceContext` writes `_ctx` and re-registers cache stats provider non-atomically — snapshot can see a disposed cache
-**File:** `src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs:173-184` (introduced by W1.1)
+### Nm1 — Saturation cleanup uses `await SendResponseAsync`
+**Resolution:** `7a43595` replaced the per-party `await SendResponseAsync` with `TrySendResponse` and `IncrementResponseDropForFullUpstream` on drop. Same doctrine as W1.3.

-Snapshot path (`ProxyCounters.Snapshot()` called from the admin endpoint) reads `_cacheStatsProvider` independently of `_ctx`. Sequence:
-1. Thread A enters `ReplaceContext`, writes `_ctx = newContext` (line 177).
-2. Thread B (admin) snapshots, reads `_cacheStatsProvider` (still the OLD adapter wrapping the OLD cache).
-3. Thread A runs `SetCacheStatsProvider(newCacheAdapter)`.
-4. Supervisor's `ReplaceContextAsync` proceeds to `oldCache.Clear()` then `oldCache.Dispose()`.
+### Nm3 — `gracefulMs` used twice
+**Resolution:** Implicit — Wave 4's NC1 fix deleted the drain loop, so the budget is now used exactly once (supervisor stop). Worst-case shutdown is `gracefulMs + 2 s admin`.

-A snapshot during the swap reports `cacheEntryCount`/`cacheBytes` from a cache the live multiplexer no longer references. Worse, if the snapshot races past step 4, the stats adapter holds a disposed `ResponseCache` and `Count`/`ApproximateBytes` may throw. **Same shape of race as the original review's "snapshot inconsistency" pattern**, re-opened on a new path by the W1.1 fix.
+### Nm6 — `_admin` lazy resolution silent null
+**Resolution:** W4-followup. `ProxyWorker.ExecuteAsync` now logs a Warning when `GetService<AdminEndpointHost>()` returns null, surfacing a botched composition without blocking startup. Previous IHostedService registration would have hard-errored in DI; this preserves the loud-failure intent without forcing callers to register admin in unit-test hosts.

-**Fix:** swap the order (provider first, then `_ctx`) inside a short lock, OR have `ResponseCache.Count`/`ApproximateBytes` no-op on a disposed instance.
+### Nm7 — `AdminEndpointHost.DisposeAsync` no `_disposed` guard
+**Resolution:** W4-followup. Added a `volatile bool _disposed` flag and a guard at the top of `DisposeAsync`; symmetry with `PlcMultiplexer`.

-### NM3. `_supervisorCts` leaks across `StartAsync` re-entry despite W2.16 guard
-**File:** `src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs:140, 437-446`
+### T2 — `WatchdogVsResponse_Race` seeded `Random` fragility
+**Resolution:** `7a43595` replaced `new Random(12345) → rng.Next(350, 450)` with a counter-based alternation `(n & 1) == 1 ? 350 : 450`. Guaranteed 15 fast + 15 slow across 30 iterations regardless of runtime.

-W2.16's guard prevents re-Start while busy, but it does not address the leaking previous CTS the comment claims to fix. `StopAsync` sets `_state = Stopped` first, then awaits `_supervisorTask`. The previous `_supervisorCts` is disposed only in `DisposeAsync`, not in `StopAsync`. A re-Start pattern (Stop → Start without Dispose) leaks one CTS per cycle.
+### T3 — `RemoveInheritedAppsettings` only on Build
+**Resolution:** W4-followup. `AfterTargets="Build;Publish"` and a second `Delete` against `$(PublishDir)appsettings.json` (guarded by `Condition="'$(PublishDir)' != ''"` so a plain Build doesn't trip it).

-Today no caller actually re-Starts a supervisor (ConfigReconciler's remove path uses `DisposeAsync`), so the leak is **latent**. Worth fixing now or documenting.
+### T4 — Stale `TryAttachOrCreate_*` test names
+**Resolution:** W4-followup. Renamed three test methods to drop `Try` and `_ReturnsTrue_` to match the W3 `AttachOrCreate` (void-returning) signature.

-### NM4. W2.15 TCS is never re-armed; W2.16 + TCS together make the supervisor effectively single-shot
-**File:** `src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs:74-75, 132-147`
+---

-`_firstAttemptCompleted` is `readonly` and set in the field initialiser. Once `TrySetResult` fires (anywhere in `RunSupervisorAsync` including the `finally`), it stays completed forever. A restarted supervisor would have `WaitForInitialBindAttemptAsync` return immediately on the *previous* run's signal, regardless of the new run's bind status. Combined with the comment claiming "the supervisor's state machine has exactly one Start", the intent appears to be single-shot — but neither the guard nor the TCS enforces it cleanly.
+## Accepted findings — what wasn't fixed and why

-Today no caller re-Starts so this is **latent**. Either re-arm the TCS in `RunSupervisorAsync` or have W2.16 refuse re-Start unconditionally.
+### Nm2 — `CoalescedHit` increments for late attachers that get exception 04
+**Why accepted:** The counter is correct per the design contract (`coalescedHit + coalescedMiss = total FC03/FC04 requests`). A late attacher *did* coalesce onto a stub; the fact that the stub later delivered an exception instead of a real response is orthogonal to the coalesce-counting semantics. Decrementing on saturation cleanup would break the parity invariant that operator dashboards rely on. Documented inline at `PlcMultiplexer.cs:809-812` (W2.4 doc clarification covers this).

-### NM5. Self-cascade swallows `ObjectDisposedException` from `_connectGate` after disposal
-**File:** `src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs:451, 648` (introduced by W1.4 + W2.5)
+### Nm4 — `_lastBindError` preserved after clean cancellation
+**Why accepted:** Intentional. The field's contract is "last fault while running" — useful for operators triaging *why* a supervisor exited (was it a bind error vs a clean stop?). Clearing it on clean exit would lose forensic information. The `finally` block at `PlcListenerSupervisor.cs:435-436` already documents this with a comment.

-After `DisposeAsync` runs `_connectGate.Dispose()`, any in-flight self-cascade (`_ = TearDownBackendAsync(...)` from writer/reader fault paths) reaches `WaitAsync()` and throws `ObjectDisposedException`. The `try/finally` will then run `Release()` on a disposed semaphore in the `finally`, also throwing. Both throws happen on a fire-and-forget Task — they become unobserved exceptions on TPL's UnobservedTaskException event. Noisy but not fatal.
+### Nm5 — `EventLogBridge` no startup log of armed state
+**Why accepted:** The bridge is a Serilog sink and doesn't have an injectable logger (would require restructuring to use `Log.ForContext` or similar). The class doc at `EventLogBridge.cs:18-20` already tells operators "must be created by `install.ps1` before the service starts". The W2.23 caching means a missing source is silent, but the architectural alternative (log from the bridge constructor using `Log.Logger`) creates a startup-ordering dependency on Serilog being fully wired before any sink construction. Low value vs the surface change.

-**Fix:** wrap the body in `try { … } catch (ObjectDisposedException) { }` or short-circuit on `_disposed` before re-entering teardown.
+### Nm8 — `TearDownBackendAsync` log-noise on queued cascades
+**Why accepted:** Cosmetic. Each cascade logs its own `BackendDisconnected` event with its own counts; queued cascades on the gate would fire mostly-zero events. Log filtering is the operator's tool here.

-## New minor findings (8)
+### T1 — Reflection on private field names in race tests
+**Why accepted:** The two reflection sites (`DrainAllocator` in `PlcMultiplexerTests`, the runtime-fault test in `SupervisorTests`) are the only externally-impossible-to-hit primitives that prove the saturation/runtime-fault contracts. The alternatives are (a) introducing test-only `internal` accessors that pollute production code, (b) exposing fields via `[InternalsVisibleTo]` (which Mbproxy.csproj already does for the test project — but the *fields* themselves are private, not internal), or (c) skipping the tests. The reflection coupling is documented as a maintenance trap in the test code. A future rename refactor breaks at run-time, not compile-time — but the tests' xmldoc explicitly warns about this.

-| # | File:line | Finding |
-|---|-----------|---------|
-| Nm1 | `PlcMultiplexer.cs:842` (W1.2 cleanup) | Saturation cleanup uses `await SendResponseAsync` (blocking) per attached pipe. A single wedged late-attacher stalls saturation delivery to all others — contradicts the W1.3 doctrine. Use `TrySendResponse` here too. |
-| Nm2 | `PlcMultiplexer.cs:810` | W1.2 increments `CoalescedHit` for late attachers that ultimately receive exception 04. Counter shows hits that produced no real coalesced response. Document or decrement. |
-| Nm3 | `ProxyWorker.cs:269-310` | Both supervisor-stop and (theatrical) drain share `gracefulMs`. Worst-case shutdown is `2 * gracefulMs + 2s admin` — operators expect single budget. |
-| Nm4 | `PlcListenerSupervisor.cs:433-446` | `finally` writes `_state = Stopped` directly, preserving `_lastBindError` from a prior fault. After clean cancellation operators see "last bind error" that was never the cause of stopping. Doc the field as "last fault while running" or clear on clean exit. |
-| Nm5 | `EventLogBridge.cs:35-45` | W2.23 cached `_sourceExists` is correct, but no startup log confirms armed state. If `install.ps1` registers source after service starts, every Error+ event is silently dropped until restart. Add a one-line startup log. |
-| Nm6 | `ProxyWorker.cs:224-235` | W1.5 lazy `_admin = _services.GetService<AdminEndpointHost>()` returns `null` silently if `AddMbproxyAdmin()` is forgotten. Previous `IHostedService` registration would have errored loudly during host build. Use `GetRequiredService<>` or log a warning. |
-| Nm7 | `AdminEndpointHost.cs:200-210` | `DisposeAsync` lacks a `_disposed` guard; `ProxyWorker.StopAsync` calls `StopAsync` then DI disposes again. Operations are idempotent today but symmetry with `PlcMultiplexer` would be cleaner. |
-| Nm8 | `PlcMultiplexer.cs:415` | `TearDownBackendAsync` log `BackendDisconnected` event fires per cascade; queued cascades on the gate each log their own event with mostly-zero counts. Cosmetic noise. |
-
-## Test-discipline findings (4)
-
-| # | File:line | Finding |
-|---|-----------|---------|
-| T1 | `PlcMultiplexerTests.cs:1152-1165`, `SupervisorTests.cs:1779-1789` | `DrainAllocator` and the runtime-fault test reflect on private field names (`_allocator`, `_currentListener`, `_listener`). A rename refactor breaks them at run-time, not compile-time. Recommend `[InternalsVisibleTo]` test seam OR comments on each reflected field warning that tests depend on the name. |
-| T2 | `PlcMultiplexerTests.cs:1379-1380` (W3 #8) | `WatchdogVsResponse_Race` uses `new Random(12345)` over `[350, 450)` ms with watchdog at 400 ms; asserts BOTH branches hit. `Random` with a seed is implementation-defined across major .NET versions (legacy → Xoshiro128 in .NET 6). On a runtime upgrade the seed could land all 30 samples on one side and the test breaks. Replace with deterministic alternation: `int delay = (i & 1) == 0 ? 350 : 450;`. |
-| T3 | `Mbproxy.Tests.csproj:38-40` (W2.21) | `RemoveInheritedAppsettings` Target uses `AfterTargets="Build"` and deletes from `$(OutputPath)`. Won't fire on `dotnet publish`. Test projects rarely publish but worth `AfterTargets="Build;Publish"` + a second Delete against `$(PublishDir)appsettings.json`. |
-| T4 | `InFlightByKeyMapTests.cs:46, 75, 112` | After W3 removed the `bool` return, test method names still read `TryAttachOrCreate_..._ReturnsTrue_WasNewTrue`. Decorative but misleading for grep. |
+---

 ## Verified clean (sampled, not exhaustive)

- **W2.3 ConcurrentDictionary migration** on `_supervisors` — all mutations atomic; status-page enumeration lock-free; Restart's "remove + add" two-step is per-key (parallel keys disjoint by name). 
+The original re-review listed the following as verified clean by inspection during the post-Wave-3 review pass; nothing in Wave 4 / W4-followup invalidated these:
+
+- **W2.3 ConcurrentDictionary migration** on `_supervisors` — all mutations atomic; status-page enumeration lock-free; Restart's "remove + add" two-step is per-key (parallel keys disjoint by name).
 - **W2.1 coalescingAccessor propagation** — both Add and Restart paths receive it; Reseat correctly does not (same supervisor, same multiplexer, same accessor).
 - **W2.13 OOR check** — multiplication is bounded by the guard; even worst-case `9999 * 10_000 + 9999 = 99_989_999` fits in int32 without overflow.
 - **W2.14 byteCount validation** — strict `<` check passes a perfectly-sized PDU; trailing-byte case correct.
 - **W2.10 resolved-TTL re-check** — `BcdTagMapBuilder.Build` is called exactly once per PLC at validation; no duplicate work.
 - **W2.18 ConnectionOptions validation** — both `MbproxyOptionsValidator` and `ReloadValidator` reject `<= 0`; no bypass path.
 - **W3 `HasBadNibble` dedupe** — clean; the codec's internal helper is the single source of truth.
- **W2.15 TCS signalled in every exit path** of `RunSupervisorAsync` (bind-success, bind-failure-into-recovery, run-failure-into-recovery, finally) — no hang on `WaitForInitialBindAttemptAsync` for the first run.
- **W2.17 TransitionTo lock contract** — both writers use it; `Snapshot` reads under the same lock; no torn triples.
+- **W2.15 TCS signalled in every exit path** of `RunSupervisorAsync` — no hang on `WaitForInitialBindAttemptAsync` for the first run. (W4 / NM4 added the re-arm for subsequent Starts.)
+- **W2.17 `TransitionTo` lock contract** — both writers use it; `Snapshot` reads under the same lock; no torn triples.
 - **`TxIdAllocator.Release` double-call is benign** (`TxIdAllocator.cs:121-129` checks `if (_inUse[id])`); the W1.4 channel drain releasing a TxId already released by the correlation drain is safe.
 - **W1.1 in-PDU snapshot consistency** — `OnUpstreamFrameAsync` reads `_ctx.Cache` and `_ctx.TagMap.ResolveCacheTtlMs` non-atomically; the only mid-PDU swap visible would change cache eligibility, not produce corrupted output. Downstream `WithCurrentRequest` snapshots TagMap+Cache for the rewriter, so the rewrite itself is consistent.
 - **W2.7 cache-FC byte sourced from post-rewriter buffer** — correct; the rewriter never touches the FC byte but the source must remain `frame[…]` to capture the exception bit.

-## Recommended next actions
+## Closed

-If you want to keep the issue tracker tidy, these are the items that should land as a small follow-up commit:
-
-1. **NC1** — fix or remove the theatrical drain loop. Smallest change: delete the loop and the `InFlightAtCancel` field from the log.
-2. **NM1** — pass `_disposeCts.Token` to `_connectGate.WaitAsync` in `TearDownBackendAsync`.
-3. **NM2** — wrap the `_ctx` write + provider re-registration in a single short lock, or null-guard the cache adapter against disposed-cache reads.
-4. **NM5** — wrap the self-cascade body in `try { … } catch (ObjectDisposedException) { }`.
-5. **Nm1** — use `TrySendResponse` in the W1.2 saturation cleanup loop to avoid wedge-by-wedged-attacher.
-6. **T2** — replace the seeded `Random` in `WatchdogVsResponse_Race` with deterministic alternation.
-
-NM3 and NM4 are latent (no current caller exercises supervisor re-Start). They become real only if the reconciler ever changes to re-Start instead of dispose-and-rebuild. Defer or document.
-
-Everything else is documentation / cosmetic / minor cleanup.
+The 2026-05-14 review series — original review → 4 remediation waves → re-review → wave-4 → wave-4-followup — is now closed. Tests: 387 pass / 0 fail. Three back-to-back race-test runs in isolation all green. Every actionable finding resolved or explicitly accepted with rationale.
@@ -44,6 +44,13 @@ internal sealed partial class AdminEndpointHost : IAsyncDisposable
    // Protects concurrent Start/Stop calls (hot-reload + StopAsync racing).
    private readonly SemaphoreSlim _lock = new(1, 1);

+    // Phase 12 (W4 / Nm7) — idempotency flag for DisposeAsync. ProxyWorker.StopAsync
+    // calls our StopAsync explicitly; the DI container then disposes the singleton on
+    // host shutdown. Without this flag the second pass would Dispose `_lock` twice and
+    // re-dispose the change registration (both currently safe but symmetry with
+    // PlcMultiplexer prevents future regression).
+    private volatile bool _disposed;
+
    // Current configured port — used to detect changes on hot-reload.
    private int _currentPort;

@@ -199,14 +206,19 @@ internal sealed partial class AdminEndpointHost : IAsyncDisposable

    public async ValueTask DisposeAsync()
    {
+        if (_disposed) return;
+        _disposed = true;
+
        _optionsChangeRegistration?.Dispose();
-        _lock.Dispose();
+        _optionsChangeRegistration = null;

        if (_app is { } app)
        {
            _app = null;
            await app.DisposeAsync().ConfigureAwait(false);
        }
+
+        try { _lock.Dispose(); } catch (ObjectDisposedException) { /* race-safe */ }
    }

    // ── Logging ──────────────────────────────────────────────────────────────
@@ -233,6 +233,16 @@ internal sealed partial class ProxyWorker : BackgroundService
                _logger.LogError(ex, "Admin endpoint failed to start: {Message}", ex.Message);
            }
        }
+        else
+        {
+            // Phase 12 (W4 / Nm6) — surface the absence. The previous IHostedService
+            // registration would have hard-errored in DI if AddMbproxyAdmin() was missing
+            // from Program.cs; the W1.5 lazy lookup returns null silently. A single warning
+            // makes a botched composition observable without blocking startup.
+            _logger.LogWarning(
+                "Admin endpoint not registered (AddMbproxyAdmin() missing from composition). " +
+                "Status page will be unavailable; service continues without it.");
+        }

        // ── 6. Keep the worker alive until the host signals stop ─────────────────────
        // Supervisors run their own background loops; ExecuteAsync just waits.
@@ -71,7 +71,13 @@ internal sealed partial class PlcListenerSupervisor : IAsyncDisposable
    // for the first time (reached Bound or Recovering). Replaces the previous busy-poll
    // implementation in WaitForInitialBindAttemptAsync, which raced fast Stopped→Bound→
    // Stopped transitions and never exited if the supervisor task threw inside Polly.
-    private readonly TaskCompletionSource _firstAttemptCompleted = new(
+    //
+    // Phase 12 (W4 / NM4) — non-readonly so StartAsync can re-arm it for a re-Started
+    // supervisor. Without re-arming, a restart-after-stop scenario would have
+    // WaitForInitialBindAttemptAsync return immediately on the previous run's signal,
+    // never observing the new run's bind status. No production caller currently re-Starts,
+    // but the supervisor's state machine should be consistent.
+    private TaskCompletionSource _firstAttemptCompleted = new(
        TaskCreationOptions.RunContinuationsAsynchronously);

    // ── Public surface ────────────────────────────────────────────────────────────────────
@@ -132,16 +138,28 @@ internal sealed partial class PlcListenerSupervisor : IAsyncDisposable
    public Task StartAsync(CancellationToken ct)
    {
        // Phase 12 (W2.16) — refuse to re-Start an already-running or already-disposed
-        // supervisor. The original code reassigned _supervisorCts unconditionally, which
-        // leaked the previous CTS and could leave a zombie task running against an
-        // unobserved token. The supervisor's state machine has exactly one Start.
+        // supervisor. After Stop the state machine returns to Stopped and StartAsync
+        // can re-arm; W4/NM3+NM4 below ensure the per-Start state (CTS, TCS) is fresh
+        // each time so no leak or stale signal carries across cycles.
        if (_disposed)
            throw new ObjectDisposedException(nameof(PlcListenerSupervisor));
        if (_state != SupervisorState.Stopped || !_supervisorTask.IsCompleted)
            throw new InvalidOperationException(
                $"Supervisor for Plc='{_plc.Name}' has already been started.");

+        // Phase 12 (W4 / NM3) — dispose the previous CTS before reassigning. The original
+        // code overwrote _supervisorCts unconditionally, leaking the prior CTS on every
+        // re-Start cycle (and any registrations linked to it). Idempotent: ObjectDisposed
+        // catch covers the very-first-Start case where the field-init CTS is still fresh.
+        try { _supervisorCts.Dispose(); } catch (ObjectDisposedException) { /* fresh */ }
        _supervisorCts = CancellationTokenSource.CreateLinkedTokenSource(ct);
+
+        // Phase 12 (W4 / NM4) — re-arm the first-attempt TCS so a re-Started supervisor
+        // doesn't immediately observe the previous run's signal in
+        // WaitForInitialBindAttemptAsync.
+        _firstAttemptCompleted = new TaskCompletionSource(
+            TaskCreationOptions.RunContinuationsAsynchronously);
+
        _supervisorTask = Task.Run(() => RunSupervisorAsync(_supervisorCts.Token), CancellationToken.None);
        return Task.CompletedTask;
    }
@@ -34,9 +34,13 @@
  <!-- Phase 12 (W2.21) — the linked appsettings.json from Mbproxy.csproj propagates to
       the test bin via the project reference. Tests build their own in-memory
       configurations and must not pick up the shipped template's example PLCs. The
-       Target deletes the inherited file from the test bin after build. -->
-  <Target Name="RemoveInheritedAppsettings" AfterTargets="Build">
+       Target deletes the inherited file from both the build output AND the publish
+       payload (W4 / T3 — adds Publish so a `dotnet publish` against this csproj for a
+       packaged self-test would not leak the template's example PLCs into the published
+       bundle). -->
+  <Target Name="RemoveInheritedAppsettings" AfterTargets="Build;Publish">
    <Delete Files="$(OutputPath)appsettings.json" Condition="Exists('$(OutputPath)appsettings.json')" />
+    <Delete Files="$(PublishDir)appsettings.json" Condition="'$(PublishDir)' != '' AND Exists('$(PublishDir)appsettings.json')" />
  </Target>

 </Project>
@@ -43,7 +43,7 @@ public sealed class InFlightByKeyMapTests
    }

    [Fact]
-    public async Task TryAttachOrCreate_NewKey_CallsFactory_ReturnsTrue_WasNewTrue()
+    public async Task AttachOrCreate_NewKey_CallsFactory_WasNewTrue()
    {
        var pipe = MakePipe();
        try
@@ -72,7 +72,7 @@ public sealed class InFlightByKeyMapTests
    }

    [Fact]
-    public async Task TryAttachOrCreate_ExistingKey_AppendsParty_ReturnsTrue_WasNewFalse()
+    public async Task AttachOrCreate_ExistingKey_AppendsParty_WasNewFalse()
    {
        var pipeA = MakePipe();
        var pipeB = MakePipe();
@@ -109,7 +109,7 @@ public sealed class InFlightByKeyMapTests
    }

    [Fact]
-    public async Task TryAttachOrCreate_ExistingKey_AtMaxParties_CreatesFreshEntry_NotAppend()
+    public async Task AttachOrCreate_ExistingKey_AtMaxParties_CreatesFreshEntry_NotAppend()
    {
        var pipeA = MakePipe();
        var pipeB = MakePipe();