# Re-Review After Remediation — 2026-05-14 Re-review of the codebase after the six-commit remediation of the original 2026-05-14 review (Wave 1 → `ce32c5c`, Wave 2 → `e66b17f`, Wave 3 → `7ead358`, the easy 5 → `2545237`, the race-hard 5 → `53f842a`). Conducted via three parallel area-focused passes. **Eyes on what the fixes themselves introduced**, not what the original review already found. **Scope:** every src/ and tests/ change in `53a7111..HEAD` (37 files, ~+2000/−700 lines). ## Status > **All actionable findings resolved across two re-review passes.** Wave 4 (`7a43595`) closed NC1 + NM1 + NM2 + NM5 + Nm1 + T2. Wave 4-followup (`9251c56`) closed NM3 + NM4 + Nm6 + Nm7 + T3 + T4. A third focused pass surfaced one more major (W5/M1) and two cosmetics (W5/m1, W5/m2); Wave 5 (this commit) resolved M1 + m2 and documented m1 as accepted best-effort. > > **Final test count:** 387 pass / 0 fail. ## Headline The remediation was structurally sound. The re-review found: - **1 critical finding** — the W1.5 drain loop inherited the very `ShutdownCoordinator` bug it was meant to replace. **Resolved** by Wave 4 (`7a43595`) snapshotting the in-flight count BEFORE supervisor stop and deleting the theatrical post-stop loop. - **5 major findings** clustered in W1.4 cascade gating and W1.1 + W1.5 lifecycle ordering. **All resolved** by Wave 4 + Wave 4-followup. - **8 minor findings** + **4 test-discipline findings**. Most resolved; the rest accepted with documented rationale. ## Resolution table | ID | Severity | Finding | Status | Commit | |----|----------|---------|--------|--------| | **NC1** | Critical | `ProxyWorker.StopAsync` drain loop structurally always-zero — inherited the original `ShutdownCoordinator` bug | ✅ **Resolved** | `7a43595` | | **NM1** | Major | `TearDownBackendAsync._connectGate.WaitAsync()` uncancellable — disposal can be blocked indefinitely | ✅ **Resolved** | `7a43595` | | **NM2** | Major | `ReplaceContext` writes `_ctx` and re-registers cache stats provider non-atomically — snapshot can read a disposed cache | ✅ **Resolved** | `7a43595` | | **NM3** | Major | `_supervisorCts` leaks across `StartAsync` re-entry despite W2.16 guard | ✅ **Resolved** | (W4-followup) | | **NM4** | Major | W2.15 TCS never re-armed — supervisor effectively single-shot | ✅ **Resolved** | (W4-followup) | | **NM5** | Major | Self-cascade swallows `ObjectDisposedException` from `_connectGate` after disposal | ✅ **Resolved** | `7a43595` | | **Nm1** | Minor | Saturation cleanup uses `await SendResponseAsync` (blocking) per attached pipe | ✅ **Resolved** | `7a43595` | | **Nm2** | Minor | W1.2 increments `CoalescedHit` for late attachers that ultimately receive exception 04 | ⚪ **Accepted** | (doc) | | **Nm3** | Minor | Both supervisor-stop and drain shared `gracefulMs` budget | ✅ **Resolved** | `7a43595` (drain deleted) | | **Nm4** | Minor | `finally` preserves `_lastBindError` after clean cancellation | ⚪ **Accepted** | (by design) | | **Nm5** | Minor | `EventLogBridge` no startup log of armed state | ⚪ **Accepted** | (low value) | | **Nm6** | Minor | `_admin` lazy resolution returns `null` silently if registration absent | ✅ **Resolved** | (W4-followup) | | **Nm7** | Minor | `AdminEndpointHost.DisposeAsync` no `_disposed` guard | ✅ **Resolved** | (W4-followup) | | **Nm8** | Minor | `TearDownBackendAsync` cosmetic log-noise on queued cascades | ⚪ **Accepted** | (cosmetic) | | **T1** | Test | Reflection coupling on private field names | ⚪ **Accepted** | (commented in code) | | **T2** | Test | `WatchdogVsResponse_Race` seeded `Random` cross-runtime fragility | ✅ **Resolved** | `7a43595` | | **T3** | Test | `RemoveInheritedAppsettings` only fires on Build, not Publish | ✅ **Resolved** | (W4-followup) | | **T4** | Test | Stale `TryAttachOrCreate_*_ReturnsTrue_*` test method names after W3 dropped the bool | ✅ **Resolved** | (W4-followup) | **Resolved: 13/18. Accepted: 5/18.** ## Third pass — final findings (Wave 5) A third focused review pass on the post-W4-followup state turned up these additional items: | ID | Severity | Finding | Status | Commit | |----|----------|---------|--------|--------| | **W5/M1** | Major | `AdminEndpointHost` `OnChange` callback can resurrect a Kestrel app after `StopAsync` returned (no `_disposed` check inside the fire-and-forget Task.Run lambda) | ✅ **Resolved** | (W5) | | **W5/m1** | Minor | `TearDownBackendAsync` gate-not-held path: a concurrent freshly-allocated TxId can collide with one being released by the channel drain → silent request drop. Probability very low (gate timeout AND new accept AND TxId collision in 65,536-slot space). | ⚪ **Accepted** | (W5 — inline doc comment in `PlcMultiplexer.cs`) | | **W5/m2** | Minor | `inFlightAtCancel` was computed AFTER `base.StopAsync` — narrower window than the field name promises | ✅ **Resolved** | (W5) | | **W5/m3** | Cosmetic | `CountInFlight` allocates a 35-field `CounterSnapshot` record per supervisor on shutdown | ⚪ **Accepted** (skip) | — | **W5/M1 fix detail.** Added `if (_disposed) return;` at the top of the `OnChange` lambda AND inside the queued `Task.Run`, plus `try/catch (ObjectDisposedException)` around `_lock.WaitAsync` and `_lock.Release()` so a hot-reload of `AdminPort` during shutdown can no longer resurrect a fresh Kestrel WebApplication on the new port after the host considered admin shut down. **W5/m2 fix detail.** Moved `int inFlightAtCancel = CountInFlight();` to BEFORE `await base.StopAsync(cancellationToken)`. Now the count actually reflects "in-flight at the moment the host signalled stop" — not "in-flight at the moment we got around to computing it after the cancel propagated." **W5/m1 acceptance.** Documented inline at `PlcMultiplexer.cs:TearDownBackendAsync` near the `gateHeld` flag declaration. The race requires three coincidences (gate-timeout + new accept landing during cascade + TxId collision); the only consequence is one dropped request that the client retries on its next attempt. **W5/m3 skip.** Trivial per-PLC allocation (~5 KB on a 54-PLC fleet, called once per shutdown). Optimising it would require exposing a single-field accessor on `ProxyCounters`; not worth the surface change. --- ## Resolved findings — what landed ### NC1 — `ProxyWorker.StopAsync` drain loop is structurally always-zero **Resolution:** `7a43595` snapshots `inFlightAtCancel = CountInFlight()` BEFORE calling `supervisor.StopAsync(...)`. The post-stop drain loop and `drainCts` are deleted entirely. The supervisor stop IS the drain — there's nothing to wait for that wouldn't be killed by the stop itself. `mbproxy.shutdown.complete` now reports a meaningful "requests dropped by stop" count. ### NM1 — `TearDownBackendAsync._connectGate.WaitAsync()` uncancellable **Resolution:** `7a43595` bounds the wait with a 2-second teardown CTS. On timeout the body proceeds best-effort without the gate; `gateHeld` flag tracks whether we acquired it so the `finally` only releases when appropriate. `ObjectDisposedException` from a disposed semaphore short-circuits to a clean return. ### NM2 — `ReplaceContext` non-atomic ctx + provider swap **Resolution:** `7a43595` swapped the order: provider FIRST, then `_ctx`. Snapshots in the swap window now read either (old, old) or (new, new) — never (old-after-disposed). ### NM3 — `_supervisorCts` leaks across `StartAsync` re-entry **Resolution:** W4-followup. `StartAsync` now does `try { _supervisorCts.Dispose(); } catch (ObjectDisposedException) { }` before reassigning, with the catch covering the very-first-Start case where the field-init CTS is still fresh. ### NM4 — W2.15 TCS never re-armed (supervisor single-shot) **Resolution:** W4-followup. `_firstAttemptCompleted` is now non-readonly and re-created in `StartAsync` after the W2.16 guard. A re-Started supervisor's `WaitForInitialBindAttemptAsync` no longer observes the previous run's signal. ### NM5 — Self-cascade `ObjectDisposedException` after dispose **Resolution:** `7a43595` gated the writer + reader fault-path `_ = TearDownBackendAsync(...)` calls behind `if (!_disposeCts.IsCancellationRequested)`. DisposeAsync runs an explicit teardown anyway, so the fire-and-forget path was redundant *and* exception-throwing. ### Nm1 — Saturation cleanup uses `await SendResponseAsync` **Resolution:** `7a43595` replaced the per-party `await SendResponseAsync` with `TrySendResponse` and `IncrementResponseDropForFullUpstream` on drop. Same doctrine as W1.3. ### Nm3 — `gracefulMs` used twice **Resolution:** Implicit — Wave 4's NC1 fix deleted the drain loop, so the budget is now used exactly once (supervisor stop). Worst-case shutdown is `gracefulMs + 2 s admin`. ### Nm6 — `_admin` lazy resolution silent null **Resolution:** W4-followup. `ProxyWorker.ExecuteAsync` now logs a Warning when `GetService()` returns null, surfacing a botched composition without blocking startup. Previous IHostedService registration would have hard-errored in DI; this preserves the loud-failure intent without forcing callers to register admin in unit-test hosts. ### Nm7 — `AdminEndpointHost.DisposeAsync` no `_disposed` guard **Resolution:** W4-followup. Added a `volatile bool _disposed` flag and a guard at the top of `DisposeAsync`; symmetry with `PlcMultiplexer`. ### T2 — `WatchdogVsResponse_Race` seeded `Random` fragility **Resolution:** `7a43595` replaced `new Random(12345) → rng.Next(350, 450)` with a counter-based alternation `(n & 1) == 1 ? 350 : 450`. Guaranteed 15 fast + 15 slow across 30 iterations regardless of runtime. ### T3 — `RemoveInheritedAppsettings` only on Build **Resolution:** W4-followup. `AfterTargets="Build;Publish"` and a second `Delete` against `$(PublishDir)appsettings.json` (guarded by `Condition="'$(PublishDir)' != ''"` so a plain Build doesn't trip it). ### T4 — Stale `TryAttachOrCreate_*` test names **Resolution:** W4-followup. Renamed three test methods to drop `Try` and `_ReturnsTrue_` to match the W3 `AttachOrCreate` (void-returning) signature. --- ## Accepted findings — what wasn't fixed and why ### Nm2 — `CoalescedHit` increments for late attachers that get exception 04 **Why accepted:** The counter is correct per the design contract (`coalescedHit + coalescedMiss = total FC03/FC04 requests`). A late attacher *did* coalesce onto a stub; the fact that the stub later delivered an exception instead of a real response is orthogonal to the coalesce-counting semantics. Decrementing on saturation cleanup would break the parity invariant that operator dashboards rely on. Documented inline at `PlcMultiplexer.cs:809-812` (W2.4 doc clarification covers this). ### Nm4 — `_lastBindError` preserved after clean cancellation **Why accepted:** Intentional. The field's contract is "last fault while running" — useful for operators triaging *why* a supervisor exited (was it a bind error vs a clean stop?). Clearing it on clean exit would lose forensic information. The `finally` block at `PlcListenerSupervisor.cs:435-436` already documents this with a comment. ### Nm5 — `EventLogBridge` no startup log of armed state **Why accepted:** The bridge is a Serilog sink and doesn't have an injectable logger (would require restructuring to use `Log.ForContext` or similar). The class doc at `EventLogBridge.cs:18-20` already tells operators "must be created by `install.ps1` before the service starts". The W2.23 caching means a missing source is silent, but the architectural alternative (log from the bridge constructor using `Log.Logger`) creates a startup-ordering dependency on Serilog being fully wired before any sink construction. Low value vs the surface change. ### Nm8 — `TearDownBackendAsync` log-noise on queued cascades **Why accepted:** Cosmetic. Each cascade logs its own `BackendDisconnected` event with its own counts; queued cascades on the gate would fire mostly-zero events. Log filtering is the operator's tool here. ### T1 — Reflection on private field names in race tests **Why accepted:** The two reflection sites (`DrainAllocator` in `PlcMultiplexerTests`, the runtime-fault test in `SupervisorTests`) are the only externally-impossible-to-hit primitives that prove the saturation/runtime-fault contracts. The alternatives are (a) introducing test-only `internal` accessors that pollute production code, (b) exposing fields via `[InternalsVisibleTo]` (which Mbproxy.csproj already does for the test project — but the *fields* themselves are private, not internal), or (c) skipping the tests. The reflection coupling is documented as a maintenance trap in the test code. A future rename refactor breaks at run-time, not compile-time — but the tests' xmldoc explicitly warns about this. --- ## Verified clean (sampled, not exhaustive) The original re-review listed the following as verified clean by inspection during the post-Wave-3 review pass; nothing in Wave 4 / W4-followup invalidated these: - **W2.3 ConcurrentDictionary migration** on `_supervisors` — all mutations atomic; status-page enumeration lock-free; Restart's "remove + add" two-step is per-key (parallel keys disjoint by name). - **W2.1 coalescingAccessor propagation** — both Add and Restart paths receive it; Reseat correctly does not (same supervisor, same multiplexer, same accessor). - **W2.13 OOR check** — multiplication is bounded by the guard; even worst-case `9999 * 10_000 + 9999 = 99_989_999` fits in int32 without overflow. - **W2.14 byteCount validation** — strict `<` check passes a perfectly-sized PDU; trailing-byte case correct. - **W2.10 resolved-TTL re-check** — `BcdTagMapBuilder.Build` is called exactly once per PLC at validation; no duplicate work. - **W2.18 ConnectionOptions validation** — both `MbproxyOptionsValidator` and `ReloadValidator` reject `<= 0`; no bypass path. - **W3 `HasBadNibble` dedupe** — clean; the codec's internal helper is the single source of truth. - **W2.15 TCS signalled in every exit path** of `RunSupervisorAsync` — no hang on `WaitForInitialBindAttemptAsync` for the first run. (W4 / NM4 added the re-arm for subsequent Starts.) - **W2.17 `TransitionTo` lock contract** — both writers use it; `Snapshot` reads under the same lock; no torn triples. - **`TxIdAllocator.Release` double-call is benign** (`TxIdAllocator.cs:121-129` checks `if (_inUse[id])`); the W1.4 channel drain releasing a TxId already released by the correlation drain is safe. - **W1.1 in-PDU snapshot consistency** — `OnUpstreamFrameAsync` reads `_ctx.Cache` and `_ctx.TagMap.ResolveCacheTtlMs` non-atomically; the only mid-PDU swap visible would change cache eligibility, not produce corrupted output. Downstream `WithCurrentRequest` snapshots TagMap+Cache for the rewriter, so the rewrite itself is consistent. - **W2.7 cache-FC byte sourced from post-rewriter buffer** — correct; the rewriter never touches the FC byte but the source must remain `frame[…]` to capture the exception bit. ## Closed The 2026-05-14 review series — original review → 4 remediation waves → first re-review → wave 4 + followup → second re-review → wave 5 — is now closed. Tests: 387 pass / 0 fail. Three back-to-back race-test runs in isolation all green. Every actionable finding resolved or explicitly accepted with rationale.