Closes the latent + minor + test-discipline items left after Wave 4. Updates
the re-review doc with a final resolution table — every actionable finding
now marked Resolved or Accepted with rationale.
NM3 — _supervisorCts leaks on re-Start
StartAsync now disposes the previous CTS before reassigning. Idempotent:
a try/catch (ObjectDisposedException) covers the very-first-Start case
where the field-init CTS is still fresh.
NM4 — W2.15 TCS is single-shot
_firstAttemptCompleted is no longer readonly; StartAsync re-creates it
after the W2.16 guard so a re-Started supervisor's
WaitForInitialBindAttemptAsync doesn't observe the previous run's signal.
Nm6 — _admin GetService<> returns null silently
ProxyWorker.ExecuteAsync now logs a Warning when admin isn't registered.
Preserves the loud-failure intent from the original IHostedService
registration without forcing test hosts to wire admin.
Nm7 — AdminEndpointHost.DisposeAsync no double-dispose guard
Added a volatile bool _disposed flag with an early-return at the top of
DisposeAsync. Symmetry with PlcMultiplexer; protects against
ProxyWorker.StopAsync explicitly disposing then DI disposing the singleton
again on host shutdown.
T3 — RemoveInheritedAppsettings only fires on Build
AfterTargets="Build;Publish" + a second Delete against $(PublishDir)
so a `dotnet publish` against the test csproj doesn't ship the example
PLCs from the linked install template.
T4 — Stale TryAttachOrCreate_*_ReturnsTrue_* test method names
Renamed to AttachOrCreate_*_WasNew{True,False} after W3 dropped the bool
return.
Accepted (with rationale documented in ReReviewAfterRemediation.md):
Nm2 — CoalescedHit semantic is per-design
Nm4 — _lastBindError preservation on clean exit is intentional forensics
Nm5 — EventLogBridge has no injectable logger
Nm8 — Cosmetic log noise
T1 — Reflection on private fields documented as maintenance trap
Tests: 387 pass / 0 fail.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
13 KiB
Re-Review After Remediation — 2026-05-14
Re-review of the codebase after the six-commit remediation of the original 2026-05-14 review (Wave 1 → ce32c5c, Wave 2 → e66b17f, Wave 3 → 7ead358, the easy 5 → 2545237, the race-hard 5 → 53f842a). Conducted via three parallel area-focused passes. Eyes on what the fixes themselves introduced, not what the original review already found.
Scope: every src/ and tests/ change in 53a7111..HEAD (37 files, ~+2000/−700 lines).
Status
All actionable findings resolved. Wave 4 (
7a43595) closed NC1 + NM1 + NM2 + NM5 + Nm1 + T2. Wave 4-followup (this commit) closed NM3 + NM4 + Nm6 + Nm7 + T3 + T4. Remaining items are Accepted with rationale (explained below) — they're deliberate trade-offs or doc-only nuances that don't warrant code changes.Final test count: 387 pass / 0 fail. Race tests stable across 3 isolated runs.
Headline
The remediation was structurally sound. The re-review found:
- 1 critical finding — the W1.5 drain loop inherited the very
ShutdownCoordinatorbug it was meant to replace. Resolved by Wave 4 (7a43595) snapshotting the in-flight count BEFORE supervisor stop and deleting the theatrical post-stop loop. - 5 major findings clustered in W1.4 cascade gating and W1.1 + W1.5 lifecycle ordering. All resolved by Wave 4 + Wave 4-followup.
- 8 minor findings + 4 test-discipline findings. Most resolved; the rest accepted with documented rationale.
Resolution table
| ID | Severity | Finding | Status | Commit |
|---|---|---|---|---|
| NC1 | Critical | ProxyWorker.StopAsync drain loop structurally always-zero — inherited the original ShutdownCoordinator bug |
✅ Resolved | 7a43595 |
| NM1 | Major | TearDownBackendAsync._connectGate.WaitAsync() uncancellable — disposal can be blocked indefinitely |
✅ Resolved | 7a43595 |
| NM2 | Major | ReplaceContext writes _ctx and re-registers cache stats provider non-atomically — snapshot can read a disposed cache |
✅ Resolved | 7a43595 |
| NM3 | Major | _supervisorCts leaks across StartAsync re-entry despite W2.16 guard |
✅ Resolved | (W4-followup) |
| NM4 | Major | W2.15 TCS never re-armed — supervisor effectively single-shot | ✅ Resolved | (W4-followup) |
| NM5 | Major | Self-cascade swallows ObjectDisposedException from _connectGate after disposal |
✅ Resolved | 7a43595 |
| Nm1 | Minor | Saturation cleanup uses await SendResponseAsync (blocking) per attached pipe |
✅ Resolved | 7a43595 |
| Nm2 | Minor | W1.2 increments CoalescedHit for late attachers that ultimately receive exception 04 |
⚪ Accepted | (doc) |
| Nm3 | Minor | Both supervisor-stop and drain shared gracefulMs budget |
✅ Resolved | 7a43595 (drain deleted) |
| Nm4 | Minor | finally preserves _lastBindError after clean cancellation |
⚪ Accepted | (by design) |
| Nm5 | Minor | EventLogBridge no startup log of armed state |
⚪ Accepted | (low value) |
| Nm6 | Minor | _admin lazy resolution returns null silently if registration absent |
✅ Resolved | (W4-followup) |
| Nm7 | Minor | AdminEndpointHost.DisposeAsync no _disposed guard |
✅ Resolved | (W4-followup) |
| Nm8 | Minor | TearDownBackendAsync cosmetic log-noise on queued cascades |
⚪ Accepted | (cosmetic) |
| T1 | Test | Reflection coupling on private field names | ⚪ Accepted | (commented in code) |
| T2 | Test | WatchdogVsResponse_Race seeded Random cross-runtime fragility |
✅ Resolved | 7a43595 |
| T3 | Test | RemoveInheritedAppsettings only fires on Build, not Publish |
✅ Resolved | (W4-followup) |
| T4 | Test | Stale TryAttachOrCreate_*_ReturnsTrue_* test method names after W3 dropped the bool |
✅ Resolved | (W4-followup) |
Resolved: 13/18. Accepted: 5/18.
Resolved findings — what landed
NC1 — ProxyWorker.StopAsync drain loop is structurally always-zero
Resolution: 7a43595 snapshots inFlightAtCancel = CountInFlight() BEFORE calling supervisor.StopAsync(...). The post-stop drain loop and drainCts are deleted entirely. The supervisor stop IS the drain — there's nothing to wait for that wouldn't be killed by the stop itself. mbproxy.shutdown.complete now reports a meaningful "requests dropped by stop" count.
NM1 — TearDownBackendAsync._connectGate.WaitAsync() uncancellable
Resolution: 7a43595 bounds the wait with a 2-second teardown CTS. On timeout the body proceeds best-effort without the gate; gateHeld flag tracks whether we acquired it so the finally only releases when appropriate. ObjectDisposedException from a disposed semaphore short-circuits to a clean return.
NM2 — ReplaceContext non-atomic ctx + provider swap
Resolution: 7a43595 swapped the order: provider FIRST, then _ctx. Snapshots in the swap window now read either (old, old) or (new, new) — never (old-after-disposed).
NM3 — _supervisorCts leaks across StartAsync re-entry
Resolution: W4-followup. StartAsync now does try { _supervisorCts.Dispose(); } catch (ObjectDisposedException) { } before reassigning, with the catch covering the very-first-Start case where the field-init CTS is still fresh.
NM4 — W2.15 TCS never re-armed (supervisor single-shot)
Resolution: W4-followup. _firstAttemptCompleted is now non-readonly and re-created in StartAsync after the W2.16 guard. A re-Started supervisor's WaitForInitialBindAttemptAsync no longer observes the previous run's signal.
NM5 — Self-cascade ObjectDisposedException after dispose
Resolution: 7a43595 gated the writer + reader fault-path _ = TearDownBackendAsync(...) calls behind if (!_disposeCts.IsCancellationRequested). DisposeAsync runs an explicit teardown anyway, so the fire-and-forget path was redundant and exception-throwing.
Nm1 — Saturation cleanup uses await SendResponseAsync
Resolution: 7a43595 replaced the per-party await SendResponseAsync with TrySendResponse and IncrementResponseDropForFullUpstream on drop. Same doctrine as W1.3.
Nm3 — gracefulMs used twice
Resolution: Implicit — Wave 4's NC1 fix deleted the drain loop, so the budget is now used exactly once (supervisor stop). Worst-case shutdown is gracefulMs + 2 s admin.
Nm6 — _admin lazy resolution silent null
Resolution: W4-followup. ProxyWorker.ExecuteAsync now logs a Warning when GetService<AdminEndpointHost>() returns null, surfacing a botched composition without blocking startup. Previous IHostedService registration would have hard-errored in DI; this preserves the loud-failure intent without forcing callers to register admin in unit-test hosts.
Nm7 — AdminEndpointHost.DisposeAsync no _disposed guard
Resolution: W4-followup. Added a volatile bool _disposed flag and a guard at the top of DisposeAsync; symmetry with PlcMultiplexer.
T2 — WatchdogVsResponse_Race seeded Random fragility
Resolution: 7a43595 replaced new Random(12345) → rng.Next(350, 450) with a counter-based alternation (n & 1) == 1 ? 350 : 450. Guaranteed 15 fast + 15 slow across 30 iterations regardless of runtime.
T3 — RemoveInheritedAppsettings only on Build
Resolution: W4-followup. AfterTargets="Build;Publish" and a second Delete against $(PublishDir)appsettings.json (guarded by Condition="'$(PublishDir)' != ''" so a plain Build doesn't trip it).
T4 — Stale TryAttachOrCreate_* test names
Resolution: W4-followup. Renamed three test methods to drop Try and _ReturnsTrue_ to match the W3 AttachOrCreate (void-returning) signature.
Accepted findings — what wasn't fixed and why
Nm2 — CoalescedHit increments for late attachers that get exception 04
Why accepted: The counter is correct per the design contract (coalescedHit + coalescedMiss = total FC03/FC04 requests). A late attacher did coalesce onto a stub; the fact that the stub later delivered an exception instead of a real response is orthogonal to the coalesce-counting semantics. Decrementing on saturation cleanup would break the parity invariant that operator dashboards rely on. Documented inline at PlcMultiplexer.cs:809-812 (W2.4 doc clarification covers this).
Nm4 — _lastBindError preserved after clean cancellation
Why accepted: Intentional. The field's contract is "last fault while running" — useful for operators triaging why a supervisor exited (was it a bind error vs a clean stop?). Clearing it on clean exit would lose forensic information. The finally block at PlcListenerSupervisor.cs:435-436 already documents this with a comment.
Nm5 — EventLogBridge no startup log of armed state
Why accepted: The bridge is a Serilog sink and doesn't have an injectable logger (would require restructuring to use Log.ForContext or similar). The class doc at EventLogBridge.cs:18-20 already tells operators "must be created by install.ps1 before the service starts". The W2.23 caching means a missing source is silent, but the architectural alternative (log from the bridge constructor using Log.Logger) creates a startup-ordering dependency on Serilog being fully wired before any sink construction. Low value vs the surface change.
Nm8 — TearDownBackendAsync log-noise on queued cascades
Why accepted: Cosmetic. Each cascade logs its own BackendDisconnected event with its own counts; queued cascades on the gate would fire mostly-zero events. Log filtering is the operator's tool here.
T1 — Reflection on private field names in race tests
Why accepted: The two reflection sites (DrainAllocator in PlcMultiplexerTests, the runtime-fault test in SupervisorTests) are the only externally-impossible-to-hit primitives that prove the saturation/runtime-fault contracts. The alternatives are (a) introducing test-only internal accessors that pollute production code, (b) exposing fields via [InternalsVisibleTo] (which Mbproxy.csproj already does for the test project — but the fields themselves are private, not internal), or (c) skipping the tests. The reflection coupling is documented as a maintenance trap in the test code. A future rename refactor breaks at run-time, not compile-time — but the tests' xmldoc explicitly warns about this.
Verified clean (sampled, not exhaustive)
The original re-review listed the following as verified clean by inspection during the post-Wave-3 review pass; nothing in Wave 4 / W4-followup invalidated these:
- W2.3 ConcurrentDictionary migration on
_supervisors— all mutations atomic; status-page enumeration lock-free; Restart's "remove + add" two-step is per-key (parallel keys disjoint by name). - W2.1 coalescingAccessor propagation — both Add and Restart paths receive it; Reseat correctly does not (same supervisor, same multiplexer, same accessor).
- W2.13 OOR check — multiplication is bounded by the guard; even worst-case
9999 * 10_000 + 9999 = 99_989_999fits in int32 without overflow. - W2.14 byteCount validation — strict
<check passes a perfectly-sized PDU; trailing-byte case correct. - W2.10 resolved-TTL re-check —
BcdTagMapBuilder.Buildis called exactly once per PLC at validation; no duplicate work. - W2.18 ConnectionOptions validation — both
MbproxyOptionsValidatorandReloadValidatorreject<= 0; no bypass path. - W3
HasBadNibblededupe — clean; the codec's internal helper is the single source of truth. - W2.15 TCS signalled in every exit path of
RunSupervisorAsync— no hang onWaitForInitialBindAttemptAsyncfor the first run. (W4 / NM4 added the re-arm for subsequent Starts.) - W2.17
TransitionTolock contract — both writers use it;Snapshotreads under the same lock; no torn triples. TxIdAllocator.Releasedouble-call is benign (TxIdAllocator.cs:121-129checksif (_inUse[id])); the W1.4 channel drain releasing a TxId already released by the correlation drain is safe.- W1.1 in-PDU snapshot consistency —
OnUpstreamFrameAsyncreads_ctx.Cacheand_ctx.TagMap.ResolveCacheTtlMsnon-atomically; the only mid-PDU swap visible would change cache eligibility, not produce corrupted output. DownstreamWithCurrentRequestsnapshots TagMap+Cache for the rewriter, so the rewrite itself is consistent. - W2.7 cache-FC byte sourced from post-rewriter buffer — correct; the rewriter never touches the FC byte but the source must remain
frame[…]to capture the exception bit.
Closed
The 2026-05-14 review series — original review → 4 remediation waves → re-review → wave-4 → wave-4-followup — is now closed. Tests: 387 pass / 0 fail. Three back-to-back race-test runs in isolation all green. Every actionable finding resolved or explicitly accepted with rationale.