Files
wwtools/mbproxy/codereviews/2026-05-14/ReReviewAfterRemediation.md
T
Joseph Doherty 9251c564c1 mbproxy: resolve remaining items from ReReviewAfterRemediation.md
Closes the latent + minor + test-discipline items left after Wave 4. Updates
the re-review doc with a final resolution table — every actionable finding
now marked Resolved or Accepted with rationale.

NM3 — _supervisorCts leaks on re-Start
  StartAsync now disposes the previous CTS before reassigning. Idempotent:
  a try/catch (ObjectDisposedException) covers the very-first-Start case
  where the field-init CTS is still fresh.

NM4 — W2.15 TCS is single-shot
  _firstAttemptCompleted is no longer readonly; StartAsync re-creates it
  after the W2.16 guard so a re-Started supervisor's
  WaitForInitialBindAttemptAsync doesn't observe the previous run's signal.

Nm6 — _admin GetService<> returns null silently
  ProxyWorker.ExecuteAsync now logs a Warning when admin isn't registered.
  Preserves the loud-failure intent from the original IHostedService
  registration without forcing test hosts to wire admin.

Nm7 — AdminEndpointHost.DisposeAsync no double-dispose guard
  Added a volatile bool _disposed flag with an early-return at the top of
  DisposeAsync. Symmetry with PlcMultiplexer; protects against
  ProxyWorker.StopAsync explicitly disposing then DI disposing the singleton
  again on host shutdown.

T3 — RemoveInheritedAppsettings only fires on Build
  AfterTargets="Build;Publish" + a second Delete against $(PublishDir)
  so a `dotnet publish` against the test csproj doesn't ship the example
  PLCs from the linked install template.

T4 — Stale TryAttachOrCreate_*_ReturnsTrue_* test method names
  Renamed to AttachOrCreate_*_WasNew{True,False} after W3 dropped the bool
  return.

Accepted (with rationale documented in ReReviewAfterRemediation.md):
  Nm2 — CoalescedHit semantic is per-design
  Nm4 — _lastBindError preservation on clean exit is intentional forensics
  Nm5 — EventLogBridge has no injectable logger
  Nm8 — Cosmetic log noise
  T1  — Reflection on private fields documented as maintenance trap

Tests: 387 pass / 0 fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 07:02:21 -04:00

13 KiB
Raw Blame History

Re-Review After Remediation — 2026-05-14

Re-review of the codebase after the six-commit remediation of the original 2026-05-14 review (Wave 1 → ce32c5c, Wave 2 → e66b17f, Wave 3 → 7ead358, the easy 5 → 2545237, the race-hard 5 → 53f842a). Conducted via three parallel area-focused passes. Eyes on what the fixes themselves introduced, not what the original review already found.

Scope: every src/ and tests/ change in 53a7111..HEAD (37 files, ~+2000/700 lines).

Status

All actionable findings resolved. Wave 4 (7a43595) closed NC1 + NM1 + NM2 + NM5 + Nm1 + T2. Wave 4-followup (this commit) closed NM3 + NM4 + Nm6 + Nm7 + T3 + T4. Remaining items are Accepted with rationale (explained below) — they're deliberate trade-offs or doc-only nuances that don't warrant code changes.

Final test count: 387 pass / 0 fail. Race tests stable across 3 isolated runs.

Headline

The remediation was structurally sound. The re-review found:

  • 1 critical finding — the W1.5 drain loop inherited the very ShutdownCoordinator bug it was meant to replace. Resolved by Wave 4 (7a43595) snapshotting the in-flight count BEFORE supervisor stop and deleting the theatrical post-stop loop.
  • 5 major findings clustered in W1.4 cascade gating and W1.1 + W1.5 lifecycle ordering. All resolved by Wave 4 + Wave 4-followup.
  • 8 minor findings + 4 test-discipline findings. Most resolved; the rest accepted with documented rationale.

Resolution table

ID Severity Finding Status Commit
NC1 Critical ProxyWorker.StopAsync drain loop structurally always-zero — inherited the original ShutdownCoordinator bug Resolved 7a43595
NM1 Major TearDownBackendAsync._connectGate.WaitAsync() uncancellable — disposal can be blocked indefinitely Resolved 7a43595
NM2 Major ReplaceContext writes _ctx and re-registers cache stats provider non-atomically — snapshot can read a disposed cache Resolved 7a43595
NM3 Major _supervisorCts leaks across StartAsync re-entry despite W2.16 guard Resolved (W4-followup)
NM4 Major W2.15 TCS never re-armed — supervisor effectively single-shot Resolved (W4-followup)
NM5 Major Self-cascade swallows ObjectDisposedException from _connectGate after disposal Resolved 7a43595
Nm1 Minor Saturation cleanup uses await SendResponseAsync (blocking) per attached pipe Resolved 7a43595
Nm2 Minor W1.2 increments CoalescedHit for late attachers that ultimately receive exception 04 Accepted (doc)
Nm3 Minor Both supervisor-stop and drain shared gracefulMs budget Resolved 7a43595 (drain deleted)
Nm4 Minor finally preserves _lastBindError after clean cancellation Accepted (by design)
Nm5 Minor EventLogBridge no startup log of armed state Accepted (low value)
Nm6 Minor _admin lazy resolution returns null silently if registration absent Resolved (W4-followup)
Nm7 Minor AdminEndpointHost.DisposeAsync no _disposed guard Resolved (W4-followup)
Nm8 Minor TearDownBackendAsync cosmetic log-noise on queued cascades Accepted (cosmetic)
T1 Test Reflection coupling on private field names Accepted (commented in code)
T2 Test WatchdogVsResponse_Race seeded Random cross-runtime fragility Resolved 7a43595
T3 Test RemoveInheritedAppsettings only fires on Build, not Publish Resolved (W4-followup)
T4 Test Stale TryAttachOrCreate_*_ReturnsTrue_* test method names after W3 dropped the bool Resolved (W4-followup)

Resolved: 13/18. Accepted: 5/18.


Resolved findings — what landed

NC1 — ProxyWorker.StopAsync drain loop is structurally always-zero

Resolution: 7a43595 snapshots inFlightAtCancel = CountInFlight() BEFORE calling supervisor.StopAsync(...). The post-stop drain loop and drainCts are deleted entirely. The supervisor stop IS the drain — there's nothing to wait for that wouldn't be killed by the stop itself. mbproxy.shutdown.complete now reports a meaningful "requests dropped by stop" count.

NM1 — TearDownBackendAsync._connectGate.WaitAsync() uncancellable

Resolution: 7a43595 bounds the wait with a 2-second teardown CTS. On timeout the body proceeds best-effort without the gate; gateHeld flag tracks whether we acquired it so the finally only releases when appropriate. ObjectDisposedException from a disposed semaphore short-circuits to a clean return.

NM2 — ReplaceContext non-atomic ctx + provider swap

Resolution: 7a43595 swapped the order: provider FIRST, then _ctx. Snapshots in the swap window now read either (old, old) or (new, new) — never (old-after-disposed).

NM3 — _supervisorCts leaks across StartAsync re-entry

Resolution: W4-followup. StartAsync now does try { _supervisorCts.Dispose(); } catch (ObjectDisposedException) { } before reassigning, with the catch covering the very-first-Start case where the field-init CTS is still fresh.

NM4 — W2.15 TCS never re-armed (supervisor single-shot)

Resolution: W4-followup. _firstAttemptCompleted is now non-readonly and re-created in StartAsync after the W2.16 guard. A re-Started supervisor's WaitForInitialBindAttemptAsync no longer observes the previous run's signal.

NM5 — Self-cascade ObjectDisposedException after dispose

Resolution: 7a43595 gated the writer + reader fault-path _ = TearDownBackendAsync(...) calls behind if (!_disposeCts.IsCancellationRequested). DisposeAsync runs an explicit teardown anyway, so the fire-and-forget path was redundant and exception-throwing.

Nm1 — Saturation cleanup uses await SendResponseAsync

Resolution: 7a43595 replaced the per-party await SendResponseAsync with TrySendResponse and IncrementResponseDropForFullUpstream on drop. Same doctrine as W1.3.

Nm3 — gracefulMs used twice

Resolution: Implicit — Wave 4's NC1 fix deleted the drain loop, so the budget is now used exactly once (supervisor stop). Worst-case shutdown is gracefulMs + 2 s admin.

Nm6 — _admin lazy resolution silent null

Resolution: W4-followup. ProxyWorker.ExecuteAsync now logs a Warning when GetService<AdminEndpointHost>() returns null, surfacing a botched composition without blocking startup. Previous IHostedService registration would have hard-errored in DI; this preserves the loud-failure intent without forcing callers to register admin in unit-test hosts.

Nm7 — AdminEndpointHost.DisposeAsync no _disposed guard

Resolution: W4-followup. Added a volatile bool _disposed flag and a guard at the top of DisposeAsync; symmetry with PlcMultiplexer.

T2 — WatchdogVsResponse_Race seeded Random fragility

Resolution: 7a43595 replaced new Random(12345) → rng.Next(350, 450) with a counter-based alternation (n & 1) == 1 ? 350 : 450. Guaranteed 15 fast + 15 slow across 30 iterations regardless of runtime.

T3 — RemoveInheritedAppsettings only on Build

Resolution: W4-followup. AfterTargets="Build;Publish" and a second Delete against $(PublishDir)appsettings.json (guarded by Condition="'$(PublishDir)' != ''" so a plain Build doesn't trip it).

T4 — Stale TryAttachOrCreate_* test names

Resolution: W4-followup. Renamed three test methods to drop Try and _ReturnsTrue_ to match the W3 AttachOrCreate (void-returning) signature.


Accepted findings — what wasn't fixed and why

Nm2 — CoalescedHit increments for late attachers that get exception 04

Why accepted: The counter is correct per the design contract (coalescedHit + coalescedMiss = total FC03/FC04 requests). A late attacher did coalesce onto a stub; the fact that the stub later delivered an exception instead of a real response is orthogonal to the coalesce-counting semantics. Decrementing on saturation cleanup would break the parity invariant that operator dashboards rely on. Documented inline at PlcMultiplexer.cs:809-812 (W2.4 doc clarification covers this).

Nm4 — _lastBindError preserved after clean cancellation

Why accepted: Intentional. The field's contract is "last fault while running" — useful for operators triaging why a supervisor exited (was it a bind error vs a clean stop?). Clearing it on clean exit would lose forensic information. The finally block at PlcListenerSupervisor.cs:435-436 already documents this with a comment.

Nm5 — EventLogBridge no startup log of armed state

Why accepted: The bridge is a Serilog sink and doesn't have an injectable logger (would require restructuring to use Log.ForContext or similar). The class doc at EventLogBridge.cs:18-20 already tells operators "must be created by install.ps1 before the service starts". The W2.23 caching means a missing source is silent, but the architectural alternative (log from the bridge constructor using Log.Logger) creates a startup-ordering dependency on Serilog being fully wired before any sink construction. Low value vs the surface change.

Nm8 — TearDownBackendAsync log-noise on queued cascades

Why accepted: Cosmetic. Each cascade logs its own BackendDisconnected event with its own counts; queued cascades on the gate would fire mostly-zero events. Log filtering is the operator's tool here.

T1 — Reflection on private field names in race tests

Why accepted: The two reflection sites (DrainAllocator in PlcMultiplexerTests, the runtime-fault test in SupervisorTests) are the only externally-impossible-to-hit primitives that prove the saturation/runtime-fault contracts. The alternatives are (a) introducing test-only internal accessors that pollute production code, (b) exposing fields via [InternalsVisibleTo] (which Mbproxy.csproj already does for the test project — but the fields themselves are private, not internal), or (c) skipping the tests. The reflection coupling is documented as a maintenance trap in the test code. A future rename refactor breaks at run-time, not compile-time — but the tests' xmldoc explicitly warns about this.


Verified clean (sampled, not exhaustive)

The original re-review listed the following as verified clean by inspection during the post-Wave-3 review pass; nothing in Wave 4 / W4-followup invalidated these:

  • W2.3 ConcurrentDictionary migration on _supervisors — all mutations atomic; status-page enumeration lock-free; Restart's "remove + add" two-step is per-key (parallel keys disjoint by name).
  • W2.1 coalescingAccessor propagation — both Add and Restart paths receive it; Reseat correctly does not (same supervisor, same multiplexer, same accessor).
  • W2.13 OOR check — multiplication is bounded by the guard; even worst-case 9999 * 10_000 + 9999 = 99_989_999 fits in int32 without overflow.
  • W2.14 byteCount validation — strict < check passes a perfectly-sized PDU; trailing-byte case correct.
  • W2.10 resolved-TTL re-checkBcdTagMapBuilder.Build is called exactly once per PLC at validation; no duplicate work.
  • W2.18 ConnectionOptions validation — both MbproxyOptionsValidator and ReloadValidator reject <= 0; no bypass path.
  • W3 HasBadNibble dedupe — clean; the codec's internal helper is the single source of truth.
  • W2.15 TCS signalled in every exit path of RunSupervisorAsync — no hang on WaitForInitialBindAttemptAsync for the first run. (W4 / NM4 added the re-arm for subsequent Starts.)
  • W2.17 TransitionTo lock contract — both writers use it; Snapshot reads under the same lock; no torn triples.
  • TxIdAllocator.Release double-call is benign (TxIdAllocator.cs:121-129 checks if (_inUse[id])); the W1.4 channel drain releasing a TxId already released by the correlation drain is safe.
  • W1.1 in-PDU snapshot consistencyOnUpstreamFrameAsync reads _ctx.Cache and _ctx.TagMap.ResolveCacheTtlMs non-atomically; the only mid-PDU swap visible would change cache eligibility, not produce corrupted output. Downstream WithCurrentRequest snapshots TagMap+Cache for the rewriter, so the rewrite itself is consistent.
  • W2.7 cache-FC byte sourced from post-rewriter buffer — correct; the rewriter never touches the FC byte but the source must remain frame[…] to capture the exception bit.

Closed

The 2026-05-14 review series — original review → 4 remediation waves → re-review → wave-4 → wave-4-followup — is now closed. Tests: 387 pass / 0 fail. Three back-to-back race-test runs in isolation all green. Every actionable finding resolved or explicitly accepted with rationale.