Files

T

Joseph Doherty f2c6669444 mbproxy/codereviews: 2026-05-14 in-depth review + remediation plan

Eight area-focused reviews (BCD rewriter, multiplexer, response cache,
supervisor + hot-reload, admin + diagnostics, hosting + options, test
suite) plus an Overview that prioritises findings across areas, and a
RemediationPlan that groups the work into three waves with per-item
file:line citations and regression-test sketches.

Findings call out: hot-reload tag-list/cache changes that don't reach the
running multiplexer, a coalescing factory leak that hangs late attachers,
backend-reader head-of-line block on a wedged upstream, stranded outbound
frames after cascade, and ShutdownCoordinator double-stop ordering. Plus
the unconventional 32-bit BCD wire format (two base-10000 digits in CDAB,
not standard binary), unreachable BcdValidationError.DuplicateAddress,
mbproxy.cache.flushed event that's defined but never emitted, and missing
test coverage for Cache.AllowLongTtl.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-14 05:15:34 -04:00

14 KiB

Raw Permalink Blame History

Listener Supervisor and Hot-Reload Subsystem — Code Review

Scope: src/Mbproxy/Proxy/Supervision/*, src/Mbproxy/Configuration/*, Options/MbproxyOptions.cs and friends, Proxy/ProxyWorker.cs. Cross-checked against docs/design.md (Listener auto-recovery, Configuration hot-reload, Startup posture, Failure modes), docs/Architecture/ConnectionModel.md, docs/Features/HotReload.md, docs/Operations/Configuration.md, docs/plan/05-listener-supervisor.md, docs/plan/06-hot-reload.md. Tests under tests/Mbproxy.Tests/Proxy/Supervision/* and tests/Mbproxy.Tests/Configuration/*.

Summary

The supervisor + reconciler are structurally sound: the Polly profiles match the design exactly (3-attempt backend connect at 100/500/2000 ms; 5-step recovery at 1/2/5/15/30 s + steady-state 30 s indefinitely), debounce + serialise are well-implemented, and the validator rejects every documented bad input.
One documented contract is unimplemented: "skip cache invalidations when the listener is recovering" (docs/design.md:203, docs/Features/HotReload.md cache section). Nothing in PlcMultiplexer consults supervisor state — there is no plumbing from the supervisor to the cache invalidator.
Several race / leak windows exist around ReplaceContextAsync (counter pointer not threaded into running multiplexer; old listener keeps running with stale cache after dispose) and WaitForInitialBindAttemptAsync (race between fast bind+stop and the wait method).
Test coverage of the supervisor is thin in places — no unit test exercises the "RunAsync returned without cancellation" recovery branch, and no test verifies lastBindError/recoveryAttempts after a runtime fault.

Critical findings

C1 — Reseat does NOT swap the multiplexer's context, so cache + tag-map changes do not actually take effect on existing connections (PlcListenerSupervisor.cs:199-216, PlcMultiplexer.cs:50 / :101 / :109). ReplaceContextAsync only writes _currentContext. Per its own comment (lines 188-191), "the very next PlcListener constructed inside the Polly loop … picks up newCtx. Existing in-flight upstream pipes served by the current multiplexer keep their reference." But PlcMultiplexer captures the context once at construction (_ctx = perPlcContext, line 101) and uses that captured reference for the response cache (_ctx.Cache in invalidation paths around lines 499-506) and the BCD rewriter map. The currently-running listener is _currentListener, which is never recreated by a reseat. So in steady state (no listener fault between reseats):

The cache-disposal logic at PlcListenerSupervisor.cs:205-213 disposes the old cache while the running multiplexer is still using it. Subsequent FC03/FC04 hits / invalidations on the running mux dereference a disposed cache (eviction timer cancelled, Dictionary still works but state is stale).
The "any tag-list reload flushes the affected PLC's whole cache" doctrine (CLAUDE.md:46, design.md:155) is only realised the next time the listener faults and Polly reconstructs a PlcListener. Until then, the running multiplexer keeps serving stale tag-map decisions and a soon-to-be-disposed cache.

This is the highest-impact gap. Either PlcMultiplexer needs a hot-swap path (volatile ctx ref it re-reads), or the reseat path must restart the multiplexer (not just swap a field on the supervisor that nobody reads in the hot path).

C2 — Counters are reset on Restart even when documentation says they are preserved on Reseat only (ConfigReconciler.cs:284). The Restart branch builds Counters = new Proxy.ProxyCounters(). That is intentional per HotReload.md:117 ("A restart … constructs a brand-new ProxyCounters because the supervisor itself is brand new."). However, service.config.reloadCount increments via RecordReloadApplied even when only validation succeeded but every restart task threw and was logged at Error (ConfigReconciler.cs:309-313). A reload where every restart fails still bumps reloadCount and clears the operator's history of those PLCs without warning. Consider either failing the whole apply, surfacing a partial-failure counter, or at minimum a warning-level event for partial-apply.

Major findings

M1 — "Skip cache invalidation while recovering" is unimplemented (design.md:203; verified absent in PlcMultiplexer.cs). Grep for recovering, SupervisorState.Recovering, or any equivalent in the multiplexer returns no matches. The design clearly states "Invalidations during a recovering listener state are skipped (the write never reached the backend, the cached read remains valid)". Today, a write that fails the listener but somehow still makes it to the cache invalidation path could clear good entries. In practice, when a listener is recovering the upstream socket is closed and no requests flow, so the bug surface is small — but it's still a documented contract violation.

M2 — WaitForInitialBindAttemptAsync has a race and busy-polls (PlcListenerSupervisor.cs:137-144). The 10 ms polling loop will never exit if the supervisor task throws (e.g., synchronous bind exception captured by Polly that immediately retries before the test grabs the snapshot). It also masks transitions: a supervisor that does Stopped → Bound → Stopped between polls (fast bind followed by StopAsync) will exit the loop on _supervisorTask.IsCompleted rather than ever reflecting Bound. Replace with a TaskCompletionSource set when the first state transition out of Stopped happens.

M3 — _supervisorCts reassignment on StartAsync leaks the previous CTS (PlcListenerSupervisor.cs:126). If StartAsync is called twice (or after StopAsync), the previously-disposed CTS is replaced without disposing the linked-token registration. Add either a guard (throw if not Stopped) or dispose the previous CTS before reassigning.

M4 — Snapshot() is non-atomic across three fields (PlcListenerSupervisor.cs:177-180). _state, _lastBindError, _recoveryAttempts are read independently; the supervisor task can transition mid-snapshot so the status page can report State=Bound with LastBindError from a prior attempt and RecoveryAttempts > 0 simultaneously. The fix is small (read into locals under a lock or under a single Interlocked.MemoryBarrier), and the design.md says "transitions atomic" implicitly via the bound/recovering/stopped machine.

M5 — ConfigReconciler.ApplyUnderLockAsync mutates _supervisors from multiple Task.Run continuations concurrently (ConfigReconciler.cs:236-238, :269-271, :306, :390). The Remove and Restart blocks each do _supervisors.Remove(name) / _supervisors[name] = newSupervisor from inside Task.WhenAll(... select async ...). Dictionary<TKey,TValue> is not thread-safe; the comment at line 50 says "All mutations happen inside ApplyAsync which is serialised by the semaphore" — but the semaphore serialises outer Apply calls, not the inner concurrent tasks. Two PLCs being added in parallel can corrupt the dictionary. Replace with ConcurrentDictionary or move dict mutations to a single thread (await Task.WhenAll(...) first, then mutate).

M6 — Per-restart task captures shared backendPipeline but does not honour ReadCoalescingOptions accessor (ConfigReconciler.cs:294-304, :378-388). The constructor argument list passed to new PlcListenerSupervisor(...) from the reconciler omits the coalescingAccessor that ProxyWorker.cs:129-130 constructs. Result: PLCs added or restarted via hot-reload always get the default ReadCoalescingOptions (Enabled=true, MaxParties=32) instead of the live monitor value, and a hot-reload of ReadCoalescing.Enabled does not propagate to those supervisors. This contradicts CLAUDE.md and design.md ("Hot-reload via Mbproxy.Resilience.ReadCoalescing.Enabled").

M7 — MbproxyOptions has no ConnectionOptions validation (MbproxyOptions.cs, ReloadValidator.cs). Neither validator rejects BackendConnectTimeoutMs <= 0, BackendRequestTimeoutMs <= 0, or GracefulShutdownTimeoutMs <= 0. A misconfigured 0 / negative value will produce a CancelAfter(0) that immediately fires and breaks every backend connect.

M8 — MbproxyOptionsValidator does not enforce cross-PLC checks but IValidateOptions runs at startup; reload uses ReloadValidator — this dual-path is fine, but ReloadValidator is missing the duplicate-address-in-resolved-tag-list test for the per-PLC Add list against itself when no global list collision occurs. BcdTagMapBuilder.Build does catch this (DuplicateAddress error), so the gap is plugged transitively. Worth a comment in ReloadValidator.cs:78-83 to make explicit that per-PLC duplicate detection is delegated.

M9 — ProxyWorker.StopAsync does NOT respect Connection.GracefulShutdownTimeoutMs (ProxyWorker.cs:202-217). Hard-coded 5-second deadline. The graceful-drain timeout lives in a separate ShutdownCoordinator (Phase 08); but the worker itself drops in-flight requests after 5s, before the coordinator's 10s default. Fine if both are wired together (see ShutdownCoordinator), but a reader of just ProxyWorker would miss the contract. Add a comment pointing at ShutdownCoordinator.

Minor findings

Min-1. RunSupervisorAsync truncates exception messages to 256 chars in two places with copy-pasted code (PlcListenerSupervisor.cs:268, :323, :346). Extract a helper Truncate(string, int).
Min-2. LogBindFailed uses EventId 41 and EventName mbproxy.startup.bind.failed, but PlcListener.cs:189 (Phase-pre-supervisor leftover) defines another LogListenerFaulted with EventId 22 and the same name mbproxy.listener.faulted as PlcListenerSupervisor.cs:413 (EventId 43). Two LoggerMessage event names should be unique per-name; duplicate names confuse log scrapers.
Min-3. PolicyFactory.cs:54 only retries ConnectionRefused, TimedOut, HostUnreachable, NetworkUnreachable. WSAEHOSTDOWN/HostDown is a common transient on Windows; consider adding it.
Min-4. PolicyFactory.cs:69 logs at Debug only — design.md:211 says retries log at Debug, but operators triaging would likely want the last retry attempt of a backend connect at Information. Consider using MaxAttempts - 1 == args.AttemptNumber to elevate the final attempt.
Min-5. ConfigReconciler.ComputeGlobalTagDelta (:428) recomputes a delta for the log line, duplicating logic that ReloadPlan.Compute already has internally via TagMapsEqual. Accept the slight cost or expose a delta count from ReloadPlan.
Test gap. SupervisorTests.cs (6 tests) is missing the "RunAsync returned cleanly without cancellation" branch (PlcListenerSupervisor.cs:340-350). No test exercises the "listener accept loop ended unexpectedly" recovery transition.
Test gap. ConfigReconcilerTests.cs:235 "concurrent reloads serialised" only verifies they all succeed; it does not actually assert serialisation (the comment admits the measurement isn't perfect). Drive a real concurrent test using a fake supervisor that asserts no overlap.
Test gap. No test covers the cache-disposal-while-multiplexer-still-uses-it scenario (C1) or the cross-thread _supervisors mutation (M5).

What looks good

Polly pipelines exactly match the design. Backend connect: BackendConnect = { MaxAttempts = 3, BackoffMs = [100, 500, 2000] } (ResilienceOptions.cs:5); listener recovery: [1000, 2000, 5000, 15000, 30000] then 30_000 indefinite (ResilienceOptions.cs:7-9). MaxRetryAttempts = int.MaxValue for listener recovery is correct (PolicyFactory.cs:101).
Cancellation honoured promptly on shutdown. _supervisorCts.CancelAsync() cancels both Polly's delay and the inner accept loop simultaneously (PlcListenerSupervisor.cs:159). StopAsync doc comment that it "completes within ~1 s regardless of backoff window size" is accurate because Polly v8's ExecuteAsync(ct) propagates the token through the delay generator.
Bind lifecycle is idempotent. Each Polly attempt creates a fresh PlcListener and disposes the failed one before re-entering the recovery delay (PlcListenerSupervisor.cs:264, :311, :320, :337). Only one _currentListener exists at any time. The half-bound port window is closed by await listener.DisposeAsync() before re-entering the retry.
Single code path for "bring up a listener." Both startup (ProxyWorker.cs:115-146) and reload-add (ConfigReconciler.cs:359-401) construct a PlcListenerSupervisor and call StartAsync — no second listener-init code path. Remove and Restart both go through supervisor.StopAsync + DisposeAsync. Matches design.md:213.
Validator returns all errors, not just the first. ReloadValidator.cs:32-117 runs every check, accumulates into errs, returns false only after every rule has been checked — operator sees every problem on a single save.
Debounce loop is correct. The drain pattern (ConfigReconciler.cs:154-171) handles editor rename-replace storms exactly as documented (250 ms quiescent window, last-write-wins via BoundedChannelFullMode.DropOldest).
TagMapsEqual includes CacheTtlMs (ReloadPlan.cs:113), so per-tag TTL changes correctly route to ToReseat — matches HotReload.md:70.
Graceful-shutdown integration via ShutdownCoordinator (separate file) handles drain-timeout properly even though ProxyWorker.StopAsync uses a hard 5 s.

Open questions

C1 (cache + tag-map not actually swapped on reseat) is the most consequential finding — is PerPlcContext actually intended to be hot-swapped on the running PlcMultiplexer, or is the current "swap is visible only on next listener fault" the intended (and undocumented) behaviour?
M6: Should the reconciler's add/restart paths receive the same coalescingAccessor ProxyWorker constructs? It looks like an oversight when Phase 10 (ReadCoalescing) was added after Phase 6 (hot-reload).
The status page surface for "supervisor is recovering" is documented (listener.state = recovering, listener.lastBindError, listener.recoveryAttempts) but the cache-skip-during-recovery clause needs a contract owner (M1). Was the design intent to thread the supervisor's snapshot into the multiplexer, or to gate via a different signal (e.g., backend-disconnect cascade flag)?

14 KiB Raw Permalink Blame History