wwtools/mbproxy/codereviews/2026-05-14/SupervisorAndHotReload.md

# Listener Supervisor and Hot-Reload Subsystem — Code Review

Scope: `src/Mbproxy/Proxy/Supervision/*`, `src/Mbproxy/Configuration/*`, `Options/MbproxyOptions.cs` and friends, `Proxy/ProxyWorker.cs`. Cross-checked against `docs/design.md` (Listener auto-recovery, Configuration hot-reload, Startup posture, Failure modes), `docs/Architecture/ConnectionModel.md`, `docs/Features/HotReload.md`, `docs/Operations/Configuration.md`, `docs/plan/05-listener-supervisor.md`, `docs/plan/06-hot-reload.md`. Tests under `tests/Mbproxy.Tests/Proxy/Supervision/*` and `tests/Mbproxy.Tests/Configuration/*`.

## Summary

- The supervisor + reconciler are structurally sound: the Polly profiles match the design exactly (3-attempt backend connect at 100/500/2000 ms; 5-step recovery at 1/2/5/15/30 s + steady-state 30 s indefinitely), debounce + serialise are well-implemented, and the validator rejects every documented bad input.
- One **documented contract is unimplemented**: "skip cache invalidations when the listener is `recovering`" (`docs/design.md:203`, `docs/Features/HotReload.md` cache section). Nothing in `PlcMultiplexer` consults supervisor state — there is no plumbing from the supervisor to the cache invalidator.
- Several **race / leak windows** exist around `ReplaceContextAsync` (counter pointer not threaded into running multiplexer; old listener keeps running with stale cache after dispose) and `WaitForInitialBindAttemptAsync` (race between fast bind+stop and the wait method).
- Test coverage of the supervisor is thin in places — no unit test exercises the "RunAsync returned without cancellation" recovery branch, and no test verifies `lastBindError`/`recoveryAttempts` after a runtime fault.

## Critical findings

**C1 — Reseat does NOT swap the multiplexer's context, so cache + tag-map changes do not actually take effect on existing connections** (`PlcListenerSupervisor.cs:199-216`, `PlcMultiplexer.cs:50` / `:101` / `:109`).
`ReplaceContextAsync` only writes `_currentContext`. Per its own comment (lines 188-191), "the very next `PlcListener` constructed inside the Polly loop … picks up `newCtx`. Existing in-flight upstream pipes served by the current multiplexer keep their reference." But `PlcMultiplexer` captures the context once at construction (`_ctx = perPlcContext`, line 101) and uses that captured reference for **the response cache** (`_ctx.Cache` in invalidation paths around lines 499-506) and **the BCD rewriter map**. The currently-running listener is `_currentListener`, which is never recreated by a reseat. So in steady state (no listener fault between reseats):
  - The cache-disposal logic at `PlcListenerSupervisor.cs:205-213` disposes the old cache while the running multiplexer is still **using** it. Subsequent FC03/FC04 hits / invalidations on the running mux dereference a disposed cache (eviction timer cancelled, `Dictionary` still works but state is stale).
  - The "any tag-list reload flushes the affected PLC's whole cache" doctrine (CLAUDE.md:46, design.md:155) is only realised the next time the listener faults and Polly reconstructs a `PlcListener`. Until then, the running multiplexer keeps serving stale tag-map decisions and a soon-to-be-disposed cache.

This is the highest-impact gap. Either `PlcMultiplexer` needs a hot-swap path (volatile ctx ref it re-reads), or the reseat path must restart the multiplexer (not just swap a field on the supervisor that nobody reads in the hot path).

**C2 — Counters are reset on Restart even when documentation says they are preserved on Reseat only** (`ConfigReconciler.cs:284`).
The Restart branch builds `Counters = new Proxy.ProxyCounters()`. That is intentional per `HotReload.md:117` ("A restart … constructs a brand-new `ProxyCounters` because the supervisor itself is brand new."). However, `service.config.reloadCount` increments via `RecordReloadApplied` even when only validation succeeded but every restart task **threw and was logged at Error** (`ConfigReconciler.cs:309-313`). A reload where every restart fails still bumps `reloadCount` and clears the operator's history of those PLCs without warning. Consider either failing the whole apply, surfacing a partial-failure counter, or at minimum a warning-level event for partial-apply.

## Major findings

**M1 — "Skip cache invalidation while recovering" is unimplemented** (design.md:203; verified absent in `PlcMultiplexer.cs`). Grep for `recovering`, `SupervisorState.Recovering`, or any equivalent in the multiplexer returns no matches. The design clearly states "Invalidations during a `recovering` listener state are skipped (the write never reached the backend, the cached read remains valid)". Today, a write that fails the listener but somehow still makes it to the cache invalidation path could clear good entries. In practice, when a listener is `recovering` the upstream socket is closed and no requests flow, so the bug surface is small — but it's still a documented contract violation.

**M2 — `WaitForInitialBindAttemptAsync` has a race and busy-polls** (`PlcListenerSupervisor.cs:137-144`). The 10 ms polling loop will never exit if the supervisor task throws (e.g., synchronous bind exception captured by Polly that immediately retries before the test grabs the snapshot). It also masks transitions: a supervisor that does `Stopped → Bound → Stopped` between polls (fast bind followed by `StopAsync`) will exit the loop on `_supervisorTask.IsCompleted` rather than ever reflecting `Bound`. Replace with a `TaskCompletionSource` set when the first state transition out of `Stopped` happens.

**M3 — `_supervisorCts` reassignment on `StartAsync` leaks the previous CTS** (`PlcListenerSupervisor.cs:126`). If `StartAsync` is called twice (or after `StopAsync`), the previously-disposed CTS is replaced without disposing the linked-token registration. Add either a guard (`throw if not Stopped`) or dispose the previous CTS before reassigning.

**M4 — `Snapshot()` is non-atomic across three fields** (`PlcListenerSupervisor.cs:177-180`). `_state`, `_lastBindError`, `_recoveryAttempts` are read independently; the supervisor task can transition mid-snapshot so the status page can report `State=Bound` with `LastBindError` from a prior attempt and `RecoveryAttempts > 0` simultaneously. The fix is small (read into locals under a lock or under a single `Interlocked.MemoryBarrier`), and the design.md says "transitions atomic" implicitly via the `bound`/`recovering`/`stopped` machine.

**M5 — `ConfigReconciler.ApplyUnderLockAsync` mutates `_supervisors` from multiple Task.Run continuations concurrently** (`ConfigReconciler.cs:236-238`, `:269-271`, `:306`, `:390`). The Remove and Restart blocks each do `_supervisors.Remove(name)` / `_supervisors[name] = newSupervisor` from inside `Task.WhenAll(... select async ...)`. `Dictionary<TKey,TValue>` is not thread-safe; the comment at line 50 says "All mutations happen inside ApplyAsync which is serialised by the semaphore" — but the semaphore serialises *outer* Apply calls, not the inner concurrent tasks. Two PLCs being added in parallel can corrupt the dictionary. Replace with `ConcurrentDictionary` or move dict mutations to a single thread (`await Task.WhenAll(...)` first, then mutate).

**M6 — Per-restart task captures shared `backendPipeline` but does not honour `ReadCoalescingOptions` accessor** (`ConfigReconciler.cs:294-304`, `:378-388`). The constructor argument list passed to `new PlcListenerSupervisor(...)` from the reconciler omits the `coalescingAccessor` that `ProxyWorker.cs:129-130` constructs. Result: PLCs added or restarted via hot-reload always get the **default** `ReadCoalescingOptions` (Enabled=true, MaxParties=32) instead of the live monitor value, **and** a hot-reload of `ReadCoalescing.Enabled` does not propagate to those supervisors. This contradicts CLAUDE.md and design.md ("Hot-reload via `Mbproxy.Resilience.ReadCoalescing.Enabled`").

**M7 — `MbproxyOptions` has no `ConnectionOptions` validation** (`MbproxyOptions.cs`, `ReloadValidator.cs`). Neither validator rejects `BackendConnectTimeoutMs <= 0`, `BackendRequestTimeoutMs <= 0`, or `GracefulShutdownTimeoutMs <= 0`. A misconfigured 0 / negative value will produce a `CancelAfter(0)` that immediately fires and breaks every backend connect.

**M8 — `MbproxyOptionsValidator` does not enforce cross-PLC checks but `IValidateOptions` runs at startup; reload uses `ReloadValidator`** — this dual-path is fine, but `ReloadValidator` is missing **the duplicate-address-in-resolved-tag-list** test for the per-PLC `Add` list against itself when no global list collision occurs. `BcdTagMapBuilder.Build` does catch this (DuplicateAddress error), so the gap is plugged transitively. Worth a comment in `ReloadValidator.cs:78-83` to make explicit that per-PLC duplicate detection is delegated.

**M9 — `ProxyWorker.StopAsync` does NOT respect `Connection.GracefulShutdownTimeoutMs`** (`ProxyWorker.cs:202-217`). Hard-coded 5-second deadline. The graceful-drain timeout lives in a separate `ShutdownCoordinator` (Phase 08); but the worker itself drops in-flight requests after 5s, before the coordinator's 10s default. Fine if both are wired together (see ShutdownCoordinator), but a reader of just `ProxyWorker` would miss the contract. Add a comment pointing at `ShutdownCoordinator`.

## Minor findings

- **Min-1.** `RunSupervisorAsync` truncates exception messages to 256 chars in two places with copy-pasted code (`PlcListenerSupervisor.cs:268`, `:323`, `:346`). Extract a helper `Truncate(string, int)`.
- **Min-2.** `LogBindFailed` uses `EventId 41` and `EventName mbproxy.startup.bind.failed`, but `PlcListener.cs:189` (Phase-pre-supervisor leftover) defines another `LogListenerFaulted` with `EventId 22` and the **same** name `mbproxy.listener.faulted` as `PlcListenerSupervisor.cs:413` (`EventId 43`). Two `LoggerMessage` event names should be unique per-name; duplicate names confuse log scrapers.
- **Min-3.** `PolicyFactory.cs:54` only retries `ConnectionRefused`, `TimedOut`, `HostUnreachable`, `NetworkUnreachable`. `WSAEHOSTDOWN`/`HostDown` is a common transient on Windows; consider adding it.
- **Min-4.** `PolicyFactory.cs:69` logs at Debug only — design.md:211 says retries log at Debug, but operators triaging would likely want the **last** retry attempt of a backend connect at Information. Consider using `MaxAttempts - 1 == args.AttemptNumber` to elevate the final attempt.
- **Min-5.** `ConfigReconciler.ComputeGlobalTagDelta` (`:428`) recomputes a delta for the log line, duplicating logic that `ReloadPlan.Compute` already has internally via `TagMapsEqual`. Accept the slight cost or expose a delta count from `ReloadPlan`.
- **Test gap.** `SupervisorTests.cs` (6 tests) is missing the "RunAsync returned cleanly without cancellation" branch (`PlcListenerSupervisor.cs:340-350`). No test exercises the "listener accept loop ended unexpectedly" recovery transition.
- **Test gap.** `ConfigReconcilerTests.cs:235` "concurrent reloads serialised" only verifies they all succeed; it does not actually assert serialisation (the comment admits the measurement isn't perfect). Drive a real concurrent test using a fake supervisor that asserts no overlap.
- **Test gap.** No test covers the cache-disposal-while-multiplexer-still-uses-it scenario (C1) or the cross-thread `_supervisors` mutation (M5).

## What looks good

- **Polly pipelines exactly match the design.** Backend connect: `BackendConnect = { MaxAttempts = 3, BackoffMs = [100, 500, 2000] }` (`ResilienceOptions.cs:5`); listener recovery: `[1000, 2000, 5000, 15000, 30000]` then `30_000` indefinite (`ResilienceOptions.cs:7-9`). `MaxRetryAttempts = int.MaxValue` for listener recovery is correct (`PolicyFactory.cs:101`).
- **Cancellation honoured promptly on shutdown.** `_supervisorCts.CancelAsync()` cancels both Polly's delay and the inner accept loop simultaneously (`PlcListenerSupervisor.cs:159`). `StopAsync` doc comment that it "completes within ~1 s regardless of backoff window size" is accurate because Polly v8's `ExecuteAsync(ct)` propagates the token through the delay generator.
- **Bind lifecycle is idempotent.** Each Polly attempt creates a fresh `PlcListener` and disposes the failed one before re-entering the recovery delay (`PlcListenerSupervisor.cs:264`, `:311`, `:320`, `:337`). Only one `_currentListener` exists at any time. The half-bound port window is closed by `await listener.DisposeAsync()` before re-entering the retry.
- **Single code path for "bring up a listener."** Both startup (`ProxyWorker.cs:115-146`) and reload-add (`ConfigReconciler.cs:359-401`) construct a `PlcListenerSupervisor` and call `StartAsync` — no second listener-init code path. Remove and Restart both go through `supervisor.StopAsync` + `DisposeAsync`. Matches design.md:213.
- **Validator returns all errors, not just the first.** `ReloadValidator.cs:32-117` runs every check, accumulates into `errs`, returns `false` only after every rule has been checked — operator sees every problem on a single save.
- **Debounce loop is correct.** The drain pattern (`ConfigReconciler.cs:154-171`) handles editor rename-replace storms exactly as documented (250 ms quiescent window, last-write-wins via `BoundedChannelFullMode.DropOldest`).
- **`TagMapsEqual` includes `CacheTtlMs`** (`ReloadPlan.cs:113`), so per-tag TTL changes correctly route to ToReseat — matches HotReload.md:70.
- **Graceful-shutdown integration via `ShutdownCoordinator`** (separate file) handles drain-timeout properly even though `ProxyWorker.StopAsync` uses a hard 5 s.

## Open questions

1. C1 (cache + tag-map not actually swapped on reseat) is the most consequential finding — is `PerPlcContext` actually intended to be hot-swapped on the running `PlcMultiplexer`, or is the current "swap is visible only on next listener fault" the intended (and undocumented) behaviour?
2. M6: Should the reconciler's add/restart paths receive the same `coalescingAccessor` `ProxyWorker` constructs? It looks like an oversight when Phase 10 (`ReadCoalescing`) was added after Phase 6 (hot-reload).
3. The status page surface for "supervisor is recovering" is documented (`listener.state = recovering`, `listener.lastBindError`, `listener.recoveryAttempts`) but the cache-skip-during-recovery clause needs a contract owner (M1). Was the design intent to thread the supervisor's snapshot into the multiplexer, or to gate via a different signal (e.g., backend-disconnect cascade flag)?