Files
Joseph Doherty f2c6669444 mbproxy/codereviews: 2026-05-14 in-depth review + remediation plan
Eight area-focused reviews (BCD rewriter, multiplexer, response cache,
supervisor + hot-reload, admin + diagnostics, hosting + options, test
suite) plus an Overview that prioritises findings across areas, and a
RemediationPlan that groups the work into three waves with per-item
file:line citations and regression-test sketches.

Findings call out: hot-reload tag-list/cache changes that don't reach the
running multiplexer, a coalescing factory leak that hangs late attachers,
backend-reader head-of-line block on a wedged upstream, stranded outbound
frames after cascade, and ShutdownCoordinator double-stop ordering. Plus
the unconventional 32-bit BCD wire format (two base-10000 digits in CDAB,
not standard binary), unreachable BcdValidationError.DuplicateAddress,
mbproxy.cache.flushed event that's defined but never emitted, and missing
test coverage for Cache.AllowLongTtl.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 05:15:34 -04:00

26 KiB
Raw Permalink Blame History

Remediation Plan — 2026-05-14 Code Review Findings

Plan for resolving every finding raised in the 2026-05-14 review. Items are grouped into three waves by impact, with each item naming the file:line, the problem, the suggested change, and the regression test that should lock the fix in place.

Conventions:

  • "Wave 1" = ship before next deploy. Production-observable correctness defects.
  • "Wave 2" = next sprint. Contract gaps, concurrency primitives, telemetry accuracy.
  • "Wave 3" = opportunistic. Cleanup, dead code, doc, missing tests.
  • Effort estimates assume a developer already familiar with the multiplexer's lifecycle. ⚙ = code change, 📝 = doc change, = test addition.

Dependency map (Wave 1 items have load-bearing relationships):

W1.1 (context swap) ──┬──► W1.2 (cache.flushed emission)
                      ├──► W2.1 (ConfigReconciler coalescingAccessor)
                      ├──► W2.7 (skip-invalidation-while-recovering)
                      └──► Tests ✅ (cache flush on tag-list reload)

W1.5 (ShutdownCoordinator ownership) ──► W2.4 (IOptionsMonitor switch)
                                       └► W2.6 (drop hardcoded 5s deadline)

W2.3 (ConfigReconciler thread safety) ──► W3 test (concurrent reload assertion)

Other items are independent and can be parallelised.


Wave 1 — Critical Correctness (4 items)

W1.1 — Plumb context updates into the running PlcMultiplexer

File: Proxy/Multiplexing/PlcMultiplexer.cs:50,101,109 and call sites at :497-509. Driven by Proxy/Supervision/PlcListenerSupervisor.cs:199-216.

Problem: ReplaceContextAsync swaps _currentContext on the supervisor, but PlcMultiplexer captured _ctx = perPlcContext at construction and never re-reads. So tag-list reloads, the cache-flush-on-tag-list contract, and any other context-derived state are visible only on the next listener fault. Worse, the old cache is disposed while the running mux still uses it.

Change: Make _ctx a volatile field on PlcMultiplexer (or wrap it in a reference type the supervisor can swap atomically). Add a public ReplaceContext(PerPlcContext newContext) method on the multiplexer; have ReplaceContextAsync call it before disposing the old context. Every place inside the multiplexer that reads _ctx.TagMap / _ctx.Cache / _ctx.CacheInvalidator must read the field fresh, not snapshot it into a local.

Tests:

  • PlcMultiplexerTests.ReplaceContext_NewTagMap_VisibleOnNextPdu — set up a mux, send one FC03, swap the context with a different tag map, send a second FC03, assert different rewriting on the second response.
  • PlcMultiplexerTests.ReplaceContext_FlushesOldCache — populate the old cache, swap context, assert next read goes to backend (cache miss, not a hit on the disposed cache).
  • HotReloadE2ETests.E2E_GlobalTagListReload_FlushesPlcCache — pre-seed cache via FC03, save modified appsettings.json, assert cacheEntryCount for that PLC drops to 0 and the next read goes through.

Effort: ~1 day. The mechanical change is small; the risk is making sure every read-site of _ctx.* is migrated and no field is hoisted into a constructor capture.


W1.2 — Fix coalescing factory leak ⚙

File: Proxy/Multiplexing/PlcMultiplexer.cs:692-733, Proxy/Multiplexing/InFlightByKeyMap.cs:62-83.

Problem: When _correlation.TryAdd or _allocator.TryAllocate fails inside the _inFlightByKey.TryAttachOrCreate factory, the code stores the stub InFlightRequest under the key. Late attachers join the stub's InterestedParties list. The post-lock cleanup at :724 only sends the saturation exception to the leader; late attachers wait forever.

Change: Either (a) inside the factory, do not return / store on failure — let TryAttachOrCreate propagate the error so no other thread sees the stub; or (b) on the cleanup path, walk inFlight.InterestedParties and deliver the exception 0x04 frame to every attached party before removing the entry. Option (b) is the smaller diff. Also remove the dead bool return from TryAttachOrCreate (InFlightByKeyMap.cs:62) — it always returns true.

Tests:

  • PlcMultiplexerTests.CoalescingFactoryFails_AllAttachersGetException — stub the allocator to return false, fire 5 concurrent identical FC03 reads, assert all 5 receive exception 0x04 with their original TxIds.
  • InFlightByKeyMapTests.AttachOrCreate_FactoryThrows_DoesNotStoreStub — drive the factory to throw and assert no entry remains in the map.

Effort: ~half day.


W1.3 — Decouple backend reader from per-upstream backpressure ⚙

File: Proxy/Multiplexing/PlcMultiplexer.cs:545.

Problem: await party.Pipe.SendResponseAsync(outFrame, ct) runs synchronously inside the single backend reader task. If one upstream's bounded response channel (capacity 16) is full, the reader stalls — every other client on that PLC stops receiving responses.

Change: Replace the synchronous await with a non-blocking TryWrite (drop-on-full, with a counter responseDropForFullUpstream), or fan out per-party with _ = Task.Run(() => party.Pipe.SendResponseAsync(...)). Drop-on-full is the cleaner semantics: a wedged upstream loses responses but the rest of the fleet keeps moving; the upstream's watchdog or socket closure will collect it. Add a counter for the dropped responses so it's observable on the status page.

Tests:

  • PlcMultiplexerTests.SlowUpstream_DoesNotStallPeerResponses — wedge one upstream's channel reader, fire reads from a second upstream, assert the second upstream receives responses on time and the wedged upstream's drop counter increments.

Effort: ~half day. Decide drop-vs-fanout first; the rest is mechanical.


W1.4 — Drain or replace _outboundChannel on cascade ⚙

File: Proxy/Multiplexing/PlcMultiplexer.cs:362-385,423-428.

Problem: When the backend writer faults at :380 and triggers a cascade, frames already enqueued in _outboundChannel are stranded. After reconnect, a new writer task drains them into the fresh backend with old proxy TxIds whose correlation entries were dropped by TearDownBackendAsync. The PLC responds; the reader silently drops the response (:423-428); upstream peers wait for the watchdog to time them out — observable as a multi-second pause for every client whose request was in flight at the cascade moment.

Change: On cascade, complete _outboundChannel.Writer.Complete() before tearing down the writer task. In EnsureBackendConnectedAsync, recreate the channel as part of bringing up a new backend connection. Alternative: drain-and-discard (while (TryRead) { releaseTxId; }) before re-running the writer.

Tests:

  • PlcMultiplexerTests.Cascade_StrandedOutboundFrames_AreDiscarded — fill the outbound channel, force a cascade mid-drain, reconnect, assert the next request gets a normal response and the stranded frames did not produce orphaned TxIds (counter responseToUnknownTxId stays at 0 for the new connection).

Effort: ~1 day. Care needed to not accidentally race the channel recreation against fast reconnects.


W1.5 — Resolve ShutdownCoordinator double-stop ⚙

File: Program.cs:60-66, Diagnostics/ShutdownCoordinator.cs:147, Proxy/ProxyWorker.cs:196-217, Admin/AdminEndpointHost.cs.

Problem: ProxyWorker and AdminEndpointHost are both IHostedService. Their StopAsync runs unconditionally during host shutdown. The coordinator on ApplicationStopping also calls StopAsync on both — so each is stopped twice. The drain logic in the coordinator therefore runs against a connection set that is already empty, and mbproxy.shutdown.complete always reports InFlightAtCancel = 0. The existing ShutdownE2ETests.cs:113 doesn't actually assert the coordinator ran.

Change: Pick one of:

  • (A) Make ProxyWorker and AdminEndpointHost not implement IHostedService — register them as singletons only — and let the coordinator drive their entire lifecycle (start in ApplicationStarted, stop in ApplicationStopping).
  • (B) Move the drain logic into ProxyWorker.StopAsync and delete the coordinator entirely. Simpler if the only thing the coordinator does is order + drain + admin-stop.

(B) is the smaller change and matches how worker services are typically structured.

Tests:

  • Update ShutdownE2ETests to assert mbproxy.shutdown.complete fires with a positive InFlightAtCancel when an in-flight request is dropped at deadline.
  • Assert mbproxy.shutdown.complete fires exactly once.

Effort: ~1 day if (B); ~2 days if (A) because every IHostedService gets unwound.


Wave 2 — Major Fixes (20 items)

Grouped by file/area for batched PRs.

Multiplexer / Concurrency

# File:line Problem Change Test
W2.1 Configuration/ConfigReconciler.cs:294-304,378-388Proxy/ProxyWorker.cs:129-130 Add/restart paths construct PlcListenerSupervisor without the coalescingAccessor ProxyWorker builds at startup. Hot-reload of ReadCoalescing.Enabled doesn't propagate to PLCs added or restarted via reload. Pass the same coalescingAccessor into ConfigReconciler.Attach and use it in the add/restart constructors. ConfigReconcilerTests.AddedPlc_HonoursCurrentReadCoalescingFlag — reload with Enabled=false, add a PLC, assert that PLC's reads do not coalesce.
W2.2 Proxy/Multiplexing/PlcMultiplexer.cs:86,142,181,216,566; UpstreamPipe.cs:52,77 _disposed is non-volatile but read on the hot path. On weakly-ordered platforms (ARM) stale-false reads after disposal are permitted. Mark both fields volatile, or replace with Interlocked.CompareExchange(ref _disposedFlag, 1, 0) pattern. The whole project ships win-x64 today so this is correctness defense for future portability. (existing dispose tests are sufficient; verify via inspection.)
W2.3 Configuration/ConfigReconciler.cs:236-238,269-271,306,390 _supervisors Dictionary<,> is mutated from concurrent Task.Run continuations inside Task.WhenAll. Outer apply is serialised by a semaphore but the inner tasks are not. Either ConcurrentDictionary<,>, or collect results from Task.WhenAll first then mutate the dict on a single thread. Replace existing Concurrent_Reloads_Serialised with one that drives genuinely-concurrent add/remove via stub supervisors and asserts no KeyNotFoundException / corruption.
W2.4 Proxy/Multiplexing/PlcMultiplexer.cs:721-733 (counter parity); :626 (cache miss double-count) Counter semantics drift from the docs: saturation increments coalescedMissCount but produces no backend round-trip; cacheMiss is incremented even when the request later coalesces onto a peer's in-flight read. Two options: (a) restructure increments to fire only on the actual outcome (move IncrementCoalescedMiss after the successful send; demote cache-miss to "cache-eligible request" and add a separate cacheElevatedToBackend counter), or (b) leave behaviour unchanged and update docs/Reference/LogEvents.md + docs/design.md to match. (b) is the lower-effort fix. Either way, add PlcMultiplexerTests.CounterSemantics_DocumentedContract that asserts the chosen rule.
W2.5 Proxy/Multiplexing/PlcMultiplexer.cs:209,350 _disposeCts disposed while watchdog/cancel paths may still race against it. oldCts?.Dispose() at :350 races with reader's ct.IsCancellationRequested check. Wrap _disposeCts.Dispose() in a try { ... } catch (ObjectDisposedException) { } at :209; gate oldCts?.Dispose() at :350 behind the join of the reader task. (existing dispose tests cover the happy path; consider a stress test that interleaves Dispose with active requests.)
W2.6 Proxy/Multiplexing/PlcMultiplexer.cs:283 _connectGate SemaphoreSlim never disposed. Add to Dispose. (no test needed.)
W2.7 Proxy/Multiplexing/PlcMultiplexer.cs:468 Reads the FC byte from the post-rewriter buffer. Correct today (FC byte never rewritten) but fragile. Use inFlight.Fc from the request side. (existing tests cover.)

Cache / Hot-reload

# File:line Problem Change Test
W2.8 Proxy/Cache/CacheLogEvents.cs:67-76, supervisor reseat path mbproxy.cache.flushed event defined and documented but never emitted. After W1.1 the reseat path can call cache.Clear() instead of disposing — that's the natural emission point. After W1.1, replace cache.Dispose() in ReplaceContextAsync with cache.Clear(reason: "tag-list-reload") and have Clear emit the log event. ResponseCacheTests.Clear_EmitsFlushedEvent_WithReason; HotReloadE2ETests add an assertion that mbproxy.cache.flushed fires on the tag-list reload.
W2.9 Proxy/Multiplexing/PlcMultiplexer.cs:497-509 (currently no gate) "Skip cache invalidations when listener is recovering" (design.md:203) is unimplemented. Implicit gating holds today (no backend → no FC06/FC16 response → no invalidation), but is undocumented and will break if any new code path produces a write response off the live backend. Either (a) add an explicit check on SupervisorState.Recovering before calling _invalidator.Invalidate(...), or (b) accept the implicit gating and add a comment in PlcMultiplexer.cs near the invalidation call referencing design.md:203. (b) is simpler and matches reality. Doc-only fix → no test. (a) → fault-injection test that synthesises a response during recovery and asserts no invalidation.
W2.10 Configuration/ReloadValidator.cs:80 ReloadValidator runs BcdTagMapBuilder.Build and discards the result; the resolved per-tag CacheTtlMs after the per-PLC default fold is never re-checked against Cache.AllowLongTtl. After Build, walk result.Map.Values and re-run the long-TTL gate against resolved values. Add to ReloadValidatorTests: tag with explicit null TTL inheriting a per-PLC default of 60_001 ms is rejected when AllowLongTtl=false.

BCD Rewriter

# File:line Problem Change Test
W2.11 Bcd/BcdTagMapBuilder.cs:84-103 DuplicateAddress validation is unreachable — input collapses through a Dictionary first. Design contract "reject duplicate addresses within a single PLC's resolved list" is silently violated. Detect duplicates before the dictionary collapse: walk the Add list with a HashSet<ushort>, collect duplicates, return them as DuplicateAddress errors. Same for the merged Global Add Remove resolution where Add overrides Global at the same address (intentional width-override path stays valid; only collisions within Add itself are errors). BcdTagMapBuilderTests.Build_DuplicateAddressInAddList_ReportsError (currently the test acknowledges the gap).
W2.12 Bcd/BcdTagMapBuilder.cs:113-127 OverlappingHighRegister reports each pair twice (once from each direction). Track reported pairs in HashSet<(ushort,ushort)>, dedupe symmetric reports. BcdTagMapBuilderTests.Build_TwoOverlappingPairs_ReportsExactlyOneErrorPerPair.
W2.13 Proxy/BcdPduPipeline.cs:206-211 FC16 32-bit write silently mutates values when clientLow > 9999 — multiplication math accepts the OOR value and re-encodes a different number. Validate clientLow < 10_000 && clientHigh < 10_000 before the multiplication; on failure, increment invalidBcdWarnings and pass through raw. BcdPduPipelineTests.FC16_32Bit_ClientHighOrLowOOR_PassesThroughRaw_WithInvalidBcdWarning.
W2.14 Proxy/BcdPduPipeline.cs:159 FC16 byteCount field at pdu[5] is parsed but discarded. Malformed claims of qty=10 with 4 bytes of register data silently process half the request. Add if (pdu.Length < 6 + qty*2) return; at the top of ProcessFc16Request (and increment a counter or log). BcdPduPipelineTests.FC16_TruncatedRegisterData_PassesThroughRawWithoutHalfRewrite.

Supervisor

# File:line Problem Change Test
W2.15 Proxy/Supervision/PlcListenerSupervisor.cs:137-144 WaitForInitialBindAttemptAsync busy-polls every 10 ms, races a fast Stopped→Bound→Stopped, never exits if the supervisor task throws. Replace polling loop with a TaskCompletionSource<bool> set on the first state transition out of Stopped. SupervisorTests.WaitForInitialBindAttemptAsync_ResolvesEvenWhenBindThrows — drive a synthetic bind exception and assert the wait returns within ~50 ms.
W2.16 Proxy/Supervision/PlcListenerSupervisor.cs:126 _supervisorCts reassigned on StartAsync without disposing the previous instance. Add a guard (throw if not Stopped) or dispose the previous CTS before reassigning. (covered by inspection.)
W2.17 Proxy/Supervision/PlcListenerSupervisor.cs:177-180 Snapshot() reads three fields independently → status page can show inconsistent state/lastBindError/recoveryAttempts triples. Read into locals under a single lock, or use a single volatile struct that's swapped atomically. SupervisorTests.Snapshot_DuringTransition_IsInternallyConsistent — fire many transitions in parallel and assert no inconsistent triple ever observed.
W2.18 Options/ConnectionOptions.cs and MbproxyOptions.cs validators Connection.{BackendConnectTimeoutMs,BackendRequestTimeoutMs,GracefulShutdownTimeoutMs} are not validated for > 0. A misconfigured 0 produces immediate CancelAfter(0) failures. Add > 0 checks to MbproxyOptionsValidator and ReloadValidator. MbproxyOptionsBindingTests.Validation_RejectsZeroOrNegativeTimeouts.

Hosting / Diagnostics

# File:line Problem Change Test
W2.19 Diagnostics/ShutdownCoordinator.cs:93,102,118; Program.cs:42 Coordinator binds IOptions<MbproxyOptions> (singleton snapshot). A hot-reloaded GracefulShutdownTimeoutMs is silently ignored. Switch to IOptionsMonitor<MbproxyOptions>; read .CurrentValue.Connection.GracefulShutdownTimeoutMs inside ShutdownAsync. After W1.5 verify the coordinator still exists; if it was removed, this fix migrates to wherever drain logic lives. ShutdownE2ETests.E2E_HotReloadedShutdownTimeout_IsHonoured.
W2.20 Diagnostics/ShutdownCoordinator.cs:147 Hardcoded TimeSpan.FromSeconds(5) for supervisor stop deadline regardless of GracefulShutdownTimeoutMs. Use the configured value. (covered by W2.19's test if extended.)
W2.21 src/Mbproxy/appsettings.jsoninstall/mbproxy.config.template.json Shipped appsettings.json is an empty stub. Fresh Mbproxy.exe from publish folder is a no-op service. Either mirror the install template into src/Mbproxy/appsettings.json, or set <None Update="appsettings.json"><CopyToOutputDirectory>Never</CopyToOutputDirectory></None> in the csproj so the publish folder doesn't ship a stub. HostSmokeTests.Host_StartsWithCheckedInAppsettings_AndAtLeastOneListenerBinds.
W2.22 Admin/StatusDto.cs:102, Admin/StatusSnapshotBuilder.cs:127-131 pdus.invalidBcdWarnings and backendExceptions.codeOther are tracked in CounterSnapshot but never surfaced via the JSON DTO or HTML renderer. Dead increment paths. Add the two fields to PlcPdusStatus / ExceptionCounts and wire them through StatusSnapshotBuilder and StatusHtmlRenderer. AdminEndpointTests.StatusJson_IncludesInvalidBcdWarnings_AfterPipelineWarning; analogous for codeOther.
W2.23 Diagnostics/EventLogBridge.cs:46 EventLog.SourceExists registry hit on every Error+ log line. Cache the result in a bool _sourceExists field set in the constructor (one-shot). Document that source registration changes after process start require a service restart. (covered by inspection; optional perf-sensitive E2E.)
W2.24 Proxy/ProxyWorker.cs:202-217 Hard-coded 5-second deadline in StopAsync. Coexists with the coordinator's deadline; reader of just ProxyWorker would miss the contract. If W1.5 keeps the coordinator: add a comment pointing at it. If W1.5 deletes the coordinator: replace the 5s with IOptionsMonitor.CurrentValue.Connection.GracefulShutdownTimeoutMs. (covered by W2.19's test.)

Wave 3 — Cleanup, Doc-only, Test Gaps (low priority)

Cleanup (small ⚙)

  • Proxy/Multiplexing/PlcMultiplexer.cs:445-450 — dead elapsedMs calculation, never used. Remove.
  • Proxy/Multiplexing/UpstreamPipe.cs:262-263 — meaningless conditional firstRead && remaining == count ? false : false. Replace with return false.
  • Proxy/Multiplexing/InFlightByKeyMap.cs:62TryAttachOrCreate always returns true; drop the dead bool. (Resolved as part of W1.2.)
  • Bcd/BcdCodec.cs:103-110 and Proxy/BcdPduPipeline.cs:455-459 — duplicated HasBadNibble. Promote codec's helper to internal and call it from the pipeline.
  • Proxy/Supervision/PlcListenerSupervisor.cs:268,323,346 — copy-pasted exception-message truncation. Extract Truncate(string, int) helper.
  • Proxy/Multiplexing/PlcMultiplexer.cs:844-846 — comment says "1-second floor"; code uses 100 ms. Fix comment.
  • Configuration/ConfigReconciler.cs:428 — duplicated tag-delta computation. Either expose a delta count from ReloadPlan.Compute or drop the log enrichment.
  • Mbproxy.csproj:42 — comment describes Polly as "deferred" but it's actively used. Remove stale comment.
  • Admin/StatusSnapshotBuilder.cs:69 — unreachable ?.Address.ToString() fallback. Simplify to p.RemoteEp?.ToString() ?? "?".

Doc-only (📝)

  • README.md (operator-facing) — add a callout describing the 32-bit BCD wire format as "two base-10000 digits in CDAB", not standard binary CDAB Int32. Standard Modbus clients (NModbus, FluentModbus, Wonderware DAServer) will silently corrupt values > 9999 if they assume normal CDAB-int32 semantics. (BcdRewriter M1.)
  • docs/design.md and docs/Reference/LogEvents.md — clarify counter semantics for cache-miss-during-coalescing per the choice made in W2.4. (Mux M9.)
  • docs/Operations/Configuration.md — note that the Event Log sink is a custom EventLogBridge, not Serilog.Sinks.EventLog, so reviewers don't add a duplicate sink. (Hosting minor.)
  • Proxy/ResponseCache.md (or docs/Architecture/ResponseCache.md) — document the implicit invalidation-skip-during-recovery gating. (Cache M1 / Supervisor M1, paired with W2.9 option (b).)
  • docs/plan/README.md — add a phase entry for "Phase 12: Review remediation 2026-05-14" pointing at this file, so the phase history stays complete.

Test gaps ()

The following design contracts are unverified by any current test. Add them in priority order:

  1. ReloadValidatorTestsCache.AllowLongTtl=false rejects CacheTtlMs > 60_000 (per-tag and per-PLC-default forms). Paired with W2.10. (Cache test gap; design.md.)
  2. ReloadValidatorTestsCache.AllowLongTtl=true happy-path acceptance.
  3. HotReloadE2ETests — tag-list reload flushes the affected PLC's cache (paired with W1.1 + W2.8).
  4. SupervisorTests — replace placeholder Supervisor_RuntimeFault_TriggersRecovery with a test that genuinely faults the accept loop and asserts state goes Bound → Recovering → Bound. (Supervisor test gap.)
  5. PlcMultiplexerTests — TxId allocator saturation propagates a clean Modbus exception (0x06 or equivalent) to upstream, doesn't hang. (Mux test gap.)
  6. PlcMultiplexerTests — coalescing factory leak under saturation with multiple attachers. (Paired with W1.2.)
  7. PlcMultiplexerTests — backend-reader head-of-line block by a slow upstream. (Paired with W1.3.)
  8. PlcMultiplexerTests — watchdog↔response race (response arriving in the same tick as SnapshotOlderThan).
  9. PlcMultiplexerTests — cascade racing with a brand-new accept (does the new pipe survive or land in a zombie multiplexer mid-teardown?).
  10. HotReloadE2ETests — flip ReadCoalescing.Enabled at runtime (not start-up) and assert subsequent reads stop coalescing. (Paired with W2.1.)
  11. BcdCodecTestsEncode16(int.MinValue) doesn't blow up.
  12. BcdPduPipelineTests — write range ending exactly mid-32-bit-pair on the high address (inverse of existing low-side test).
  13. BcdPduPipelineTests — mixed 16-bit + 32-bit + non-BCD slots in a single FC03 read.
  14. BcdPduPipelineTests — FC16 response handling (currently only the request is tested).
  15. ProxyForwardingTests or similar — qty > 128 FC03 read passes through unchanged; PLC's exception 03 surfaces to the client.
  16. ConfigReconcilerTests — replace the soft "concurrent reloads serialised" test with one that drives genuinely-concurrent applies and asserts no overlap (using a fake supervisor with a "claim count" assertion).
  17. AdminEndpointTests — POST/PUT/DELETE/HEAD against / and /status.json return 405 (verifies no unintended methods are bound).

Out of Scope

Findings that this plan deliberately does not address:

  • Mux C2 / C3 (cache-hit-during-cascade visibility model). Per the design these are acceptable behaviours; the Multiplexer review downgraded them.
  • Any finding that recommended adding a new feature (e.g. an Mbproxy.AdminBindAddress config knob) or new defense-in-depth security hardening — those were trimmed from the reviews per direction on 2026-05-14.

Suggested Sequencing

A focused remediation pass might split as:

Sprint Items Outcome
Sprint 1 (~1 week) W1.1, W1.2, W1.3, W1.4 The four critical multiplexer/hot-reload defects are gone. Hot-reload finally works as documented.
Sprint 2 (~1 week) W1.5, W2.19, W2.20, W2.24 + the related test (ShutdownE2ETests) Shutdown path is single-owner and respects the configured timeout.
Sprint 3 (~1 week) W2.1, W2.3, W2.8, W2.9, W2.10, W2.21 + cache/supervisor tests All the major hot-reload + cache contract gaps closed.
Sprint 4 (~1 week) All remaining W2 items (BCD, supervisor, multiplexer minor) Counter accuracy, dispose hygiene, validator coverage.
Sprint 5 (~3 days) Wave 3 cleanup + doc + tests Suite + docs caught up.

After Sprint 1 the production risk profile drops sharply. After Sprint 3 every documented design contract is enforced. Wave 3 is opportunistic and can be folded into ongoing work.