Files
Joseph Doherty f2c6669444 mbproxy/codereviews: 2026-05-14 in-depth review + remediation plan
Eight area-focused reviews (BCD rewriter, multiplexer, response cache,
supervisor + hot-reload, admin + diagnostics, hosting + options, test
suite) plus an Overview that prioritises findings across areas, and a
RemediationPlan that groups the work into three waves with per-item
file:line citations and regression-test sketches.

Findings call out: hot-reload tag-list/cache changes that don't reach the
running multiplexer, a coalescing factory leak that hangs late attachers,
backend-reader head-of-line block on a wedged upstream, stranded outbound
frames after cascade, and ShutdownCoordinator double-stop ordering. Plus
the unconventional 32-bit BCD wire format (two base-10000 digits in CDAB,
not standard binary), unreachable BcdValidationError.DuplicateAddress,
mbproxy.cache.flushed event that's defined but never emitted, and missing
test coverage for Cache.AllowLongTtl.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 05:15:34 -04:00

237 lines
26 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Remediation Plan — 2026-05-14 Code Review Findings
Plan for resolving every finding raised in the 2026-05-14 review. Items are grouped into three waves by impact, with each item naming the file:line, the problem, the suggested change, and the regression test that should lock the fix in place.
**Conventions:**
- "Wave 1" = ship before next deploy. Production-observable correctness defects.
- "Wave 2" = next sprint. Contract gaps, concurrency primitives, telemetry accuracy.
- "Wave 3" = opportunistic. Cleanup, dead code, doc, missing tests.
- Effort estimates assume a developer already familiar with the multiplexer's lifecycle. ⚙ = code change, 📝 = doc change, ✅ = test addition.
**Dependency map** (Wave 1 items have load-bearing relationships):
```
W1.1 (context swap) ──┬──► W1.2 (cache.flushed emission)
├──► W2.1 (ConfigReconciler coalescingAccessor)
├──► W2.7 (skip-invalidation-while-recovering)
└──► Tests ✅ (cache flush on tag-list reload)
W1.5 (ShutdownCoordinator ownership) ──► W2.4 (IOptionsMonitor switch)
└► W2.6 (drop hardcoded 5s deadline)
W2.3 (ConfigReconciler thread safety) ──► W3 test (concurrent reload assertion)
```
Other items are independent and can be parallelised.
---
## Wave 1 — Critical Correctness (4 items)
### W1.1 — Plumb context updates into the running `PlcMultiplexer` ⚙ ✅
**File:** `Proxy/Multiplexing/PlcMultiplexer.cs:50,101,109` and call sites at `:497-509`. Driven by `Proxy/Supervision/PlcListenerSupervisor.cs:199-216`.
**Problem:** `ReplaceContextAsync` swaps `_currentContext` on the supervisor, but `PlcMultiplexer` captured `_ctx = perPlcContext` at construction and never re-reads. So tag-list reloads, the cache-flush-on-tag-list contract, and any other context-derived state are visible only on the next listener fault. Worse, the old cache is disposed while the running mux still uses it.
**Change:** Make `_ctx` a `volatile` field on `PlcMultiplexer` (or wrap it in a reference type the supervisor can swap atomically). Add a public `ReplaceContext(PerPlcContext newContext)` method on the multiplexer; have `ReplaceContextAsync` call it before disposing the old context. Every place inside the multiplexer that reads `_ctx.TagMap` / `_ctx.Cache` / `_ctx.CacheInvalidator` must read the field fresh, not snapshot it into a local.
**Tests:**
-`PlcMultiplexerTests.ReplaceContext_NewTagMap_VisibleOnNextPdu` — set up a mux, send one FC03, swap the context with a different tag map, send a second FC03, assert different rewriting on the second response.
-`PlcMultiplexerTests.ReplaceContext_FlushesOldCache` — populate the old cache, swap context, assert next read goes to backend (cache miss, not a hit on the disposed cache).
-`HotReloadE2ETests.E2E_GlobalTagListReload_FlushesPlcCache` — pre-seed cache via FC03, save modified `appsettings.json`, assert `cacheEntryCount` for that PLC drops to 0 and the next read goes through.
**Effort:** ~1 day. The mechanical change is small; the risk is making sure every read-site of `_ctx.*` is migrated and no field is hoisted into a constructor capture.
---
### W1.2 — Fix coalescing factory leak ⚙ ✅
**File:** `Proxy/Multiplexing/PlcMultiplexer.cs:692-733`, `Proxy/Multiplexing/InFlightByKeyMap.cs:62-83`.
**Problem:** When `_correlation.TryAdd` or `_allocator.TryAllocate` fails inside the `_inFlightByKey.TryAttachOrCreate` factory, the code stores the stub `InFlightRequest` under the key. Late attachers join the stub's `InterestedParties` list. The post-lock cleanup at `:724` only sends the saturation exception to the leader; late attachers wait forever.
**Change:** Either (a) inside the factory, do not return / store on failure — let `TryAttachOrCreate` propagate the error so no other thread sees the stub; or (b) on the cleanup path, walk `inFlight.InterestedParties` and deliver the exception 0x04 frame to every attached party before removing the entry. Option (b) is the smaller diff. Also remove the dead `bool` return from `TryAttachOrCreate` (`InFlightByKeyMap.cs:62`) — it always returns `true`.
**Tests:**
-`PlcMultiplexerTests.CoalescingFactoryFails_AllAttachersGetException` — stub the allocator to return false, fire 5 concurrent identical FC03 reads, assert all 5 receive exception 0x04 with their original TxIds.
-`InFlightByKeyMapTests.AttachOrCreate_FactoryThrows_DoesNotStoreStub` — drive the factory to throw and assert no entry remains in the map.
**Effort:** ~half day.
---
### W1.3 — Decouple backend reader from per-upstream backpressure ⚙ ✅
**File:** `Proxy/Multiplexing/PlcMultiplexer.cs:545`.
**Problem:** `await party.Pipe.SendResponseAsync(outFrame, ct)` runs synchronously inside the single backend reader task. If one upstream's bounded response channel (capacity 16) is full, the reader stalls — every other client on that PLC stops receiving responses.
**Change:** Replace the synchronous `await` with a non-blocking `TryWrite` (drop-on-full, with a counter `responseDropForFullUpstream`), or fan out per-party with `_ = Task.Run(() => party.Pipe.SendResponseAsync(...))`. Drop-on-full is the cleaner semantics: a wedged upstream loses responses but the rest of the fleet keeps moving; the upstream's watchdog or socket closure will collect it. Add a counter for the dropped responses so it's observable on the status page.
**Tests:**
-`PlcMultiplexerTests.SlowUpstream_DoesNotStallPeerResponses` — wedge one upstream's channel reader, fire reads from a second upstream, assert the second upstream receives responses on time and the wedged upstream's drop counter increments.
**Effort:** ~half day. Decide drop-vs-fanout first; the rest is mechanical.
---
### W1.4 — Drain or replace `_outboundChannel` on cascade ⚙ ✅
**File:** `Proxy/Multiplexing/PlcMultiplexer.cs:362-385,423-428`.
**Problem:** When the backend writer faults at `:380` and triggers a cascade, frames already enqueued in `_outboundChannel` are stranded. After reconnect, a new writer task drains them into the *fresh* backend with old proxy TxIds whose correlation entries were dropped by `TearDownBackendAsync`. The PLC responds; the reader silently drops the response (`:423-428`); upstream peers wait for the watchdog to time them out — observable as a multi-second pause for every client whose request was in flight at the cascade moment.
**Change:** On cascade, complete `_outboundChannel.Writer.Complete()` before tearing down the writer task. In `EnsureBackendConnectedAsync`, recreate the channel as part of bringing up a new backend connection. Alternative: drain-and-discard (`while (TryRead) { releaseTxId; }`) before re-running the writer.
**Tests:**
-`PlcMultiplexerTests.Cascade_StrandedOutboundFrames_AreDiscarded` — fill the outbound channel, force a cascade mid-drain, reconnect, assert the next request gets a normal response and the stranded frames did not produce orphaned TxIds (counter `responseToUnknownTxId` stays at 0 for the new connection).
**Effort:** ~1 day. Care needed to not accidentally race the channel recreation against fast reconnects.
---
### W1.5 — Resolve `ShutdownCoordinator` double-stop ⚙ ✅
**File:** `Program.cs:60-66`, `Diagnostics/ShutdownCoordinator.cs:147`, `Proxy/ProxyWorker.cs:196-217`, `Admin/AdminEndpointHost.cs`.
**Problem:** `ProxyWorker` and `AdminEndpointHost` are both `IHostedService`. Their `StopAsync` runs unconditionally during host shutdown. The coordinator on `ApplicationStopping` *also* calls `StopAsync` on both — so each is stopped twice. The drain logic in the coordinator therefore runs against a connection set that is already empty, and `mbproxy.shutdown.complete` always reports `InFlightAtCancel = 0`. The existing `ShutdownE2ETests.cs:113` doesn't actually assert the coordinator ran.
**Change:** Pick one of:
- **(A)** Make `ProxyWorker` and `AdminEndpointHost` *not* implement `IHostedService` — register them as singletons only — and let the coordinator drive their entire lifecycle (start in `ApplicationStarted`, stop in `ApplicationStopping`).
- **(B)** Move the drain logic into `ProxyWorker.StopAsync` and delete the coordinator entirely. Simpler if the only thing the coordinator does is order + drain + admin-stop.
(B) is the smaller change and matches how worker services are typically structured.
**Tests:**
- ✅ Update `ShutdownE2ETests` to assert `mbproxy.shutdown.complete` fires with a positive `InFlightAtCancel` when an in-flight request is dropped at deadline.
- ✅ Assert `mbproxy.shutdown.complete` fires exactly once.
**Effort:** ~1 day if (B); ~2 days if (A) because every IHostedService gets unwound.
---
## Wave 2 — Major Fixes (20 items)
Grouped by file/area for batched PRs.
### Multiplexer / Concurrency
| # | File:line | Problem | Change | Test |
|---|-----------|---------|--------|------|
| W2.1 | `Configuration/ConfigReconciler.cs:294-304,378-388``Proxy/ProxyWorker.cs:129-130` | Add/restart paths construct `PlcListenerSupervisor` without the `coalescingAccessor` `ProxyWorker` builds at startup. Hot-reload of `ReadCoalescing.Enabled` doesn't propagate to PLCs added or restarted via reload. | Pass the same `coalescingAccessor` into `ConfigReconciler.Attach` and use it in the add/restart constructors. | ✅ `ConfigReconcilerTests.AddedPlc_HonoursCurrentReadCoalescingFlag` — reload with `Enabled=false`, add a PLC, assert that PLC's reads do not coalesce. |
| W2.2 | `Proxy/Multiplexing/PlcMultiplexer.cs:86,142,181,216,566`; `UpstreamPipe.cs:52,77` | `_disposed` is non-volatile but read on the hot path. On weakly-ordered platforms (ARM) stale-false reads after disposal are permitted. | Mark both fields `volatile`, or replace with `Interlocked.CompareExchange(ref _disposedFlag, 1, 0)` pattern. The whole project ships win-x64 today so this is correctness defense for future portability. | (existing dispose tests are sufficient; verify via inspection.) |
| W2.3 | `Configuration/ConfigReconciler.cs:236-238,269-271,306,390` | `_supervisors` `Dictionary<,>` is mutated from concurrent `Task.Run` continuations inside `Task.WhenAll`. Outer apply is serialised by a semaphore but the inner tasks are not. | Either `ConcurrentDictionary<,>`, or collect results from `Task.WhenAll` first then mutate the dict on a single thread. | ✅ Replace existing `Concurrent_Reloads_Serialised` with one that drives genuinely-concurrent add/remove via stub supervisors and asserts no `KeyNotFoundException` / corruption. |
| W2.4 | `Proxy/Multiplexing/PlcMultiplexer.cs:721-733` (counter parity); `:626` (cache miss double-count) | Counter semantics drift from the docs: saturation increments `coalescedMissCount` but produces no backend round-trip; `cacheMiss` is incremented even when the request later coalesces onto a peer's in-flight read. | Two options: (a) restructure increments to fire only on the actual outcome (move `IncrementCoalescedMiss` after the successful send; demote cache-miss to "cache-eligible request" and add a separate `cacheElevatedToBackend` counter), or (b) leave behaviour unchanged and update `docs/Reference/LogEvents.md` + `docs/design.md` to match. (b) is the lower-effort fix. | ✅ Either way, add `PlcMultiplexerTests.CounterSemantics_DocumentedContract` that asserts the chosen rule. |
| W2.5 | `Proxy/Multiplexing/PlcMultiplexer.cs:209,350` | `_disposeCts` disposed while watchdog/cancel paths may still race against it. `oldCts?.Dispose()` at `:350` races with reader's `ct.IsCancellationRequested` check. | Wrap `_disposeCts.Dispose()` in a `try { ... } catch (ObjectDisposedException) { }` at `:209`; gate `oldCts?.Dispose()` at `:350` behind the join of the reader task. | (existing dispose tests cover the happy path; consider a stress test that interleaves Dispose with active requests.) |
| W2.6 | `Proxy/Multiplexing/PlcMultiplexer.cs:283` | `_connectGate` `SemaphoreSlim` never disposed. | Add to `Dispose`. | (no test needed.) |
| W2.7 | `Proxy/Multiplexing/PlcMultiplexer.cs:468` | Reads the FC byte from the post-rewriter buffer. Correct today (FC byte never rewritten) but fragile. | Use `inFlight.Fc` from the request side. | (existing tests cover.) |
### Cache / Hot-reload
| # | File:line | Problem | Change | Test |
|---|-----------|---------|--------|------|
| W2.8 | `Proxy/Cache/CacheLogEvents.cs:67-76`, supervisor reseat path | `mbproxy.cache.flushed` event defined and documented but never emitted. After W1.1 the reseat path can call `cache.Clear()` instead of disposing — that's the natural emission point. | After W1.1, replace `cache.Dispose()` in `ReplaceContextAsync` with `cache.Clear(reason: "tag-list-reload")` and have `Clear` emit the log event. | ✅ `ResponseCacheTests.Clear_EmitsFlushedEvent_WithReason`; ✅ `HotReloadE2ETests` add an assertion that `mbproxy.cache.flushed` fires on the tag-list reload. |
| W2.9 | `Proxy/Multiplexing/PlcMultiplexer.cs:497-509` (currently no gate) | "Skip cache invalidations when listener is `recovering`" (`design.md:203`) is unimplemented. Implicit gating holds today (no backend → no FC06/FC16 response → no invalidation), but is undocumented and will break if any new code path produces a write response off the live backend. | Either (a) add an explicit check on `SupervisorState.Recovering` before calling `_invalidator.Invalidate(...)`, or (b) accept the implicit gating and add a comment in `PlcMultiplexer.cs` near the invalidation call referencing `design.md:203`. (b) is simpler and matches reality. | ✅ Doc-only fix → no test. (a) → fault-injection test that synthesises a response during recovery and asserts no invalidation. |
| W2.10 | `Configuration/ReloadValidator.cs:80` | `ReloadValidator` runs `BcdTagMapBuilder.Build` and discards the result; the resolved per-tag `CacheTtlMs` after the per-PLC default fold is never re-checked against `Cache.AllowLongTtl`. | After Build, walk `result.Map.Values` and re-run the long-TTL gate against resolved values. | ✅ Add to `ReloadValidatorTests`: tag with explicit null TTL inheriting a per-PLC default of 60_001 ms is rejected when `AllowLongTtl=false`. |
### BCD Rewriter
| # | File:line | Problem | Change | Test |
|---|-----------|---------|--------|------|
| W2.11 | `Bcd/BcdTagMapBuilder.cs:84-103` | `DuplicateAddress` validation is unreachable — input collapses through a `Dictionary` first. Design contract "reject duplicate addresses within a single PLC's resolved list" is silently violated. | Detect duplicates *before* the dictionary collapse: walk the `Add` list with a `HashSet<ushort>`, collect duplicates, return them as `DuplicateAddress` errors. Same for the merged `Global Add Remove` resolution where Add overrides Global at the same address (intentional width-override path stays valid; only collisions within Add itself are errors). | ✅ `BcdTagMapBuilderTests.Build_DuplicateAddressInAddList_ReportsError` (currently the test acknowledges the gap). |
| W2.12 | `Bcd/BcdTagMapBuilder.cs:113-127` | `OverlappingHighRegister` reports each pair twice (once from each direction). | Track reported pairs in `HashSet<(ushort,ushort)>`, dedupe symmetric reports. | ✅ `BcdTagMapBuilderTests.Build_TwoOverlappingPairs_ReportsExactlyOneErrorPerPair`. |
| W2.13 | `Proxy/BcdPduPipeline.cs:206-211` | FC16 32-bit write silently mutates values when `clientLow > 9999` — multiplication math accepts the OOR value and re-encodes a different number. | Validate `clientLow < 10_000 && clientHigh < 10_000` before the multiplication; on failure, increment `invalidBcdWarnings` and pass through raw. | ✅ `BcdPduPipelineTests.FC16_32Bit_ClientHighOrLowOOR_PassesThroughRaw_WithInvalidBcdWarning`. |
| W2.14 | `Proxy/BcdPduPipeline.cs:159` | FC16 `byteCount` field at `pdu[5]` is parsed but discarded. Malformed claims of `qty=10` with 4 bytes of register data silently process half the request. | Add `if (pdu.Length < 6 + qty*2) return;` at the top of `ProcessFc16Request` (and increment a counter or log). | ✅ `BcdPduPipelineTests.FC16_TruncatedRegisterData_PassesThroughRawWithoutHalfRewrite`. |
### Supervisor
| # | File:line | Problem | Change | Test |
|---|-----------|---------|--------|------|
| W2.15 | `Proxy/Supervision/PlcListenerSupervisor.cs:137-144` | `WaitForInitialBindAttemptAsync` busy-polls every 10 ms, races a fast `Stopped→Bound→Stopped`, never exits if the supervisor task throws. | Replace polling loop with a `TaskCompletionSource<bool>` set on the first state transition out of `Stopped`. | ✅ `SupervisorTests.WaitForInitialBindAttemptAsync_ResolvesEvenWhenBindThrows` — drive a synthetic bind exception and assert the wait returns within ~50 ms. |
| W2.16 | `Proxy/Supervision/PlcListenerSupervisor.cs:126` | `_supervisorCts` reassigned on `StartAsync` without disposing the previous instance. | Add a guard (`throw if not Stopped`) or dispose the previous CTS before reassigning. | (covered by inspection.) |
| W2.17 | `Proxy/Supervision/PlcListenerSupervisor.cs:177-180` | `Snapshot()` reads three fields independently → status page can show inconsistent state/lastBindError/recoveryAttempts triples. | Read into locals under a single lock, or use a single `volatile` struct that's swapped atomically. | ✅ `SupervisorTests.Snapshot_DuringTransition_IsInternallyConsistent` — fire many transitions in parallel and assert no inconsistent triple ever observed. |
| W2.18 | `Options/ConnectionOptions.cs` and `MbproxyOptions.cs` validators | `Connection.{BackendConnectTimeoutMs,BackendRequestTimeoutMs,GracefulShutdownTimeoutMs}` are not validated for `> 0`. A misconfigured 0 produces immediate `CancelAfter(0)` failures. | Add `> 0` checks to `MbproxyOptionsValidator` and `ReloadValidator`. | ✅ `MbproxyOptionsBindingTests.Validation_RejectsZeroOrNegativeTimeouts`. |
### Hosting / Diagnostics
| # | File:line | Problem | Change | Test |
|---|-----------|---------|--------|------|
| W2.19 | `Diagnostics/ShutdownCoordinator.cs:93,102,118`; `Program.cs:42` | Coordinator binds `IOptions<MbproxyOptions>` (singleton snapshot). A hot-reloaded `GracefulShutdownTimeoutMs` is silently ignored. | Switch to `IOptionsMonitor<MbproxyOptions>`; read `.CurrentValue.Connection.GracefulShutdownTimeoutMs` inside `ShutdownAsync`. After W1.5 verify the coordinator still exists; if it was removed, this fix migrates to wherever drain logic lives. | ✅ `ShutdownE2ETests.E2E_HotReloadedShutdownTimeout_IsHonoured`. |
| W2.20 | `Diagnostics/ShutdownCoordinator.cs:147` | Hardcoded `TimeSpan.FromSeconds(5)` for supervisor stop deadline regardless of `GracefulShutdownTimeoutMs`. | Use the configured value. | (covered by W2.19's test if extended.) |
| W2.21 | `src/Mbproxy/appsettings.json``install/mbproxy.config.template.json` | Shipped `appsettings.json` is an empty stub. Fresh `Mbproxy.exe` from publish folder is a no-op service. | Either mirror the install template into `src/Mbproxy/appsettings.json`, or set `<None Update="appsettings.json"><CopyToOutputDirectory>Never</CopyToOutputDirectory></None>` in the csproj so the publish folder doesn't ship a stub. | ✅ `HostSmokeTests.Host_StartsWithCheckedInAppsettings_AndAtLeastOneListenerBinds`. |
| W2.22 | `Admin/StatusDto.cs:102`, `Admin/StatusSnapshotBuilder.cs:127-131` | `pdus.invalidBcdWarnings` and `backendExceptions.codeOther` are tracked in `CounterSnapshot` but never surfaced via the JSON DTO or HTML renderer. Dead increment paths. | Add the two fields to `PlcPdusStatus` / `ExceptionCounts` and wire them through `StatusSnapshotBuilder` and `StatusHtmlRenderer`. | ✅ `AdminEndpointTests.StatusJson_IncludesInvalidBcdWarnings_AfterPipelineWarning`; ✅ analogous for `codeOther`. |
| W2.23 | `Diagnostics/EventLogBridge.cs:46` | `EventLog.SourceExists` registry hit on every Error+ log line. | Cache the result in a `bool _sourceExists` field set in the constructor (one-shot). Document that source registration changes after process start require a service restart. | (covered by inspection; optional perf-sensitive E2E.) |
| W2.24 | `Proxy/ProxyWorker.cs:202-217` | Hard-coded 5-second deadline in `StopAsync`. Coexists with the coordinator's deadline; reader of just `ProxyWorker` would miss the contract. | If W1.5 keeps the coordinator: add a comment pointing at it. If W1.5 deletes the coordinator: replace the 5s with `IOptionsMonitor.CurrentValue.Connection.GracefulShutdownTimeoutMs`. | (covered by W2.19's test.) |
---
## Wave 3 — Cleanup, Doc-only, Test Gaps (low priority)
### Cleanup (small ⚙)
- `Proxy/Multiplexing/PlcMultiplexer.cs:445-450` — dead `elapsedMs` calculation, never used. Remove.
- `Proxy/Multiplexing/UpstreamPipe.cs:262-263` — meaningless conditional `firstRead && remaining == count ? false : false`. Replace with `return false`.
- `Proxy/Multiplexing/InFlightByKeyMap.cs:62``TryAttachOrCreate` always returns `true`; drop the dead `bool`. (Resolved as part of W1.2.)
- `Bcd/BcdCodec.cs:103-110` and `Proxy/BcdPduPipeline.cs:455-459` — duplicated `HasBadNibble`. Promote codec's helper to `internal` and call it from the pipeline.
- `Proxy/Supervision/PlcListenerSupervisor.cs:268,323,346` — copy-pasted exception-message truncation. Extract `Truncate(string, int)` helper.
- `Proxy/Multiplexing/PlcMultiplexer.cs:844-846` — comment says "1-second floor"; code uses 100 ms. Fix comment.
- `Configuration/ConfigReconciler.cs:428` — duplicated tag-delta computation. Either expose a delta count from `ReloadPlan.Compute` or drop the log enrichment.
- `Mbproxy.csproj:42` — comment describes Polly as "deferred" but it's actively used. Remove stale comment.
- `Admin/StatusSnapshotBuilder.cs:69` — unreachable `?.Address.ToString()` fallback. Simplify to `p.RemoteEp?.ToString() ?? "?"`.
### Doc-only (📝)
- `README.md` (operator-facing) — add a callout describing the 32-bit BCD wire format as "two base-10000 digits in CDAB", not standard binary CDAB Int32. Standard Modbus clients (NModbus, FluentModbus, Wonderware DAServer) will silently corrupt values > 9999 if they assume normal CDAB-int32 semantics. *(BcdRewriter M1.)*
- `docs/design.md` and `docs/Reference/LogEvents.md` — clarify counter semantics for cache-miss-during-coalescing per the choice made in W2.4. *(Mux M9.)*
- `docs/Operations/Configuration.md` — note that the Event Log sink is a custom `EventLogBridge`, not `Serilog.Sinks.EventLog`, so reviewers don't add a duplicate sink. *(Hosting minor.)*
- `Proxy/ResponseCache.md` (or `docs/Architecture/ResponseCache.md`) — document the implicit invalidation-skip-during-recovery gating. *(Cache M1 / Supervisor M1, paired with W2.9 option (b).)*
- `docs/plan/README.md` — add a phase entry for "Phase 12: Review remediation 2026-05-14" pointing at this file, so the phase history stays complete.
### Test gaps (✅)
The following design contracts are unverified by any current test. Add them in priority order:
1.`ReloadValidatorTests``Cache.AllowLongTtl=false` rejects `CacheTtlMs > 60_000` (per-tag and per-PLC-default forms). Paired with W2.10. *(Cache test gap; design.md.)*
2.`ReloadValidatorTests``Cache.AllowLongTtl=true` happy-path acceptance.
3.`HotReloadE2ETests` — tag-list reload flushes the affected PLC's cache (paired with W1.1 + W2.8).
4.`SupervisorTests` — replace placeholder `Supervisor_RuntimeFault_TriggersRecovery` with a test that genuinely faults the accept loop and asserts state goes `Bound → Recovering → Bound`. *(Supervisor test gap.)*
5.`PlcMultiplexerTests` — TxId allocator saturation propagates a clean Modbus exception (0x06 or equivalent) to upstream, doesn't hang. *(Mux test gap.)*
6.`PlcMultiplexerTests` — coalescing factory leak under saturation with multiple attachers. (Paired with W1.2.)
7.`PlcMultiplexerTests` — backend-reader head-of-line block by a slow upstream. (Paired with W1.3.)
8.`PlcMultiplexerTests` — watchdog↔response race (response arriving in the same tick as `SnapshotOlderThan`).
9.`PlcMultiplexerTests` — cascade racing with a brand-new accept (does the new pipe survive or land in a zombie multiplexer mid-teardown?).
10.`HotReloadE2ETests` — flip `ReadCoalescing.Enabled` at runtime (not start-up) and assert subsequent reads stop coalescing. (Paired with W2.1.)
11.`BcdCodecTests``Encode16(int.MinValue)` doesn't blow up.
12.`BcdPduPipelineTests` — write range ending exactly mid-32-bit-pair on the high address (inverse of existing low-side test).
13.`BcdPduPipelineTests` — mixed 16-bit + 32-bit + non-BCD slots in a single FC03 read.
14.`BcdPduPipelineTests` — FC16 *response* handling (currently only the request is tested).
15.`ProxyForwardingTests` or similar — `qty > 128` FC03 read passes through unchanged; PLC's exception 03 surfaces to the client.
16.`ConfigReconcilerTests` — replace the soft "concurrent reloads serialised" test with one that drives genuinely-concurrent applies and asserts no overlap (using a fake supervisor with a "claim count" assertion).
17.`AdminEndpointTests` — POST/PUT/DELETE/HEAD against `/` and `/status.json` return 405 (verifies no unintended methods are bound).
---
## Out of Scope
Findings that this plan deliberately does **not** address:
- **Mux C2 / C3** (cache-hit-during-cascade visibility model). Per the design these are acceptable behaviours; the Multiplexer review downgraded them.
- Any finding that recommended adding a new feature (e.g. an `Mbproxy.AdminBindAddress` config knob) or new defense-in-depth security hardening — those were trimmed from the reviews per direction on 2026-05-14.
---
## Suggested Sequencing
A focused remediation pass might split as:
| Sprint | Items | Outcome |
|--------|-------|---------|
| Sprint 1 (~1 week) | W1.1, W1.2, W1.3, W1.4 | The four critical multiplexer/hot-reload defects are gone. Hot-reload finally works as documented. |
| Sprint 2 (~1 week) | W1.5, W2.19, W2.20, W2.24 + the related test (ShutdownE2ETests) | Shutdown path is single-owner and respects the configured timeout. |
| Sprint 3 (~1 week) | W2.1, W2.3, W2.8, W2.9, W2.10, W2.21 + cache/supervisor tests | All the major hot-reload + cache contract gaps closed. |
| Sprint 4 (~1 week) | All remaining W2 items (BCD, supervisor, multiplexer minor) | Counter accuracy, dispose hygiene, validator coverage. |
| Sprint 5 (~3 days) | Wave 3 cleanup + doc + tests | Suite + docs caught up. |
After Sprint 1 the production risk profile drops sharply. After Sprint 3 every documented design contract is enforced. Wave 3 is opportunistic and can be folded into ongoing work.