Files

T

Joseph Doherty f2c6669444 mbproxy/codereviews: 2026-05-14 in-depth review + remediation plan

Eight area-focused reviews (BCD rewriter, multiplexer, response cache,
supervisor + hot-reload, admin + diagnostics, hosting + options, test
suite) plus an Overview that prioritises findings across areas, and a
RemediationPlan that groups the work into three waves with per-item
file:line citations and regression-test sketches.

Findings call out: hot-reload tag-list/cache changes that don't reach the
running multiplexer, a coalescing factory leak that hangs late attachers,
backend-reader head-of-line block on a wedged upstream, stranded outbound
frames after cascade, and ShutdownCoordinator double-stop ordering. Plus
the unconventional 32-bit BCD wire format (two base-10000 digits in CDAB,
not standard binary), unreachable BcdValidationError.DuplicateAddress,
mbproxy.cache.flushed event that's defined but never emitted, and missing
test coverage for Cache.AllowLongTtl.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-14 05:15:34 -04:00

26 KiB

Raw Permalink Blame History

Remediation Plan — 2026-05-14 Code Review Findings

Plan for resolving every finding raised in the 2026-05-14 review. Items are grouped into three waves by impact, with each item naming the file:line, the problem, the suggested change, and the regression test that should lock the fix in place.

Conventions:

"Wave 1" = ship before next deploy. Production-observable correctness defects.
"Wave 2" = next sprint. Contract gaps, concurrency primitives, telemetry accuracy.
"Wave 3" = opportunistic. Cleanup, dead code, doc, missing tests.
Effort estimates assume a developer already familiar with the multiplexer's lifecycle. ⚙ = code change, 📝 = doc change, ✅ = test addition.

Dependency map (Wave 1 items have load-bearing relationships):

W1.1 (context swap) ──┬──► W1.2 (cache.flushed emission)
                      ├──► W2.1 (ConfigReconciler coalescingAccessor)
                      ├──► W2.7 (skip-invalidation-while-recovering)
                      └──► Tests ✅ (cache flush on tag-list reload)

W1.5 (ShutdownCoordinator ownership) ──► W2.4 (IOptionsMonitor switch)
                                       └► W2.6 (drop hardcoded 5s deadline)

W2.3 (ConfigReconciler thread safety) ──► W3 test (concurrent reload assertion)

Other items are independent and can be parallelised.

Wave 1 — Critical Correctness (4 items)

W1.1 — Plumb context updates into the running `PlcMultiplexer` ⚙ ✅

File: Proxy/Multiplexing/PlcMultiplexer.cs:50,101,109 and call sites at :497-509. Driven by Proxy/Supervision/PlcListenerSupervisor.cs:199-216.

Problem: ReplaceContextAsync swaps _currentContext on the supervisor, but PlcMultiplexer captured _ctx = perPlcContext at construction and never re-reads. So tag-list reloads, the cache-flush-on-tag-list contract, and any other context-derived state are visible only on the next listener fault. Worse, the old cache is disposed while the running mux still uses it.

Change: Make _ctx a volatile field on PlcMultiplexer (or wrap it in a reference type the supervisor can swap atomically). Add a public ReplaceContext(PerPlcContext newContext) method on the multiplexer; have ReplaceContextAsync call it before disposing the old context. Every place inside the multiplexer that reads _ctx.TagMap / _ctx.Cache / _ctx.CacheInvalidator must read the field fresh, not snapshot it into a local.

Tests:

✅ PlcMultiplexerTests.ReplaceContext_NewTagMap_VisibleOnNextPdu — set up a mux, send one FC03, swap the context with a different tag map, send a second FC03, assert different rewriting on the second response.
✅ PlcMultiplexerTests.ReplaceContext_FlushesOldCache — populate the old cache, swap context, assert next read goes to backend (cache miss, not a hit on the disposed cache).
✅ HotReloadE2ETests.E2E_GlobalTagListReload_FlushesPlcCache — pre-seed cache via FC03, save modified appsettings.json, assert cacheEntryCount for that PLC drops to 0 and the next read goes through.

Effort: ~1 day. The mechanical change is small; the risk is making sure every read-site of _ctx.* is migrated and no field is hoisted into a constructor capture.

W1.2 — Fix coalescing factory leak ⚙ ✅

File: Proxy/Multiplexing/PlcMultiplexer.cs:692-733, Proxy/Multiplexing/InFlightByKeyMap.cs:62-83.

Problem: When _correlation.TryAdd or _allocator.TryAllocate fails inside the _inFlightByKey.TryAttachOrCreate factory, the code stores the stub InFlightRequest under the key. Late attachers join the stub's InterestedParties list. The post-lock cleanup at :724 only sends the saturation exception to the leader; late attachers wait forever.

Change: Either (a) inside the factory, do not return / store on failure — let TryAttachOrCreate propagate the error so no other thread sees the stub; or (b) on the cleanup path, walk inFlight.InterestedParties and deliver the exception 0x04 frame to every attached party before removing the entry. Option (b) is the smaller diff. Also remove the dead bool return from TryAttachOrCreate (InFlightByKeyMap.cs:62) — it always returns true.

Tests:

✅ PlcMultiplexerTests.CoalescingFactoryFails_AllAttachersGetException — stub the allocator to return false, fire 5 concurrent identical FC03 reads, assert all 5 receive exception 0x04 with their original TxIds.
✅ InFlightByKeyMapTests.AttachOrCreate_FactoryThrows_DoesNotStoreStub — drive the factory to throw and assert no entry remains in the map.

Effort: ~half day.

W1.3 — Decouple backend reader from per-upstream backpressure ⚙ ✅

File: Proxy/Multiplexing/PlcMultiplexer.cs:545.

Problem: await party.Pipe.SendResponseAsync(outFrame, ct) runs synchronously inside the single backend reader task. If one upstream's bounded response channel (capacity 16) is full, the reader stalls — every other client on that PLC stops receiving responses.

Change: Replace the synchronous await with a non-blocking TryWrite (drop-on-full, with a counter responseDropForFullUpstream), or fan out per-party with _ = Task.Run(() => party.Pipe.SendResponseAsync(...)). Drop-on-full is the cleaner semantics: a wedged upstream loses responses but the rest of the fleet keeps moving; the upstream's watchdog or socket closure will collect it. Add a counter for the dropped responses so it's observable on the status page.

Tests:

✅ PlcMultiplexerTests.SlowUpstream_DoesNotStallPeerResponses — wedge one upstream's channel reader, fire reads from a second upstream, assert the second upstream receives responses on time and the wedged upstream's drop counter increments.

Effort: ~half day. Decide drop-vs-fanout first; the rest is mechanical.

W1.4 — Drain or replace `_outboundChannel` on cascade ⚙ ✅

File: Proxy/Multiplexing/PlcMultiplexer.cs:362-385,423-428.

Problem: When the backend writer faults at :380 and triggers a cascade, frames already enqueued in _outboundChannel are stranded. After reconnect, a new writer task drains them into the fresh backend with old proxy TxIds whose correlation entries were dropped by TearDownBackendAsync. The PLC responds; the reader silently drops the response (:423-428); upstream peers wait for the watchdog to time them out — observable as a multi-second pause for every client whose request was in flight at the cascade moment.

Change: On cascade, complete _outboundChannel.Writer.Complete() before tearing down the writer task. In EnsureBackendConnectedAsync, recreate the channel as part of bringing up a new backend connection. Alternative: drain-and-discard (while (TryRead) { releaseTxId; }) before re-running the writer.

Tests:

✅ PlcMultiplexerTests.Cascade_StrandedOutboundFrames_AreDiscarded — fill the outbound channel, force a cascade mid-drain, reconnect, assert the next request gets a normal response and the stranded frames did not produce orphaned TxIds (counter responseToUnknownTxId stays at 0 for the new connection).

Effort: ~1 day. Care needed to not accidentally race the channel recreation against fast reconnects.

W1.5 — Resolve `ShutdownCoordinator` double-stop ⚙ ✅

File: Program.cs:60-66, Diagnostics/ShutdownCoordinator.cs:147, Proxy/ProxyWorker.cs:196-217, Admin/AdminEndpointHost.cs.

Problem: ProxyWorker and AdminEndpointHost are both IHostedService. Their StopAsync runs unconditionally during host shutdown. The coordinator on ApplicationStopping also calls StopAsync on both — so each is stopped twice. The drain logic in the coordinator therefore runs against a connection set that is already empty, and mbproxy.shutdown.complete always reports InFlightAtCancel = 0. The existing ShutdownE2ETests.cs:113 doesn't actually assert the coordinator ran.

Change: Pick one of:

(A) Make ProxyWorker and AdminEndpointHost not implement IHostedService — register them as singletons only — and let the coordinator drive their entire lifecycle (start in ApplicationStarted, stop in ApplicationStopping).
(B) Move the drain logic into ProxyWorker.StopAsync and delete the coordinator entirely. Simpler if the only thing the coordinator does is order + drain + admin-stop.

(B) is the smaller change and matches how worker services are typically structured.

Tests:

✅ Update ShutdownE2ETests to assert mbproxy.shutdown.complete fires with a positive InFlightAtCancel when an in-flight request is dropped at deadline.
✅ Assert mbproxy.shutdown.complete fires exactly once.

Effort: ~1 day if (B); ~2 days if (A) because every IHostedService gets unwound.

Wave 2 — Major Fixes (20 items)

Grouped by file/area for batched PRs.

Multiplexer / Concurrency

#	File:line	Problem	Change	Test
W2.1	`Configuration/ConfigReconciler.cs:294-304,378-388` ↔ `Proxy/ProxyWorker.cs:129-130`	Add/restart paths construct `PlcListenerSupervisor` without the `coalescingAccessor` `ProxyWorker` builds at startup. Hot-reload of `ReadCoalescing.Enabled` doesn't propagate to PLCs added or restarted via reload.	Pass the same `coalescingAccessor` into `ConfigReconciler.Attach` and use it in the add/restart constructors.	✅ `ConfigReconcilerTests.AddedPlc_HonoursCurrentReadCoalescingFlag` — reload with `Enabled=false`, add a PLC, assert that PLC's reads do not coalesce.
W2.2	`Proxy/Multiplexing/PlcMultiplexer.cs:86,142,181,216,566`; `UpstreamPipe.cs:52,77`	`_disposed` is non-volatile but read on the hot path. On weakly-ordered platforms (ARM) stale-false reads after disposal are permitted.	Mark both fields `volatile`, or replace with `Interlocked.CompareExchange(ref _disposedFlag, 1, 0)` pattern. The whole project ships win-x64 today so this is correctness defense for future portability.	(existing dispose tests are sufficient; verify via inspection.)
W2.3	`Configuration/ConfigReconciler.cs:236-238,269-271,306,390`	`_supervisors` `Dictionary<,>` is mutated from concurrent `Task.Run` continuations inside `Task.WhenAll`. Outer apply is serialised by a semaphore but the inner tasks are not.	Either `ConcurrentDictionary<,>`, or collect results from `Task.WhenAll` first then mutate the dict on a single thread.	✅ Replace existing `Concurrent_Reloads_Serialised` with one that drives genuinely-concurrent add/remove via stub supervisors and asserts no `KeyNotFoundException` / corruption.
W2.4	`Proxy/Multiplexing/PlcMultiplexer.cs:721-733` (counter parity); `:626` (cache miss double-count)	Counter semantics drift from the docs: saturation increments `coalescedMissCount` but produces no backend round-trip; `cacheMiss` is incremented even when the request later coalesces onto a peer's in-flight read.	Two options: (a) restructure increments to fire only on the actual outcome (move `IncrementCoalescedMiss` after the successful send; demote cache-miss to "cache-eligible request" and add a separate `cacheElevatedToBackend` counter), or (b) leave behaviour unchanged and update `docs/Reference/LogEvents.md` + `docs/design.md` to match. (b) is the lower-effort fix.	✅ Either way, add `PlcMultiplexerTests.CounterSemantics_DocumentedContract` that asserts the chosen rule.
W2.5	`Proxy/Multiplexing/PlcMultiplexer.cs:209,350`	`_disposeCts` disposed while watchdog/cancel paths may still race against it. `oldCts?.Dispose()` at `:350` races with reader's `ct.IsCancellationRequested` check.	Wrap `_disposeCts.Dispose()` in a `try { ... } catch (ObjectDisposedException) { }` at `:209`; gate `oldCts?.Dispose()` at `:350` behind the join of the reader task.	(existing dispose tests cover the happy path; consider a stress test that interleaves Dispose with active requests.)
W2.6	`Proxy/Multiplexing/PlcMultiplexer.cs:283`	`_connectGate` `SemaphoreSlim` never disposed.	Add to `Dispose`.	(no test needed.)
W2.7	`Proxy/Multiplexing/PlcMultiplexer.cs:468`	Reads the FC byte from the post-rewriter buffer. Correct today (FC byte never rewritten) but fragile.	Use `inFlight.Fc` from the request side.	(existing tests cover.)

Cache / Hot-reload

#	File:line	Problem	Change	Test
W2.8	`Proxy/Cache/CacheLogEvents.cs:67-76`, supervisor reseat path	`mbproxy.cache.flushed` event defined and documented but never emitted. After W1.1 the reseat path can call `cache.Clear()` instead of disposing — that's the natural emission point.	After W1.1, replace `cache.Dispose()` in `ReplaceContextAsync` with `cache.Clear(reason: "tag-list-reload")` and have `Clear` emit the log event.	✅ `ResponseCacheTests.Clear_EmitsFlushedEvent_WithReason`; ✅ `HotReloadE2ETests` add an assertion that `mbproxy.cache.flushed` fires on the tag-list reload.
W2.9	`Proxy/Multiplexing/PlcMultiplexer.cs:497-509` (currently no gate)	"Skip cache invalidations when listener is `recovering`" (`design.md:203`) is unimplemented. Implicit gating holds today (no backend → no FC06/FC16 response → no invalidation), but is undocumented and will break if any new code path produces a write response off the live backend.	Either (a) add an explicit check on `SupervisorState.Recovering` before calling `_invalidator.Invalidate(...)`, or (b) accept the implicit gating and add a comment in `PlcMultiplexer.cs` near the invalidation call referencing `design.md:203`. (b) is simpler and matches reality.	✅ Doc-only fix → no test. (a) → fault-injection test that synthesises a response during recovery and asserts no invalidation.
W2.10	`Configuration/ReloadValidator.cs:80`	`ReloadValidator` runs `BcdTagMapBuilder.Build` and discards the result; the resolved per-tag `CacheTtlMs` after the per-PLC default fold is never re-checked against `Cache.AllowLongTtl`.	After Build, walk `result.Map.Values` and re-run the long-TTL gate against resolved values.	✅ Add to `ReloadValidatorTests`: tag with explicit null TTL inheriting a per-PLC default of 60_001 ms is rejected when `AllowLongTtl=false`.

BCD Rewriter

#	File:line	Problem	Change	Test
W2.11	`Bcd/BcdTagMapBuilder.cs:84-103`	`DuplicateAddress` validation is unreachable — input collapses through a `Dictionary` first. Design contract "reject duplicate addresses within a single PLC's resolved list" is silently violated.	Detect duplicates before the dictionary collapse: walk the `Add` list with a `HashSet<ushort>`, collect duplicates, return them as `DuplicateAddress` errors. Same for the merged `Global ∪ Add − Remove` resolution where Add overrides Global at the same address (intentional width-override path stays valid; only collisions within Add itself are errors).	✅ `BcdTagMapBuilderTests.Build_DuplicateAddressInAddList_ReportsError` (currently the test acknowledges the gap).
W2.12	`Bcd/BcdTagMapBuilder.cs:113-127`	`OverlappingHighRegister` reports each pair twice (once from each direction).	Track reported pairs in `HashSet<(ushort,ushort)>`, dedupe symmetric reports.	✅ `BcdTagMapBuilderTests.Build_TwoOverlappingPairs_ReportsExactlyOneErrorPerPair`.
W2.13	`Proxy/BcdPduPipeline.cs:206-211`	FC16 32-bit write silently mutates values when `clientLow > 9999` — multiplication math accepts the OOR value and re-encodes a different number.	Validate `clientLow < 10_000 && clientHigh < 10_000` before the multiplication; on failure, increment `invalidBcdWarnings` and pass through raw.	✅ `BcdPduPipelineTests.FC16_32Bit_ClientHighOrLowOOR_PassesThroughRaw_WithInvalidBcdWarning`.
W2.14	`Proxy/BcdPduPipeline.cs:159`	FC16 `byteCount` field at `pdu[5]` is parsed but discarded. Malformed claims of `qty=10` with 4 bytes of register data silently process half the request.	Add `if (pdu.Length < 6 + qty*2) return;` at the top of `ProcessFc16Request` (and increment a counter or log).	✅ `BcdPduPipelineTests.FC16_TruncatedRegisterData_PassesThroughRawWithoutHalfRewrite`.

Supervisor

#	File:line	Problem	Change	Test
W2.15	`Proxy/Supervision/PlcListenerSupervisor.cs:137-144`	`WaitForInitialBindAttemptAsync` busy-polls every 10 ms, races a fast `Stopped→Bound→Stopped`, never exits if the supervisor task throws.	Replace polling loop with a `TaskCompletionSource<bool>` set on the first state transition out of `Stopped`.	✅ `SupervisorTests.WaitForInitialBindAttemptAsync_ResolvesEvenWhenBindThrows` — drive a synthetic bind exception and assert the wait returns within ~50 ms.
W2.16	`Proxy/Supervision/PlcListenerSupervisor.cs:126`	`_supervisorCts` reassigned on `StartAsync` without disposing the previous instance.	Add a guard (`throw if not Stopped`) or dispose the previous CTS before reassigning.	(covered by inspection.)
W2.17	`Proxy/Supervision/PlcListenerSupervisor.cs:177-180`	`Snapshot()` reads three fields independently → status page can show inconsistent state/lastBindError/recoveryAttempts triples.	Read into locals under a single lock, or use a single `volatile` struct that's swapped atomically.	✅ `SupervisorTests.Snapshot_DuringTransition_IsInternallyConsistent` — fire many transitions in parallel and assert no inconsistent triple ever observed.
W2.18	`Options/ConnectionOptions.cs` and `MbproxyOptions.cs` validators	`Connection.{BackendConnectTimeoutMs,BackendRequestTimeoutMs,GracefulShutdownTimeoutMs}` are not validated for `> 0`. A misconfigured 0 produces immediate `CancelAfter(0)` failures.	Add `> 0` checks to `MbproxyOptionsValidator` and `ReloadValidator`.	✅ `MbproxyOptionsBindingTests.Validation_RejectsZeroOrNegativeTimeouts`.

Hosting / Diagnostics

#	File:line	Problem	Change	Test
W2.19	`Diagnostics/ShutdownCoordinator.cs:93,102,118`; `Program.cs:42`	Coordinator binds `IOptions<MbproxyOptions>` (singleton snapshot). A hot-reloaded `GracefulShutdownTimeoutMs` is silently ignored.	Switch to `IOptionsMonitor<MbproxyOptions>`; read `.CurrentValue.Connection.GracefulShutdownTimeoutMs` inside `ShutdownAsync`. After W1.5 verify the coordinator still exists; if it was removed, this fix migrates to wherever drain logic lives.	✅ `ShutdownE2ETests.E2E_HotReloadedShutdownTimeout_IsHonoured`.
W2.20	`Diagnostics/ShutdownCoordinator.cs:147`	Hardcoded `TimeSpan.FromSeconds(5)` for supervisor stop deadline regardless of `GracefulShutdownTimeoutMs`.	Use the configured value.	(covered by W2.19's test if extended.)
W2.21	`src/Mbproxy/appsettings.json` ↔ `install/mbproxy.config.template.json`	Shipped `appsettings.json` is an empty stub. Fresh `Mbproxy.exe` from publish folder is a no-op service.	Either mirror the install template into `src/Mbproxy/appsettings.json`, or set `<None Update="appsettings.json"><CopyToOutputDirectory>Never</CopyToOutputDirectory></None>` in the csproj so the publish folder doesn't ship a stub.	✅ `HostSmokeTests.Host_StartsWithCheckedInAppsettings_AndAtLeastOneListenerBinds`.
W2.22	`Admin/StatusDto.cs:102`, `Admin/StatusSnapshotBuilder.cs:127-131`	`pdus.invalidBcdWarnings` and `backendExceptions.codeOther` are tracked in `CounterSnapshot` but never surfaced via the JSON DTO or HTML renderer. Dead increment paths.	Add the two fields to `PlcPdusStatus` / `ExceptionCounts` and wire them through `StatusSnapshotBuilder` and `StatusHtmlRenderer`.	✅ `AdminEndpointTests.StatusJson_IncludesInvalidBcdWarnings_AfterPipelineWarning`; ✅ analogous for `codeOther`.
W2.23	`Diagnostics/EventLogBridge.cs:46`	`EventLog.SourceExists` registry hit on every Error+ log line.	Cache the result in a `bool _sourceExists` field set in the constructor (one-shot). Document that source registration changes after process start require a service restart.	(covered by inspection; optional perf-sensitive E2E.)
W2.24	`Proxy/ProxyWorker.cs:202-217`	Hard-coded 5-second deadline in `StopAsync`. Coexists with the coordinator's deadline; reader of just `ProxyWorker` would miss the contract.	If W1.5 keeps the coordinator: add a comment pointing at it. If W1.5 deletes the coordinator: replace the 5s with `IOptionsMonitor.CurrentValue.Connection.GracefulShutdownTimeoutMs`.	(covered by W2.19's test.)

Wave 3 — Cleanup, Doc-only, Test Gaps (low priority)

Cleanup (small ⚙)

Proxy/Multiplexing/PlcMultiplexer.cs:445-450 — dead elapsedMs calculation, never used. Remove.
Proxy/Multiplexing/UpstreamPipe.cs:262-263 — meaningless conditional firstRead && remaining == count ? false : false. Replace with return false.
Proxy/Multiplexing/InFlightByKeyMap.cs:62 — TryAttachOrCreate always returns true; drop the dead bool. (Resolved as part of W1.2.)
Bcd/BcdCodec.cs:103-110 and Proxy/BcdPduPipeline.cs:455-459 — duplicated HasBadNibble. Promote codec's helper to internal and call it from the pipeline.
Proxy/Supervision/PlcListenerSupervisor.cs:268,323,346 — copy-pasted exception-message truncation. Extract Truncate(string, int) helper.
Proxy/Multiplexing/PlcMultiplexer.cs:844-846 — comment says "1-second floor"; code uses 100 ms. Fix comment.
Configuration/ConfigReconciler.cs:428 — duplicated tag-delta computation. Either expose a delta count from ReloadPlan.Compute or drop the log enrichment.
Mbproxy.csproj:42 — comment describes Polly as "deferred" but it's actively used. Remove stale comment.
Admin/StatusSnapshotBuilder.cs:69 — unreachable ?.Address.ToString() fallback. Simplify to p.RemoteEp?.ToString() ?? "?".

Doc-only (📝)

README.md (operator-facing) — add a callout describing the 32-bit BCD wire format as "two base-10000 digits in CDAB", not standard binary CDAB Int32. Standard Modbus clients (NModbus, FluentModbus, Wonderware DAServer) will silently corrupt values > 9999 if they assume normal CDAB-int32 semantics. (BcdRewriter M1.)
docs/design.md and docs/Reference/LogEvents.md — clarify counter semantics for cache-miss-during-coalescing per the choice made in W2.4. (Mux M9.)
docs/Operations/Configuration.md — note that the Event Log sink is a custom EventLogBridge, not Serilog.Sinks.EventLog, so reviewers don't add a duplicate sink. (Hosting minor.)
Proxy/ResponseCache.md (or docs/Architecture/ResponseCache.md) — document the implicit invalidation-skip-during-recovery gating. (Cache M1 / Supervisor M1, paired with W2.9 option (b).)
docs/plan/README.md — add a phase entry for "Phase 12: Review remediation 2026-05-14" pointing at this file, so the phase history stays complete.

Test gaps (✅)

The following design contracts are unverified by any current test. Add them in priority order:

✅ ReloadValidatorTests — Cache.AllowLongTtl=false rejects CacheTtlMs > 60_000 (per-tag and per-PLC-default forms). Paired with W2.10. (Cache test gap; design.md.)
✅ ReloadValidatorTests — Cache.AllowLongTtl=true happy-path acceptance.
✅ HotReloadE2ETests — tag-list reload flushes the affected PLC's cache (paired with W1.1 + W2.8).
✅ SupervisorTests — replace placeholder Supervisor_RuntimeFault_TriggersRecovery with a test that genuinely faults the accept loop and asserts state goes Bound → Recovering → Bound. (Supervisor test gap.)
✅ PlcMultiplexerTests — TxId allocator saturation propagates a clean Modbus exception (0x06 or equivalent) to upstream, doesn't hang. (Mux test gap.)
✅ PlcMultiplexerTests — coalescing factory leak under saturation with multiple attachers. (Paired with W1.2.)
✅ PlcMultiplexerTests — backend-reader head-of-line block by a slow upstream. (Paired with W1.3.)
✅ PlcMultiplexerTests — watchdog↔response race (response arriving in the same tick as SnapshotOlderThan).
✅ PlcMultiplexerTests — cascade racing with a brand-new accept (does the new pipe survive or land in a zombie multiplexer mid-teardown?).
✅ HotReloadE2ETests — flip ReadCoalescing.Enabled at runtime (not start-up) and assert subsequent reads stop coalescing. (Paired with W2.1.)
✅ BcdCodecTests — Encode16(int.MinValue) doesn't blow up.
✅ BcdPduPipelineTests — write range ending exactly mid-32-bit-pair on the high address (inverse of existing low-side test).
✅ BcdPduPipelineTests — mixed 16-bit + 32-bit + non-BCD slots in a single FC03 read.
✅ BcdPduPipelineTests — FC16 response handling (currently only the request is tested).
✅ ProxyForwardingTests or similar — qty > 128 FC03 read passes through unchanged; PLC's exception 03 surfaces to the client.
✅ ConfigReconcilerTests — replace the soft "concurrent reloads serialised" test with one that drives genuinely-concurrent applies and asserts no overlap (using a fake supervisor with a "claim count" assertion).
✅ AdminEndpointTests — POST/PUT/DELETE/HEAD against / and /status.json return 405 (verifies no unintended methods are bound).

Out of Scope

Findings that this plan deliberately does not address:

Mux C2 / C3 (cache-hit-during-cascade visibility model). Per the design these are acceptable behaviours; the Multiplexer review downgraded them.
Any finding that recommended adding a new feature (e.g. an Mbproxy.AdminBindAddress config knob) or new defense-in-depth security hardening — those were trimmed from the reviews per direction on 2026-05-14.

Suggested Sequencing

A focused remediation pass might split as:

Sprint	Items	Outcome
Sprint 1 (~1 week)	W1.1, W1.2, W1.3, W1.4	The four critical multiplexer/hot-reload defects are gone. Hot-reload finally works as documented.
Sprint 2 (~1 week)	W1.5, W2.19, W2.20, W2.24 + the related test (ShutdownE2ETests)	Shutdown path is single-owner and respects the configured timeout.
Sprint 3 (~1 week)	W2.1, W2.3, W2.8, W2.9, W2.10, W2.21 + cache/supervisor tests	All the major hot-reload + cache contract gaps closed.
Sprint 4 (~1 week)	All remaining W2 items (BCD, supervisor, multiplexer minor)	Counter accuracy, dispose hygiene, validator coverage.
Sprint 5 (~3 days)	Wave 3 cleanup + doc + tests	Suite + docs caught up.

After Sprint 1 the production risk profile drops sharply. After Sprint 3 every documented design contract is enforced. Wave 3 is opportunistic and can be folded into ongoing work.

26 KiB Raw Permalink Blame History Unescape Escape

Remediation Plan — 2026-05-14 Code Review Findings

Wave 1 — Critical Correctness (4 items)

W1.1 — Plumb context updates into the running PlcMultiplexer ⚙ ✅

W1.2 — Fix coalescing factory leak ⚙ ✅

W1.3 — Decouple backend reader from per-upstream backpressure ⚙ ✅

W1.4 — Drain or replace _outboundChannel on cascade ⚙ ✅

W1.5 — Resolve ShutdownCoordinator double-stop ⚙ ✅

Wave 2 — Major Fixes (20 items)

Multiplexer / Concurrency

Cache / Hot-reload

BCD Rewriter

Supervisor

Hosting / Diagnostics

Wave 3 — Cleanup, Doc-only, Test Gaps (low priority)

Cleanup (small ⚙)

Doc-only (📝)

Test gaps (✅)

Out of Scope

Suggested Sequencing

26 KiB

Raw Permalink Blame History

W1.1 — Plumb context updates into the running `PlcMultiplexer` ⚙ ✅

W1.4 — Drain or replace `_outboundChannel` on cascade ⚙ ✅

W1.5 — Resolve `ShutdownCoordinator` double-stop ⚙ ✅