mbproxy/codereviews: 2026-05-14 in-depth review + remediation plan

Eight area-focused reviews (BCD rewriter, multiplexer, response cache,
supervisor + hot-reload, admin + diagnostics, hosting + options, test
suite) plus an Overview that prioritises findings across areas, and a
RemediationPlan that groups the work into three waves with per-item
file:line citations and regression-test sketches.

Findings call out: hot-reload tag-list/cache changes that don't reach the
running multiplexer, a coalescing factory leak that hangs late attachers,
backend-reader head-of-line block on a wedged upstream, stranded outbound
frames after cascade, and ShutdownCoordinator double-stop ordering. Plus
the unconventional 32-bit BCD wire format (two base-10000 digits in CDAB,
not standard binary), unreachable BcdValidationError.DuplicateAddress,
mbproxy.cache.flushed event that's defined but never emitted, and missing
test coverage for Cache.AllowLongTtl.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-14 05:15:34 -04:00
parent 53a7111120
commit f2c6669444
9 changed files with 781 additions and 0 deletions
@@ -0,0 +1,52 @@
# Admin Endpoint & Diagnostics Subsystem — Code Review
Scope: `src/Mbproxy/Admin/*`, `src/Mbproxy/Diagnostics/*`, `src/Mbproxy/{ServiceCounters,Proxy/ProxyCounters}.cs`. Cross-checked against `docs/design.md` (Status page, Failure modes), `docs/Architecture/Overview.md`, `docs/Operations/StatusPage.md`, `docs/Operations/Troubleshooting.md`, `docs/Reference/LogEvents.md`. Tests under `tests/Mbproxy.Tests/Admin/*` + `tests/Mbproxy.Tests/Diagnostics/*`.
## Summary
- Endpoint surface is correctly minimal (two read-only routes), HTML escaping is sound, JSON shape matches the design spec, and the overall code quality is high. The `Interlocked`-only counter design is correct for x64.
- The `ShutdownCoordinator` ordering does **not** match its own doc — `ProxyWorker.StopAsync` already stops/disposes supervisors during host shutdown **before** the `ApplicationStopping` callback runs the coordinator, so by the time the coordinator polls `InFlightCount` everything is already 0.
- A subtle 32-bit `double`/`DateTimeOffset` torn-read issue exists in `ProxyCounters.UpdateRoundTripEwma` and `ServiceCounters.LastReloadUtc` that is benign on x64 but worth noting.
## Critical findings
**C1. ShutdownCoordinator never actually drains anything in production.** `Program.cs:6066` registers the coordinator on `ApplicationStopping`, but `ProxyWorker.StopAsync` (`src/Mbproxy/Proxy/ProxyWorker.cs:196217`) is `IHostedService` — the host runs `StopAsync` on every hosted service **before** firing `ApplicationStopped`. `ApplicationStopping` fires *first*, so the coordinator runs first — but the coordinator's "Step 1" calls `supervisor.StopAsync` again, which `ProxyWorker.StopAsync` will then call a *second* time. More importantly, the coordinator calls `_adminEndpoint.StopAsync` inside the lifetime callback, then the host immediately calls `AdminEndpointHost.StopAsync` again. Both supervisor and admin host get a redundant double-stop. The existing `ShutdownE2ETests.cs:41` only asserts upstream socket closure, not that the drain actually occurred — so the gap is invisible from tests. Recommended: either (a) make `ProxyWorker` and `AdminEndpointHost` *not* implement `IHostedService` and let the coordinator own their lifecycle, or (b) move the drain logic into `ProxyWorker.StopAsync` and delete the coordinator.
## Major findings
**M1. Coordinator step 1's 5-second deadline is independent of `GracefulShutdownTimeoutMs`.** `src/Mbproxy/Diagnostics/ShutdownCoordinator.cs:147` hardcodes `TimeSpan.FromSeconds(5)` for the supervisor stop deadline. If `GracefulShutdownTimeoutMs` is set to e.g. 30s, supervisor stop will still time-bound at 5s. The doc comment at lines 7684 implies the configured timeout governs the whole sequence.
**M2. EWMA double field has tearing potential on 32-bit hosts.** `src/Mbproxy/Proxy/ProxyCounters.cs:410` reads `Interlocked.Read(ref _lastRoundTripUsEwma)` correctly, but the value is divided by `1000.0` to produce a `double` returned in the snapshot. The underlying long is read atomically; that part is fine on any host. However, `LastReloadUtc` in `ServiceCounters.cs:31` constructs a new `DateTimeOffset` from `Interlocked.Read`'d ticks — also fine. **Conclusion: no actual tearing risk on x64**; the project is `RuntimeIdentifier=win-x64` (csproj line 25). Worth a one-line comment in `ProxyCounters.cs:192` noting "x64 only — long fields are atomic per ECMA-335 only at 8-byte alignment, and we ship win-x64."
**M3. `EventLogBridge.Emit` calls `EventLog.SourceExists` on every write.** `src/Mbproxy/Diagnostics/EventLogBridge.cs:46` — this is a registry hit per Error+ event. Cache the result in a field after the first call (one-shot at construction) to avoid the registry round-trip on every error-level log line under fault storms.
**M4. JSON DTO shape — `pdus.invalidBcdWarnings` and `backendExceptionOther` are tracked in `CounterSnapshot` but never surfaced.** `ProxyCounters.cs:218` increments `_invalidBcdWarnings`; `Snapshot()` returns it (line 399) but `StatusSnapshotBuilder.cs:127131` builds `PlcPdusStatus` without it. Same for `BackendExceptionOther` (`ProxyCounters.cs:232`, surfaced in snapshot at line 405, but not in `ExceptionCounts` at `StatusDto.cs:102`). These are the only two counters in code that don't appear on the wire. Either add `invalidBcdWarnings` to `PlcPdusStatus` and `codeOther` to `ExceptionCounts`, or remove the dead increment paths.
**M5. JSON field naming policy may not apply to nested `record` parameter names under source-gen.** `StatusDto.cs:117` uses `JsonKnownNamingPolicy.CamelCase` on the source-gen context. The E2E test (`AdminEndpointTests.cs:66, 91, 93`) confirms `uptimeSeconds`, `byFc`, `lastRoundTripMs` are emitted in camelCase, so this is verified working. No action needed; flagged because it's a known source-gen gotcha worth keeping a test for.
## Minor findings
**N1. `StatusSnapshotBuilder.cs:69` falls back to `p.RemoteEp?.Address.ToString()` after the `?.ToString()` — unreachable.** If `p.RemoteEp` is null the first `?.` returns null; the second `?.ToString()` also returns null → final `"?"`. The middle expression `p.RemoteEp?.Address.ToString()` can never be reached because if `p.RemoteEp` is non-null, the first expression already produced a string. Simplify to `p.RemoteEp?.ToString() ?? "?"`.
**N2. Auto-refresh of 5 s × 50 simultaneous viewers = 10 req/s.** Cheap (snapshot is in-memory, no I/O), but worth noting that the renderer allocates a fresh `StringBuilder(4096)` and `List<PlcStatus>(opts.Plcs.Count)` per request. For 54 PLCs and a sustained dashboard wall, that's ~10 ms of allocation per second — fine; flagging only as a future optimisation if memory profiling shows it.
**N3. `EventLogBridge.cs:57` truncation math is approximate**`message.Length * 2 > MaxMessageBytes` assumes 2 bytes per char (UTF-16), which is correct for the underlying string but the `EventLog.WriteEntry` API actually has a ~32 KB *character* limit, not byte. The current truncation is conservative (errs short) which is fine.
**N4. `AdminEndpointHost.cs:179` 2-second stop deadline is hardcoded; not configurable.** Acceptable but undocumented.
## What looks good
- **Bind-failure isolation.** `AdminEndpointHost.cs:159164` correctly catches non-cancellation exceptions from `app.StartAsync`, logs `mbproxy.admin.bind.failed` (event ID 71), and continues with `_app = null`. The proxy listeners are completely independent, exactly as designed. The E2E test at `AdminEndpointTests.cs:200246` validates this.
- **Hot-reload of `AdminPort`.** `AdminEndpointHost.cs:6389` uses a semaphore + double-check pattern that correctly serializes concurrent `OnChange` callbacks, with E2E coverage at `AdminEndpointTests.cs:251311`.
- **HTML escaping is applied at every dynamic-content insertion site:** `StatusHtmlRenderer.cs:51, 59, 96, 97, 108, 112, 128`. The XSS attack surface (PLC `Name`, `Host`, bind error messages, client `Remote`) is covered.
- **JSON serialization uses source generation** (`StatusJsonContext.Default.StatusResponse` at `AdminEndpointHost.cs:150`) — AOT-friendly, no reflection.
- **Counter design is well thought through.** `ProxyCounters.ObserveInFlight` (line 307) uses CAS for the high-water mark; `UpdateRoundTripEwma` uses fixed-point microseconds-as-long for atomic CAS on the EWMA (line 352). `ICacheStatsProvider` and `IMultiplexCountersProvider` are clean abstractions that decouple the snapshot path from the multiplexer's lifecycle.
- **`AssemblyVersionAccessor.cs:1923`** correctly uses `GetCustomAttribute` (works under single-file publish) with a sensible `"0.0.0"` fallback. `Mbproxy.csproj:12` sets `<InformationalVersion>1.0.0</InformationalVersion>` per CLAUDE.md.
- **Coordinator testability.** `ShutdownCoordinator.cs:115` second constructor accepts `ISupervisorHandle` and `IAdminEndpointHandle` abstractions, allowing the unit tests at `ShutdownCoordinatorTests.cs` to verify ordering, drain, timeout, and admin-stop without touching real Kestrel/listeners.
- **`EventLogBridge.cs:46`** correctly never tries to *create* the source — `install.ps1:1` registers it via `New-EventLog -Source 'mbproxy' -LogName 'Application'`, and the bridge silently no-ops if the source is missing (line 46). Cannot crash the service.
## Open questions
1. **C1 verification:** does the existing `ShutdownE2ETests.E2E_StopHost_DuringInFlightRequest_CancelsAfterTimeout` actually exercise the coordinator's drain path, or does it only verify host stop timing? The test at `ShutdownE2ETests.cs:113` doesn't read `mbproxy.shutdown.complete` to confirm the coordinator actually ran. Worth adding an assertion.
2. `StatusSnapshotBuilder.cs:67` materializes `clientSnapshots.ToList()` per snapshot. With 54 PLCs × ~4 clients each × 5-second refresh = 216 list allocations every 5s. Acceptable; flagging for memory-pressure dashboards.
3. The `ForwardingLoggerProvider` (`AdminEndpointHost.cs:218224`) creates a new logger per category but `Dispose()` is a no-op — confirm this doesn't leak loggers across hot-reload re-binds (each re-bind builds a new `WebApplication` and a new provider instance).
@@ -0,0 +1,68 @@
# BCD Rewriter Code Review
Scope: `src/Mbproxy/Bcd/*`, `src/Mbproxy/Proxy/{BcdPduPipeline,IPduPipeline,NoopPduPipeline,MbapFrame,RewriterLogEvents}.cs`, the corresponding `BcdTagOptions`/`BcdTagListOptions`, and the matching tests under `tests/Mbproxy.Tests/Bcd/*` and `tests/Mbproxy.Tests/Proxy/Bcd*Tests.cs`. Cross-checked against `docs/design.md` (BCD tag shape, Rewriter scope), `docs/Features/BcdRewriting.md`, and `DL260/dl205.md` (BCD encoding + CDAB word order).
## Summary
- The 16-bit and 32-bit BCD codec is mathematically correct and well-tested; bounds checks use the `(uint)x > Max` cast trick which correctly rejects negatives. Round-trip and CDAB ordering match the design contract.
- The 32-bit FC16 write/response handling on the wire is **NOT** standard CDAB binary int32 — it is a "two base-10000 digits in CDAB order" format (`value = high*10000 + low`). The codec and pipeline are internally consistent with the design contract, but this contract is unconventional and silently mis-translates any client that sends/expects a true binary int32 split CDAB. Worth a doc callout.
- `BcdTagMapBuilder` cannot detect duplicate addresses (the input collapses through a `Dictionary` first), so the `DuplicateAddress` validation path is effectively dead code; the test file even acknowledges this. The `OverlappingHighRegister` check, however, reports a single overlap as **two** errors (both directions of the pair).
- `MbapFrame` is intentionally a parser only — no bounds enforcement. Sound, but the proxy as a whole has no defence against a `length` field exceeding `MaxPduBodySize`; this needs to be enforced by the caller (out of scope for this review but worth flagging).
## Critical findings
None. No data-corruption or memory-safety bugs. No fence-post / off-by-one in the partial-overlap detection. The MBAP length invariant is preserved.
## Major findings
**M1. 32-bit FC16/FC03/FC04 wire format is not standard CDAB int32 (`BcdPduPipeline.cs:202-206, 373-379`).** The reconstruction `int binaryValue = clientHigh * 10_000 + clientLow;` and emission `decodedLow = decoded % 10_000; decodedHigh = decoded / 10_000;` together define a "two-base-10000-digits CDAB" format, **not** binary CDAB. This is **internally consistent** with `design.md:123` (`Decoded decimal = high * 10000 + low`), but it diverges from how every standard Modbus client (NModbus, FluentModbus, Wonderware DAServer) interprets a CDAB Int32. A client writing binary 12_345_678 with normal CDAB-int32 semantics sends `(low=0xBC4E=48206, high=0x00BC=188)`; the proxy then computes `188*10000 + 48206 = 2_128_206` and silently encodes that as BCD. This is a silent-corruption hazard that depends entirely on operator awareness of the contract. Recommend either documenting prominently in the *operator-facing* README (not just `design.md`/`BcdRewriting.md`) or reconsidering whether the wire format should be standard CDAB binary.
**M2. `BcdValidationError.DuplicateAddress` is unreachable (`BcdTagMapBuilder.cs:84-103`).** The working set is built by iterating into a `Dictionary<ushort, BcdTagOptions>` (lines 58-79), which deduplicates by key on the way in. The `seenAddresses.Add(addr)` call at line 98 then iterates *that* dictionary, where every key is already unique by construction. The test at `BcdTagMapBuilderTests.cs:101-137` documents this gap explicitly. The design's stated requirement ("reject duplicate addresses within a single PLC's resolved list" — `design.md:94`) is therefore not enforced; duplicate `Add` entries silently last-write-wins. Either drop the dead check, or detect duplicates before the dictionary collapse (iterate the lists, not the working dict).
**M3. `OverlappingHighRegister` double-reports each overlap (`BcdTagMapBuilder.cs:113-127`).** The collision check iterates every 32-bit tag and looks up the address-of-its-high-register. If two tags `(A=100, W=32)` and `(A=101, W=32)` collide on register 101, the loop reports the collision twice (once when processing the first 32-bit tag, once when processing the second). Worse, when a 32-bit pair `(A=100, W=32)` collides with a 16-bit `(A=101, W=16)`, the lookup hits the 16-bit tag and reports correctly — but if the *16-bit* tag's address happens to also be a 32-bit pair high register, it isn't separately checked from the 16-bit side. Recommend tracking reported pairs in a `HashSet<(ushort,ushort)>` to dedupe.
**M4. FC16 reconstruction can silently truncate/mutate values when `clientLow > 9999` (`BcdPduPipeline.cs:206-211`).** A client that writes `(clientLow=9999, clientHigh=9999)` (intending value 0x9999_9999 in binary CDAB) produces `binaryValue = 99_989_999`, which `Encode32` *accepts* and re-encodes as BCD `(low=0x9999, high=0x9998)`. The high word silently changed. The codec validates the value range, but the pipeline never validates that `clientLow < 10_000` and `clientHigh < 10_000` *before* running the formula. Recommend adding `if (clientLow > 9999 || clientHigh > 9999) → invalid_bcd, passthrough` before the multiplication.
**M5. FC16 `byteCount` field is not validated (`BcdPduPipeline.cs:159`).** `pdu[5]` is parsed but discarded. A malformed client claiming `qty=10` but sending only 4 bytes of register data is caught by the per-slot bounds check (line 191 / 234) and that slot is silently skipped. This is *safe* (no buffer overrun) but means non-BCD slots upstream of the bad slot still get processed and counted. Reasonable defensiveness, but a single explicit `if (pdu.Length < 6 + qty*2) return;` at the top would be cleaner and would prevent the misleading "successfully rewrote half a multi-write" outcome.
## Minor findings
**N1. `HasBadNibble` is duplicated** in `BcdCodec.cs:103-110` (private) and `BcdPduPipeline.cs:455-459`. They're identical. The pipeline's copy exists only to support the "which word was bad" reporting at `BcdPduPipeline.cs:364-365`. Either expose the codec's helper as `internal` or accept the duplication as a perf/locality choice and add a comment.
**N2. FC16 32-bit write `IncrementInvalidBcd` reports `clientLow` as the raw value (`BcdPduPipeline.cs:215-216`)**, not the reconstructed `binaryValue`. Operator looking at the log sees a 16-bit value when the actual problem is the 32-bit reconstruction. Use `binaryValue` (or both) in the message.
**N3. `BcdTagMap.s_emptyHits` is a singleton `Array.Empty<RangeHit>()` but `TryGetForRange` allocates a `List<RangeHit>(4)` on every hit (`BcdTagMap.cs:127`).** For a fleet doing 1000s of FC03/s on PLCs with even one BCD tag, that's a Gen-0 allocation per request. Consider stack-allocating a small `Span<RangeHit>` or a per-thread reusable list. *Probably* fine for current load; flag if memory profiling shows it.
**N4. `BcdTagMap.TryGetForRange` does a linear scan over `_map.Values` (`BcdTagMap.cs:107`)** — O(N) over all tags, not O(hits). For a PLC with ~10 BCD tags this is trivial; for a PLC with hundreds it could matter. The data structure is `FrozenDictionary<ushort, BcdTag>`; a sorted-by-address array with a binary search for the range start would be O(log N + hits). Fine for current scale.
**N5. `Decode32` decodes high then low (`BcdCodec.cs:95-96`)** — the comment at lines 92-94 acknowledges the choice but says it's "natural order". Consumers who care about which word's bad-nibble fires first will see *high*'s exception, but the rewriter then re-checks via `HasBadNibble(rawLow)` first. Slight inconsistency between codec error attribution and pipeline log attribution. Cosmetic.
**N6. `BcdTag.HighRegister` throws `InvalidOperationException` for 16-bit tags (`BcdTag.cs:44-48`)** — fine, but combined with the validator iterating `validated.Values` and only branching on `IsThirtyTwoBit`, the throw is currently unreachable. Future maintainers might call it carelessly — documented well.
**N7. FC16 response is not handled at all (`BcdPduPipeline.cs:289-291`)** — explicit `return`. Correct per the spec (response carries no register values), but the test suite doesn't cover an FC16 response, only the request. One xUnit `[Fact]` would close the gap.
**N8. Test gap: no test asserts that a `BcdCodec.Encode16(int.MinValue)` doesn't blow up unexpectedly.** Negative inputs are tested only with `-1`. The `(uint)value > Max16` cast handles `int.MinValue` correctly (it becomes `0x80000000`), but a direct test would lock the contract.
**N9. Test gap: no test exercises a write range that ends exactly mid-32-bit-pair on the high address (`qty=2, startAddress = tag.Address - 1`)** — i.e., the inverse of the `BcdPduPipelineTests.FC16_WritePartiallyOverlapping32BitPair_PassesThroughRaw_WithPartialWarning` test. Currently the test only covers partial-overlap-on-the-low-side.
**N10. Test gap: no test covers the **mixed 16/32/non-BCD in one FC03 read** scenario.** `FC03_Mixed_16BitBcd_And_NonBcd_InSameRead_OnlyBcdSlotRewritten` covers 16 + non-BCD only; no test mixes a 32-bit pair with non-BCD or with a 16-bit tag in one request.
## What looks good
- **Codec is genuinely allocation-free on the success path.** Pure value-type math, tuple return, and bounds checks via `(uint)cast` rather than two-sided comparisons. Clean.
- **MBAP transparency invariant** (length field never modified, always re-encoding ushort→ushort) is honored end-to-end. The dedicated `PduLength_IsNeverChangedByRewriting` test enforces it.
- **Exception-response detection** (`fc & 0x80`) at `BcdPduPipeline.cs:264` is correctly placed *before* the FC switch, so FC03/04 exception responses (`0x83`/`0x84`) are not accidentally treated as register-data responses. The exception-code counter routing is also correct.
- **Pass-through for unrelated FCs (01, 02, 05, 15)** is implicit and correct — the `switch` in `ProcessRequest` and `ProcessResponse` simply doesn't match them. Tests confirm.
- **Stateless pipeline + per-call `WithCurrentRequest` clone** (`PerPlcContext.cs:61`) is a clean way to pass per-PDU state without a thread-static or a lock. The single per-call clone allocation is pragmatic.
- **`FrozenDictionary` for the tag map** is the right .NET 8+ choice for a build-once / read-many lookup table.
- **`s_emptyHits` sentinel + early returns in `TryGetForRange`** mean the no-hit path (the overwhelming majority for non-BCD addresses on a 54-PLC fleet) is allocation-free.
- **Stable `EventName`s on `[LoggerMessage]` attributes** (`RewriterLogEvents.cs:14, 31, 47`) match the naming convention in `design.md` and survive a rename refactor of the C# method names.
- **FC06 echo decode rationale** (NModbus client validation) is documented inline at line 285-286 and the slot-counter is intentionally skipped to avoid double-counting (line 448-449). Good attention to detail.
- **Partial-overlap detection in `BcdPduPipeline.ProcessFc06Request`** correctly handles *both* the low-of-pair case (via `tag.IsThirtyTwoBit` after a hit) **and** the high-of-pair case (via `TryGetForRange(address, 1)` returning a negative `OffsetWords`). Subtle but right.
## Open questions
- **Is the "two-base-10000 digits in CDAB" wire format for 32-bit BCD an intentional contract or a bug?** The design doc, the codec, the pipeline, and the tests all agree. But every standard Modbus client treats CDAB as straight binary int32. If the upstream Wonderware/Historian configurations were built knowing this, fine; if they were configured assuming standard CDAB-int32, they're silently producing garbage on writes and reading garbage on reads any time the value exceeds `9_999`. (This is a question for the integrator, not the codebase.)
- **Should the `Cache.AllowLongTtl` validation reject `CacheTtlMs > 60_000` at the `BcdTagMapBuilder` layer or only at the options-validator layer?** The builder happily accepts any `CacheTtlMs >= 0` (line 107-110), so the long-TTL gate must live elsewhere. Couldn't tell from the in-scope files alone.
- **`BcdTagMapBuilder.Build` includes `OverlappingHighRegister` errors but still emits the conflicting tag in the map (lines 130-136)**. The XML doc says "Callers should treat any error as a fatal configuration problem at startup," but if a caller ignores errors, the map is *partially valid* with overlapping entries. Is the contract "reject loudly at startup, never run with errors" enforced by the host, or is there a code path where the service runs with `Errors.Count > 0`?
- **No FC23 (Read/Write Multiple Registers) handling.** The DL260 reference doesn't list FC23 as supported, so it's safely ignored, but if one ever shows up the proxy will pass it through unrewritten. Probably fine; worth confirming the DL fleet never advertises FC23.
@@ -0,0 +1,57 @@
# Hosting / DI / Options / Windows Service Audit
Scope: `Program.cs`, `HostingExtensions.cs`, `Mbproxy.csproj`, `appsettings.json`, `Options/*.cs`, `install/`, `Mbproxy.slnx`. Cross-checked against `docs/design.md` (Configuration, Logging, Startup posture), `docs/Architecture/Overview.md`, `docs/Operations/Configuration.md`. Tests under `tests/Mbproxy.Tests/Options/MbproxyOptionsBindingTests.cs` + `tests/Mbproxy.Tests/HostSmokeTests.cs`.
## Summary
- The host pipeline is small, idiomatic for .NET 10 `Host.CreateApplicationBuilder`, and the Windows Service / console-fallback story is correctly handled via `AddWindowsService()` + `WindowsServiceHelpers.IsWindowsService()`.
- Hot-reload posture is mostly clean (`IOptionsMonitor<MbproxyOptions>` everywhere it matters), with one regression: `ShutdownCoordinator` consumes `IOptions<MbproxyOptions>` and will freeze `GracefulShutdownTimeoutMs` at startup-time value.
- `appsettings.json` shipped in `src/Mbproxy/` is an empty template (no PLCs, no BCD tags, no `Cache` section). The richer working example lives only in `install/mbproxy.config.template.json` — a maintenance-time hazard.
- Single-file self-contained publish is wired correctly for net10.0; package versions are pinned and GA (no preview/RC strings present).
## Critical findings
None. The service starts, reaches `mbproxy.startup.ready`, and shuts down cleanly per `HostSmokeTests`.
## Major findings
- **`ShutdownCoordinator` uses `IOptions<MbproxyOptions>`, defeating hot-reload of `GracefulShutdownTimeoutMs`.** `src/Mbproxy/Diagnostics/ShutdownCoordinator.cs:93,102,118` and the DI factory at `src/Mbproxy/Program.cs:42`. `IOptions<T>` is a singleton snapshot: any operator who edits `Connection.GracefulShutdownTimeoutMs` after start will silently see the original value at shutdown. Per `docs/Operations/Configuration.md:160` "`GracefulShutdownTimeoutMs` is sampled only at shutdown" — so the contract is "current config wins at stop time". Switch to `IOptionsMonitor<MbproxyOptions>` and read `_options.CurrentValue.Connection.GracefulShutdownTimeoutMs` inside `ShutdownAsync`.
- **`appsettings.json` shipped with the binary is a near-empty stub.** `src/Mbproxy/appsettings.json:1` ships `BcdTags.Global = []`, `Plcs = []`, no `Cache` section, and no example tag. The fully-commented working example lives only in `install/mbproxy.config.template.json`. Copy-on-build (`Mbproxy.csproj:52`) will publish the empty stub next to `Mbproxy.exe`, and `install.ps1:99-105` then copies *that* into `InstallPath`. The Production binary therefore carries an `appsettings.json` that only the install script's separate copy of the template will ever fix at `%ProgramData%\mbproxy\…`. A fresh `Mbproxy.exe` run from the publish folder yields a no-op service. Recommend either marking `src/Mbproxy/appsettings.json` `CopyToOutputDirectory=Never` so the install template is the single source of truth, or making the shipped file mirror the template.
- **`ConfigReconciler.Attach` happens between supervisor construction and `StartAsync` — not "BEFORE accepting hot-reload events".** `src/Mbproxy/Proxy/ProxyWorker.cs:153`. The comment claims this is safe because the channel "only enqueues" and apply runs async. That is true, but it does mean a config save that lands during the very early window between `host.RunAsync()` and `_reconciler.Attach` is observed by `IOptionsMonitor.OnChange` before the reconciler has its supervisor map. Whether that is benign depends on `ConfigReconciler` internals (it does buffer); worth a quick review to confirm the buffer is unbounded or that the missed event is replayed from `_options.CurrentValue` after `Attach`.
- **`IHostApplicationLifetime.ApplicationStopping` callback runs `coordinator.ShutdownAsync().GetAwaiter().GetResult()` synchronously on the lifetime thread.** `src/Mbproxy/Program.cs:60-66`. The comment acknowledges async isn't supported and notes the coordinator manages its own deadline, which is correct. But the drain loop in `ShutdownCoordinator.ShutdownAsync` (line 167) `Task.Delay(10, drainCts.Token)` ContinueWith chains will resume on the threadpool — fine — yet the synchronous `GetResult` blocks one IHost lifetime thread for up to `GracefulShutdownTimeoutMs + ~7 s` (5 s stop + drain + 2 s admin). That is by design, but worth documenting as the reason the `GracefulShutdownTimeoutMs` default (10 s) is bounded well below the SCM 30 s wait hint.
## Minor findings
- **Bare-config-only Serilog wiring; the design's "Event Log sink for Error+ in service mode" is satisfied via a custom sink, not the standard `Serilog.Sinks.EventLog` package.** `src/Mbproxy/HostingExtensions.cs:79-84` + `src/Mbproxy/Diagnostics/EventLogBridge.cs:25`. This is intentional (the bridge silently no-ops if the source isn't registered, avoiding admin-required `EventLog.CreateEventSource` at startup). Just call this out in `docs/Operations/Configuration.md` so reviewers don't add a duplicate sink.
- **`ProxyWorker` registered as singleton + hosted service via the `sp.GetRequiredService` factory pattern.** `src/Mbproxy/Program.cs:29-30` and the same trick at `src/Mbproxy/HostingExtensions.cs:57-58` for `AdminEndpointHost`. The pattern is correct and idiomatic; just notice that `ProxyWorker` holds a private mutable dictionary `_supervisors` (line 41) — singleton lifetime is correct because that dict is per-process, not per-PLC, and `Supervisors` is exposed as `IReadOnlyDictionary` for the status page.
- **`MbproxyOptions.Plcs` typed as `IReadOnlyList<PlcOptions>`.** `src/Mbproxy/Options/MbproxyOptions.cs:8`. Configuration binder support for `IReadOnlyList<T>` works on .NET 8+ but historically had quirks; `MbproxyOptionsBindingTests.cs:75` confirms binding works. No action needed.
- **`AddJsonFile` is implicit via `Host.CreateApplicationBuilder`.** Not called manually in `Program.cs`. The default loader uses `optional: true, reloadOnChange: true` for `appsettings.json` and `appsettings.{Environment}.json`. Per `docs/Operations/Configuration.md:13` the design states `reloadOnChange: true` — satisfied by the default. Environment overrides and command-line are also picked up automatically by `CreateApplicationBuilder(args)` at `Program.cs:9`.
- **`MbproxyOptionsValidator` is registered manually as `IValidateOptions<>` (line 28-30) AND `.ValidateOnStart()` is called (line 26).** Correct combination, but note `ValidateOnStart()` only gates startup — runtime reload validation is delegated to `ReloadValidator` (per `Configuration.md:245`). The schema-level validator does NOT run on reload via `IOptionsMonitor.OnChange` because options binding rebinds without re-invoking `IValidateOptions<>`. This is the design (the validation entry-point on reload is the reconciler), but it's a sharp edge — a width-of-8 sneaked in via hot-reload would NOT be caught by `MbproxyOptionsValidator`; it has to be re-applied by `ReloadValidator`.
- **`Mbproxy.csproj:42` describes Polly as "deferred" — but Polly is now used in `PolicyFactory.BuildBackendConnect/ListenerRecovery`.** Comment is stale; remove it.
- **No `appsettings.Development.json` exists** — fine, but the host will still attempt to load it (optional). If you want to suppress dev-only Serilog overrides in published artifacts, OK as-is.
## What looks good
- TFM `net10.0`, GA package versions `Microsoft.Extensions.Hosting.WindowsServices 10.0.8`, `Serilog.Extensions.Hosting 10.0.0`, `Polly 8.6.6` — all production releases, no `*-preview*` / `*-rc*` strings (`Mbproxy.csproj:37-43`).
- `OutputType=Exe` + `Sdk="Microsoft.NET.Sdk.Worker"` is the right combo for a Worker Service that also hosts ASP.NET Core (Kestrel admin) via `<FrameworkReference Include="Microsoft.AspNetCore.App" />` (line 31).
- Single-file publish gated to Release only (`Mbproxy.csproj:22-27`) — Debug iteration stays fast. `IncludeNativeLibrariesForSelfExtract=true` is the correct hint for self-extracting native deps.
- `AddWindowsService()` is called unconditionally (`Program.cs:12`); `EventLogBridge` wiring is gated by `WindowsServiceHelpers.IsWindowsService()` so console runs don't depend on Event Log source registration (`Program.cs:15`, `HostingExtensions.cs:79-84`).
- Cancellation flows correctly: `BackgroundService.ExecuteAsync(stoppingToken)` is plumbed into supervisor `StartAsync(stoppingToken)` and `WaitForInitialBindAttemptAsync(readyLinked.Token)` (`ProxyWorker.cs:163-177`).
- Install/uninstall scripts are idempotent: `install.ps1:71-84` stops gracefully before rebuilding, `:112-117` updates an existing service in-place rather than recreating, `:166-171` checks Event Log source existence before creating, `:154-162` preserves an existing operator config. `uninstall.ps1:75-93` archives logs to a timestamped folder rather than deleting — excellent operational instinct.
- Failure-recovery actions are configured (`install.ps1:127`): `restart/60000/restart/60000/""/0` with a 24 h reset window — restart twice then back off, which is the right policy for a fleet proxy that should not flap-restart.
- `Mbproxy.slnx` (the new XML solution format) is correctly minimal and references only the two projects.
## Open questions
1. Is the empty `src/Mbproxy/appsettings.json` shipping intentionally so operators are forced into `%ProgramData%`? If so, document it explicitly in `docs/Operations/Configuration.md`. If not, mirror the install template into the source tree.
2. Should `ShutdownCoordinator` switch to `IOptionsMonitor<T>` to honour a hot-reloaded `GracefulShutdownTimeoutMs`, or is the doc claim ("sampled only at shutdown") meant to mean "as of the last reload **before** shutdown was requested"? Either intent is fine, but the code currently means "as of process start" and that contradicts both readings.
3. Is the missing per-PLC reload-validation re-run of `MbproxyOptionsValidator` (Width != 16/32) covered by `ReloadValidator` test coverage? The width check appears to live only in `MbproxyOptionsValidator` (`MbproxyOptions.cs:62-63`) — worth confirming `ReloadValidator` re-checks it on hot-reload, or it could be skipped on a runtime config change.
@@ -0,0 +1,78 @@
# PlcMultiplexer Concurrency Review
Scope: `src/Mbproxy/Proxy/Multiplexing/*`, `src/Mbproxy/Proxy/{PlcListener,PerPlcContext,ProxyWorker,ProxyCounters}.cs`, the relevant `Options` POCOs, and the matching tests under `tests/Mbproxy.Tests/Proxy/Multiplexing/*` + `ProxyForwardingTests.cs`. Cross-checked against `docs/design.md` (Connection model, Read coalescing), `docs/Architecture/ConnectionModel.md`, `docs/Architecture/ReadCoalescing.md`.
## Summary
- The architecture is sound and the documented invariants (single backend reader/writer, claim-then-dispatch in the watchdog, remove-from-key-map-before-fanout) are upheld. Most paths are well-defended.
- Two correctness defects in the coalescing path can deliver responses without restoring the original TxId or duplicate counter increments; one in the cache path lets a hit survive a freshly-cascaded backend.
- A handful of leaks and lifecycle races exist around `_connectGate`, the outbound channel after teardown, the `_disposed` flag, and the second backend-CTS dispose.
- Test coverage is solid for happy-path multiplexing/coalescing/saturation/cascade/watchdog but does not exercise the watchdog↔response race or coalescing-mid-cascade.
## Critical Findings
### C1. Coalescing factory failure can leak the entry, deliver no response, and hang every late attacher
`PlcMultiplexer.cs:692-700` — when `_correlation.TryAdd` fails inside the factory, the code releases the TxId, logs, and **returns the stub `inFlight`**. The map keeps the just-stored entry under `key` but `inFlightForSend` stays `null`, so the post-lock branch at line 724 takes the saturation cleanup (`_inFlightByKey.TryRemove(key, out _)`) and sends an exception **only to the leader**. Any late attacher that joined between the factory call and the `TryRemove` is appended to a `List<InterestedParty>` that no one will ever drain — its pipe hangs until the upstream socket closes. Same pathology if `_allocator.TryAllocate` returns false inside the factory: the stub is stored under `key`, late attaches accumulate, and only the leader gets exception 0x04. The factory must either (a) not store on failure, or (b) the post-lock cleanup must walk all parties on the stub before removing. (See also `InFlightByKeyMap.cs:82-83`: `_entries[key] = req` runs unconditionally.)
### C2. Coalesce-hit late attaches against an FC03/FC04 cache hit can be silently dropped
`PlcMultiplexer.cs:603-624` — for an FC03/FC04 cache hit, the path returns immediately at line 623 **without** consulting `_inFlightByKey`. Meanwhile, a peer in flight on the exact same `(unit, fc, start, qty)` may already be coalescing late attaches (line 645-718). The cache lookup happens before `EnsureBackendConnectedAsync` and before coalescing — but if the cached entry was just stored by the backend reader at line 490 *while a coalesced-leader's response is still in fan-out*, a late upstream that arrives during fan-out can take the cache hit path and return correctly. The real hazard is the inverse: at `RunBackendReaderAsync` line 490 the cache `Set` happens **before** the InFlight TryRemove at line 441. A new request arriving between those two steps could miss the cache (entry not yet stored… actually the order is reversed: cache Set is at 490 and `_inFlightByKey.TryRemove` is at 441, so cache is populated *after* the key is removed — correct order). However, between line 441 and line 490 a new identical FC03 will: cache miss (not yet set), then check `_inFlightByKey` — missing too — and open a fresh round-trip. That's wasteful but not corrupting. **Downgrade to Major.**
### C3. Cache hit can be served against a backend that has just cascaded
`PlcMultiplexer.cs:610-624` — by design the cache "survives backend disconnects," but the contract in `ConnectionModel.md` says cascade closes **every attached upstream pipe**. A new pipe that connects mid-cascade and immediately reads a cached tag returns successfully, while a peer attached one millisecond earlier just had its socket closed. This is an inconsistent visibility model — but the design explicitly accepts it for cached reads. Note however: after a cascade, `_inFlightByKey.DrainAll()` runs (line 324) without delivering anything to the parties on the drained entries. Those parties' pipes are then disposed at line 338 so the hang in C1 doesn't apply here. **Acceptable** but worth a doc note.
### C4. Backend reader's send to upstream uses the backend CTS, so a slow upstream blocks the whole reader
`PlcMultiplexer.cs:545``await party.Pipe.SendResponseAsync(outFrame, ct)` uses the backend reader's `ct` (= `_backendCts.Token`). If the upstream pipe's `_responseChannel` is full (capacity 16) and the backend writer keeps producing for it, this `WriteAsync` blocks **inside the single backend reader task**, stalling responses to **every other PLC client** sharing that backend socket. The bounded outbound channel only backpressures inbound; on the response side a single wedged upstream stops the world for that PLC. Mitigations: drop-on-full for response channel, write with timeout, or fan-out tasks per party.
## Major Findings
### M1. `_disposed` is non-volatile and read on every hot path
`PlcMultiplexer.cs:86`, checked at lines 142, 181, 216, 566. A non-volatile `bool` set on the disposing thread is not guaranteed to be observed by other threads under the .NET memory model on weakly-ordered platforms (ARM). Same in `UpstreamPipe.cs:52,77`. Recommend `volatile bool` or use `Interlocked.Exchange(ref int, 1)` pattern.
### M2. `_disposeCts` is disposed while watchdog/cancel paths may still touch it
`PlcMultiplexer.cs:209` disposes `_disposeCts` after a 2-second `WaitAsync` on the watchdog. If the watchdog took >2 s to settle (e.g. blocked on a slow `SendResponseAsync`), it will subsequently call `Task.Delay(tickMs, ct)` against a disposed CTS → `ObjectDisposedException`, which the outer `catch (Exception ex)` (line 909) logs as an error. Best-effort, but noisy. Same risk: `oldCts?.Dispose()` at line 350 races with `RunBackendReaderAsync`'s `ct.IsCancellationRequested` check at line 392 and with the catch handler.
### M3. `_connectGate` is never disposed
`PlcMultiplexer.cs:283`. `SemaphoreSlim` holds an internal `WaitHandle` lazily; in practice rarely realized so leak is small, but it's a missed dispose on a long-lived object.
### M4. Counter parity contract violated when allocator saturates on a coalescing miss
`PlcMultiplexer.cs:721-733` — the miss is counted at line 721 (`IncrementCoalescedMiss`). If the factory then fails (saturation/duplicate-key), the path at line 729 sends exception 04 but **never** decrements the miss counter. The doc's identity `coalescedHitCount + coalescedMissCount = total FC03+FC04` still holds in raw counting terms (each FC03 entered the path once), but the operator semantically expects "miss = a fresh backend round-trip." Saturation produced no round-trip. Document this, or move the increment after success.
### M5. Watchdog snapshot enumerates a `ConcurrentDictionary` non-atomically across multiple ticks
`CorrelationMap.cs:75-79` uses `foreach (var kvp in _entries)` which is a snapshot enumerator but does **not** take a moment-in-time lock. An entry added after the enumerator started may or may not be seen this tick — fine. The issue is that the watchdog reads `kvp.Value.SentAtUtc` and may evaluate against an entry whose true `SentAtUtc` is later than threshold because the entry was added between the snapshot copy at the dictionary level and the field read. Since `InFlightRequest` is immutable, the field read is consistent, and a too-young entry just won't satisfy `<=` — so this is benign. Worth a comment.
### M6. Watchdog comment says "1-second floor" but code uses 100 ms
`PlcMultiplexer.cs:844-846` — the doc string and `ConnectionModel.md` both say "100 ms floor"; the inline comment at line 845 says "1-second floor". Confusing. The 100 ms floor in code is correct.
### M7. Pre-rewriter FC byte read after rewriter mutates the buffer
`PlcMultiplexer.cs:468``byte fcInResponse = frame[MbapFrame.HeaderSize]; // post-rewriter, but the FC byte is never rewritten`. True for FC03/04/06/16 responses, but if a future rewriter ever touches byte 0 of the PDU body this becomes silently wrong. Use `inFlight.Fc` from the request side instead.
### M8. `RunBackendWriterAsync` doesn't drain remaining frames on cascade
`PlcMultiplexer.cs:362-385` — when the writer faults at line 380 it triggers a cascade. Frames already enqueued in `_outboundChannel` are stranded (the channel is not completed and no subsequent writer task will ever exist against the same channel — `EnsureBackendConnectedAsync` wires a fresh task at line 269, which then drains those stranded frames into the *new* backend socket). Those frames carry old proxy TxIds whose correlation entries were just dropped by `TearDownBackendAsync`. The PLC will respond to TxIds that have no correlation entry, the reader logs nothing (line 423-428: silent drop), the upstream peers hang waiting for the watchdog. Either complete the channel on cascade and recreate it on reconnect, or drain-and-discard before re-running the writer.
### M9. `IncrementCacheMiss` always increments before the fall-through to coalescing
`PlcMultiplexer.cs:626` — for a cache-eligible read, the miss is counted, then the request falls through to coalescing. If two upstream peers issue the same read, one stores the response in cache, the other coalesces onto it. Result: one CacheMiss, one CoalescedMiss, one CoalescedHit (the second hits coalesce, not cache). If two peers issue, peer A is the leader: A → CacheMiss, CoalescedMiss; B → CacheMiss, CoalescedHit. Per-second the operator reads `CacheMiss = 2` but only one actually had to go to backend. This is reasonable but worth documenting.
## Minor Findings
- **`PlcMultiplexer.cs:445-450`**: Dead code — `elapsedMs` is computed and never used; the actual EWMA conversion happens two lines below.
- **`UpstreamPipe.cs:262-263`**: `return firstRead && remaining == count ? false : false;` — both branches return `false`. The conditional is meaningless; just `return false;`.
- **`PlcMultiplexer.cs:269-270`**: Backend tasks are created with `CancellationToken.None` so they cannot be observed via that token; relies on `cts2.Token` propagating cancellation through `SendAsync`/`ReceiveAsync`. Fine but worth a comment.
- **`PlcMultiplexer.cs:537-540`**: The `Count == 1` reuse of the original `frame` buffer is correct only because the BCD rewriter has already run. After patching `outFrame[0..2]` for a single party we mutate the same buffer the cache `Set` at line 478 already snapshotted (`Buffer.BlockCopy`), so cache integrity holds. Good — but a comment would help reviewers.
- **`InFlightByKeyMap.cs:62`** signature: `public bool TryAttachOrCreate(...)` always returns `true`; the bool is dead.
- **Test gap**: No test exercises C1 (factory-saturation/duplicate-key with multiple attachers in flight). No test exercises the watchdog↔response race (response arriving in the same tick as `SnapshotOlderThan`). No test exercises cascade racing with a brand-new accept (does the new pipe get attached to the freshly-empty `_pipes` and survive, or does it land in a zombie multiplexer mid-teardown?).
- **`ProxyWorker.cs:64-194`**: There is no add/remove reconciliation path here — this is delegated to `ConfigReconciler` via `Attach(_supervisors, opts)`. Out of scope for this audit but worth verifying that reconciler tear-down respects the same disposal ordering.
## What Looks Good
- **TxIdAllocator** is straightforward and demonstrably correct: single lock on the hot path, scan-from-cursor with wrap detection, defensive double-release no-op, exhaustion returns `false` cleanly. Stress test at `TxIdAllocatorTests.cs:101` covers concurrent allocate/release for duplicate detection.
- **Claim-then-dispatch** in the watchdog (`PlcMultiplexer.cs:863`): `TryRemove` is the single source of truth; a response that wins the race makes the watchdog skip silently. Symmetric in the response path (line 423).
- **`InFlightByKeyMap` lock discipline** — every state mutation under the same `object` lock; remove-before-iterate ordering at the multiplexer (line 441) preserves the no-late-attach invariant for fan-out. The `Concurrent_AttachOrCreate_From_Two_Threads` test (16×500 ops, single key, unbounded MaxParties → exactly one create) is a clean concurrency proof.
- **`TearDownBackendAsync` idempotency** via the `_backendLock`-protected null-swap (line 290-301) is correct; second invocation early-returns at line 303.
- **`BuildExceptionFrame`** correctly produces a real 9-byte Modbus exception frame so client libraries see a normal `ModbusException`.
- **Per-call context cloning** via `WithCurrentRequest` (`PerPlcContext.cs:61`) avoids cross-talk between concurrent multiplexed responses on the shared `PerPlcContext`.
- **Counters are uniformly Interlocked** — no `++` on shared longs, peak via CAS, EWMA via fixed-point CAS loop.
## Open Questions
1. Is the design intent that **FC03/FC04 with `MaxParties = int.MaxValue`** is supported in production, or is the test-only setting an antipattern? `InFlightByKeyMapTests.cs:208` uses it — a real backend response with 8000 parties means the reader walks 8000 frame clones synchronously inside the single backend reader (C4 amplified).
2. When `EnsureBackendConnectedAsync` returns false, `OnUpstreamFrameAsync` disposes the upstream pipe (line 636). Is that the right behavior given the cache-hit-survives-backend-down contract? An upstream that needs a non-cached read will be killed even if its next read would have been served from cache.
3. The cascade walks `_pipes.Values.ToArray()` (line 335). A pipe accepted between snapshot and `_pipes.Clear()` (line 341) is **not** disposed and **not** counted in `upstreamCount`. The next `OnUpstreamFrameAsync` it receives will see `_disposed == false` and try to reuse the multiplexer — likely hits `EnsureBackendConnectedAsync` and proceeds normally. Intentional?
4. `RunBackendReaderAsync` line 550 does `_ = TearDownBackendAsync(...)` fire-and-forget. The reader task itself then exits cleanly. But the cascade's "best-effort join" at line 348 awaits the same reader task (`reader = _backendReaderTask`) → self-join? Trace: `TearDown` snapshots `reader` from the field, the field is then set to null inside the lock, and `reader.WaitAsync` is awaited. Since the reader returned and started cascade fire-and-forget, the await on its task completes immediately — fine. Worth a comment confirming the intent.
@@ -0,0 +1,86 @@
# mbproxy Code Review — 2026-05-14
In-depth review of the `mbproxy` service through Phase 11 (opt-in response cache). Conducted via parallel area-focused review passes, then synthesised here. Each area has its own document under this folder; this Overview ranks findings across the whole codebase by impact so a maintainer can prioritise fixes without re-reading every file.
**Scope:** every `.cs` file under `src/Mbproxy/`, the design contracts in `docs/design.md` + `docs/Architecture/*` + `docs/Features/*`, the option POCOs, the hosting/install glue, and the test suite under `tests/Mbproxy.Tests/`.
**Out of scope:** the pymodbus simulator profile itself, anything under `DL260/`, anything outside `mbproxy/`.
## Reviewer Verdict
The service is well-architected and the per-area code quality is high. Concurrency primitives are correct, log events are stable and well-named, and the test suite genuinely exercises races rather than pretending to. Most of the contracts in `docs/design.md` are upheld in the code.
The fleet of critical-severity findings clusters in **two seams**:
1. **Hot-reload propagation into a running `PlcMultiplexer`.** The supervisor's `ReplaceContextAsync` swaps `_currentContext` on the supervisor but the running multiplexer captures `_ctx` at construction and never re-reads it — so tag-list reloads, cache-on-reload flushes, and the `ReadCoalescing.Enabled` toggle are visible only on the *next* listener fault, not at the next PDU. The design doc reads as if the next-PDU contract is upheld; the code only upholds it for one branch (the `BcdTagMap` read inside the rewriter's per-call clone).
2. **Backend reader/writer cascade.** When the backend socket dies, in-flight responses can sit synchronously on a wedged upstream pipe (one slow client stalls every peer), stranded outbound frames carry old TxIds against a fresh backend (orphaned responses, hung peers via watchdog), and the cascade itself races accepts that arrive between snapshot and `_pipes.Clear()`.
Below, **Critical** findings are correctness defects, hangs, data corruption, or security gaps that block production confidence. **Major** findings are design-contract violations that an operator could hit in a year of running. **Minor** findings are style, dead code, and test-only gaps.
## Critical Findings (ranked by impact)
| # | Area | Finding | File:line |
|---|------|---------|-----------|
| 1 | Supervisor / Hot-reload | `ReplaceContextAsync` swaps the supervisor's `_currentContext` but the running `PlcMultiplexer` captured `_ctx` at construction and never re-reads. The cache-flush-on-tag-list-reload contract (`design.md:155`) is not enforced for the in-flight mux; the old cache is even disposed while the running mux is still using it. | `Proxy/Supervision/PlcListenerSupervisor.cs:199-216``Proxy/Multiplexing/PlcMultiplexer.cs:50,101,109` |
| 2 | Multiplexer | Coalescing factory failure (`TryAdd` collision, `TxIdAllocator.TryAllocate` returns false) stores a stub `InFlightRequest` under the coalescing key but only the leader gets the saturation exception. Late attachers join a `List<InterestedParty>` that no one will ever drain — pipes hang until the upstream socket closes. | `Proxy/Multiplexing/PlcMultiplexer.cs:692-733`, `Proxy/Multiplexing/InFlightByKeyMap.cs:82-83` |
| 3 | Multiplexer | Backend reader sends to upstream synchronously inside the single reader task. If one upstream's bounded response channel is full, the reader blocks — stalling every other client on the same PLC. | `Proxy/Multiplexing/PlcMultiplexer.cs:545` |
| 4 | Multiplexer | Backend writer doesn't complete/drain `_outboundChannel` on cascade. Stranded frames carry old proxy TxIds; the fresh backend socket then receives them and responds against TxIds whose correlation entries were already dropped by `TearDownBackendAsync`. Responses are silently dropped (`PlcMultiplexer.cs:423-428`); upstream peers hang waiting for the watchdog. | `Proxy/Multiplexing/PlcMultiplexer.cs:362-385`, `:423-428` |
## Major Findings (highest-impact, full list in per-area docs)
| Area | Finding | File:line |
|------|---------|-----------|
| Supervisor / Hot-reload | "Skip cache invalidations when listener state is `recovering`" (`design.md:203`) is unimplemented. `PlcMultiplexer` never consults supervisor state. Implicit gating (no backend reader during recovery → no invalidation) holds today but is fragile and undocumented in code. | `Proxy/Multiplexing/PlcMultiplexer.cs:497-509` |
| Supervisor / Hot-reload | `ConfigReconciler.Apply`'s add/restart code paths construct `PlcListenerSupervisor` **without** the `coalescingAccessor` that `ProxyWorker` builds at startup. Result: PLCs added or restarted via hot-reload always use the *default* `ReadCoalescingOptions`, and a hot-reload of `Mbproxy.Resilience.ReadCoalescing.Enabled` does NOT propagate to those supervisors. | `Configuration/ConfigReconciler.cs:294-304,378-388``Proxy/ProxyWorker.cs:129-130` |
| Supervisor / Hot-reload | `ConfigReconciler.ApplyUnderLockAsync` mutates the `_supervisors` `Dictionary` from concurrent `Task.Run` continuations. The outer apply is serialised by a semaphore, but the inner Add/Remove/Restart tasks run in parallel. `Dictionary<,>` is not thread-safe. | `Configuration/ConfigReconciler.cs:236-238,269-271,306,390` |
| Multiplexer | `_disposed` is non-volatile and read on every hot path (`PlcMultiplexer.cs:86`, `UpstreamPipe.cs:52,77`). Under the .NET memory model, on ARM hosts a stale `false` read after disposal is permitted. (x64 happens to give acquire-release on plain reads.) | `Proxy/Multiplexing/PlcMultiplexer.cs:86,142,181,216,566` |
| Cache | `mbproxy.cache.flushed` event is defined in `CacheLogEvents.cs` and referenced in `Reference/LogEvents.md` but never emitted by any caller. The reseat path disposes the old cache rather than calling `Clear()`. | `Proxy/Cache/CacheLogEvents.cs:67-76`, `Proxy/Supervision/PlcListenerSupervisor.cs:199-216` |
| Cache | `Cache.AllowLongTtl=false` rejects `CacheTtlMs > 60_000` per the design — and the validators do enforce it — but the suite has **zero tests** for this rule (`grep -rn AllowLongTtl tests/` returns nothing). Same for the happy path with `AllowLongTtl=true`. | `Configuration/ReloadValidator.cs:90-128`, `Options/MbproxyOptions.cs:58-111` |
| Cache | `IncrementCacheMiss` always increments on cache-eligible reads *before* the fall-through to coalescing. Two peers on the same key produce two cache misses and one coalesced hit — accurate per the strict letter of the spec but means "miss" doesn't mean "backend round-trip". | `Proxy/Multiplexing/PlcMultiplexer.cs:626` |
| Bcd Rewriter | The 32-bit BCD wire format is "two base-10000 digits in CDAB", **not** standard CDAB binary Int32. Internally consistent with `design.md:123`, but every standard Modbus client (NModbus, FluentModbus, Wonderware DAServer) treats CDAB as straight binary int32 and will silently corrupt any value > 9999. Worth flagging in operator-facing docs so integrators know what wire format the proxy expects. | `Proxy/BcdPduPipeline.cs:202-206,373-379`, `Bcd/BcdCodec.cs` |
| Bcd Rewriter | `BcdTagMapBuilder` cannot detect duplicate addresses because the input collapses through a `Dictionary` first (`Add` last-write-wins via key). The `DuplicateAddress` validation error is unreachable; the design's "reject duplicate addresses within a single PLC's resolved list" is silently violated. | `Bcd/BcdTagMapBuilder.cs:58-103` |
| Admin / Diagnostics | `ShutdownCoordinator` runs via `ApplicationStopping` callback then `IHostedService.StopAsync` fires the same `ProxyWorker`/`AdminEndpointHost` again — double-stop. Drain happens but the contract is unclear. The existing `ShutdownE2ETests` don't assert `mbproxy.shutdown.complete` fired. | `Program.cs:60-66`, `Proxy/ProxyWorker.cs:196-217`, `Diagnostics/ShutdownCoordinator.cs:147` |
| Hosting | `ShutdownCoordinator` consumes `IOptions<MbproxyOptions>` (singleton snapshot) instead of `IOptionsMonitor<MbproxyOptions>`. A hot-reloaded `GracefulShutdownTimeoutMs` is silently ignored at shutdown. | `Diagnostics/ShutdownCoordinator.cs:93,102,118`, `Program.cs:42` |
| Hosting | Shipped `src/Mbproxy/appsettings.json` is a near-empty stub (no PLCs, no BCD tags, no `Cache` section). The working template lives only in `install/mbproxy.config.template.json`. A fresh `Mbproxy.exe` run from the publish folder is a no-op service. | `src/Mbproxy/appsettings.json``install/mbproxy.config.template.json` |
## Top fix recommendations (in order)
1. **Plumb context updates into the running `PlcMultiplexer`.** Either thread a `volatile PerPlcContext` reference on the mux and have `ReplaceContextAsync` swap it (so the next per-PDU `_ctx.TagMap` / `_ctx.Cache` deref sees the new map / new cache), or restart the mux when the context changes. Without this, hot-reload of tag lists, cache-flush-on-reload, and the `ReadCoalescing.Enabled` toggle all silently fail to propagate. (Resolves: Critical 1, Major "coalescingAccessor".)
2. **Fix the coalescing factory leak.** In the `_inFlightByKey.TryAttachOrCreate` factory, if `TryAdd`/`TryAllocate` fails, do not return a stub stored under the key — either fail the entire operation (so nobody else attaches), or have the post-factory cleanup walk every party on the stored stub before removing. (Resolves: Critical 2.)
3. **Decouple the backend reader from per-upstream backpressure.** Either drop-on-full for the response channel, write with a per-frame timeout, or fan out responses on a per-party `Task.Run`. (Resolves: Critical 3.)
4. **Drain or replace `_outboundChannel` on cascade.** Complete it on cascade and recreate it on next reconnect, so stranded frames never reach a fresh backend with a dead TxId space. (Resolves: Critical 4.)
5. **Make `_supervisors` thread-safe.** Either `ConcurrentDictionary<,>` or move dict mutations to a single serialisation point after `Task.WhenAll(...)` settles. (Resolves: Major "concurrent dict".)
6. **Add the `ReadCoalescing` accessor to the reconciler's add/restart paths.** Single-line oversight from when Phase 10 landed after Phase 6.
7. **Make `ShutdownCoordinator` read `IOptionsMonitor`.**
8. **Ship a usable `appsettings.json`** (or stop shipping the stub at publish time).
9. **Flag the 32-bit BCD wire format prominently in operator-facing docs.** The "two base-10000 digits in CDAB" contract lives in `design.md`/`BcdRewriting.md` but not in the README; integrators using standard CDAB-int32 clients will silently corrupt values > 9999 without knowing it.
10. **Add tests for `Cache.AllowLongTtl`.** Both rejection (>60_000 without gate) and acceptance (>60_000 with gate).
## What looks especially good
- **Three-tier test architecture** (pure unit → stub-socket integration → simulator E2E) with explicit per-tier rationale in xmldoc. The stub-backend tests catch what the pymodbus simulator can't (its concurrent-MBAP-frame bug is documented and worked around).
- **Genuine concurrency tests, not pretend ones.** `TxIdAllocatorTests.Concurrent_AllocateRelease_...` runs 100 tasks × 1000 ops asserting no duplicate live IDs; `ResponseCacheTests.Concurrent_GetSet` races 8 tasks × 500 ops; `PlcMultiplexerTests` `BackendDisconnect_CascadesToAllUpstreams` uses real sockets.
- **Counter-based assertions** across the suite — testing observable telemetry as well as behaviour, doubling as wire-format validation.
- **Polly pipelines** correctly match the design's documented attempt counts and backoffs in both `BackendConnect` and `ListenerRecovery`.
- **Bind-failure isolation on the admin endpoint** — Kestrel failing to bind logs `mbproxy.admin.bind.failed` and the proxy listeners continue. E2E-tested.
- **Single code path for "bring up a listener"** (both startup and hot-reload add go through `PlcListenerSupervisor.StartAsync`). Matches the design.
- **`FrozenDictionary` for the tag map**, allocation-free codec hot path, and `s_emptyHits` sentinel for the no-hit case keep the rewriter cheap under sustained load.
- **Stable `EventName`s** on every `[LoggerMessage]` survive a C# rename refactor.
## Per-area documents
| Area | File |
|------|------|
| BCD codec + PDU pipeline | [`BcdRewriter.md`](./BcdRewriter.md) |
| Multiplexer + concurrency | [`Multiplexer.md`](./Multiplexer.md) |
| Response cache | [`ResponseCache.md`](./ResponseCache.md) |
| Supervisor + hot reload | [`SupervisorAndHotReload.md`](./SupervisorAndHotReload.md) |
| Admin + diagnostics | [`AdminAndDiagnostics.md`](./AdminAndDiagnostics.md) |
| Hosting + options | [`HostingAndOptions.md`](./HostingAndOptions.md) |
| Test suite | [`TestSuite.md`](./TestSuite.md) |
## Methodology
Each per-area document is the verbatim (lightly cleaned) output of an independent area-scoped review pass conducted on 2026-05-14 against branch `master` at commit `53a7111`. Reviewers were given the design docs as ground truth, the in-scope files, and the corresponding tests, and asked for `file:line`-cited findings ranked by severity. The Overview above re-ranks the union of those findings across areas.
Findings reference paths relative to `mbproxy/` unless otherwise specified.
@@ -0,0 +1,236 @@
# Remediation Plan — 2026-05-14 Code Review Findings
Plan for resolving every finding raised in the 2026-05-14 review. Items are grouped into three waves by impact, with each item naming the file:line, the problem, the suggested change, and the regression test that should lock the fix in place.
**Conventions:**
- "Wave 1" = ship before next deploy. Production-observable correctness defects.
- "Wave 2" = next sprint. Contract gaps, concurrency primitives, telemetry accuracy.
- "Wave 3" = opportunistic. Cleanup, dead code, doc, missing tests.
- Effort estimates assume a developer already familiar with the multiplexer's lifecycle. ⚙ = code change, 📝 = doc change, ✅ = test addition.
**Dependency map** (Wave 1 items have load-bearing relationships):
```
W1.1 (context swap) ──┬──► W1.2 (cache.flushed emission)
├──► W2.1 (ConfigReconciler coalescingAccessor)
├──► W2.7 (skip-invalidation-while-recovering)
└──► Tests ✅ (cache flush on tag-list reload)
W1.5 (ShutdownCoordinator ownership) ──► W2.4 (IOptionsMonitor switch)
└► W2.6 (drop hardcoded 5s deadline)
W2.3 (ConfigReconciler thread safety) ──► W3 test (concurrent reload assertion)
```
Other items are independent and can be parallelised.
---
## Wave 1 — Critical Correctness (4 items)
### W1.1 — Plumb context updates into the running `PlcMultiplexer` ⚙ ✅
**File:** `Proxy/Multiplexing/PlcMultiplexer.cs:50,101,109` and call sites at `:497-509`. Driven by `Proxy/Supervision/PlcListenerSupervisor.cs:199-216`.
**Problem:** `ReplaceContextAsync` swaps `_currentContext` on the supervisor, but `PlcMultiplexer` captured `_ctx = perPlcContext` at construction and never re-reads. So tag-list reloads, the cache-flush-on-tag-list contract, and any other context-derived state are visible only on the next listener fault. Worse, the old cache is disposed while the running mux still uses it.
**Change:** Make `_ctx` a `volatile` field on `PlcMultiplexer` (or wrap it in a reference type the supervisor can swap atomically). Add a public `ReplaceContext(PerPlcContext newContext)` method on the multiplexer; have `ReplaceContextAsync` call it before disposing the old context. Every place inside the multiplexer that reads `_ctx.TagMap` / `_ctx.Cache` / `_ctx.CacheInvalidator` must read the field fresh, not snapshot it into a local.
**Tests:**
-`PlcMultiplexerTests.ReplaceContext_NewTagMap_VisibleOnNextPdu` — set up a mux, send one FC03, swap the context with a different tag map, send a second FC03, assert different rewriting on the second response.
-`PlcMultiplexerTests.ReplaceContext_FlushesOldCache` — populate the old cache, swap context, assert next read goes to backend (cache miss, not a hit on the disposed cache).
-`HotReloadE2ETests.E2E_GlobalTagListReload_FlushesPlcCache` — pre-seed cache via FC03, save modified `appsettings.json`, assert `cacheEntryCount` for that PLC drops to 0 and the next read goes through.
**Effort:** ~1 day. The mechanical change is small; the risk is making sure every read-site of `_ctx.*` is migrated and no field is hoisted into a constructor capture.
---
### W1.2 — Fix coalescing factory leak ⚙ ✅
**File:** `Proxy/Multiplexing/PlcMultiplexer.cs:692-733`, `Proxy/Multiplexing/InFlightByKeyMap.cs:62-83`.
**Problem:** When `_correlation.TryAdd` or `_allocator.TryAllocate` fails inside the `_inFlightByKey.TryAttachOrCreate` factory, the code stores the stub `InFlightRequest` under the key. Late attachers join the stub's `InterestedParties` list. The post-lock cleanup at `:724` only sends the saturation exception to the leader; late attachers wait forever.
**Change:** Either (a) inside the factory, do not return / store on failure — let `TryAttachOrCreate` propagate the error so no other thread sees the stub; or (b) on the cleanup path, walk `inFlight.InterestedParties` and deliver the exception 0x04 frame to every attached party before removing the entry. Option (b) is the smaller diff. Also remove the dead `bool` return from `TryAttachOrCreate` (`InFlightByKeyMap.cs:62`) — it always returns `true`.
**Tests:**
-`PlcMultiplexerTests.CoalescingFactoryFails_AllAttachersGetException` — stub the allocator to return false, fire 5 concurrent identical FC03 reads, assert all 5 receive exception 0x04 with their original TxIds.
-`InFlightByKeyMapTests.AttachOrCreate_FactoryThrows_DoesNotStoreStub` — drive the factory to throw and assert no entry remains in the map.
**Effort:** ~half day.
---
### W1.3 — Decouple backend reader from per-upstream backpressure ⚙ ✅
**File:** `Proxy/Multiplexing/PlcMultiplexer.cs:545`.
**Problem:** `await party.Pipe.SendResponseAsync(outFrame, ct)` runs synchronously inside the single backend reader task. If one upstream's bounded response channel (capacity 16) is full, the reader stalls — every other client on that PLC stops receiving responses.
**Change:** Replace the synchronous `await` with a non-blocking `TryWrite` (drop-on-full, with a counter `responseDropForFullUpstream`), or fan out per-party with `_ = Task.Run(() => party.Pipe.SendResponseAsync(...))`. Drop-on-full is the cleaner semantics: a wedged upstream loses responses but the rest of the fleet keeps moving; the upstream's watchdog or socket closure will collect it. Add a counter for the dropped responses so it's observable on the status page.
**Tests:**
-`PlcMultiplexerTests.SlowUpstream_DoesNotStallPeerResponses` — wedge one upstream's channel reader, fire reads from a second upstream, assert the second upstream receives responses on time and the wedged upstream's drop counter increments.
**Effort:** ~half day. Decide drop-vs-fanout first; the rest is mechanical.
---
### W1.4 — Drain or replace `_outboundChannel` on cascade ⚙ ✅
**File:** `Proxy/Multiplexing/PlcMultiplexer.cs:362-385,423-428`.
**Problem:** When the backend writer faults at `:380` and triggers a cascade, frames already enqueued in `_outboundChannel` are stranded. After reconnect, a new writer task drains them into the *fresh* backend with old proxy TxIds whose correlation entries were dropped by `TearDownBackendAsync`. The PLC responds; the reader silently drops the response (`:423-428`); upstream peers wait for the watchdog to time them out — observable as a multi-second pause for every client whose request was in flight at the cascade moment.
**Change:** On cascade, complete `_outboundChannel.Writer.Complete()` before tearing down the writer task. In `EnsureBackendConnectedAsync`, recreate the channel as part of bringing up a new backend connection. Alternative: drain-and-discard (`while (TryRead) { releaseTxId; }`) before re-running the writer.
**Tests:**
-`PlcMultiplexerTests.Cascade_StrandedOutboundFrames_AreDiscarded` — fill the outbound channel, force a cascade mid-drain, reconnect, assert the next request gets a normal response and the stranded frames did not produce orphaned TxIds (counter `responseToUnknownTxId` stays at 0 for the new connection).
**Effort:** ~1 day. Care needed to not accidentally race the channel recreation against fast reconnects.
---
### W1.5 — Resolve `ShutdownCoordinator` double-stop ⚙ ✅
**File:** `Program.cs:60-66`, `Diagnostics/ShutdownCoordinator.cs:147`, `Proxy/ProxyWorker.cs:196-217`, `Admin/AdminEndpointHost.cs`.
**Problem:** `ProxyWorker` and `AdminEndpointHost` are both `IHostedService`. Their `StopAsync` runs unconditionally during host shutdown. The coordinator on `ApplicationStopping` *also* calls `StopAsync` on both — so each is stopped twice. The drain logic in the coordinator therefore runs against a connection set that is already empty, and `mbproxy.shutdown.complete` always reports `InFlightAtCancel = 0`. The existing `ShutdownE2ETests.cs:113` doesn't actually assert the coordinator ran.
**Change:** Pick one of:
- **(A)** Make `ProxyWorker` and `AdminEndpointHost` *not* implement `IHostedService` — register them as singletons only — and let the coordinator drive their entire lifecycle (start in `ApplicationStarted`, stop in `ApplicationStopping`).
- **(B)** Move the drain logic into `ProxyWorker.StopAsync` and delete the coordinator entirely. Simpler if the only thing the coordinator does is order + drain + admin-stop.
(B) is the smaller change and matches how worker services are typically structured.
**Tests:**
- ✅ Update `ShutdownE2ETests` to assert `mbproxy.shutdown.complete` fires with a positive `InFlightAtCancel` when an in-flight request is dropped at deadline.
- ✅ Assert `mbproxy.shutdown.complete` fires exactly once.
**Effort:** ~1 day if (B); ~2 days if (A) because every IHostedService gets unwound.
---
## Wave 2 — Major Fixes (20 items)
Grouped by file/area for batched PRs.
### Multiplexer / Concurrency
| # | File:line | Problem | Change | Test |
|---|-----------|---------|--------|------|
| W2.1 | `Configuration/ConfigReconciler.cs:294-304,378-388``Proxy/ProxyWorker.cs:129-130` | Add/restart paths construct `PlcListenerSupervisor` without the `coalescingAccessor` `ProxyWorker` builds at startup. Hot-reload of `ReadCoalescing.Enabled` doesn't propagate to PLCs added or restarted via reload. | Pass the same `coalescingAccessor` into `ConfigReconciler.Attach` and use it in the add/restart constructors. | ✅ `ConfigReconcilerTests.AddedPlc_HonoursCurrentReadCoalescingFlag` — reload with `Enabled=false`, add a PLC, assert that PLC's reads do not coalesce. |
| W2.2 | `Proxy/Multiplexing/PlcMultiplexer.cs:86,142,181,216,566`; `UpstreamPipe.cs:52,77` | `_disposed` is non-volatile but read on the hot path. On weakly-ordered platforms (ARM) stale-false reads after disposal are permitted. | Mark both fields `volatile`, or replace with `Interlocked.CompareExchange(ref _disposedFlag, 1, 0)` pattern. The whole project ships win-x64 today so this is correctness defense for future portability. | (existing dispose tests are sufficient; verify via inspection.) |
| W2.3 | `Configuration/ConfigReconciler.cs:236-238,269-271,306,390` | `_supervisors` `Dictionary<,>` is mutated from concurrent `Task.Run` continuations inside `Task.WhenAll`. Outer apply is serialised by a semaphore but the inner tasks are not. | Either `ConcurrentDictionary<,>`, or collect results from `Task.WhenAll` first then mutate the dict on a single thread. | ✅ Replace existing `Concurrent_Reloads_Serialised` with one that drives genuinely-concurrent add/remove via stub supervisors and asserts no `KeyNotFoundException` / corruption. |
| W2.4 | `Proxy/Multiplexing/PlcMultiplexer.cs:721-733` (counter parity); `:626` (cache miss double-count) | Counter semantics drift from the docs: saturation increments `coalescedMissCount` but produces no backend round-trip; `cacheMiss` is incremented even when the request later coalesces onto a peer's in-flight read. | Two options: (a) restructure increments to fire only on the actual outcome (move `IncrementCoalescedMiss` after the successful send; demote cache-miss to "cache-eligible request" and add a separate `cacheElevatedToBackend` counter), or (b) leave behaviour unchanged and update `docs/Reference/LogEvents.md` + `docs/design.md` to match. (b) is the lower-effort fix. | ✅ Either way, add `PlcMultiplexerTests.CounterSemantics_DocumentedContract` that asserts the chosen rule. |
| W2.5 | `Proxy/Multiplexing/PlcMultiplexer.cs:209,350` | `_disposeCts` disposed while watchdog/cancel paths may still race against it. `oldCts?.Dispose()` at `:350` races with reader's `ct.IsCancellationRequested` check. | Wrap `_disposeCts.Dispose()` in a `try { ... } catch (ObjectDisposedException) { }` at `:209`; gate `oldCts?.Dispose()` at `:350` behind the join of the reader task. | (existing dispose tests cover the happy path; consider a stress test that interleaves Dispose with active requests.) |
| W2.6 | `Proxy/Multiplexing/PlcMultiplexer.cs:283` | `_connectGate` `SemaphoreSlim` never disposed. | Add to `Dispose`. | (no test needed.) |
| W2.7 | `Proxy/Multiplexing/PlcMultiplexer.cs:468` | Reads the FC byte from the post-rewriter buffer. Correct today (FC byte never rewritten) but fragile. | Use `inFlight.Fc` from the request side. | (existing tests cover.) |
### Cache / Hot-reload
| # | File:line | Problem | Change | Test |
|---|-----------|---------|--------|------|
| W2.8 | `Proxy/Cache/CacheLogEvents.cs:67-76`, supervisor reseat path | `mbproxy.cache.flushed` event defined and documented but never emitted. After W1.1 the reseat path can call `cache.Clear()` instead of disposing — that's the natural emission point. | After W1.1, replace `cache.Dispose()` in `ReplaceContextAsync` with `cache.Clear(reason: "tag-list-reload")` and have `Clear` emit the log event. | ✅ `ResponseCacheTests.Clear_EmitsFlushedEvent_WithReason`; ✅ `HotReloadE2ETests` add an assertion that `mbproxy.cache.flushed` fires on the tag-list reload. |
| W2.9 | `Proxy/Multiplexing/PlcMultiplexer.cs:497-509` (currently no gate) | "Skip cache invalidations when listener is `recovering`" (`design.md:203`) is unimplemented. Implicit gating holds today (no backend → no FC06/FC16 response → no invalidation), but is undocumented and will break if any new code path produces a write response off the live backend. | Either (a) add an explicit check on `SupervisorState.Recovering` before calling `_invalidator.Invalidate(...)`, or (b) accept the implicit gating and add a comment in `PlcMultiplexer.cs` near the invalidation call referencing `design.md:203`. (b) is simpler and matches reality. | ✅ Doc-only fix → no test. (a) → fault-injection test that synthesises a response during recovery and asserts no invalidation. |
| W2.10 | `Configuration/ReloadValidator.cs:80` | `ReloadValidator` runs `BcdTagMapBuilder.Build` and discards the result; the resolved per-tag `CacheTtlMs` after the per-PLC default fold is never re-checked against `Cache.AllowLongTtl`. | After Build, walk `result.Map.Values` and re-run the long-TTL gate against resolved values. | ✅ Add to `ReloadValidatorTests`: tag with explicit null TTL inheriting a per-PLC default of 60_001 ms is rejected when `AllowLongTtl=false`. |
### BCD Rewriter
| # | File:line | Problem | Change | Test |
|---|-----------|---------|--------|------|
| W2.11 | `Bcd/BcdTagMapBuilder.cs:84-103` | `DuplicateAddress` validation is unreachable — input collapses through a `Dictionary` first. Design contract "reject duplicate addresses within a single PLC's resolved list" is silently violated. | Detect duplicates *before* the dictionary collapse: walk the `Add` list with a `HashSet<ushort>`, collect duplicates, return them as `DuplicateAddress` errors. Same for the merged `Global Add Remove` resolution where Add overrides Global at the same address (intentional width-override path stays valid; only collisions within Add itself are errors). | ✅ `BcdTagMapBuilderTests.Build_DuplicateAddressInAddList_ReportsError` (currently the test acknowledges the gap). |
| W2.12 | `Bcd/BcdTagMapBuilder.cs:113-127` | `OverlappingHighRegister` reports each pair twice (once from each direction). | Track reported pairs in `HashSet<(ushort,ushort)>`, dedupe symmetric reports. | ✅ `BcdTagMapBuilderTests.Build_TwoOverlappingPairs_ReportsExactlyOneErrorPerPair`. |
| W2.13 | `Proxy/BcdPduPipeline.cs:206-211` | FC16 32-bit write silently mutates values when `clientLow > 9999` — multiplication math accepts the OOR value and re-encodes a different number. | Validate `clientLow < 10_000 && clientHigh < 10_000` before the multiplication; on failure, increment `invalidBcdWarnings` and pass through raw. | ✅ `BcdPduPipelineTests.FC16_32Bit_ClientHighOrLowOOR_PassesThroughRaw_WithInvalidBcdWarning`. |
| W2.14 | `Proxy/BcdPduPipeline.cs:159` | FC16 `byteCount` field at `pdu[5]` is parsed but discarded. Malformed claims of `qty=10` with 4 bytes of register data silently process half the request. | Add `if (pdu.Length < 6 + qty*2) return;` at the top of `ProcessFc16Request` (and increment a counter or log). | ✅ `BcdPduPipelineTests.FC16_TruncatedRegisterData_PassesThroughRawWithoutHalfRewrite`. |
### Supervisor
| # | File:line | Problem | Change | Test |
|---|-----------|---------|--------|------|
| W2.15 | `Proxy/Supervision/PlcListenerSupervisor.cs:137-144` | `WaitForInitialBindAttemptAsync` busy-polls every 10 ms, races a fast `Stopped→Bound→Stopped`, never exits if the supervisor task throws. | Replace polling loop with a `TaskCompletionSource<bool>` set on the first state transition out of `Stopped`. | ✅ `SupervisorTests.WaitForInitialBindAttemptAsync_ResolvesEvenWhenBindThrows` — drive a synthetic bind exception and assert the wait returns within ~50 ms. |
| W2.16 | `Proxy/Supervision/PlcListenerSupervisor.cs:126` | `_supervisorCts` reassigned on `StartAsync` without disposing the previous instance. | Add a guard (`throw if not Stopped`) or dispose the previous CTS before reassigning. | (covered by inspection.) |
| W2.17 | `Proxy/Supervision/PlcListenerSupervisor.cs:177-180` | `Snapshot()` reads three fields independently → status page can show inconsistent state/lastBindError/recoveryAttempts triples. | Read into locals under a single lock, or use a single `volatile` struct that's swapped atomically. | ✅ `SupervisorTests.Snapshot_DuringTransition_IsInternallyConsistent` — fire many transitions in parallel and assert no inconsistent triple ever observed. |
| W2.18 | `Options/ConnectionOptions.cs` and `MbproxyOptions.cs` validators | `Connection.{BackendConnectTimeoutMs,BackendRequestTimeoutMs,GracefulShutdownTimeoutMs}` are not validated for `> 0`. A misconfigured 0 produces immediate `CancelAfter(0)` failures. | Add `> 0` checks to `MbproxyOptionsValidator` and `ReloadValidator`. | ✅ `MbproxyOptionsBindingTests.Validation_RejectsZeroOrNegativeTimeouts`. |
### Hosting / Diagnostics
| # | File:line | Problem | Change | Test |
|---|-----------|---------|--------|------|
| W2.19 | `Diagnostics/ShutdownCoordinator.cs:93,102,118`; `Program.cs:42` | Coordinator binds `IOptions<MbproxyOptions>` (singleton snapshot). A hot-reloaded `GracefulShutdownTimeoutMs` is silently ignored. | Switch to `IOptionsMonitor<MbproxyOptions>`; read `.CurrentValue.Connection.GracefulShutdownTimeoutMs` inside `ShutdownAsync`. After W1.5 verify the coordinator still exists; if it was removed, this fix migrates to wherever drain logic lives. | ✅ `ShutdownE2ETests.E2E_HotReloadedShutdownTimeout_IsHonoured`. |
| W2.20 | `Diagnostics/ShutdownCoordinator.cs:147` | Hardcoded `TimeSpan.FromSeconds(5)` for supervisor stop deadline regardless of `GracefulShutdownTimeoutMs`. | Use the configured value. | (covered by W2.19's test if extended.) |
| W2.21 | `src/Mbproxy/appsettings.json``install/mbproxy.config.template.json` | Shipped `appsettings.json` is an empty stub. Fresh `Mbproxy.exe` from publish folder is a no-op service. | Either mirror the install template into `src/Mbproxy/appsettings.json`, or set `<None Update="appsettings.json"><CopyToOutputDirectory>Never</CopyToOutputDirectory></None>` in the csproj so the publish folder doesn't ship a stub. | ✅ `HostSmokeTests.Host_StartsWithCheckedInAppsettings_AndAtLeastOneListenerBinds`. |
| W2.22 | `Admin/StatusDto.cs:102`, `Admin/StatusSnapshotBuilder.cs:127-131` | `pdus.invalidBcdWarnings` and `backendExceptions.codeOther` are tracked in `CounterSnapshot` but never surfaced via the JSON DTO or HTML renderer. Dead increment paths. | Add the two fields to `PlcPdusStatus` / `ExceptionCounts` and wire them through `StatusSnapshotBuilder` and `StatusHtmlRenderer`. | ✅ `AdminEndpointTests.StatusJson_IncludesInvalidBcdWarnings_AfterPipelineWarning`; ✅ analogous for `codeOther`. |
| W2.23 | `Diagnostics/EventLogBridge.cs:46` | `EventLog.SourceExists` registry hit on every Error+ log line. | Cache the result in a `bool _sourceExists` field set in the constructor (one-shot). Document that source registration changes after process start require a service restart. | (covered by inspection; optional perf-sensitive E2E.) |
| W2.24 | `Proxy/ProxyWorker.cs:202-217` | Hard-coded 5-second deadline in `StopAsync`. Coexists with the coordinator's deadline; reader of just `ProxyWorker` would miss the contract. | If W1.5 keeps the coordinator: add a comment pointing at it. If W1.5 deletes the coordinator: replace the 5s with `IOptionsMonitor.CurrentValue.Connection.GracefulShutdownTimeoutMs`. | (covered by W2.19's test.) |
---
## Wave 3 — Cleanup, Doc-only, Test Gaps (low priority)
### Cleanup (small ⚙)
- `Proxy/Multiplexing/PlcMultiplexer.cs:445-450` — dead `elapsedMs` calculation, never used. Remove.
- `Proxy/Multiplexing/UpstreamPipe.cs:262-263` — meaningless conditional `firstRead && remaining == count ? false : false`. Replace with `return false`.
- `Proxy/Multiplexing/InFlightByKeyMap.cs:62``TryAttachOrCreate` always returns `true`; drop the dead `bool`. (Resolved as part of W1.2.)
- `Bcd/BcdCodec.cs:103-110` and `Proxy/BcdPduPipeline.cs:455-459` — duplicated `HasBadNibble`. Promote codec's helper to `internal` and call it from the pipeline.
- `Proxy/Supervision/PlcListenerSupervisor.cs:268,323,346` — copy-pasted exception-message truncation. Extract `Truncate(string, int)` helper.
- `Proxy/Multiplexing/PlcMultiplexer.cs:844-846` — comment says "1-second floor"; code uses 100 ms. Fix comment.
- `Configuration/ConfigReconciler.cs:428` — duplicated tag-delta computation. Either expose a delta count from `ReloadPlan.Compute` or drop the log enrichment.
- `Mbproxy.csproj:42` — comment describes Polly as "deferred" but it's actively used. Remove stale comment.
- `Admin/StatusSnapshotBuilder.cs:69` — unreachable `?.Address.ToString()` fallback. Simplify to `p.RemoteEp?.ToString() ?? "?"`.
### Doc-only (📝)
- `README.md` (operator-facing) — add a callout describing the 32-bit BCD wire format as "two base-10000 digits in CDAB", not standard binary CDAB Int32. Standard Modbus clients (NModbus, FluentModbus, Wonderware DAServer) will silently corrupt values > 9999 if they assume normal CDAB-int32 semantics. *(BcdRewriter M1.)*
- `docs/design.md` and `docs/Reference/LogEvents.md` — clarify counter semantics for cache-miss-during-coalescing per the choice made in W2.4. *(Mux M9.)*
- `docs/Operations/Configuration.md` — note that the Event Log sink is a custom `EventLogBridge`, not `Serilog.Sinks.EventLog`, so reviewers don't add a duplicate sink. *(Hosting minor.)*
- `Proxy/ResponseCache.md` (or `docs/Architecture/ResponseCache.md`) — document the implicit invalidation-skip-during-recovery gating. *(Cache M1 / Supervisor M1, paired with W2.9 option (b).)*
- `docs/plan/README.md` — add a phase entry for "Phase 12: Review remediation 2026-05-14" pointing at this file, so the phase history stays complete.
### Test gaps (✅)
The following design contracts are unverified by any current test. Add them in priority order:
1.`ReloadValidatorTests``Cache.AllowLongTtl=false` rejects `CacheTtlMs > 60_000` (per-tag and per-PLC-default forms). Paired with W2.10. *(Cache test gap; design.md.)*
2.`ReloadValidatorTests``Cache.AllowLongTtl=true` happy-path acceptance.
3.`HotReloadE2ETests` — tag-list reload flushes the affected PLC's cache (paired with W1.1 + W2.8).
4.`SupervisorTests` — replace placeholder `Supervisor_RuntimeFault_TriggersRecovery` with a test that genuinely faults the accept loop and asserts state goes `Bound → Recovering → Bound`. *(Supervisor test gap.)*
5.`PlcMultiplexerTests` — TxId allocator saturation propagates a clean Modbus exception (0x06 or equivalent) to upstream, doesn't hang. *(Mux test gap.)*
6.`PlcMultiplexerTests` — coalescing factory leak under saturation with multiple attachers. (Paired with W1.2.)
7.`PlcMultiplexerTests` — backend-reader head-of-line block by a slow upstream. (Paired with W1.3.)
8.`PlcMultiplexerTests` — watchdog↔response race (response arriving in the same tick as `SnapshotOlderThan`).
9.`PlcMultiplexerTests` — cascade racing with a brand-new accept (does the new pipe survive or land in a zombie multiplexer mid-teardown?).
10.`HotReloadE2ETests` — flip `ReadCoalescing.Enabled` at runtime (not start-up) and assert subsequent reads stop coalescing. (Paired with W2.1.)
11.`BcdCodecTests``Encode16(int.MinValue)` doesn't blow up.
12.`BcdPduPipelineTests` — write range ending exactly mid-32-bit-pair on the high address (inverse of existing low-side test).
13.`BcdPduPipelineTests` — mixed 16-bit + 32-bit + non-BCD slots in a single FC03 read.
14.`BcdPduPipelineTests` — FC16 *response* handling (currently only the request is tested).
15.`ProxyForwardingTests` or similar — `qty > 128` FC03 read passes through unchanged; PLC's exception 03 surfaces to the client.
16.`ConfigReconcilerTests` — replace the soft "concurrent reloads serialised" test with one that drives genuinely-concurrent applies and asserts no overlap (using a fake supervisor with a "claim count" assertion).
17.`AdminEndpointTests` — POST/PUT/DELETE/HEAD against `/` and `/status.json` return 405 (verifies no unintended methods are bound).
---
## Out of Scope
Findings that this plan deliberately does **not** address:
- **Mux C2 / C3** (cache-hit-during-cascade visibility model). Per the design these are acceptable behaviours; the Multiplexer review downgraded them.
- Any finding that recommended adding a new feature (e.g. an `Mbproxy.AdminBindAddress` config knob) or new defense-in-depth security hardening — those were trimmed from the reviews per direction on 2026-05-14.
---
## Suggested Sequencing
A focused remediation pass might split as:
| Sprint | Items | Outcome |
|--------|-------|---------|
| Sprint 1 (~1 week) | W1.1, W1.2, W1.3, W1.4 | The four critical multiplexer/hot-reload defects are gone. Hot-reload finally works as documented. |
| Sprint 2 (~1 week) | W1.5, W2.19, W2.20, W2.24 + the related test (ShutdownE2ETests) | Shutdown path is single-owner and respects the configured timeout. |
| Sprint 3 (~1 week) | W2.1, W2.3, W2.8, W2.9, W2.10, W2.21 + cache/supervisor tests | All the major hot-reload + cache contract gaps closed. |
| Sprint 4 (~1 week) | All remaining W2 items (BCD, supervisor, multiplexer minor) | Counter accuracy, dispose hygiene, validator coverage. |
| Sprint 5 (~3 days) | Wave 3 cleanup + doc + tests | Suite + docs caught up. |
After Sprint 1 the production risk profile drops sharply. After Sprint 3 every documented design contract is enforced. Wave 3 is opportunistic and can be folded into ongoing work.
@@ -0,0 +1,61 @@
# Phase 11 Response Cache — Code Review
Scope: `src/Mbproxy/Proxy/Cache/*`, `BcdTagOptions.CacheTtlMs`, `PlcOptions.DefaultCacheTtlMs`, `MbproxyOptions.CacheOptions`, `Configuration/ReloadValidator.cs` (long-TTL gate), and the wire-up in `PlcMultiplexer.cs` (cache→coalesce→backend lookup order, post-rewriter store, write invalidation). Cross-checked against `docs/design.md` "Response cache (Phase 11)", `docs/Architecture/ResponseCache.md`, `docs/plan/11-response-cache.md`. Tests under `tests/Mbproxy.Tests/Proxy/Cache/*`.
## Summary
- The cache contract is implemented correctly in the load-bearing places: lookup-order, post-rewriter store, address-range overlap invalidation, exception-skip, unit-id isolation, multi-tag `min(TTL)` with the conservative "any TTL=0 disables the range" rule, and pre-backend-connect short-circuit on hit.
- TTL resolution is split across the options binder (per-tag/per-PLC fold at build time, `BcdTagMapBuilder.Build` lines 107-110) and the per-read collapse in `BcdTagMap.ResolveCacheTtlMs` (lines 65-78). The hot path sees a single integer per tag.
- Concurrency is a single coarse `lock` (`ResponseCache.cs:30`); LRU is a 64-bit monotonic ticker rather than a clock; LRU eviction is a linear scan.
- Several lower-priority gaps exist: `mbproxy.cache.flushed` event is never emitted, write-invalidation runs the overlap matcher against a snapshot of *all* keys (no UnitId/FC pre-filter), and the long-TTL gate is enforced only against config-shaped TTLs (not against the resolved per-tag values surfaced to the runtime).
## Critical findings
**None.** No correctness, race, or stale-data leak found that violates the design contract.
## Major findings
1. **`mbproxy.cache.flushed` event never emitted.** Defined in `src/Mbproxy/Proxy/Cache/CacheLogEvents.cs:67-76` and referenced in the design doc (`docs/design.md:225`, `docs/Architecture/ResponseCache.md:347`) and `docs/Reference/LogEvents.md:374`, but no caller exists (Grep matches: only the definition). The hot-reload PLC-cache flush in `PlcListenerSupervisor.ReplaceContextAsync` (`src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs:199-216`) drops the old cache via `Dispose()` rather than `Clear()` (so the bytes/entry-count drop is implicit). Operators reading the docs will look for an event that never appears.
2. **Long-TTL gate is enforced only against `BcdTagOptions.CacheTtlMs` and `PlcOptions.DefaultCacheTtlMs` at config-shape time.** `MbproxyOptionsValidator.ValidateCacheTtl` (`src/Mbproxy/Options/MbproxyOptions.cs:102-112`) and `ReloadValidator.CheckTtl` (`src/Mbproxy/Configuration/ReloadValidator.cs:120-129`) inspect the *raw* options. After the resolver folds the per-PLC default into per-tag values (`BcdTagMapBuilder.cs:107`), a tag whose explicit `CacheTtlMs` is null and whose PLC default is `60_001` would pass each individual gate (the per-PLC default check on line 97 does catch this), but a tag whose explicit value is `30_000` and inherited default would also be 30_000 — so the explicit interaction is fine. The resolved `BcdTag.CacheTtlMs` is, however, not re-checked anywhere — consistent with intent, but worth flagging if the per-tag fold ever grows arithmetic (e.g. clamping or summing). Note: `ReloadValidator` also runs `BcdTagMapBuilder.Build` (line 80) and discards its results — the resolved TTLs are never re-validated against `AllowLongTtl`.
3. **Hot-reload flush has no test coverage.** No test in `tests/Mbproxy.Tests/Proxy/Cache/*` exercises `ReplaceContextAsync` or the dispose-old-cache path. Given that this is the only mechanism that satisfies the "any tag-list change drops the entire PLC cache" rule, a test would be valuable. (See also the Supervisor review's Critical 1 — the running multiplexer captures `_ctx` at construction and never re-reads, so even when the supervisor swaps `_currentContext` the running mux keeps using the old cache. Until that is fixed, the "flush" semantics are only realised when the listener is later restarted by Polly.)
4. **Cache survives backend disconnect — not directly tested for invalidation skip.** The `FailedBackendConnect_OnFirstRead_DoesNotPreventLaterCacheHits_IfCachePrePopulated` test (`ResponseCacheMultiplexerTests.cs:506-543`) covers the read side. There is no test for the design clause "Invalidations during a `recovering` listener state are skipped" (`docs/Architecture/ResponseCache.md:299-304`). The implementation gates this implicitly: invalidation only runs in `RunBackendReaderAsync` after a successful FC06/FC16 *response* arrives (`PlcMultiplexer.cs:497-509`), and a `Recovering` state means no backend reader → no response → no invalidation. The implicit gating is correct but fragile and undocumented in code comments.
## Minor findings
5. **`Invalidate` snapshots ALL keys with `_entries.Keys.ToArray()` then filters by UnitId/FC inside `CacheInvalidator.FindOverlapping`.** `ResponseCache.cs:157`. For a 1000-entry cache on a multi-unit gateway, every write allocates a 1000-element array and walks it. An obvious micro-optimisation is to filter inside the lock with the matcher. Acceptable at current scale (54 PLCs, 1000 entries cap, low write rate).
6. **`Dispose()` doesn't drain `_entries`.** `ResponseCache.cs:190-202` cancels the eviction loop but leaves entry buffers reachable until GC. Not a correctness issue; just worth a `Clear()` call.
7. **`SweepExpired` allocates a fresh `List<CacheKey>` every tick.** `ResponseCache.cs:261`. With `EvictionIntervalMs = 5000` and typical low expiry counts this is one allocation per 5 s — fine — but a reusable pooled list (or two-phase enumerate-and-conditional-remove using `Dictionary.Remove`'s 1-pass nature in newer runtimes) would eliminate it.
8. **`BuildCacheHitFrame` doesn't validate `cachedPdu.Length` against the MBAP 253-byte cap.** `PlcMultiplexer.cs:952-966`. Cache stores are guarded upstream (the backend reader enforces `MaxPduBodySize`), but the Length field arithmetic on line 956 silently overflows a `ushort` for >65534-byte PDUs. Defensive only.
9. **`elapsedMs` computed but unused.** `PlcMultiplexer.cs:445`. Dead code.
10. **`MaxEntriesPerPlc = 0` semantics.** `Set` is no-op (line 122) but the eviction task still runs every `EvictionIntervalMs`. Operators who set this to 0 to disable the cache will pay an idle timer cost. Document or short-circuit constructor.
11. **`Cache.AllowLongTtl` is documented as "reload-time check" but is also enforced at schema-validation time** (`MbproxyOptions.cs:74-77`, `ReloadValidator.cs:97-100`). Both layers agree but the duplication invites drift if one is updated.
## What looks good
- **Lookup order:** The cache check at `PlcMultiplexer.cs:610-629` happens before `EnsureBackendConnectedAsync` (line 634) and before the coalescing `TryAttachOrCreate` block (line 661). A hit returns immediately without creating an `InFlightByKeyMap` entry — verifiable by `CacheHit_ShortCircuits_Coalescing` test (`ResponseCacheMultiplexerTests.cs:304-343`).
- **Post-rewriter store ordering:** `_pipeline.Process(MbapDirection.ResponseToClient, ...)` runs at `PlcMultiplexer.cs:455-459`, then the cache `Set` at line 490. The `BcdDecodedBytes_AreCached_NotRawBcd` test verifies this end-to-end.
- **Multi-tag conservative rule:** `BcdTagMap.ResolveCacheTtlMs` returns 0 immediately on encountering any `ttl <= 0` (line 74), bailing the loop. Verified by `MultiTagRange_AnyZeroTtl_DisablesCachingForWholeRead` test.
- **Invalidator scope:** `CacheInvalidator.FindOverlapping` (`CacheInvalidator.cs:43-49`) explicitly checks both FC03 and FC04, and tests cover both adjacent-but-not-overlapping (register 15 case) and full/partial/disjoint overlap cases (`CacheInvalidatorTests.cs`). Different unitIds isolated correctly.
- **Exception skip:** `PlcMultiplexer.cs:469-471` checks `(fcInResponse & 0x80) != 0` before either store or invalidate. Consistent with the design clause "exception responses do not invalidate."
- **Counter parity:** Cache hit/miss only increment when `resolvedCacheTtlMs > 0` (`PlcMultiplexer.cs:613`), so uncached reads (TTL=0) bump neither, satisfying the `Hit + Miss = cache-eligible` identity. Tested by `MultiTagRange_AnyZeroTtl_DisablesCachingForWholeRead` asserting both counters at 0.
- **LRU correctness:** Tested by `LRU_TracksAccessOrder_AcrossGetAndSet`. Monotonic-counter design (`CacheEntry.cs:14-16`) is robust to wall-clock skew and clock calls.
- **Memory accounting:** `_approxBytes` decremented on every removal path (lazy expiry line 103, eviction line 228, sweep line 272, invalidate line 164, replace line 132, clear line 182). Tested by `ApproximateBytes_TracksSetReplaceAndInvalidate`.
- **Write invalidation in multiplexer is FC04-aware:** Even though writes only land on holding registers, `CacheInvalidator` evicts both FC03 and FC04 keys at the same address, matching the design's `(unitId, FC ∈ {3,4})` scope.
- **FC04 cacheability:** `PlcMultiplexer.cs:610` and `:473` both check `is 0x03 or 0x04` — input registers are cacheable, not just holding.
- **Pre-existing cache survives backend down:** `EnsureBackendConnectedAsync` is called only after the cache check returns false (line 634); `FailedBackendConnect_OnFirstRead_DoesNotPreventLaterCacheHits_IfCachePrePopulated` verifies.
## Open questions
1. The design specifies `mbproxy.cache.flushed` events with `Reason` propagation. Without a caller, what's the intended emission point — in `ReplaceContextAsync`, in the supervisor's reseat handler, or somewhere in `ConfigReconciler`? See `CacheLogEvents.cs:67-76`.
2. Is `cacheBytes` intended to reflect dictionary overhead (key+entry headers) or only PDU bytes? Currently only PDU bytes (the `Length` field). For a 1000-entry cap with small PDUs, dictionary overhead may dominate actual heap; this matters for the "Tier-2 memory-watch KPI" framing.
3. When a tag-list reload would change `MaxEntriesPerPlc` (the `Cache.MaxEntriesPerPlc` knob, not per-tag TTL), the design says "Existing entries unaffected; cap applies to subsequent inserts." But `ReplaceContextAsync` runs only on per-PLC tag-list change — a `MaxEntriesPerPlc` change alone does NOT trigger a context swap, so the running `ResponseCache` keeps its constructor-time cap. Is this acceptable, or should `Cache` knob changes propagate live?
4. Does the `BcdTagMap.ResolveCacheTtlMs` allocation behaviour (allocates the hit list when *any* tag is in range, then immediately discards on TTL=0) need optimising for the hot path? At ~54 PLCs and modest read rates probably no, but at 1000+ FC03/s per PLC it would matter.
@@ -0,0 +1,70 @@
# Listener Supervisor and Hot-Reload Subsystem — Code Review
Scope: `src/Mbproxy/Proxy/Supervision/*`, `src/Mbproxy/Configuration/*`, `Options/MbproxyOptions.cs` and friends, `Proxy/ProxyWorker.cs`. Cross-checked against `docs/design.md` (Listener auto-recovery, Configuration hot-reload, Startup posture, Failure modes), `docs/Architecture/ConnectionModel.md`, `docs/Features/HotReload.md`, `docs/Operations/Configuration.md`, `docs/plan/05-listener-supervisor.md`, `docs/plan/06-hot-reload.md`. Tests under `tests/Mbproxy.Tests/Proxy/Supervision/*` and `tests/Mbproxy.Tests/Configuration/*`.
## Summary
- The supervisor + reconciler are structurally sound: the Polly profiles match the design exactly (3-attempt backend connect at 100/500/2000 ms; 5-step recovery at 1/2/5/15/30 s + steady-state 30 s indefinitely), debounce + serialise are well-implemented, and the validator rejects every documented bad input.
- One **documented contract is unimplemented**: "skip cache invalidations when the listener is `recovering`" (`docs/design.md:203`, `docs/Features/HotReload.md` cache section). Nothing in `PlcMultiplexer` consults supervisor state — there is no plumbing from the supervisor to the cache invalidator.
- Several **race / leak windows** exist around `ReplaceContextAsync` (counter pointer not threaded into running multiplexer; old listener keeps running with stale cache after dispose) and `WaitForInitialBindAttemptAsync` (race between fast bind+stop and the wait method).
- Test coverage of the supervisor is thin in places — no unit test exercises the "RunAsync returned without cancellation" recovery branch, and no test verifies `lastBindError`/`recoveryAttempts` after a runtime fault.
## Critical findings
**C1 — Reseat does NOT swap the multiplexer's context, so cache + tag-map changes do not actually take effect on existing connections** (`PlcListenerSupervisor.cs:199-216`, `PlcMultiplexer.cs:50` / `:101` / `:109`).
`ReplaceContextAsync` only writes `_currentContext`. Per its own comment (lines 188-191), "the very next `PlcListener` constructed inside the Polly loop … picks up `newCtx`. Existing in-flight upstream pipes served by the current multiplexer keep their reference." But `PlcMultiplexer` captures the context once at construction (`_ctx = perPlcContext`, line 101) and uses that captured reference for **the response cache** (`_ctx.Cache` in invalidation paths around lines 499-506) and **the BCD rewriter map**. The currently-running listener is `_currentListener`, which is never recreated by a reseat. So in steady state (no listener fault between reseats):
- The cache-disposal logic at `PlcListenerSupervisor.cs:205-213` disposes the old cache while the running multiplexer is still **using** it. Subsequent FC03/FC04 hits / invalidations on the running mux dereference a disposed cache (eviction timer cancelled, `Dictionary` still works but state is stale).
- The "any tag-list reload flushes the affected PLC's whole cache" doctrine (CLAUDE.md:46, design.md:155) is only realised the next time the listener faults and Polly reconstructs a `PlcListener`. Until then, the running multiplexer keeps serving stale tag-map decisions and a soon-to-be-disposed cache.
This is the highest-impact gap. Either `PlcMultiplexer` needs a hot-swap path (volatile ctx ref it re-reads), or the reseat path must restart the multiplexer (not just swap a field on the supervisor that nobody reads in the hot path).
**C2 — Counters are reset on Restart even when documentation says they are preserved on Reseat only** (`ConfigReconciler.cs:284`).
The Restart branch builds `Counters = new Proxy.ProxyCounters()`. That is intentional per `HotReload.md:117` ("A restart … constructs a brand-new `ProxyCounters` because the supervisor itself is brand new."). However, `service.config.reloadCount` increments via `RecordReloadApplied` even when only validation succeeded but every restart task **threw and was logged at Error** (`ConfigReconciler.cs:309-313`). A reload where every restart fails still bumps `reloadCount` and clears the operator's history of those PLCs without warning. Consider either failing the whole apply, surfacing a partial-failure counter, or at minimum a warning-level event for partial-apply.
## Major findings
**M1 — "Skip cache invalidation while recovering" is unimplemented** (design.md:203; verified absent in `PlcMultiplexer.cs`). Grep for `recovering`, `SupervisorState.Recovering`, or any equivalent in the multiplexer returns no matches. The design clearly states "Invalidations during a `recovering` listener state are skipped (the write never reached the backend, the cached read remains valid)". Today, a write that fails the listener but somehow still makes it to the cache invalidation path could clear good entries. In practice, when a listener is `recovering` the upstream socket is closed and no requests flow, so the bug surface is small — but it's still a documented contract violation.
**M2 — `WaitForInitialBindAttemptAsync` has a race and busy-polls** (`PlcListenerSupervisor.cs:137-144`). The 10 ms polling loop will never exit if the supervisor task throws (e.g., synchronous bind exception captured by Polly that immediately retries before the test grabs the snapshot). It also masks transitions: a supervisor that does `Stopped → Bound → Stopped` between polls (fast bind followed by `StopAsync`) will exit the loop on `_supervisorTask.IsCompleted` rather than ever reflecting `Bound`. Replace with a `TaskCompletionSource` set when the first state transition out of `Stopped` happens.
**M3 — `_supervisorCts` reassignment on `StartAsync` leaks the previous CTS** (`PlcListenerSupervisor.cs:126`). If `StartAsync` is called twice (or after `StopAsync`), the previously-disposed CTS is replaced without disposing the linked-token registration. Add either a guard (`throw if not Stopped`) or dispose the previous CTS before reassigning.
**M4 — `Snapshot()` is non-atomic across three fields** (`PlcListenerSupervisor.cs:177-180`). `_state`, `_lastBindError`, `_recoveryAttempts` are read independently; the supervisor task can transition mid-snapshot so the status page can report `State=Bound` with `LastBindError` from a prior attempt and `RecoveryAttempts > 0` simultaneously. The fix is small (read into locals under a lock or under a single `Interlocked.MemoryBarrier`), and the design.md says "transitions atomic" implicitly via the `bound`/`recovering`/`stopped` machine.
**M5 — `ConfigReconciler.ApplyUnderLockAsync` mutates `_supervisors` from multiple Task.Run continuations concurrently** (`ConfigReconciler.cs:236-238`, `:269-271`, `:306`, `:390`). The Remove and Restart blocks each do `_supervisors.Remove(name)` / `_supervisors[name] = newSupervisor` from inside `Task.WhenAll(... select async ...)`. `Dictionary<TKey,TValue>` is not thread-safe; the comment at line 50 says "All mutations happen inside ApplyAsync which is serialised by the semaphore" — but the semaphore serialises *outer* Apply calls, not the inner concurrent tasks. Two PLCs being added in parallel can corrupt the dictionary. Replace with `ConcurrentDictionary` or move dict mutations to a single thread (`await Task.WhenAll(...)` first, then mutate).
**M6 — Per-restart task captures shared `backendPipeline` but does not honour `ReadCoalescingOptions` accessor** (`ConfigReconciler.cs:294-304`, `:378-388`). The constructor argument list passed to `new PlcListenerSupervisor(...)` from the reconciler omits the `coalescingAccessor` that `ProxyWorker.cs:129-130` constructs. Result: PLCs added or restarted via hot-reload always get the **default** `ReadCoalescingOptions` (Enabled=true, MaxParties=32) instead of the live monitor value, **and** a hot-reload of `ReadCoalescing.Enabled` does not propagate to those supervisors. This contradicts CLAUDE.md and design.md ("Hot-reload via `Mbproxy.Resilience.ReadCoalescing.Enabled`").
**M7 — `MbproxyOptions` has no `ConnectionOptions` validation** (`MbproxyOptions.cs`, `ReloadValidator.cs`). Neither validator rejects `BackendConnectTimeoutMs <= 0`, `BackendRequestTimeoutMs <= 0`, or `GracefulShutdownTimeoutMs <= 0`. A misconfigured 0 / negative value will produce a `CancelAfter(0)` that immediately fires and breaks every backend connect.
**M8 — `MbproxyOptionsValidator` does not enforce cross-PLC checks but `IValidateOptions` runs at startup; reload uses `ReloadValidator`** — this dual-path is fine, but `ReloadValidator` is missing **the duplicate-address-in-resolved-tag-list** test for the per-PLC `Add` list against itself when no global list collision occurs. `BcdTagMapBuilder.Build` does catch this (DuplicateAddress error), so the gap is plugged transitively. Worth a comment in `ReloadValidator.cs:78-83` to make explicit that per-PLC duplicate detection is delegated.
**M9 — `ProxyWorker.StopAsync` does NOT respect `Connection.GracefulShutdownTimeoutMs`** (`ProxyWorker.cs:202-217`). Hard-coded 5-second deadline. The graceful-drain timeout lives in a separate `ShutdownCoordinator` (Phase 08); but the worker itself drops in-flight requests after 5s, before the coordinator's 10s default. Fine if both are wired together (see ShutdownCoordinator), but a reader of just `ProxyWorker` would miss the contract. Add a comment pointing at `ShutdownCoordinator`.
## Minor findings
- **Min-1.** `RunSupervisorAsync` truncates exception messages to 256 chars in two places with copy-pasted code (`PlcListenerSupervisor.cs:268`, `:323`, `:346`). Extract a helper `Truncate(string, int)`.
- **Min-2.** `LogBindFailed` uses `EventId 41` and `EventName mbproxy.startup.bind.failed`, but `PlcListener.cs:189` (Phase-pre-supervisor leftover) defines another `LogListenerFaulted` with `EventId 22` and the **same** name `mbproxy.listener.faulted` as `PlcListenerSupervisor.cs:413` (`EventId 43`). Two `LoggerMessage` event names should be unique per-name; duplicate names confuse log scrapers.
- **Min-3.** `PolicyFactory.cs:54` only retries `ConnectionRefused`, `TimedOut`, `HostUnreachable`, `NetworkUnreachable`. `WSAEHOSTDOWN`/`HostDown` is a common transient on Windows; consider adding it.
- **Min-4.** `PolicyFactory.cs:69` logs at Debug only — design.md:211 says retries log at Debug, but operators triaging would likely want the **last** retry attempt of a backend connect at Information. Consider using `MaxAttempts - 1 == args.AttemptNumber` to elevate the final attempt.
- **Min-5.** `ConfigReconciler.ComputeGlobalTagDelta` (`:428`) recomputes a delta for the log line, duplicating logic that `ReloadPlan.Compute` already has internally via `TagMapsEqual`. Accept the slight cost or expose a delta count from `ReloadPlan`.
- **Test gap.** `SupervisorTests.cs` (6 tests) is missing the "RunAsync returned cleanly without cancellation" branch (`PlcListenerSupervisor.cs:340-350`). No test exercises the "listener accept loop ended unexpectedly" recovery transition.
- **Test gap.** `ConfigReconcilerTests.cs:235` "concurrent reloads serialised" only verifies they all succeed; it does not actually assert serialisation (the comment admits the measurement isn't perfect). Drive a real concurrent test using a fake supervisor that asserts no overlap.
- **Test gap.** No test covers the cache-disposal-while-multiplexer-still-uses-it scenario (C1) or the cross-thread `_supervisors` mutation (M5).
## What looks good
- **Polly pipelines exactly match the design.** Backend connect: `BackendConnect = { MaxAttempts = 3, BackoffMs = [100, 500, 2000] }` (`ResilienceOptions.cs:5`); listener recovery: `[1000, 2000, 5000, 15000, 30000]` then `30_000` indefinite (`ResilienceOptions.cs:7-9`). `MaxRetryAttempts = int.MaxValue` for listener recovery is correct (`PolicyFactory.cs:101`).
- **Cancellation honoured promptly on shutdown.** `_supervisorCts.CancelAsync()` cancels both Polly's delay and the inner accept loop simultaneously (`PlcListenerSupervisor.cs:159`). `StopAsync` doc comment that it "completes within ~1 s regardless of backoff window size" is accurate because Polly v8's `ExecuteAsync(ct)` propagates the token through the delay generator.
- **Bind lifecycle is idempotent.** Each Polly attempt creates a fresh `PlcListener` and disposes the failed one before re-entering the recovery delay (`PlcListenerSupervisor.cs:264`, `:311`, `:320`, `:337`). Only one `_currentListener` exists at any time. The half-bound port window is closed by `await listener.DisposeAsync()` before re-entering the retry.
- **Single code path for "bring up a listener."** Both startup (`ProxyWorker.cs:115-146`) and reload-add (`ConfigReconciler.cs:359-401`) construct a `PlcListenerSupervisor` and call `StartAsync` — no second listener-init code path. Remove and Restart both go through `supervisor.StopAsync` + `DisposeAsync`. Matches design.md:213.
- **Validator returns all errors, not just the first.** `ReloadValidator.cs:32-117` runs every check, accumulates into `errs`, returns `false` only after every rule has been checked — operator sees every problem on a single save.
- **Debounce loop is correct.** The drain pattern (`ConfigReconciler.cs:154-171`) handles editor rename-replace storms exactly as documented (250 ms quiescent window, last-write-wins via `BoundedChannelFullMode.DropOldest`).
- **`TagMapsEqual` includes `CacheTtlMs`** (`ReloadPlan.cs:113`), so per-tag TTL changes correctly route to ToReseat — matches HotReload.md:70.
- **Graceful-shutdown integration via `ShutdownCoordinator`** (separate file) handles drain-timeout properly even though `ProxyWorker.StopAsync` uses a hard 5 s.
## Open questions
1. C1 (cache + tag-map not actually swapped on reseat) is the most consequential finding — is `PerPlcContext` actually intended to be hot-swapped on the running `PlcMultiplexer`, or is the current "swap is visible only on next listener fault" the intended (and undocumented) behaviour?
2. M6: Should the reconciler's add/restart paths receive the same `coalescingAccessor` `ProxyWorker` constructs? It looks like an oversight when Phase 10 (`ReadCoalescing`) was added after Phase 6 (hot-reload).
3. The status page surface for "supervisor is recovering" is documented (`listener.state = recovering`, `listener.lastBindError`, `listener.recoveryAttempts`) but the cache-skip-during-recovery clause needs a contract owner (M1). Was the design intent to thread the supervisor's snapshot into the multiplexer, or to gate via a different signal (e.g., backend-disconnect cascade flag)?
@@ -0,0 +1,73 @@
# Test Suite Assessment
Scope: every `.cs` file under `tests/Mbproxy.Tests/` (excluding bin/obj). Cross-checked against `docs/design.md` Testing section, `docs/plan/README.md` per-phase test inventory, `docs/Testing/Simulator.md`. Per CLAUDE.md the suite was around 325 tests at Phase 9 baseline and has grown through Phase 11.
## Summary
**Counts (by `[Fact]` / `[Theory]` attribute, recursive over `tests/Mbproxy.Tests/**/*.cs` excluding `obj/`):**
| Bucket | Files | Test methods |
|---|---:|---:|
| Unit (no I/O or stub-socket only, `[Trait("Category","Unit")]`) | 26 | ~189 |
| E2E (`[Trait("Category","E2E")]`, simulator + real Kestrel + real host) | 9 | ~37 |
| Other / fixture smoke (`HostSmokeTests`, `SimulatorSmokeTests`, fixture files) | 4 | 7 |
| **Total method count** | **39** | **~233** |
Only **2 `[Theory]`** tests exist (both in `BcdCodecTests`), each with a few `[InlineData]` rows, so the actual executed-test count is roughly **240245**, slightly **below** the README/CLAUDE.md baseline of "325 tests at Phase 9 (282 unit + 43 E2E)". Either CLAUDE.md is over-reporting (perhaps counting individual `Theory` cases that have since been refactored away, or counting per-PLC parameterized runs), or test consolidation has happened since Phase 9 — worth reconciling. Phase 11 net additions are visible (Cache/* directory: 5 files, 32 facts).
**Headlines:**
- BCD codec, MBAP framing, multiplexer correlation, cache, and coalescing are all well-covered with both targeted unit tests and E2E + stub-backend integration tests; the suite has clear architectural layering (pure unit → stub-socket integration → simulator E2E).
- The E2E suite uses a **single shared `DL205SimulatorFixture`** that gracefully `Assert.Skip`s when Python/pymodbus is unavailable, but it **leans heavily on `Task.Delay` polling** (64 occurrences) and explicit time-window arithmetic — flake-prone on slow/loaded CI workers.
- The biggest concrete coverage gap is the **`Cache.AllowLongTtl` reload-rejection rule**, which is documented in `docs/design.md` and enforced in `ReloadValidator.cs` (line 100, 128) and `MbproxyOptions.cs` but has **zero corresponding test** in `ReloadValidatorTests.cs` or anywhere in the suite (`grep -rn AllowLongTtl tests/` returns nothing).
## Coverage matrix
| Subsystem | Rating | Notes |
|---|---|---|
| BCD codec (16/32-bit, partial, OOR) | **Strong** | `BcdCodecTests.cs` (19) + `BcdPduPipelineTests.cs` (26) cover encode/decode, single-reg, full 32-bit CDAB pair, partial low-only & high-only with `PartialBcdWarnings`, mixed-with-non-BCD, bad-nibble pass-through, FC06 OOR, FC16 multi-reg + partial pair, exception responses, length invariant. |
| MBAP frame parsing | **Adequate** | `MbapFrameTests.cs` (10 facts). Not read in depth — but headers in `PlcMultiplexerTests` exercise real wire frames end-to-end. |
| Multiplexer (TxId, correlation, cascade, watchdog) | **Strong** | `TxIdAllocatorTests` (8, incl. saturation, full-65536 sweep, wrap, parallel-stress concurrency), `CorrelationMapTests` (5), `RewriterCorrelationTests` (4), `PlcMultiplexerTests` (8: TxId-collision, distinct-on-wire, upstream-disconnect isolation, **backend-disconnect cascade with 3 upstreams**, **watchdog → exception 0x0B**, reconnect after cascade). |
| Coalescing | **Strong** | `ReadCoalescingTests` (8) + `CoalescingKeyTests` (5) + `InFlightByKeyMapTests` (6) + `ReadCoalescingE2ETests` (5). Covers leader/waiter, FC03/FC04 not coalesced, FC06 never coalesced, `MaxParties` cap (= 2 forces overflow), dead-upstream `CoalescedResponseToDeadUpstream` counter, hot-reload `Enabled=false` path. |
| Cache | **Strong** | `ResponseCacheTests` (10), `CacheInvalidatorTests` (8), `ResponseCacheMultiplexerTests` (8), `CacheKeyTests` (3), `CacheEntryTests` (3), `ResponseCacheE2ETests` (5). Covers TTL expiry, LRU + access-order, multi-tag min-TTL/zero-TTL disable, range-overlap (full/low/high/adjacent/disjoint/wrong-unit/zero-qty), write invalidation, **cache→coalesce short-circuit ordering**, post-rewriter byte caching, concurrent get/set stress (8 tasks × 500 ops). |
| Supervisor | **Adequate** | `SupervisorTests` (6) covers bind, port-in-use → Recovering, recovery-on-port-free, Stop-during-Recovering, RecoveryAttempts accumulation. `BackendConnectRetryTests` (3) and `PolicyFactoryTests` (5) cover Polly. **Test 4 ("RuntimeFault") is a placeholder** (test name says runtime fault but doesn't actually induce one — it self-admits this). |
| Hot reload | **Adequate** | `HotReloadE2ETests` (4): add PLC, remove PLC, change global tag list (reseat without restart), invalid reload (port collision) → no mutation. `ConfigReconcilerTests` (4) adds concurrent-reload serialisation. `ReloadPlanTests` (8) and `ReloadValidatorTests` (8) cover plan computation and validation rules **except the cache-related ones**. |
| Admin endpoint | **Strong** | `AdminEndpointTests` (6 E2E): JSON shape, post-FC03 PDU count, partial-BCD warning surfaced, HTML+meta-refresh, **bind-failure isolation** (service stays up + logs `bind-failed`), **admin-port hot-reload rebind**. `StatusSnapshotBuilderTests` (6) + `StatusHtmlRendererTests` (3). |
| Shutdown | **Strong** | `ShutdownCoordinatorTests` (4): no-conn, with-conn drain, timeout-cancel-with-count, ordering (admin stops after listeners). `ShutdownE2ETests` (2): graceful drain ≤ 10s, in-flight cancel after timeout. |
## Specific gaps (design contracts without a test)
1. **`Cache.AllowLongTtl=false` rejects `CacheTtlMs > 60_000`** — enforced at `src/Mbproxy/Configuration/ReloadValidator.cs:90-128` and `src/Mbproxy/Options/MbproxyOptions.cs:58-111`, but **no test exists in `ReloadValidatorTests.cs`** (only 8 tests covering name dup, port dup, admin collision, BCD map error, port range, admin-port range, happy path, empty name). Both the per-tag rule and the global reload rule should each get a test.
2. **Reload-driven cache flush on tag-list change** — `docs/design.md` and `CLAUDE.md` state "a tag-list reload flushes the affected PLC's response cache". `HotReloadE2ETests.E2E_ChangeGlobalBcdTagList_RewriteReflectsImmediately` proves the reseat happens but does **not** assert that any pre-existing cache entries for that PLC are dropped (e.g., set entry, mutate tag list, observe `Clear` increment / next read goes to backend). And per the Supervisor review's Critical 1, this is exactly the behaviour that today does NOT actually fire in the running multiplexer — a test would catch the regression directly.
3. **`SupervisorTests.Supervisor_RuntimeFault_TriggersRecovery`** is named for the contract but explicitly admits "we verify recovery mechanics are wired … For a full runtime-fault scenario, see the E2E tests." `SupervisorE2ETests` only has 2 tests — needs a check that they actually fault the accept loop. No test currently forces an accept-loop exception and asserts the supervisor resumes.
4. **Backend FC03/FC04 max-qty pass-through (PLC enforces 128/123 caps with exception 03)** — `dl205.md` calls this out as a quirk. The proxy is supposed to be transparent; no test sends `qty > 128` and asserts the proxy doesn't truncate / forwards as-is and surfaces exception 03 to the client.
5. **TxId allocator saturation propagation** — the allocator returns `false` when full (`TxIdAllocatorTests.Allocate_WhenSaturated_ReturnsFalse_DoesNotThrow`), but there's no `PlcMultiplexerTests` case asserting what happens when 65,536 in-flight requests pile up — does the multiplexer surface exception 0x06 (Server Device Busy) or backpressure?
6. **`Cache.AllowLongTtl=true` happy path** — paired with gap (1); should also have a test that the long-TTL value passes validation when the gate is set.
7. **Reload of `ReadCoalescing.Enabled` at runtime** — there's `ReadCoalescingE2ETests.E2E_CoalescingDisabledViaConfig_EveryReadIsAMiss` but it appears to start with the value disabled, not flip it at runtime through the hot-reload path. Per the Supervisor review's M6, hot-reload of this knob doesn't propagate through the reconciler's add/restart paths anyway — a test would catch it.
8. **Coalescing factory leak under saturation with multiple attachers** — see Multiplexer review C1. No test exercises the case where the leader's `TryAdd` or `TryAllocate` fails while late attachers have already joined the entry.
9. **Backend-reader head-of-line block by a slow upstream** — see Multiplexer review C4. No test wedges one upstream's response channel and asserts other peers on the same PLC continue to receive responses.
## Quality observations
- **Wall-clock arithmetic in coalescing tests.** `ReadCoalescingTests` uses fixed sleeps to construct the in-flight window (e.g. `ResponseDelayMs = 300; await Task.Delay(80)`). The `FiveClients_SameRequest_OneBackendRoundTrip_FiveResponses` test is hedged (`backend.RequestCount.ShouldBeLessThanOrEqualTo(2, "...one for the leader, one for any racy first-arrival")` and `CoalescedHitCount.ShouldBeGreaterThanOrEqualTo(3)`) — sensible hedging that admits the timing race exists. This is a known smell for CI.
- **Cascade test polls for counter convergence** (`PlcMultiplexerTests.BackendDisconnect_CascadesToAllUpstreams` polls 50× at 20 ms with comment explaining the inherent scheduling gap). Good defensive engineering, but indicates the underlying `BackendDisconnectCascades` increment is not synchronously observable from the test seam.
- **Naming is strong throughout**: tests follow `Subject_Condition_Expectation` style (e.g. `TwoClients_SameRequest_OnlyOneBackendRoundTrip`, `Validate_AdminPortCollidesWith_PlcListenPort_Fails`). Comments above each test explain the contract being verified.
- **AAA structure is clear** — almost every test has `// ── Helpers ──` / `// ── Tests ──` regions and explicit "Arrange / Act / Assert" implicit through whitespace.
- **Concurrency tests are genuine, not pretend.** `TxIdAllocatorTests.Concurrent_AllocateRelease_NoDuplicateIds_Under_Parallel_Stress` (100 tasks × 1000 ops verifying no duplicate live IDs via `ConcurrentDictionary.TryAdd`) and `ResponseCacheTests.Concurrent_GetSet_NoDataRace` (8 × 500) actually race tasks. Multiplexer's "two upstream concurrent" tests use real sockets with simultaneous `SendAsync` calls.
- **Skip pattern is clean and uniform**: `if (_sim.SkipReason is not null) Assert.Skip(_sim.SkipReason);` at the top of every E2E test method that needs the simulator. Cold-pip provisioning gets a 120-second window — reasonable for cold CI.
- **One redundant test pair**: `Diagnostics/ShutdownCoordinatorTests` line 27, 43, 57 each define a `StopAsync` stub — these are nested helper classes inside the test class, fine but easy to miss when reading the test method list.
- **Some E2E timeouts are tight enough to flake** (e.g. `[Fact(Timeout = 5_000)]` for `HotReloadE2ETests.E2E_AddPlcAtRuntime_NewListenerBinds_AndIsReachable`). The author already bumped one to 10s with a comment about hot-reload propagation needing headroom — others may follow.
## What looks good
- Three-tier architecture: **pure unit** (`BcdCodecTests`, `MbapFrameTests`, `TxIdAllocatorTests`, `CacheKeyTests`) → **stub-socket integration** (`PlcMultiplexerTests`, `ReadCoalescingTests`, `ResponseCacheMultiplexerTests`) → **simulator E2E** (`*E2ETests`). Each layer is justified in its xmldoc — e.g. `ReadCoalescingTests` explicitly notes "the pymodbus simulator cannot be used here — its known concurrent-MBAP-frame bug would invalidate the proxy-TxId echo path".
- Counter-based assertions everywhere (`RewrittenSlots`, `PartialBcdWarnings`, `CoalescedHitCount`, `CoalescedResponseToDeadUpstream`, `BackendDisconnectCascades`, `ReloadAppliedCount`, `ReloadRejectedCount`) — gives strong observable signals beyond happy-path correctness, and validates the status-page wire format simultaneously.
- Stub backends (`StubBackend` in `PlcMultiplexerTests`, `DelayedStubBackend` in `ReadCoalescingTests`, drain-only in watchdog test) are well-factored and reusable.
- Negative cases are first-class: bad nibbles, OOR encode, wrong unit ID, adjacent-but-not-overlapping ranges, FC06 against partial 32-bit pair, dead-upstream fan-out, port-already-in-use, duplicate config — not just happy-path coverage.
- `HotReloadE2ETests` exercises the **real `FileSystemWatcher` + `reloadOnChange: true`** path with rename-replace (the pattern that produces the 2-3 spurious events the debounce has to handle) — the most realistic possible test for that subsystem.
**Key file paths referenced:**
- `src/Mbproxy/Configuration/ReloadValidator.cs` (lines 90, 100, 128 — long-TTL gate enforcement, untested)
- `src/Mbproxy/Options/MbproxyOptions.cs` (lines 32, 58111 — long-TTL gate option + per-tag enforcement)
- `tests/Mbproxy.Tests/Configuration/ReloadValidatorTests.cs` (8 tests, missing cache rules)
- `tests/Mbproxy.Tests/Proxy/Supervision/SupervisorTests.cs` (Test 4 = placeholder)
- `tests/Mbproxy.Tests/Sim/DL205SimulatorFixture.cs` (skip mechanism)