# mbproxy Code Review — 2026-05-14 In-depth review of the `mbproxy` service through Phase 11 (opt-in response cache). Conducted via parallel area-focused review passes, then synthesised here. Each area has its own document under this folder; this Overview ranks findings across the whole codebase by impact so a maintainer can prioritise fixes without re-reading every file. **Scope:** every `.cs` file under `src/Mbproxy/`, the design contracts in `docs/design.md` + `docs/Architecture/*` + `docs/Features/*`, the option POCOs, the hosting/install glue, and the test suite under `tests/Mbproxy.Tests/`. **Out of scope:** the pymodbus simulator profile itself, anything under `DL260/`, anything outside `mbproxy/`. ## Reviewer Verdict The service is well-architected and the per-area code quality is high. Concurrency primitives are correct, log events are stable and well-named, and the test suite genuinely exercises races rather than pretending to. Most of the contracts in `docs/design.md` are upheld in the code. The fleet of critical-severity findings clusters in **two seams**: 1. **Hot-reload propagation into a running `PlcMultiplexer`.** The supervisor's `ReplaceContextAsync` swaps `_currentContext` on the supervisor but the running multiplexer captures `_ctx` at construction and never re-reads it — so tag-list reloads, cache-on-reload flushes, and the `ReadCoalescing.Enabled` toggle are visible only on the *next* listener fault, not at the next PDU. The design doc reads as if the next-PDU contract is upheld; the code only upholds it for one branch (the `BcdTagMap` read inside the rewriter's per-call clone). 2. **Backend reader/writer cascade.** When the backend socket dies, in-flight responses can sit synchronously on a wedged upstream pipe (one slow client stalls every peer), stranded outbound frames carry old TxIds against a fresh backend (orphaned responses, hung peers via watchdog), and the cascade itself races accepts that arrive between snapshot and `_pipes.Clear()`. Below, **Critical** findings are correctness defects, hangs, data corruption, or security gaps that block production confidence. **Major** findings are design-contract violations that an operator could hit in a year of running. **Minor** findings are style, dead code, and test-only gaps. ## Critical Findings (ranked by impact) | # | Area | Finding | File:line | |---|------|---------|-----------| | 1 | Supervisor / Hot-reload | `ReplaceContextAsync` swaps the supervisor's `_currentContext` but the running `PlcMultiplexer` captured `_ctx` at construction and never re-reads. The cache-flush-on-tag-list-reload contract (`design.md:155`) is not enforced for the in-flight mux; the old cache is even disposed while the running mux is still using it. | `Proxy/Supervision/PlcListenerSupervisor.cs:199-216` ↔ `Proxy/Multiplexing/PlcMultiplexer.cs:50,101,109` | | 2 | Multiplexer | Coalescing factory failure (`TryAdd` collision, `TxIdAllocator.TryAllocate` returns false) stores a stub `InFlightRequest` under the coalescing key but only the leader gets the saturation exception. Late attachers join a `List` that no one will ever drain — pipes hang until the upstream socket closes. | `Proxy/Multiplexing/PlcMultiplexer.cs:692-733`, `Proxy/Multiplexing/InFlightByKeyMap.cs:82-83` | | 3 | Multiplexer | Backend reader sends to upstream synchronously inside the single reader task. If one upstream's bounded response channel is full, the reader blocks — stalling every other client on the same PLC. | `Proxy/Multiplexing/PlcMultiplexer.cs:545` | | 4 | Multiplexer | Backend writer doesn't complete/drain `_outboundChannel` on cascade. Stranded frames carry old proxy TxIds; the fresh backend socket then receives them and responds against TxIds whose correlation entries were already dropped by `TearDownBackendAsync`. Responses are silently dropped (`PlcMultiplexer.cs:423-428`); upstream peers hang waiting for the watchdog. | `Proxy/Multiplexing/PlcMultiplexer.cs:362-385`, `:423-428` | ## Major Findings (highest-impact, full list in per-area docs) | Area | Finding | File:line | |------|---------|-----------| | Supervisor / Hot-reload | "Skip cache invalidations when listener state is `recovering`" (`design.md:203`) is unimplemented. `PlcMultiplexer` never consults supervisor state. Implicit gating (no backend reader during recovery → no invalidation) holds today but is fragile and undocumented in code. | `Proxy/Multiplexing/PlcMultiplexer.cs:497-509` | | Supervisor / Hot-reload | `ConfigReconciler.Apply`'s add/restart code paths construct `PlcListenerSupervisor` **without** the `coalescingAccessor` that `ProxyWorker` builds at startup. Result: PLCs added or restarted via hot-reload always use the *default* `ReadCoalescingOptions`, and a hot-reload of `Mbproxy.Resilience.ReadCoalescing.Enabled` does NOT propagate to those supervisors. | `Configuration/ConfigReconciler.cs:294-304,378-388` ↔ `Proxy/ProxyWorker.cs:129-130` | | Supervisor / Hot-reload | `ConfigReconciler.ApplyUnderLockAsync` mutates the `_supervisors` `Dictionary` from concurrent `Task.Run` continuations. The outer apply is serialised by a semaphore, but the inner Add/Remove/Restart tasks run in parallel. `Dictionary<,>` is not thread-safe. | `Configuration/ConfigReconciler.cs:236-238,269-271,306,390` | | Multiplexer | `_disposed` is non-volatile and read on every hot path (`PlcMultiplexer.cs:86`, `UpstreamPipe.cs:52,77`). Under the .NET memory model, on ARM hosts a stale `false` read after disposal is permitted. (x64 happens to give acquire-release on plain reads.) | `Proxy/Multiplexing/PlcMultiplexer.cs:86,142,181,216,566` | | Cache | `mbproxy.cache.flushed` event is defined in `CacheLogEvents.cs` and referenced in `Reference/LogEvents.md` but never emitted by any caller. The reseat path disposes the old cache rather than calling `Clear()`. | `Proxy/Cache/CacheLogEvents.cs:67-76`, `Proxy/Supervision/PlcListenerSupervisor.cs:199-216` | | Cache | `Cache.AllowLongTtl=false` rejects `CacheTtlMs > 60_000` per the design — and the validators do enforce it — but the suite has **zero tests** for this rule (`grep -rn AllowLongTtl tests/` returns nothing). Same for the happy path with `AllowLongTtl=true`. | `Configuration/ReloadValidator.cs:90-128`, `Options/MbproxyOptions.cs:58-111` | | Cache | `IncrementCacheMiss` always increments on cache-eligible reads *before* the fall-through to coalescing. Two peers on the same key produce two cache misses and one coalesced hit — accurate per the strict letter of the spec but means "miss" doesn't mean "backend round-trip". | `Proxy/Multiplexing/PlcMultiplexer.cs:626` | | Bcd Rewriter | The 32-bit BCD wire format is "two base-10000 digits in CDAB", **not** standard CDAB binary Int32. Internally consistent with `design.md:123`, but every standard Modbus client (NModbus, FluentModbus, Wonderware DAServer) treats CDAB as straight binary int32 and will silently corrupt any value > 9999. Worth flagging in operator-facing docs so integrators know what wire format the proxy expects. | `Proxy/BcdPduPipeline.cs:202-206,373-379`, `Bcd/BcdCodec.cs` | | Bcd Rewriter | `BcdTagMapBuilder` cannot detect duplicate addresses because the input collapses through a `Dictionary` first (`Add` last-write-wins via key). The `DuplicateAddress` validation error is unreachable; the design's "reject duplicate addresses within a single PLC's resolved list" is silently violated. | `Bcd/BcdTagMapBuilder.cs:58-103` | | Admin / Diagnostics | `ShutdownCoordinator` runs via `ApplicationStopping` callback then `IHostedService.StopAsync` fires the same `ProxyWorker`/`AdminEndpointHost` again — double-stop. Drain happens but the contract is unclear. The existing `ShutdownE2ETests` don't assert `mbproxy.shutdown.complete` fired. | `Program.cs:60-66`, `Proxy/ProxyWorker.cs:196-217`, `Diagnostics/ShutdownCoordinator.cs:147` | | Hosting | `ShutdownCoordinator` consumes `IOptions` (singleton snapshot) instead of `IOptionsMonitor`. A hot-reloaded `GracefulShutdownTimeoutMs` is silently ignored at shutdown. | `Diagnostics/ShutdownCoordinator.cs:93,102,118`, `Program.cs:42` | | Hosting | Shipped `src/Mbproxy/appsettings.json` is a near-empty stub (no PLCs, no BCD tags, no `Cache` section). The working template lives only in `install/mbproxy.config.template.json`. A fresh `Mbproxy.exe` run from the publish folder is a no-op service. | `src/Mbproxy/appsettings.json` ↔ `install/mbproxy.config.template.json` | ## Top fix recommendations (in order) 1. **Plumb context updates into the running `PlcMultiplexer`.** Either thread a `volatile PerPlcContext` reference on the mux and have `ReplaceContextAsync` swap it (so the next per-PDU `_ctx.TagMap` / `_ctx.Cache` deref sees the new map / new cache), or restart the mux when the context changes. Without this, hot-reload of tag lists, cache-flush-on-reload, and the `ReadCoalescing.Enabled` toggle all silently fail to propagate. (Resolves: Critical 1, Major "coalescingAccessor".) 2. **Fix the coalescing factory leak.** In the `_inFlightByKey.TryAttachOrCreate` factory, if `TryAdd`/`TryAllocate` fails, do not return a stub stored under the key — either fail the entire operation (so nobody else attaches), or have the post-factory cleanup walk every party on the stored stub before removing. (Resolves: Critical 2.) 3. **Decouple the backend reader from per-upstream backpressure.** Either drop-on-full for the response channel, write with a per-frame timeout, or fan out responses on a per-party `Task.Run`. (Resolves: Critical 3.) 4. **Drain or replace `_outboundChannel` on cascade.** Complete it on cascade and recreate it on next reconnect, so stranded frames never reach a fresh backend with a dead TxId space. (Resolves: Critical 4.) 5. **Make `_supervisors` thread-safe.** Either `ConcurrentDictionary<,>` or move dict mutations to a single serialisation point after `Task.WhenAll(...)` settles. (Resolves: Major "concurrent dict".) 6. **Add the `ReadCoalescing` accessor to the reconciler's add/restart paths.** Single-line oversight from when Phase 10 landed after Phase 6. 7. **Make `ShutdownCoordinator` read `IOptionsMonitor`.** 8. **Ship a usable `appsettings.json`** (or stop shipping the stub at publish time). 9. **Flag the 32-bit BCD wire format prominently in operator-facing docs.** The "two base-10000 digits in CDAB" contract lives in `design.md`/`BcdRewriting.md` but not in the README; integrators using standard CDAB-int32 clients will silently corrupt values > 9999 without knowing it. 10. **Add tests for `Cache.AllowLongTtl`.** Both rejection (>60_000 without gate) and acceptance (>60_000 with gate). ## What looks especially good - **Three-tier test architecture** (pure unit → stub-socket integration → simulator E2E) with explicit per-tier rationale in xmldoc. The stub-backend tests catch what the pymodbus simulator can't (its concurrent-MBAP-frame bug is documented and worked around). - **Genuine concurrency tests, not pretend ones.** `TxIdAllocatorTests.Concurrent_AllocateRelease_...` runs 100 tasks × 1000 ops asserting no duplicate live IDs; `ResponseCacheTests.Concurrent_GetSet` races 8 tasks × 500 ops; `PlcMultiplexerTests` `BackendDisconnect_CascadesToAllUpstreams` uses real sockets. - **Counter-based assertions** across the suite — testing observable telemetry as well as behaviour, doubling as wire-format validation. - **Polly pipelines** correctly match the design's documented attempt counts and backoffs in both `BackendConnect` and `ListenerRecovery`. - **Bind-failure isolation on the admin endpoint** — Kestrel failing to bind logs `mbproxy.admin.bind.failed` and the proxy listeners continue. E2E-tested. - **Single code path for "bring up a listener"** (both startup and hot-reload add go through `PlcListenerSupervisor.StartAsync`). Matches the design. - **`FrozenDictionary` for the tag map**, allocation-free codec hot path, and `s_emptyHits` sentinel for the no-hit case keep the rewriter cheap under sustained load. - **Stable `EventName`s** on every `[LoggerMessage]` survive a C# rename refactor. ## Per-area documents | Area | File | |------|------| | BCD codec + PDU pipeline | [`BcdRewriter.md`](./BcdRewriter.md) | | Multiplexer + concurrency | [`Multiplexer.md`](./Multiplexer.md) | | Response cache | [`ResponseCache.md`](./ResponseCache.md) | | Supervisor + hot reload | [`SupervisorAndHotReload.md`](./SupervisorAndHotReload.md) | | Admin + diagnostics | [`AdminAndDiagnostics.md`](./AdminAndDiagnostics.md) | | Hosting + options | [`HostingAndOptions.md`](./HostingAndOptions.md) | | Test suite | [`TestSuite.md`](./TestSuite.md) | ## Methodology Each per-area document is the verbatim (lightly cleaned) output of an independent area-scoped review pass conducted on 2026-05-14 against branch `master` at commit `53a7111`. Reviewers were given the design docs as ground truth, the in-scope files, and the corresponding tests, and asked for `file:line`-cited findings ranked by severity. The Overview above re-ranks the union of those findings across areas. Findings reference paths relative to `mbproxy/` unless otherwise specified.