Eight area-focused reviews (BCD rewriter, multiplexer, response cache, supervisor + hot-reload, admin + diagnostics, hosting + options, test suite) plus an Overview that prioritises findings across areas, and a RemediationPlan that groups the work into three waves with per-item file:line citations and regression-test sketches. Findings call out: hot-reload tag-list/cache changes that don't reach the running multiplexer, a coalescing factory leak that hangs late attachers, backend-reader head-of-line block on a wedged upstream, stranded outbound frames after cascade, and ShutdownCoordinator double-stop ordering. Plus the unconventional 32-bit BCD wire format (two base-10000 digits in CDAB, not standard binary), unreachable BcdValidationError.DuplicateAddress, mbproxy.cache.flushed event that's defined but never emitted, and missing test coverage for Cache.AllowLongTtl. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
13 KiB
mbproxy Code Review — 2026-05-14
In-depth review of the mbproxy service through Phase 11 (opt-in response cache). Conducted via parallel area-focused review passes, then synthesised here. Each area has its own document under this folder; this Overview ranks findings across the whole codebase by impact so a maintainer can prioritise fixes without re-reading every file.
Scope: every .cs file under src/Mbproxy/, the design contracts in docs/design.md + docs/Architecture/* + docs/Features/*, the option POCOs, the hosting/install glue, and the test suite under tests/Mbproxy.Tests/.
Out of scope: the pymodbus simulator profile itself, anything under DL260/, anything outside mbproxy/.
Reviewer Verdict
The service is well-architected and the per-area code quality is high. Concurrency primitives are correct, log events are stable and well-named, and the test suite genuinely exercises races rather than pretending to. Most of the contracts in docs/design.md are upheld in the code.
The fleet of critical-severity findings clusters in two seams:
- Hot-reload propagation into a running
PlcMultiplexer. The supervisor'sReplaceContextAsyncswaps_currentContexton the supervisor but the running multiplexer captures_ctxat construction and never re-reads it — so tag-list reloads, cache-on-reload flushes, and theReadCoalescing.Enabledtoggle are visible only on the next listener fault, not at the next PDU. The design doc reads as if the next-PDU contract is upheld; the code only upholds it for one branch (theBcdTagMapread inside the rewriter's per-call clone). - Backend reader/writer cascade. When the backend socket dies, in-flight responses can sit synchronously on a wedged upstream pipe (one slow client stalls every peer), stranded outbound frames carry old TxIds against a fresh backend (orphaned responses, hung peers via watchdog), and the cascade itself races accepts that arrive between snapshot and
_pipes.Clear().
Below, Critical findings are correctness defects, hangs, data corruption, or security gaps that block production confidence. Major findings are design-contract violations that an operator could hit in a year of running. Minor findings are style, dead code, and test-only gaps.
Critical Findings (ranked by impact)
| # | Area | Finding | File:line |
|---|---|---|---|
| 1 | Supervisor / Hot-reload | ReplaceContextAsync swaps the supervisor's _currentContext but the running PlcMultiplexer captured _ctx at construction and never re-reads. The cache-flush-on-tag-list-reload contract (design.md:155) is not enforced for the in-flight mux; the old cache is even disposed while the running mux is still using it. |
Proxy/Supervision/PlcListenerSupervisor.cs:199-216 ↔ Proxy/Multiplexing/PlcMultiplexer.cs:50,101,109 |
| 2 | Multiplexer | Coalescing factory failure (TryAdd collision, TxIdAllocator.TryAllocate returns false) stores a stub InFlightRequest under the coalescing key but only the leader gets the saturation exception. Late attachers join a List<InterestedParty> that no one will ever drain — pipes hang until the upstream socket closes. |
Proxy/Multiplexing/PlcMultiplexer.cs:692-733, Proxy/Multiplexing/InFlightByKeyMap.cs:82-83 |
| 3 | Multiplexer | Backend reader sends to upstream synchronously inside the single reader task. If one upstream's bounded response channel is full, the reader blocks — stalling every other client on the same PLC. | Proxy/Multiplexing/PlcMultiplexer.cs:545 |
| 4 | Multiplexer | Backend writer doesn't complete/drain _outboundChannel on cascade. Stranded frames carry old proxy TxIds; the fresh backend socket then receives them and responds against TxIds whose correlation entries were already dropped by TearDownBackendAsync. Responses are silently dropped (PlcMultiplexer.cs:423-428); upstream peers hang waiting for the watchdog. |
Proxy/Multiplexing/PlcMultiplexer.cs:362-385, :423-428 |
Major Findings (highest-impact, full list in per-area docs)
| Area | Finding | File:line |
|---|---|---|
| Supervisor / Hot-reload | "Skip cache invalidations when listener state is recovering" (design.md:203) is unimplemented. PlcMultiplexer never consults supervisor state. Implicit gating (no backend reader during recovery → no invalidation) holds today but is fragile and undocumented in code. |
Proxy/Multiplexing/PlcMultiplexer.cs:497-509 |
| Supervisor / Hot-reload | ConfigReconciler.Apply's add/restart code paths construct PlcListenerSupervisor without the coalescingAccessor that ProxyWorker builds at startup. Result: PLCs added or restarted via hot-reload always use the default ReadCoalescingOptions, and a hot-reload of Mbproxy.Resilience.ReadCoalescing.Enabled does NOT propagate to those supervisors. |
Configuration/ConfigReconciler.cs:294-304,378-388 ↔ Proxy/ProxyWorker.cs:129-130 |
| Supervisor / Hot-reload | ConfigReconciler.ApplyUnderLockAsync mutates the _supervisors Dictionary from concurrent Task.Run continuations. The outer apply is serialised by a semaphore, but the inner Add/Remove/Restart tasks run in parallel. Dictionary<,> is not thread-safe. |
Configuration/ConfigReconciler.cs:236-238,269-271,306,390 |
| Multiplexer | _disposed is non-volatile and read on every hot path (PlcMultiplexer.cs:86, UpstreamPipe.cs:52,77). Under the .NET memory model, on ARM hosts a stale false read after disposal is permitted. (x64 happens to give acquire-release on plain reads.) |
Proxy/Multiplexing/PlcMultiplexer.cs:86,142,181,216,566 |
| Cache | mbproxy.cache.flushed event is defined in CacheLogEvents.cs and referenced in Reference/LogEvents.md but never emitted by any caller. The reseat path disposes the old cache rather than calling Clear(). |
Proxy/Cache/CacheLogEvents.cs:67-76, Proxy/Supervision/PlcListenerSupervisor.cs:199-216 |
| Cache | Cache.AllowLongTtl=false rejects CacheTtlMs > 60_000 per the design — and the validators do enforce it — but the suite has zero tests for this rule (grep -rn AllowLongTtl tests/ returns nothing). Same for the happy path with AllowLongTtl=true. |
Configuration/ReloadValidator.cs:90-128, Options/MbproxyOptions.cs:58-111 |
| Cache | IncrementCacheMiss always increments on cache-eligible reads before the fall-through to coalescing. Two peers on the same key produce two cache misses and one coalesced hit — accurate per the strict letter of the spec but means "miss" doesn't mean "backend round-trip". |
Proxy/Multiplexing/PlcMultiplexer.cs:626 |
| Bcd Rewriter | The 32-bit BCD wire format is "two base-10000 digits in CDAB", not standard CDAB binary Int32. Internally consistent with design.md:123, but every standard Modbus client (NModbus, FluentModbus, Wonderware DAServer) treats CDAB as straight binary int32 and will silently corrupt any value > 9999. Worth flagging in operator-facing docs so integrators know what wire format the proxy expects. |
Proxy/BcdPduPipeline.cs:202-206,373-379, Bcd/BcdCodec.cs |
| Bcd Rewriter | BcdTagMapBuilder cannot detect duplicate addresses because the input collapses through a Dictionary first (Add last-write-wins via key). The DuplicateAddress validation error is unreachable; the design's "reject duplicate addresses within a single PLC's resolved list" is silently violated. |
Bcd/BcdTagMapBuilder.cs:58-103 |
| Admin / Diagnostics | ShutdownCoordinator runs via ApplicationStopping callback then IHostedService.StopAsync fires the same ProxyWorker/AdminEndpointHost again — double-stop. Drain happens but the contract is unclear. The existing ShutdownE2ETests don't assert mbproxy.shutdown.complete fired. |
Program.cs:60-66, Proxy/ProxyWorker.cs:196-217, Diagnostics/ShutdownCoordinator.cs:147 |
| Hosting | ShutdownCoordinator consumes IOptions<MbproxyOptions> (singleton snapshot) instead of IOptionsMonitor<MbproxyOptions>. A hot-reloaded GracefulShutdownTimeoutMs is silently ignored at shutdown. |
Diagnostics/ShutdownCoordinator.cs:93,102,118, Program.cs:42 |
| Hosting | Shipped src/Mbproxy/appsettings.json is a near-empty stub (no PLCs, no BCD tags, no Cache section). The working template lives only in install/mbproxy.config.template.json. A fresh Mbproxy.exe run from the publish folder is a no-op service. |
src/Mbproxy/appsettings.json ↔ install/mbproxy.config.template.json |
Top fix recommendations (in order)
- Plumb context updates into the running
PlcMultiplexer. Either thread avolatile PerPlcContextreference on the mux and haveReplaceContextAsyncswap it (so the next per-PDU_ctx.TagMap/_ctx.Cachederef sees the new map / new cache), or restart the mux when the context changes. Without this, hot-reload of tag lists, cache-flush-on-reload, and theReadCoalescing.Enabledtoggle all silently fail to propagate. (Resolves: Critical 1, Major "coalescingAccessor".) - Fix the coalescing factory leak. In the
_inFlightByKey.TryAttachOrCreatefactory, ifTryAdd/TryAllocatefails, do not return a stub stored under the key — either fail the entire operation (so nobody else attaches), or have the post-factory cleanup walk every party on the stored stub before removing. (Resolves: Critical 2.) - Decouple the backend reader from per-upstream backpressure. Either drop-on-full for the response channel, write with a per-frame timeout, or fan out responses on a per-party
Task.Run. (Resolves: Critical 3.) - Drain or replace
_outboundChannelon cascade. Complete it on cascade and recreate it on next reconnect, so stranded frames never reach a fresh backend with a dead TxId space. (Resolves: Critical 4.) - Make
_supervisorsthread-safe. EitherConcurrentDictionary<,>or move dict mutations to a single serialisation point afterTask.WhenAll(...)settles. (Resolves: Major "concurrent dict".) - Add the
ReadCoalescingaccessor to the reconciler's add/restart paths. Single-line oversight from when Phase 10 landed after Phase 6. - Make
ShutdownCoordinatorreadIOptionsMonitor. - Ship a usable
appsettings.json(or stop shipping the stub at publish time). - Flag the 32-bit BCD wire format prominently in operator-facing docs. The "two base-10000 digits in CDAB" contract lives in
design.md/BcdRewriting.mdbut not in the README; integrators using standard CDAB-int32 clients will silently corrupt values > 9999 without knowing it. - Add tests for
Cache.AllowLongTtl. Both rejection (>60_000 without gate) and acceptance (>60_000 with gate).
What looks especially good
- Three-tier test architecture (pure unit → stub-socket integration → simulator E2E) with explicit per-tier rationale in xmldoc. The stub-backend tests catch what the pymodbus simulator can't (its concurrent-MBAP-frame bug is documented and worked around).
- Genuine concurrency tests, not pretend ones.
TxIdAllocatorTests.Concurrent_AllocateRelease_...runs 100 tasks × 1000 ops asserting no duplicate live IDs;ResponseCacheTests.Concurrent_GetSetraces 8 tasks × 500 ops;PlcMultiplexerTestsBackendDisconnect_CascadesToAllUpstreamsuses real sockets. - Counter-based assertions across the suite — testing observable telemetry as well as behaviour, doubling as wire-format validation.
- Polly pipelines correctly match the design's documented attempt counts and backoffs in both
BackendConnectandListenerRecovery. - Bind-failure isolation on the admin endpoint — Kestrel failing to bind logs
mbproxy.admin.bind.failedand the proxy listeners continue. E2E-tested. - Single code path for "bring up a listener" (both startup and hot-reload add go through
PlcListenerSupervisor.StartAsync). Matches the design. FrozenDictionaryfor the tag map, allocation-free codec hot path, ands_emptyHitssentinel for the no-hit case keep the rewriter cheap under sustained load.- Stable
EventNames on every[LoggerMessage]survive a C# rename refactor.
Per-area documents
| Area | File |
|---|---|
| BCD codec + PDU pipeline | BcdRewriter.md |
| Multiplexer + concurrency | Multiplexer.md |
| Response cache | ResponseCache.md |
| Supervisor + hot reload | SupervisorAndHotReload.md |
| Admin + diagnostics | AdminAndDiagnostics.md |
| Hosting + options | HostingAndOptions.md |
| Test suite | TestSuite.md |
Methodology
Each per-area document is the verbatim (lightly cleaned) output of an independent area-scoped review pass conducted on 2026-05-14 against branch master at commit 53a7111. Reviewers were given the design docs as ground truth, the in-scope files, and the corresponding tests, and asked for file:line-cited findings ranked by severity. The Overview above re-ranks the union of those findings across areas.
Findings reference paths relative to mbproxy/ unless otherwise specified.