Files
Joseph Doherty f2c6669444 mbproxy/codereviews: 2026-05-14 in-depth review + remediation plan
Eight area-focused reviews (BCD rewriter, multiplexer, response cache,
supervisor + hot-reload, admin + diagnostics, hosting + options, test
suite) plus an Overview that prioritises findings across areas, and a
RemediationPlan that groups the work into three waves with per-item
file:line citations and regression-test sketches.

Findings call out: hot-reload tag-list/cache changes that don't reach the
running multiplexer, a coalescing factory leak that hangs late attachers,
backend-reader head-of-line block on a wedged upstream, stranded outbound
frames after cascade, and ShutdownCoordinator double-stop ordering. Plus
the unconventional 32-bit BCD wire format (two base-10000 digits in CDAB,
not standard binary), unreachable BcdValidationError.DuplicateAddress,
mbproxy.cache.flushed event that's defined but never emitted, and missing
test coverage for Cache.AllowLongTtl.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 05:15:34 -04:00

13 KiB
Raw Permalink Blame History

mbproxy Code Review — 2026-05-14

In-depth review of the mbproxy service through Phase 11 (opt-in response cache). Conducted via parallel area-focused review passes, then synthesised here. Each area has its own document under this folder; this Overview ranks findings across the whole codebase by impact so a maintainer can prioritise fixes without re-reading every file.

Scope: every .cs file under src/Mbproxy/, the design contracts in docs/design.md + docs/Architecture/* + docs/Features/*, the option POCOs, the hosting/install glue, and the test suite under tests/Mbproxy.Tests/.

Out of scope: the pymodbus simulator profile itself, anything under DL260/, anything outside mbproxy/.

Reviewer Verdict

The service is well-architected and the per-area code quality is high. Concurrency primitives are correct, log events are stable and well-named, and the test suite genuinely exercises races rather than pretending to. Most of the contracts in docs/design.md are upheld in the code.

The fleet of critical-severity findings clusters in two seams:

  1. Hot-reload propagation into a running PlcMultiplexer. The supervisor's ReplaceContextAsync swaps _currentContext on the supervisor but the running multiplexer captures _ctx at construction and never re-reads it — so tag-list reloads, cache-on-reload flushes, and the ReadCoalescing.Enabled toggle are visible only on the next listener fault, not at the next PDU. The design doc reads as if the next-PDU contract is upheld; the code only upholds it for one branch (the BcdTagMap read inside the rewriter's per-call clone).
  2. Backend reader/writer cascade. When the backend socket dies, in-flight responses can sit synchronously on a wedged upstream pipe (one slow client stalls every peer), stranded outbound frames carry old TxIds against a fresh backend (orphaned responses, hung peers via watchdog), and the cascade itself races accepts that arrive between snapshot and _pipes.Clear().

Below, Critical findings are correctness defects, hangs, data corruption, or security gaps that block production confidence. Major findings are design-contract violations that an operator could hit in a year of running. Minor findings are style, dead code, and test-only gaps.

Critical Findings (ranked by impact)

# Area Finding File:line
1 Supervisor / Hot-reload ReplaceContextAsync swaps the supervisor's _currentContext but the running PlcMultiplexer captured _ctx at construction and never re-reads. The cache-flush-on-tag-list-reload contract (design.md:155) is not enforced for the in-flight mux; the old cache is even disposed while the running mux is still using it. Proxy/Supervision/PlcListenerSupervisor.cs:199-216Proxy/Multiplexing/PlcMultiplexer.cs:50,101,109
2 Multiplexer Coalescing factory failure (TryAdd collision, TxIdAllocator.TryAllocate returns false) stores a stub InFlightRequest under the coalescing key but only the leader gets the saturation exception. Late attachers join a List<InterestedParty> that no one will ever drain — pipes hang until the upstream socket closes. Proxy/Multiplexing/PlcMultiplexer.cs:692-733, Proxy/Multiplexing/InFlightByKeyMap.cs:82-83
3 Multiplexer Backend reader sends to upstream synchronously inside the single reader task. If one upstream's bounded response channel is full, the reader blocks — stalling every other client on the same PLC. Proxy/Multiplexing/PlcMultiplexer.cs:545
4 Multiplexer Backend writer doesn't complete/drain _outboundChannel on cascade. Stranded frames carry old proxy TxIds; the fresh backend socket then receives them and responds against TxIds whose correlation entries were already dropped by TearDownBackendAsync. Responses are silently dropped (PlcMultiplexer.cs:423-428); upstream peers hang waiting for the watchdog. Proxy/Multiplexing/PlcMultiplexer.cs:362-385, :423-428

Major Findings (highest-impact, full list in per-area docs)

Area Finding File:line
Supervisor / Hot-reload "Skip cache invalidations when listener state is recovering" (design.md:203) is unimplemented. PlcMultiplexer never consults supervisor state. Implicit gating (no backend reader during recovery → no invalidation) holds today but is fragile and undocumented in code. Proxy/Multiplexing/PlcMultiplexer.cs:497-509
Supervisor / Hot-reload ConfigReconciler.Apply's add/restart code paths construct PlcListenerSupervisor without the coalescingAccessor that ProxyWorker builds at startup. Result: PLCs added or restarted via hot-reload always use the default ReadCoalescingOptions, and a hot-reload of Mbproxy.Resilience.ReadCoalescing.Enabled does NOT propagate to those supervisors. Configuration/ConfigReconciler.cs:294-304,378-388Proxy/ProxyWorker.cs:129-130
Supervisor / Hot-reload ConfigReconciler.ApplyUnderLockAsync mutates the _supervisors Dictionary from concurrent Task.Run continuations. The outer apply is serialised by a semaphore, but the inner Add/Remove/Restart tasks run in parallel. Dictionary<,> is not thread-safe. Configuration/ConfigReconciler.cs:236-238,269-271,306,390
Multiplexer _disposed is non-volatile and read on every hot path (PlcMultiplexer.cs:86, UpstreamPipe.cs:52,77). Under the .NET memory model, on ARM hosts a stale false read after disposal is permitted. (x64 happens to give acquire-release on plain reads.) Proxy/Multiplexing/PlcMultiplexer.cs:86,142,181,216,566
Cache mbproxy.cache.flushed event is defined in CacheLogEvents.cs and referenced in Reference/LogEvents.md but never emitted by any caller. The reseat path disposes the old cache rather than calling Clear(). Proxy/Cache/CacheLogEvents.cs:67-76, Proxy/Supervision/PlcListenerSupervisor.cs:199-216
Cache Cache.AllowLongTtl=false rejects CacheTtlMs > 60_000 per the design — and the validators do enforce it — but the suite has zero tests for this rule (grep -rn AllowLongTtl tests/ returns nothing). Same for the happy path with AllowLongTtl=true. Configuration/ReloadValidator.cs:90-128, Options/MbproxyOptions.cs:58-111
Cache IncrementCacheMiss always increments on cache-eligible reads before the fall-through to coalescing. Two peers on the same key produce two cache misses and one coalesced hit — accurate per the strict letter of the spec but means "miss" doesn't mean "backend round-trip". Proxy/Multiplexing/PlcMultiplexer.cs:626
Bcd Rewriter The 32-bit BCD wire format is "two base-10000 digits in CDAB", not standard CDAB binary Int32. Internally consistent with design.md:123, but every standard Modbus client (NModbus, FluentModbus, Wonderware DAServer) treats CDAB as straight binary int32 and will silently corrupt any value > 9999. Worth flagging in operator-facing docs so integrators know what wire format the proxy expects. Proxy/BcdPduPipeline.cs:202-206,373-379, Bcd/BcdCodec.cs
Bcd Rewriter BcdTagMapBuilder cannot detect duplicate addresses because the input collapses through a Dictionary first (Add last-write-wins via key). The DuplicateAddress validation error is unreachable; the design's "reject duplicate addresses within a single PLC's resolved list" is silently violated. Bcd/BcdTagMapBuilder.cs:58-103
Admin / Diagnostics ShutdownCoordinator runs via ApplicationStopping callback then IHostedService.StopAsync fires the same ProxyWorker/AdminEndpointHost again — double-stop. Drain happens but the contract is unclear. The existing ShutdownE2ETests don't assert mbproxy.shutdown.complete fired. Program.cs:60-66, Proxy/ProxyWorker.cs:196-217, Diagnostics/ShutdownCoordinator.cs:147
Hosting ShutdownCoordinator consumes IOptions<MbproxyOptions> (singleton snapshot) instead of IOptionsMonitor<MbproxyOptions>. A hot-reloaded GracefulShutdownTimeoutMs is silently ignored at shutdown. Diagnostics/ShutdownCoordinator.cs:93,102,118, Program.cs:42
Hosting Shipped src/Mbproxy/appsettings.json is a near-empty stub (no PLCs, no BCD tags, no Cache section). The working template lives only in install/mbproxy.config.template.json. A fresh Mbproxy.exe run from the publish folder is a no-op service. src/Mbproxy/appsettings.jsoninstall/mbproxy.config.template.json

Top fix recommendations (in order)

  1. Plumb context updates into the running PlcMultiplexer. Either thread a volatile PerPlcContext reference on the mux and have ReplaceContextAsync swap it (so the next per-PDU _ctx.TagMap / _ctx.Cache deref sees the new map / new cache), or restart the mux when the context changes. Without this, hot-reload of tag lists, cache-flush-on-reload, and the ReadCoalescing.Enabled toggle all silently fail to propagate. (Resolves: Critical 1, Major "coalescingAccessor".)
  2. Fix the coalescing factory leak. In the _inFlightByKey.TryAttachOrCreate factory, if TryAdd/TryAllocate fails, do not return a stub stored under the key — either fail the entire operation (so nobody else attaches), or have the post-factory cleanup walk every party on the stored stub before removing. (Resolves: Critical 2.)
  3. Decouple the backend reader from per-upstream backpressure. Either drop-on-full for the response channel, write with a per-frame timeout, or fan out responses on a per-party Task.Run. (Resolves: Critical 3.)
  4. Drain or replace _outboundChannel on cascade. Complete it on cascade and recreate it on next reconnect, so stranded frames never reach a fresh backend with a dead TxId space. (Resolves: Critical 4.)
  5. Make _supervisors thread-safe. Either ConcurrentDictionary<,> or move dict mutations to a single serialisation point after Task.WhenAll(...) settles. (Resolves: Major "concurrent dict".)
  6. Add the ReadCoalescing accessor to the reconciler's add/restart paths. Single-line oversight from when Phase 10 landed after Phase 6.
  7. Make ShutdownCoordinator read IOptionsMonitor.
  8. Ship a usable appsettings.json (or stop shipping the stub at publish time).
  9. Flag the 32-bit BCD wire format prominently in operator-facing docs. The "two base-10000 digits in CDAB" contract lives in design.md/BcdRewriting.md but not in the README; integrators using standard CDAB-int32 clients will silently corrupt values > 9999 without knowing it.
  10. Add tests for Cache.AllowLongTtl. Both rejection (>60_000 without gate) and acceptance (>60_000 with gate).

What looks especially good

  • Three-tier test architecture (pure unit → stub-socket integration → simulator E2E) with explicit per-tier rationale in xmldoc. The stub-backend tests catch what the pymodbus simulator can't (its concurrent-MBAP-frame bug is documented and worked around).
  • Genuine concurrency tests, not pretend ones. TxIdAllocatorTests.Concurrent_AllocateRelease_... runs 100 tasks × 1000 ops asserting no duplicate live IDs; ResponseCacheTests.Concurrent_GetSet races 8 tasks × 500 ops; PlcMultiplexerTests BackendDisconnect_CascadesToAllUpstreams uses real sockets.
  • Counter-based assertions across the suite — testing observable telemetry as well as behaviour, doubling as wire-format validation.
  • Polly pipelines correctly match the design's documented attempt counts and backoffs in both BackendConnect and ListenerRecovery.
  • Bind-failure isolation on the admin endpoint — Kestrel failing to bind logs mbproxy.admin.bind.failed and the proxy listeners continue. E2E-tested.
  • Single code path for "bring up a listener" (both startup and hot-reload add go through PlcListenerSupervisor.StartAsync). Matches the design.
  • FrozenDictionary for the tag map, allocation-free codec hot path, and s_emptyHits sentinel for the no-hit case keep the rewriter cheap under sustained load.
  • Stable EventNames on every [LoggerMessage] survive a C# rename refactor.

Per-area documents

Area File
BCD codec + PDU pipeline BcdRewriter.md
Multiplexer + concurrency Multiplexer.md
Response cache ResponseCache.md
Supervisor + hot reload SupervisorAndHotReload.md
Admin + diagnostics AdminAndDiagnostics.md
Hosting + options HostingAndOptions.md
Test suite TestSuite.md

Methodology

Each per-area document is the verbatim (lightly cleaned) output of an independent area-scoped review pass conducted on 2026-05-14 against branch master at commit 53a7111. Reviewers were given the design docs as ground truth, the in-scope files, and the corresponding tests, and asked for file:line-cited findings ranked by severity. The Overview above re-ranks the union of those findings across areas.

Findings reference paths relative to mbproxy/ unless otherwise specified.