f2c6669444
Eight area-focused reviews (BCD rewriter, multiplexer, response cache, supervisor + hot-reload, admin + diagnostics, hosting + options, test suite) plus an Overview that prioritises findings across areas, and a RemediationPlan that groups the work into three waves with per-item file:line citations and regression-test sketches. Findings call out: hot-reload tag-list/cache changes that don't reach the running multiplexer, a coalescing factory leak that hangs late attachers, backend-reader head-of-line block on a wedged upstream, stranded outbound frames after cascade, and ShutdownCoordinator double-stop ordering. Plus the unconventional 32-bit BCD wire format (two base-10000 digits in CDAB, not standard binary), unreachable BcdValidationError.DuplicateAddress, mbproxy.cache.flushed event that's defined but never emitted, and missing test coverage for Cache.AllowLongTtl. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
87 lines
13 KiB
Markdown
87 lines
13 KiB
Markdown
# mbproxy Code Review — 2026-05-14
|
||
|
||
In-depth review of the `mbproxy` service through Phase 11 (opt-in response cache). Conducted via parallel area-focused review passes, then synthesised here. Each area has its own document under this folder; this Overview ranks findings across the whole codebase by impact so a maintainer can prioritise fixes without re-reading every file.
|
||
|
||
**Scope:** every `.cs` file under `src/Mbproxy/`, the design contracts in `docs/design.md` + `docs/Architecture/*` + `docs/Features/*`, the option POCOs, the hosting/install glue, and the test suite under `tests/Mbproxy.Tests/`.
|
||
|
||
**Out of scope:** the pymodbus simulator profile itself, anything under `DL260/`, anything outside `mbproxy/`.
|
||
|
||
## Reviewer Verdict
|
||
|
||
The service is well-architected and the per-area code quality is high. Concurrency primitives are correct, log events are stable and well-named, and the test suite genuinely exercises races rather than pretending to. Most of the contracts in `docs/design.md` are upheld in the code.
|
||
|
||
The fleet of critical-severity findings clusters in **two seams**:
|
||
|
||
1. **Hot-reload propagation into a running `PlcMultiplexer`.** The supervisor's `ReplaceContextAsync` swaps `_currentContext` on the supervisor but the running multiplexer captures `_ctx` at construction and never re-reads it — so tag-list reloads, cache-on-reload flushes, and the `ReadCoalescing.Enabled` toggle are visible only on the *next* listener fault, not at the next PDU. The design doc reads as if the next-PDU contract is upheld; the code only upholds it for one branch (the `BcdTagMap` read inside the rewriter's per-call clone).
|
||
2. **Backend reader/writer cascade.** When the backend socket dies, in-flight responses can sit synchronously on a wedged upstream pipe (one slow client stalls every peer), stranded outbound frames carry old TxIds against a fresh backend (orphaned responses, hung peers via watchdog), and the cascade itself races accepts that arrive between snapshot and `_pipes.Clear()`.
|
||
|
||
Below, **Critical** findings are correctness defects, hangs, data corruption, or security gaps that block production confidence. **Major** findings are design-contract violations that an operator could hit in a year of running. **Minor** findings are style, dead code, and test-only gaps.
|
||
|
||
## Critical Findings (ranked by impact)
|
||
|
||
| # | Area | Finding | File:line |
|
||
|---|------|---------|-----------|
|
||
| 1 | Supervisor / Hot-reload | `ReplaceContextAsync` swaps the supervisor's `_currentContext` but the running `PlcMultiplexer` captured `_ctx` at construction and never re-reads. The cache-flush-on-tag-list-reload contract (`design.md:155`) is not enforced for the in-flight mux; the old cache is even disposed while the running mux is still using it. | `Proxy/Supervision/PlcListenerSupervisor.cs:199-216` ↔ `Proxy/Multiplexing/PlcMultiplexer.cs:50,101,109` |
|
||
| 2 | Multiplexer | Coalescing factory failure (`TryAdd` collision, `TxIdAllocator.TryAllocate` returns false) stores a stub `InFlightRequest` under the coalescing key but only the leader gets the saturation exception. Late attachers join a `List<InterestedParty>` that no one will ever drain — pipes hang until the upstream socket closes. | `Proxy/Multiplexing/PlcMultiplexer.cs:692-733`, `Proxy/Multiplexing/InFlightByKeyMap.cs:82-83` |
|
||
| 3 | Multiplexer | Backend reader sends to upstream synchronously inside the single reader task. If one upstream's bounded response channel is full, the reader blocks — stalling every other client on the same PLC. | `Proxy/Multiplexing/PlcMultiplexer.cs:545` |
|
||
| 4 | Multiplexer | Backend writer doesn't complete/drain `_outboundChannel` on cascade. Stranded frames carry old proxy TxIds; the fresh backend socket then receives them and responds against TxIds whose correlation entries were already dropped by `TearDownBackendAsync`. Responses are silently dropped (`PlcMultiplexer.cs:423-428`); upstream peers hang waiting for the watchdog. | `Proxy/Multiplexing/PlcMultiplexer.cs:362-385`, `:423-428` |
|
||
|
||
## Major Findings (highest-impact, full list in per-area docs)
|
||
|
||
| Area | Finding | File:line |
|
||
|------|---------|-----------|
|
||
| Supervisor / Hot-reload | "Skip cache invalidations when listener state is `recovering`" (`design.md:203`) is unimplemented. `PlcMultiplexer` never consults supervisor state. Implicit gating (no backend reader during recovery → no invalidation) holds today but is fragile and undocumented in code. | `Proxy/Multiplexing/PlcMultiplexer.cs:497-509` |
|
||
| Supervisor / Hot-reload | `ConfigReconciler.Apply`'s add/restart code paths construct `PlcListenerSupervisor` **without** the `coalescingAccessor` that `ProxyWorker` builds at startup. Result: PLCs added or restarted via hot-reload always use the *default* `ReadCoalescingOptions`, and a hot-reload of `Mbproxy.Resilience.ReadCoalescing.Enabled` does NOT propagate to those supervisors. | `Configuration/ConfigReconciler.cs:294-304,378-388` ↔ `Proxy/ProxyWorker.cs:129-130` |
|
||
| Supervisor / Hot-reload | `ConfigReconciler.ApplyUnderLockAsync` mutates the `_supervisors` `Dictionary` from concurrent `Task.Run` continuations. The outer apply is serialised by a semaphore, but the inner Add/Remove/Restart tasks run in parallel. `Dictionary<,>` is not thread-safe. | `Configuration/ConfigReconciler.cs:236-238,269-271,306,390` |
|
||
| Multiplexer | `_disposed` is non-volatile and read on every hot path (`PlcMultiplexer.cs:86`, `UpstreamPipe.cs:52,77`). Under the .NET memory model, on ARM hosts a stale `false` read after disposal is permitted. (x64 happens to give acquire-release on plain reads.) | `Proxy/Multiplexing/PlcMultiplexer.cs:86,142,181,216,566` |
|
||
| Cache | `mbproxy.cache.flushed` event is defined in `CacheLogEvents.cs` and referenced in `Reference/LogEvents.md` but never emitted by any caller. The reseat path disposes the old cache rather than calling `Clear()`. | `Proxy/Cache/CacheLogEvents.cs:67-76`, `Proxy/Supervision/PlcListenerSupervisor.cs:199-216` |
|
||
| Cache | `Cache.AllowLongTtl=false` rejects `CacheTtlMs > 60_000` per the design — and the validators do enforce it — but the suite has **zero tests** for this rule (`grep -rn AllowLongTtl tests/` returns nothing). Same for the happy path with `AllowLongTtl=true`. | `Configuration/ReloadValidator.cs:90-128`, `Options/MbproxyOptions.cs:58-111` |
|
||
| Cache | `IncrementCacheMiss` always increments on cache-eligible reads *before* the fall-through to coalescing. Two peers on the same key produce two cache misses and one coalesced hit — accurate per the strict letter of the spec but means "miss" doesn't mean "backend round-trip". | `Proxy/Multiplexing/PlcMultiplexer.cs:626` |
|
||
| Bcd Rewriter | The 32-bit BCD wire format is "two base-10000 digits in CDAB", **not** standard CDAB binary Int32. Internally consistent with `design.md:123`, but every standard Modbus client (NModbus, FluentModbus, Wonderware DAServer) treats CDAB as straight binary int32 and will silently corrupt any value > 9999. Worth flagging in operator-facing docs so integrators know what wire format the proxy expects. | `Proxy/BcdPduPipeline.cs:202-206,373-379`, `Bcd/BcdCodec.cs` |
|
||
| Bcd Rewriter | `BcdTagMapBuilder` cannot detect duplicate addresses because the input collapses through a `Dictionary` first (`Add` last-write-wins via key). The `DuplicateAddress` validation error is unreachable; the design's "reject duplicate addresses within a single PLC's resolved list" is silently violated. | `Bcd/BcdTagMapBuilder.cs:58-103` |
|
||
| Admin / Diagnostics | `ShutdownCoordinator` runs via `ApplicationStopping` callback then `IHostedService.StopAsync` fires the same `ProxyWorker`/`AdminEndpointHost` again — double-stop. Drain happens but the contract is unclear. The existing `ShutdownE2ETests` don't assert `mbproxy.shutdown.complete` fired. | `Program.cs:60-66`, `Proxy/ProxyWorker.cs:196-217`, `Diagnostics/ShutdownCoordinator.cs:147` |
|
||
| Hosting | `ShutdownCoordinator` consumes `IOptions<MbproxyOptions>` (singleton snapshot) instead of `IOptionsMonitor<MbproxyOptions>`. A hot-reloaded `GracefulShutdownTimeoutMs` is silently ignored at shutdown. | `Diagnostics/ShutdownCoordinator.cs:93,102,118`, `Program.cs:42` |
|
||
| Hosting | Shipped `src/Mbproxy/appsettings.json` is a near-empty stub (no PLCs, no BCD tags, no `Cache` section). The working template lives only in `install/mbproxy.config.template.json`. A fresh `Mbproxy.exe` run from the publish folder is a no-op service. | `src/Mbproxy/appsettings.json` ↔ `install/mbproxy.config.template.json` |
|
||
|
||
## Top fix recommendations (in order)
|
||
|
||
1. **Plumb context updates into the running `PlcMultiplexer`.** Either thread a `volatile PerPlcContext` reference on the mux and have `ReplaceContextAsync` swap it (so the next per-PDU `_ctx.TagMap` / `_ctx.Cache` deref sees the new map / new cache), or restart the mux when the context changes. Without this, hot-reload of tag lists, cache-flush-on-reload, and the `ReadCoalescing.Enabled` toggle all silently fail to propagate. (Resolves: Critical 1, Major "coalescingAccessor".)
|
||
2. **Fix the coalescing factory leak.** In the `_inFlightByKey.TryAttachOrCreate` factory, if `TryAdd`/`TryAllocate` fails, do not return a stub stored under the key — either fail the entire operation (so nobody else attaches), or have the post-factory cleanup walk every party on the stored stub before removing. (Resolves: Critical 2.)
|
||
3. **Decouple the backend reader from per-upstream backpressure.** Either drop-on-full for the response channel, write with a per-frame timeout, or fan out responses on a per-party `Task.Run`. (Resolves: Critical 3.)
|
||
4. **Drain or replace `_outboundChannel` on cascade.** Complete it on cascade and recreate it on next reconnect, so stranded frames never reach a fresh backend with a dead TxId space. (Resolves: Critical 4.)
|
||
5. **Make `_supervisors` thread-safe.** Either `ConcurrentDictionary<,>` or move dict mutations to a single serialisation point after `Task.WhenAll(...)` settles. (Resolves: Major "concurrent dict".)
|
||
6. **Add the `ReadCoalescing` accessor to the reconciler's add/restart paths.** Single-line oversight from when Phase 10 landed after Phase 6.
|
||
7. **Make `ShutdownCoordinator` read `IOptionsMonitor`.**
|
||
8. **Ship a usable `appsettings.json`** (or stop shipping the stub at publish time).
|
||
9. **Flag the 32-bit BCD wire format prominently in operator-facing docs.** The "two base-10000 digits in CDAB" contract lives in `design.md`/`BcdRewriting.md` but not in the README; integrators using standard CDAB-int32 clients will silently corrupt values > 9999 without knowing it.
|
||
10. **Add tests for `Cache.AllowLongTtl`.** Both rejection (>60_000 without gate) and acceptance (>60_000 with gate).
|
||
|
||
## What looks especially good
|
||
|
||
- **Three-tier test architecture** (pure unit → stub-socket integration → simulator E2E) with explicit per-tier rationale in xmldoc. The stub-backend tests catch what the pymodbus simulator can't (its concurrent-MBAP-frame bug is documented and worked around).
|
||
- **Genuine concurrency tests, not pretend ones.** `TxIdAllocatorTests.Concurrent_AllocateRelease_...` runs 100 tasks × 1000 ops asserting no duplicate live IDs; `ResponseCacheTests.Concurrent_GetSet` races 8 tasks × 500 ops; `PlcMultiplexerTests` `BackendDisconnect_CascadesToAllUpstreams` uses real sockets.
|
||
- **Counter-based assertions** across the suite — testing observable telemetry as well as behaviour, doubling as wire-format validation.
|
||
- **Polly pipelines** correctly match the design's documented attempt counts and backoffs in both `BackendConnect` and `ListenerRecovery`.
|
||
- **Bind-failure isolation on the admin endpoint** — Kestrel failing to bind logs `mbproxy.admin.bind.failed` and the proxy listeners continue. E2E-tested.
|
||
- **Single code path for "bring up a listener"** (both startup and hot-reload add go through `PlcListenerSupervisor.StartAsync`). Matches the design.
|
||
- **`FrozenDictionary` for the tag map**, allocation-free codec hot path, and `s_emptyHits` sentinel for the no-hit case keep the rewriter cheap under sustained load.
|
||
- **Stable `EventName`s** on every `[LoggerMessage]` survive a C# rename refactor.
|
||
|
||
## Per-area documents
|
||
|
||
| Area | File |
|
||
|------|------|
|
||
| BCD codec + PDU pipeline | [`BcdRewriter.md`](./BcdRewriter.md) |
|
||
| Multiplexer + concurrency | [`Multiplexer.md`](./Multiplexer.md) |
|
||
| Response cache | [`ResponseCache.md`](./ResponseCache.md) |
|
||
| Supervisor + hot reload | [`SupervisorAndHotReload.md`](./SupervisorAndHotReload.md) |
|
||
| Admin + diagnostics | [`AdminAndDiagnostics.md`](./AdminAndDiagnostics.md) |
|
||
| Hosting + options | [`HostingAndOptions.md`](./HostingAndOptions.md) |
|
||
| Test suite | [`TestSuite.md`](./TestSuite.md) |
|
||
|
||
## Methodology
|
||
|
||
Each per-area document is the verbatim (lightly cleaned) output of an independent area-scoped review pass conducted on 2026-05-14 against branch `master` at commit `53a7111`. Reviewers were given the design docs as ground truth, the in-scope files, and the corresponding tests, and asked for `file:line`-cited findings ranked by severity. The Overview above re-ranks the union of those findings across areas.
|
||
|
||
Findings reference paths relative to `mbproxy/` unless otherwise specified.
|