Files
Joseph Doherty f2c6669444 mbproxy/codereviews: 2026-05-14 in-depth review + remediation plan
Eight area-focused reviews (BCD rewriter, multiplexer, response cache,
supervisor + hot-reload, admin + diagnostics, hosting + options, test
suite) plus an Overview that prioritises findings across areas, and a
RemediationPlan that groups the work into three waves with per-item
file:line citations and regression-test sketches.

Findings call out: hot-reload tag-list/cache changes that don't reach the
running multiplexer, a coalescing factory leak that hangs late attachers,
backend-reader head-of-line block on a wedged upstream, stranded outbound
frames after cascade, and ShutdownCoordinator double-stop ordering. Plus
the unconventional 32-bit BCD wire format (two base-10000 digits in CDAB,
not standard binary), unreachable BcdValidationError.DuplicateAddress,
mbproxy.cache.flushed event that's defined but never emitted, and missing
test coverage for Cache.AllowLongTtl.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 05:15:34 -04:00

87 lines
13 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# mbproxy Code Review — 2026-05-14
In-depth review of the `mbproxy` service through Phase 11 (opt-in response cache). Conducted via parallel area-focused review passes, then synthesised here. Each area has its own document under this folder; this Overview ranks findings across the whole codebase by impact so a maintainer can prioritise fixes without re-reading every file.
**Scope:** every `.cs` file under `src/Mbproxy/`, the design contracts in `docs/design.md` + `docs/Architecture/*` + `docs/Features/*`, the option POCOs, the hosting/install glue, and the test suite under `tests/Mbproxy.Tests/`.
**Out of scope:** the pymodbus simulator profile itself, anything under `DL260/`, anything outside `mbproxy/`.
## Reviewer Verdict
The service is well-architected and the per-area code quality is high. Concurrency primitives are correct, log events are stable and well-named, and the test suite genuinely exercises races rather than pretending to. Most of the contracts in `docs/design.md` are upheld in the code.
The fleet of critical-severity findings clusters in **two seams**:
1. **Hot-reload propagation into a running `PlcMultiplexer`.** The supervisor's `ReplaceContextAsync` swaps `_currentContext` on the supervisor but the running multiplexer captures `_ctx` at construction and never re-reads it — so tag-list reloads, cache-on-reload flushes, and the `ReadCoalescing.Enabled` toggle are visible only on the *next* listener fault, not at the next PDU. The design doc reads as if the next-PDU contract is upheld; the code only upholds it for one branch (the `BcdTagMap` read inside the rewriter's per-call clone).
2. **Backend reader/writer cascade.** When the backend socket dies, in-flight responses can sit synchronously on a wedged upstream pipe (one slow client stalls every peer), stranded outbound frames carry old TxIds against a fresh backend (orphaned responses, hung peers via watchdog), and the cascade itself races accepts that arrive between snapshot and `_pipes.Clear()`.
Below, **Critical** findings are correctness defects, hangs, data corruption, or security gaps that block production confidence. **Major** findings are design-contract violations that an operator could hit in a year of running. **Minor** findings are style, dead code, and test-only gaps.
## Critical Findings (ranked by impact)
| # | Area | Finding | File:line |
|---|------|---------|-----------|
| 1 | Supervisor / Hot-reload | `ReplaceContextAsync` swaps the supervisor's `_currentContext` but the running `PlcMultiplexer` captured `_ctx` at construction and never re-reads. The cache-flush-on-tag-list-reload contract (`design.md:155`) is not enforced for the in-flight mux; the old cache is even disposed while the running mux is still using it. | `Proxy/Supervision/PlcListenerSupervisor.cs:199-216``Proxy/Multiplexing/PlcMultiplexer.cs:50,101,109` |
| 2 | Multiplexer | Coalescing factory failure (`TryAdd` collision, `TxIdAllocator.TryAllocate` returns false) stores a stub `InFlightRequest` under the coalescing key but only the leader gets the saturation exception. Late attachers join a `List<InterestedParty>` that no one will ever drain — pipes hang until the upstream socket closes. | `Proxy/Multiplexing/PlcMultiplexer.cs:692-733`, `Proxy/Multiplexing/InFlightByKeyMap.cs:82-83` |
| 3 | Multiplexer | Backend reader sends to upstream synchronously inside the single reader task. If one upstream's bounded response channel is full, the reader blocks — stalling every other client on the same PLC. | `Proxy/Multiplexing/PlcMultiplexer.cs:545` |
| 4 | Multiplexer | Backend writer doesn't complete/drain `_outboundChannel` on cascade. Stranded frames carry old proxy TxIds; the fresh backend socket then receives them and responds against TxIds whose correlation entries were already dropped by `TearDownBackendAsync`. Responses are silently dropped (`PlcMultiplexer.cs:423-428`); upstream peers hang waiting for the watchdog. | `Proxy/Multiplexing/PlcMultiplexer.cs:362-385`, `:423-428` |
## Major Findings (highest-impact, full list in per-area docs)
| Area | Finding | File:line |
|------|---------|-----------|
| Supervisor / Hot-reload | "Skip cache invalidations when listener state is `recovering`" (`design.md:203`) is unimplemented. `PlcMultiplexer` never consults supervisor state. Implicit gating (no backend reader during recovery → no invalidation) holds today but is fragile and undocumented in code. | `Proxy/Multiplexing/PlcMultiplexer.cs:497-509` |
| Supervisor / Hot-reload | `ConfigReconciler.Apply`'s add/restart code paths construct `PlcListenerSupervisor` **without** the `coalescingAccessor` that `ProxyWorker` builds at startup. Result: PLCs added or restarted via hot-reload always use the *default* `ReadCoalescingOptions`, and a hot-reload of `Mbproxy.Resilience.ReadCoalescing.Enabled` does NOT propagate to those supervisors. | `Configuration/ConfigReconciler.cs:294-304,378-388``Proxy/ProxyWorker.cs:129-130` |
| Supervisor / Hot-reload | `ConfigReconciler.ApplyUnderLockAsync` mutates the `_supervisors` `Dictionary` from concurrent `Task.Run` continuations. The outer apply is serialised by a semaphore, but the inner Add/Remove/Restart tasks run in parallel. `Dictionary<,>` is not thread-safe. | `Configuration/ConfigReconciler.cs:236-238,269-271,306,390` |
| Multiplexer | `_disposed` is non-volatile and read on every hot path (`PlcMultiplexer.cs:86`, `UpstreamPipe.cs:52,77`). Under the .NET memory model, on ARM hosts a stale `false` read after disposal is permitted. (x64 happens to give acquire-release on plain reads.) | `Proxy/Multiplexing/PlcMultiplexer.cs:86,142,181,216,566` |
| Cache | `mbproxy.cache.flushed` event is defined in `CacheLogEvents.cs` and referenced in `Reference/LogEvents.md` but never emitted by any caller. The reseat path disposes the old cache rather than calling `Clear()`. | `Proxy/Cache/CacheLogEvents.cs:67-76`, `Proxy/Supervision/PlcListenerSupervisor.cs:199-216` |
| Cache | `Cache.AllowLongTtl=false` rejects `CacheTtlMs > 60_000` per the design — and the validators do enforce it — but the suite has **zero tests** for this rule (`grep -rn AllowLongTtl tests/` returns nothing). Same for the happy path with `AllowLongTtl=true`. | `Configuration/ReloadValidator.cs:90-128`, `Options/MbproxyOptions.cs:58-111` |
| Cache | `IncrementCacheMiss` always increments on cache-eligible reads *before* the fall-through to coalescing. Two peers on the same key produce two cache misses and one coalesced hit — accurate per the strict letter of the spec but means "miss" doesn't mean "backend round-trip". | `Proxy/Multiplexing/PlcMultiplexer.cs:626` |
| Bcd Rewriter | The 32-bit BCD wire format is "two base-10000 digits in CDAB", **not** standard CDAB binary Int32. Internally consistent with `design.md:123`, but every standard Modbus client (NModbus, FluentModbus, Wonderware DAServer) treats CDAB as straight binary int32 and will silently corrupt any value > 9999. Worth flagging in operator-facing docs so integrators know what wire format the proxy expects. | `Proxy/BcdPduPipeline.cs:202-206,373-379`, `Bcd/BcdCodec.cs` |
| Bcd Rewriter | `BcdTagMapBuilder` cannot detect duplicate addresses because the input collapses through a `Dictionary` first (`Add` last-write-wins via key). The `DuplicateAddress` validation error is unreachable; the design's "reject duplicate addresses within a single PLC's resolved list" is silently violated. | `Bcd/BcdTagMapBuilder.cs:58-103` |
| Admin / Diagnostics | `ShutdownCoordinator` runs via `ApplicationStopping` callback then `IHostedService.StopAsync` fires the same `ProxyWorker`/`AdminEndpointHost` again — double-stop. Drain happens but the contract is unclear. The existing `ShutdownE2ETests` don't assert `mbproxy.shutdown.complete` fired. | `Program.cs:60-66`, `Proxy/ProxyWorker.cs:196-217`, `Diagnostics/ShutdownCoordinator.cs:147` |
| Hosting | `ShutdownCoordinator` consumes `IOptions<MbproxyOptions>` (singleton snapshot) instead of `IOptionsMonitor<MbproxyOptions>`. A hot-reloaded `GracefulShutdownTimeoutMs` is silently ignored at shutdown. | `Diagnostics/ShutdownCoordinator.cs:93,102,118`, `Program.cs:42` |
| Hosting | Shipped `src/Mbproxy/appsettings.json` is a near-empty stub (no PLCs, no BCD tags, no `Cache` section). The working template lives only in `install/mbproxy.config.template.json`. A fresh `Mbproxy.exe` run from the publish folder is a no-op service. | `src/Mbproxy/appsettings.json``install/mbproxy.config.template.json` |
## Top fix recommendations (in order)
1. **Plumb context updates into the running `PlcMultiplexer`.** Either thread a `volatile PerPlcContext` reference on the mux and have `ReplaceContextAsync` swap it (so the next per-PDU `_ctx.TagMap` / `_ctx.Cache` deref sees the new map / new cache), or restart the mux when the context changes. Without this, hot-reload of tag lists, cache-flush-on-reload, and the `ReadCoalescing.Enabled` toggle all silently fail to propagate. (Resolves: Critical 1, Major "coalescingAccessor".)
2. **Fix the coalescing factory leak.** In the `_inFlightByKey.TryAttachOrCreate` factory, if `TryAdd`/`TryAllocate` fails, do not return a stub stored under the key — either fail the entire operation (so nobody else attaches), or have the post-factory cleanup walk every party on the stored stub before removing. (Resolves: Critical 2.)
3. **Decouple the backend reader from per-upstream backpressure.** Either drop-on-full for the response channel, write with a per-frame timeout, or fan out responses on a per-party `Task.Run`. (Resolves: Critical 3.)
4. **Drain or replace `_outboundChannel` on cascade.** Complete it on cascade and recreate it on next reconnect, so stranded frames never reach a fresh backend with a dead TxId space. (Resolves: Critical 4.)
5. **Make `_supervisors` thread-safe.** Either `ConcurrentDictionary<,>` or move dict mutations to a single serialisation point after `Task.WhenAll(...)` settles. (Resolves: Major "concurrent dict".)
6. **Add the `ReadCoalescing` accessor to the reconciler's add/restart paths.** Single-line oversight from when Phase 10 landed after Phase 6.
7. **Make `ShutdownCoordinator` read `IOptionsMonitor`.**
8. **Ship a usable `appsettings.json`** (or stop shipping the stub at publish time).
9. **Flag the 32-bit BCD wire format prominently in operator-facing docs.** The "two base-10000 digits in CDAB" contract lives in `design.md`/`BcdRewriting.md` but not in the README; integrators using standard CDAB-int32 clients will silently corrupt values > 9999 without knowing it.
10. **Add tests for `Cache.AllowLongTtl`.** Both rejection (>60_000 without gate) and acceptance (>60_000 with gate).
## What looks especially good
- **Three-tier test architecture** (pure unit → stub-socket integration → simulator E2E) with explicit per-tier rationale in xmldoc. The stub-backend tests catch what the pymodbus simulator can't (its concurrent-MBAP-frame bug is documented and worked around).
- **Genuine concurrency tests, not pretend ones.** `TxIdAllocatorTests.Concurrent_AllocateRelease_...` runs 100 tasks × 1000 ops asserting no duplicate live IDs; `ResponseCacheTests.Concurrent_GetSet` races 8 tasks × 500 ops; `PlcMultiplexerTests` `BackendDisconnect_CascadesToAllUpstreams` uses real sockets.
- **Counter-based assertions** across the suite — testing observable telemetry as well as behaviour, doubling as wire-format validation.
- **Polly pipelines** correctly match the design's documented attempt counts and backoffs in both `BackendConnect` and `ListenerRecovery`.
- **Bind-failure isolation on the admin endpoint** — Kestrel failing to bind logs `mbproxy.admin.bind.failed` and the proxy listeners continue. E2E-tested.
- **Single code path for "bring up a listener"** (both startup and hot-reload add go through `PlcListenerSupervisor.StartAsync`). Matches the design.
- **`FrozenDictionary` for the tag map**, allocation-free codec hot path, and `s_emptyHits` sentinel for the no-hit case keep the rewriter cheap under sustained load.
- **Stable `EventName`s** on every `[LoggerMessage]` survive a C# rename refactor.
## Per-area documents
| Area | File |
|------|------|
| BCD codec + PDU pipeline | [`BcdRewriter.md`](./BcdRewriter.md) |
| Multiplexer + concurrency | [`Multiplexer.md`](./Multiplexer.md) |
| Response cache | [`ResponseCache.md`](./ResponseCache.md) |
| Supervisor + hot reload | [`SupervisorAndHotReload.md`](./SupervisorAndHotReload.md) |
| Admin + diagnostics | [`AdminAndDiagnostics.md`](./AdminAndDiagnostics.md) |
| Hosting + options | [`HostingAndOptions.md`](./HostingAndOptions.md) |
| Test suite | [`TestSuite.md`](./TestSuite.md) |
## Methodology
Each per-area document is the verbatim (lightly cleaned) output of an independent area-scoped review pass conducted on 2026-05-14 against branch `master` at commit `53a7111`. Reviewers were given the design docs as ground truth, the in-scope files, and the corresponding tests, and asked for `file:line`-cited findings ranked by severity. The Overview above re-ranks the union of those findings across areas.
Findings reference paths relative to `mbproxy/` unless otherwise specified.