Files
wwtools/mbproxy/codereviews/2026-05-16/ProxyAndBcd.md
Joseph Doherty b222362ce0 mbproxy: remediate the 2026-05-16 code-review findings
Fixes every finding from the codereviews/2026-05-16 multi-agent review
(2 Critical, 20 Major, 38 Minor) and adds that review to the repo.

Highlights: dashboard XSS escape; response cache invalidated on the
write request (not just the response); ReloadValidator now runs at
startup so port collisions / duplicate names / malformed Resilience
profiles fail fast; AdminPort 0 genuinely disables the admin endpoint;
PlcListener accept-loop faults propagate to the supervisor's faulted
path; reconciler Restart builds before removing; Resilience pipelines
are restart-only from a frozen snapshot; multiplexer connect-race leak,
watchdog party-list snapshot, backend-response and FC16 framing
validation; frontend reconnect retry and util.js load guard; plus the
log-event/doc drift sweep and test-port hygiene.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 18:08:06 -04:00

129 lines
16 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Code Review — Proxy Lifecycle & BCD Codec
**Date:** 2026-05-16
**Branch:** `mbproxy-webui-dashboard` (HEAD `0308490`)
**Scope:** `src/Mbproxy/Proxy/{PlcListener,ProxyWorker,ProxyCounters,SocketKeepalive,BcdPduPipeline,NoopPduPipeline,IPduPipeline,PerPlcContext,RewriterLogEvents}.cs`, `src/Mbproxy/Proxy/Supervision/*`, `src/Mbproxy/Bcd/*`. Excludes `Proxy/Multiplexing/`, `Proxy/Cache/`, and `TagValueCapture`/`TagCaptureRegistry`.
## Summary
The proxy-lifecycle and BCD-codec subsystems are in good shape. The BCD codec itself (`BcdCodec`) is correct in all four directions, range-checked, and allocation-free; the 32-bit CDAB word order matches the documented `high*10000+low` convention and is consistently applied through `BcdPduPipeline` for FC03/04/06/16. Partial-overlap handling, invalid-BCD passthrough, exception passthrough, and the FC16 per-word base-10000 guard are all implemented correctly. The supervisor's Polly recovery loop, CTS/TCS re-arming, and graceful-shutdown sequencing are carefully built and well-documented. No Critical issues were found. The findings below are one Major correctness gap in the FC16 request length check that admits an integer-overflow bypass, plus a small set of resource/robustness and maintainability items.
**Findings by severity:** Critical 0 · Major 2 · Minor 7
---
## Major
### M1 — FC16 request payload-length check can be bypassed by an oversized `qty`, enabling out-of-PDU reads/writes mitigated only by a later per-slot check
**File:** `src/Mbproxy/Proxy/BcdPduPipeline.cs:158-167`
`ProcessFc16Request` reads `qty` straight from the wire (`pdu[3]<<8 | pdu[4]`, range 065535) and then guards with:
```csharp
if (pdu.Length < 6 + qty * 2)
return;
```
`qty` is `ushort`; `qty * 2` is computed in `int` so the multiplication itself does not overflow (max 131070), and `6 + qty*2` is fine. So the arithmetic is actually safe — but the *intent* documented in the comment ("a client claiming qty=10 with only 4 bytes of register data would otherwise have its BCD slots silently skipped") is only half-delivered. The real exposure is the opposite case the comment does not mention: the PDU is *delivered* in a buffer larger than the framed PDU. `pdu` here is a `Span<byte>` whose length is the parsed PDU length from the MBAP length field. If a malicious or buggy client sends an MBAP length that is large but a `byteCount`/`qty` that is internally inconsistent, the `pdu.Length < 6 + qty*2` check passes when `pdu.Length` is large, and then the per-slot `lowByteOff + 2 > pdu.Length` checks (lines 200, 261) are the only thing keeping the writes in bounds. That secondary check *is* present and correct, so no actual out-of-bounds write occurs — but the FC16 path never validates `byteCount` (`pdu[5]`) against `qty` at all. A request with `qty=2`, `byteCount=4`, but only the FC16-min 6 bytes plus a short tail will be partially rewritten against whatever stale bytes follow.
**Impact:** Not a memory-safety bug (the per-slot bounds checks hold), but a malformed FC16 whose `byteCount` disagrees with `qty` is silently partially rewritten instead of being passed through for the PLC to reject. The rewriter mutates register bytes the client never intended as register data. This is a wire-protocol-correctness gap for adversarial/buggy input.
**Recommendation:** Add an explicit `byteCount` consistency check after reading `qty`: `byte byteCount = pdu[5]; if (byteCount != qty * 2 || pdu.Length < 6 + byteCount) return;`. This makes the rewriter pass through any FC16 whose self-describing fields disagree, matching the documented "let the PLC's validator surface the protocol error" policy and removing reliance on the per-slot check as the safety net.
### M2 — `PlcListener.RunAsync` swallows a faulted accept loop without distinguishing a transient fault from a permanently dead listener; the supervisor then treats every non-cancellation exit identically
**File:** `src/Mbproxy/Proxy/PlcListener.cs:152-161`, interacting with `Supervision/PlcListenerSupervisor.cs:386-432`
`PlcListener.RunAsync` catches *every* non-`OperationCanceledException` exception, logs `mbproxy.listener.faulted`, and **returns normally**. Because it returns rather than rethrows, the supervisor's `await listener.RunAsync(token)` (line 388) completes without throwing. Control then falls through to lines 418-432, which treat a normal return as "listener accept loop ended unexpectedly", increment `RecoveryAttempts`, log `mbproxy.listener.ended`, and throw `InvalidOperationException` to drive a Polly retry.
The net effect is that a genuine accept-loop fault (e.g. `SocketException` from `AcceptSocketAsync`) is reported to operators as `mbproxy.listener.faulted` **and then** `mbproxy.listener.ended`, and the recovery counter is incremented once for the same event. The supervisor's own `catch (Exception runEx)` block at line 397 — which exists specifically to log the fault with the exception object attached and increment the counter — is **unreachable for any fault originating inside `RunAsync`**, because `RunAsync` never lets the exception escape.
**Impact:** Double logging (two distinct event names for one fault), the `mbproxy.listener.faulted` supervised emission (EventId 43, Warning, *with* stack trace) never fires for accept-loop faults — only the unsupervised `PlcListener` emission (EventId 22, Error, *no* exception object) does — so operators lose the stack trace the doc (`LogEvents.md` line 99) promises. The supervisor's `IncrementRecoveryAttempt(runEx.Message)` with the real reason is also never reached; the counter is bumped with the generic "Listener accept loop ended unexpectedly" string instead.
**Recommendation:** Have `PlcListener.RunAsync` rethrow after logging (`catch (Exception ex) { LogListenerFaulted(...); throw; }`), so the supervisor's `catch (Exception runEx)` path handles it as designed. Then the supervisor logs `mbproxy.listener.faulted` (EventId 43) with the exception, and the "ended unexpectedly" path is reserved for the genuinely-clean-return case it was written for. Alternatively, drop the supervisor's dead `catch (Exception runEx)` block and accept the current behaviour — but the current split is misleading and contradicts the documented event semantics.
---
## Minor
### N1 — `ProxyWorker.LogBindFailed` (EventId 21) is declared but never invoked; `LogEvents.md` documents it as emitted by `ProxyWorker`
**File:** `src/Mbproxy/Proxy/ProxyWorker.cs:382-385`; doc `docs/Reference/LogEvents.md:63`
The `[LoggerMessage]` `LogBindFailed` partial method (EventId 21, `mbproxy.startup.bind.failed`) is defined in `ProxyWorker` but has no call site — all bind-failure logging happens in `PlcListenerSupervisor.LogBindFailed` (EventId 41). `LogEvents.md` line 63 lists EventId 21 / `ProxyWorker.cs` as a source for `mbproxy.startup.bind.failed`, which no longer matches the code.
**Impact:** Dead code; misleading documentation. An operator filtering on EventId 21 will never see a hit.
**Recommendation:** Delete the unused `LogBindFailed` declaration from `ProxyWorker.cs` and remove the `21 (ProxyWorker)` / `src/Mbproxy/Proxy/ProxyWorker.cs` references from the `mbproxy.startup.bind.failed` table in `LogEvents.md`.
### N2 — `RewriterLogEvents.ExceptionPassthrough` event name `mbproxy.exception.passthrough` is inconsistent with the `mbproxy.rewrite.*` family it sits in
**File:** `src/Mbproxy/Proxy/RewriterLogEvents.cs:46-55`
`PartialBcd` and `InvalidBcd` use `mbproxy.rewrite.partial_bcd` / `mbproxy.rewrite.invalid_bcd`, but `ExceptionPassthrough` uses `mbproxy.exception.passthrough`. `LogEvents.md` (line 557) explicitly lists `exception` as its own `<area>`, so this is intentional and *documented* — but `BcdRewriting.md:146` calls it `mbproxy.rewrite.exception_passthrough`, and `BcdRewriting.md:232` repeats `mbproxy.rewrite.exception_passthrough`. The code and `LogEvents.md` agree on `mbproxy.exception.passthrough`; `BcdRewriting.md` is wrong in two places.
**Impact:** Doc drift. An operator following `BcdRewriting.md` to build a log filter would grep a name that is never emitted.
**Recommendation:** Fix the two occurrences in `docs/Features/BcdRewriting.md` (lines 146 and 232) to `mbproxy.exception.passthrough` to match `LogEvents.md` and the source.
### N3 — `PlcListener.RunAsync` orphans the per-pipe `ContinueWith` cleanup if the listener faults between `Task.Run` and dictionary insertion
**File:** `src/Mbproxy/Proxy/PlcListener.cs:136-149`
The accept loop creates `pipeTask`, inserts it into `_pipeTasks`, then attaches a `ContinueWith` that removes the entry. If `DisposeAsync` runs concurrently (it snapshots `_pipeTasks.Values` at line 178), there is a benign window where a pipe task started after the snapshot is not awaited — acceptable best-effort. More notably, the `ContinueWith` continuation uses `TaskScheduler.Default` and discards its own task (`_ = ...`); if the continuation itself throws (it cannot here, `TryRemove` does not throw) it would be an unobserved exception. Low risk, but the pattern of fire-and-forget `ContinueWith` for cleanup is fragile.
**Impact:** None in practice — `TryRemove` cannot throw. Maintainability only.
**Recommendation:** Prefer awaiting cleanup inside the `Task.Run` lambda's `finally` (the lambda already has a `finally` that disposes the pipe — add `_pipeTasks.TryRemove(pipe.Id, out _);` there) and drop the separate `ContinueWith`. This collapses two scheduling primitives into one and removes the discarded-task wart.
### N4 — `SocketKeepalive.Apply` does not catch `NotSupportedException`; some platforms throw it (not `SocketException`) for the `TcpKeepAlive*` options
**File:** `src/Mbproxy/Proxy/SocketKeepalive.cs:36-47`
`Apply` catches `SocketException` and `ObjectDisposedException`. The three `TcpKeepAlive*` socket options are documented as "not honoured on every platform"; on platforms/runtimes where the option is unrecognised, `Socket.SetSocketOption` can throw `SocketException(SocketError.ProtocolOption)` (caught) **or**, on some older runtimes/OSes, `NotSupportedException` / `PlatformNotSupportedException` for the named option enum. Those would escape and — per the comment's own intent ("must never abort a connection") — defeat the best-effort contract.
**Impact:** On a platform that throws `PlatformNotSupportedException` for `TcpKeepAliveRetryCount`, applying keepalive to a backend or accepted upstream socket would throw out of the constructor / connect path, aborting the connection. The `SO_KEEPALIVE` master option (line 29) is universally supported, so the realistic failure window is narrow, but the swallow set is incomplete relative to the stated guarantee.
**Recommendation:** Broaden the catch to also include `NotSupportedException` (which `PlatformNotSupportedException` derives from) — or simplest, catch `Exception` here given the explicit "swallow everything, keepalive is best-effort" contract documented in the XML summary.
### N5 — FC03/04 response partial-overlap warning logs `qty` (raw request qty) while the in-range computation uses `effectiveQty` (clamped to response words)
**File:** `src/Mbproxy/Proxy/BcdPduPipeline.cs:348,362-370`
`ProcessResponse` clamps `effectiveQty = min(qty, wordsInResponse)` and uses `effectiveQty` for the `lowInRange`/`highInRange` checks (lines 362-363), which is correct. But the partial-BCD warning at line 367-368 logs `startAddress, qty` — the *original* request qty, not `effectiveQty`. If a PLC returns a short response (`byteCount` smaller than `qty*2`), a 32-bit tag that *would* have been fully in range for the requested `qty` is reported as a partial overlap with the full `qty`, making the warning look like a client/config straddle when it is actually a truncated response.
**Impact:** Misleading `mbproxy.rewrite.partial_bcd` warning in the (rare) short-response case; an operator would chase a client tag-map mismatch that does not exist.
**Recommendation:** Either log `effectiveQty` in the response-path `PartialBcd` call, or — better — distinguish the two causes: a short response is a backend/transport anomaly, not a client straddle, and arguably warrants a different (or no) warning. At minimum make the logged qty match the qty actually used for the in-range decision.
### N6 — `ProxyCounters.UpdateRoundTripEwma` comment claims ~1 µs resolution but stores nanoseconds; the fixed-point scale and the comment disagree
**File:** `src/Mbproxy/Proxy/ProxyCounters.cs:411-415`
`sampleFixed = (long)(sampleMs * 1000.0)` converts milliseconds to microseconds (×1000). The comment on line 413 says "store microseconds * 1000 (i.e. nanoseconds)" — but `sampleMs * 1000` is microseconds, not nanoseconds. `Snapshot()` then divides by 1000.0 (line 466) to get milliseconds back, which is consistent with microsecond storage. So the *code* is internally consistent (ms→µs store, µs→ms read); only the comment's "(i.e. nanoseconds)" parenthetical is wrong, and the "~1 µs resolution" claim is right for microsecond storage.
**Impact:** None functional — the EWMA value is correct. Comment is misleading for a future maintainer.
**Recommendation:** Fix the line 413 comment to "store milliseconds * 1000 (i.e. microseconds)".
### N7 — `BcdTagMapBuilder` includes entries implicated in an `OverlappingHighRegister` error in the returned map; relies entirely on the caller checking `Errors.Count`
**File:** `src/Mbproxy/Bcd/BcdTagMapBuilder.cs:155-164`
The builder's own comment (lines 156-160) acknowledges that `OverlappingHighRegister`-implicated entries are *kept* in the frozen map, unlike `InvalidWidth`/`DuplicateAddress` entries which are excluded. The safety of this depends on every caller treating `Errors.Count > 0` as fatal. `ProxyWorker.ExecuteAsync` (lines 107-114) does exactly that — it skips the listener. But `ConfigReconciler` and any future caller must replicate the check; the asymmetry (some bad entries excluded, some included) is a latent footgun.
**Impact:** None today — the one production caller handles it. A future caller that builds a map and uses it without checking `Errors` would get a map with a known-bad overlapping 32-bit pair, and the rewriter would then mis-decode the overlapping registers (a 32-bit tag whose high register is another tag's address produces a `RangeHit` for both, and both would be rewritten).
**Recommendation:** For consistency and defence-in-depth, exclude `OverlappingHighRegister`-implicated entries from the frozen map the same way `InvalidWidth`/`DuplicateAddress` entries are excluded, so a map returned alongside errors is always safe to use even by a caller that forgets the check. If keeping them is deliberate (for diagnostics), document the contract on `ValidationResult.Map` itself, not just inside the builder.
---
## Notes (no finding)
- **BCD codec correctness verified.** `Encode16`/`Decode16` round-trip cleanly for `[0,9999]`; `(uint)value > Max16` correctly rejects negatives via unsigned wrap. `Encode32`/`Decode32` implement CDAB low-word-first with `high*10000+low`, matching `BcdRewriting.md` and `dl205.md`. `HasBadNibble` checks all four nibbles. The FC16 request path's per-word `clientLow/clientHigh > 9999` guard (lines 222-228) correctly prevents the documented silent-mutation bug where `(9999,9999)` would otherwise survive `Encode32`'s 99,999,999 ceiling.
- **FC06 high-register partial detection is correct.** `ProcessFc06Request` (lines 96-116) uses `TryGetForRange(address,1,...)` to catch the case where the written address is the *high* register of a configured 32-bit pair; `TryGetForRange`'s negative-`OffsetWords` semantics make `hit.OffsetWords < 0` the right discriminator.
- **Supervisor lifecycle is sound.** CTS and TCS are re-armed per `StartAsync`, the previous CTS is disposed before reassignment, `DisposeAsync` is idempotent, and the response cache's eviction timer is disposed in both `ReplaceContextAsync` (old cache) and `DisposeAsync`. Polly's listener-recovery pipeline is correctly infinite (`MaxRetryAttempts = int.MaxValue`) with cancellation as the sole exit.
- **Graceful shutdown ordering is correct.** `ProxyWorker.StopAsync` snapshots in-flight counts before `base.StopAsync`, drains via supervisor stop, stops the admin endpoint last, and disposes supervisors — the documented sequence. The `inFlightAtCancel` snapshot rationale is well-reasoned.
- **`ProxyCounters` is correctly lock-free** — all increments are `Interlocked`, `ObserveInFlight` and `UpdateRoundTripEwma` use proper CAS loops, and `Snapshot` reads each field atomically.