Files

T

Joseph Doherty b222362ce0 mbproxy: remediate the 2026-05-16 code-review findings

Fixes every finding from the codereviews/2026-05-16 multi-agent review
(2 Critical, 20 Major, 38 Minor) and adds that review to the repo.

Highlights: dashboard XSS escape; response cache invalidated on the
write request (not just the response); ReloadValidator now runs at
startup so port collisions / duplicate names / malformed Resilience
profiles fail fast; AdminPort 0 genuinely disables the admin endpoint;
PlcListener accept-loop faults propagate to the supervisor's faulted
path; reconciler Restart builds before removing; Resilience pipelines
are restart-only from a frozen snapshot; multiplexer connect-race leak,
watchdog party-list snapshot, backend-response and FC16 framing
validation; frontend reconnect retry and util.js load guard; plus the
log-event/doc drift sweep and test-port hygiene.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-16 18:08:06 -04:00

16 KiB

Raw Blame History

Code Review — Proxy Lifecycle & BCD Codec

Date: 2026-05-16 Branch: mbproxy-webui-dashboard (HEAD 0308490) Scope: src/Mbproxy/Proxy/{PlcListener,ProxyWorker,ProxyCounters,SocketKeepalive,BcdPduPipeline,NoopPduPipeline,IPduPipeline,PerPlcContext,RewriterLogEvents}.cs, src/Mbproxy/Proxy/Supervision/*, src/Mbproxy/Bcd/*. Excludes Proxy/Multiplexing/, Proxy/Cache/, and TagValueCapture/TagCaptureRegistry.

Summary

The proxy-lifecycle and BCD-codec subsystems are in good shape. The BCD codec itself (BcdCodec) is correct in all four directions, range-checked, and allocation-free; the 32-bit CDAB word order matches the documented high*10000+low convention and is consistently applied through BcdPduPipeline for FC03/04/06/16. Partial-overlap handling, invalid-BCD passthrough, exception passthrough, and the FC16 per-word base-10000 guard are all implemented correctly. The supervisor's Polly recovery loop, CTS/TCS re-arming, and graceful-shutdown sequencing are carefully built and well-documented. No Critical issues were found. The findings below are one Major correctness gap in the FC16 request length check that admits an integer-overflow bypass, plus a small set of resource/robustness and maintainability items.

Findings by severity: Critical 0 · Major 2 · Minor 7

Major

M1 — FC16 request payload-length check can be bypassed by an oversized `qty`, enabling out-of-PDU reads/writes mitigated only by a later per-slot check

File: src/Mbproxy/Proxy/BcdPduPipeline.cs:158-167

ProcessFc16Request reads qty straight from the wire (pdu[3]<<8 | pdu[4], range 0–65535) and then guards with:

if (pdu.Length < 6 + qty * 2)
    return;

qty is ushort; qty * 2 is computed in int so the multiplication itself does not overflow (max 131070), and 6 + qty*2 is fine. So the arithmetic is actually safe — but the intent documented in the comment ("a client claiming qty=10 with only 4 bytes of register data would otherwise have its BCD slots silently skipped") is only half-delivered. The real exposure is the opposite case the comment does not mention: the PDU is delivered in a buffer larger than the framed PDU. pdu here is a Span<byte> whose length is the parsed PDU length from the MBAP length field. If a malicious or buggy client sends an MBAP length that is large but a byteCount/qty that is internally inconsistent, the pdu.Length < 6 + qty*2 check passes when pdu.Length is large, and then the per-slot lowByteOff + 2 > pdu.Length checks (lines 200, 261) are the only thing keeping the writes in bounds. That secondary check is present and correct, so no actual out-of-bounds write occurs — but the FC16 path never validates byteCount (pdu[5]) against qty at all. A request with qty=2, byteCount=4, but only the FC16-min 6 bytes plus a short tail will be partially rewritten against whatever stale bytes follow.

Impact: Not a memory-safety bug (the per-slot bounds checks hold), but a malformed FC16 whose byteCount disagrees with qty is silently partially rewritten instead of being passed through for the PLC to reject. The rewriter mutates register bytes the client never intended as register data. This is a wire-protocol-correctness gap for adversarial/buggy input.

Recommendation: Add an explicit byteCount consistency check after reading qty: byte byteCount = pdu[5]; if (byteCount != qty * 2 || pdu.Length < 6 + byteCount) return;. This makes the rewriter pass through any FC16 whose self-describing fields disagree, matching the documented "let the PLC's validator surface the protocol error" policy and removing reliance on the per-slot check as the safety net.

M2 — `PlcListener.RunAsync` swallows a faulted accept loop without distinguishing a transient fault from a permanently dead listener; the supervisor then treats every non-cancellation exit identically

File: src/Mbproxy/Proxy/PlcListener.cs:152-161, interacting with Supervision/PlcListenerSupervisor.cs:386-432

PlcListener.RunAsync catches every non-OperationCanceledException exception, logs mbproxy.listener.faulted, and returns normally. Because it returns rather than rethrows, the supervisor's await listener.RunAsync(token) (line 388) completes without throwing. Control then falls through to lines 418-432, which treat a normal return as "listener accept loop ended unexpectedly", increment RecoveryAttempts, log mbproxy.listener.ended, and throw InvalidOperationException to drive a Polly retry.

The net effect is that a genuine accept-loop fault (e.g. SocketException from AcceptSocketAsync) is reported to operators as mbproxy.listener.faulted and then mbproxy.listener.ended, and the recovery counter is incremented once for the same event. The supervisor's own catch (Exception runEx) block at line 397 — which exists specifically to log the fault with the exception object attached and increment the counter — is unreachable for any fault originating inside RunAsync, because RunAsync never lets the exception escape.

Impact: Double logging (two distinct event names for one fault), the mbproxy.listener.faulted supervised emission (EventId 43, Warning, with stack trace) never fires for accept-loop faults — only the unsupervised PlcListener emission (EventId 22, Error, no exception object) does — so operators lose the stack trace the doc (LogEvents.md line 99) promises. The supervisor's IncrementRecoveryAttempt(runEx.Message) with the real reason is also never reached; the counter is bumped with the generic "Listener accept loop ended unexpectedly" string instead.

Recommendation: Have PlcListener.RunAsync rethrow after logging (catch (Exception ex) { LogListenerFaulted(...); throw; }), so the supervisor's catch (Exception runEx) path handles it as designed. Then the supervisor logs mbproxy.listener.faulted (EventId 43) with the exception, and the "ended unexpectedly" path is reserved for the genuinely-clean-return case it was written for. Alternatively, drop the supervisor's dead catch (Exception runEx) block and accept the current behaviour — but the current split is misleading and contradicts the documented event semantics.

Minor

N1 — `ProxyWorker.LogBindFailed` (EventId 21) is declared but never invoked; `LogEvents.md` documents it as emitted by `ProxyWorker`

File: src/Mbproxy/Proxy/ProxyWorker.cs:382-385; doc docs/Reference/LogEvents.md:63

The [LoggerMessage] LogBindFailed partial method (EventId 21, mbproxy.startup.bind.failed) is defined in ProxyWorker but has no call site — all bind-failure logging happens in PlcListenerSupervisor.LogBindFailed (EventId 41). LogEvents.md line 63 lists EventId 21 / ProxyWorker.cs as a source for mbproxy.startup.bind.failed, which no longer matches the code.

Impact: Dead code; misleading documentation. An operator filtering on EventId 21 will never see a hit.

Recommendation: Delete the unused LogBindFailed declaration from ProxyWorker.cs and remove the 21 (ProxyWorker) / src/Mbproxy/Proxy/ProxyWorker.cs references from the mbproxy.startup.bind.failed table in LogEvents.md.

N2 — `RewriterLogEvents.ExceptionPassthrough` event name `mbproxy.exception.passthrough` is inconsistent with the `mbproxy.rewrite.*` family it sits in

File: src/Mbproxy/Proxy/RewriterLogEvents.cs:46-55

PartialBcd and InvalidBcd use mbproxy.rewrite.partial_bcd / mbproxy.rewrite.invalid_bcd, but ExceptionPassthrough uses mbproxy.exception.passthrough. LogEvents.md (line 557) explicitly lists exception as its own <area>, so this is intentional and documented — but BcdRewriting.md:146 calls it mbproxy.rewrite.exception_passthrough, and BcdRewriting.md:232 repeats mbproxy.rewrite.exception_passthrough. The code and LogEvents.md agree on mbproxy.exception.passthrough; BcdRewriting.md is wrong in two places.

Impact: Doc drift. An operator following BcdRewriting.md to build a log filter would grep a name that is never emitted.

Recommendation: Fix the two occurrences in docs/Features/BcdRewriting.md (lines 146 and 232) to mbproxy.exception.passthrough to match LogEvents.md and the source.

N3 — `PlcListener.RunAsync` orphans the per-pipe `ContinueWith` cleanup if the listener faults between `Task.Run` and dictionary insertion

File: src/Mbproxy/Proxy/PlcListener.cs:136-149

The accept loop creates pipeTask, inserts it into _pipeTasks, then attaches a ContinueWith that removes the entry. If DisposeAsync runs concurrently (it snapshots _pipeTasks.Values at line 178), there is a benign window where a pipe task started after the snapshot is not awaited — acceptable best-effort. More notably, the ContinueWith continuation uses TaskScheduler.Default and discards its own task (_ = ...); if the continuation itself throws (it cannot here, TryRemove does not throw) it would be an unobserved exception. Low risk, but the pattern of fire-and-forget ContinueWith for cleanup is fragile.

Impact: None in practice — TryRemove cannot throw. Maintainability only.

Recommendation: Prefer awaiting cleanup inside the Task.Run lambda's finally (the lambda already has a finally that disposes the pipe — add _pipeTasks.TryRemove(pipe.Id, out _); there) and drop the separate ContinueWith. This collapses two scheduling primitives into one and removes the discarded-task wart.

N4 — `SocketKeepalive.Apply` does not catch `NotSupportedException`; some platforms throw it (not `SocketException`) for the `TcpKeepAlive*` options

File: src/Mbproxy/Proxy/SocketKeepalive.cs:36-47

Apply catches SocketException and ObjectDisposedException. The three TcpKeepAlive* socket options are documented as "not honoured on every platform"; on platforms/runtimes where the option is unrecognised, Socket.SetSocketOption can throw SocketException(SocketError.ProtocolOption) (caught) or, on some older runtimes/OSes, NotSupportedException / PlatformNotSupportedException for the named option enum. Those would escape and — per the comment's own intent ("must never abort a connection") — defeat the best-effort contract.

Impact: On a platform that throws PlatformNotSupportedException for TcpKeepAliveRetryCount, applying keepalive to a backend or accepted upstream socket would throw out of the constructor / connect path, aborting the connection. The SO_KEEPALIVE master option (line 29) is universally supported, so the realistic failure window is narrow, but the swallow set is incomplete relative to the stated guarantee.

Recommendation: Broaden the catch to also include NotSupportedException (which PlatformNotSupportedException derives from) — or simplest, catch Exception here given the explicit "swallow everything, keepalive is best-effort" contract documented in the XML summary.

N5 — FC03/04 response partial-overlap warning logs `qty` (raw request qty) while the in-range computation uses `effectiveQty` (clamped to response words)

File: src/Mbproxy/Proxy/BcdPduPipeline.cs:348,362-370

ProcessResponse clamps effectiveQty = min(qty, wordsInResponse) and uses effectiveQty for the lowInRange/highInRange checks (lines 362-363), which is correct. But the partial-BCD warning at line 367-368 logs startAddress, qty — the original request qty, not effectiveQty. If a PLC returns a short response (byteCount smaller than qty*2), a 32-bit tag that would have been fully in range for the requested qty is reported as a partial overlap with the full qty, making the warning look like a client/config straddle when it is actually a truncated response.

Impact: Misleading mbproxy.rewrite.partial_bcd warning in the (rare) short-response case; an operator would chase a client tag-map mismatch that does not exist.

Recommendation: Either log effectiveQty in the response-path PartialBcd call, or — better — distinguish the two causes: a short response is a backend/transport anomaly, not a client straddle, and arguably warrants a different (or no) warning. At minimum make the logged qty match the qty actually used for the in-range decision.

N6 — `ProxyCounters.UpdateRoundTripEwma` comment claims ~1 µs resolution but stores nanoseconds; the fixed-point scale and the comment disagree

File: src/Mbproxy/Proxy/ProxyCounters.cs:411-415

sampleFixed = (long)(sampleMs * 1000.0) converts milliseconds to microseconds (×1000). The comment on line 413 says "store microseconds * 1000 (i.e. nanoseconds)" — but sampleMs * 1000 is microseconds, not nanoseconds. Snapshot() then divides by 1000.0 (line 466) to get milliseconds back, which is consistent with microsecond storage. So the code is internally consistent (ms→µs store, µs→ms read); only the comment's "(i.e. nanoseconds)" parenthetical is wrong, and the "~1 µs resolution" claim is right for microsecond storage.

Impact: None functional — the EWMA value is correct. Comment is misleading for a future maintainer.

Recommendation: Fix the line 413 comment to "store milliseconds * 1000 (i.e. microseconds)".

N7 — `BcdTagMapBuilder` includes entries implicated in an `OverlappingHighRegister` error in the returned map; relies entirely on the caller checking `Errors.Count`

File: src/Mbproxy/Bcd/BcdTagMapBuilder.cs:155-164

The builder's own comment (lines 156-160) acknowledges that OverlappingHighRegister-implicated entries are kept in the frozen map, unlike InvalidWidth/DuplicateAddress entries which are excluded. The safety of this depends on every caller treating Errors.Count > 0 as fatal. ProxyWorker.ExecuteAsync (lines 107-114) does exactly that — it skips the listener. But ConfigReconciler and any future caller must replicate the check; the asymmetry (some bad entries excluded, some included) is a latent footgun.

Impact: None today — the one production caller handles it. A future caller that builds a map and uses it without checking Errors would get a map with a known-bad overlapping 32-bit pair, and the rewriter would then mis-decode the overlapping registers (a 32-bit tag whose high register is another tag's address produces a RangeHit for both, and both would be rewritten).

Recommendation: For consistency and defence-in-depth, exclude OverlappingHighRegister-implicated entries from the frozen map the same way InvalidWidth/DuplicateAddress entries are excluded, so a map returned alongside errors is always safe to use even by a caller that forgets the check. If keeping them is deliberate (for diagnostics), document the contract on ValidationResult.Map itself, not just inside the builder.

Notes (no finding)

BCD codec correctness verified. Encode16/Decode16 round-trip cleanly for [0,9999]; (uint)value > Max16 correctly rejects negatives via unsigned wrap. Encode32/Decode32 implement CDAB low-word-first with high*10000+low, matching BcdRewriting.md and dl205.md. HasBadNibble checks all four nibbles. The FC16 request path's per-word clientLow/clientHigh > 9999 guard (lines 222-228) correctly prevents the documented silent-mutation bug where (9999,9999) would otherwise survive Encode32's 99,999,999 ceiling.
FC06 high-register partial detection is correct. ProcessFc06Request (lines 96-116) uses TryGetForRange(address,1,...) to catch the case where the written address is the high register of a configured 32-bit pair; TryGetForRange's negative-OffsetWords semantics make hit.OffsetWords < 0 the right discriminator.
Supervisor lifecycle is sound. CTS and TCS are re-armed per StartAsync, the previous CTS is disposed before reassignment, DisposeAsync is idempotent, and the response cache's eviction timer is disposed in both ReplaceContextAsync (old cache) and DisposeAsync. Polly's listener-recovery pipeline is correctly infinite (MaxRetryAttempts = int.MaxValue) with cancellation as the sole exit.
Graceful shutdown ordering is correct. ProxyWorker.StopAsync snapshots in-flight counts before base.StopAsync, drains via supervisor stop, stops the admin endpoint last, and disposes supervisors — the documented sequence. The inFlightAtCancel snapshot rationale is well-reasoned.
ProxyCounters is correctly lock-free — all increments are Interlocked, ObserveInFlight and UpdateRoundTripEwma use proper CAS loops, and Snapshot reads each field atomically.

16 KiB Raw Blame History Unescape Escape

Code Review — Proxy Lifecycle & BCD Codec

Summary

Major

M1 — FC16 request payload-length check can be bypassed by an oversized qty, enabling out-of-PDU reads/writes mitigated only by a later per-slot check

M2 — PlcListener.RunAsync swallows a faulted accept loop without distinguishing a transient fault from a permanently dead listener; the supervisor then treats every non-cancellation exit identically

Minor

N1 — ProxyWorker.LogBindFailed (EventId 21) is declared but never invoked; LogEvents.md documents it as emitted by ProxyWorker

N2 — RewriterLogEvents.ExceptionPassthrough event name mbproxy.exception.passthrough is inconsistent with the mbproxy.rewrite.* family it sits in

N3 — PlcListener.RunAsync orphans the per-pipe ContinueWith cleanup if the listener faults between Task.Run and dictionary insertion

N4 — SocketKeepalive.Apply does not catch NotSupportedException; some platforms throw it (not SocketException) for the TcpKeepAlive* options

N5 — FC03/04 response partial-overlap warning logs qty (raw request qty) while the in-range computation uses effectiveQty (clamped to response words)

N6 — ProxyCounters.UpdateRoundTripEwma comment claims ~1 µs resolution but stores nanoseconds; the fixed-point scale and the comment disagree

N7 — BcdTagMapBuilder includes entries implicated in an OverlappingHighRegister error in the returned map; relies entirely on the caller checking Errors.Count