wwtools/mbproxy/codereviews/2026-05-16/ProxyAndBcd.md

# Code Review — Proxy Lifecycle & BCD Codec

**Date:** 2026-05-16
**Branch:** `mbproxy-webui-dashboard` (HEAD `0308490`)
**Scope:** `src/Mbproxy/Proxy/{PlcListener,ProxyWorker,ProxyCounters,SocketKeepalive,BcdPduPipeline,NoopPduPipeline,IPduPipeline,PerPlcContext,RewriterLogEvents}.cs`, `src/Mbproxy/Proxy/Supervision/*`, `src/Mbproxy/Bcd/*`. Excludes `Proxy/Multiplexing/`, `Proxy/Cache/`, and `TagValueCapture`/`TagCaptureRegistry`.

## Summary

The proxy-lifecycle and BCD-codec subsystems are in good shape. The BCD codec itself (`BcdCodec`) is correct in all four directions, range-checked, and allocation-free; the 32-bit CDAB word order matches the documented `high*10000+low` convention and is consistently applied through `BcdPduPipeline` for FC03/04/06/16. Partial-overlap handling, invalid-BCD passthrough, exception passthrough, and the FC16 per-word base-10000 guard are all implemented correctly. The supervisor's Polly recovery loop, CTS/TCS re-arming, and graceful-shutdown sequencing are carefully built and well-documented. No Critical issues were found. The findings below are one Major correctness gap in the FC16 request length check that admits an integer-overflow bypass, plus a small set of resource/robustness and maintainability items.

**Findings by severity:** Critical 0 · Major 2 · Minor 7

---

## Major

### M1 — FC16 request payload-length check can be bypassed by an oversized `qty`, enabling out-of-PDU reads/writes mitigated only by a later per-slot check

**File:** `src/Mbproxy/Proxy/BcdPduPipeline.cs:158-167`

`ProcessFc16Request` reads `qty` straight from the wire (`pdu[3]<<8 | pdu[4]`, range 0–65535) and then guards with:

```csharp
if (pdu.Length < 6 + qty * 2)
    return;
```

`qty` is `ushort`; `qty * 2` is computed in `int` so the multiplication itself does not overflow (max 131070), and `6 + qty*2` is fine. So the arithmetic is actually safe — but the *intent* documented in the comment ("a client claiming qty=10 with only 4 bytes of register data would otherwise have its BCD slots silently skipped") is only half-delivered. The real exposure is the opposite case the comment does not mention: the PDU is *delivered* in a buffer larger than the framed PDU. `pdu` here is a `Span<byte>` whose length is the parsed PDU length from the MBAP length field. If a malicious or buggy client sends an MBAP length that is large but a `byteCount`/`qty` that is internally inconsistent, the `pdu.Length < 6 + qty*2` check passes when `pdu.Length` is large, and then the per-slot `lowByteOff + 2 > pdu.Length` checks (lines 200, 261) are the only thing keeping the writes in bounds. That secondary check *is* present and correct, so no actual out-of-bounds write occurs — but the FC16 path never validates `byteCount` (`pdu[5]`) against `qty` at all. A request with `qty=2`, `byteCount=4`, but only the FC16-min 6 bytes plus a short tail will be partially rewritten against whatever stale bytes follow.

**Impact:** Not a memory-safety bug (the per-slot bounds checks hold), but a malformed FC16 whose `byteCount` disagrees with `qty` is silently partially rewritten instead of being passed through for the PLC to reject. The rewriter mutates register bytes the client never intended as register data. This is a wire-protocol-correctness gap for adversarial/buggy input.

**Recommendation:** Add an explicit `byteCount` consistency check after reading `qty`: `byte byteCount = pdu[5]; if (byteCount != qty * 2 || pdu.Length < 6 + byteCount) return;`. This makes the rewriter pass through any FC16 whose self-describing fields disagree, matching the documented "let the PLC's validator surface the protocol error" policy and removing reliance on the per-slot check as the safety net.

### M2 — `PlcListener.RunAsync` swallows a faulted accept loop without distinguishing a transient fault from a permanently dead listener; the supervisor then treats every non-cancellation exit identically

**File:** `src/Mbproxy/Proxy/PlcListener.cs:152-161`, interacting with `Supervision/PlcListenerSupervisor.cs:386-432`

`PlcListener.RunAsync` catches *every* non-`OperationCanceledException` exception, logs `mbproxy.listener.faulted`, and **returns normally**. Because it returns rather than rethrows, the supervisor's `await listener.RunAsync(token)` (line 388) completes without throwing. Control then falls through to lines 418-432, which treat a normal return as "listener accept loop ended unexpectedly", increment `RecoveryAttempts`, log `mbproxy.listener.ended`, and throw `InvalidOperationException` to drive a Polly retry.

The net effect is that a genuine accept-loop fault (e.g. `SocketException` from `AcceptSocketAsync`) is reported to operators as `mbproxy.listener.faulted` **and then** `mbproxy.listener.ended`, and the recovery counter is incremented once for the same event. The supervisor's own `catch (Exception runEx)` block at line 397 — which exists specifically to log the fault with the exception object attached and increment the counter — is **unreachable for any fault originating inside `RunAsync`**, because `RunAsync` never lets the exception escape.

**Impact:** Double logging (two distinct event names for one fault), the `mbproxy.listener.faulted` supervised emission (EventId 43, Warning, *with* stack trace) never fires for accept-loop faults — only the unsupervised `PlcListener` emission (EventId 22, Error, *no* exception object) does — so operators lose the stack trace the doc (`LogEvents.md` line 99) promises. The supervisor's `IncrementRecoveryAttempt(runEx.Message)` with the real reason is also never reached; the counter is bumped with the generic "Listener accept loop ended unexpectedly" string instead.

**Recommendation:** Have `PlcListener.RunAsync` rethrow after logging (`catch (Exception ex) { LogListenerFaulted(...); throw; }`), so the supervisor's `catch (Exception runEx)` path handles it as designed. Then the supervisor logs `mbproxy.listener.faulted` (EventId 43) with the exception, and the "ended unexpectedly" path is reserved for the genuinely-clean-return case it was written for. Alternatively, drop the supervisor's dead `catch (Exception runEx)` block and accept the current behaviour — but the current split is misleading and contradicts the documented event semantics.

---

## Minor

### N1 — `ProxyWorker.LogBindFailed` (EventId 21) is declared but never invoked; `LogEvents.md` documents it as emitted by `ProxyWorker`

**File:** `src/Mbproxy/Proxy/ProxyWorker.cs:382-385`; doc `docs/Reference/LogEvents.md:63`

The `[LoggerMessage]` `LogBindFailed` partial method (EventId 21, `mbproxy.startup.bind.failed`) is defined in `ProxyWorker` but has no call site — all bind-failure logging happens in `PlcListenerSupervisor.LogBindFailed` (EventId 41). `LogEvents.md` line 63 lists EventId 21 / `ProxyWorker.cs` as a source for `mbproxy.startup.bind.failed`, which no longer matches the code.

**Impact:** Dead code; misleading documentation. An operator filtering on EventId 21 will never see a hit.

**Recommendation:** Delete the unused `LogBindFailed` declaration from `ProxyWorker.cs` and remove the `21 (ProxyWorker)` / `src/Mbproxy/Proxy/ProxyWorker.cs` references from the `mbproxy.startup.bind.failed` table in `LogEvents.md`.

### N2 — `RewriterLogEvents.ExceptionPassthrough` event name `mbproxy.exception.passthrough` is inconsistent with the `mbproxy.rewrite.*` family it sits in

**File:** `src/Mbproxy/Proxy/RewriterLogEvents.cs:46-55`

`PartialBcd` and `InvalidBcd` use `mbproxy.rewrite.partial_bcd` / `mbproxy.rewrite.invalid_bcd`, but `ExceptionPassthrough` uses `mbproxy.exception.passthrough`. `LogEvents.md` (line 557) explicitly lists `exception` as its own `<area>`, so this is intentional and *documented* — but `BcdRewriting.md:146` calls it `mbproxy.rewrite.exception_passthrough`, and `BcdRewriting.md:232` repeats `mbproxy.rewrite.exception_passthrough`. The code and `LogEvents.md` agree on `mbproxy.exception.passthrough`; `BcdRewriting.md` is wrong in two places.

**Impact:** Doc drift. An operator following `BcdRewriting.md` to build a log filter would grep a name that is never emitted.

**Recommendation:** Fix the two occurrences in `docs/Features/BcdRewriting.md` (lines 146 and 232) to `mbproxy.exception.passthrough` to match `LogEvents.md` and the source.

### N3 — `PlcListener.RunAsync` orphans the per-pipe `ContinueWith` cleanup if the listener faults between `Task.Run` and dictionary insertion

**File:** `src/Mbproxy/Proxy/PlcListener.cs:136-149`

The accept loop creates `pipeTask`, inserts it into `_pipeTasks`, then attaches a `ContinueWith` that removes the entry. If `DisposeAsync` runs concurrently (it snapshots `_pipeTasks.Values` at line 178), there is a benign window where a pipe task started after the snapshot is not awaited — acceptable best-effort. More notably, the `ContinueWith` continuation uses `TaskScheduler.Default` and discards its own task (`_ = ...`); if the continuation itself throws (it cannot here, `TryRemove` does not throw) it would be an unobserved exception. Low risk, but the pattern of fire-and-forget `ContinueWith` for cleanup is fragile.

**Impact:** None in practice — `TryRemove` cannot throw. Maintainability only.

**Recommendation:** Prefer awaiting cleanup inside the `Task.Run` lambda's `finally` (the lambda already has a `finally` that disposes the pipe — add `_pipeTasks.TryRemove(pipe.Id, out _);` there) and drop the separate `ContinueWith`. This collapses two scheduling primitives into one and removes the discarded-task wart.

### N4 — `SocketKeepalive.Apply` does not catch `NotSupportedException`; some platforms throw it (not `SocketException`) for the `TcpKeepAlive*` options

**File:** `src/Mbproxy/Proxy/SocketKeepalive.cs:36-47`

`Apply` catches `SocketException` and `ObjectDisposedException`. The three `TcpKeepAlive*` socket options are documented as "not honoured on every platform"; on platforms/runtimes where the option is unrecognised, `Socket.SetSocketOption` can throw `SocketException(SocketError.ProtocolOption)` (caught) **or**, on some older runtimes/OSes, `NotSupportedException` / `PlatformNotSupportedException` for the named option enum. Those would escape and — per the comment's own intent ("must never abort a connection") — defeat the best-effort contract.

**Impact:** On a platform that throws `PlatformNotSupportedException` for `TcpKeepAliveRetryCount`, applying keepalive to a backend or accepted upstream socket would throw out of the constructor / connect path, aborting the connection. The `SO_KEEPALIVE` master option (line 29) is universally supported, so the realistic failure window is narrow, but the swallow set is incomplete relative to the stated guarantee.

**Recommendation:** Broaden the catch to also include `NotSupportedException` (which `PlatformNotSupportedException` derives from) — or simplest, catch `Exception` here given the explicit "swallow everything, keepalive is best-effort" contract documented in the XML summary.

### N5 — FC03/04 response partial-overlap warning logs `qty` (raw request qty) while the in-range computation uses `effectiveQty` (clamped to response words)

**File:** `src/Mbproxy/Proxy/BcdPduPipeline.cs:348,362-370`

`ProcessResponse` clamps `effectiveQty = min(qty, wordsInResponse)` and uses `effectiveQty` for the `lowInRange`/`highInRange` checks (lines 362-363), which is correct. But the partial-BCD warning at line 367-368 logs `startAddress, qty` — the *original* request qty, not `effectiveQty`. If a PLC returns a short response (`byteCount` smaller than `qty*2`), a 32-bit tag that *would* have been fully in range for the requested `qty` is reported as a partial overlap with the full `qty`, making the warning look like a client/config straddle when it is actually a truncated response.

**Impact:** Misleading `mbproxy.rewrite.partial_bcd` warning in the (rare) short-response case; an operator would chase a client tag-map mismatch that does not exist.

**Recommendation:** Either log `effectiveQty` in the response-path `PartialBcd` call, or — better — distinguish the two causes: a short response is a backend/transport anomaly, not a client straddle, and arguably warrants a different (or no) warning. At minimum make the logged qty match the qty actually used for the in-range decision.

### N6 — `ProxyCounters.UpdateRoundTripEwma` comment claims ~1 µs resolution but stores nanoseconds; the fixed-point scale and the comment disagree

**File:** `src/Mbproxy/Proxy/ProxyCounters.cs:411-415`

`sampleFixed = (long)(sampleMs * 1000.0)` converts milliseconds to microseconds (×1000). The comment on line 413 says "store microseconds * 1000 (i.e. nanoseconds)" — but `sampleMs * 1000` is microseconds, not nanoseconds. `Snapshot()` then divides by 1000.0 (line 466) to get milliseconds back, which is consistent with microsecond storage. So the *code* is internally consistent (ms→µs store, µs→ms read); only the comment's "(i.e. nanoseconds)" parenthetical is wrong, and the "~1 µs resolution" claim is right for microsecond storage.

**Impact:** None functional — the EWMA value is correct. Comment is misleading for a future maintainer.

**Recommendation:** Fix the line 413 comment to "store milliseconds * 1000 (i.e. microseconds)".

### N7 — `BcdTagMapBuilder` includes entries implicated in an `OverlappingHighRegister` error in the returned map; relies entirely on the caller checking `Errors.Count`

**File:** `src/Mbproxy/Bcd/BcdTagMapBuilder.cs:155-164`

The builder's own comment (lines 156-160) acknowledges that `OverlappingHighRegister`-implicated entries are *kept* in the frozen map, unlike `InvalidWidth`/`DuplicateAddress` entries which are excluded. The safety of this depends on every caller treating `Errors.Count > 0` as fatal. `ProxyWorker.ExecuteAsync` (lines 107-114) does exactly that — it skips the listener. But `ConfigReconciler` and any future caller must replicate the check; the asymmetry (some bad entries excluded, some included) is a latent footgun.

**Impact:** None today — the one production caller handles it. A future caller that builds a map and uses it without checking `Errors` would get a map with a known-bad overlapping 32-bit pair, and the rewriter would then mis-decode the overlapping registers (a 32-bit tag whose high register is another tag's address produces a `RangeHit` for both, and both would be rewritten).

**Recommendation:** For consistency and defence-in-depth, exclude `OverlappingHighRegister`-implicated entries from the frozen map the same way `InvalidWidth`/`DuplicateAddress` entries are excluded, so a map returned alongside errors is always safe to use even by a caller that forgets the check. If keeping them is deliberate (for diagnostics), document the contract on `ValidationResult.Map` itself, not just inside the builder.

---

## Notes (no finding)

- **BCD codec correctness verified.** `Encode16`/`Decode16` round-trip cleanly for `[0,9999]`; `(uint)value > Max16` correctly rejects negatives via unsigned wrap. `Encode32`/`Decode32` implement CDAB low-word-first with `high*10000+low`, matching `BcdRewriting.md` and `dl205.md`. `HasBadNibble` checks all four nibbles. The FC16 request path's per-word `clientLow/clientHigh > 9999` guard (lines 222-228) correctly prevents the documented silent-mutation bug where `(9999,9999)` would otherwise survive `Encode32`'s 99,999,999 ceiling.
- **FC06 high-register partial detection is correct.** `ProcessFc06Request` (lines 96-116) uses `TryGetForRange(address,1,...)` to catch the case where the written address is the *high* register of a configured 32-bit pair; `TryGetForRange`'s negative-`OffsetWords` semantics make `hit.OffsetWords < 0` the right discriminator.
- **Supervisor lifecycle is sound.** CTS and TCS are re-armed per `StartAsync`, the previous CTS is disposed before reassignment, `DisposeAsync` is idempotent, and the response cache's eviction timer is disposed in both `ReplaceContextAsync` (old cache) and `DisposeAsync`. Polly's listener-recovery pipeline is correctly infinite (`MaxRetryAttempts = int.MaxValue`) with cancellation as the sole exit.
- **Graceful shutdown ordering is correct.** `ProxyWorker.StopAsync` snapshots in-flight counts before `base.StopAsync`, drains via supervisor stop, stops the admin endpoint last, and disposes supervisors — the documented sequence. The `inFlightAtCancel` snapshot rationale is well-reasoned.
- **`ProxyCounters` is correctly lock-free** — all increments are `Interlocked`, `ObserveInFlight` and `UpdateRoundTripEwma` use proper CAS loops, and `Snapshot` reads each field atomically.