Files
wwtools/mbproxy/codereviews/2026-05-16/Overview.md
Joseph Doherty b222362ce0 mbproxy: remediate the 2026-05-16 code-review findings
Fixes every finding from the codereviews/2026-05-16 multi-agent review
(2 Critical, 20 Major, 38 Minor) and adds that review to the repo.

Highlights: dashboard XSS escape; response cache invalidated on the
write request (not just the response); ReloadValidator now runs at
startup so port collisions / duplicate names / malformed Resilience
profiles fail fast; AdminPort 0 genuinely disables the admin endpoint;
PlcListener accept-loop faults propagate to the supervisor's faulted
path; reconciler Restart builds before removing; Resilience pipelines
are restart-only from a frozen snapshot; multiplexer connect-race leak,
watchdog party-list snapshot, backend-response and FC16 framing
validation; frontend reconnect retry and util.js load guard; plus the
log-event/doc drift sweep and test-port hygiene.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 18:08:06 -04:00

116 lines
13 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Code Review — mbproxy (2026-05-16)
**Branch:** `mbproxy-webui-dashboard` · **HEAD:** `0308490`
**Predecessors:** `codereviews/2026-05-14/`, `codereviews/2026-05-15/` (all of their findings remediated in `554b05d`, `374eecd`, `0308490`).
## Method
This was a full-service review run as six parallel area reviews, each in its own file:
| File | Scope |
|------|-------|
| [`Multiplexing.md`](Multiplexing.md) | Backend connection layer — `Proxy/Multiplexing/`, `Proxy/Cache/`, `MbapFrame` |
| [`ProxyAndBcd.md`](ProxyAndBcd.md) | Proxy lifecycle + BCD codec — `Proxy/` core, `Proxy/Supervision/`, `Bcd/` |
| [`AdminSignalR.md`](AdminSignalR.md) | Admin endpoint, SignalR backend, `TagValueCapture`/`TagCaptureRegistry` |
| [`Frontend.md`](Frontend.md) | `Admin/wwwroot/` — HTML/CSS/JS dashboard |
| [`ConfigAndHosting.md`](ConfigAndHosting.md) | `Configuration/`, `Options/`, `Diagnostics/`, hosting, `ServiceCounters` |
| [`TestsAndConfig.md`](TestsAndConfig.md) | `tests/`, `install/`, config templates, csproj packaging |
Open each area file for the full finding text (description, impact, recommendation, line refs). This Overview consolidates, adjudicates severity where I disagree with an agent, and sets a remediation order.
## Headline verdict
The service is fundamentally sound. The hard parts — TxId multiplexing, the claim-then-dispatch watchdog, lock ordering, the BCD codec in all four directions, `Interlocked` counters, graceful-shutdown sequencing, the tab-keyed SignalR capture model — were checked and hold up. **Every Critical and Major finding from the two prior reviews is confirmed remediated with no regressions** (one frontend exception, below).
This review found **2 findings rated Critical by the area agents, 20 Major, and 38 Minor** (60 total). My own adjudication downgrades both "Criticals" to high-Major (reasoning under *Critical findings*), so I would summarise the real risk profile as: **no true Critical, ~8 Major worth scheduling, the rest opportunistic.** The two largest themes are (1) the proxy trusts self-describing Modbus wire fields more than it should, and (2) startup config validation is weaker than hot-reload config validation.
## Findings by area
| Area | Critical | Major | Minor | Most important |
|------|:--:|:--:|:--:|----------------|
| Multiplexing | 1 | 6 | 7 | C1 cache can serve a value stale across a write |
| ProxyAndBcd | 0 | 2 | 7 | M2 `PlcListener` swallows accept-loop faults |
| AdminSignalR | 0 | 2 | 7 | M1 one-cycle arm-state/tag-row inconsistency |
| Frontend | 1 | 1 | 4 | C1 `dashboard.js` `stateChip` unescaped (XSS regression) |
| ConfigAndHosting | 0 | 5 | 6 | M1 `ReloadValidator` never runs at startup |
| TestsAndConfig | 0 | 4 | 7 | M1 `AdminPort: 0` does not disable the admin endpoint |
| **Total** | **2** | **20** | **38** | |
## Critical findings — adjudicated
### Multiplexing C1 — response cache can serve a value contradicted by an in-flight write
`PlcMultiplexer.cs:826-857` (read/lookup) vs `:674-695` (invalidation on the FC06/FC16 *response*). Cache invalidation fires only when the write *response* lands. A read arriving after a write was forwarded but before its response gets a cache hit and the pre-write value. See `Multiplexing.md` C1 for the full sequence.
**My adjudication: high-Major, not Critical.** Mitigating: the cache is opt-in per tag (default OFF); the window is one backend round-trip and is already inside the documented `CacheTtlMs` staleness contract; and until the PLC acks the write, the old value is arguably the last *confirmed* value. The genuinely defective sub-case is a **write that times out** — it never produces a response, so invalidation never runs and the cache can serve a potentially-changed value for the rest of the TTL. Regardless of severity label, the fix is cheap and removes all ambiguity: **invalidate on the write request (enqueue), keeping response-side invalidation as a backstop.** Recommended.
### Frontend C1 — `dashboard.js` `stateChip` interpolates `listener.state` into `innerHTML` unescaped
`dashboard.js:48`. `detail.js`'s identical function was escaped by the prior review; the fix was not propagated to `dashboard.js`. See `Frontend.md` C1.
**My adjudication: Major, but fix immediately.** `listener.state` is a server-serialized enum, so practical exploitability today is low — but this is a straight regression from the prior XSS sweep and the fix is one character (`${escapeHtml(state)}`). It should not wait.
## Major findings worth scheduling
Grouped by theme. Full text in the area files.
**Startup vs hot-reload validation asymmetry — fix as one piece of work.**
- `ConfigAndHosting.md` M1 — `ReloadValidator.Validate` runs *only* inside the reconciler; the startup path runs only `MbproxyOptionsValidator`, which does not check duplicate `ListenPort`s, `AdminPort` collisions, duplicate PLC names, or the keepalive cross-field rule. A port-typo config starts a half-working fleet instead of failing fast — and `docs/Operations/Configuration.md:284` claims the opposite.
- `ConfigAndHosting.md` M4 — `MbproxyOptionsValidator` never range-checks `ListenPort`/`Port`/`AdminPort` or rejects empty `Host`/`Name`; `ListenPort: 0` (the omitted-key default) silently binds an ephemeral port.
- `TestsAndConfig.md` M1 — both config templates and the docs say `AdminPort: 0` *disables* the admin endpoint; it does not — Kestrel binds an ephemeral port. And `ReloadValidator` rejects `AdminPort < 1` on reload while startup accepts it. **This is the most user-facing Major: documented behaviour is false on a security-relevant setting.**
**The proxy trusts self-describing Modbus wire fields.**
- `Multiplexing.md` M2 — a single wrong MBAP `Length` desynchronises the backend stream indefinitely with no resync; the PLC goes dark until the socket happens to close.
- `Multiplexing.md` M3 — a cached FC03/FC04 payload is never validated against `2*qty`; a malformed backend response is cached and replayed to every hit for the TTL.
- `Multiplexing.md` M4 / `ProxyAndBcd.md` M1 — FC16 `qty`/`byteCount` are never cross-checked; a request whose self-describing fields disagree drives cache invalidation and partial BCD rewriting off the client's claim rather than reality. (Not memory-unsafe — per-slot bounds checks hold — but a contract gap.)
- Recommendation: add one framing-validation helper for inbound FC16 (`byteCount == 2*qty`, length consistent, `qty` within the DL260 cap) and one for cached/forwarded FC03/04 responses (`byteCount == 2*qty`); reject/teardown on mismatch.
**Hot-reload robustness.**
- `ConfigAndHosting.md` M2 — the reconciler `Restart` path removes the old supervisor from the dictionary *before* rebuilding; a transient fault during rebuild drops that PLC permanently with only an Error line. Build the new supervisor first, swap last.
- `ConfigAndHosting.md` M3 — `Resilience.BackendConnect`/`ListenerRecovery` reloads reach only added/restarted PLCs, not reseated/untouched ones — inconsistent, undocumented. Either thread a live accessor (as done for `ReadCoalescing`/`Keepalive`) or document them as restart-only.
**Connection-layer resilience.**
- `ProxyAndBcd.md` M2 — `PlcListener.RunAsync` catches and *returns* on an accept-loop fault instead of rethrowing, making the supervisor's exception-carrying `mbproxy.listener.faulted` (EventId 43) unreachable; faults are double-logged and lose their stack trace.
- `Multiplexing.md` M5 — a late-attaching coalescing party can mutate `InterestedParties` while the watchdog enumerates it, throwing `InvalidOperationException` that the outer catch turns into **permanent watchdog death for that PLC**. Snapshot the party list before fan-out.
- `Multiplexing.md` M1 — `EnsureBackendConnectedAsync` can leak a live backend socket + three tasks if `_disposed` flips during a cold-start connect. Re-check `_disposed` under `_backendLock` before publishing the socket.
- `Multiplexing.md` M6 — three failure-detection paths call `TearDownBackendAsync` fire-and-forget; a pre-await throw becomes an unobserved task exception. Attach a faulted-continuation logger.
**Silent failures.**
- `Frontend.md` M1 — `onreconnected` swallows a failed `SubscribeFleet`/`SubscribePlc`; the pill shows green while the feed is dead, with no retry. Route the warm-reconnect subscribe through the same retry as cold start.
- `AdminSignalR.md` M1 — `ReconcileArmed` and `BuildDebug` re-query the capture independently in one push cycle, so a mid-cycle disarm can push a `PlcDebugSnapshot` where `captureArmed` and the tag rows disagree for ≤1 cycle. Self-heals; derive `CaptureArmed` from the already-held `activePlcs` set instead. (I'd call this Minor — one-cycle cosmetic — but it touches the arm-state invariant the prior reviews centred on.)
- `AdminSignalR.md` M2 — `StatusBroadcaster.Start()` has no double-call guard; a second call orphans a push loop. Latent (one call site today); add an `Interlocked` flag.
## Cross-cutting Minor themes
- **Doc drift** (several): `BcdRewriting.md` uses `mbproxy.rewrite.exception_passthrough` for an event actually named `mbproxy.exception.passthrough` (`ProxyAndBcd.md` N2); `Configuration.md:284` and `HotReload.md:81/95` describe a validation model the code does not implement (`ConfigAndHosting.md` M1/N6); `ConfigReconciler.cs:364` comment claims `GetOrCreate` preserves the armed flag — it no longer does (`AdminSignalR.md` N1); `LogEvents.md` lists EventId 21 for a `ProxyWorker` event with no call site (`ProxyAndBcd.md` N1).
- **Unbounded growth on PLC churn**: frontend `prevPdu`/`rateByName` maps (`Frontend.md` N2); `ResponseCache` LRU is O(n)-per-insert with an unbounded operator-settable cap (`Multiplexing.md` N4).
- **Test hygiene**: hard-coded `AdminPort: 8080` in five test configs → parallel-run bind conflicts (`TestsAndConfig.md` M2); the sim fixture's readiness poll ignores the runner cancellation token (`TestsAndConfig.md` M3); a needless 400 ms delay in a broadcaster test (`TestsAndConfig.md` m2).
- **Best-effort swallow gaps**: `SocketKeepalive.Apply` does not catch `NotSupportedException`/`PlatformNotSupportedException` despite a "never abort a connection" contract (`ProxyAndBcd.md` N4); `EventLogBridge` has no first-failure breadcrumb where `SyslogBridge` does (`ConfigAndHosting.md` N4).
## Correction to an area finding
`TestsAndConfig.md` **m4** (and its regression-table row for prior `m6`) reports that `tests/sim/mbproxy.smoke.config.json` still cites `plans/2026-05-15-webui-dashboard.md`. **This is a false positive** — that reference was removed in commit `0308490` (task #21); the current header reads "mbproxy smoke-test configuration for the web-UI browser smoke tests." No action needed for m4.
## Prior-review regression check
All Critical and Major findings from `codereviews/2026-05-15/` are confirmed fixed and the fixes hold:
- SignalR capture-leak (prior C1/C2) — `PlcSubscriptionTracker` is tab-keyed; arm/disarm funnels through `ReconcileArmed`; the hub no longer arms captures. ✔
- XSS in `detail.js` (prior C1) — `detail.js` is fully escaped. ✔ **But the same fix was not propagated to `dashboard.js` — see Frontend C1 above (regression-by-omission).**
- Cache-hit debug-view freeze (prior TagCapture C1) — fixed via `CacheEntry.CapturedTags` replay. ✔
- Bind-failure leak (prior M5/M6) — `StartAppAsync` tears down a partially-started app/broadcaster. ✔
- The 8 Minor follow-ups (#14#21) — all closed; `TestsAndConfig.md`'s regression table confirms, with the m4/m6 false positive corrected above.
## Recommended remediation order
1. **`dashboard.js:48`** — escape `stateChip` (`${escapeHtml(state)}`). One character; closes the XSS regression.
2. **`AdminPort: 0` semantics** — either implement the documented disable (a `port == 0 → return` guard in `AdminEndpointHost.StartAppAsync`, then allow 0 in both validators) or correct the templates/docs to say the admin endpoint is always on. Pick one; today the docs lie.
3. **Single startup/reload validation gate** — run `ReloadValidator` (or a merged validator) at startup so port collisions and duplicate PLC names fail fast; fix `Configuration.md:284`.
4. **Cache invalidation on the write request** (Multiplexing C1) — invalidate overlapping FC03/04 entries when an FC06/FC16 is enqueued, not only on its response.
5. **Reconciler `Restart` ordering** (ConfigAndHosting M2) — build the new supervisor before removing the old one.
6. **`PlcListener.RunAsync` rethrow** (ProxyAndBcd M2) so the supervisor's fault path runs as designed.
7. **Watchdog party-list snapshot** (Multiplexing M5) — iterate a frozen copy so a late attach cannot kill the watchdog.
8. **Inbound FC16 / cached-response framing validation** (Multiplexing M2/M3/M4, ProxyAndBcd M1) — one helper, applied at both ends.
Items 13 are small and high-value. 48 are the connection-layer correctness work and deserve their own change with tests. The 38 Minors are opportunistic — the doc-drift cluster is worth a single sweep.