Files
wwtools/mbproxy/codereviews/2026-05-16/Overview.md
T
Joseph Doherty b222362ce0 mbproxy: remediate the 2026-05-16 code-review findings
Fixes every finding from the codereviews/2026-05-16 multi-agent review
(2 Critical, 20 Major, 38 Minor) and adds that review to the repo.

Highlights: dashboard XSS escape; response cache invalidated on the
write request (not just the response); ReloadValidator now runs at
startup so port collisions / duplicate names / malformed Resilience
profiles fail fast; AdminPort 0 genuinely disables the admin endpoint;
PlcListener accept-loop faults propagate to the supervisor's faulted
path; reconciler Restart builds before removing; Resilience pipelines
are restart-only from a frozen snapshot; multiplexer connect-race leak,
watchdog party-list snapshot, backend-response and FC16 framing
validation; frontend reconnect retry and util.js load guard; plus the
log-event/doc drift sweep and test-port hygiene.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 18:08:06 -04:00

13 KiB
Raw Blame History

Code Review — mbproxy (2026-05-16)

Branch: mbproxy-webui-dashboard · HEAD: 0308490 Predecessors: codereviews/2026-05-14/, codereviews/2026-05-15/ (all of their findings remediated in 554b05d, 374eecd, 0308490).

Method

This was a full-service review run as six parallel area reviews, each in its own file:

File Scope
Multiplexing.md Backend connection layer — Proxy/Multiplexing/, Proxy/Cache/, MbapFrame
ProxyAndBcd.md Proxy lifecycle + BCD codec — Proxy/ core, Proxy/Supervision/, Bcd/
AdminSignalR.md Admin endpoint, SignalR backend, TagValueCapture/TagCaptureRegistry
Frontend.md Admin/wwwroot/ — HTML/CSS/JS dashboard
ConfigAndHosting.md Configuration/, Options/, Diagnostics/, hosting, ServiceCounters
TestsAndConfig.md tests/, install/, config templates, csproj packaging

Open each area file for the full finding text (description, impact, recommendation, line refs). This Overview consolidates, adjudicates severity where I disagree with an agent, and sets a remediation order.

Headline verdict

The service is fundamentally sound. The hard parts — TxId multiplexing, the claim-then-dispatch watchdog, lock ordering, the BCD codec in all four directions, Interlocked counters, graceful-shutdown sequencing, the tab-keyed SignalR capture model — were checked and hold up. Every Critical and Major finding from the two prior reviews is confirmed remediated with no regressions (one frontend exception, below).

This review found 2 findings rated Critical by the area agents, 20 Major, and 38 Minor (60 total). My own adjudication downgrades both "Criticals" to high-Major (reasoning under Critical findings), so I would summarise the real risk profile as: no true Critical, ~8 Major worth scheduling, the rest opportunistic. The two largest themes are (1) the proxy trusts self-describing Modbus wire fields more than it should, and (2) startup config validation is weaker than hot-reload config validation.

Findings by area

Area Critical Major Minor Most important
Multiplexing 1 6 7 C1 cache can serve a value stale across a write
ProxyAndBcd 0 2 7 M2 PlcListener swallows accept-loop faults
AdminSignalR 0 2 7 M1 one-cycle arm-state/tag-row inconsistency
Frontend 1 1 4 C1 dashboard.js stateChip unescaped (XSS regression)
ConfigAndHosting 0 5 6 M1 ReloadValidator never runs at startup
TestsAndConfig 0 4 7 M1 AdminPort: 0 does not disable the admin endpoint
Total 2 20 38

Critical findings — adjudicated

Multiplexing C1 — response cache can serve a value contradicted by an in-flight write

PlcMultiplexer.cs:826-857 (read/lookup) vs :674-695 (invalidation on the FC06/FC16 response). Cache invalidation fires only when the write response lands. A read arriving after a write was forwarded but before its response gets a cache hit and the pre-write value. See Multiplexing.md C1 for the full sequence.

My adjudication: high-Major, not Critical. Mitigating: the cache is opt-in per tag (default OFF); the window is one backend round-trip and is already inside the documented CacheTtlMs staleness contract; and until the PLC acks the write, the old value is arguably the last confirmed value. The genuinely defective sub-case is a write that times out — it never produces a response, so invalidation never runs and the cache can serve a potentially-changed value for the rest of the TTL. Regardless of severity label, the fix is cheap and removes all ambiguity: invalidate on the write request (enqueue), keeping response-side invalidation as a backstop. Recommended.

Frontend C1 — dashboard.js stateChip interpolates listener.state into innerHTML unescaped

dashboard.js:48. detail.js's identical function was escaped by the prior review; the fix was not propagated to dashboard.js. See Frontend.md C1.

My adjudication: Major, but fix immediately. listener.state is a server-serialized enum, so practical exploitability today is low — but this is a straight regression from the prior XSS sweep and the fix is one character (${escapeHtml(state)}). It should not wait.

Major findings worth scheduling

Grouped by theme. Full text in the area files.

Startup vs hot-reload validation asymmetry — fix as one piece of work.

  • ConfigAndHosting.md M1 — ReloadValidator.Validate runs only inside the reconciler; the startup path runs only MbproxyOptionsValidator, which does not check duplicate ListenPorts, AdminPort collisions, duplicate PLC names, or the keepalive cross-field rule. A port-typo config starts a half-working fleet instead of failing fast — and docs/Operations/Configuration.md:284 claims the opposite.
  • ConfigAndHosting.md M4 — MbproxyOptionsValidator never range-checks ListenPort/Port/AdminPort or rejects empty Host/Name; ListenPort: 0 (the omitted-key default) silently binds an ephemeral port.
  • TestsAndConfig.md M1 — both config templates and the docs say AdminPort: 0 disables the admin endpoint; it does not — Kestrel binds an ephemeral port. And ReloadValidator rejects AdminPort < 1 on reload while startup accepts it. This is the most user-facing Major: documented behaviour is false on a security-relevant setting.

The proxy trusts self-describing Modbus wire fields.

  • Multiplexing.md M2 — a single wrong MBAP Length desynchronises the backend stream indefinitely with no resync; the PLC goes dark until the socket happens to close.
  • Multiplexing.md M3 — a cached FC03/FC04 payload is never validated against 2*qty; a malformed backend response is cached and replayed to every hit for the TTL.
  • Multiplexing.md M4 / ProxyAndBcd.md M1 — FC16 qty/byteCount are never cross-checked; a request whose self-describing fields disagree drives cache invalidation and partial BCD rewriting off the client's claim rather than reality. (Not memory-unsafe — per-slot bounds checks hold — but a contract gap.)
  • Recommendation: add one framing-validation helper for inbound FC16 (byteCount == 2*qty, length consistent, qty within the DL260 cap) and one for cached/forwarded FC03/04 responses (byteCount == 2*qty); reject/teardown on mismatch.

Hot-reload robustness.

  • ConfigAndHosting.md M2 — the reconciler Restart path removes the old supervisor from the dictionary before rebuilding; a transient fault during rebuild drops that PLC permanently with only an Error line. Build the new supervisor first, swap last.
  • ConfigAndHosting.md M3 — Resilience.BackendConnect/ListenerRecovery reloads reach only added/restarted PLCs, not reseated/untouched ones — inconsistent, undocumented. Either thread a live accessor (as done for ReadCoalescing/Keepalive) or document them as restart-only.

Connection-layer resilience.

  • ProxyAndBcd.md M2 — PlcListener.RunAsync catches and returns on an accept-loop fault instead of rethrowing, making the supervisor's exception-carrying mbproxy.listener.faulted (EventId 43) unreachable; faults are double-logged and lose their stack trace.
  • Multiplexing.md M5 — a late-attaching coalescing party can mutate InterestedParties while the watchdog enumerates it, throwing InvalidOperationException that the outer catch turns into permanent watchdog death for that PLC. Snapshot the party list before fan-out.
  • Multiplexing.md M1 — EnsureBackendConnectedAsync can leak a live backend socket + three tasks if _disposed flips during a cold-start connect. Re-check _disposed under _backendLock before publishing the socket.
  • Multiplexing.md M6 — three failure-detection paths call TearDownBackendAsync fire-and-forget; a pre-await throw becomes an unobserved task exception. Attach a faulted-continuation logger.

Silent failures.

  • Frontend.md M1 — onreconnected swallows a failed SubscribeFleet/SubscribePlc; the pill shows green while the feed is dead, with no retry. Route the warm-reconnect subscribe through the same retry as cold start.
  • AdminSignalR.md M1 — ReconcileArmed and BuildDebug re-query the capture independently in one push cycle, so a mid-cycle disarm can push a PlcDebugSnapshot where captureArmed and the tag rows disagree for ≤1 cycle. Self-heals; derive CaptureArmed from the already-held activePlcs set instead. (I'd call this Minor — one-cycle cosmetic — but it touches the arm-state invariant the prior reviews centred on.)
  • AdminSignalR.md M2 — StatusBroadcaster.Start() has no double-call guard; a second call orphans a push loop. Latent (one call site today); add an Interlocked flag.

Cross-cutting Minor themes

  • Doc drift (several): BcdRewriting.md uses mbproxy.rewrite.exception_passthrough for an event actually named mbproxy.exception.passthrough (ProxyAndBcd.md N2); Configuration.md:284 and HotReload.md:81/95 describe a validation model the code does not implement (ConfigAndHosting.md M1/N6); ConfigReconciler.cs:364 comment claims GetOrCreate preserves the armed flag — it no longer does (AdminSignalR.md N1); LogEvents.md lists EventId 21 for a ProxyWorker event with no call site (ProxyAndBcd.md N1).
  • Unbounded growth on PLC churn: frontend prevPdu/rateByName maps (Frontend.md N2); ResponseCache LRU is O(n)-per-insert with an unbounded operator-settable cap (Multiplexing.md N4).
  • Test hygiene: hard-coded AdminPort: 8080 in five test configs → parallel-run bind conflicts (TestsAndConfig.md M2); the sim fixture's readiness poll ignores the runner cancellation token (TestsAndConfig.md M3); a needless 400 ms delay in a broadcaster test (TestsAndConfig.md m2).
  • Best-effort swallow gaps: SocketKeepalive.Apply does not catch NotSupportedException/PlatformNotSupportedException despite a "never abort a connection" contract (ProxyAndBcd.md N4); EventLogBridge has no first-failure breadcrumb where SyslogBridge does (ConfigAndHosting.md N4).

Correction to an area finding

TestsAndConfig.md m4 (and its regression-table row for prior m6) reports that tests/sim/mbproxy.smoke.config.json still cites plans/2026-05-15-webui-dashboard.md. This is a false positive — that reference was removed in commit 0308490 (task #21); the current header reads "mbproxy smoke-test configuration for the web-UI browser smoke tests." No action needed for m4.

Prior-review regression check

All Critical and Major findings from codereviews/2026-05-15/ are confirmed fixed and the fixes hold:

  • SignalR capture-leak (prior C1/C2) — PlcSubscriptionTracker is tab-keyed; arm/disarm funnels through ReconcileArmed; the hub no longer arms captures. ✔
  • XSS in detail.js (prior C1) — detail.js is fully escaped. ✔ But the same fix was not propagated to dashboard.js — see Frontend C1 above (regression-by-omission).
  • Cache-hit debug-view freeze (prior TagCapture C1) — fixed via CacheEntry.CapturedTags replay. ✔
  • Bind-failure leak (prior M5/M6) — StartAppAsync tears down a partially-started app/broadcaster. ✔
  • The 8 Minor follow-ups (#14#21) — all closed; TestsAndConfig.md's regression table confirms, with the m4/m6 false positive corrected above.
  1. dashboard.js:48 — escape stateChip (${escapeHtml(state)}). One character; closes the XSS regression.
  2. AdminPort: 0 semantics — either implement the documented disable (a port == 0 → return guard in AdminEndpointHost.StartAppAsync, then allow 0 in both validators) or correct the templates/docs to say the admin endpoint is always on. Pick one; today the docs lie.
  3. Single startup/reload validation gate — run ReloadValidator (or a merged validator) at startup so port collisions and duplicate PLC names fail fast; fix Configuration.md:284.
  4. Cache invalidation on the write request (Multiplexing C1) — invalidate overlapping FC03/04 entries when an FC06/FC16 is enqueued, not only on its response.
  5. Reconciler Restart ordering (ConfigAndHosting M2) — build the new supervisor before removing the old one.
  6. PlcListener.RunAsync rethrow (ProxyAndBcd M2) so the supervisor's fault path runs as designed.
  7. Watchdog party-list snapshot (Multiplexing M5) — iterate a frozen copy so a late attach cannot kill the watchdog.
  8. Inbound FC16 / cached-response framing validation (Multiplexing M2/M3/M4, ProxyAndBcd M1) — one helper, applied at both ends.

Items 13 are small and high-value. 48 are the connection-layer correctness work and deserve their own change with tests. The 38 Minors are opportunistic — the doc-drift cluster is worth a single sweep.