Fixes every finding from the codereviews/2026-05-16 multi-agent review (2 Critical, 20 Major, 38 Minor) and adds that review to the repo. Highlights: dashboard XSS escape; response cache invalidated on the write request (not just the response); ReloadValidator now runs at startup so port collisions / duplicate names / malformed Resilience profiles fail fast; AdminPort 0 genuinely disables the admin endpoint; PlcListener accept-loop faults propagate to the supervisor's faulted path; reconciler Restart builds before removing; Resilience pipelines are restart-only from a frozen snapshot; multiplexer connect-race leak, watchdog party-list snapshot, backend-response and FC16 framing validation; frontend reconnect retry and util.js load guard; plus the log-event/doc drift sweep and test-port hygiene. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
13 KiB
Code Review — mbproxy (2026-05-16)
Branch: mbproxy-webui-dashboard · HEAD: 0308490
Predecessors: codereviews/2026-05-14/, codereviews/2026-05-15/ (all of their findings remediated in 554b05d, 374eecd, 0308490).
Method
This was a full-service review run as six parallel area reviews, each in its own file:
| File | Scope |
|---|---|
Multiplexing.md |
Backend connection layer — Proxy/Multiplexing/, Proxy/Cache/, MbapFrame |
ProxyAndBcd.md |
Proxy lifecycle + BCD codec — Proxy/ core, Proxy/Supervision/, Bcd/ |
AdminSignalR.md |
Admin endpoint, SignalR backend, TagValueCapture/TagCaptureRegistry |
Frontend.md |
Admin/wwwroot/ — HTML/CSS/JS dashboard |
ConfigAndHosting.md |
Configuration/, Options/, Diagnostics/, hosting, ServiceCounters |
TestsAndConfig.md |
tests/, install/, config templates, csproj packaging |
Open each area file for the full finding text (description, impact, recommendation, line refs). This Overview consolidates, adjudicates severity where I disagree with an agent, and sets a remediation order.
Headline verdict
The service is fundamentally sound. The hard parts — TxId multiplexing, the claim-then-dispatch watchdog, lock ordering, the BCD codec in all four directions, Interlocked counters, graceful-shutdown sequencing, the tab-keyed SignalR capture model — were checked and hold up. Every Critical and Major finding from the two prior reviews is confirmed remediated with no regressions (one frontend exception, below).
This review found 2 findings rated Critical by the area agents, 20 Major, and 38 Minor (60 total). My own adjudication downgrades both "Criticals" to high-Major (reasoning under Critical findings), so I would summarise the real risk profile as: no true Critical, ~8 Major worth scheduling, the rest opportunistic. The two largest themes are (1) the proxy trusts self-describing Modbus wire fields more than it should, and (2) startup config validation is weaker than hot-reload config validation.
Findings by area
| Area | Critical | Major | Minor | Most important |
|---|---|---|---|---|
| Multiplexing | 1 | 6 | 7 | C1 cache can serve a value stale across a write |
| ProxyAndBcd | 0 | 2 | 7 | M2 PlcListener swallows accept-loop faults |
| AdminSignalR | 0 | 2 | 7 | M1 one-cycle arm-state/tag-row inconsistency |
| Frontend | 1 | 1 | 4 | C1 dashboard.js stateChip unescaped (XSS regression) |
| ConfigAndHosting | 0 | 5 | 6 | M1 ReloadValidator never runs at startup |
| TestsAndConfig | 0 | 4 | 7 | M1 AdminPort: 0 does not disable the admin endpoint |
| Total | 2 | 20 | 38 |
Critical findings — adjudicated
Multiplexing C1 — response cache can serve a value contradicted by an in-flight write
PlcMultiplexer.cs:826-857 (read/lookup) vs :674-695 (invalidation on the FC06/FC16 response). Cache invalidation fires only when the write response lands. A read arriving after a write was forwarded but before its response gets a cache hit and the pre-write value. See Multiplexing.md C1 for the full sequence.
My adjudication: high-Major, not Critical. Mitigating: the cache is opt-in per tag (default OFF); the window is one backend round-trip and is already inside the documented CacheTtlMs staleness contract; and until the PLC acks the write, the old value is arguably the last confirmed value. The genuinely defective sub-case is a write that times out — it never produces a response, so invalidation never runs and the cache can serve a potentially-changed value for the rest of the TTL. Regardless of severity label, the fix is cheap and removes all ambiguity: invalidate on the write request (enqueue), keeping response-side invalidation as a backstop. Recommended.
Frontend C1 — dashboard.js stateChip interpolates listener.state into innerHTML unescaped
dashboard.js:48. detail.js's identical function was escaped by the prior review; the fix was not propagated to dashboard.js. See Frontend.md C1.
My adjudication: Major, but fix immediately. listener.state is a server-serialized enum, so practical exploitability today is low — but this is a straight regression from the prior XSS sweep and the fix is one character (${escapeHtml(state)}). It should not wait.
Major findings worth scheduling
Grouped by theme. Full text in the area files.
Startup vs hot-reload validation asymmetry — fix as one piece of work.
ConfigAndHosting.mdM1 —ReloadValidator.Validateruns only inside the reconciler; the startup path runs onlyMbproxyOptionsValidator, which does not check duplicateListenPorts,AdminPortcollisions, duplicate PLC names, or the keepalive cross-field rule. A port-typo config starts a half-working fleet instead of failing fast — anddocs/Operations/Configuration.md:284claims the opposite.ConfigAndHosting.mdM4 —MbproxyOptionsValidatornever range-checksListenPort/Port/AdminPortor rejects emptyHost/Name;ListenPort: 0(the omitted-key default) silently binds an ephemeral port.TestsAndConfig.mdM1 — both config templates and the docs sayAdminPort: 0disables the admin endpoint; it does not — Kestrel binds an ephemeral port. AndReloadValidatorrejectsAdminPort < 1on reload while startup accepts it. This is the most user-facing Major: documented behaviour is false on a security-relevant setting.
The proxy trusts self-describing Modbus wire fields.
Multiplexing.mdM2 — a single wrong MBAPLengthdesynchronises the backend stream indefinitely with no resync; the PLC goes dark until the socket happens to close.Multiplexing.mdM3 — a cached FC03/FC04 payload is never validated against2*qty; a malformed backend response is cached and replayed to every hit for the TTL.Multiplexing.mdM4 /ProxyAndBcd.mdM1 — FC16qty/byteCountare never cross-checked; a request whose self-describing fields disagree drives cache invalidation and partial BCD rewriting off the client's claim rather than reality. (Not memory-unsafe — per-slot bounds checks hold — but a contract gap.)- Recommendation: add one framing-validation helper for inbound FC16 (
byteCount == 2*qty, length consistent,qtywithin the DL260 cap) and one for cached/forwarded FC03/04 responses (byteCount == 2*qty); reject/teardown on mismatch.
Hot-reload robustness.
ConfigAndHosting.mdM2 — the reconcilerRestartpath removes the old supervisor from the dictionary before rebuilding; a transient fault during rebuild drops that PLC permanently with only an Error line. Build the new supervisor first, swap last.ConfigAndHosting.mdM3 —Resilience.BackendConnect/ListenerRecoveryreloads reach only added/restarted PLCs, not reseated/untouched ones — inconsistent, undocumented. Either thread a live accessor (as done forReadCoalescing/Keepalive) or document them as restart-only.
Connection-layer resilience.
ProxyAndBcd.mdM2 —PlcListener.RunAsynccatches and returns on an accept-loop fault instead of rethrowing, making the supervisor's exception-carryingmbproxy.listener.faulted(EventId 43) unreachable; faults are double-logged and lose their stack trace.Multiplexing.mdM5 — a late-attaching coalescing party can mutateInterestedPartieswhile the watchdog enumerates it, throwingInvalidOperationExceptionthat the outer catch turns into permanent watchdog death for that PLC. Snapshot the party list before fan-out.Multiplexing.mdM1 —EnsureBackendConnectedAsynccan leak a live backend socket + three tasks if_disposedflips during a cold-start connect. Re-check_disposedunder_backendLockbefore publishing the socket.Multiplexing.mdM6 — three failure-detection paths callTearDownBackendAsyncfire-and-forget; a pre-await throw becomes an unobserved task exception. Attach a faulted-continuation logger.
Silent failures.
Frontend.mdM1 —onreconnectedswallows a failedSubscribeFleet/SubscribePlc; the pill shows green while the feed is dead, with no retry. Route the warm-reconnect subscribe through the same retry as cold start.AdminSignalR.mdM1 —ReconcileArmedandBuildDebugre-query the capture independently in one push cycle, so a mid-cycle disarm can push aPlcDebugSnapshotwherecaptureArmedand the tag rows disagree for ≤1 cycle. Self-heals; deriveCaptureArmedfrom the already-heldactivePlcsset instead. (I'd call this Minor — one-cycle cosmetic — but it touches the arm-state invariant the prior reviews centred on.)AdminSignalR.mdM2 —StatusBroadcaster.Start()has no double-call guard; a second call orphans a push loop. Latent (one call site today); add anInterlockedflag.
Cross-cutting Minor themes
- Doc drift (several):
BcdRewriting.mdusesmbproxy.rewrite.exception_passthroughfor an event actually namedmbproxy.exception.passthrough(ProxyAndBcd.mdN2);Configuration.md:284andHotReload.md:81/95describe a validation model the code does not implement (ConfigAndHosting.mdM1/N6);ConfigReconciler.cs:364comment claimsGetOrCreatepreserves the armed flag — it no longer does (AdminSignalR.mdN1);LogEvents.mdlists EventId 21 for aProxyWorkerevent with no call site (ProxyAndBcd.mdN1). - Unbounded growth on PLC churn: frontend
prevPdu/rateByNamemaps (Frontend.mdN2);ResponseCacheLRU is O(n)-per-insert with an unbounded operator-settable cap (Multiplexing.mdN4). - Test hygiene: hard-coded
AdminPort: 8080in five test configs → parallel-run bind conflicts (TestsAndConfig.mdM2); the sim fixture's readiness poll ignores the runner cancellation token (TestsAndConfig.mdM3); a needless 400 ms delay in a broadcaster test (TestsAndConfig.mdm2). - Best-effort swallow gaps:
SocketKeepalive.Applydoes not catchNotSupportedException/PlatformNotSupportedExceptiondespite a "never abort a connection" contract (ProxyAndBcd.mdN4);EventLogBridgehas no first-failure breadcrumb whereSyslogBridgedoes (ConfigAndHosting.mdN4).
Correction to an area finding
TestsAndConfig.md m4 (and its regression-table row for prior m6) reports that tests/sim/mbproxy.smoke.config.json still cites plans/2026-05-15-webui-dashboard.md. This is a false positive — that reference was removed in commit 0308490 (task #21); the current header reads "mbproxy smoke-test configuration for the web-UI browser smoke tests." No action needed for m4.
Prior-review regression check
All Critical and Major findings from codereviews/2026-05-15/ are confirmed fixed and the fixes hold:
- SignalR capture-leak (prior C1/C2) —
PlcSubscriptionTrackeris tab-keyed; arm/disarm funnels throughReconcileArmed; the hub no longer arms captures. ✔ - XSS in
detail.js(prior C1) —detail.jsis fully escaped. ✔ But the same fix was not propagated todashboard.js— see Frontend C1 above (regression-by-omission). - Cache-hit debug-view freeze (prior TagCapture C1) — fixed via
CacheEntry.CapturedTagsreplay. ✔ - Bind-failure leak (prior M5/M6) —
StartAppAsynctears down a partially-started app/broadcaster. ✔ - The 8 Minor follow-ups (#14–#21) — all closed;
TestsAndConfig.md's regression table confirms, with the m4/m6 false positive corrected above.
Recommended remediation order
dashboard.js:48— escapestateChip(${escapeHtml(state)}). One character; closes the XSS regression.AdminPort: 0semantics — either implement the documented disable (aport == 0 → returnguard inAdminEndpointHost.StartAppAsync, then allow 0 in both validators) or correct the templates/docs to say the admin endpoint is always on. Pick one; today the docs lie.- Single startup/reload validation gate — run
ReloadValidator(or a merged validator) at startup so port collisions and duplicate PLC names fail fast; fixConfiguration.md:284. - Cache invalidation on the write request (Multiplexing C1) — invalidate overlapping FC03/04 entries when an FC06/FC16 is enqueued, not only on its response.
- Reconciler
Restartordering (ConfigAndHosting M2) — build the new supervisor before removing the old one. PlcListener.RunAsyncrethrow (ProxyAndBcd M2) so the supervisor's fault path runs as designed.- Watchdog party-list snapshot (Multiplexing M5) — iterate a frozen copy so a late attach cannot kill the watchdog.
- Inbound FC16 / cached-response framing validation (Multiplexing M2/M3/M4, ProxyAndBcd M1) — one helper, applied at both ends.
Items 1–3 are small and high-value. 4–8 are the connection-layer correctness work and deserve their own change with tests. The 38 Minors are opportunistic — the doc-drift cluster is worth a single sweep.