b222362ce0
Fixes every finding from the codereviews/2026-05-16 multi-agent review (2 Critical, 20 Major, 38 Minor) and adds that review to the repo. Highlights: dashboard XSS escape; response cache invalidated on the write request (not just the response); ReloadValidator now runs at startup so port collisions / duplicate names / malformed Resilience profiles fail fast; AdminPort 0 genuinely disables the admin endpoint; PlcListener accept-loop faults propagate to the supervisor's faulted path; reconciler Restart builds before removing; Resilience pipelines are restart-only from a frozen snapshot; multiplexer connect-race leak, watchdog party-list snapshot, backend-response and FC16 framing validation; frontend reconnect retry and util.js load guard; plus the log-event/doc drift sweep and test-port hygiene. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
116 lines
13 KiB
Markdown
116 lines
13 KiB
Markdown
# Code Review — mbproxy (2026-05-16)
|
||
|
||
**Branch:** `mbproxy-webui-dashboard` · **HEAD:** `0308490`
|
||
**Predecessors:** `codereviews/2026-05-14/`, `codereviews/2026-05-15/` (all of their findings remediated in `554b05d`, `374eecd`, `0308490`).
|
||
|
||
## Method
|
||
|
||
This was a full-service review run as six parallel area reviews, each in its own file:
|
||
|
||
| File | Scope |
|
||
|------|-------|
|
||
| [`Multiplexing.md`](Multiplexing.md) | Backend connection layer — `Proxy/Multiplexing/`, `Proxy/Cache/`, `MbapFrame` |
|
||
| [`ProxyAndBcd.md`](ProxyAndBcd.md) | Proxy lifecycle + BCD codec — `Proxy/` core, `Proxy/Supervision/`, `Bcd/` |
|
||
| [`AdminSignalR.md`](AdminSignalR.md) | Admin endpoint, SignalR backend, `TagValueCapture`/`TagCaptureRegistry` |
|
||
| [`Frontend.md`](Frontend.md) | `Admin/wwwroot/` — HTML/CSS/JS dashboard |
|
||
| [`ConfigAndHosting.md`](ConfigAndHosting.md) | `Configuration/`, `Options/`, `Diagnostics/`, hosting, `ServiceCounters` |
|
||
| [`TestsAndConfig.md`](TestsAndConfig.md) | `tests/`, `install/`, config templates, csproj packaging |
|
||
|
||
Open each area file for the full finding text (description, impact, recommendation, line refs). This Overview consolidates, adjudicates severity where I disagree with an agent, and sets a remediation order.
|
||
|
||
## Headline verdict
|
||
|
||
The service is fundamentally sound. The hard parts — TxId multiplexing, the claim-then-dispatch watchdog, lock ordering, the BCD codec in all four directions, `Interlocked` counters, graceful-shutdown sequencing, the tab-keyed SignalR capture model — were checked and hold up. **Every Critical and Major finding from the two prior reviews is confirmed remediated with no regressions** (one frontend exception, below).
|
||
|
||
This review found **2 findings rated Critical by the area agents, 20 Major, and 38 Minor** (60 total). My own adjudication downgrades both "Criticals" to high-Major (reasoning under *Critical findings*), so I would summarise the real risk profile as: **no true Critical, ~8 Major worth scheduling, the rest opportunistic.** The two largest themes are (1) the proxy trusts self-describing Modbus wire fields more than it should, and (2) startup config validation is weaker than hot-reload config validation.
|
||
|
||
## Findings by area
|
||
|
||
| Area | Critical | Major | Minor | Most important |
|
||
|------|:--:|:--:|:--:|----------------|
|
||
| Multiplexing | 1 | 6 | 7 | C1 cache can serve a value stale across a write |
|
||
| ProxyAndBcd | 0 | 2 | 7 | M2 `PlcListener` swallows accept-loop faults |
|
||
| AdminSignalR | 0 | 2 | 7 | M1 one-cycle arm-state/tag-row inconsistency |
|
||
| Frontend | 1 | 1 | 4 | C1 `dashboard.js` `stateChip` unescaped (XSS regression) |
|
||
| ConfigAndHosting | 0 | 5 | 6 | M1 `ReloadValidator` never runs at startup |
|
||
| TestsAndConfig | 0 | 4 | 7 | M1 `AdminPort: 0` does not disable the admin endpoint |
|
||
| **Total** | **2** | **20** | **38** | |
|
||
|
||
## Critical findings — adjudicated
|
||
|
||
### Multiplexing C1 — response cache can serve a value contradicted by an in-flight write
|
||
|
||
`PlcMultiplexer.cs:826-857` (read/lookup) vs `:674-695` (invalidation on the FC06/FC16 *response*). Cache invalidation fires only when the write *response* lands. A read arriving after a write was forwarded but before its response gets a cache hit and the pre-write value. See `Multiplexing.md` C1 for the full sequence.
|
||
|
||
**My adjudication: high-Major, not Critical.** Mitigating: the cache is opt-in per tag (default OFF); the window is one backend round-trip and is already inside the documented `CacheTtlMs` staleness contract; and until the PLC acks the write, the old value is arguably the last *confirmed* value. The genuinely defective sub-case is a **write that times out** — it never produces a response, so invalidation never runs and the cache can serve a potentially-changed value for the rest of the TTL. Regardless of severity label, the fix is cheap and removes all ambiguity: **invalidate on the write request (enqueue), keeping response-side invalidation as a backstop.** Recommended.
|
||
|
||
### Frontend C1 — `dashboard.js` `stateChip` interpolates `listener.state` into `innerHTML` unescaped
|
||
|
||
`dashboard.js:48`. `detail.js`'s identical function was escaped by the prior review; the fix was not propagated to `dashboard.js`. See `Frontend.md` C1.
|
||
|
||
**My adjudication: Major, but fix immediately.** `listener.state` is a server-serialized enum, so practical exploitability today is low — but this is a straight regression from the prior XSS sweep and the fix is one character (`${escapeHtml(state)}`). It should not wait.
|
||
|
||
## Major findings worth scheduling
|
||
|
||
Grouped by theme. Full text in the area files.
|
||
|
||
**Startup vs hot-reload validation asymmetry — fix as one piece of work.**
|
||
- `ConfigAndHosting.md` M1 — `ReloadValidator.Validate` runs *only* inside the reconciler; the startup path runs only `MbproxyOptionsValidator`, which does not check duplicate `ListenPort`s, `AdminPort` collisions, duplicate PLC names, or the keepalive cross-field rule. A port-typo config starts a half-working fleet instead of failing fast — and `docs/Operations/Configuration.md:284` claims the opposite.
|
||
- `ConfigAndHosting.md` M4 — `MbproxyOptionsValidator` never range-checks `ListenPort`/`Port`/`AdminPort` or rejects empty `Host`/`Name`; `ListenPort: 0` (the omitted-key default) silently binds an ephemeral port.
|
||
- `TestsAndConfig.md` M1 — both config templates and the docs say `AdminPort: 0` *disables* the admin endpoint; it does not — Kestrel binds an ephemeral port. And `ReloadValidator` rejects `AdminPort < 1` on reload while startup accepts it. **This is the most user-facing Major: documented behaviour is false on a security-relevant setting.**
|
||
|
||
**The proxy trusts self-describing Modbus wire fields.**
|
||
- `Multiplexing.md` M2 — a single wrong MBAP `Length` desynchronises the backend stream indefinitely with no resync; the PLC goes dark until the socket happens to close.
|
||
- `Multiplexing.md` M3 — a cached FC03/FC04 payload is never validated against `2*qty`; a malformed backend response is cached and replayed to every hit for the TTL.
|
||
- `Multiplexing.md` M4 / `ProxyAndBcd.md` M1 — FC16 `qty`/`byteCount` are never cross-checked; a request whose self-describing fields disagree drives cache invalidation and partial BCD rewriting off the client's claim rather than reality. (Not memory-unsafe — per-slot bounds checks hold — but a contract gap.)
|
||
- Recommendation: add one framing-validation helper for inbound FC16 (`byteCount == 2*qty`, length consistent, `qty` within the DL260 cap) and one for cached/forwarded FC03/04 responses (`byteCount == 2*qty`); reject/teardown on mismatch.
|
||
|
||
**Hot-reload robustness.**
|
||
- `ConfigAndHosting.md` M2 — the reconciler `Restart` path removes the old supervisor from the dictionary *before* rebuilding; a transient fault during rebuild drops that PLC permanently with only an Error line. Build the new supervisor first, swap last.
|
||
- `ConfigAndHosting.md` M3 — `Resilience.BackendConnect`/`ListenerRecovery` reloads reach only added/restarted PLCs, not reseated/untouched ones — inconsistent, undocumented. Either thread a live accessor (as done for `ReadCoalescing`/`Keepalive`) or document them as restart-only.
|
||
|
||
**Connection-layer resilience.**
|
||
- `ProxyAndBcd.md` M2 — `PlcListener.RunAsync` catches and *returns* on an accept-loop fault instead of rethrowing, making the supervisor's exception-carrying `mbproxy.listener.faulted` (EventId 43) unreachable; faults are double-logged and lose their stack trace.
|
||
- `Multiplexing.md` M5 — a late-attaching coalescing party can mutate `InterestedParties` while the watchdog enumerates it, throwing `InvalidOperationException` that the outer catch turns into **permanent watchdog death for that PLC**. Snapshot the party list before fan-out.
|
||
- `Multiplexing.md` M1 — `EnsureBackendConnectedAsync` can leak a live backend socket + three tasks if `_disposed` flips during a cold-start connect. Re-check `_disposed` under `_backendLock` before publishing the socket.
|
||
- `Multiplexing.md` M6 — three failure-detection paths call `TearDownBackendAsync` fire-and-forget; a pre-await throw becomes an unobserved task exception. Attach a faulted-continuation logger.
|
||
|
||
**Silent failures.**
|
||
- `Frontend.md` M1 — `onreconnected` swallows a failed `SubscribeFleet`/`SubscribePlc`; the pill shows green while the feed is dead, with no retry. Route the warm-reconnect subscribe through the same retry as cold start.
|
||
- `AdminSignalR.md` M1 — `ReconcileArmed` and `BuildDebug` re-query the capture independently in one push cycle, so a mid-cycle disarm can push a `PlcDebugSnapshot` where `captureArmed` and the tag rows disagree for ≤1 cycle. Self-heals; derive `CaptureArmed` from the already-held `activePlcs` set instead. (I'd call this Minor — one-cycle cosmetic — but it touches the arm-state invariant the prior reviews centred on.)
|
||
- `AdminSignalR.md` M2 — `StatusBroadcaster.Start()` has no double-call guard; a second call orphans a push loop. Latent (one call site today); add an `Interlocked` flag.
|
||
|
||
## Cross-cutting Minor themes
|
||
|
||
- **Doc drift** (several): `BcdRewriting.md` uses `mbproxy.rewrite.exception_passthrough` for an event actually named `mbproxy.exception.passthrough` (`ProxyAndBcd.md` N2); `Configuration.md:284` and `HotReload.md:81/95` describe a validation model the code does not implement (`ConfigAndHosting.md` M1/N6); `ConfigReconciler.cs:364` comment claims `GetOrCreate` preserves the armed flag — it no longer does (`AdminSignalR.md` N1); `LogEvents.md` lists EventId 21 for a `ProxyWorker` event with no call site (`ProxyAndBcd.md` N1).
|
||
- **Unbounded growth on PLC churn**: frontend `prevPdu`/`rateByName` maps (`Frontend.md` N2); `ResponseCache` LRU is O(n)-per-insert with an unbounded operator-settable cap (`Multiplexing.md` N4).
|
||
- **Test hygiene**: hard-coded `AdminPort: 8080` in five test configs → parallel-run bind conflicts (`TestsAndConfig.md` M2); the sim fixture's readiness poll ignores the runner cancellation token (`TestsAndConfig.md` M3); a needless 400 ms delay in a broadcaster test (`TestsAndConfig.md` m2).
|
||
- **Best-effort swallow gaps**: `SocketKeepalive.Apply` does not catch `NotSupportedException`/`PlatformNotSupportedException` despite a "never abort a connection" contract (`ProxyAndBcd.md` N4); `EventLogBridge` has no first-failure breadcrumb where `SyslogBridge` does (`ConfigAndHosting.md` N4).
|
||
|
||
## Correction to an area finding
|
||
|
||
`TestsAndConfig.md` **m4** (and its regression-table row for prior `m6`) reports that `tests/sim/mbproxy.smoke.config.json` still cites `plans/2026-05-15-webui-dashboard.md`. **This is a false positive** — that reference was removed in commit `0308490` (task #21); the current header reads "mbproxy smoke-test configuration for the web-UI browser smoke tests." No action needed for m4.
|
||
|
||
## Prior-review regression check
|
||
|
||
All Critical and Major findings from `codereviews/2026-05-15/` are confirmed fixed and the fixes hold:
|
||
|
||
- SignalR capture-leak (prior C1/C2) — `PlcSubscriptionTracker` is tab-keyed; arm/disarm funnels through `ReconcileArmed`; the hub no longer arms captures. ✔
|
||
- XSS in `detail.js` (prior C1) — `detail.js` is fully escaped. ✔ **But the same fix was not propagated to `dashboard.js` — see Frontend C1 above (regression-by-omission).**
|
||
- Cache-hit debug-view freeze (prior TagCapture C1) — fixed via `CacheEntry.CapturedTags` replay. ✔
|
||
- Bind-failure leak (prior M5/M6) — `StartAppAsync` tears down a partially-started app/broadcaster. ✔
|
||
- The 8 Minor follow-ups (#14–#21) — all closed; `TestsAndConfig.md`'s regression table confirms, with the m4/m6 false positive corrected above.
|
||
|
||
## Recommended remediation order
|
||
|
||
1. **`dashboard.js:48`** — escape `stateChip` (`${escapeHtml(state)}`). One character; closes the XSS regression.
|
||
2. **`AdminPort: 0` semantics** — either implement the documented disable (a `port == 0 → return` guard in `AdminEndpointHost.StartAppAsync`, then allow 0 in both validators) or correct the templates/docs to say the admin endpoint is always on. Pick one; today the docs lie.
|
||
3. **Single startup/reload validation gate** — run `ReloadValidator` (or a merged validator) at startup so port collisions and duplicate PLC names fail fast; fix `Configuration.md:284`.
|
||
4. **Cache invalidation on the write request** (Multiplexing C1) — invalidate overlapping FC03/04 entries when an FC06/FC16 is enqueued, not only on its response.
|
||
5. **Reconciler `Restart` ordering** (ConfigAndHosting M2) — build the new supervisor before removing the old one.
|
||
6. **`PlcListener.RunAsync` rethrow** (ProxyAndBcd M2) so the supervisor's fault path runs as designed.
|
||
7. **Watchdog party-list snapshot** (Multiplexing M5) — iterate a frozen copy so a late attach cannot kill the watchdog.
|
||
8. **Inbound FC16 / cached-response framing validation** (Multiplexing M2/M3/M4, ProxyAndBcd M1) — one helper, applied at both ends.
|
||
|
||
Items 1–3 are small and high-value. 4–8 are the connection-layer correctness work and deserve their own change with tests. The 38 Minors are opportunistic — the doc-drift cluster is worth a single sweep.
|