Files
wwtools/mbproxy/codereviews/2026-05-15/Frontend.md
Joseph Doherty 554b05d28c mbproxy: fix dashboard review findings, add named BCD tags + fleet config
Reviewed the new SignalR dashboard and fixed its two top findings: a stored XSS on the connection-detail page (unescaped tag name / direction / timestamp rendered into innerHTML) and FC03/FC04 cache hits bypassing the debug-view capture, which left cached tags frozen while their age climbed. Also adds an optional human-friendly Name to BCD tags surfaced on the debug view, and loads the real fleet config from tags.txt (12 named BCD tags, PLC Z28061) so the published appsettings.json is deploy-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 03:39:39 -04:00

97 lines
17 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Frontend / Live Web Dashboard — Code Review
Scope: commit `e719dd5` ("replace status page with a live SignalR web dashboard"). Files reviewed — all under `src/Mbproxy/Admin/wwwroot/`:
`index.html`, `plc.html`, `dashboard.js`, `detail.js`, `theme.css`, `dashboard.css`, `detail.css`. Vendored assets (`bootstrap.min.css`, `bootstrap.bundle.min.js`, `signalr.min.js`, `*.woff2`) explicitly out of scope. Cross-checked against `docs/Operations/StatusPage.md` (wire contract), `CLAUDE.md` (mbproxy), and the server side `StatusHub.cs` / `StatusBroadcaster.cs` for the SignalR contract.
## Summary
- The dashboard is well-built vanilla JS: thoughtful escaping helpers, clean rate computation, sensible reconnection wiring, and a correct subscribe-on-reconnect pattern that keeps the on-demand tag capture armed/disarmed correctly.
- **The single most important finding is a real, exploitable XSS hole on the detail page**: `detail.js` renders `t.rawHex`, `t.address`, `t.width`, `t.direction`, `c.remote`, and `plc.host` straight into `innerHTML` **without escaping**. `escapeHtml` exists in the file but is applied inconsistently — several attacker/PLC-influenceable fields bypass it. The fleet table (`dashboard.js`) is escaped correctly throughout; the detail page is not. This is a Critical finding.
- A second Major issue: on a *cold* SignalR failure the page enters a `setTimeout(start, 3000)` retry loop that **builds a brand-new `HubConnection` on every retry** — no, it reuses the same connection object, but it never tears down handlers, and combined with `withAutomaticReconnect` the failure/retry semantics are muddled. There is also no upper bound and no UX past "retrying".
- DOM updates are full-`innerHTML` re-renders of the whole table body every ~1 s. Correct and flicker-free for 54 rows, but it blows away focus/selection and is wasteful; acceptable at this scale, flagged.
## Critical findings
**C1. Stored/reflected XSS on the connection-detail page — multiple unescaped fields rendered via `innerHTML`.** `detail.js` defines `escapeHtml` (line 23) and uses it for *some* fields but omits it for several others, all of which are interpolated into template strings later assigned to `.innerHTML`:
- **`detail.js:194` and `:203`** — the debug-row builder:
```js
<td>${t.address} <span class="ratio-sub">${hex4(t.address)}</span></td>
<td>${t.width}-bit</td>
...
<td><span class="dir-tag ${dirCls}">${t.direction}</span></td>
```
`t.direction` is a server string (`"read"`/`"write"` per the schema) interpolated raw into both an attribute-ish context and element text. `t.address`/`t.width` are typed `int` in the documented DTO, so they are lower risk — but the code does not coerce them (`Number(...)`) or escape them, so it relies entirely on the server DTO type. `t.direction` is a free `string` on the wire and is **not escaped**.
- **`detail.js:86`** — the client list:
```js
`<div class="client-line">${escapeHtml(c.remote)}` +
`<span class="pdu"> · ${num(c.pdusForwarded)} PDUs · since ${shortTime(c.connectedAtUtc)}</span></div>`
```
`c.remote` *is* escaped (good), but `shortTime(c.connectedAtUtc)` is **not** — and `shortTime` (line 36) has a `catch { return iso; }` branch that returns the **raw, unescaped `connectedAtUtc` string** verbatim when `new Date(iso)` parsing throws or the value is non-ISO. A malformed/attacker-controlled timestamp string therefore lands unescaped inside `innerHTML`.
- **`detail.js:203`** — `escapeHtml(t.rawHex)` *is* applied (good), and `num(t.decodedValue)` is numeric-safe. So `rawHex` is the one debug field that is handled correctly. (Correcting the brief: `rawHex` is escaped; `direction` and the timestamp path are the live holes.)
- **`detail.js:65`** — `$('plc-sub').textContent = `${plc.host}:${plc.listenPort}`;` is **safe** (`textContent`). But `detail.js:1416` does the same for `plcName` via `textContent` — also safe. So the identity header is fine; the holes are specifically the `innerHTML` card/table builders.
Severity rationale: per `docs/Operations/StatusPage.md` the admin endpoint binds `IPAddress.Any` with **no authentication** ("Authentication lives at the network layer"). PLC `name`/`host` and backend-derived strings come off the Modbus wire / `appsettings.json` and are operator- or device-influenceable. A PLC named or a backend that returns a crafted string can inject `<img src=x onerror=...>` into any operator's browser session on the trusted segment. Read-only UI does not mean low impact: the injected script runs with the operator's origin.
Concrete fix: route **every** dynamic value through `escapeHtml` before it enters an `innerHTML` template, with **no exceptions**:
- `detail.js:194` / `:203`: wrap `t.direction` in `escapeHtml(...)`; coerce `t.address`/`t.width` with `Number(...)` (or escape) rather than trusting the DTO type.
- `detail.js:86`: wrap the `shortTime(...)` result in `escapeHtml(...)`, and/or change `shortTime`'s fallback to `escapeHtml(iso)` / return `'—'`.
- Add a single `escapeAttr` helper (as `dashboard.js` has at line 179) and use it for the `dirCls`/class interpolation at `:197` if `dirCls` ever becomes data-derived (currently it is a literal, so low priority — but `card()`'s `cls` parameter at `detail.js:45` is interpolated into `class="v ${cls||''}"` and is caller-supplied; today all callers pass literals, so it is safe *now* but fragile).
- Better still: stop hand-building HTML. Build rows with `document.createElement` + `textContent`/`dataset`, which makes escaping structural rather than a discipline that one missed call defeats.
## Major findings
**M1. Cold-start retry loop layers a manual `setTimeout` retry on top of `withAutomaticReconnect`, with no bound and a misleading pill.** `dashboard.js:252263` and the identical `detail.js:236245`:
```js
async function start() {
try { ...; await connection.start(); await connection.invoke('SubscribeFleet'); ... }
catch { setConn('disconnected','retrying'); setTimeout(start, 3000); }
}
```
`withAutomaticReconnect` only covers a connection that *was* established and then dropped — it does **not** retry the initial `start()`. So the manual loop is needed. But: (a) it retries **forever** every 3 s with no backoff and no cap — if the hub is permanently gone the browser hammers it indefinitely; (b) if `connection.start()` succeeds but `invoke('SubscribeFleet')` throws, the `catch` calls `start()` again on an **already-started** connection, which will throw "cannot start a connection that is not in the Disconnected state" and wedge the retry loop; (c) the pill shows `disconnected`/`retrying` which is reasonable, but there is no terminal "giving up" state and no way for an operator to know the difference between "server down" and "server slow". Fix: separate "start the connection" from "subscribe" so a subscribe failure doesn't restart the socket; add capped exponential backoff; guard `start()` against re-entry when `connection.state !== 'Disconnected'`.
**M2. Detail page never handles "PLC name not in fleet" vs. "hub never delivers".** `detail.js` subscribes with `SubscribePlc(plcName)` where `plcName` is taken from `location.pathname`. If the name is wrong (typo, stale bookmark, or a name that never existed), `StatusHub.SubscribePlc` happily adds the connection to a `plc:{bogus}` group and `_captureRegistry.Arm` is a documented no-op — so the server **never sends a `plc` message** and `onDetail` never fires. The page sits forever on "Waiting for first snapshot…" with a green `connected` pill. There is no timeout, no "unknown PLC" state. The `renderMissing()` path (`detail.js:168`) only triggers when the server *does* push a payload with `detail.plc === null` (PLC removed by hot-reload) — it cannot fire for a name that was never configured because nothing is pushed at all. Fix: after `connected`, start a watchdog (e.g. 3× the push interval); if no `plc` message arrives, show an "unknown or unreachable PLC" notice.
**M3. Fleet PDU/s rate is wrong for the first snapshot after any reconnect, and silently leaks `prevPdu`/`rateByName` entries for removed PLCs.** `dashboard.js:6577` `updateRates` keys `prevPdu`/`rateByName` by `plc.name` and never removes entries. On hot-reload removal of a PLC, its stale rate stays in `rateByName` forever and is still summed into the fleet rate? — no, `renderAggregates` only iterates `s.plcs`, so a removed PLC drops out of the *sum*; but the Map still grows unboundedly across the process lifetime if PLCs churn. More importantly, after a SignalR reconnect the counters are *cumulative since service start*, so `cur - prev.forwarded` across a multi-second reconnect gap produces a correct (if coarse) rate — that part is fine. The real bug: `performance.now()` is monotonic per-page, so that's fine too. Net: low-grade Map leak on PLC churn; prune `prevPdu`/`rateByName` to the current `snapshot.plcs` name set each cycle.
**M4. Full-table-body `innerHTML` re-render on every ~1 s push destroys transient UI state.** `dashboard.js:155173` rebuilds the entire `<tbody>` string and assigns `tbody.innerHTML`. Consequences at the 1 s cadence: any text selection inside the table is lost every second; `:hover` is re-evaluated (minor); the browser re-parses ~54 rows of HTML each tick. It is *not* a flicker source (synchronous replace, no intermediate paint) and 54 rows is cheap, so this is acceptable — but it is the kind of thing that becomes a problem if columns/rows grow. A keyed diff (update existing `<tr>` cells in place, add/remove only on set change) would also fix the selection-loss annoyance. Flagged as Major because of the selection-loss UX regression on a screen operators stare at; the brief explicitly asked about this.
## Minor findings
**N1. `detail.js` has no `escapeAttr` and `dashboard.js`'s is duplicated.** Both files define their own `escapeHtml` (`dashboard.js:176`, `detail.js:23`) — identical code, copy-pasted. `escapeAttr` exists only in `dashboard.js:179`. Since both files are separately served and there is no shared module, duplication is the pragmatic choice for a no-build-step project, but the *inconsistency* (detail.js lacking `escapeAttr`) is exactly what enabled C1. Recommend a tiny shared `util.js` served from `/assets/`, or at minimum make the two `escapeHtml` definitions and an `escapeAttr` identical and present in both.
**N2. Accessibility gaps.**
- The sortable `<th>` elements (`index.html:7281`) are clickable via a JS `click` handler but are not keyboard-focusable and carry no `role="button"`/`tabindex="0"`/`aria-sort`. A keyboard-only operator cannot sort the table. Add `tabindex="0"`, `aria-sort` reflecting `sorted-asc`/`sorted-desc`, and a `keydown` (Enter/Space) handler.
- Table rows are clickable (`dashboard.js:232`) to open a detail page but are `<tr>` with `cursor:pointer` only — not keyboard-reachable and not announced as interactive. Consider making the PLC-name cell an actual `<a href="/plc/...">` so it is focusable, middle-click/ctrl-click works natively, and screen readers announce it. This would also remove the need for the JS `window.open` handler.
- The connection-state pill (`#conn`) updates visually but has no `aria-live` region, so a screen-reader user is never told the hub dropped. Add `aria-live="polite"` to `#conn` or to `#conn-text`.
- `<input type="search" id="f-search">` has a `placeholder` but no associated `<label>` (visible or `aria-label`). Same for `#f-state`. Add `aria-label`.
**N3. Row click always `window.open(..., '_blank')` — no modifier-key respect, popup-blocker exposure.** `dashboard.js:235` unconditionally calls `window.open`. A plain click should arguably open in the same tab or respect the user's intent; programmatic `window.open` not in direct response to a trusted click on an anchor can be caught by popup blockers in some configs. Tied to N2's suggestion: render the PLC name as a real `<a target="_blank" rel="noopener">` and delete the handler. (`rel="noopener"` also matters — `window.open` without it leaves `window.opener` live; here the opened page is same-origin so impact is low, but it is still best practice.)
**N4. `detail.js` `plcName` parsing is brittle.** `detail.js:10`: `decodeURIComponent(location.pathname.replace(/^\/plc\//, ''))`. If the route is ever served under a path prefix, or the name itself contains an encoded `/`, this misparses. Also if `decodeURIComponent` throws (malformed `%` sequence in the URL) the whole script aborts at line 10 before `connect()` is ever reached — the page is then blank with no error. Wrap in try/catch and fall back to a visible error state.
**N5. `dashboard.js:101` and `:106` — fleet PDU/s shows `` until the *second* snapshot.** Expected (rate needs two samples) and correctly handled, but the aggregate card shows `` for the first ~1 s while the table already shows per-row `` rates. Cosmetic; no fix required, noted for completeness.
**N6. No `console.log`, no hardcoded absolute URLs, no obvious dead code.** All asset/hub URLs are root-relative (`/assets/...`, `/hub/status`, `/plc/...`) — correct, survives any host/port. `card()`'s `extra` parameter and the `cls` third element of card rows are lightly-used but not dead. Clean on this axis.
**N7. `formatUptime` / `formatAge` silently misrender negative or NaN input.** `dashboard.js:199` and `detail.js:28`: if `uptimeSeconds`/`ageSeconds` ever arrive negative (clock skew) or non-numeric, `Math.floor` yields `NaN` and the card shows `NaN`. Low risk given server types; a `Number.isFinite` guard returning `'—'` is cheap insurance.
## What looks good
- **Reconnect → re-subscribe is correct.** `dashboard.js:249` and `detail.js:233` both re-`invoke('SubscribeFleet'/'SubscribePlc')` inside `onreconnected`. This is essential and easy to forget: SignalR auto-reconnect gives a new `ConnectionId`, server-side group membership does **not** survive, and the detail page's tag capture is armed per-connection — without the re-subscribe the capture would silently disarm on every transient drop. Verified against `StatusHub.SubscribePlc`/`OnDisconnectedAsync` and `StatusBroadcaster` group targeting — the lifecycle is sound.
- **`onclose` correctly does not re-subscribe** — it just sets the pill; `withAutomaticReconnect`'s own loop owns recovery. No double-retry on the warm path.
- **Single `HubConnection` per page, closed implicitly on navigation.** Opening a detail page in a new tab (`window.open`) means each tab owns exactly one hub connection; closing the tab fires `OnDisconnectedAsync` server-side which disarms the capture. No connection leak across navigation.
- **The fleet table escapes correctly.** `dashboard.js:161163` routes `plc.name` through `escapeAttr` for the `data-name` attribute and `escapeHtml` for cell text; `plc.host` through `escapeHtml`; `plc.listener.lastBindError` through `escapeAttr` for the `title=` attribute (line 160). This is the right discipline — it is only `detail.js` that fails to match it (C1).
- **`escapeHtml` is a correct minimal implementation** — `&`, `<`, `>` cover element-text contexts; `escapeAttr` adds `"`. Order matters (`&` first) and is correct.
- **Rate computation is robust.** `dashboard.js:70` guards `now > prev.t`, `Math.max(0, ...)` clamps a counter reset/reconnect to a non-negative rate, and `performance.now()` (monotonic) is the right clock for deltas — not `Date.now()`.
- **Filter is correct and cheap.** `visiblePlcs` (`dashboard.js:130`) filters a `.slice()` copy (never mutates the snapshot), search is case-folded once, and the sort has a stable `localeCompare` tiebreaker by name. At 54 rows the filter+sort+render per keystroke is sub-millisecond. It correctly survives live updates because `render()` always re-derives from `latest` + current `filter` state.
- **`renderMissing()` hot-reload path.** The detail page genuinely handles a PLC disappearing mid-session (`detail.js:168`) — `notice` shown, cards cleared and `hidden`. Good defensive UX for the hot-reload scenario (the gap is the *never-existed* name, see M2).
- **No CDN dependencies** — every `<script>`/`<link>` is `/assets/...`, consistent with the firewalled-network design goal in `StatusPage.md`.
- **CSS is clean**: design tokens via custom properties, `font-display: swap` on all `@font-face`, responsive `agg-grid` breakpoints, `prefers`-free but no egregious issues. `tr.stale` / `tr.no-traffic` styling gives the debug view real legibility. No `!important` abuse beyond two justified `.empty-row` overrides.
## Open questions
1. **C1 exploitability depends on whether `t.direction` and `connectedAtUtc` can actually carry attacker bytes.** `direction` is server-derived from the FC, so today it is effectively a closed set — but it is typed `string` on the wire and `detail.js` trusts it. `connectedAtUtc` is a `DateTimeOffset` serialized server-side, so it too is well-formed *today*. The finding stands because the frontend must not depend on server-side type discipline for its own XSS safety — but if the threat model formally excludes a compromised service, C1 could be re-rated Major. Recommend fixing regardless: the cost is three `escapeHtml` calls.
2. Does `StatusBroadcaster` ever push a `plc` payload for a PLC that has *zero* configured BCD tags? `detail.js:183` handles `debug.tags` empty → "No BCD tags configured". Confirmed handled; noted only to flag that the empty-state is covered.
3. Should the detail page cap how long it waits before declaring the PLC unknown (M2)? The server has no "unknown PLC" rejection in `SubscribePlc` — it silently accepts any name. A client-side watchdog is the only place this can be surfaced without a server change.