554b05d28c
Reviewed the new SignalR dashboard and fixed its two top findings: a stored XSS on the connection-detail page (unescaped tag name / direction / timestamp rendered into innerHTML) and FC03/FC04 cache hits bypassing the debug-view capture, which left cached tags frozen while their age climbed. Also adds an optional human-friendly Name to BCD tags surfaced on the debug view, and loads the real fleet config from tags.txt (12 named BCD tags, PLC Z28061) so the published appsettings.json is deploy-ready. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
97 lines
17 KiB
Markdown
97 lines
17 KiB
Markdown
# Frontend / Live Web Dashboard — Code Review
|
||
|
||
Scope: commit `e719dd5` ("replace status page with a live SignalR web dashboard"). Files reviewed — all under `src/Mbproxy/Admin/wwwroot/`:
|
||
`index.html`, `plc.html`, `dashboard.js`, `detail.js`, `theme.css`, `dashboard.css`, `detail.css`. Vendored assets (`bootstrap.min.css`, `bootstrap.bundle.min.js`, `signalr.min.js`, `*.woff2`) explicitly out of scope. Cross-checked against `docs/Operations/StatusPage.md` (wire contract), `CLAUDE.md` (mbproxy), and the server side `StatusHub.cs` / `StatusBroadcaster.cs` for the SignalR contract.
|
||
|
||
## Summary
|
||
|
||
- The dashboard is well-built vanilla JS: thoughtful escaping helpers, clean rate computation, sensible reconnection wiring, and a correct subscribe-on-reconnect pattern that keeps the on-demand tag capture armed/disarmed correctly.
|
||
- **The single most important finding is a real, exploitable XSS hole on the detail page**: `detail.js` renders `t.rawHex`, `t.address`, `t.width`, `t.direction`, `c.remote`, and `plc.host` straight into `innerHTML` **without escaping**. `escapeHtml` exists in the file but is applied inconsistently — several attacker/PLC-influenceable fields bypass it. The fleet table (`dashboard.js`) is escaped correctly throughout; the detail page is not. This is a Critical finding.
|
||
- A second Major issue: on a *cold* SignalR failure the page enters a `setTimeout(start, 3000)` retry loop that **builds a brand-new `HubConnection` on every retry** — no, it reuses the same connection object, but it never tears down handlers, and combined with `withAutomaticReconnect` the failure/retry semantics are muddled. There is also no upper bound and no UX past "retrying".
|
||
- DOM updates are full-`innerHTML` re-renders of the whole table body every ~1 s. Correct and flicker-free for 54 rows, but it blows away focus/selection and is wasteful; acceptable at this scale, flagged.
|
||
|
||
## Critical findings
|
||
|
||
**C1. Stored/reflected XSS on the connection-detail page — multiple unescaped fields rendered via `innerHTML`.** `detail.js` defines `escapeHtml` (line 23) and uses it for *some* fields but omits it for several others, all of which are interpolated into template strings later assigned to `.innerHTML`:
|
||
|
||
- **`detail.js:194` and `:203`** — the debug-row builder:
|
||
```js
|
||
<td>${t.address} <span class="ratio-sub">${hex4(t.address)}</span></td>
|
||
<td>${t.width}-bit</td>
|
||
...
|
||
<td><span class="dir-tag ${dirCls}">${t.direction}</span></td>
|
||
```
|
||
`t.direction` is a server string (`"read"`/`"write"` per the schema) interpolated raw into both an attribute-ish context and element text. `t.address`/`t.width` are typed `int` in the documented DTO, so they are lower risk — but the code does not coerce them (`Number(...)`) or escape them, so it relies entirely on the server DTO type. `t.direction` is a free `string` on the wire and is **not escaped**.
|
||
- **`detail.js:86`** — the client list:
|
||
```js
|
||
`<div class="client-line">${escapeHtml(c.remote)}` +
|
||
`<span class="pdu"> · ${num(c.pdusForwarded)} PDUs · since ${shortTime(c.connectedAtUtc)}</span></div>`
|
||
```
|
||
`c.remote` *is* escaped (good), but `shortTime(c.connectedAtUtc)` is **not** — and `shortTime` (line 36) has a `catch { return iso; }` branch that returns the **raw, unescaped `connectedAtUtc` string** verbatim when `new Date(iso)` parsing throws or the value is non-ISO. A malformed/attacker-controlled timestamp string therefore lands unescaped inside `innerHTML`.
|
||
- **`detail.js:203`** — `escapeHtml(t.rawHex)` *is* applied (good), and `num(t.decodedValue)` is numeric-safe. So `rawHex` is the one debug field that is handled correctly. (Correcting the brief: `rawHex` is escaped; `direction` and the timestamp path are the live holes.)
|
||
- **`detail.js:65`** — `$('plc-sub').textContent = `${plc.host}:${plc.listenPort}`;` is **safe** (`textContent`). But `detail.js:14–16` does the same for `plcName` via `textContent` — also safe. So the identity header is fine; the holes are specifically the `innerHTML` card/table builders.
|
||
|
||
Severity rationale: per `docs/Operations/StatusPage.md` the admin endpoint binds `IPAddress.Any` with **no authentication** ("Authentication lives at the network layer"). PLC `name`/`host` and backend-derived strings come off the Modbus wire / `appsettings.json` and are operator- or device-influenceable. A PLC named or a backend that returns a crafted string can inject `<img src=x onerror=...>` into any operator's browser session on the trusted segment. Read-only UI does not mean low impact: the injected script runs with the operator's origin.
|
||
|
||
Concrete fix: route **every** dynamic value through `escapeHtml` before it enters an `innerHTML` template, with **no exceptions**:
|
||
- `detail.js:194` / `:203`: wrap `t.direction` in `escapeHtml(...)`; coerce `t.address`/`t.width` with `Number(...)` (or escape) rather than trusting the DTO type.
|
||
- `detail.js:86`: wrap the `shortTime(...)` result in `escapeHtml(...)`, and/or change `shortTime`'s fallback to `escapeHtml(iso)` / return `'—'`.
|
||
- Add a single `escapeAttr` helper (as `dashboard.js` has at line 179) and use it for the `dirCls`/class interpolation at `:197` if `dirCls` ever becomes data-derived (currently it is a literal, so low priority — but `card()`'s `cls` parameter at `detail.js:45` is interpolated into `class="v ${cls||''}"` and is caller-supplied; today all callers pass literals, so it is safe *now* but fragile).
|
||
- Better still: stop hand-building HTML. Build rows with `document.createElement` + `textContent`/`dataset`, which makes escaping structural rather than a discipline that one missed call defeats.
|
||
|
||
## Major findings
|
||
|
||
**M1. Cold-start retry loop layers a manual `setTimeout` retry on top of `withAutomaticReconnect`, with no bound and a misleading pill.** `dashboard.js:252–263` and the identical `detail.js:236–245`:
|
||
```js
|
||
async function start() {
|
||
try { ...; await connection.start(); await connection.invoke('SubscribeFleet'); ... }
|
||
catch { setConn('disconnected','retrying'); setTimeout(start, 3000); }
|
||
}
|
||
```
|
||
`withAutomaticReconnect` only covers a connection that *was* established and then dropped — it does **not** retry the initial `start()`. So the manual loop is needed. But: (a) it retries **forever** every 3 s with no backoff and no cap — if the hub is permanently gone the browser hammers it indefinitely; (b) if `connection.start()` succeeds but `invoke('SubscribeFleet')` throws, the `catch` calls `start()` again on an **already-started** connection, which will throw "cannot start a connection that is not in the Disconnected state" and wedge the retry loop; (c) the pill shows `disconnected`/`retrying` which is reasonable, but there is no terminal "giving up" state and no way for an operator to know the difference between "server down" and "server slow". Fix: separate "start the connection" from "subscribe" so a subscribe failure doesn't restart the socket; add capped exponential backoff; guard `start()` against re-entry when `connection.state !== 'Disconnected'`.
|
||
|
||
**M2. Detail page never handles "PLC name not in fleet" vs. "hub never delivers".** `detail.js` subscribes with `SubscribePlc(plcName)` where `plcName` is taken from `location.pathname`. If the name is wrong (typo, stale bookmark, or a name that never existed), `StatusHub.SubscribePlc` happily adds the connection to a `plc:{bogus}` group and `_captureRegistry.Arm` is a documented no-op — so the server **never sends a `plc` message** and `onDetail` never fires. The page sits forever on "Waiting for first snapshot…" with a green `connected` pill. There is no timeout, no "unknown PLC" state. The `renderMissing()` path (`detail.js:168`) only triggers when the server *does* push a payload with `detail.plc === null` (PLC removed by hot-reload) — it cannot fire for a name that was never configured because nothing is pushed at all. Fix: after `connected`, start a watchdog (e.g. 3× the push interval); if no `plc` message arrives, show an "unknown or unreachable PLC" notice.
|
||
|
||
**M3. Fleet PDU/s rate is wrong for the first snapshot after any reconnect, and silently leaks `prevPdu`/`rateByName` entries for removed PLCs.** `dashboard.js:65–77` `updateRates` keys `prevPdu`/`rateByName` by `plc.name` and never removes entries. On hot-reload removal of a PLC, its stale rate stays in `rateByName` forever and is still summed into the fleet rate? — no, `renderAggregates` only iterates `s.plcs`, so a removed PLC drops out of the *sum*; but the Map still grows unboundedly across the process lifetime if PLCs churn. More importantly, after a SignalR reconnect the counters are *cumulative since service start*, so `cur - prev.forwarded` across a multi-second reconnect gap produces a correct (if coarse) rate — that part is fine. The real bug: `performance.now()` is monotonic per-page, so that's fine too. Net: low-grade Map leak on PLC churn; prune `prevPdu`/`rateByName` to the current `snapshot.plcs` name set each cycle.
|
||
|
||
**M4. Full-table-body `innerHTML` re-render on every ~1 s push destroys transient UI state.** `dashboard.js:155–173` rebuilds the entire `<tbody>` string and assigns `tbody.innerHTML`. Consequences at the 1 s cadence: any text selection inside the table is lost every second; `:hover` is re-evaluated (minor); the browser re-parses ~54 rows of HTML each tick. It is *not* a flicker source (synchronous replace, no intermediate paint) and 54 rows is cheap, so this is acceptable — but it is the kind of thing that becomes a problem if columns/rows grow. A keyed diff (update existing `<tr>` cells in place, add/remove only on set change) would also fix the selection-loss annoyance. Flagged as Major because of the selection-loss UX regression on a screen operators stare at; the brief explicitly asked about this.
|
||
|
||
## Minor findings
|
||
|
||
**N1. `detail.js` has no `escapeAttr` and `dashboard.js`'s is duplicated.** Both files define their own `escapeHtml` (`dashboard.js:176`, `detail.js:23`) — identical code, copy-pasted. `escapeAttr` exists only in `dashboard.js:179`. Since both files are separately served and there is no shared module, duplication is the pragmatic choice for a no-build-step project, but the *inconsistency* (detail.js lacking `escapeAttr`) is exactly what enabled C1. Recommend a tiny shared `util.js` served from `/assets/`, or at minimum make the two `escapeHtml` definitions and an `escapeAttr` identical and present in both.
|
||
|
||
**N2. Accessibility gaps.**
|
||
- The sortable `<th>` elements (`index.html:72–81`) are clickable via a JS `click` handler but are not keyboard-focusable and carry no `role="button"`/`tabindex="0"`/`aria-sort`. A keyboard-only operator cannot sort the table. Add `tabindex="0"`, `aria-sort` reflecting `sorted-asc`/`sorted-desc`, and a `keydown` (Enter/Space) handler.
|
||
- Table rows are clickable (`dashboard.js:232`) to open a detail page but are `<tr>` with `cursor:pointer` only — not keyboard-reachable and not announced as interactive. Consider making the PLC-name cell an actual `<a href="/plc/...">` so it is focusable, middle-click/ctrl-click works natively, and screen readers announce it. This would also remove the need for the JS `window.open` handler.
|
||
- The connection-state pill (`#conn`) updates visually but has no `aria-live` region, so a screen-reader user is never told the hub dropped. Add `aria-live="polite"` to `#conn` or to `#conn-text`.
|
||
- `<input type="search" id="f-search">` has a `placeholder` but no associated `<label>` (visible or `aria-label`). Same for `#f-state`. Add `aria-label`.
|
||
|
||
**N3. Row click always `window.open(..., '_blank')` — no modifier-key respect, popup-blocker exposure.** `dashboard.js:235` unconditionally calls `window.open`. A plain click should arguably open in the same tab or respect the user's intent; programmatic `window.open` not in direct response to a trusted click on an anchor can be caught by popup blockers in some configs. Tied to N2's suggestion: render the PLC name as a real `<a target="_blank" rel="noopener">` and delete the handler. (`rel="noopener"` also matters — `window.open` without it leaves `window.opener` live; here the opened page is same-origin so impact is low, but it is still best practice.)
|
||
|
||
**N4. `detail.js` `plcName` parsing is brittle.** `detail.js:10`: `decodeURIComponent(location.pathname.replace(/^\/plc\//, ''))`. If the route is ever served under a path prefix, or the name itself contains an encoded `/`, this misparses. Also if `decodeURIComponent` throws (malformed `%` sequence in the URL) the whole script aborts at line 10 before `connect()` is ever reached — the page is then blank with no error. Wrap in try/catch and fall back to a visible error state.
|
||
|
||
**N5. `dashboard.js:101` and `:106` — fleet PDU/s shows `—` until the *second* snapshot.** Expected (rate needs two samples) and correctly handled, but the aggregate card shows `—` for the first ~1 s while the table already shows per-row `—` rates. Cosmetic; no fix required, noted for completeness.
|
||
|
||
**N6. No `console.log`, no hardcoded absolute URLs, no obvious dead code.** All asset/hub URLs are root-relative (`/assets/...`, `/hub/status`, `/plc/...`) — correct, survives any host/port. `card()`'s `extra` parameter and the `cls` third element of card rows are lightly-used but not dead. Clean on this axis.
|
||
|
||
**N7. `formatUptime` / `formatAge` silently misrender negative or NaN input.** `dashboard.js:199` and `detail.js:28`: if `uptimeSeconds`/`ageSeconds` ever arrive negative (clock skew) or non-numeric, `Math.floor` yields `NaN` and the card shows `NaN`. Low risk given server types; a `Number.isFinite` guard returning `'—'` is cheap insurance.
|
||
|
||
## What looks good
|
||
|
||
- **Reconnect → re-subscribe is correct.** `dashboard.js:249` and `detail.js:233` both re-`invoke('SubscribeFleet'/'SubscribePlc')` inside `onreconnected`. This is essential and easy to forget: SignalR auto-reconnect gives a new `ConnectionId`, server-side group membership does **not** survive, and the detail page's tag capture is armed per-connection — without the re-subscribe the capture would silently disarm on every transient drop. Verified against `StatusHub.SubscribePlc`/`OnDisconnectedAsync` and `StatusBroadcaster` group targeting — the lifecycle is sound.
|
||
- **`onclose` correctly does not re-subscribe** — it just sets the pill; `withAutomaticReconnect`'s own loop owns recovery. No double-retry on the warm path.
|
||
- **Single `HubConnection` per page, closed implicitly on navigation.** Opening a detail page in a new tab (`window.open`) means each tab owns exactly one hub connection; closing the tab fires `OnDisconnectedAsync` server-side which disarms the capture. No connection leak across navigation.
|
||
- **The fleet table escapes correctly.** `dashboard.js:161–163` routes `plc.name` through `escapeAttr` for the `data-name` attribute and `escapeHtml` for cell text; `plc.host` through `escapeHtml`; `plc.listener.lastBindError` through `escapeAttr` for the `title=` attribute (line 160). This is the right discipline — it is only `detail.js` that fails to match it (C1).
|
||
- **`escapeHtml` is a correct minimal implementation** — `&`, `<`, `>` cover element-text contexts; `escapeAttr` adds `"`. Order matters (`&` first) and is correct.
|
||
- **Rate computation is robust.** `dashboard.js:70` guards `now > prev.t`, `Math.max(0, ...)` clamps a counter reset/reconnect to a non-negative rate, and `performance.now()` (monotonic) is the right clock for deltas — not `Date.now()`.
|
||
- **Filter is correct and cheap.** `visiblePlcs` (`dashboard.js:130`) filters a `.slice()` copy (never mutates the snapshot), search is case-folded once, and the sort has a stable `localeCompare` tiebreaker by name. At 54 rows the filter+sort+render per keystroke is sub-millisecond. It correctly survives live updates because `render()` always re-derives from `latest` + current `filter` state.
|
||
- **`renderMissing()` hot-reload path.** The detail page genuinely handles a PLC disappearing mid-session (`detail.js:168`) — `notice` shown, cards cleared and `hidden`. Good defensive UX for the hot-reload scenario (the gap is the *never-existed* name, see M2).
|
||
- **No CDN dependencies** — every `<script>`/`<link>` is `/assets/...`, consistent with the firewalled-network design goal in `StatusPage.md`.
|
||
- **CSS is clean**: design tokens via custom properties, `font-display: swap` on all `@font-face`, responsive `agg-grid` breakpoints, `prefers`-free but no egregious issues. `tr.stale` / `tr.no-traffic` styling gives the debug view real legibility. No `!important` abuse beyond two justified `.empty-row` overrides.
|
||
|
||
## Open questions
|
||
|
||
1. **C1 exploitability depends on whether `t.direction` and `connectedAtUtc` can actually carry attacker bytes.** `direction` is server-derived from the FC, so today it is effectively a closed set — but it is typed `string` on the wire and `detail.js` trusts it. `connectedAtUtc` is a `DateTimeOffset` serialized server-side, so it too is well-formed *today*. The finding stands because the frontend must not depend on server-side type discipline for its own XSS safety — but if the threat model formally excludes a compromised service, C1 could be re-rated Major. Recommend fixing regardless: the cost is three `escapeHtml` calls.
|
||
2. Does `StatusBroadcaster` ever push a `plc` payload for a PLC that has *zero* configured BCD tags? `detail.js:183` handles `debug.tags` empty → "No BCD tags configured". Confirmed handled; noted only to flag that the empty-state is covered.
|
||
3. Should the detail page cap how long it waits before declaring the PLC unknown (M2)? The server has no "unknown PLC" rejection in `SubscribePlc` — it silently accepts any name. A client-side watchdog is the only place this can be surfaced without a server change.
|