Files
wwtools/mbproxy/codereviews/2026-05-15/Overview.md
T
Joseph Doherty 554b05d28c mbproxy: fix dashboard review findings, add named BCD tags + fleet config
Reviewed the new SignalR dashboard and fixed its two top findings: a stored XSS on the connection-detail page (unescaped tag name / direction / timestamp rendered into innerHTML) and FC03/FC04 cache hits bypassing the debug-view capture, which left cached tags frozen while their age climbed. Also adds an optional human-friendly Name to BCD tags surfaced on the debug view, and loads the real fleet config from tags.txt (12 named BCD tags, PLC Z28061) so the published appsettings.json is deploy-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 03:39:39 -04:00

7.4 KiB

mbproxy SignalR Web Dashboard — Code Review Overview

Scope: commit e719dd5 ("replace status page with a live SignalR web dashboard") — ~3,500 lines across 49 files. Reviewed in four subsystem passes:

  • AdminSignalR.mdsrc/Mbproxy/Admin/* (host, hub, broadcaster, push sink, subscription tracker, snapshot builder, DTOs).
  • TagCapture.mdsrc/Mbproxy/Proxy/{TagValueCapture,TagCaptureRegistry}.cs + pipeline/reconciler/worker integration.
  • Frontend.mdsrc/Mbproxy/Admin/wwwroot/* (hand-written HTML/CSS/JS only; vendored assets excluded).
  • TestsAndConfig.md — new/changed tests, MbproxyOptions/ReloadValidator, csproj EmbeddedResource, smoke config.

Verdict

The dashboard is well-architected at the macro level — the IStatusPushSink testability seam, the lock-free TagValueCapture data structure, the broadcaster's lifecycle binding to the Kestrel app, and the embedded-asset model are all sound. But the review surfaced one security bug and a cluster of concurrency bugs that share a single root cause, plus a feature-correctness gap. None of these should ship to operators as-is.

The standout pattern: two independent reviewers (AdminSignalR.md, TagCapture.md) converged on the same root cause — capture arm/disarm state is not authoritatively owned. TagValueCapture.IsArmed is carried on the transient capture instance and counted by PlcSubscriptionTracker keyed on SignalR ConnectionId. That single design choice produces C2, C3, and M3 below. Fix it once and three findings collapse.

Cross-cutting critical findings

C1 — Stored XSS on the connection-detail page. (Frontend.md C1) detail.js interpolates t.direction raw into .innerHTML (detail.js:194/203), and the client-list builder escapes c.remote but not the shortTime(c.connectedAtUtc) result — whose catch branch returns the raw timestamp string verbatim (detail.js:86). The admin endpoint binds IPAddress.Any with no authentication, and the injected strings are device-/config-influenceable. A crafted value executes script in any operator's browser on the trusted segment. Fleet table (dashboard.js) escapes correctly — only the detail page breaks discipline. Fix: three escapeHtml calls, or switch the row builders to createElement/textContent.

C2 — Capture armed forever after a SignalR reconnect. (AdminSignalR.md C1) PlcSubscriptionTracker keys subscriber counts on ConnectionId, which changes on every transport reconnect. OnDisconnectedAsync for the old connection is unordered relative to the new connection's SubscribePlc, so a reconnect-during-view leaks the count and leaves a PLC's TagValueCapture armed with no viewer for the life of the process. The documented invariant "zero hot-path cost when nobody is watching" becomes false after any reconnect — every backend read and FC06/FC16 write then pays a FrozenDictionary lookup + TagValueObservation allocation, fleet-wide. DisarmAll() only fires on admin shutdown/port hot-reload, so the leak is never bounded in steady state.

C3 — Cache hits never reach Record(). (TagCapture.md C1) The ctx.Capture?.Record(...) calls live only in BcdPduPipeline.ProcessResponse, but the Phase-11 response-cache hit path (PlcMultiplexer.cs:823-828) returns cached post-rewrite bytes without invoking the pipeline. For any BCD tag with CacheTtlMs > 0, once the cache is warm the debug view freezes at the last cache-miss observation while AgeSeconds climbs — actively misleading an operator into thinking a live tag is dead. Feature-correctness, not a crash; caching is OFF by default so the default deployment is unaffected.

Major findings (consolidated)

  • M1 — Non-atomic SubscribePlc. (AdminSignalR.md C2) AddToGroupAsync then _tracker.Add span two awaits; a same-connection OnDisconnectedAsync can interleave and arm a capture on an already-gone connection. Same root cause as C2.
  • M2 — GetOrCreate lost-update race. (TagCapture.md M1) The AddOrUpdate delegate reads existing.IsArmed, but a concurrent detail-page open can land its Arm on the about-to-be-discarded instance — publishing the rebuilt capture disarmed under an open page, or leaking it armed with no viewer.
  • M3 — Recommended fix for C2/M1/M2: make PlcSubscriptionTracker's subscriber count the single authority for arm state — key it on a stable per-tab identifier (or count distinct viewers, not connections), and derive IsArmed from the count rather than carrying it on the transient capture.
  • M4 — StatusBroadcaster.LoopAsync has zero coverage. (TestsAndConfig.md) All four broadcaster tests call PushOnceAsync directly; the production push loop, its interval hot-reload re-read, the Math.Max(100,…) floor, and cancellation are unverified. The /hub/status endpoint is never exercised end-to-end.
  • M5 — DebugJsonContext is dead code; SignalR serializes via reflection System.Text.Json. (AdminSignalR.md) A latent AOT trap; the camelCase wire-shape guarantee has no test.
  • M6 — Bind failure after _broadcaster.Start() leaks the broadcaster loop and a bound listener — the catch block's cleanup is incomplete. (AdminSignalR.md)
  • M7 — Untested arm/disarm race in TagValueCapture. (TestsAndConfig.md) Disarm() flips _armed=false then clears slots; a Record() that wins the _armed check before Disarm runs leaves a stale observation on a disarmed capture. The torn-read test only races Record vs Snapshot, never vs Disarm.
  • M8 — Frontend cold-start retry loop layers an unbounded setTimeout retry over withAutomaticReconnect and can wedge if start() succeeds but invoke() fails. (Frontend.md M1)
  • M9 — Detail page never handles an unknown PLC name — sits forever on "Waiting for first snapshot…" with a green pill. (Frontend.md M2)
  1. C1 (XSS) — smallest fix, highest severity, ships in any operator-facing build. Do first.
  2. M3 — re-root capture arm/disarm authority in PlcSubscriptionTracker; closes C2, M1, M2, M7 together. Add a concurrency test for the tracker (currently has none).
  3. C3 — add a Record() call on the cache-hit path in PlcMultiplexer, or document the debug view as cache-blind. Decide explicitly.
  4. M4 — add an end-to-end /hub/status test (real HubConnection, assert a fleet/plc message and its camelCase shape — also closes the M5 gap) and a LoopAsync interval/cancellation test.
  5. M6, M8, M9 and the Minor findings in each subsystem file.

What looks good

  • IStatusPushSink is a genuine, well-placed testability seam.
  • TagValueCapture itself — lock-free, torn-read-safe via immutable records + Volatile.Write/Read, FrozenDictionary address map — is correct. The weakness is who arms it, not the structure.
  • Broadcaster per-cycle error isolation and lifecycle binding to the Kestrel app (no leak across port hot-reloads).
  • Fleet table (dashboard.js) escapes all dynamic content; reconnect→re-subscribe is correctly wired in both JS files; no CDN deps, no stray console.log.
  • StatusHtmlRenderer removed cleanly — no dangling source or test references.
  • csproj EmbeddedResource glob is correct (Worker SDK has no competing web default globs).
  • AdminPushIntervalMs validation matches house style across both validators.