diff --git a/mbproxy/codereviews/2026-05-15/AdminSignalR.md b/mbproxy/codereviews/2026-05-15/AdminSignalR.md new file mode 100644 index 0000000..689e487 --- /dev/null +++ b/mbproxy/codereviews/2026-05-15/AdminSignalR.md @@ -0,0 +1,93 @@ +# Admin SignalR Web Dashboard — Code Review + +Scope: commit `e719dd5` ("replace status page with a live SignalR web dashboard"), files `src/Mbproxy/Admin/{AdminEndpointHost,StatusHub,StatusBroadcaster,StatusPushSink,PlcSubscriptionTracker,StatusSnapshotBuilder,DebugDto}.cs`. Cross-checked against `docs/Operations/StatusPage.md`, `docs/Reference/LogEvents.md`, the mbproxy `CLAUDE.md`, and the supporting types `Proxy/TagCaptureRegistry.cs`, `Proxy/TagValueCapture.cs`, `Proxy/ProxyWorker.cs`, `HostingExtensions.cs`, `Configuration/ConfigReconciler.cs`. Tests under `tests/Mbproxy.Tests/Admin/{StatusHubTests,StatusBroadcasterTests}.cs`. + +## Summary + +- The decomposition is sound: `IStatusPushSink` cleanly isolates the push loop from SignalR, `PlcSubscriptionTracker` is correctly single-locked, and the broadcaster lifecycle is tied to the Kestrel app's lifecycle so an `AdminPort` hot-reload re-bind does not leak a second broadcaster. Per-cycle error handling in `PushOnceAsync` is genuinely defensive. +- The most serious problem is a **subscriber-count leak in `PlcSubscriptionTracker`**: a SignalR reconnect (transport drop without a clean close) increments a PLC's count on the new connection but the old connection's `OnDisconnectedAsync` decrement is not guaranteed, so a capture can be left armed forever — or, in the opposite race, double-armed/never-disarmed. Combined with the fact that capture arming has no reference to the *connection liveness*, the on-demand-capture invariant ("armed only while a viewer is open") is not actually upheld in the field. +- A **second real bug**: `StatusHub.SubscribePlc` is not atomic — `Groups.AddToGroupAsync` then `_tracker.Add` are two awaits with an interleaving point, and `OnDisconnectedAsync` can run on the same connection between them, producing a *negative-free but still wrong* state where the group membership outlives the tracker entry (capture disarmed while the page still receives pushes) or vice versa. +- `DebugJsonContext` (in `DebugDto.cs`) is dead code — defined, never referenced. SignalR serializes `PlcDetailResponse` via the reflection-based `System.Text.Json` path, not the source-gen context. Harmless today (no trimming/AOT in the csproj) but it is a misleading artifact and a latent trap if AOT is ever enabled. +- The documented contract is matched on the fleet path but the **detail push (`PlcDetailResponse`) is undocumented in the `status.json` schema** and is not reachable from `StatusJsonContext` — only over SignalR. That is by design, but the camelCase guarantee for it rests entirely on the hub's `AddJsonProtocol` config, with no test asserting the wire shape. + +## Critical findings + +**C1. `PlcSubscriptionTracker` leaks subscriber counts across SignalR reconnects — captures get stuck armed (or never armed).** `StatusHub.cs:48-54` / `PlcSubscriptionTracker.cs:29-73`. SignalR assigns a *new* `ConnectionId` on every transport reconnect (WebSocket drop, long-polling cycle, network blip). The client (`detail.js`) re-invokes `SubscribePlc` on reconnect. The sequence on a reconnect is: + +1. Old connection's transport dies. SignalR *eventually* fires `OnDisconnectedAsync` for the old `ConnectionId` — but this is **not** ordered relative to the new connection's `OnConnectedAsync`/`SubscribePlc`, and on an ungraceful drop it may be delayed by the server's keepalive/timeout window (default ~30 s) or, if the server is shutting down, may not fire deterministically at all. +2. New connection calls `SubscribePlc` → `_tracker.Add(newId, plc)` → count `1 → 2`, returns `false`, so no re-arm (fine). +3. Old connection's `OnDisconnectedAsync` runs late → count `2 → 1`. Capture stays armed. **Correct only if the order is (2) then (3).** + +The failure case: if the old connection's `OnDisconnectedAsync` is delayed past the *new* connection also disconnecting, or if the operator closes the tab during the reconnect window, the count never returns to 0 and **the capture is armed forever with no viewer** — exactly the hot-path cost the on-demand design exists to avoid. Over a long-running service with flaky operator networks this accumulates: every PLC ever viewed ends up permanently armed. `TagValueCapture.Record` is then a non-trivial cost (`FrozenDictionary` lookup + allocation of a `TagValueObservation` + `Volatile.Write`) on the backend reader task and every FC06/FC16 upstream task, fleet-wide, forever. + +The `StatusBroadcaster.StopAsync` → `DisarmAll()` safety net only fires on admin shutdown / `AdminPort` hot-reload, not during normal operation, so it does not bound this leak in a steady-state service. + +Fix: do not rely on `ConnectionId` lifetime as the capture-arming key. Either (a) key subscriptions on a stable client-supplied identity and treat reconnects as idempotent re-subscribes, or (b) drive disarm off the broadcaster: each cycle, `ActivePlcs()` is the live set; reconcile armed captures against it (`_captureRegistry` disarms any PLC not in `ActivePlcs()`), so a leaked count is self-healing within one push interval — but a leaked *count* still keeps `ActivePlcs()` returning the PLC, so (b) alone is insufficient. The robust fix is (a): give `PlcSubscriptionTracker.RemoveConnection` a periodic sweep against SignalR's live connection set, or add a TTL/heartbeat so a connection that has not pushed a keepalive in N intervals is reaped. At minimum, document the leak and have the broadcaster log when `ActivePlcs()` count exceeds the number of distinct connections it has seen. + +**C2. `SubscribePlc` is not atomic — group membership and tracker state can diverge under a concurrent disconnect.** `StatusHub.cs:48-54`: + +```csharp +public async Task SubscribePlc(string plcName) +{ + await Groups.AddToGroupAsync(Context.ConnectionId, PlcGroup(plcName)).ConfigureAwait(false); + if (_tracker.Add(Context.ConnectionId, plcName)) + _captureRegistry.Arm(plcName); +} +``` + +There are two awaits' worth of interleaving here. SignalR will dispatch `OnDisconnectedAsync` for the same connection if the transport drops while `SubscribePlc` is mid-flight (hub method invocations and the disconnect callback are not mutually exclusive — only invocations on the *same* connection are serialized *with each other*, and `OnDisconnectedAsync` is a separate dispatch path). Concretely: + +- Connection drops after `AddToGroupAsync` completes but before `_tracker.Add`. `OnDisconnectedAsync` runs `_tracker.RemoveConnection(id)` → returns empty (nothing tracked yet). Then `SubscribePlc` resumes, `_tracker.Add` runs → count `0 → 1`, returns `true` → **`_captureRegistry.Arm(plcName)` on a connection that is already gone.** The capture is now armed with a phantom viewer. This is the same stuck-armed leak as C1, reached by a tighter race. +- The mirror case leaves a group membership with no tracker entry; benign for pushes (the dead connection just never receives them) but it confirms the two data structures are not kept consistent. + +Fix: make subscribe/unsubscribe go through a single critical section that also checks connection liveness, or — simpler — register the tracker entry *first* (synchronous, under its own lock), then `AddToGroupAsync`, and have `OnDisconnectedAsync` always run last and unconditionally. Even then the arm/disarm must be idempotent and reconciled against the live connection set (see C1). The cleanest fix addresses C1 and C2 together: treat `OnConnectedAsync`/`OnDisconnectedAsync` as the *only* mutators of tracker state, make `SubscribePlc` only adjust group membership, and arm/disarm from a periodic reconciliation in the broadcaster against SignalR's actual group/connection state. + +## Major findings + +**M1. `DebugJsonContext` is dead code; the detail payload is not source-gen serialized.** `DebugDto.cs:54-61` declares `[JsonSerializable(typeof(PlcDetailResponse))] ... DebugJsonContext`, but nothing references `DebugJsonContext.Default` anywhere in the tree (grep confirms zero usages). `AdminEndpointHost.cs:192-196` configures the SignalR JSON protocol with only a `PropertyNamingPolicy` — no `TypeInfoResolver` — so SignalR serializes `PlcDetailResponse` through the **reflection-based** `System.Text.Json` resolver. The `status.json` route uses `StatusJsonContext` (source-gen); the SignalR path does not use any source-gen context. Today this works (the csproj sets neither `PublishTrimmed` nor `PublishAot`), but: (a) `DebugJsonContext` is misleading dead code that implies the detail payload is AOT-safe when it is not; (b) if trimming/AOT is ever turned on, every SignalR push of `PlcDetailResponse` *and* of `StatusResponse` (the `"fleet"` message also goes through the reflection resolver — `StatusJsonContext` is only wired into the `status.json` `JsonSerializer.Serialize` call, not into `AddJsonProtocol`) will throw or silently emit `{}`. Fix: either delete `DebugJsonContext` and document that the SignalR path is reflection-only, or — better — wire both contexts into the hub via `AddJsonProtocol(o => o.PayloadSerializerOptions.TypeInfoResolverChain.Insert(0, StatusJsonContext.Default); ...Insert(1, DebugJsonContext.Default))` so the SignalR path is consistent with `status.json` and trim-safe. Note `DebugJsonContext` would also need `StatusResponse`/`PlcStatus` reachable since `PlcDetailResponse` embeds `PlcStatus?`. + +**M2. No test asserts the SignalR wire shape is camelCase.** `AdminEndpointHost.cs:194-196` is the *only* thing making the live feed's JSON match the documented `status.json` contract (`docs/Operations/StatusPage.md` repeatedly states "camelCase property names"). `StatusBroadcasterTests` uses `FakeStatusPushSink` and never serializes; `StatusHubTests` uses fakes. If someone removes or mis-edits the `AddJsonProtocol` lambda, every dashboard field silently becomes `PascalCase` and the JS (`dashboard.js`/`detail.js` expecting `service.uptimeSeconds` etc.) breaks with no failing test. The reflection-vs-source-gen split in M1 makes this worse — the two endpoints' naming is configured in two unrelated places. Add an integration test that stands up the hub (or at least serializes `PlcDetailResponse`/`StatusResponse` with the exact `PayloadSerializerOptions` the hub builds) and asserts `uptimeSeconds`/`captureArmed` appear lowercase-first. + +**M3. `StatusBroadcaster.PushOnceAsync` swallows a cancelled per-PLC push but keeps looping over remaining PLCs.** `StatusBroadcaster.cs:97-110`. The `catch (Exception ex) when (ex is not OperationCanceledException)` correctly lets cancellation propagate — but only out of the `await _sink.PushPlcAsync`. If `ct` is cancelled mid-loop, the `OperationCanceledException` escapes `PushOnceAsync`, which is caught by `LoopAsync`'s outer `catch (OperationCanceledException)` — fine. However, if `_builder.BuildDebug(plcName)` (synchronous, `StatusSnapshotBuilder.cs:44`) throws a *non-OCE* exception for one PLC, the `catch` logs and the loop continues — good. But `snapshot.Plcs.FirstOrDefault(...)` is re-run inside the `try` for every PLC: an O(N) scan per PLC over the `Plcs` list = O(N²) per push cycle. For 54 PLCs that is 2,916 comparisons per second — trivially cheap here, but flagged because the fleet snapshot is already a `List` and a `Dictionary` projection once per cycle would be cleaner and removes the quadratic. Minor on its own; grouped here because it sits in the same loop as a real concern: `BuildDebug` is called for every active PLC even when `snapshot.Plcs` has no matching entry (PLC removed by hot-reload) — that path is handled (`Plc: null`) but worth a test. + +**M4. `StatusBroadcaster.Start()` is fire-and-forget with no guard against being called twice.** `StatusBroadcaster.cs:51-52`: `public void Start() => _loop = Task.Run(...)`. The XML doc says "Idempotent only in the sense that it is called once" — but nothing *enforces* that. A second `Start()` overwrites `_loop`, orphaning the first loop task (it keeps running against the same `_cts`, so two loops now push concurrently every interval until cancellation). `AdminEndpointHost` only ever calls it once per `StartAppAsync`, and a new `StatusBroadcaster` is constructed each re-bind, so this is not hit today — but a one-line guard (`if (_loop is not { IsCompleted: true } and not null) throw`/return, or an `Interlocked` flag) would make the class safe to misuse. Also: `Task.Run` returns a task whose faults are observed only via `_loop` being awaited in `StopAsync`; if `StopAsync` is never called (e.g. `StartAppAsync` throws *after* `_broadcaster.Start()` but the catch at `AdminEndpointHost.cs:253` sets `_app = null` without disposing `_broadcaster`) the loop task and its `_cts` leak. See M5. + +**M5. A bind failure after `_broadcaster.Start()` leaks the broadcaster and its push loop.** `AdminEndpointHost.cs:237-258`. `StartAppAsync` does `await app.StartAsync(ct)` (line 237), then `_app = app`, then constructs and `Start()`s `_broadcaster` (lines 242-249), then `LogAdminStarted`. The whole body is wrapped in `catch (Exception ex) when (ex is not OperationCanceledException)` (line 253) which logs `mbproxy.admin.bind.failed` and sets `_app = null`. If anything between line 238 and 251 throws — e.g. `app.Services.GetRequiredService>()` fails, or the `StatusBroadcaster` constructor throws, or `LogAdminStarted` somehow throws — the catch sets `_app = null` but **does not stop the already-started `_broadcaster` nor the already-started Kestrel `app`**. The push loop keeps running forever against a sink whose hub is on a Kestrel app that `StopCurrentAppAsync` will never see (`_app` is null, `_broadcaster` field may or may not be set depending on where the throw landed). Result: a leaked Kestrel listener still bound to the port, plus a leaked broadcaster loop. The probability is low (those calls rarely throw) but the catch's cleanup is incomplete. Fix: in the catch, best-effort stop/dispose whatever was started — mirror `StopCurrentAppAsync`'s logic, or wrap the post-`StartAsync` section so a failure tears down `app` and `_broadcaster` before nulling the fields. + +**M6. `PushOnceAsync` builds the debug snapshot for a PLC whose subscriber count is stale.** Tied to C1/C2: `ActivePlcs()` (`PlcSubscriptionTracker.cs:76-82`) returns a key snapshot, and the broadcaster pushes a `"plc"` message to `PlcGroup(plcName)` for each. If the count is leaked-high (C1), the broadcaster pushes to an empty SignalR group every cycle forever — cheap (SignalR no-ops an empty group) but it also keeps calling `_builder.BuildDebug(plcName)` and, more importantly, the *capture stays armed* because nothing disarms it. This is the observable symptom of C1 inside the broadcaster. Recording it separately because the fix (broadcaster-side reconciliation) lives here. + +## Minor findings + +**N1. `StatusBroadcaster` does not use a stable log event name.** `StatusBroadcaster.cs:84,94,108,133` all call `_logger.LogError(ex, "StatusBroadcaster: ...")` with free-text messages and no `EventId`/`EventName`. Every other component in this codebase uses `[LoggerMessage]` source-gen with a stable `mbproxy.*` event name catalogued in `docs/Reference/LogEvents.md` (e.g. `mbproxy.admin.started` EventId 70, `mbproxy.admin.bind.failed` EventId 71). The broadcaster's "loop terminated unexpectedly" at line 133 is exactly the kind of event an operator would alert on, and it is invisible to event-name-based log queries. Add `[LoggerMessage]` entries (e.g. `mbproxy.admin.broadcast.failed`, `mbproxy.admin.broadcast.loop.terminated`) and register them in `LogEvents.md`. + +**N2. `LoopAsync` does `Task.Delay` *before* the first push.** `StatusBroadcaster.cs:122-124`: the loop delays `interval` ms, then pushes. The first dashboard client therefore waits up to `AdminPushIntervalMs` (default 1000 ms) for its first `"fleet"` message even though a snapshot is available immediately. `index.html`/`dashboard.js` presumably also fetch `status.json` or render empty until the first push — but a push-before-delay (or an immediate `PushOnceAsync` before entering the loop) would make the dashboard populate instantly on open. Cosmetic, but easy. + +**N3. `StatusBroadcaster.StopAsync` is not idempotent-safe against the `_loop` await.** `StatusBroadcaster.cs:57-72`: if `StopAsync` is called twice (it is reachable: `DisposeAsync` calls `StopAsync`, and `AdminEndpointHost.StopCurrentAppAsync` calls `broadcaster.DisposeAsync()` which calls `StopAsync` — only one path per instance today, but). The second call: `_cts.IsCancellationRequested` is true so it skips `CancelAsync`, then `await _loop` again (a completed task — fine), then `_captureRegistry.DisarmAll()` again (fine). Benign, but `DisposeAsync` at line 137-141 then calls `_cts.Dispose()`; a *third* path touching `_cts` after dispose would throw `ObjectDisposedException`. No live bug, but the class lacks the `_disposed` guard that `AdminEndpointHost` itself carries (and whose absence the `AdminEndpointHost` comment at lines 59-64 explicitly calls out as a regression risk). Add a `_stopped`/`_disposed` flag for symmetry. + +**N4. `OnDisconnectedAsync` does its cleanup *before* `base.OnDisconnectedAsync`.** `StatusHub.cs:60-66`. The capture disarm runs first, then `base.OnDisconnectedAsync(exception)`. This is the correct order (you want to release resources before the base teardown) and the disarm is synchronous so it cannot be lost — but note there is **no `OnConnectedAsync` override**, which is fine, and no try/finally around the `foreach`. If `_captureRegistry.Disarm` threw (it cannot — it is a dictionary lookup + volatile write), the remaining PLCs in the connection's set would not be disarmed and `base.OnDisconnectedAsync` would be skipped. Defensive only; `Disarm` is genuinely no-throw today. + +**N5. `PlcSubscriptionTracker.ActivePlcs()` allocates a fresh array every push cycle.** `PlcSubscriptionTracker.cs:80`: `_plcCounts.Keys.ToArray()` under the lock, once per `AdminPushIntervalMs`. With the common case of zero detail-page viewers it correctly returns `Array.Empty()` (line 80 short-circuits on `Count == 0`). Only allocates when someone is viewing. Fine — flagged only as a known per-cycle allocation if push interval is ever lowered aggressively. + +**N6. `ServeHtmlShell` / asset routes have no `HEAD` handling and `/plc/{name}` ignores `name`.** `AdminEndpointHost.cs:210`: `app.MapGet("/plc/{name}", (string name, HttpContext ctx) => ServeHtmlShell(ctx, "plc.html"))` — `name` is bound but unused (the page reads it client-side from the URL). Harmless, but the unused parameter will draw a compiler/analyzer warning under this project's `TreatWarningsAsErrors` unless suppressed; verify it builds. If it does build clean, fine — `MapGet` route parameters are not flagged as unused. Minor; mentioned for the reviewer to confirm against CI. + +**N7. The detail payload's `PlcDetailResponse` shape is undocumented as part of the `/status.json` contract but the doc table at `StatusPage.md:317-328` does describe it.** Actually documented — withdrawing the "undocumented" concern from the Summary's last bullet to this extent: the *fields* are in `StatusPage.md`. What is genuinely missing is a statement that this payload travels **only over SignalR** and is **not** reachable at any `GET` route, and that its serialization path differs from `status.json` (M1). One sentence in `StatusPage.md`'s "Debug View Data" section would close it. + +## What looks good + +- **Broadcaster lifecycle is correctly bound to the Kestrel app, not the host.** `AdminEndpointHost.cs:242-249` creates a fresh `StatusBroadcaster` inside `StartAppAsync` and `StopCurrentAppAsync` (lines 268-279) disposes it *before* stopping Kestrel. An `AdminPort` hot-reload therefore tears down the old broadcaster and starts a new one — no broadcaster leak across re-binds, and the `DisarmAll()` in `StopAsync` ensures the re-bind does not strand an armed capture. This directly answers the open question N3-style concern from the 2026-05-14 `AdminAndDiagnostics` review about provider/loop leaks across re-binds. +- **`IStatusPushSink` is a clean seam.** Defining the outbound side as an interface (`StatusPushSink.cs`) lets `StatusBroadcasterTests` exercise the full push-cycle logic (fleet always, per-PLC only for active PLCs, disarm-on-stop) with a recording fake and zero SignalR host — and the tests actually do this. Good testability design. +- **`PlcSubscriptionTracker` locking is correct *as a data structure*.** Single `_gate`, every method takes it, the count transitions (`0→1` returns arm-signal, `1→0` returns disarm-signal) are computed under the lock, and `RemoveConnection` correctly handles the multi-PLC-per-connection case. The bug (C1/C2) is not in the locking — it is that `ConnectionId` is the wrong key for a *lifetime* and the tracker is mutated from two un-serialized hub dispatch paths. The class itself is internally consistent. +- **`PushOnceAsync` per-stage error isolation.** `StatusBroadcaster.cs:78-110` wraps snapshot-build, fleet-push, and each per-PLC push in their own `try/catch` so one PLC's failure does not abort the cycle and a snapshot-build failure does not kill the loop. The `when (ex is not OperationCanceledException)` filters correctly let shutdown cancellation propagate to `LoopAsync`'s handler. This is the right shape. +- **`LoopAsync` re-reads `AdminPushIntervalMs` every cycle** (`StatusBroadcaster.cs:122`) so a hot-reload of the interval takes effect without restarting the loop, and floors it at 100 ms so a bad value cannot spin the CPU. Matches the hot-reload-everything posture in `CLAUDE.md`. +- **`TagValueCapture` concurrency is genuinely lock-free-correct.** `Volatile.Write`/`Volatile.Read` of references to an immutable `record` (`TagValueObservation`), `_armed` is `volatile`, and `Record` short-circuits on `!_armed` with a single volatile read before any work — so the disarmed hot path cost is one bool read, as advertised. `Disarm` clears slots so a re-arm shows only fresh data. This part of the design is solid; the weakness is *who calls Arm/Disarm and when* (C1/C2), not the capture itself. +- **`StatusSnapshotBuilder.BuildDebug` degrades gracefully for an unknown PLC** (`StatusSnapshotBuilder.cs:44-47`) — returns a disarmed empty snapshot rather than throwing, which is the correct behavior for a detail page open on a hot-reload-removed PLC, and `PlcDetailResponse.Plc` is nullable to carry that state. `ConfigReconciler.cs:259` calls `_captureRegistry.Remove(name)` on PLC removal, so the registry and the config stay consistent. +- **`StatusHub.SubscribePlc` for an unknown PLC is a documented no-op** (`StatusHub.cs:53`, `TagCaptureRegistry.Arm` no-ops a missing key) and `StatusHubTests.SubscribePlc_UnknownPlc_DoesNotThrow_AndArmsNothing` covers it. A hub method throwing would be sent to the caller as a hub error; this path correctly does not throw. +- **Asset serving is safe.** `AdminEndpointHost.cs:212-226` rejects `/`, `\`, and `..` in the asset path segment before touching `GetManifestResourceStream`, caches bytes and misses in a `static ConcurrentDictionary` shared across app re-builds, and sets `immutable` cache headers for content-addressed assets vs. `no-cache` for the HTML shells. Embedded resources mean no filesystem traversal surface at all. + +## Open questions + +1. **C1 field impact:** how often do operators' browsers reconnect in this deployment? If the dashboard is on a stable internal segment and detail pages are short-lived, the leak is slow — but it is unbounded and never self-heals in a steady-state service. Is there an existing process-restart cadence that masks it? Either way the invariant in `StatusPage.md:315` ("the hot path carries zero cost when nobody is watching") is currently false after any reconnect-during-view. +2. **C2 / hub dispatch model:** confirm against the SignalR version in use whether `OnDisconnectedAsync` for a connection can overlap an in-flight `SubscribePlc` invocation on that *same* connection. If SignalR guarantees `OnDisconnectedAsync` runs only after all in-flight invocations for that connection complete, C2's same-connection race narrows to the cross-connection (reconnect) race in C1 — still a bug, but the fix scope shrinks. +3. **M1:** is trimming/AOT on the roadmap for this service? `CLAUDE.md` mentions single-file self-contained publish but not trimming. If AOT is ever planned, M1 is upgraded to Critical (the SignalR reflection JSON path will break) and `DebugJsonContext` must be wired in, not deleted. +4. **M5:** has the post-`StartAsync` failure path (e.g. `GetRequiredService>` failing) ever been observed? It is low-probability, but the catch block's cleanup is provably incomplete — worth a deliberate decision to either fix it or document it as accepted. +5. Is there any reason `StatusJsonContext.Default` is not also wired into the hub's `AddJsonProtocol` so the fleet `"fleet"` push and `GET /status.json` share one serialization path and one camelCase configuration point (M1/M2)? diff --git a/mbproxy/codereviews/2026-05-15/Frontend.md b/mbproxy/codereviews/2026-05-15/Frontend.md new file mode 100644 index 0000000..10acde6 --- /dev/null +++ b/mbproxy/codereviews/2026-05-15/Frontend.md @@ -0,0 +1,96 @@ +# Frontend / Live Web Dashboard — Code Review + +Scope: commit `e719dd5` ("replace status page with a live SignalR web dashboard"). Files reviewed — all under `src/Mbproxy/Admin/wwwroot/`: +`index.html`, `plc.html`, `dashboard.js`, `detail.js`, `theme.css`, `dashboard.css`, `detail.css`. Vendored assets (`bootstrap.min.css`, `bootstrap.bundle.min.js`, `signalr.min.js`, `*.woff2`) explicitly out of scope. Cross-checked against `docs/Operations/StatusPage.md` (wire contract), `CLAUDE.md` (mbproxy), and the server side `StatusHub.cs` / `StatusBroadcaster.cs` for the SignalR contract. + +## Summary + +- The dashboard is well-built vanilla JS: thoughtful escaping helpers, clean rate computation, sensible reconnection wiring, and a correct subscribe-on-reconnect pattern that keeps the on-demand tag capture armed/disarmed correctly. +- **The single most important finding is a real, exploitable XSS hole on the detail page**: `detail.js` renders `t.rawHex`, `t.address`, `t.width`, `t.direction`, `c.remote`, and `plc.host` straight into `innerHTML` **without escaping**. `escapeHtml` exists in the file but is applied inconsistently — several attacker/PLC-influenceable fields bypass it. The fleet table (`dashboard.js`) is escaped correctly throughout; the detail page is not. This is a Critical finding. +- A second Major issue: on a *cold* SignalR failure the page enters a `setTimeout(start, 3000)` retry loop that **builds a brand-new `HubConnection` on every retry** — no, it reuses the same connection object, but it never tears down handlers, and combined with `withAutomaticReconnect` the failure/retry semantics are muddled. There is also no upper bound and no UX past "retrying". +- DOM updates are full-`innerHTML` re-renders of the whole table body every ~1 s. Correct and flicker-free for 54 rows, but it blows away focus/selection and is wasteful; acceptable at this scale, flagged. + +## Critical findings + +**C1. Stored/reflected XSS on the connection-detail page — multiple unescaped fields rendered via `innerHTML`.** `detail.js` defines `escapeHtml` (line 23) and uses it for *some* fields but omits it for several others, all of which are interpolated into template strings later assigned to `.innerHTML`: + +- **`detail.js:194` and `:203`** — the debug-row builder: + ```js + ${t.address} ${hex4(t.address)} + ${t.width}-bit + ... + ${t.direction} + ``` + `t.direction` is a server string (`"read"`/`"write"` per the schema) interpolated raw into both an attribute-ish context and element text. `t.address`/`t.width` are typed `int` in the documented DTO, so they are lower risk — but the code does not coerce them (`Number(...)`) or escape them, so it relies entirely on the server DTO type. `t.direction` is a free `string` on the wire and is **not escaped**. +- **`detail.js:86`** — the client list: + ```js + `
${escapeHtml(c.remote)}` + + ` · ${num(c.pdusForwarded)} PDUs · since ${shortTime(c.connectedAtUtc)}
` + ``` + `c.remote` *is* escaped (good), but `shortTime(c.connectedAtUtc)` is **not** — and `shortTime` (line 36) has a `catch { return iso; }` branch that returns the **raw, unescaped `connectedAtUtc` string** verbatim when `new Date(iso)` parsing throws or the value is non-ISO. A malformed/attacker-controlled timestamp string therefore lands unescaped inside `innerHTML`. +- **`detail.js:203`** — `escapeHtml(t.rawHex)` *is* applied (good), and `num(t.decodedValue)` is numeric-safe. So `rawHex` is the one debug field that is handled correctly. (Correcting the brief: `rawHex` is escaped; `direction` and the timestamp path are the live holes.) +- **`detail.js:65`** — `$('plc-sub').textContent = `${plc.host}:${plc.listenPort}`;` is **safe** (`textContent`). But `detail.js:14–16` does the same for `plcName` via `textContent` — also safe. So the identity header is fine; the holes are specifically the `innerHTML` card/table builders. + +Severity rationale: per `docs/Operations/StatusPage.md` the admin endpoint binds `IPAddress.Any` with **no authentication** ("Authentication lives at the network layer"). PLC `name`/`host` and backend-derived strings come off the Modbus wire / `appsettings.json` and are operator- or device-influenceable. A PLC named or a backend that returns a crafted string can inject `` into any operator's browser session on the trusted segment. Read-only UI does not mean low impact: the injected script runs with the operator's origin. + +Concrete fix: route **every** dynamic value through `escapeHtml` before it enters an `innerHTML` template, with **no exceptions**: +- `detail.js:194` / `:203`: wrap `t.direction` in `escapeHtml(...)`; coerce `t.address`/`t.width` with `Number(...)` (or escape) rather than trusting the DTO type. +- `detail.js:86`: wrap the `shortTime(...)` result in `escapeHtml(...)`, and/or change `shortTime`'s fallback to `escapeHtml(iso)` / return `'—'`. +- Add a single `escapeAttr` helper (as `dashboard.js` has at line 179) and use it for the `dirCls`/class interpolation at `:197` if `dirCls` ever becomes data-derived (currently it is a literal, so low priority — but `card()`'s `cls` parameter at `detail.js:45` is interpolated into `class="v ${cls||''}"` and is caller-supplied; today all callers pass literals, so it is safe *now* but fragile). +- Better still: stop hand-building HTML. Build rows with `document.createElement` + `textContent`/`dataset`, which makes escaping structural rather than a discipline that one missed call defeats. + +## Major findings + +**M1. Cold-start retry loop layers a manual `setTimeout` retry on top of `withAutomaticReconnect`, with no bound and a misleading pill.** `dashboard.js:252–263` and the identical `detail.js:236–245`: +```js +async function start() { + try { ...; await connection.start(); await connection.invoke('SubscribeFleet'); ... } + catch { setConn('disconnected','retrying'); setTimeout(start, 3000); } +} +``` +`withAutomaticReconnect` only covers a connection that *was* established and then dropped — it does **not** retry the initial `start()`. So the manual loop is needed. But: (a) it retries **forever** every 3 s with no backoff and no cap — if the hub is permanently gone the browser hammers it indefinitely; (b) if `connection.start()` succeeds but `invoke('SubscribeFleet')` throws, the `catch` calls `start()` again on an **already-started** connection, which will throw "cannot start a connection that is not in the Disconnected state" and wedge the retry loop; (c) the pill shows `disconnected`/`retrying` which is reasonable, but there is no terminal "giving up" state and no way for an operator to know the difference between "server down" and "server slow". Fix: separate "start the connection" from "subscribe" so a subscribe failure doesn't restart the socket; add capped exponential backoff; guard `start()` against re-entry when `connection.state !== 'Disconnected'`. + +**M2. Detail page never handles "PLC name not in fleet" vs. "hub never delivers".** `detail.js` subscribes with `SubscribePlc(plcName)` where `plcName` is taken from `location.pathname`. If the name is wrong (typo, stale bookmark, or a name that never existed), `StatusHub.SubscribePlc` happily adds the connection to a `plc:{bogus}` group and `_captureRegistry.Arm` is a documented no-op — so the server **never sends a `plc` message** and `onDetail` never fires. The page sits forever on "Waiting for first snapshot…" with a green `connected` pill. There is no timeout, no "unknown PLC" state. The `renderMissing()` path (`detail.js:168`) only triggers when the server *does* push a payload with `detail.plc === null` (PLC removed by hot-reload) — it cannot fire for a name that was never configured because nothing is pushed at all. Fix: after `connected`, start a watchdog (e.g. 3× the push interval); if no `plc` message arrives, show an "unknown or unreachable PLC" notice. + +**M3. Fleet PDU/s rate is wrong for the first snapshot after any reconnect, and silently leaks `prevPdu`/`rateByName` entries for removed PLCs.** `dashboard.js:65–77` `updateRates` keys `prevPdu`/`rateByName` by `plc.name` and never removes entries. On hot-reload removal of a PLC, its stale rate stays in `rateByName` forever and is still summed into the fleet rate? — no, `renderAggregates` only iterates `s.plcs`, so a removed PLC drops out of the *sum*; but the Map still grows unboundedly across the process lifetime if PLCs churn. More importantly, after a SignalR reconnect the counters are *cumulative since service start*, so `cur - prev.forwarded` across a multi-second reconnect gap produces a correct (if coarse) rate — that part is fine. The real bug: `performance.now()` is monotonic per-page, so that's fine too. Net: low-grade Map leak on PLC churn; prune `prevPdu`/`rateByName` to the current `snapshot.plcs` name set each cycle. + +**M4. Full-table-body `innerHTML` re-render on every ~1 s push destroys transient UI state.** `dashboard.js:155–173` rebuilds the entire `` string and assigns `tbody.innerHTML`. Consequences at the 1 s cadence: any text selection inside the table is lost every second; `:hover` is re-evaluated (minor); the browser re-parses ~54 rows of HTML each tick. It is *not* a flicker source (synchronous replace, no intermediate paint) and 54 rows is cheap, so this is acceptable — but it is the kind of thing that becomes a problem if columns/rows grow. A keyed diff (update existing `` cells in place, add/remove only on set change) would also fix the selection-loss annoyance. Flagged as Major because of the selection-loss UX regression on a screen operators stare at; the brief explicitly asked about this. + +## Minor findings + +**N1. `detail.js` has no `escapeAttr` and `dashboard.js`'s is duplicated.** Both files define their own `escapeHtml` (`dashboard.js:176`, `detail.js:23`) — identical code, copy-pasted. `escapeAttr` exists only in `dashboard.js:179`. Since both files are separately served and there is no shared module, duplication is the pragmatic choice for a no-build-step project, but the *inconsistency* (detail.js lacking `escapeAttr`) is exactly what enabled C1. Recommend a tiny shared `util.js` served from `/assets/`, or at minimum make the two `escapeHtml` definitions and an `escapeAttr` identical and present in both. + +**N2. Accessibility gaps.** +- The sortable `` elements (`index.html:72–81`) are clickable via a JS `click` handler but are not keyboard-focusable and carry no `role="button"`/`tabindex="0"`/`aria-sort`. A keyboard-only operator cannot sort the table. Add `tabindex="0"`, `aria-sort` reflecting `sorted-asc`/`sorted-desc`, and a `keydown` (Enter/Space) handler. +- Table rows are clickable (`dashboard.js:232`) to open a detail page but are `` with `cursor:pointer` only — not keyboard-reachable and not announced as interactive. Consider making the PLC-name cell an actual `` so it is focusable, middle-click/ctrl-click works natively, and screen readers announce it. This would also remove the need for the JS `window.open` handler. +- The connection-state pill (`#conn`) updates visually but has no `aria-live` region, so a screen-reader user is never told the hub dropped. Add `aria-live="polite"` to `#conn` or to `#conn-text`. +- `` has a `placeholder` but no associated `` and delete the handler. (`rel="noopener"` also matters — `window.open` without it leaves `window.opener` live; here the opened page is same-origin so impact is low, but it is still best practice.) + +**N4. `detail.js` `plcName` parsing is brittle.** `detail.js:10`: `decodeURIComponent(location.pathname.replace(/^\/plc\//, ''))`. If the route is ever served under a path prefix, or the name itself contains an encoded `/`, this misparses. Also if `decodeURIComponent` throws (malformed `%` sequence in the URL) the whole script aborts at line 10 before `connect()` is ever reached — the page is then blank with no error. Wrap in try/catch and fall back to a visible error state. + +**N5. `dashboard.js:101` and `:106` — fleet PDU/s shows `—` until the *second* snapshot.** Expected (rate needs two samples) and correctly handled, but the aggregate card shows `—` for the first ~1 s while the table already shows per-row `—` rates. Cosmetic; no fix required, noted for completeness. + +**N6. No `console.log`, no hardcoded absolute URLs, no obvious dead code.** All asset/hub URLs are root-relative (`/assets/...`, `/hub/status`, `/plc/...`) — correct, survives any host/port. `card()`'s `extra` parameter and the `cls` third element of card rows are lightly-used but not dead. Clean on this axis. + +**N7. `formatUptime` / `formatAge` silently misrender negative or NaN input.** `dashboard.js:199` and `detail.js:28`: if `uptimeSeconds`/`ageSeconds` ever arrive negative (clock skew) or non-numeric, `Math.floor` yields `NaN` and the card shows `NaN`. Low risk given server types; a `Number.isFinite` guard returning `'—'` is cheap insurance. + +## What looks good + +- **Reconnect → re-subscribe is correct.** `dashboard.js:249` and `detail.js:233` both re-`invoke('SubscribeFleet'/'SubscribePlc')` inside `onreconnected`. This is essential and easy to forget: SignalR auto-reconnect gives a new `ConnectionId`, server-side group membership does **not** survive, and the detail page's tag capture is armed per-connection — without the re-subscribe the capture would silently disarm on every transient drop. Verified against `StatusHub.SubscribePlc`/`OnDisconnectedAsync` and `StatusBroadcaster` group targeting — the lifecycle is sound. +- **`onclose` correctly does not re-subscribe** — it just sets the pill; `withAutomaticReconnect`'s own loop owns recovery. No double-retry on the warm path. +- **Single `HubConnection` per page, closed implicitly on navigation.** Opening a detail page in a new tab (`window.open`) means each tab owns exactly one hub connection; closing the tab fires `OnDisconnectedAsync` server-side which disarms the capture. No connection leak across navigation. +- **The fleet table escapes correctly.** `dashboard.js:161–163` routes `plc.name` through `escapeAttr` for the `data-name` attribute and `escapeHtml` for cell text; `plc.host` through `escapeHtml`; `plc.listener.lastBindError` through `escapeAttr` for the `title=` attribute (line 160). This is the right discipline — it is only `detail.js` that fails to match it (C1). +- **`escapeHtml` is a correct minimal implementation** — `&`, `<`, `>` cover element-text contexts; `escapeAttr` adds `"`. Order matters (`&` first) and is correct. +- **Rate computation is robust.** `dashboard.js:70` guards `now > prev.t`, `Math.max(0, ...)` clamps a counter reset/reconnect to a non-negative rate, and `performance.now()` (monotonic) is the right clock for deltas — not `Date.now()`. +- **Filter is correct and cheap.** `visiblePlcs` (`dashboard.js:130`) filters a `.slice()` copy (never mutates the snapshot), search is case-folded once, and the sort has a stable `localeCompare` tiebreaker by name. At 54 rows the filter+sort+render per keystroke is sub-millisecond. It correctly survives live updates because `render()` always re-derives from `latest` + current `filter` state. +- **`renderMissing()` hot-reload path.** The detail page genuinely handles a PLC disappearing mid-session (`detail.js:168`) — `notice` shown, cards cleared and `hidden`. Good defensive UX for the hot-reload scenario (the gap is the *never-existed* name, see M2). +- **No CDN dependencies** — every `