Files
wwtools/mbproxy/codereviews/2026-05-15/AdminSignalR.md
Joseph Doherty 554b05d28c mbproxy: fix dashboard review findings, add named BCD tags + fleet config
Reviewed the new SignalR dashboard and fixed its two top findings: a stored XSS on the connection-detail page (unescaped tag name / direction / timestamp rendered into innerHTML) and FC03/FC04 cache hits bypassing the debug-view capture, which left cached tags frozen while their age climbed. Also adds an optional human-friendly Name to BCD tags surfaced on the debug view, and loads the real fleet config from tags.txt (12 named BCD tags, PLC Z28061) so the published appsettings.json is deploy-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 03:39:39 -04:00

24 KiB

Admin SignalR Web Dashboard — Code Review

Scope: commit e719dd5 ("replace status page with a live SignalR web dashboard"), files src/Mbproxy/Admin/{AdminEndpointHost,StatusHub,StatusBroadcaster,StatusPushSink,PlcSubscriptionTracker,StatusSnapshotBuilder,DebugDto}.cs. Cross-checked against docs/Operations/StatusPage.md, docs/Reference/LogEvents.md, the mbproxy CLAUDE.md, and the supporting types Proxy/TagCaptureRegistry.cs, Proxy/TagValueCapture.cs, Proxy/ProxyWorker.cs, HostingExtensions.cs, Configuration/ConfigReconciler.cs. Tests under tests/Mbproxy.Tests/Admin/{StatusHubTests,StatusBroadcasterTests}.cs.

Summary

  • The decomposition is sound: IStatusPushSink cleanly isolates the push loop from SignalR, PlcSubscriptionTracker is correctly single-locked, and the broadcaster lifecycle is tied to the Kestrel app's lifecycle so an AdminPort hot-reload re-bind does not leak a second broadcaster. Per-cycle error handling in PushOnceAsync is genuinely defensive.
  • The most serious problem is a subscriber-count leak in PlcSubscriptionTracker: a SignalR reconnect (transport drop without a clean close) increments a PLC's count on the new connection but the old connection's OnDisconnectedAsync decrement is not guaranteed, so a capture can be left armed forever — or, in the opposite race, double-armed/never-disarmed. Combined with the fact that capture arming has no reference to the connection liveness, the on-demand-capture invariant ("armed only while a viewer is open") is not actually upheld in the field.
  • A second real bug: StatusHub.SubscribePlc is not atomic — Groups.AddToGroupAsync then _tracker.Add are two awaits with an interleaving point, and OnDisconnectedAsync can run on the same connection between them, producing a negative-free but still wrong state where the group membership outlives the tracker entry (capture disarmed while the page still receives pushes) or vice versa.
  • DebugJsonContext (in DebugDto.cs) is dead code — defined, never referenced. SignalR serializes PlcDetailResponse via the reflection-based System.Text.Json path, not the source-gen context. Harmless today (no trimming/AOT in the csproj) but it is a misleading artifact and a latent trap if AOT is ever enabled.
  • The documented contract is matched on the fleet path but the detail push (PlcDetailResponse) is undocumented in the status.json schema and is not reachable from StatusJsonContext — only over SignalR. That is by design, but the camelCase guarantee for it rests entirely on the hub's AddJsonProtocol config, with no test asserting the wire shape.

Critical findings

C1. PlcSubscriptionTracker leaks subscriber counts across SignalR reconnects — captures get stuck armed (or never armed). StatusHub.cs:48-54 / PlcSubscriptionTracker.cs:29-73. SignalR assigns a new ConnectionId on every transport reconnect (WebSocket drop, long-polling cycle, network blip). The client (detail.js) re-invokes SubscribePlc on reconnect. The sequence on a reconnect is:

  1. Old connection's transport dies. SignalR eventually fires OnDisconnectedAsync for the old ConnectionId — but this is not ordered relative to the new connection's OnConnectedAsync/SubscribePlc, and on an ungraceful drop it may be delayed by the server's keepalive/timeout window (default ~30 s) or, if the server is shutting down, may not fire deterministically at all.
  2. New connection calls SubscribePlc_tracker.Add(newId, plc) → count 1 → 2, returns false, so no re-arm (fine).
  3. Old connection's OnDisconnectedAsync runs late → count 2 → 1. Capture stays armed. Correct only if the order is (2) then (3).

The failure case: if the old connection's OnDisconnectedAsync is delayed past the new connection also disconnecting, or if the operator closes the tab during the reconnect window, the count never returns to 0 and the capture is armed forever with no viewer — exactly the hot-path cost the on-demand design exists to avoid. Over a long-running service with flaky operator networks this accumulates: every PLC ever viewed ends up permanently armed. TagValueCapture.Record is then a non-trivial cost (FrozenDictionary lookup + allocation of a TagValueObservation + Volatile.Write) on the backend reader task and every FC06/FC16 upstream task, fleet-wide, forever.

The StatusBroadcaster.StopAsyncDisarmAll() safety net only fires on admin shutdown / AdminPort hot-reload, not during normal operation, so it does not bound this leak in a steady-state service.

Fix: do not rely on ConnectionId lifetime as the capture-arming key. Either (a) key subscriptions on a stable client-supplied identity and treat reconnects as idempotent re-subscribes, or (b) drive disarm off the broadcaster: each cycle, ActivePlcs() is the live set; reconcile armed captures against it (_captureRegistry disarms any PLC not in ActivePlcs()), so a leaked count is self-healing within one push interval — but a leaked count still keeps ActivePlcs() returning the PLC, so (b) alone is insufficient. The robust fix is (a): give PlcSubscriptionTracker.RemoveConnection a periodic sweep against SignalR's live connection set, or add a TTL/heartbeat so a connection that has not pushed a keepalive in N intervals is reaped. At minimum, document the leak and have the broadcaster log when ActivePlcs() count exceeds the number of distinct connections it has seen.

C2. SubscribePlc is not atomic — group membership and tracker state can diverge under a concurrent disconnect. StatusHub.cs:48-54:

public async Task SubscribePlc(string plcName)
{
    await Groups.AddToGroupAsync(Context.ConnectionId, PlcGroup(plcName)).ConfigureAwait(false);
    if (_tracker.Add(Context.ConnectionId, plcName))
        _captureRegistry.Arm(plcName);
}

There are two awaits' worth of interleaving here. SignalR will dispatch OnDisconnectedAsync for the same connection if the transport drops while SubscribePlc is mid-flight (hub method invocations and the disconnect callback are not mutually exclusive — only invocations on the same connection are serialized with each other, and OnDisconnectedAsync is a separate dispatch path). Concretely:

  • Connection drops after AddToGroupAsync completes but before _tracker.Add. OnDisconnectedAsync runs _tracker.RemoveConnection(id) → returns empty (nothing tracked yet). Then SubscribePlc resumes, _tracker.Add runs → count 0 → 1, returns true_captureRegistry.Arm(plcName) on a connection that is already gone. The capture is now armed with a phantom viewer. This is the same stuck-armed leak as C1, reached by a tighter race.
  • The mirror case leaves a group membership with no tracker entry; benign for pushes (the dead connection just never receives them) but it confirms the two data structures are not kept consistent.

Fix: make subscribe/unsubscribe go through a single critical section that also checks connection liveness, or — simpler — register the tracker entry first (synchronous, under its own lock), then AddToGroupAsync, and have OnDisconnectedAsync always run last and unconditionally. Even then the arm/disarm must be idempotent and reconciled against the live connection set (see C1). The cleanest fix addresses C1 and C2 together: treat OnConnectedAsync/OnDisconnectedAsync as the only mutators of tracker state, make SubscribePlc only adjust group membership, and arm/disarm from a periodic reconciliation in the broadcaster against SignalR's actual group/connection state.

Major findings

M1. DebugJsonContext is dead code; the detail payload is not source-gen serialized. DebugDto.cs:54-61 declares [JsonSerializable(typeof(PlcDetailResponse))] ... DebugJsonContext, but nothing references DebugJsonContext.Default anywhere in the tree (grep confirms zero usages). AdminEndpointHost.cs:192-196 configures the SignalR JSON protocol with only a PropertyNamingPolicy — no TypeInfoResolver — so SignalR serializes PlcDetailResponse through the reflection-based System.Text.Json resolver. The status.json route uses StatusJsonContext (source-gen); the SignalR path does not use any source-gen context. Today this works (the csproj sets neither PublishTrimmed nor PublishAot), but: (a) DebugJsonContext is misleading dead code that implies the detail payload is AOT-safe when it is not; (b) if trimming/AOT is ever turned on, every SignalR push of PlcDetailResponse and of StatusResponse (the "fleet" message also goes through the reflection resolver — StatusJsonContext is only wired into the status.json JsonSerializer.Serialize call, not into AddJsonProtocol) will throw or silently emit {}. Fix: either delete DebugJsonContext and document that the SignalR path is reflection-only, or — better — wire both contexts into the hub via AddJsonProtocol(o => o.PayloadSerializerOptions.TypeInfoResolverChain.Insert(0, StatusJsonContext.Default); ...Insert(1, DebugJsonContext.Default)) so the SignalR path is consistent with status.json and trim-safe. Note DebugJsonContext would also need StatusResponse/PlcStatus reachable since PlcDetailResponse embeds PlcStatus?.

M2. No test asserts the SignalR wire shape is camelCase. AdminEndpointHost.cs:194-196 is the only thing making the live feed's JSON match the documented status.json contract (docs/Operations/StatusPage.md repeatedly states "camelCase property names"). StatusBroadcasterTests uses FakeStatusPushSink and never serializes; StatusHubTests uses fakes. If someone removes or mis-edits the AddJsonProtocol lambda, every dashboard field silently becomes PascalCase and the JS (dashboard.js/detail.js expecting service.uptimeSeconds etc.) breaks with no failing test. The reflection-vs-source-gen split in M1 makes this worse — the two endpoints' naming is configured in two unrelated places. Add an integration test that stands up the hub (or at least serializes PlcDetailResponse/StatusResponse with the exact PayloadSerializerOptions the hub builds) and asserts uptimeSeconds/captureArmed appear lowercase-first.

M3. StatusBroadcaster.PushOnceAsync swallows a cancelled per-PLC push but keeps looping over remaining PLCs. StatusBroadcaster.cs:97-110. The catch (Exception ex) when (ex is not OperationCanceledException) correctly lets cancellation propagate — but only out of the await _sink.PushPlcAsync. If ct is cancelled mid-loop, the OperationCanceledException escapes PushOnceAsync, which is caught by LoopAsync's outer catch (OperationCanceledException) — fine. However, if _builder.BuildDebug(plcName) (synchronous, StatusSnapshotBuilder.cs:44) throws a non-OCE exception for one PLC, the catch logs and the loop continues — good. But snapshot.Plcs.FirstOrDefault(...) is re-run inside the try for every PLC: an O(N) scan per PLC over the Plcs list = O(N²) per push cycle. For 54 PLCs that is 2,916 comparisons per second — trivially cheap here, but flagged because the fleet snapshot is already a List and a Dictionary<string,PlcStatus> projection once per cycle would be cleaner and removes the quadratic. Minor on its own; grouped here because it sits in the same loop as a real concern: BuildDebug is called for every active PLC even when snapshot.Plcs has no matching entry (PLC removed by hot-reload) — that path is handled (Plc: null) but worth a test.

M4. StatusBroadcaster.Start() is fire-and-forget with no guard against being called twice. StatusBroadcaster.cs:51-52: public void Start() => _loop = Task.Run(...). The XML doc says "Idempotent only in the sense that it is called once" — but nothing enforces that. A second Start() overwrites _loop, orphaning the first loop task (it keeps running against the same _cts, so two loops now push concurrently every interval until cancellation). AdminEndpointHost only ever calls it once per StartAppAsync, and a new StatusBroadcaster is constructed each re-bind, so this is not hit today — but a one-line guard (if (_loop is not { IsCompleted: true } and not null) throw/return, or an Interlocked flag) would make the class safe to misuse. Also: Task.Run returns a task whose faults are observed only via _loop being awaited in StopAsync; if StopAsync is never called (e.g. StartAppAsync throws after _broadcaster.Start() but the catch at AdminEndpointHost.cs:253 sets _app = null without disposing _broadcaster) the loop task and its _cts leak. See M5.

M5. A bind failure after _broadcaster.Start() leaks the broadcaster and its push loop. AdminEndpointHost.cs:237-258. StartAppAsync does await app.StartAsync(ct) (line 237), then _app = app, then constructs and Start()s _broadcaster (lines 242-249), then LogAdminStarted. The whole body is wrapped in catch (Exception ex) when (ex is not OperationCanceledException) (line 253) which logs mbproxy.admin.bind.failed and sets _app = null. If anything between line 238 and 251 throws — e.g. app.Services.GetRequiredService<IHubContext<StatusHub>>() fails, or the StatusBroadcaster constructor throws, or LogAdminStarted somehow throws — the catch sets _app = null but does not stop the already-started _broadcaster nor the already-started Kestrel app. The push loop keeps running forever against a sink whose hub is on a Kestrel app that StopCurrentAppAsync will never see (_app is null, _broadcaster field may or may not be set depending on where the throw landed). Result: a leaked Kestrel listener still bound to the port, plus a leaked broadcaster loop. The probability is low (those calls rarely throw) but the catch's cleanup is incomplete. Fix: in the catch, best-effort stop/dispose whatever was started — mirror StopCurrentAppAsync's logic, or wrap the post-StartAsync section so a failure tears down app and _broadcaster before nulling the fields.

M6. PushOnceAsync builds the debug snapshot for a PLC whose subscriber count is stale. Tied to C1/C2: ActivePlcs() (PlcSubscriptionTracker.cs:76-82) returns a key snapshot, and the broadcaster pushes a "plc" message to PlcGroup(plcName) for each. If the count is leaked-high (C1), the broadcaster pushes to an empty SignalR group every cycle forever — cheap (SignalR no-ops an empty group) but it also keeps calling _builder.BuildDebug(plcName) and, more importantly, the capture stays armed because nothing disarms it. This is the observable symptom of C1 inside the broadcaster. Recording it separately because the fix (broadcaster-side reconciliation) lives here.

Minor findings

N1. StatusBroadcaster does not use a stable log event name. StatusBroadcaster.cs:84,94,108,133 all call _logger.LogError(ex, "StatusBroadcaster: ...") with free-text messages and no EventId/EventName. Every other component in this codebase uses [LoggerMessage] source-gen with a stable mbproxy.* event name catalogued in docs/Reference/LogEvents.md (e.g. mbproxy.admin.started EventId 70, mbproxy.admin.bind.failed EventId 71). The broadcaster's "loop terminated unexpectedly" at line 133 is exactly the kind of event an operator would alert on, and it is invisible to event-name-based log queries. Add [LoggerMessage] entries (e.g. mbproxy.admin.broadcast.failed, mbproxy.admin.broadcast.loop.terminated) and register them in LogEvents.md.

N2. LoopAsync does Task.Delay before the first push. StatusBroadcaster.cs:122-124: the loop delays interval ms, then pushes. The first dashboard client therefore waits up to AdminPushIntervalMs (default 1000 ms) for its first "fleet" message even though a snapshot is available immediately. index.html/dashboard.js presumably also fetch status.json or render empty until the first push — but a push-before-delay (or an immediate PushOnceAsync before entering the loop) would make the dashboard populate instantly on open. Cosmetic, but easy.

N3. StatusBroadcaster.StopAsync is not idempotent-safe against the _loop await. StatusBroadcaster.cs:57-72: if StopAsync is called twice (it is reachable: DisposeAsync calls StopAsync, and AdminEndpointHost.StopCurrentAppAsync calls broadcaster.DisposeAsync() which calls StopAsync — only one path per instance today, but). The second call: _cts.IsCancellationRequested is true so it skips CancelAsync, then await _loop again (a completed task — fine), then _captureRegistry.DisarmAll() again (fine). Benign, but DisposeAsync at line 137-141 then calls _cts.Dispose(); a third path touching _cts after dispose would throw ObjectDisposedException. No live bug, but the class lacks the _disposed guard that AdminEndpointHost itself carries (and whose absence the AdminEndpointHost comment at lines 59-64 explicitly calls out as a regression risk). Add a _stopped/_disposed flag for symmetry.

N4. OnDisconnectedAsync does its cleanup before base.OnDisconnectedAsync. StatusHub.cs:60-66. The capture disarm runs first, then base.OnDisconnectedAsync(exception). This is the correct order (you want to release resources before the base teardown) and the disarm is synchronous so it cannot be lost — but note there is no OnConnectedAsync override, which is fine, and no try/finally around the foreach. If _captureRegistry.Disarm threw (it cannot — it is a dictionary lookup + volatile write), the remaining PLCs in the connection's set would not be disarmed and base.OnDisconnectedAsync would be skipped. Defensive only; Disarm is genuinely no-throw today.

N5. PlcSubscriptionTracker.ActivePlcs() allocates a fresh array every push cycle. PlcSubscriptionTracker.cs:80: _plcCounts.Keys.ToArray() under the lock, once per AdminPushIntervalMs. With the common case of zero detail-page viewers it correctly returns Array.Empty<string>() (line 80 short-circuits on Count == 0). Only allocates when someone is viewing. Fine — flagged only as a known per-cycle allocation if push interval is ever lowered aggressively.

N6. ServeHtmlShell / asset routes have no HEAD handling and /plc/{name} ignores name. AdminEndpointHost.cs:210: app.MapGet("/plc/{name}", (string name, HttpContext ctx) => ServeHtmlShell(ctx, "plc.html"))name is bound but unused (the page reads it client-side from the URL). Harmless, but the unused parameter will draw a compiler/analyzer warning under this project's TreatWarningsAsErrors unless suppressed; verify it builds. If it does build clean, fine — MapGet route parameters are not flagged as unused. Minor; mentioned for the reviewer to confirm against CI.

N7. The detail payload's PlcDetailResponse shape is undocumented as part of the /status.json contract but the doc table at StatusPage.md:317-328 does describe it. Actually documented — withdrawing the "undocumented" concern from the Summary's last bullet to this extent: the fields are in StatusPage.md. What is genuinely missing is a statement that this payload travels only over SignalR and is not reachable at any GET route, and that its serialization path differs from status.json (M1). One sentence in StatusPage.md's "Debug View Data" section would close it.

What looks good

  • Broadcaster lifecycle is correctly bound to the Kestrel app, not the host. AdminEndpointHost.cs:242-249 creates a fresh StatusBroadcaster inside StartAppAsync and StopCurrentAppAsync (lines 268-279) disposes it before stopping Kestrel. An AdminPort hot-reload therefore tears down the old broadcaster and starts a new one — no broadcaster leak across re-binds, and the DisarmAll() in StopAsync ensures the re-bind does not strand an armed capture. This directly answers the open question N3-style concern from the 2026-05-14 AdminAndDiagnostics review about provider/loop leaks across re-binds.
  • IStatusPushSink is a clean seam. Defining the outbound side as an interface (StatusPushSink.cs) lets StatusBroadcasterTests exercise the full push-cycle logic (fleet always, per-PLC only for active PLCs, disarm-on-stop) with a recording fake and zero SignalR host — and the tests actually do this. Good testability design.
  • PlcSubscriptionTracker locking is correct as a data structure. Single _gate, every method takes it, the count transitions (0→1 returns arm-signal, 1→0 returns disarm-signal) are computed under the lock, and RemoveConnection correctly handles the multi-PLC-per-connection case. The bug (C1/C2) is not in the locking — it is that ConnectionId is the wrong key for a lifetime and the tracker is mutated from two un-serialized hub dispatch paths. The class itself is internally consistent.
  • PushOnceAsync per-stage error isolation. StatusBroadcaster.cs:78-110 wraps snapshot-build, fleet-push, and each per-PLC push in their own try/catch so one PLC's failure does not abort the cycle and a snapshot-build failure does not kill the loop. The when (ex is not OperationCanceledException) filters correctly let shutdown cancellation propagate to LoopAsync's handler. This is the right shape.
  • LoopAsync re-reads AdminPushIntervalMs every cycle (StatusBroadcaster.cs:122) so a hot-reload of the interval takes effect without restarting the loop, and floors it at 100 ms so a bad value cannot spin the CPU. Matches the hot-reload-everything posture in CLAUDE.md.
  • TagValueCapture concurrency is genuinely lock-free-correct. Volatile.Write/Volatile.Read of references to an immutable record (TagValueObservation), _armed is volatile, and Record short-circuits on !_armed with a single volatile read before any work — so the disarmed hot path cost is one bool read, as advertised. Disarm clears slots so a re-arm shows only fresh data. This part of the design is solid; the weakness is who calls Arm/Disarm and when (C1/C2), not the capture itself.
  • StatusSnapshotBuilder.BuildDebug degrades gracefully for an unknown PLC (StatusSnapshotBuilder.cs:44-47) — returns a disarmed empty snapshot rather than throwing, which is the correct behavior for a detail page open on a hot-reload-removed PLC, and PlcDetailResponse.Plc is nullable to carry that state. ConfigReconciler.cs:259 calls _captureRegistry.Remove(name) on PLC removal, so the registry and the config stay consistent.
  • StatusHub.SubscribePlc for an unknown PLC is a documented no-op (StatusHub.cs:53, TagCaptureRegistry.Arm no-ops a missing key) and StatusHubTests.SubscribePlc_UnknownPlc_DoesNotThrow_AndArmsNothing covers it. A hub method throwing would be sent to the caller as a hub error; this path correctly does not throw.
  • Asset serving is safe. AdminEndpointHost.cs:212-226 rejects /, \, and .. in the asset path segment before touching GetManifestResourceStream, caches bytes and misses in a static ConcurrentDictionary shared across app re-builds, and sets immutable cache headers for content-addressed assets vs. no-cache for the HTML shells. Embedded resources mean no filesystem traversal surface at all.

Open questions

  1. C1 field impact: how often do operators' browsers reconnect in this deployment? If the dashboard is on a stable internal segment and detail pages are short-lived, the leak is slow — but it is unbounded and never self-heals in a steady-state service. Is there an existing process-restart cadence that masks it? Either way the invariant in StatusPage.md:315 ("the hot path carries zero cost when nobody is watching") is currently false after any reconnect-during-view.
  2. C2 / hub dispatch model: confirm against the SignalR version in use whether OnDisconnectedAsync for a connection can overlap an in-flight SubscribePlc invocation on that same connection. If SignalR guarantees OnDisconnectedAsync runs only after all in-flight invocations for that connection complete, C2's same-connection race narrows to the cross-connection (reconnect) race in C1 — still a bug, but the fix scope shrinks.
  3. M1: is trimming/AOT on the roadmap for this service? CLAUDE.md mentions single-file self-contained publish but not trimming. If AOT is ever planned, M1 is upgraded to Critical (the SignalR reflection JSON path will break) and DebugJsonContext must be wired in, not deleted.
  4. M5: has the post-StartAsync failure path (e.g. GetRequiredService<IHubContext<StatusHub>> failing) ever been observed? It is low-probability, but the catch block's cleanup is provably incomplete — worth a deliberate decision to either fix it or document it as accepted.
  5. Is there any reason StatusJsonContext.Default is not also wired into the hub's AddJsonProtocol so the fleet "fleet" push and GET /status.json share one serialization path and one camelCase configuration point (M1/M2)?