wwtools/mbproxy/codereviews/2026-05-15/AdminSignalR.md

# Admin SignalR Web Dashboard — Code Review

Scope: commit `e719dd5` ("replace status page with a live SignalR web dashboard"), files `src/Mbproxy/Admin/{AdminEndpointHost,StatusHub,StatusBroadcaster,StatusPushSink,PlcSubscriptionTracker,StatusSnapshotBuilder,DebugDto}.cs`. Cross-checked against `docs/Operations/StatusPage.md`, `docs/Reference/LogEvents.md`, the mbproxy `CLAUDE.md`, and the supporting types `Proxy/TagCaptureRegistry.cs`, `Proxy/TagValueCapture.cs`, `Proxy/ProxyWorker.cs`, `HostingExtensions.cs`, `Configuration/ConfigReconciler.cs`. Tests under `tests/Mbproxy.Tests/Admin/{StatusHubTests,StatusBroadcasterTests}.cs`.

## Summary

- The decomposition is sound: `IStatusPushSink` cleanly isolates the push loop from SignalR, `PlcSubscriptionTracker` is correctly single-locked, and the broadcaster lifecycle is tied to the Kestrel app's lifecycle so an `AdminPort` hot-reload re-bind does not leak a second broadcaster. Per-cycle error handling in `PushOnceAsync` is genuinely defensive.
- The most serious problem is a **subscriber-count leak in `PlcSubscriptionTracker`**: a SignalR reconnect (transport drop without a clean close) increments a PLC's count on the new connection but the old connection's `OnDisconnectedAsync` decrement is not guaranteed, so a capture can be left armed forever — or, in the opposite race, double-armed/never-disarmed. Combined with the fact that capture arming has no reference to the *connection liveness*, the on-demand-capture invariant ("armed only while a viewer is open") is not actually upheld in the field.
- A **second real bug**: `StatusHub.SubscribePlc` is not atomic — `Groups.AddToGroupAsync` then `_tracker.Add` are two awaits with an interleaving point, and `OnDisconnectedAsync` can run on the same connection between them, producing a *negative-free but still wrong* state where the group membership outlives the tracker entry (capture disarmed while the page still receives pushes) or vice versa.
- `DebugJsonContext` (in `DebugDto.cs`) is dead code — defined, never referenced. SignalR serializes `PlcDetailResponse` via the reflection-based `System.Text.Json` path, not the source-gen context. Harmless today (no trimming/AOT in the csproj) but it is a misleading artifact and a latent trap if AOT is ever enabled.
- The documented contract is matched on the fleet path but the **detail push (`PlcDetailResponse`) is undocumented in the `status.json` schema** and is not reachable from `StatusJsonContext` — only over SignalR. That is by design, but the camelCase guarantee for it rests entirely on the hub's `AddJsonProtocol` config, with no test asserting the wire shape.

## Critical findings

**C1. `PlcSubscriptionTracker` leaks subscriber counts across SignalR reconnects — captures get stuck armed (or never armed).** `StatusHub.cs:48-54` / `PlcSubscriptionTracker.cs:29-73`. SignalR assigns a *new* `ConnectionId` on every transport reconnect (WebSocket drop, long-polling cycle, network blip). The client (`detail.js`) re-invokes `SubscribePlc` on reconnect. The sequence on a reconnect is:

1. Old connection's transport dies. SignalR *eventually* fires `OnDisconnectedAsync` for the old `ConnectionId` — but this is **not** ordered relative to the new connection's `OnConnectedAsync`/`SubscribePlc`, and on an ungraceful drop it may be delayed by the server's keepalive/timeout window (default ~30 s) or, if the server is shutting down, may not fire deterministically at all.
2. New connection calls `SubscribePlc` → `_tracker.Add(newId, plc)` → count `1 → 2`, returns `false`, so no re-arm (fine).
3. Old connection's `OnDisconnectedAsync` runs late → count `2 → 1`. Capture stays armed. **Correct only if the order is (2) then (3).**

The failure case: if the old connection's `OnDisconnectedAsync` is delayed past the *new* connection also disconnecting, or if the operator closes the tab during the reconnect window, the count never returns to 0 and **the capture is armed forever with no viewer** — exactly the hot-path cost the on-demand design exists to avoid. Over a long-running service with flaky operator networks this accumulates: every PLC ever viewed ends up permanently armed. `TagValueCapture.Record` is then a non-trivial cost (`FrozenDictionary` lookup + allocation of a `TagValueObservation` + `Volatile.Write`) on the backend reader task and every FC06/FC16 upstream task, fleet-wide, forever.

The `StatusBroadcaster.StopAsync` → `DisarmAll()` safety net only fires on admin shutdown / `AdminPort` hot-reload, not during normal operation, so it does not bound this leak in a steady-state service.

Fix: do not rely on `ConnectionId` lifetime as the capture-arming key. Either (a) key subscriptions on a stable client-supplied identity and treat reconnects as idempotent re-subscribes, or (b) drive disarm off the broadcaster: each cycle, `ActivePlcs()` is the live set; reconcile armed captures against it (`_captureRegistry` disarms any PLC not in `ActivePlcs()`), so a leaked count is self-healing within one push interval — but a leaked *count* still keeps `ActivePlcs()` returning the PLC, so (b) alone is insufficient. The robust fix is (a): give `PlcSubscriptionTracker.RemoveConnection` a periodic sweep against SignalR's live connection set, or add a TTL/heartbeat so a connection that has not pushed a keepalive in N intervals is reaped. At minimum, document the leak and have the broadcaster log when `ActivePlcs()` count exceeds the number of distinct connections it has seen.

**C2. `SubscribePlc` is not atomic — group membership and tracker state can diverge under a concurrent disconnect.** `StatusHub.cs:48-54`:

```csharp
public async Task SubscribePlc(string plcName)
{
    await Groups.AddToGroupAsync(Context.ConnectionId, PlcGroup(plcName)).ConfigureAwait(false);
    if (_tracker.Add(Context.ConnectionId, plcName))
        _captureRegistry.Arm(plcName);
}
```

There are two awaits' worth of interleaving here. SignalR will dispatch `OnDisconnectedAsync` for the same connection if the transport drops while `SubscribePlc` is mid-flight (hub method invocations and the disconnect callback are not mutually exclusive — only invocations on the *same* connection are serialized *with each other*, and `OnDisconnectedAsync` is a separate dispatch path). Concretely:

- Connection drops after `AddToGroupAsync` completes but before `_tracker.Add`. `OnDisconnectedAsync` runs `_tracker.RemoveConnection(id)` → returns empty (nothing tracked yet). Then `SubscribePlc` resumes, `_tracker.Add` runs → count `0 → 1`, returns `true` → **`_captureRegistry.Arm(plcName)` on a connection that is already gone.** The capture is now armed with a phantom viewer. This is the same stuck-armed leak as C1, reached by a tighter race.
- The mirror case leaves a group membership with no tracker entry; benign for pushes (the dead connection just never receives them) but it confirms the two data structures are not kept consistent.

Fix: make subscribe/unsubscribe go through a single critical section that also checks connection liveness, or — simpler — register the tracker entry *first* (synchronous, under its own lock), then `AddToGroupAsync`, and have `OnDisconnectedAsync` always run last and unconditionally. Even then the arm/disarm must be idempotent and reconciled against the live connection set (see C1). The cleanest fix addresses C1 and C2 together: treat `OnConnectedAsync`/`OnDisconnectedAsync` as the *only* mutators of tracker state, make `SubscribePlc` only adjust group membership, and arm/disarm from a periodic reconciliation in the broadcaster against SignalR's actual group/connection state.

## Major findings

**M1. `DebugJsonContext` is dead code; the detail payload is not source-gen serialized.** `DebugDto.cs:54-61` declares `[JsonSerializable(typeof(PlcDetailResponse))] ... DebugJsonContext`, but nothing references `DebugJsonContext.Default` anywhere in the tree (grep confirms zero usages). `AdminEndpointHost.cs:192-196` configures the SignalR JSON protocol with only a `PropertyNamingPolicy` — no `TypeInfoResolver` — so SignalR serializes `PlcDetailResponse` through the **reflection-based** `System.Text.Json` resolver. The `status.json` route uses `StatusJsonContext` (source-gen); the SignalR path does not use any source-gen context. Today this works (the csproj sets neither `PublishTrimmed` nor `PublishAot`), but: (a) `DebugJsonContext` is misleading dead code that implies the detail payload is AOT-safe when it is not; (b) if trimming/AOT is ever turned on, every SignalR push of `PlcDetailResponse` *and* of `StatusResponse` (the `"fleet"` message also goes through the reflection resolver — `StatusJsonContext` is only wired into the `status.json` `JsonSerializer.Serialize` call, not into `AddJsonProtocol`) will throw or silently emit `{}`. Fix: either delete `DebugJsonContext` and document that the SignalR path is reflection-only, or — better — wire both contexts into the hub via `AddJsonProtocol(o => o.PayloadSerializerOptions.TypeInfoResolverChain.Insert(0, StatusJsonContext.Default); ...Insert(1, DebugJsonContext.Default))` so the SignalR path is consistent with `status.json` and trim-safe. Note `DebugJsonContext` would also need `StatusResponse`/`PlcStatus` reachable since `PlcDetailResponse` embeds `PlcStatus?`.

**M2. No test asserts the SignalR wire shape is camelCase.** `AdminEndpointHost.cs:194-196` is the *only* thing making the live feed's JSON match the documented `status.json` contract (`docs/Operations/StatusPage.md` repeatedly states "camelCase property names"). `StatusBroadcasterTests` uses `FakeStatusPushSink` and never serializes; `StatusHubTests` uses fakes. If someone removes or mis-edits the `AddJsonProtocol` lambda, every dashboard field silently becomes `PascalCase` and the JS (`dashboard.js`/`detail.js` expecting `service.uptimeSeconds` etc.) breaks with no failing test. The reflection-vs-source-gen split in M1 makes this worse — the two endpoints' naming is configured in two unrelated places. Add an integration test that stands up the hub (or at least serializes `PlcDetailResponse`/`StatusResponse` with the exact `PayloadSerializerOptions` the hub builds) and asserts `uptimeSeconds`/`captureArmed` appear lowercase-first.

**M3. `StatusBroadcaster.PushOnceAsync` swallows a cancelled per-PLC push but keeps looping over remaining PLCs.** `StatusBroadcaster.cs:97-110`. The `catch (Exception ex) when (ex is not OperationCanceledException)` correctly lets cancellation propagate — but only out of the `await _sink.PushPlcAsync`. If `ct` is cancelled mid-loop, the `OperationCanceledException` escapes `PushOnceAsync`, which is caught by `LoopAsync`'s outer `catch (OperationCanceledException)` — fine. However, if `_builder.BuildDebug(plcName)` (synchronous, `StatusSnapshotBuilder.cs:44`) throws a *non-OCE* exception for one PLC, the `catch` logs and the loop continues — good. But `snapshot.Plcs.FirstOrDefault(...)` is re-run inside the `try` for every PLC: an O(N) scan per PLC over the `Plcs` list = O(N²) per push cycle. For 54 PLCs that is 2,916 comparisons per second — trivially cheap here, but flagged because the fleet snapshot is already a `List` and a `Dictionary<string,PlcStatus>` projection once per cycle would be cleaner and removes the quadratic. Minor on its own; grouped here because it sits in the same loop as a real concern: `BuildDebug` is called for every active PLC even when `snapshot.Plcs` has no matching entry (PLC removed by hot-reload) — that path is handled (`Plc: null`) but worth a test.

**M4. `StatusBroadcaster.Start()` is fire-and-forget with no guard against being called twice.** `StatusBroadcaster.cs:51-52`: `public void Start() => _loop = Task.Run(...)`. The XML doc says "Idempotent only in the sense that it is called once" — but nothing *enforces* that. A second `Start()` overwrites `_loop`, orphaning the first loop task (it keeps running against the same `_cts`, so two loops now push concurrently every interval until cancellation). `AdminEndpointHost` only ever calls it once per `StartAppAsync`, and a new `StatusBroadcaster` is constructed each re-bind, so this is not hit today — but a one-line guard (`if (_loop is not { IsCompleted: true } and not null) throw`/return, or an `Interlocked` flag) would make the class safe to misuse. Also: `Task.Run` returns a task whose faults are observed only via `_loop` being awaited in `StopAsync`; if `StopAsync` is never called (e.g. `StartAppAsync` throws *after* `_broadcaster.Start()` but the catch at `AdminEndpointHost.cs:253` sets `_app = null` without disposing `_broadcaster`) the loop task and its `_cts` leak. See M5.

**M5. A bind failure after `_broadcaster.Start()` leaks the broadcaster and its push loop.** `AdminEndpointHost.cs:237-258`. `StartAppAsync` does `await app.StartAsync(ct)` (line 237), then `_app = app`, then constructs and `Start()`s `_broadcaster` (lines 242-249), then `LogAdminStarted`. The whole body is wrapped in `catch (Exception ex) when (ex is not OperationCanceledException)` (line 253) which logs `mbproxy.admin.bind.failed` and sets `_app = null`. If anything between line 238 and 251 throws — e.g. `app.Services.GetRequiredService<IHubContext<StatusHub>>()` fails, or the `StatusBroadcaster` constructor throws, or `LogAdminStarted` somehow throws — the catch sets `_app = null` but **does not stop the already-started `_broadcaster` nor the already-started Kestrel `app`**. The push loop keeps running forever against a sink whose hub is on a Kestrel app that `StopCurrentAppAsync` will never see (`_app` is null, `_broadcaster` field may or may not be set depending on where the throw landed). Result: a leaked Kestrel listener still bound to the port, plus a leaked broadcaster loop. The probability is low (those calls rarely throw) but the catch's cleanup is incomplete. Fix: in the catch, best-effort stop/dispose whatever was started — mirror `StopCurrentAppAsync`'s logic, or wrap the post-`StartAsync` section so a failure tears down `app` and `_broadcaster` before nulling the fields.

**M6. `PushOnceAsync` builds the debug snapshot for a PLC whose subscriber count is stale.** Tied to C1/C2: `ActivePlcs()` (`PlcSubscriptionTracker.cs:76-82`) returns a key snapshot, and the broadcaster pushes a `"plc"` message to `PlcGroup(plcName)` for each. If the count is leaked-high (C1), the broadcaster pushes to an empty SignalR group every cycle forever — cheap (SignalR no-ops an empty group) but it also keeps calling `_builder.BuildDebug(plcName)` and, more importantly, the *capture stays armed* because nothing disarms it. This is the observable symptom of C1 inside the broadcaster. Recording it separately because the fix (broadcaster-side reconciliation) lives here.

## Minor findings

**N1. `StatusBroadcaster` does not use a stable log event name.** `StatusBroadcaster.cs:84,94,108,133` all call `_logger.LogError(ex, "StatusBroadcaster: ...")` with free-text messages and no `EventId`/`EventName`. Every other component in this codebase uses `[LoggerMessage]` source-gen with a stable `mbproxy.*` event name catalogued in `docs/Reference/LogEvents.md` (e.g. `mbproxy.admin.started` EventId 70, `mbproxy.admin.bind.failed` EventId 71). The broadcaster's "loop terminated unexpectedly" at line 133 is exactly the kind of event an operator would alert on, and it is invisible to event-name-based log queries. Add `[LoggerMessage]` entries (e.g. `mbproxy.admin.broadcast.failed`, `mbproxy.admin.broadcast.loop.terminated`) and register them in `LogEvents.md`.

**N2. `LoopAsync` does `Task.Delay` *before* the first push.** `StatusBroadcaster.cs:122-124`: the loop delays `interval` ms, then pushes. The first dashboard client therefore waits up to `AdminPushIntervalMs` (default 1000 ms) for its first `"fleet"` message even though a snapshot is available immediately. `index.html`/`dashboard.js` presumably also fetch `status.json` or render empty until the first push — but a push-before-delay (or an immediate `PushOnceAsync` before entering the loop) would make the dashboard populate instantly on open. Cosmetic, but easy.

**N3. `StatusBroadcaster.StopAsync` is not idempotent-safe against the `_loop` await.** `StatusBroadcaster.cs:57-72`: if `StopAsync` is called twice (it is reachable: `DisposeAsync` calls `StopAsync`, and `AdminEndpointHost.StopCurrentAppAsync` calls `broadcaster.DisposeAsync()` which calls `StopAsync` — only one path per instance today, but). The second call: `_cts.IsCancellationRequested` is true so it skips `CancelAsync`, then `await _loop` again (a completed task — fine), then `_captureRegistry.DisarmAll()` again (fine). Benign, but `DisposeAsync` at line 137-141 then calls `_cts.Dispose()`; a *third* path touching `_cts` after dispose would throw `ObjectDisposedException`. No live bug, but the class lacks the `_disposed` guard that `AdminEndpointHost` itself carries (and whose absence the `AdminEndpointHost` comment at lines 59-64 explicitly calls out as a regression risk). Add a `_stopped`/`_disposed` flag for symmetry.

**N4. `OnDisconnectedAsync` does its cleanup *before* `base.OnDisconnectedAsync`.** `StatusHub.cs:60-66`. The capture disarm runs first, then `base.OnDisconnectedAsync(exception)`. This is the correct order (you want to release resources before the base teardown) and the disarm is synchronous so it cannot be lost — but note there is **no `OnConnectedAsync` override**, which is fine, and no try/finally around the `foreach`. If `_captureRegistry.Disarm` threw (it cannot — it is a dictionary lookup + volatile write), the remaining PLCs in the connection's set would not be disarmed and `base.OnDisconnectedAsync` would be skipped. Defensive only; `Disarm` is genuinely no-throw today.

**N5. `PlcSubscriptionTracker.ActivePlcs()` allocates a fresh array every push cycle.** `PlcSubscriptionTracker.cs:80`: `_plcCounts.Keys.ToArray()` under the lock, once per `AdminPushIntervalMs`. With the common case of zero detail-page viewers it correctly returns `Array.Empty<string>()` (line 80 short-circuits on `Count == 0`). Only allocates when someone is viewing. Fine — flagged only as a known per-cycle allocation if push interval is ever lowered aggressively.

**N6. `ServeHtmlShell` / asset routes have no `HEAD` handling and `/plc/{name}` ignores `name`.** `AdminEndpointHost.cs:210`: `app.MapGet("/plc/{name}", (string name, HttpContext ctx) => ServeHtmlShell(ctx, "plc.html"))` — `name` is bound but unused (the page reads it client-side from the URL). Harmless, but the unused parameter will draw a compiler/analyzer warning under this project's `TreatWarningsAsErrors` unless suppressed; verify it builds. If it does build clean, fine — `MapGet` route parameters are not flagged as unused. Minor; mentioned for the reviewer to confirm against CI.

**N7. The detail payload's `PlcDetailResponse` shape is undocumented as part of the `/status.json` contract but the doc table at `StatusPage.md:317-328` does describe it.** Actually documented — withdrawing the "undocumented" concern from the Summary's last bullet to this extent: the *fields* are in `StatusPage.md`. What is genuinely missing is a statement that this payload travels **only over SignalR** and is **not** reachable at any `GET` route, and that its serialization path differs from `status.json` (M1). One sentence in `StatusPage.md`'s "Debug View Data" section would close it.

## What looks good

- **Broadcaster lifecycle is correctly bound to the Kestrel app, not the host.** `AdminEndpointHost.cs:242-249` creates a fresh `StatusBroadcaster` inside `StartAppAsync` and `StopCurrentAppAsync` (lines 268-279) disposes it *before* stopping Kestrel. An `AdminPort` hot-reload therefore tears down the old broadcaster and starts a new one — no broadcaster leak across re-binds, and the `DisarmAll()` in `StopAsync` ensures the re-bind does not strand an armed capture. This directly answers the open question N3-style concern from the 2026-05-14 `AdminAndDiagnostics` review about provider/loop leaks across re-binds.
- **`IStatusPushSink` is a clean seam.** Defining the outbound side as an interface (`StatusPushSink.cs`) lets `StatusBroadcasterTests` exercise the full push-cycle logic (fleet always, per-PLC only for active PLCs, disarm-on-stop) with a recording fake and zero SignalR host — and the tests actually do this. Good testability design.
- **`PlcSubscriptionTracker` locking is correct *as a data structure*.** Single `_gate`, every method takes it, the count transitions (`0→1` returns arm-signal, `1→0` returns disarm-signal) are computed under the lock, and `RemoveConnection` correctly handles the multi-PLC-per-connection case. The bug (C1/C2) is not in the locking — it is that `ConnectionId` is the wrong key for a *lifetime* and the tracker is mutated from two un-serialized hub dispatch paths. The class itself is internally consistent.
- **`PushOnceAsync` per-stage error isolation.** `StatusBroadcaster.cs:78-110` wraps snapshot-build, fleet-push, and each per-PLC push in their own `try/catch` so one PLC's failure does not abort the cycle and a snapshot-build failure does not kill the loop. The `when (ex is not OperationCanceledException)` filters correctly let shutdown cancellation propagate to `LoopAsync`'s handler. This is the right shape.
- **`LoopAsync` re-reads `AdminPushIntervalMs` every cycle** (`StatusBroadcaster.cs:122`) so a hot-reload of the interval takes effect without restarting the loop, and floors it at 100 ms so a bad value cannot spin the CPU. Matches the hot-reload-everything posture in `CLAUDE.md`.
- **`TagValueCapture` concurrency is genuinely lock-free-correct.** `Volatile.Write`/`Volatile.Read` of references to an immutable `record` (`TagValueObservation`), `_armed` is `volatile`, and `Record` short-circuits on `!_armed` with a single volatile read before any work — so the disarmed hot path cost is one bool read, as advertised. `Disarm` clears slots so a re-arm shows only fresh data. This part of the design is solid; the weakness is *who calls Arm/Disarm and when* (C1/C2), not the capture itself.
- **`StatusSnapshotBuilder.BuildDebug` degrades gracefully for an unknown PLC** (`StatusSnapshotBuilder.cs:44-47`) — returns a disarmed empty snapshot rather than throwing, which is the correct behavior for a detail page open on a hot-reload-removed PLC, and `PlcDetailResponse.Plc` is nullable to carry that state. `ConfigReconciler.cs:259` calls `_captureRegistry.Remove(name)` on PLC removal, so the registry and the config stay consistent.
- **`StatusHub.SubscribePlc` for an unknown PLC is a documented no-op** (`StatusHub.cs:53`, `TagCaptureRegistry.Arm` no-ops a missing key) and `StatusHubTests.SubscribePlc_UnknownPlc_DoesNotThrow_AndArmsNothing` covers it. A hub method throwing would be sent to the caller as a hub error; this path correctly does not throw.
- **Asset serving is safe.** `AdminEndpointHost.cs:212-226` rejects `/`, `\`, and `..` in the asset path segment before touching `GetManifestResourceStream`, caches bytes and misses in a `static ConcurrentDictionary` shared across app re-builds, and sets `immutable` cache headers for content-addressed assets vs. `no-cache` for the HTML shells. Embedded resources mean no filesystem traversal surface at all.

## Open questions

1. **C1 field impact:** how often do operators' browsers reconnect in this deployment? If the dashboard is on a stable internal segment and detail pages are short-lived, the leak is slow — but it is unbounded and never self-heals in a steady-state service. Is there an existing process-restart cadence that masks it? Either way the invariant in `StatusPage.md:315` ("the hot path carries zero cost when nobody is watching") is currently false after any reconnect-during-view.
2. **C2 / hub dispatch model:** confirm against the SignalR version in use whether `OnDisconnectedAsync` for a connection can overlap an in-flight `SubscribePlc` invocation on that *same* connection. If SignalR guarantees `OnDisconnectedAsync` runs only after all in-flight invocations for that connection complete, C2's same-connection race narrows to the cross-connection (reconnect) race in C1 — still a bug, but the fix scope shrinks.
3. **M1:** is trimming/AOT on the roadmap for this service? `CLAUDE.md` mentions single-file self-contained publish but not trimming. If AOT is ever planned, M1 is upgraded to Critical (the SignalR reflection JSON path will break) and `DebugJsonContext` must be wired in, not deleted.
4. **M5:** has the post-`StartAsync` failure path (e.g. `GetRequiredService<IHubContext<StatusHub>>` failing) ever been observed? It is low-probability, but the catch block's cleanup is provably incomplete — worth a deliberate decision to either fix it or document it as accepted.
5. Is there any reason `StatusJsonContext.Default` is not also wired into the hub's `AddJsonProtocol` so the fleet `"fleet"` push and `GET /status.json` share one serialization path and one camelCase configuration point (M1/M2)?