Files
wwtools/mbproxy/codereviews/2026-05-16/AdminSignalR.md
Joseph Doherty b222362ce0 mbproxy: remediate the 2026-05-16 code-review findings
Fixes every finding from the codereviews/2026-05-16 multi-agent review
(2 Critical, 20 Major, 38 Minor) and adds that review to the repo.

Highlights: dashboard XSS escape; response cache invalidated on the
write request (not just the response); ReloadValidator now runs at
startup so port collisions / duplicate names / malformed Resilience
profiles fail fast; AdminPort 0 genuinely disables the admin endpoint;
PlcListener accept-loop faults propagate to the supervisor's faulted
path; reconciler Restart builds before removing; Resilience pipelines
are restart-only from a frozen snapshot; multiplexer connect-race leak,
watchdog party-list snapshot, backend-response and FC16 framing
validation; frontend reconnect retry and util.js load guard; plus the
log-event/doc drift sweep and test-port hygiene.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 18:08:06 -04:00

19 KiB

Admin SignalR / Tag Capture — Code Review

Scope: src/Mbproxy/Admin/*.cs (AdminEndpointHost, StatusBroadcaster, StatusHub, PlcSubscriptionTracker, StatusSnapshotBuilder, StatusPushSink, StatusDto, DebugDto, AssemblyVersionAccessor) plus src/Mbproxy/Proxy/TagValueCapture.cs and src/Mbproxy/Proxy/TagCaptureRegistry.cs. Branch mbproxy-webui-dashboard (HEAD 0308490). Cross-checked against docs/Operations/StatusPage.md, docs/Reference/LogEvents.md, the mbproxy CLAUDE.md, the prior reviews codereviews/2026-05-15/{AdminSignalR,TagCapture}.md, and the consuming code in ConfigReconciler.cs / ProxyWorker.cs / PlcMultiplexer.cs.

Summary

The two Critical findings from the 2026-05-15 reviews have been genuinely fixed and the fixes hold. The SignalR subscription tracker is now keyed on a stable per-page-load tabId (PlcSubscriptionTracker), not on the ConnectionId, so a transport reconnect can no longer leak a viewer; arm/disarm is funneled through a single authority (StatusBroadcaster.PushOnceAsyncTagCaptureRegistry.ReconcileArmed), removing the lost-update race; the hub no longer arms captures at all. The prior XSS surface does not exist in the current admin layer — no server-rendered HTML, no string interpolation into a response; /plc/{name} serves a static embedded shell and name is read client-side. The cache-hit debug-view freeze (prior TagCapture C1) is fixed by CacheEntry.CapturedTags replay. The endpoint is strictly read-only. M5/M6 from the prior review (bind-failure leak) are addressed: StartAppAsync now tears down a partially-started app/_broadcaster in its catch.

Remaining issues are lower severity: a real but bounded reconcile/disarm race against Snapshot(), a stale doc comment, a missing Start() double-call guard, and assorted test-gap / doc-drift items. No new Critical issues.

Findings by severity: Critical: 0 · Major: 2 · Minor: 7

Major findings

M1. ReconcileArmed disarm can clear a slot between snapshot capture and push — debug view can flicker an extra empty cycle

StatusBroadcaster.cs:108-130, TagCaptureRegistry.cs:55-68, TagValueCapture.cs:108-113.

PushOnceAsync calls ReconcileArmed(activePlcs) before the per-PLC foreach that calls _builder.BuildDebug(plcName). For a PLC that is currently armed and still has a viewer this is fine. But the ordering interacts badly with a viewer that leaves during a push cycle:

  • activePlcs is snapshotted at line 108. Suppose plc-x is in it.
  • ReconcileArmed keeps plc-x armed.
  • The viewer's OnDisconnectedAsync runs on a hub thread mid-cycle, removing plc-x from the tracker — but the foreach still iterates the stale activePlcs list and pushes one final detail message to a now-empty group. Benign (empty group no-ops).

The genuinely observable case is the inverse and is a correctness nuance, not a crash: because ReconcileArmed runs once per cycle and BuildDebug runs in the same cycle, a PLC that becomes active after line 108 (a SubscribePlc landing between line 108 and the foreach) is not in activePlcs, so it is neither armed nor pushed this cycle — it self-heals next cycle. That is acceptable. What is worth flagging: ReconcileArmed and BuildDebug are not consistent within the cycle for a PLC whose arm state flips mid-cycle, and the capture's Disarm() clears every slot (TagValueCapture.cs:111-112). If a hot-reload GetOrCreate rebuild or a ReconcileArmed disarm interleaves between ReconcileArmed and the capture.Snapshot() inside BuildDebug, the pushed PlcDebugSnapshot can report CaptureArmed = true (read at StatusSnapshotBuilder.cs:54) with all-empty tag rows, or CaptureArmed = false with populated rows.

Impact: a transient one-cycle (≤ AdminPushIntervalMs, default 1 s) inconsistency in the debug view — captureArmed and the tag rows can momentarily disagree. Not data corruption, not a leak; the next cycle is correct. Worth recording because the prior reviews treated arm-state consistency as the central invariant.

Recommendation: read capture.IsArmed and capture.Snapshot() once, atomically enough for display — or, simplest, have BuildDebug derive CaptureArmed from the same activePlcs set the broadcaster already holds rather than re-reading capture.IsArmed. Pass the active set (or a bool armed) into BuildDebug instead of letting it independently re-query the capture. That makes the pushed payload internally consistent by construction.

M2. StatusBroadcaster.Start() has no guard against a second call — a double-start orphans a push loop

StatusBroadcaster.cs:56-57: public void Start() => _loop = Task.Run(() => LoopAsync(_cts.Token));

A second Start() overwrites _loop, orphaning the first loop task. The first task keeps running against the same _cts, so two loops now push the fleet snapshot every interval until cancellation, and StopAsync only awaits the second _loop — the orphaned first loop is never awaited and its faults are unobserved. The XML doc ("Idempotent only in the sense that it is called once") documents the assumption but nothing enforces it.

Today AdminEndpointHost constructs a fresh StatusBroadcaster per StartAppAsync and calls Start() exactly once, so this is not hit in production. But StatusBroadcasterTests.Loop_PushesRepeatedly_ThenStopsAfterStopAsync (line 175) calls h.Broadcaster.Start() on a harness whose broadcaster was not started by a hidden path — that one is fine — yet the class remains trivially misusable, and IAsyncDisposable.DisposeAsyncStopAsync would only ever clean up one of two loops.

Recommendation: add a one-line guard — Interlocked.CompareExchange on an int _started, or if (_loop is not { Status: TaskStatus.RanToCompletion or ... } ) ... — and either throw or no-op on the second call. An Interlocked flag is the cleanest. Pair it with a brief test asserting a second Start() does not double the push rate.

Minor findings

N1. Stale comment in ConfigReconciler claims GetOrCreate preserves the armed flag

ConfigReconciler.cs:364-365: "Rebuild the capture for the new tag set; GetOrCreate preserves the armed flag so an open detail page keeps capturing across the reload."

TagCaptureRegistry.GetOrCreate (TagCaptureRegistry.cs:33-37) no longer does this — the update delegate (_, _) => new TagValueCapture(map.All) builds a fresh disarmed capture and copies nothing. The XML doc on GetOrCreate itself (lines 27-31) correctly describes the new behavior ("The rebuilt capture is disarmed; StatusBroadcaster re-arms it on its next push cycle"). The ConfigReconciler comment is the stale one and now contradicts the code. It is the only surviving reference to the old armed-flag-preservation design.

Impact: misleading to a future maintainer; could prompt a "fix" that reintroduces the lost-update race the redesign eliminated.

Recommendation: update the ConfigReconciler.cs:364-365 comment to: "Rebuild the capture for the new tag set; the rebuilt capture is disarmed and StatusBroadcaster re-arms it within one push cycle if the PLC still has a viewer."

N2. TagValueCapture.Record's Record-vs-Disarm re-check still has a narrow stale-slot window

TagValueCapture.cs:122-143. The defensive re-read of _armed after the Volatile.Write is correct in spirit and closes the common case. But it is not airtight: Disarm() does _armed = false then clears slots (TagValueCapture.cs:110-112). Interleaving Record against Disarm:

  1. Record: _armed true → passes the gate.
  2. Disarm: _armed = false; begins the slot-clear loop, clears slot idx.
  3. Record: Volatile.Write(ref _slots[idx], newObs) — writes after Disarm already cleared idx.
  4. Record: re-reads _armed → false → Volatile.Write(ref _slots[idx], null). Slot ends null. Correct.

That path self-corrects. The genuinely uncovered ordering is step 3 landing after step 4's own null-write would have run — it cannot, single-threaded within Record — so Record is in fact safe for its own write. The residual risk is purely that Record is called concurrently from two threads for the same address (e.g. an FC03 backend response and an FC06 upstream write racing on the same tag): both pass the gate, both write, the re-check on the later writer can null the slot even though the capture is armed, dropping that observation until the next traffic. This is a lost update, not stale data, and the slot self-heals on the next PDU.

Impact: under concurrent same-address traffic with a near-simultaneous disarm, a single observation can be dropped. Functionally negligible for a debug view; flagged only for completeness since the prior review called this race out and it is narrowed but not fully eliminated.

Recommendation: no code change required — document in the Record comment that the guarantee is "armed captures never retain stale data after disarm," not "every observation while armed is retained." If stricter retention is ever wanted, Disarm would need to set _armed = false after clearing slots and Record would re-validate, but that inverts a different race; the current trade-off is the right one for a debug view.

N3. No test exercises a reconnect that drops the new connection first

StatusHubTests.Reconnect_SameTab_NewConnection_DoesNotLeakViewer (line 50) and PlcSubscriptionTrackerTests.SameTab_TwoConnections_StaysActiveUntilLastConnectionGone (line 27) both cover the reconnect-overlap case where the old connection's OnDisconnectedAsync fires last. Neither covers the mirror ordering (new connection's disconnect arrives before the old one's), nor the case where SubscribePlc for the new connection arrives before RemoveConnection for the old one but with a different tabId (e.g. a hard page reload that generates a fresh tabId). The tracker handles all of these correctly by construction, but the test suite asserts only one of the orderings.

Recommendation: add a test where two connections of the same tab are removed in reverse order, and one where a page reload (new tabId, old tab still has a live connection mid-teardown) is shown not to leak.

N4. SubscribePlc registers the tracker entry before the group join — a throw in AddToGroupAsync leaves a tracked viewer with no group membership

StatusHub.cs:46-54. The current order is deliberate (comment lines 48-51: register with the tracker first so this connection's own OnDisconnectedAsync sees consistent state). But if Groups.AddToGroupAsync throws (transport fault mid-invocation), the tracker now holds a subscription for a connection that is not in the PLC's SignalR group. The capture for that PLC gets armed by the next ReconcileArmed, but the broadcaster pushes detail messages to a group the connection never joined, so the page receives nothing while the capture is needlessly armed — until OnDisconnectedAsync eventually fires for that connection and RemoveConnection cleans it up. Self-healing (the disconnect callback always runs eventually) and the capture cost is bounded, so this is Minor — but the asymmetry is real.

Recommendation: either accept it (the disconnect callback is the backstop, and OnDisconnectedAsync is guaranteed to run for any connection that completed OnConnectedAsync) and add a one-line comment to that effect, or wrap the body so a failed AddToGroupAsync rolls back the tracker entry via RemoveConnection. The accept-and-document option is sufficient.

N5. StatusBroadcaster swallows every non-OCE push exception with no rate limiting

StatusBroadcaster.cs:96-130. Per-stage try/catch correctly isolates a snapshot-build failure, a fleet-push failure, and each per-PLC push failure, and the [LoggerMessage] events (mbproxy.admin.broadcast.{snapshot,fleet,detail}.failed, EventIds 72-74) match LogEvents.md. Good. But a persistently failing push (e.g. a wedged SignalR transport) logs one Error event per cycle — at the default 1 s cadence that is 86,400 error log lines/day for one stuck PLC. LogEvents.md for mbproxy.admin.broadcast.detail.failed says "Sustained occurrences ... mean that PLC's detail page is not receiving live updates" — so the volume is by design as an operator signal, but it can drown the rolling file.

Recommendation: consider logging the first failure at Error and subsequent consecutive failures at Debug (or every Nth), resetting on the first success. Minor — the documented contract permits the current behavior, so this is an enhancement, not a defect.

N6. DebugDtoSerializationTests re-creates the hub options by hand rather than asserting against the real configuration

DebugDtoSerializationTests.cs:18-19 builds new JsonSerializerOptions { PropertyNamingPolicy = JsonNamingPolicy.CamelCase } and comments that this is "The exact policy AdminEndpointHost configures." It is a copy of AdminEndpointHost.cs:200-202, not a reference to it — if someone edits the AddJsonProtocol lambda (adds DefaultIgnoreCondition, changes the policy, etc.) the test still passes against its stale copy. HubStatusE2ETests does exercise the real hub end-to-end and asserts captureArmed/service/plcs appear, which is the genuine guard; the unit test is a weaker duplicate.

Recommendation: low priority — either delete the unit test as redundant with the E2E coverage, or extract the JsonSerializerOptions configuration into a shared static (e.g. AdminEndpointHost.HubJsonOptions) that both the production AddJsonProtocol call and the test reference, so they cannot drift.

N7. Doc drift: StatusPage.md does not state the detail payload travels only over SignalR, and LogEvents.md EventId 75 wording is fine but the broadcaster's mbproxy.admin.broadcast.* family is not cross-referenced from the doc's source list

docs/Operations/StatusPage.md "Debug View Data" (lines 313-331) documents every PlcDetailResponse field but never says the payload is reachable only over the /hub/status SignalR feed and has no GET route — a scraper author could reasonably expect a /plc/{name}.json twin. Prior review N7 raised exactly this and it is still open. Separately, StatusPage.md:23-25 describes the bind-failure and AdminPort hot-reload behavior accurately, and LogEvents.md lists the four mbproxy.admin.broadcast.* events correctly — no defect there, just confirming.

Recommendation: add one sentence to StatusPage.md's "Debug View Data" section: "PlcDetailResponse is delivered only over the /hub/status SignalR feed (the \"plc\" message); there is no GET route for it, and it is serialized through the SignalR JSON protocol, not StatusJsonContext."

What looks good

  • The prior Critical SignalR capture-leak (2026-05-15 C1/C2) is genuinely fixed. PlcSubscriptionTracker is keyed on tabId, not ConnectionId; a reconnect is "same tab acquires a new connection," so the viewer count cannot leak across a transport drop. StatusHub no longer arms/disarms captures — OnConnectedAsync is not even overridden — and arm/disarm flows through StatusBroadcaster.PushOnceAsync → TagCaptureRegistry.ReconcileArmed, a single-threaded once-per-cycle authority. ReconcileArmed reconciles all captures against the live ActivePlcs() set every cycle, so any stale arm state self-heals within one AdminPushIntervalMs. The StatusHubTests / PlcSubscriptionTrackerTests reconnect and concurrency-stress tests cover the core invariant.
  • The prior TagCapture Critical (cache-hit debug-view freeze) is fixed. PlcMultiplexer.cs:653-666 captures TagValueObservations into CacheEntry.CapturedTags at store time, and the cache-hit path (PlcMultiplexer.cs:844-850) replays them into the armed capture re-stamped to the hit time. The debug view now reflects cache-served reads.
  • No XSS / output-encoding surface. The admin layer renders no server-side HTML. / and /plc/{name} serve static embedded index.html / plc.html byte blobs (ServeHtmlShell); name is bound but unused and read client-side. /status.json and the SignalR payloads are JSON-serialized, never string-interpolated. There is nowhere for attacker-controlled input to reach an HTML response.
  • Endpoint is strictly read-only. Routes are GET /, GET /plc/{name}, GET /assets/{path}, GET /status.json, and the hub. The hub exposes only SubscribeFleet / SubscribePlc (group joins + tracker mutation) and OnDisconnectedAsync. No mutation route, no control action, no log download.
  • Asset serving is traversal-safe. AdminEndpointHost.cs:221 rejects /, \, and .. in the path segment before any resource lookup; assets are embedded manifest resources, so there is no filesystem surface. Bytes and misses are cached in a static ConcurrentDictionary.
  • Bind-failure cleanup is now complete (prior M5/M6 fixed). StartAppAsync declares app outside the try and its catch (AdminEndpointHost.cs:258-280) best-effort disposes the _broadcaster and stops/disposes the partially-started Kestrel app, so a throw after app.StartAsync no longer leaks a bound listener or a running push loop.
  • Lifecycle / hot-reload of AdminPort is correct. _broadcaster is constructed per StartAppAsync and disposed before Kestrel stops in StopCurrentAppAsync; the OnChange callback is guarded by the _disposed flag both before queueing and after the threadpool picks it up, and re-checks newPort == _currentPort under _lock. DisposeAsync is idempotent via _disposed. CTS/semaphore disposal is race-guarded.
  • StatusBroadcaster resource hygiene. _cts is disposed in DisposeAsync after StopAsync awaits the loop; StopAsync carries a _stopped flag so a double-stop (DisposeAsync after an explicit StopAsync) does not touch the disposed CTS or re-cancel. LoopAsync re-reads AdminPushIntervalMs each cycle (hot-reloadable) and floors it at 100 ms; it pushes before delaying so a freshly-connected dashboard populates immediately.
  • TagValueCapture torn-read safety holds. TagValueObservation is an immutable sealed record; slots are reference-typed and only swapped via Volatile.Write / read via Volatile.Read; _armed is volatile; the disarmed hot path is a single volatile-bool read after one ?. null-check. FrozenDictionary is built once and never mutated.
  • StatusSnapshotBuilder is lock-free and side-effect-free, degrades gracefully for an unknown PLC (BuildDebug returns a disarmed empty snapshot), and the CounterSnapshot fallback for a not-yet-started supervisor is exhaustive.
  • StatusBroadcaster.PushOnceAsync indexes the fleet snapshot into a Dictionary once per cycle (StatusBroadcaster.cs:113-114) — the prior review's O(N²) FirstOrDefault-per-PLC concern is resolved.