mbproxy: fix dashboard review findings, add named BCD tags + fleet config

Reviewed the new SignalR dashboard and fixed its two top findings: a stored XSS on the connection-detail page (unescaped tag name / direction / timestamp rendered into innerHTML) and FC03/FC04 cache hits bypassing the debug-view capture, which left cached tags frozen while their age climbed. Also adds an optional human-friendly Name to BCD tags surfaced on the debug view, and loads the real fleet config from tags.txt (12 named BCD tags, PLC Z28061) so the published appsettings.json is deploy-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-16 03:39:39 -04:00
parent e719dd51c1
commit 554b05d28c
27 changed files with 964 additions and 83 deletions
@@ -0,0 +1,285 @@
# Code Review — Tests & Config for the SignalR Dashboard (commit `e719dd5`)
Scope: the test files, config/options/build changes, and the smoke config introduced or
modified by `e719dd5` ("replace status page with a live SignalR web dashboard"). Production
SignalR/capture code (`StatusHub`, `StatusBroadcaster`, `TagValueCapture`,
`TagCaptureRegistry`, `AdminEndpointHost`) was read for context but is reviewed only insofar
as it tells us whether the new tests actually exercise the risky paths. The wider UI
(`dashboard.js`, `detail.js`, HTML/CSS) is explicitly out of scope.
## Summary
The new tests are competently written, follow the existing `Subject_Condition_Expectation`
naming, and the `IStatusPushSink` seam is a genuinely good design that makes the broadcaster
unit-testable without a SignalR host. The `TagValueCapture` torn-read test is a real
concurrency test, not a pretend one. **But the concurrency-sensitive paths the commit message
advertises — "armed only while a detail page is open", arm/disarm under churn, the broadcaster
*loop* — are tested only at the single-threaded happy-path level.** The most serious gaps:
1. The **`StatusBroadcaster.LoopAsync` background loop has zero test coverage** — every
broadcaster test calls the internal `PushOnceAsync` directly. The loop's interval
re-read, the `Math.Max(100, ...)` floor, the cancellation path, and the
loop-terminated-unexpectedly catch are all unverified.
2. The **arm/disarm-vs-record race in `TagValueCapture` is real and untested.** `Disarm()`
sets `_armed = false` *then* clears slots; `Record()` checks `_armed` *then* writes. A
`Record` that passes the `_armed` check before `Disarm` flips it can write a slot *after*
`Disarm` cleared it — leaving a stale observation on a disarmed capture. The torn-read
test only races `Record` against `Snapshot`, never against `Disarm`/`Arm`.
3. The **`PlcSubscriptionTracker` — the lock-guarded type whose entire job is concurrent
subscriber counting — has no direct test at all** and no concurrent test anywhere. Its
behaviour is only incidentally exercised through single-threaded `StatusHubTests`.
None are release-blockers for a read-only admin page on a trusted segment, but #2 is a latent
correctness bug and #1/#3 mean a regression in the live-feed plumbing would ship silently.
## Findings
### Critical
None. This is a read-only admin surface; nothing here can corrupt the Modbus path.
### Major
**M1 — `StatusBroadcaster.LoopAsync` has no test.**
`src/Mbproxy/Admin/StatusBroadcaster.cs:113-135`; `tests/Mbproxy.Tests/Admin/StatusBroadcasterTests.cs`.
All four broadcaster tests drive `PushOnceAsync` directly. The actual background loop —
which is what runs in production — is untested. Specifically uncovered:
- the per-cycle `_options.CurrentValue.AdminPushIntervalMs` re-read (the documented
hot-reload-without-restart behaviour);
- the `Math.Max(100, ...)` floor that defends against a bad interval slipping past
validation (note: validation rejects `<= 0`, so the floor only ever matters for values
`1..99` — itself an untested corner);
- that `StartAsync``StopAsync` actually terminates the loop and that `StopAsync` is safe
to call when the loop never started.
Fix: add a test that constructs the broadcaster with `AdminPushIntervalMs` ~100ms, calls
`Start()`, polls `Sink.FleetPushes.Count` until `>= 2` with a generous deadline + the test
`CancellationToken`, then `StopAsync()` and asserts the count stops growing. Keep timing
hedged (`ShouldBeGreaterThanOrEqualTo`), consistent with the existing coalescing tests.
**M2 — Arm/disarm-vs-record race in `TagValueCapture` is real and untested.**
`src/Mbproxy/Proxy/TagValueCapture.cs:103-129`. `Disarm()` does `_armed = false` then clears
slots; `Record()` does `if (!_armed) return;` then `Volatile.Write` the slot. Interleaving:
`Record` reads `_armed == true``Disarm` sets `false` and clears all slots → `Record`
writes its observation. Result: a *disarmed* capture holds a non-null slot, and the next
`Snapshot()` reports stale traffic with `UpdatedAtUtc != null` — exactly the "reopened page
shows stale data" failure the on-demand design says it prevents. The hot path makes this
unlikely (record traffic stops when no viewer is attached) but it is not impossible: a
detail page closing while an in-flight FC03 response is being rewritten hits it.
`ConcurrentRecordAndSnapshot_NeverYieldsTornSlot` (line 118) races `Record` vs `Snapshot`
only — never vs `Disarm`. Fix in production: have `Disarm()` set `_armed = false`, then
clear slots, then clear *again* (or re-clear after a memory barrier), or re-check `_armed`
inside `Record` after the `Volatile.Write` and null the slot if disarmed. At minimum add a
test that hammers `Arm`/`Disarm` on one task while `Record` runs on another and asserts a
disarmed capture's `Snapshot()` is all-null. The review should not let this ship "tested"
when the test deliberately avoids the racy interleaving.
**M3 — `PlcSubscriptionTracker` has no direct or concurrent test.**
`src/Mbproxy/Admin/PlcSubscriptionTracker.cs` (whole file); no `PlcSubscriptionTrackerTests.cs`
exists. This is the single lock-guarded type whose correctness drives capture arm/disarm.
`StatusHubTests` exercises it only single-threaded and indirectly. Untested behaviour that
has no coverage anywhere:
- the redundant re-subscribe path (`Add` returns `false` when the same connection
re-subscribes the same PLC — `set.Add` fails);
- `RemoveConnection` for an unknown connection id returning `Array.Empty<string>()`;
- a connection subscribed to *multiple* PLCs being torn down in one `RemoveConnection`;
- concurrent `Add`/`RemoveConnection` from many "connections" — the lock is claimed
thread-safe but nothing proves the count never goes negative or leaks.
Fix: add `PlcSubscriptionTrackerTests` with the single-threaded cases plus one
parallel-stress test (N tasks each Add-then-Remove, assert `ActivePlcs()` is empty and no
exception), mirroring `TxIdAllocatorTests`' concurrency style.
**M4 — `StatusHub` group-leave on disconnect is not verified.**
`src/Mbproxy/Admin/StatusHub.cs:60-66`; `StatusHubTests.cs`.
`OnDisconnectedAsync` is tested for the *capture-disarm* side effect, but the hub never
calls `Groups.RemoveFromGroupAsync` — and `FakeGroupManager` records `Removed` but **no test
ever asserts `groups.Removed`**. Worth confirming this is intentional (real SignalR auto-
removes a disconnected connection from all groups, so an explicit `RemoveFromGroupAsync` is
genuinely unnecessary) — but then `FakeGroupManager.Removed` is dead code that implies a
contract the hub does not honour. Either delete `Removed` from the fake, or if the hub is
*supposed* to leave fleet groups explicitly, that's a production bug. As written, a reviewer
cannot tell which. Fix: drop the unused `Removed` list, or add a comment on the fake
explaining SignalR's implicit group cleanup so the gap is not mistaken for an omission.
### Minor
**m1 — `SignalRFakes` do not model the real SignalR failure surface.**
`tests/Mbproxy.Tests/Admin/SignalRFakes.cs`. The fakes are faithful to the *shapes* but
model only the success path: `FakeGroupManager.AddToGroupAsync` always succeeds and is
synchronous; `FakeStatusPushSink.PushFleetAsync` never throws and never observes the
`CancellationToken`. The production code has explicit `try/catch` around
`_sink.PushFleetAsync` / `PushPlcAsync` (`StatusBroadcaster.cs:88-110`) — a sink that throws
is a real scenario (a SignalR send to a dropped connection) and that catch is **completely
uncovered**. The `BuildDebug` failure catch (`StatusBroadcaster.cs:82-86`) is likewise
uncovered. Fix: add a throwing variant of `FakeStatusPushSink` and assert `PushOnceAsync`
swallows the exception and still attempts the per-PLC pushes; assert the `ct` is honoured
(an `OperationCanceledException` from the sink must *not* be logged-and-swallowed the same
way — the `when (ex is not OperationCanceledException)` filter is untested).
**m2 — `FakeHubCallerContext.ConnectionAborted` is hardcoded to `CancellationToken.None`.**
`SignalRFakes.cs:23`. Real SignalR fires this token on disconnect. No current test needs it,
but if anyone later tests connection-abort handling the fake will silently mask it. Low risk;
note it with a comment so a future test author knows the fake is inert here.
**m3 — Asset content-type test relies on a hand-maintained `[InlineData]` allow-list.**
`AdminEndpointTests.Get_Asset_ReturnsCorrectContentType` covers 5 of the 14 embedded files.
`bootstrap.bundle.min.js`, `detail.css`, `detail.js`, `index.html`, `plc.html`,
`ibm-plex-sans-400/600.woff2` are not asserted. More importantly there is **no test that the
csproj glob actually embedded every wwwroot file** — if someone adds `favicon.ico` to
`wwwroot/` and the glob silently misses it (it won't, `*.*` catches it, but a `.gitignore`'d
or renamed file would), nothing fails. Fix: add a test that enumerates
`typeof(AdminEndpointHost).Assembly.GetManifestResourceNames()`, filters the
`Mbproxy.Admin.wwwroot.` prefix, and asserts every file physically present in `wwwroot/` has
a matching resource (or just assert the count). This is the only thing that would catch a
broken `EmbeddedResource` glob.
**m4 — `Get_Asset` does not assert the bytes are the *right* asset.**
`AdminEndpointTests.cs` (the new theory). It asserts `bytes.Length > 0` and the content
type, but not that `dashboard.js` contains a known marker (it already does this for the HTML
shells via `ShouldContain("/assets/dashboard.js")`). A resource-name collision or a wrong
`ContentTypeFor` mapping for a *correctly-served-but-wrong* file would pass. Cheap to harden:
for the text assets assert a known substring.
**m5 — `csproj` `EmbeddedResource Include="Admin\wwwroot\*.*"` — glob caveats.**
`src/Mbproxy/Mbproxy.csproj` (the new `ItemGroup`). The SDK is `Microsoft.NET.Sdk.Worker`,
which does **not** enable web default-item globs, so there is no double-include of `.css`/
`.js`/`.html` as `Content` — good. Two real caveats:
- `*.*` matches only files containing a dot. Every current asset has an extension so this is
fine, but it is a non-obvious constraint; `**\*` or just `*` would be more honest. The
comment says "intentionally flat", so `*.*` is acceptable, but worth a one-word note that
extensionless files would be skipped.
- Globs in an `<ItemGroup>` are evaluated at project-evaluation time; a newly-added asset is
*not* picked up by an incremental build that didn't re-evaluate. In practice `dotnet
build`/`publish` always re-evaluates, so this is a non-issue — but it is the kind of thing
that bites a watch-mode developer. No fix needed; flagged for completeness.
The resource-name → request-path mapping (`Mbproxy.Admin.wwwroot.<file>`) is correct for the
flat directory and matches `AssetResourcePrefix` in `AdminEndpointHost.cs:301`.
**m6 — `mbproxy.smoke.config.json` — valid but with stale/odd values.**
`tests/sim/mbproxy.smoke.config.json`. The schema is current (`AdminPushIntervalMs` present,
`Keepalive` field names match `KeepaliveOptions`, no removed keys). Issues:
- The header comment says the simulator runs on `127.0.0.1:5020` and both `line-a`/`line-b`
point `Port: 5020` — consistent. Good.
- `BackendHeartbeatIdleMs: 10000` while `BackendRequestTimeoutMs: 2000` — satisfies the
"must be greater than BackendRequestTimeoutMs" cross-field rule. Fine.
- `line-dead` has no `BcdTags`, so it inherits the empty `Global: []` set — fine, the intent
is just an unreachable backend.
- The file has **no `Resilience` and no `Cache` section.** Both are optional (defaults
apply), so the config binds — but a smoke config whose stated purpose is exercising the
dashboard's "problems only" filter would be more representative with the defaults made
explicit, or at least a comment that defaults are intentionally relied on.
- Comment line 1 references "Phase 4/5 web-UI browser smoke tests" and
`plans/2026-05-15-webui-dashboard.md`. `plans/` is untracked (`?? plans/` in git status)
and per recent commit `7466a46` the project has been *retiring* plan docs. A smoke config
pointing at an untracked plan file is a dangling reference — either commit the plan or
drop the citation.
- No `Resilience.ReadCoalescing` means coalescing runs with defaults during the smoke run;
acceptable but undocumented.
**m7 — `ConfigReconcilerTests` change is a minimal compile-fix, not a behaviour test.**
`tests/Mbproxy.Tests/Configuration/ConfigReconcilerTests.cs` (the 1-line diff). The new
`TagCaptureRegistry` ctor arg is passed as a throwaway `new TagCaptureRegistry()`. That is
the correct minimal change, but it means **the reconciler's new responsibility — calling
`TagCaptureRegistry.GetOrCreate` on PLC add and `Remove` on PLC removal during hot-reload —
has no test.** `ConfigReconciler.cs` was modified in this commit (`+11` lines) to wire the
registry; nothing asserts that a hot-reload-added PLC gets a capture or that a removed PLC's
capture is dropped. Fix: in `ConfigReconcilerTests`, pass a real registry the test holds a
reference to, trigger an add/remove reload, and assert `registry.TryGet` reflects it.
**m8 — `ReloadValidator`/`MbproxyOptions` `AdminPushIntervalMs` validation: consistent, but
the upper bound is unguarded.** `ReloadValidator.cs` and `MbproxyOptionsValidator` both
reject `<= 0` — consistent with how `GracefulShutdownTimeoutMs` and the keepalive intervals
are validated (lower-bound-only, `> 0`). So the *new* validation matches house style. But
note no interval option in this codebase has an *upper* bound, and `AdminPushIntervalMs` is
the one most likely to be fat-fingered into something absurd (`1000000` = 16-minute feed).
Not a regression and not inconsistent — flagged only because the dashboard's whole value
proposition is "live". The `LoopAsync` `Math.Max(100, ...)` floor protects against tiny
values; nothing protects against huge ones. Optional: a soft upper bound (e.g. reject
`> 60_000`) or just a doc note. The two new `ReloadValidatorTests` (zero, negative) and two
new `MbproxyOptionsBindingTests` (default 1000, binds 250) are correct and adequate for the
bounds that *are* enforced.
**m9 — `StatusHubTests.SecondSubscriber_FirstLeaveKeepsArmed_LastLeaveDisarms` passes
`null` to `OnDisconnectedAsync`.** Line 72/76. Real SignalR passes a non-null `Exception`
on an abnormal disconnect and `null` on a clean one, so `null` is a legal value — but the
test never exercises the exception-carrying path. `StatusHub.OnDisconnectedAsync` ignores
the exception entirely, so this is harmless today; flagged so a future change that *uses*
the exception doesn't find itself untested.
## Coverage gaps (behavior in `e719dd5` with NO test)
1. **`StatusBroadcaster.LoopAsync`** — the entire background loop (M1).
2. **Arm/disarm vs. `Record` race** in `TagValueCapture` (M2).
3. **`PlcSubscriptionTracker`** — no direct test, no concurrency test (M3).
4. **Sink-throws / build-throws error handling** in `PushOnceAsync` — the `try/catch`
blocks and the `when (ex is not OperationCanceledException)` filter (m1).
5. **`ConfigReconciler``TagCaptureRegistry` wiring** — capture created on hot-reload PLC
add, removed on PLC remove (m7).
6. **`AdminEndpointHost` AdminPort hot-reload tears down the broadcaster and disarms
captures** — `StopCurrentAppAsync` disposes `_broadcaster` (which `DisarmAll`s); the
broadcaster's `StopAsync_DisarmsEveryCapture` test proves the disarm in isolation but no
test proves the *hot-reload rebind* path runs it. (`AdminEndpointTests` has an admin-port
rebind test from the prior commit — extend it to assert capture disarm.)
7. **`/hub/status` SignalR endpoint** — not reachable by any test. The 405-methods test
explicitly excludes it. No test connects a SignalR client and verifies a `fleet`/`plc`
message is actually received end-to-end. The `IStatusPushSink` unit tests cover the
broadcaster logic but nothing covers `SignalRStatusPushSink` + `MapHub` wiring. Given the
project already stands up a real Kestrel host in `AdminEndpointTests`, one
`HubConnectionBuilder`-based E2E test (`SubscribeFleet` → await a `fleet` message) would
close the single biggest "does the feature actually work" gap.
8. **Multi-tag / 32-bit capture via `BcdPduPipeline`**`BcdPduPipelineCaptureTests` covers
FC03 16/32-bit and FC06/FC16 16-bit, but not FC16 *32-bit* (CDAB pair write) capture, and
not a PDU spanning a BCD tag *and* a non-BCD register (does capture record only the BCD
one?). The pure `TagValueCaptureTests` cover 32-bit, but the *pipeline hook* for 32-bit
writes is unverified.
9. **`Get_PlcDetailRoute_ReturnsDetailShell`** uses `/plc/anything` — it never checks that a
PLC name with a slash, encoded characters, or empty segment is handled. The route is
`/plc/{name}` and the name is read client-side, so this is low-risk, but
`/plc/` (empty) and `/plc/a%2Fb` are untested.
## What looks good
- **`IStatusPushSink` is a clean seam.** Extracting the outbound side behind an interface so
the broadcaster loop logic is testable without a SignalR host is exactly right, and
`FakeStatusPushSink` using `ConcurrentBag` is the correct call for a type that production
pushes to from a background thread.
- **`TagValueCapture` torn-read test is genuine.** `ConcurrentRecordAndSnapshot_NeverYields
TornSlot` races 4 writers × 200k ops against a reader and checks a real invariant
(`DecodedValue == RawLow + RawHigh`). This is the right way to test the `Volatile.Write`
immutable-record swap, and it would actually catch a regression to a mutable slot.
- **`SignalRFakes` model the right shapes.** `FakeHubCallerContext : HubCallerContext` with
the abstract members overridden, `FakeGroupManager : IGroupManager`, and the hub-property
injection (`Context = ...`, `Groups = ...`) is the standard, correct way to unit-test a
SignalR `Hub` without a host. No mocking framework is dragged in.
- **`BcdPduPipelineCaptureTests` regression guards are well-chosen.** The disarmed-capture
and null-capture cases assert the rewrite *still happens byte-identically* — that is the
load-bearing property (capture must never perturb the proxy's transparency) and it is
explicitly tested.
- **`TagCaptureRegistryTests.GetOrCreate_Rebuild_PreservesArmedFlag`** correctly tests the
hot-reload reseat contract (rebuilt capture is a new instance but keeps `IsArmed`), and
`UnknownPlc_Operations_AreSafeNoOps` covers the no-op-for-ghost-PLC contract that
`StatusHub` relies on.
- **`StatusHtmlRenderer` removal is clean.** No source or test file references it; the only
remaining mentions are in `plans/`, `docs/`, and prior `codereviews/` — all expected.
- **The `AdminPushIntervalMs` validation is consistent with the codebase.** Lower-bound-only
`> 0` checks in both `MbproxyOptionsValidator` and `ReloadValidator`, error-message format
matching the sibling checks, and tests for default + bind + zero + negative. This is the
right pattern; the new tests are adequate for what is enforced.
- **The csproj `EmbeddedResource` glob is correct** for the flat-directory design, the
comment accurately documents the resource-name → request-path mapping, and there is no
accidental double-include because the Worker SDK does not enable web default globs.
- **`mbproxy.smoke.config.json` binds against the current schema** — no stale keys, the
three-PLC line-a/line-b/line-dead topology is a sensible smoke surface, and
`AdminPushIntervalMs` is present and explicit.
## Key file references
- `src/Mbproxy/Admin/StatusBroadcaster.cs:113-135` — untested `LoopAsync` (M1).
- `src/Mbproxy/Proxy/TagValueCapture.cs:103-129` — arm/disarm vs. record race (M2).
- `src/Mbproxy/Admin/PlcSubscriptionTracker.cs` — no test file exists (M3).
- `src/Mbproxy/Admin/StatusHub.cs:60-66` — `OnDisconnectedAsync`; `FakeGroupManager.Removed`
is unasserted dead code (M4).
- `tests/Mbproxy.Tests/Admin/SignalRFakes.cs` — success-path-only fakes (m1, m2).
- `tests/Mbproxy.Tests/Configuration/ConfigReconcilerTests.cs` — compile-fix only, reconciler
↔ registry wiring untested (m7).
- `src/Mbproxy/Mbproxy.csproj` — `EmbeddedResource Include="Admin\wwwroot\*.*"` (m5).
- `tests/sim/mbproxy.smoke.config.json` — dangling `plans/` reference (m6).