mbproxy: fix dashboard review findings, add named BCD tags + fleet config

Reviewed the new SignalR dashboard and fixed its two top findings: a stored XSS on the connection-detail page (unescaped tag name / direction / timestamp rendered into innerHTML) and FC03/FC04 cache hits bypassing the debug-view capture, which left cached tags frozen while their age climbed. Also adds an optional human-friendly Name to BCD tags surfaced on the debug view, and loads the real fleet config from tags.txt (12 named BCD tags, PLC Z28061) so the published appsettings.json is deploy-ready. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 03:39:39 -04:00
parent e719dd51c1
commit 554b05d28c
27 changed files with 964 additions and 83 deletions
@@ -0,0 +1,285 @@
+# Code Review — Tests & Config for the SignalR Dashboard (commit `e719dd5`)
+
+Scope: the test files, config/options/build changes, and the smoke config introduced or
+modified by `e719dd5` ("replace status page with a live SignalR web dashboard"). Production
+SignalR/capture code (`StatusHub`, `StatusBroadcaster`, `TagValueCapture`,
+`TagCaptureRegistry`, `AdminEndpointHost`) was read for context but is reviewed only insofar
+as it tells us whether the new tests actually exercise the risky paths. The wider UI
+(`dashboard.js`, `detail.js`, HTML/CSS) is explicitly out of scope.
+
+## Summary
+
+The new tests are competently written, follow the existing `Subject_Condition_Expectation`
+naming, and the `IStatusPushSink` seam is a genuinely good design that makes the broadcaster
+unit-testable without a SignalR host. The `TagValueCapture` torn-read test is a real
+concurrency test, not a pretend one. **But the concurrency-sensitive paths the commit message
+advertises — "armed only while a detail page is open", arm/disarm under churn, the broadcaster
+*loop* — are tested only at the single-threaded happy-path level.** The most serious gaps:
+
+1. The **`StatusBroadcaster.LoopAsync` background loop has zero test coverage** — every
+   broadcaster test calls the internal `PushOnceAsync` directly. The loop's interval
+   re-read, the `Math.Max(100, ...)` floor, the cancellation path, and the
+   loop-terminated-unexpectedly catch are all unverified.
+2. The **arm/disarm-vs-record race in `TagValueCapture` is real and untested.** `Disarm()`
+   sets `_armed = false` *then* clears slots; `Record()` checks `_armed` *then* writes. A
+   `Record` that passes the `_armed` check before `Disarm` flips it can write a slot *after*
+   `Disarm` cleared it — leaving a stale observation on a disarmed capture. The torn-read
+   test only races `Record` against `Snapshot`, never against `Disarm`/`Arm`.
+3. The **`PlcSubscriptionTracker` — the lock-guarded type whose entire job is concurrent
+   subscriber counting — has no direct test at all** and no concurrent test anywhere. Its
+   behaviour is only incidentally exercised through single-threaded `StatusHubTests`.
+
+None are release-blockers for a read-only admin page on a trusted segment, but #2 is a latent
+correctness bug and #1/#3 mean a regression in the live-feed plumbing would ship silently.
+
+## Findings
+
+### Critical
+
+None. This is a read-only admin surface; nothing here can corrupt the Modbus path.
+
+### Major
+
+**M1 — `StatusBroadcaster.LoopAsync` has no test.**
+`src/Mbproxy/Admin/StatusBroadcaster.cs:113-135`; `tests/Mbproxy.Tests/Admin/StatusBroadcasterTests.cs`.
+All four broadcaster tests drive `PushOnceAsync` directly. The actual background loop —
+which is what runs in production — is untested. Specifically uncovered:
+- the per-cycle `_options.CurrentValue.AdminPushIntervalMs` re-read (the documented
+  hot-reload-without-restart behaviour);
+- the `Math.Max(100, ...)` floor that defends against a bad interval slipping past
+  validation (note: validation rejects `<= 0`, so the floor only ever matters for values
+  `1..99` — itself an untested corner);
+- that `StartAsync`→`StopAsync` actually terminates the loop and that `StopAsync` is safe
+  to call when the loop never started.
+Fix: add a test that constructs the broadcaster with `AdminPushIntervalMs` ~100ms, calls
+`Start()`, polls `Sink.FleetPushes.Count` until `>= 2` with a generous deadline + the test
+`CancellationToken`, then `StopAsync()` and asserts the count stops growing. Keep timing
+hedged (`ShouldBeGreaterThanOrEqualTo`), consistent with the existing coalescing tests.
+
+**M2 — Arm/disarm-vs-record race in `TagValueCapture` is real and untested.**
+`src/Mbproxy/Proxy/TagValueCapture.cs:103-129`. `Disarm()` does `_armed = false` then clears
+slots; `Record()` does `if (!_armed) return;` then `Volatile.Write` the slot. Interleaving:
+`Record` reads `_armed == true` → `Disarm` sets `false` and clears all slots → `Record`
+writes its observation. Result: a *disarmed* capture holds a non-null slot, and the next
+`Snapshot()` reports stale traffic with `UpdatedAtUtc != null` — exactly the "reopened page
+shows stale data" failure the on-demand design says it prevents. The hot path makes this
+unlikely (record traffic stops when no viewer is attached) but it is not impossible: a
+detail page closing while an in-flight FC03 response is being rewritten hits it.
+`ConcurrentRecordAndSnapshot_NeverYieldsTornSlot` (line 118) races `Record` vs `Snapshot`
+only — never vs `Disarm`. Fix in production: have `Disarm()` set `_armed = false`, then
+clear slots, then clear *again* (or re-clear after a memory barrier), or re-check `_armed`
+inside `Record` after the `Volatile.Write` and null the slot if disarmed. At minimum add a
+test that hammers `Arm`/`Disarm` on one task while `Record` runs on another and asserts a
+disarmed capture's `Snapshot()` is all-null. The review should not let this ship "tested"
+when the test deliberately avoids the racy interleaving.
+
+**M3 — `PlcSubscriptionTracker` has no direct or concurrent test.**
+`src/Mbproxy/Admin/PlcSubscriptionTracker.cs` (whole file); no `PlcSubscriptionTrackerTests.cs`
+exists. This is the single lock-guarded type whose correctness drives capture arm/disarm.
+`StatusHubTests` exercises it only single-threaded and indirectly. Untested behaviour that
+has no coverage anywhere:
+- the redundant re-subscribe path (`Add` returns `false` when the same connection
+  re-subscribes the same PLC — `set.Add` fails);
+- `RemoveConnection` for an unknown connection id returning `Array.Empty<string>()`;
+- a connection subscribed to *multiple* PLCs being torn down in one `RemoveConnection`;
+- concurrent `Add`/`RemoveConnection` from many "connections" — the lock is claimed
+  thread-safe but nothing proves the count never goes negative or leaks.
+Fix: add `PlcSubscriptionTrackerTests` with the single-threaded cases plus one
+parallel-stress test (N tasks each Add-then-Remove, assert `ActivePlcs()` is empty and no
+exception), mirroring `TxIdAllocatorTests`' concurrency style.
+
+**M4 — `StatusHub` group-leave on disconnect is not verified.**
+`src/Mbproxy/Admin/StatusHub.cs:60-66`; `StatusHubTests.cs`.
+`OnDisconnectedAsync` is tested for the *capture-disarm* side effect, but the hub never
+calls `Groups.RemoveFromGroupAsync` — and `FakeGroupManager` records `Removed` but **no test
+ever asserts `groups.Removed`**. Worth confirming this is intentional (real SignalR auto-
+removes a disconnected connection from all groups, so an explicit `RemoveFromGroupAsync` is
+genuinely unnecessary) — but then `FakeGroupManager.Removed` is dead code that implies a
+contract the hub does not honour. Either delete `Removed` from the fake, or if the hub is
+*supposed* to leave fleet groups explicitly, that's a production bug. As written, a reviewer
+cannot tell which. Fix: drop the unused `Removed` list, or add a comment on the fake
+explaining SignalR's implicit group cleanup so the gap is not mistaken for an omission.
+
+### Minor
+
+**m1 — `SignalRFakes` do not model the real SignalR failure surface.**
+`tests/Mbproxy.Tests/Admin/SignalRFakes.cs`. The fakes are faithful to the *shapes* but
+model only the success path: `FakeGroupManager.AddToGroupAsync` always succeeds and is
+synchronous; `FakeStatusPushSink.PushFleetAsync` never throws and never observes the
+`CancellationToken`. The production code has explicit `try/catch` around
+`_sink.PushFleetAsync` / `PushPlcAsync` (`StatusBroadcaster.cs:88-110`) — a sink that throws
+is a real scenario (a SignalR send to a dropped connection) and that catch is **completely
+uncovered**. The `BuildDebug` failure catch (`StatusBroadcaster.cs:82-86`) is likewise
+uncovered. Fix: add a throwing variant of `FakeStatusPushSink` and assert `PushOnceAsync`
+swallows the exception and still attempts the per-PLC pushes; assert the `ct` is honoured
+(an `OperationCanceledException` from the sink must *not* be logged-and-swallowed the same
+way — the `when (ex is not OperationCanceledException)` filter is untested).
+
+**m2 — `FakeHubCallerContext.ConnectionAborted` is hardcoded to `CancellationToken.None`.**
+`SignalRFakes.cs:23`. Real SignalR fires this token on disconnect. No current test needs it,
+but if anyone later tests connection-abort handling the fake will silently mask it. Low risk;
+note it with a comment so a future test author knows the fake is inert here.
+
+**m3 — Asset content-type test relies on a hand-maintained `[InlineData]` allow-list.**
+`AdminEndpointTests.Get_Asset_ReturnsCorrectContentType` covers 5 of the 14 embedded files.
+`bootstrap.bundle.min.js`, `detail.css`, `detail.js`, `index.html`, `plc.html`,
+`ibm-plex-sans-400/600.woff2` are not asserted. More importantly there is **no test that the
+csproj glob actually embedded every wwwroot file** — if someone adds `favicon.ico` to
+`wwwroot/` and the glob silently misses it (it won't, `*.*` catches it, but a `.gitignore`'d
+or renamed file would), nothing fails. Fix: add a test that enumerates
+`typeof(AdminEndpointHost).Assembly.GetManifestResourceNames()`, filters the
+`Mbproxy.Admin.wwwroot.` prefix, and asserts every file physically present in `wwwroot/` has
+a matching resource (or just assert the count). This is the only thing that would catch a
+broken `EmbeddedResource` glob.
+
+**m4 — `Get_Asset` does not assert the bytes are the *right* asset.**
+`AdminEndpointTests.cs` (the new theory). It asserts `bytes.Length > 0` and the content
+type, but not that `dashboard.js` contains a known marker (it already does this for the HTML
+shells via `ShouldContain("/assets/dashboard.js")`). A resource-name collision or a wrong
+`ContentTypeFor` mapping for a *correctly-served-but-wrong* file would pass. Cheap to harden:
+for the text assets assert a known substring.
+
+**m5 — `csproj` `EmbeddedResource Include="Admin\wwwroot\*.*"` — glob caveats.**
+`src/Mbproxy/Mbproxy.csproj` (the new `ItemGroup`). The SDK is `Microsoft.NET.Sdk.Worker`,
+which does **not** enable web default-item globs, so there is no double-include of `.css`/
+`.js`/`.html` as `Content` — good. Two real caveats:
+- `*.*` matches only files containing a dot. Every current asset has an extension so this is
+  fine, but it is a non-obvious constraint; `**\*` or just `*` would be more honest. The
+  comment says "intentionally flat", so `*.*` is acceptable, but worth a one-word note that
+  extensionless files would be skipped.
+- Globs in an `<ItemGroup>` are evaluated at project-evaluation time; a newly-added asset is
+  *not* picked up by an incremental build that didn't re-evaluate. In practice `dotnet
+  build`/`publish` always re-evaluates, so this is a non-issue — but it is the kind of thing
+  that bites a watch-mode developer. No fix needed; flagged for completeness.
+The resource-name → request-path mapping (`Mbproxy.Admin.wwwroot.<file>`) is correct for the
+flat directory and matches `AssetResourcePrefix` in `AdminEndpointHost.cs:301`.
+
+**m6 — `mbproxy.smoke.config.json` — valid but with stale/odd values.**
+`tests/sim/mbproxy.smoke.config.json`. The schema is current (`AdminPushIntervalMs` present,
+`Keepalive` field names match `KeepaliveOptions`, no removed keys). Issues:
+- The header comment says the simulator runs on `127.0.0.1:5020` and both `line-a`/`line-b`
+  point `Port: 5020` — consistent. Good.
+- `BackendHeartbeatIdleMs: 10000` while `BackendRequestTimeoutMs: 2000` — satisfies the
+  "must be greater than BackendRequestTimeoutMs" cross-field rule. Fine.
+- `line-dead` has no `BcdTags`, so it inherits the empty `Global: []` set — fine, the intent
+  is just an unreachable backend.
+- The file has **no `Resilience` and no `Cache` section.** Both are optional (defaults
+  apply), so the config binds — but a smoke config whose stated purpose is exercising the
+  dashboard's "problems only" filter would be more representative with the defaults made
+  explicit, or at least a comment that defaults are intentionally relied on.
+- Comment line 1 references "Phase 4/5 web-UI browser smoke tests" and
+  `plans/2026-05-15-webui-dashboard.md`. `plans/` is untracked (`?? plans/` in git status)
+  and per recent commit `7466a46` the project has been *retiring* plan docs. A smoke config
+  pointing at an untracked plan file is a dangling reference — either commit the plan or
+  drop the citation.
+- No `Resilience.ReadCoalescing` means coalescing runs with defaults during the smoke run;
+  acceptable but undocumented.
+
+**m7 — `ConfigReconcilerTests` change is a minimal compile-fix, not a behaviour test.**
+`tests/Mbproxy.Tests/Configuration/ConfigReconcilerTests.cs` (the 1-line diff). The new
+`TagCaptureRegistry` ctor arg is passed as a throwaway `new TagCaptureRegistry()`. That is
+the correct minimal change, but it means **the reconciler's new responsibility — calling
+`TagCaptureRegistry.GetOrCreate` on PLC add and `Remove` on PLC removal during hot-reload —
+has no test.** `ConfigReconciler.cs` was modified in this commit (`+11` lines) to wire the
+registry; nothing asserts that a hot-reload-added PLC gets a capture or that a removed PLC's
+capture is dropped. Fix: in `ConfigReconcilerTests`, pass a real registry the test holds a
+reference to, trigger an add/remove reload, and assert `registry.TryGet` reflects it.
+
+**m8 — `ReloadValidator`/`MbproxyOptions` `AdminPushIntervalMs` validation: consistent, but
+the upper bound is unguarded.** `ReloadValidator.cs` and `MbproxyOptionsValidator` both
+reject `<= 0` — consistent with how `GracefulShutdownTimeoutMs` and the keepalive intervals
+are validated (lower-bound-only, `> 0`). So the *new* validation matches house style. But
+note no interval option in this codebase has an *upper* bound, and `AdminPushIntervalMs` is
+the one most likely to be fat-fingered into something absurd (`1000000` = 16-minute feed).
+Not a regression and not inconsistent — flagged only because the dashboard's whole value
+proposition is "live". The `LoopAsync` `Math.Max(100, ...)` floor protects against tiny
+values; nothing protects against huge ones. Optional: a soft upper bound (e.g. reject
+`> 60_000`) or just a doc note. The two new `ReloadValidatorTests` (zero, negative) and two
+new `MbproxyOptionsBindingTests` (default 1000, binds 250) are correct and adequate for the
+bounds that *are* enforced.
+
+**m9 — `StatusHubTests.SecondSubscriber_FirstLeaveKeepsArmed_LastLeaveDisarms` passes
+`null` to `OnDisconnectedAsync`.** Line 72/76. Real SignalR passes a non-null `Exception`
+on an abnormal disconnect and `null` on a clean one, so `null` is a legal value — but the
+test never exercises the exception-carrying path. `StatusHub.OnDisconnectedAsync` ignores
+the exception entirely, so this is harmless today; flagged so a future change that *uses*
+the exception doesn't find itself untested.
+
+## Coverage gaps (behavior in `e719dd5` with NO test)
+
+1. **`StatusBroadcaster.LoopAsync`** — the entire background loop (M1).
+2. **Arm/disarm vs. `Record` race** in `TagValueCapture` (M2).
+3. **`PlcSubscriptionTracker`** — no direct test, no concurrency test (M3).
+4. **Sink-throws / build-throws error handling** in `PushOnceAsync` — the `try/catch`
+   blocks and the `when (ex is not OperationCanceledException)` filter (m1).
+5. **`ConfigReconciler` ↔ `TagCaptureRegistry` wiring** — capture created on hot-reload PLC
+   add, removed on PLC remove (m7).
+6. **`AdminEndpointHost` AdminPort hot-reload tears down the broadcaster and disarms
+   captures** — `StopCurrentAppAsync` disposes `_broadcaster` (which `DisarmAll`s); the
+   broadcaster's `StopAsync_DisarmsEveryCapture` test proves the disarm in isolation but no
+   test proves the *hot-reload rebind* path runs it. (`AdminEndpointTests` has an admin-port
+   rebind test from the prior commit — extend it to assert capture disarm.)
+7. **`/hub/status` SignalR endpoint** — not reachable by any test. The 405-methods test
+   explicitly excludes it. No test connects a SignalR client and verifies a `fleet`/`plc`
+   message is actually received end-to-end. The `IStatusPushSink` unit tests cover the
+   broadcaster logic but nothing covers `SignalRStatusPushSink` + `MapHub` wiring. Given the
+   project already stands up a real Kestrel host in `AdminEndpointTests`, one
+   `HubConnectionBuilder`-based E2E test (`SubscribeFleet` → await a `fleet` message) would
+   close the single biggest "does the feature actually work" gap.
+8. **Multi-tag / 32-bit capture via `BcdPduPipeline`** — `BcdPduPipelineCaptureTests` covers
+   FC03 16/32-bit and FC06/FC16 16-bit, but not FC16 *32-bit* (CDAB pair write) capture, and
+   not a PDU spanning a BCD tag *and* a non-BCD register (does capture record only the BCD
+   one?). The pure `TagValueCaptureTests` cover 32-bit, but the *pipeline hook* for 32-bit
+   writes is unverified.
+9. **`Get_PlcDetailRoute_ReturnsDetailShell`** uses `/plc/anything` — it never checks that a
+   PLC name with a slash, encoded characters, or empty segment is handled. The route is
+   `/plc/{name}` and the name is read client-side, so this is low-risk, but
+   `/plc/` (empty) and `/plc/a%2Fb` are untested.
+
+## What looks good
+
+- **`IStatusPushSink` is a clean seam.** Extracting the outbound side behind an interface so
+  the broadcaster loop logic is testable without a SignalR host is exactly right, and
+  `FakeStatusPushSink` using `ConcurrentBag` is the correct call for a type that production
+  pushes to from a background thread.
+- **`TagValueCapture` torn-read test is genuine.** `ConcurrentRecordAndSnapshot_NeverYields
+  TornSlot` races 4 writers × 200k ops against a reader and checks a real invariant
+  (`DecodedValue == RawLow + RawHigh`). This is the right way to test the `Volatile.Write`
+  immutable-record swap, and it would actually catch a regression to a mutable slot.
+- **`SignalRFakes` model the right shapes.** `FakeHubCallerContext : HubCallerContext` with
+  the abstract members overridden, `FakeGroupManager : IGroupManager`, and the hub-property
+  injection (`Context = ...`, `Groups = ...`) is the standard, correct way to unit-test a
+  SignalR `Hub` without a host. No mocking framework is dragged in.
+- **`BcdPduPipelineCaptureTests` regression guards are well-chosen.** The disarmed-capture
+  and null-capture cases assert the rewrite *still happens byte-identically* — that is the
+  load-bearing property (capture must never perturb the proxy's transparency) and it is
+  explicitly tested.
+- **`TagCaptureRegistryTests.GetOrCreate_Rebuild_PreservesArmedFlag`** correctly tests the
+  hot-reload reseat contract (rebuilt capture is a new instance but keeps `IsArmed`), and
+  `UnknownPlc_Operations_AreSafeNoOps` covers the no-op-for-ghost-PLC contract that
+  `StatusHub` relies on.
+- **`StatusHtmlRenderer` removal is clean.** No source or test file references it; the only
+  remaining mentions are in `plans/`, `docs/`, and prior `codereviews/` — all expected.
+- **The `AdminPushIntervalMs` validation is consistent with the codebase.** Lower-bound-only
+  `> 0` checks in both `MbproxyOptionsValidator` and `ReloadValidator`, error-message format
+  matching the sibling checks, and tests for default + bind + zero + negative. This is the
+  right pattern; the new tests are adequate for what is enforced.
+- **The csproj `EmbeddedResource` glob is correct** for the flat-directory design, the
+  comment accurately documents the resource-name → request-path mapping, and there is no
+  accidental double-include because the Worker SDK does not enable web default globs.
+- **`mbproxy.smoke.config.json` binds against the current schema** — no stale keys, the
+  three-PLC line-a/line-b/line-dead topology is a sensible smoke surface, and
+  `AdminPushIntervalMs` is present and explicit.
+
+## Key file references
+
+- `src/Mbproxy/Admin/StatusBroadcaster.cs:113-135` — untested `LoopAsync` (M1).
+- `src/Mbproxy/Proxy/TagValueCapture.cs:103-129` — arm/disarm vs. record race (M2).
+- `src/Mbproxy/Admin/PlcSubscriptionTracker.cs` — no test file exists (M3).
+- `src/Mbproxy/Admin/StatusHub.cs:60-66` — `OnDisconnectedAsync`; `FakeGroupManager.Removed`
+  is unasserted dead code (M4).
+- `tests/Mbproxy.Tests/Admin/SignalRFakes.cs` — success-path-only fakes (m1, m2).
+- `tests/Mbproxy.Tests/Configuration/ConfigReconcilerTests.cs` — compile-fix only, reconciler
+  ↔ registry wiring untested (m7).
+- `src/Mbproxy/Mbproxy.csproj` — `EmbeddedResource Include="Admin\wwwroot\*.*"` (m5).
+- `tests/sim/mbproxy.smoke.config.json` — dangling `plans/` reference (m6).