Files
wwtools/mbproxy/codereviews/2026-05-15/TestsAndConfig.md
Joseph Doherty 554b05d28c mbproxy: fix dashboard review findings, add named BCD tags + fleet config
Reviewed the new SignalR dashboard and fixed its two top findings: a stored XSS on the connection-detail page (unescaped tag name / direction / timestamp rendered into innerHTML) and FC03/FC04 cache hits bypassing the debug-view capture, which left cached tags frozen while their age climbed. Also adds an optional human-friendly Name to BCD tags surfaced on the debug view, and loads the real fleet config from tags.txt (12 named BCD tags, PLC Z28061) so the published appsettings.json is deploy-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 03:39:39 -04:00

286 lines
20 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Code Review — Tests & Config for the SignalR Dashboard (commit `e719dd5`)
Scope: the test files, config/options/build changes, and the smoke config introduced or
modified by `e719dd5` ("replace status page with a live SignalR web dashboard"). Production
SignalR/capture code (`StatusHub`, `StatusBroadcaster`, `TagValueCapture`,
`TagCaptureRegistry`, `AdminEndpointHost`) was read for context but is reviewed only insofar
as it tells us whether the new tests actually exercise the risky paths. The wider UI
(`dashboard.js`, `detail.js`, HTML/CSS) is explicitly out of scope.
## Summary
The new tests are competently written, follow the existing `Subject_Condition_Expectation`
naming, and the `IStatusPushSink` seam is a genuinely good design that makes the broadcaster
unit-testable without a SignalR host. The `TagValueCapture` torn-read test is a real
concurrency test, not a pretend one. **But the concurrency-sensitive paths the commit message
advertises — "armed only while a detail page is open", arm/disarm under churn, the broadcaster
*loop* — are tested only at the single-threaded happy-path level.** The most serious gaps:
1. The **`StatusBroadcaster.LoopAsync` background loop has zero test coverage** — every
broadcaster test calls the internal `PushOnceAsync` directly. The loop's interval
re-read, the `Math.Max(100, ...)` floor, the cancellation path, and the
loop-terminated-unexpectedly catch are all unverified.
2. The **arm/disarm-vs-record race in `TagValueCapture` is real and untested.** `Disarm()`
sets `_armed = false` *then* clears slots; `Record()` checks `_armed` *then* writes. A
`Record` that passes the `_armed` check before `Disarm` flips it can write a slot *after*
`Disarm` cleared it — leaving a stale observation on a disarmed capture. The torn-read
test only races `Record` against `Snapshot`, never against `Disarm`/`Arm`.
3. The **`PlcSubscriptionTracker` — the lock-guarded type whose entire job is concurrent
subscriber counting — has no direct test at all** and no concurrent test anywhere. Its
behaviour is only incidentally exercised through single-threaded `StatusHubTests`.
None are release-blockers for a read-only admin page on a trusted segment, but #2 is a latent
correctness bug and #1/#3 mean a regression in the live-feed plumbing would ship silently.
## Findings
### Critical
None. This is a read-only admin surface; nothing here can corrupt the Modbus path.
### Major
**M1 — `StatusBroadcaster.LoopAsync` has no test.**
`src/Mbproxy/Admin/StatusBroadcaster.cs:113-135`; `tests/Mbproxy.Tests/Admin/StatusBroadcasterTests.cs`.
All four broadcaster tests drive `PushOnceAsync` directly. The actual background loop —
which is what runs in production — is untested. Specifically uncovered:
- the per-cycle `_options.CurrentValue.AdminPushIntervalMs` re-read (the documented
hot-reload-without-restart behaviour);
- the `Math.Max(100, ...)` floor that defends against a bad interval slipping past
validation (note: validation rejects `<= 0`, so the floor only ever matters for values
`1..99` — itself an untested corner);
- that `StartAsync``StopAsync` actually terminates the loop and that `StopAsync` is safe
to call when the loop never started.
Fix: add a test that constructs the broadcaster with `AdminPushIntervalMs` ~100ms, calls
`Start()`, polls `Sink.FleetPushes.Count` until `>= 2` with a generous deadline + the test
`CancellationToken`, then `StopAsync()` and asserts the count stops growing. Keep timing
hedged (`ShouldBeGreaterThanOrEqualTo`), consistent with the existing coalescing tests.
**M2 — Arm/disarm-vs-record race in `TagValueCapture` is real and untested.**
`src/Mbproxy/Proxy/TagValueCapture.cs:103-129`. `Disarm()` does `_armed = false` then clears
slots; `Record()` does `if (!_armed) return;` then `Volatile.Write` the slot. Interleaving:
`Record` reads `_armed == true``Disarm` sets `false` and clears all slots → `Record`
writes its observation. Result: a *disarmed* capture holds a non-null slot, and the next
`Snapshot()` reports stale traffic with `UpdatedAtUtc != null` — exactly the "reopened page
shows stale data" failure the on-demand design says it prevents. The hot path makes this
unlikely (record traffic stops when no viewer is attached) but it is not impossible: a
detail page closing while an in-flight FC03 response is being rewritten hits it.
`ConcurrentRecordAndSnapshot_NeverYieldsTornSlot` (line 118) races `Record` vs `Snapshot`
only — never vs `Disarm`. Fix in production: have `Disarm()` set `_armed = false`, then
clear slots, then clear *again* (or re-clear after a memory barrier), or re-check `_armed`
inside `Record` after the `Volatile.Write` and null the slot if disarmed. At minimum add a
test that hammers `Arm`/`Disarm` on one task while `Record` runs on another and asserts a
disarmed capture's `Snapshot()` is all-null. The review should not let this ship "tested"
when the test deliberately avoids the racy interleaving.
**M3 — `PlcSubscriptionTracker` has no direct or concurrent test.**
`src/Mbproxy/Admin/PlcSubscriptionTracker.cs` (whole file); no `PlcSubscriptionTrackerTests.cs`
exists. This is the single lock-guarded type whose correctness drives capture arm/disarm.
`StatusHubTests` exercises it only single-threaded and indirectly. Untested behaviour that
has no coverage anywhere:
- the redundant re-subscribe path (`Add` returns `false` when the same connection
re-subscribes the same PLC — `set.Add` fails);
- `RemoveConnection` for an unknown connection id returning `Array.Empty<string>()`;
- a connection subscribed to *multiple* PLCs being torn down in one `RemoveConnection`;
- concurrent `Add`/`RemoveConnection` from many "connections" — the lock is claimed
thread-safe but nothing proves the count never goes negative or leaks.
Fix: add `PlcSubscriptionTrackerTests` with the single-threaded cases plus one
parallel-stress test (N tasks each Add-then-Remove, assert `ActivePlcs()` is empty and no
exception), mirroring `TxIdAllocatorTests`' concurrency style.
**M4 — `StatusHub` group-leave on disconnect is not verified.**
`src/Mbproxy/Admin/StatusHub.cs:60-66`; `StatusHubTests.cs`.
`OnDisconnectedAsync` is tested for the *capture-disarm* side effect, but the hub never
calls `Groups.RemoveFromGroupAsync` — and `FakeGroupManager` records `Removed` but **no test
ever asserts `groups.Removed`**. Worth confirming this is intentional (real SignalR auto-
removes a disconnected connection from all groups, so an explicit `RemoveFromGroupAsync` is
genuinely unnecessary) — but then `FakeGroupManager.Removed` is dead code that implies a
contract the hub does not honour. Either delete `Removed` from the fake, or if the hub is
*supposed* to leave fleet groups explicitly, that's a production bug. As written, a reviewer
cannot tell which. Fix: drop the unused `Removed` list, or add a comment on the fake
explaining SignalR's implicit group cleanup so the gap is not mistaken for an omission.
### Minor
**m1 — `SignalRFakes` do not model the real SignalR failure surface.**
`tests/Mbproxy.Tests/Admin/SignalRFakes.cs`. The fakes are faithful to the *shapes* but
model only the success path: `FakeGroupManager.AddToGroupAsync` always succeeds and is
synchronous; `FakeStatusPushSink.PushFleetAsync` never throws and never observes the
`CancellationToken`. The production code has explicit `try/catch` around
`_sink.PushFleetAsync` / `PushPlcAsync` (`StatusBroadcaster.cs:88-110`) — a sink that throws
is a real scenario (a SignalR send to a dropped connection) and that catch is **completely
uncovered**. The `BuildDebug` failure catch (`StatusBroadcaster.cs:82-86`) is likewise
uncovered. Fix: add a throwing variant of `FakeStatusPushSink` and assert `PushOnceAsync`
swallows the exception and still attempts the per-PLC pushes; assert the `ct` is honoured
(an `OperationCanceledException` from the sink must *not* be logged-and-swallowed the same
way — the `when (ex is not OperationCanceledException)` filter is untested).
**m2 — `FakeHubCallerContext.ConnectionAborted` is hardcoded to `CancellationToken.None`.**
`SignalRFakes.cs:23`. Real SignalR fires this token on disconnect. No current test needs it,
but if anyone later tests connection-abort handling the fake will silently mask it. Low risk;
note it with a comment so a future test author knows the fake is inert here.
**m3 — Asset content-type test relies on a hand-maintained `[InlineData]` allow-list.**
`AdminEndpointTests.Get_Asset_ReturnsCorrectContentType` covers 5 of the 14 embedded files.
`bootstrap.bundle.min.js`, `detail.css`, `detail.js`, `index.html`, `plc.html`,
`ibm-plex-sans-400/600.woff2` are not asserted. More importantly there is **no test that the
csproj glob actually embedded every wwwroot file** — if someone adds `favicon.ico` to
`wwwroot/` and the glob silently misses it (it won't, `*.*` catches it, but a `.gitignore`'d
or renamed file would), nothing fails. Fix: add a test that enumerates
`typeof(AdminEndpointHost).Assembly.GetManifestResourceNames()`, filters the
`Mbproxy.Admin.wwwroot.` prefix, and asserts every file physically present in `wwwroot/` has
a matching resource (or just assert the count). This is the only thing that would catch a
broken `EmbeddedResource` glob.
**m4 — `Get_Asset` does not assert the bytes are the *right* asset.**
`AdminEndpointTests.cs` (the new theory). It asserts `bytes.Length > 0` and the content
type, but not that `dashboard.js` contains a known marker (it already does this for the HTML
shells via `ShouldContain("/assets/dashboard.js")`). A resource-name collision or a wrong
`ContentTypeFor` mapping for a *correctly-served-but-wrong* file would pass. Cheap to harden:
for the text assets assert a known substring.
**m5 — `csproj` `EmbeddedResource Include="Admin\wwwroot\*.*"` — glob caveats.**
`src/Mbproxy/Mbproxy.csproj` (the new `ItemGroup`). The SDK is `Microsoft.NET.Sdk.Worker`,
which does **not** enable web default-item globs, so there is no double-include of `.css`/
`.js`/`.html` as `Content` — good. Two real caveats:
- `*.*` matches only files containing a dot. Every current asset has an extension so this is
fine, but it is a non-obvious constraint; `**\*` or just `*` would be more honest. The
comment says "intentionally flat", so `*.*` is acceptable, but worth a one-word note that
extensionless files would be skipped.
- Globs in an `<ItemGroup>` are evaluated at project-evaluation time; a newly-added asset is
*not* picked up by an incremental build that didn't re-evaluate. In practice `dotnet
build`/`publish` always re-evaluates, so this is a non-issue — but it is the kind of thing
that bites a watch-mode developer. No fix needed; flagged for completeness.
The resource-name → request-path mapping (`Mbproxy.Admin.wwwroot.<file>`) is correct for the
flat directory and matches `AssetResourcePrefix` in `AdminEndpointHost.cs:301`.
**m6 — `mbproxy.smoke.config.json` — valid but with stale/odd values.**
`tests/sim/mbproxy.smoke.config.json`. The schema is current (`AdminPushIntervalMs` present,
`Keepalive` field names match `KeepaliveOptions`, no removed keys). Issues:
- The header comment says the simulator runs on `127.0.0.1:5020` and both `line-a`/`line-b`
point `Port: 5020` — consistent. Good.
- `BackendHeartbeatIdleMs: 10000` while `BackendRequestTimeoutMs: 2000` — satisfies the
"must be greater than BackendRequestTimeoutMs" cross-field rule. Fine.
- `line-dead` has no `BcdTags`, so it inherits the empty `Global: []` set — fine, the intent
is just an unreachable backend.
- The file has **no `Resilience` and no `Cache` section.** Both are optional (defaults
apply), so the config binds — but a smoke config whose stated purpose is exercising the
dashboard's "problems only" filter would be more representative with the defaults made
explicit, or at least a comment that defaults are intentionally relied on.
- Comment line 1 references "Phase 4/5 web-UI browser smoke tests" and
`plans/2026-05-15-webui-dashboard.md`. `plans/` is untracked (`?? plans/` in git status)
and per recent commit `7466a46` the project has been *retiring* plan docs. A smoke config
pointing at an untracked plan file is a dangling reference — either commit the plan or
drop the citation.
- No `Resilience.ReadCoalescing` means coalescing runs with defaults during the smoke run;
acceptable but undocumented.
**m7 — `ConfigReconcilerTests` change is a minimal compile-fix, not a behaviour test.**
`tests/Mbproxy.Tests/Configuration/ConfigReconcilerTests.cs` (the 1-line diff). The new
`TagCaptureRegistry` ctor arg is passed as a throwaway `new TagCaptureRegistry()`. That is
the correct minimal change, but it means **the reconciler's new responsibility — calling
`TagCaptureRegistry.GetOrCreate` on PLC add and `Remove` on PLC removal during hot-reload —
has no test.** `ConfigReconciler.cs` was modified in this commit (`+11` lines) to wire the
registry; nothing asserts that a hot-reload-added PLC gets a capture or that a removed PLC's
capture is dropped. Fix: in `ConfigReconcilerTests`, pass a real registry the test holds a
reference to, trigger an add/remove reload, and assert `registry.TryGet` reflects it.
**m8 — `ReloadValidator`/`MbproxyOptions` `AdminPushIntervalMs` validation: consistent, but
the upper bound is unguarded.** `ReloadValidator.cs` and `MbproxyOptionsValidator` both
reject `<= 0` — consistent with how `GracefulShutdownTimeoutMs` and the keepalive intervals
are validated (lower-bound-only, `> 0`). So the *new* validation matches house style. But
note no interval option in this codebase has an *upper* bound, and `AdminPushIntervalMs` is
the one most likely to be fat-fingered into something absurd (`1000000` = 16-minute feed).
Not a regression and not inconsistent — flagged only because the dashboard's whole value
proposition is "live". The `LoopAsync` `Math.Max(100, ...)` floor protects against tiny
values; nothing protects against huge ones. Optional: a soft upper bound (e.g. reject
`> 60_000`) or just a doc note. The two new `ReloadValidatorTests` (zero, negative) and two
new `MbproxyOptionsBindingTests` (default 1000, binds 250) are correct and adequate for the
bounds that *are* enforced.
**m9 — `StatusHubTests.SecondSubscriber_FirstLeaveKeepsArmed_LastLeaveDisarms` passes
`null` to `OnDisconnectedAsync`.** Line 72/76. Real SignalR passes a non-null `Exception`
on an abnormal disconnect and `null` on a clean one, so `null` is a legal value — but the
test never exercises the exception-carrying path. `StatusHub.OnDisconnectedAsync` ignores
the exception entirely, so this is harmless today; flagged so a future change that *uses*
the exception doesn't find itself untested.
## Coverage gaps (behavior in `e719dd5` with NO test)
1. **`StatusBroadcaster.LoopAsync`** — the entire background loop (M1).
2. **Arm/disarm vs. `Record` race** in `TagValueCapture` (M2).
3. **`PlcSubscriptionTracker`** — no direct test, no concurrency test (M3).
4. **Sink-throws / build-throws error handling** in `PushOnceAsync` — the `try/catch`
blocks and the `when (ex is not OperationCanceledException)` filter (m1).
5. **`ConfigReconciler``TagCaptureRegistry` wiring** — capture created on hot-reload PLC
add, removed on PLC remove (m7).
6. **`AdminEndpointHost` AdminPort hot-reload tears down the broadcaster and disarms
captures** — `StopCurrentAppAsync` disposes `_broadcaster` (which `DisarmAll`s); the
broadcaster's `StopAsync_DisarmsEveryCapture` test proves the disarm in isolation but no
test proves the *hot-reload rebind* path runs it. (`AdminEndpointTests` has an admin-port
rebind test from the prior commit — extend it to assert capture disarm.)
7. **`/hub/status` SignalR endpoint** — not reachable by any test. The 405-methods test
explicitly excludes it. No test connects a SignalR client and verifies a `fleet`/`plc`
message is actually received end-to-end. The `IStatusPushSink` unit tests cover the
broadcaster logic but nothing covers `SignalRStatusPushSink` + `MapHub` wiring. Given the
project already stands up a real Kestrel host in `AdminEndpointTests`, one
`HubConnectionBuilder`-based E2E test (`SubscribeFleet` → await a `fleet` message) would
close the single biggest "does the feature actually work" gap.
8. **Multi-tag / 32-bit capture via `BcdPduPipeline`**`BcdPduPipelineCaptureTests` covers
FC03 16/32-bit and FC06/FC16 16-bit, but not FC16 *32-bit* (CDAB pair write) capture, and
not a PDU spanning a BCD tag *and* a non-BCD register (does capture record only the BCD
one?). The pure `TagValueCaptureTests` cover 32-bit, but the *pipeline hook* for 32-bit
writes is unverified.
9. **`Get_PlcDetailRoute_ReturnsDetailShell`** uses `/plc/anything` — it never checks that a
PLC name with a slash, encoded characters, or empty segment is handled. The route is
`/plc/{name}` and the name is read client-side, so this is low-risk, but
`/plc/` (empty) and `/plc/a%2Fb` are untested.
## What looks good
- **`IStatusPushSink` is a clean seam.** Extracting the outbound side behind an interface so
the broadcaster loop logic is testable without a SignalR host is exactly right, and
`FakeStatusPushSink` using `ConcurrentBag` is the correct call for a type that production
pushes to from a background thread.
- **`TagValueCapture` torn-read test is genuine.** `ConcurrentRecordAndSnapshot_NeverYields
TornSlot` races 4 writers × 200k ops against a reader and checks a real invariant
(`DecodedValue == RawLow + RawHigh`). This is the right way to test the `Volatile.Write`
immutable-record swap, and it would actually catch a regression to a mutable slot.
- **`SignalRFakes` model the right shapes.** `FakeHubCallerContext : HubCallerContext` with
the abstract members overridden, `FakeGroupManager : IGroupManager`, and the hub-property
injection (`Context = ...`, `Groups = ...`) is the standard, correct way to unit-test a
SignalR `Hub` without a host. No mocking framework is dragged in.
- **`BcdPduPipelineCaptureTests` regression guards are well-chosen.** The disarmed-capture
and null-capture cases assert the rewrite *still happens byte-identically* — that is the
load-bearing property (capture must never perturb the proxy's transparency) and it is
explicitly tested.
- **`TagCaptureRegistryTests.GetOrCreate_Rebuild_PreservesArmedFlag`** correctly tests the
hot-reload reseat contract (rebuilt capture is a new instance but keeps `IsArmed`), and
`UnknownPlc_Operations_AreSafeNoOps` covers the no-op-for-ghost-PLC contract that
`StatusHub` relies on.
- **`StatusHtmlRenderer` removal is clean.** No source or test file references it; the only
remaining mentions are in `plans/`, `docs/`, and prior `codereviews/` — all expected.
- **The `AdminPushIntervalMs` validation is consistent with the codebase.** Lower-bound-only
`> 0` checks in both `MbproxyOptionsValidator` and `ReloadValidator`, error-message format
matching the sibling checks, and tests for default + bind + zero + negative. This is the
right pattern; the new tests are adequate for what is enforced.
- **The csproj `EmbeddedResource` glob is correct** for the flat-directory design, the
comment accurately documents the resource-name → request-path mapping, and there is no
accidental double-include because the Worker SDK does not enable web default globs.
- **`mbproxy.smoke.config.json` binds against the current schema** — no stale keys, the
three-PLC line-a/line-b/line-dead topology is a sensible smoke surface, and
`AdminPushIntervalMs` is present and explicit.
## Key file references
- `src/Mbproxy/Admin/StatusBroadcaster.cs:113-135` — untested `LoopAsync` (M1).
- `src/Mbproxy/Proxy/TagValueCapture.cs:103-129` — arm/disarm vs. record race (M2).
- `src/Mbproxy/Admin/PlcSubscriptionTracker.cs` — no test file exists (M3).
- `src/Mbproxy/Admin/StatusHub.cs:60-66` — `OnDisconnectedAsync`; `FakeGroupManager.Removed`
is unasserted dead code (M4).
- `tests/Mbproxy.Tests/Admin/SignalRFakes.cs` — success-path-only fakes (m1, m2).
- `tests/Mbproxy.Tests/Configuration/ConfigReconcilerTests.cs` — compile-fix only, reconciler
↔ registry wiring untested (m7).
- `src/Mbproxy/Mbproxy.csproj` — `EmbeddedResource Include="Admin\wwwroot\*.*"` (m5).
- `tests/sim/mbproxy.smoke.config.json` — dangling `plans/` reference (m6).