554b05d28c
Reviewed the new SignalR dashboard and fixed its two top findings: a stored XSS on the connection-detail page (unescaped tag name / direction / timestamp rendered into innerHTML) and FC03/FC04 cache hits bypassing the debug-view capture, which left cached tags frozen while their age climbed. Also adds an optional human-friendly Name to BCD tags surfaced on the debug view, and loads the real fleet config from tags.txt (12 named BCD tags, PLC Z28061) so the published appsettings.json is deploy-ready. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
286 lines
20 KiB
Markdown
286 lines
20 KiB
Markdown
# Code Review — Tests & Config for the SignalR Dashboard (commit `e719dd5`)
|
||
|
||
Scope: the test files, config/options/build changes, and the smoke config introduced or
|
||
modified by `e719dd5` ("replace status page with a live SignalR web dashboard"). Production
|
||
SignalR/capture code (`StatusHub`, `StatusBroadcaster`, `TagValueCapture`,
|
||
`TagCaptureRegistry`, `AdminEndpointHost`) was read for context but is reviewed only insofar
|
||
as it tells us whether the new tests actually exercise the risky paths. The wider UI
|
||
(`dashboard.js`, `detail.js`, HTML/CSS) is explicitly out of scope.
|
||
|
||
## Summary
|
||
|
||
The new tests are competently written, follow the existing `Subject_Condition_Expectation`
|
||
naming, and the `IStatusPushSink` seam is a genuinely good design that makes the broadcaster
|
||
unit-testable without a SignalR host. The `TagValueCapture` torn-read test is a real
|
||
concurrency test, not a pretend one. **But the concurrency-sensitive paths the commit message
|
||
advertises — "armed only while a detail page is open", arm/disarm under churn, the broadcaster
|
||
*loop* — are tested only at the single-threaded happy-path level.** The most serious gaps:
|
||
|
||
1. The **`StatusBroadcaster.LoopAsync` background loop has zero test coverage** — every
|
||
broadcaster test calls the internal `PushOnceAsync` directly. The loop's interval
|
||
re-read, the `Math.Max(100, ...)` floor, the cancellation path, and the
|
||
loop-terminated-unexpectedly catch are all unverified.
|
||
2. The **arm/disarm-vs-record race in `TagValueCapture` is real and untested.** `Disarm()`
|
||
sets `_armed = false` *then* clears slots; `Record()` checks `_armed` *then* writes. A
|
||
`Record` that passes the `_armed` check before `Disarm` flips it can write a slot *after*
|
||
`Disarm` cleared it — leaving a stale observation on a disarmed capture. The torn-read
|
||
test only races `Record` against `Snapshot`, never against `Disarm`/`Arm`.
|
||
3. The **`PlcSubscriptionTracker` — the lock-guarded type whose entire job is concurrent
|
||
subscriber counting — has no direct test at all** and no concurrent test anywhere. Its
|
||
behaviour is only incidentally exercised through single-threaded `StatusHubTests`.
|
||
|
||
None are release-blockers for a read-only admin page on a trusted segment, but #2 is a latent
|
||
correctness bug and #1/#3 mean a regression in the live-feed plumbing would ship silently.
|
||
|
||
## Findings
|
||
|
||
### Critical
|
||
|
||
None. This is a read-only admin surface; nothing here can corrupt the Modbus path.
|
||
|
||
### Major
|
||
|
||
**M1 — `StatusBroadcaster.LoopAsync` has no test.**
|
||
`src/Mbproxy/Admin/StatusBroadcaster.cs:113-135`; `tests/Mbproxy.Tests/Admin/StatusBroadcasterTests.cs`.
|
||
All four broadcaster tests drive `PushOnceAsync` directly. The actual background loop —
|
||
which is what runs in production — is untested. Specifically uncovered:
|
||
- the per-cycle `_options.CurrentValue.AdminPushIntervalMs` re-read (the documented
|
||
hot-reload-without-restart behaviour);
|
||
- the `Math.Max(100, ...)` floor that defends against a bad interval slipping past
|
||
validation (note: validation rejects `<= 0`, so the floor only ever matters for values
|
||
`1..99` — itself an untested corner);
|
||
- that `StartAsync`→`StopAsync` actually terminates the loop and that `StopAsync` is safe
|
||
to call when the loop never started.
|
||
Fix: add a test that constructs the broadcaster with `AdminPushIntervalMs` ~100ms, calls
|
||
`Start()`, polls `Sink.FleetPushes.Count` until `>= 2` with a generous deadline + the test
|
||
`CancellationToken`, then `StopAsync()` and asserts the count stops growing. Keep timing
|
||
hedged (`ShouldBeGreaterThanOrEqualTo`), consistent with the existing coalescing tests.
|
||
|
||
**M2 — Arm/disarm-vs-record race in `TagValueCapture` is real and untested.**
|
||
`src/Mbproxy/Proxy/TagValueCapture.cs:103-129`. `Disarm()` does `_armed = false` then clears
|
||
slots; `Record()` does `if (!_armed) return;` then `Volatile.Write` the slot. Interleaving:
|
||
`Record` reads `_armed == true` → `Disarm` sets `false` and clears all slots → `Record`
|
||
writes its observation. Result: a *disarmed* capture holds a non-null slot, and the next
|
||
`Snapshot()` reports stale traffic with `UpdatedAtUtc != null` — exactly the "reopened page
|
||
shows stale data" failure the on-demand design says it prevents. The hot path makes this
|
||
unlikely (record traffic stops when no viewer is attached) but it is not impossible: a
|
||
detail page closing while an in-flight FC03 response is being rewritten hits it.
|
||
`ConcurrentRecordAndSnapshot_NeverYieldsTornSlot` (line 118) races `Record` vs `Snapshot`
|
||
only — never vs `Disarm`. Fix in production: have `Disarm()` set `_armed = false`, then
|
||
clear slots, then clear *again* (or re-clear after a memory barrier), or re-check `_armed`
|
||
inside `Record` after the `Volatile.Write` and null the slot if disarmed. At minimum add a
|
||
test that hammers `Arm`/`Disarm` on one task while `Record` runs on another and asserts a
|
||
disarmed capture's `Snapshot()` is all-null. The review should not let this ship "tested"
|
||
when the test deliberately avoids the racy interleaving.
|
||
|
||
**M3 — `PlcSubscriptionTracker` has no direct or concurrent test.**
|
||
`src/Mbproxy/Admin/PlcSubscriptionTracker.cs` (whole file); no `PlcSubscriptionTrackerTests.cs`
|
||
exists. This is the single lock-guarded type whose correctness drives capture arm/disarm.
|
||
`StatusHubTests` exercises it only single-threaded and indirectly. Untested behaviour that
|
||
has no coverage anywhere:
|
||
- the redundant re-subscribe path (`Add` returns `false` when the same connection
|
||
re-subscribes the same PLC — `set.Add` fails);
|
||
- `RemoveConnection` for an unknown connection id returning `Array.Empty<string>()`;
|
||
- a connection subscribed to *multiple* PLCs being torn down in one `RemoveConnection`;
|
||
- concurrent `Add`/`RemoveConnection` from many "connections" — the lock is claimed
|
||
thread-safe but nothing proves the count never goes negative or leaks.
|
||
Fix: add `PlcSubscriptionTrackerTests` with the single-threaded cases plus one
|
||
parallel-stress test (N tasks each Add-then-Remove, assert `ActivePlcs()` is empty and no
|
||
exception), mirroring `TxIdAllocatorTests`' concurrency style.
|
||
|
||
**M4 — `StatusHub` group-leave on disconnect is not verified.**
|
||
`src/Mbproxy/Admin/StatusHub.cs:60-66`; `StatusHubTests.cs`.
|
||
`OnDisconnectedAsync` is tested for the *capture-disarm* side effect, but the hub never
|
||
calls `Groups.RemoveFromGroupAsync` — and `FakeGroupManager` records `Removed` but **no test
|
||
ever asserts `groups.Removed`**. Worth confirming this is intentional (real SignalR auto-
|
||
removes a disconnected connection from all groups, so an explicit `RemoveFromGroupAsync` is
|
||
genuinely unnecessary) — but then `FakeGroupManager.Removed` is dead code that implies a
|
||
contract the hub does not honour. Either delete `Removed` from the fake, or if the hub is
|
||
*supposed* to leave fleet groups explicitly, that's a production bug. As written, a reviewer
|
||
cannot tell which. Fix: drop the unused `Removed` list, or add a comment on the fake
|
||
explaining SignalR's implicit group cleanup so the gap is not mistaken for an omission.
|
||
|
||
### Minor
|
||
|
||
**m1 — `SignalRFakes` do not model the real SignalR failure surface.**
|
||
`tests/Mbproxy.Tests/Admin/SignalRFakes.cs`. The fakes are faithful to the *shapes* but
|
||
model only the success path: `FakeGroupManager.AddToGroupAsync` always succeeds and is
|
||
synchronous; `FakeStatusPushSink.PushFleetAsync` never throws and never observes the
|
||
`CancellationToken`. The production code has explicit `try/catch` around
|
||
`_sink.PushFleetAsync` / `PushPlcAsync` (`StatusBroadcaster.cs:88-110`) — a sink that throws
|
||
is a real scenario (a SignalR send to a dropped connection) and that catch is **completely
|
||
uncovered**. The `BuildDebug` failure catch (`StatusBroadcaster.cs:82-86`) is likewise
|
||
uncovered. Fix: add a throwing variant of `FakeStatusPushSink` and assert `PushOnceAsync`
|
||
swallows the exception and still attempts the per-PLC pushes; assert the `ct` is honoured
|
||
(an `OperationCanceledException` from the sink must *not* be logged-and-swallowed the same
|
||
way — the `when (ex is not OperationCanceledException)` filter is untested).
|
||
|
||
**m2 — `FakeHubCallerContext.ConnectionAborted` is hardcoded to `CancellationToken.None`.**
|
||
`SignalRFakes.cs:23`. Real SignalR fires this token on disconnect. No current test needs it,
|
||
but if anyone later tests connection-abort handling the fake will silently mask it. Low risk;
|
||
note it with a comment so a future test author knows the fake is inert here.
|
||
|
||
**m3 — Asset content-type test relies on a hand-maintained `[InlineData]` allow-list.**
|
||
`AdminEndpointTests.Get_Asset_ReturnsCorrectContentType` covers 5 of the 14 embedded files.
|
||
`bootstrap.bundle.min.js`, `detail.css`, `detail.js`, `index.html`, `plc.html`,
|
||
`ibm-plex-sans-400/600.woff2` are not asserted. More importantly there is **no test that the
|
||
csproj glob actually embedded every wwwroot file** — if someone adds `favicon.ico` to
|
||
`wwwroot/` and the glob silently misses it (it won't, `*.*` catches it, but a `.gitignore`'d
|
||
or renamed file would), nothing fails. Fix: add a test that enumerates
|
||
`typeof(AdminEndpointHost).Assembly.GetManifestResourceNames()`, filters the
|
||
`Mbproxy.Admin.wwwroot.` prefix, and asserts every file physically present in `wwwroot/` has
|
||
a matching resource (or just assert the count). This is the only thing that would catch a
|
||
broken `EmbeddedResource` glob.
|
||
|
||
**m4 — `Get_Asset` does not assert the bytes are the *right* asset.**
|
||
`AdminEndpointTests.cs` (the new theory). It asserts `bytes.Length > 0` and the content
|
||
type, but not that `dashboard.js` contains a known marker (it already does this for the HTML
|
||
shells via `ShouldContain("/assets/dashboard.js")`). A resource-name collision or a wrong
|
||
`ContentTypeFor` mapping for a *correctly-served-but-wrong* file would pass. Cheap to harden:
|
||
for the text assets assert a known substring.
|
||
|
||
**m5 — `csproj` `EmbeddedResource Include="Admin\wwwroot\*.*"` — glob caveats.**
|
||
`src/Mbproxy/Mbproxy.csproj` (the new `ItemGroup`). The SDK is `Microsoft.NET.Sdk.Worker`,
|
||
which does **not** enable web default-item globs, so there is no double-include of `.css`/
|
||
`.js`/`.html` as `Content` — good. Two real caveats:
|
||
- `*.*` matches only files containing a dot. Every current asset has an extension so this is
|
||
fine, but it is a non-obvious constraint; `**\*` or just `*` would be more honest. The
|
||
comment says "intentionally flat", so `*.*` is acceptable, but worth a one-word note that
|
||
extensionless files would be skipped.
|
||
- Globs in an `<ItemGroup>` are evaluated at project-evaluation time; a newly-added asset is
|
||
*not* picked up by an incremental build that didn't re-evaluate. In practice `dotnet
|
||
build`/`publish` always re-evaluates, so this is a non-issue — but it is the kind of thing
|
||
that bites a watch-mode developer. No fix needed; flagged for completeness.
|
||
The resource-name → request-path mapping (`Mbproxy.Admin.wwwroot.<file>`) is correct for the
|
||
flat directory and matches `AssetResourcePrefix` in `AdminEndpointHost.cs:301`.
|
||
|
||
**m6 — `mbproxy.smoke.config.json` — valid but with stale/odd values.**
|
||
`tests/sim/mbproxy.smoke.config.json`. The schema is current (`AdminPushIntervalMs` present,
|
||
`Keepalive` field names match `KeepaliveOptions`, no removed keys). Issues:
|
||
- The header comment says the simulator runs on `127.0.0.1:5020` and both `line-a`/`line-b`
|
||
point `Port: 5020` — consistent. Good.
|
||
- `BackendHeartbeatIdleMs: 10000` while `BackendRequestTimeoutMs: 2000` — satisfies the
|
||
"must be greater than BackendRequestTimeoutMs" cross-field rule. Fine.
|
||
- `line-dead` has no `BcdTags`, so it inherits the empty `Global: []` set — fine, the intent
|
||
is just an unreachable backend.
|
||
- The file has **no `Resilience` and no `Cache` section.** Both are optional (defaults
|
||
apply), so the config binds — but a smoke config whose stated purpose is exercising the
|
||
dashboard's "problems only" filter would be more representative with the defaults made
|
||
explicit, or at least a comment that defaults are intentionally relied on.
|
||
- Comment line 1 references "Phase 4/5 web-UI browser smoke tests" and
|
||
`plans/2026-05-15-webui-dashboard.md`. `plans/` is untracked (`?? plans/` in git status)
|
||
and per recent commit `7466a46` the project has been *retiring* plan docs. A smoke config
|
||
pointing at an untracked plan file is a dangling reference — either commit the plan or
|
||
drop the citation.
|
||
- No `Resilience.ReadCoalescing` means coalescing runs with defaults during the smoke run;
|
||
acceptable but undocumented.
|
||
|
||
**m7 — `ConfigReconcilerTests` change is a minimal compile-fix, not a behaviour test.**
|
||
`tests/Mbproxy.Tests/Configuration/ConfigReconcilerTests.cs` (the 1-line diff). The new
|
||
`TagCaptureRegistry` ctor arg is passed as a throwaway `new TagCaptureRegistry()`. That is
|
||
the correct minimal change, but it means **the reconciler's new responsibility — calling
|
||
`TagCaptureRegistry.GetOrCreate` on PLC add and `Remove` on PLC removal during hot-reload —
|
||
has no test.** `ConfigReconciler.cs` was modified in this commit (`+11` lines) to wire the
|
||
registry; nothing asserts that a hot-reload-added PLC gets a capture or that a removed PLC's
|
||
capture is dropped. Fix: in `ConfigReconcilerTests`, pass a real registry the test holds a
|
||
reference to, trigger an add/remove reload, and assert `registry.TryGet` reflects it.
|
||
|
||
**m8 — `ReloadValidator`/`MbproxyOptions` `AdminPushIntervalMs` validation: consistent, but
|
||
the upper bound is unguarded.** `ReloadValidator.cs` and `MbproxyOptionsValidator` both
|
||
reject `<= 0` — consistent with how `GracefulShutdownTimeoutMs` and the keepalive intervals
|
||
are validated (lower-bound-only, `> 0`). So the *new* validation matches house style. But
|
||
note no interval option in this codebase has an *upper* bound, and `AdminPushIntervalMs` is
|
||
the one most likely to be fat-fingered into something absurd (`1000000` = 16-minute feed).
|
||
Not a regression and not inconsistent — flagged only because the dashboard's whole value
|
||
proposition is "live". The `LoopAsync` `Math.Max(100, ...)` floor protects against tiny
|
||
values; nothing protects against huge ones. Optional: a soft upper bound (e.g. reject
|
||
`> 60_000`) or just a doc note. The two new `ReloadValidatorTests` (zero, negative) and two
|
||
new `MbproxyOptionsBindingTests` (default 1000, binds 250) are correct and adequate for the
|
||
bounds that *are* enforced.
|
||
|
||
**m9 — `StatusHubTests.SecondSubscriber_FirstLeaveKeepsArmed_LastLeaveDisarms` passes
|
||
`null` to `OnDisconnectedAsync`.** Line 72/76. Real SignalR passes a non-null `Exception`
|
||
on an abnormal disconnect and `null` on a clean one, so `null` is a legal value — but the
|
||
test never exercises the exception-carrying path. `StatusHub.OnDisconnectedAsync` ignores
|
||
the exception entirely, so this is harmless today; flagged so a future change that *uses*
|
||
the exception doesn't find itself untested.
|
||
|
||
## Coverage gaps (behavior in `e719dd5` with NO test)
|
||
|
||
1. **`StatusBroadcaster.LoopAsync`** — the entire background loop (M1).
|
||
2. **Arm/disarm vs. `Record` race** in `TagValueCapture` (M2).
|
||
3. **`PlcSubscriptionTracker`** — no direct test, no concurrency test (M3).
|
||
4. **Sink-throws / build-throws error handling** in `PushOnceAsync` — the `try/catch`
|
||
blocks and the `when (ex is not OperationCanceledException)` filter (m1).
|
||
5. **`ConfigReconciler` ↔ `TagCaptureRegistry` wiring** — capture created on hot-reload PLC
|
||
add, removed on PLC remove (m7).
|
||
6. **`AdminEndpointHost` AdminPort hot-reload tears down the broadcaster and disarms
|
||
captures** — `StopCurrentAppAsync` disposes `_broadcaster` (which `DisarmAll`s); the
|
||
broadcaster's `StopAsync_DisarmsEveryCapture` test proves the disarm in isolation but no
|
||
test proves the *hot-reload rebind* path runs it. (`AdminEndpointTests` has an admin-port
|
||
rebind test from the prior commit — extend it to assert capture disarm.)
|
||
7. **`/hub/status` SignalR endpoint** — not reachable by any test. The 405-methods test
|
||
explicitly excludes it. No test connects a SignalR client and verifies a `fleet`/`plc`
|
||
message is actually received end-to-end. The `IStatusPushSink` unit tests cover the
|
||
broadcaster logic but nothing covers `SignalRStatusPushSink` + `MapHub` wiring. Given the
|
||
project already stands up a real Kestrel host in `AdminEndpointTests`, one
|
||
`HubConnectionBuilder`-based E2E test (`SubscribeFleet` → await a `fleet` message) would
|
||
close the single biggest "does the feature actually work" gap.
|
||
8. **Multi-tag / 32-bit capture via `BcdPduPipeline`** — `BcdPduPipelineCaptureTests` covers
|
||
FC03 16/32-bit and FC06/FC16 16-bit, but not FC16 *32-bit* (CDAB pair write) capture, and
|
||
not a PDU spanning a BCD tag *and* a non-BCD register (does capture record only the BCD
|
||
one?). The pure `TagValueCaptureTests` cover 32-bit, but the *pipeline hook* for 32-bit
|
||
writes is unverified.
|
||
9. **`Get_PlcDetailRoute_ReturnsDetailShell`** uses `/plc/anything` — it never checks that a
|
||
PLC name with a slash, encoded characters, or empty segment is handled. The route is
|
||
`/plc/{name}` and the name is read client-side, so this is low-risk, but
|
||
`/plc/` (empty) and `/plc/a%2Fb` are untested.
|
||
|
||
## What looks good
|
||
|
||
- **`IStatusPushSink` is a clean seam.** Extracting the outbound side behind an interface so
|
||
the broadcaster loop logic is testable without a SignalR host is exactly right, and
|
||
`FakeStatusPushSink` using `ConcurrentBag` is the correct call for a type that production
|
||
pushes to from a background thread.
|
||
- **`TagValueCapture` torn-read test is genuine.** `ConcurrentRecordAndSnapshot_NeverYields
|
||
TornSlot` races 4 writers × 200k ops against a reader and checks a real invariant
|
||
(`DecodedValue == RawLow + RawHigh`). This is the right way to test the `Volatile.Write`
|
||
immutable-record swap, and it would actually catch a regression to a mutable slot.
|
||
- **`SignalRFakes` model the right shapes.** `FakeHubCallerContext : HubCallerContext` with
|
||
the abstract members overridden, `FakeGroupManager : IGroupManager`, and the hub-property
|
||
injection (`Context = ...`, `Groups = ...`) is the standard, correct way to unit-test a
|
||
SignalR `Hub` without a host. No mocking framework is dragged in.
|
||
- **`BcdPduPipelineCaptureTests` regression guards are well-chosen.** The disarmed-capture
|
||
and null-capture cases assert the rewrite *still happens byte-identically* — that is the
|
||
load-bearing property (capture must never perturb the proxy's transparency) and it is
|
||
explicitly tested.
|
||
- **`TagCaptureRegistryTests.GetOrCreate_Rebuild_PreservesArmedFlag`** correctly tests the
|
||
hot-reload reseat contract (rebuilt capture is a new instance but keeps `IsArmed`), and
|
||
`UnknownPlc_Operations_AreSafeNoOps` covers the no-op-for-ghost-PLC contract that
|
||
`StatusHub` relies on.
|
||
- **`StatusHtmlRenderer` removal is clean.** No source or test file references it; the only
|
||
remaining mentions are in `plans/`, `docs/`, and prior `codereviews/` — all expected.
|
||
- **The `AdminPushIntervalMs` validation is consistent with the codebase.** Lower-bound-only
|
||
`> 0` checks in both `MbproxyOptionsValidator` and `ReloadValidator`, error-message format
|
||
matching the sibling checks, and tests for default + bind + zero + negative. This is the
|
||
right pattern; the new tests are adequate for what is enforced.
|
||
- **The csproj `EmbeddedResource` glob is correct** for the flat-directory design, the
|
||
comment accurately documents the resource-name → request-path mapping, and there is no
|
||
accidental double-include because the Worker SDK does not enable web default globs.
|
||
- **`mbproxy.smoke.config.json` binds against the current schema** — no stale keys, the
|
||
three-PLC line-a/line-b/line-dead topology is a sensible smoke surface, and
|
||
`AdminPushIntervalMs` is present and explicit.
|
||
|
||
## Key file references
|
||
|
||
- `src/Mbproxy/Admin/StatusBroadcaster.cs:113-135` — untested `LoopAsync` (M1).
|
||
- `src/Mbproxy/Proxy/TagValueCapture.cs:103-129` — arm/disarm vs. record race (M2).
|
||
- `src/Mbproxy/Admin/PlcSubscriptionTracker.cs` — no test file exists (M3).
|
||
- `src/Mbproxy/Admin/StatusHub.cs:60-66` — `OnDisconnectedAsync`; `FakeGroupManager.Removed`
|
||
is unasserted dead code (M4).
|
||
- `tests/Mbproxy.Tests/Admin/SignalRFakes.cs` — success-path-only fakes (m1, m2).
|
||
- `tests/Mbproxy.Tests/Configuration/ConfigReconcilerTests.cs` — compile-fix only, reconciler
|
||
↔ registry wiring untested (m7).
|
||
- `src/Mbproxy/Mbproxy.csproj` — `EmbeddedResource Include="Admin\wwwroot\*.*"` (m5).
|
||
- `tests/sim/mbproxy.smoke.config.json` — dangling `plans/` reference (m6).
|