wwtools/mbproxy/codereviews/2026-05-15/TestsAndConfig.md

# Code Review — Tests & Config for the SignalR Dashboard (commit `e719dd5`)

Scope: the test files, config/options/build changes, and the smoke config introduced or
modified by `e719dd5` ("replace status page with a live SignalR web dashboard"). Production
SignalR/capture code (`StatusHub`, `StatusBroadcaster`, `TagValueCapture`,
`TagCaptureRegistry`, `AdminEndpointHost`) was read for context but is reviewed only insofar
as it tells us whether the new tests actually exercise the risky paths. The wider UI
(`dashboard.js`, `detail.js`, HTML/CSS) is explicitly out of scope.

## Summary

The new tests are competently written, follow the existing `Subject_Condition_Expectation`
naming, and the `IStatusPushSink` seam is a genuinely good design that makes the broadcaster
unit-testable without a SignalR host. The `TagValueCapture` torn-read test is a real
concurrency test, not a pretend one. **But the concurrency-sensitive paths the commit message
advertises — "armed only while a detail page is open", arm/disarm under churn, the broadcaster
*loop* — are tested only at the single-threaded happy-path level.** The most serious gaps:

1. The **`StatusBroadcaster.LoopAsync` background loop has zero test coverage** — every
   broadcaster test calls the internal `PushOnceAsync` directly. The loop's interval
   re-read, the `Math.Max(100, ...)` floor, the cancellation path, and the
   loop-terminated-unexpectedly catch are all unverified.
2. The **arm/disarm-vs-record race in `TagValueCapture` is real and untested.** `Disarm()`
   sets `_armed = false` *then* clears slots; `Record()` checks `_armed` *then* writes. A
   `Record` that passes the `_armed` check before `Disarm` flips it can write a slot *after*
   `Disarm` cleared it — leaving a stale observation on a disarmed capture. The torn-read
   test only races `Record` against `Snapshot`, never against `Disarm`/`Arm`.
3. The **`PlcSubscriptionTracker` — the lock-guarded type whose entire job is concurrent
   subscriber counting — has no direct test at all** and no concurrent test anywhere. Its
   behaviour is only incidentally exercised through single-threaded `StatusHubTests`.

None are release-blockers for a read-only admin page on a trusted segment, but #2 is a latent
correctness bug and #1/#3 mean a regression in the live-feed plumbing would ship silently.

## Findings

### Critical

None. This is a read-only admin surface; nothing here can corrupt the Modbus path.

### Major

**M1 — `StatusBroadcaster.LoopAsync` has no test.**
`src/Mbproxy/Admin/StatusBroadcaster.cs:113-135`; `tests/Mbproxy.Tests/Admin/StatusBroadcasterTests.cs`.
All four broadcaster tests drive `PushOnceAsync` directly. The actual background loop —
which is what runs in production — is untested. Specifically uncovered:
- the per-cycle `_options.CurrentValue.AdminPushIntervalMs` re-read (the documented
  hot-reload-without-restart behaviour);
- the `Math.Max(100, ...)` floor that defends against a bad interval slipping past
  validation (note: validation rejects `<= 0`, so the floor only ever matters for values
  `1..99` — itself an untested corner);
- that `StartAsync`→`StopAsync` actually terminates the loop and that `StopAsync` is safe
  to call when the loop never started.
Fix: add a test that constructs the broadcaster with `AdminPushIntervalMs` ~100ms, calls
`Start()`, polls `Sink.FleetPushes.Count` until `>= 2` with a generous deadline + the test
`CancellationToken`, then `StopAsync()` and asserts the count stops growing. Keep timing
hedged (`ShouldBeGreaterThanOrEqualTo`), consistent with the existing coalescing tests.

**M2 — Arm/disarm-vs-record race in `TagValueCapture` is real and untested.**
`src/Mbproxy/Proxy/TagValueCapture.cs:103-129`. `Disarm()` does `_armed = false` then clears
slots; `Record()` does `if (!_armed) return;` then `Volatile.Write` the slot. Interleaving:
`Record` reads `_armed == true` → `Disarm` sets `false` and clears all slots → `Record`
writes its observation. Result: a *disarmed* capture holds a non-null slot, and the next
`Snapshot()` reports stale traffic with `UpdatedAtUtc != null` — exactly the "reopened page
shows stale data" failure the on-demand design says it prevents. The hot path makes this
unlikely (record traffic stops when no viewer is attached) but it is not impossible: a
detail page closing while an in-flight FC03 response is being rewritten hits it.
`ConcurrentRecordAndSnapshot_NeverYieldsTornSlot` (line 118) races `Record` vs `Snapshot`
only — never vs `Disarm`. Fix in production: have `Disarm()` set `_armed = false`, then
clear slots, then clear *again* (or re-clear after a memory barrier), or re-check `_armed`
inside `Record` after the `Volatile.Write` and null the slot if disarmed. At minimum add a
test that hammers `Arm`/`Disarm` on one task while `Record` runs on another and asserts a
disarmed capture's `Snapshot()` is all-null. The review should not let this ship "tested"
when the test deliberately avoids the racy interleaving.

**M3 — `PlcSubscriptionTracker` has no direct or concurrent test.**
`src/Mbproxy/Admin/PlcSubscriptionTracker.cs` (whole file); no `PlcSubscriptionTrackerTests.cs`
exists. This is the single lock-guarded type whose correctness drives capture arm/disarm.
`StatusHubTests` exercises it only single-threaded and indirectly. Untested behaviour that
has no coverage anywhere:
- the redundant re-subscribe path (`Add` returns `false` when the same connection
  re-subscribes the same PLC — `set.Add` fails);
- `RemoveConnection` for an unknown connection id returning `Array.Empty<string>()`;
- a connection subscribed to *multiple* PLCs being torn down in one `RemoveConnection`;
- concurrent `Add`/`RemoveConnection` from many "connections" — the lock is claimed
  thread-safe but nothing proves the count never goes negative or leaks.
Fix: add `PlcSubscriptionTrackerTests` with the single-threaded cases plus one
parallel-stress test (N tasks each Add-then-Remove, assert `ActivePlcs()` is empty and no
exception), mirroring `TxIdAllocatorTests`' concurrency style.

**M4 — `StatusHub` group-leave on disconnect is not verified.**
`src/Mbproxy/Admin/StatusHub.cs:60-66`; `StatusHubTests.cs`.
`OnDisconnectedAsync` is tested for the *capture-disarm* side effect, but the hub never
calls `Groups.RemoveFromGroupAsync` — and `FakeGroupManager` records `Removed` but **no test
ever asserts `groups.Removed`**. Worth confirming this is intentional (real SignalR auto-
removes a disconnected connection from all groups, so an explicit `RemoveFromGroupAsync` is
genuinely unnecessary) — but then `FakeGroupManager.Removed` is dead code that implies a
contract the hub does not honour. Either delete `Removed` from the fake, or if the hub is
*supposed* to leave fleet groups explicitly, that's a production bug. As written, a reviewer
cannot tell which. Fix: drop the unused `Removed` list, or add a comment on the fake
explaining SignalR's implicit group cleanup so the gap is not mistaken for an omission.

### Minor

**m1 — `SignalRFakes` do not model the real SignalR failure surface.**
`tests/Mbproxy.Tests/Admin/SignalRFakes.cs`. The fakes are faithful to the *shapes* but
model only the success path: `FakeGroupManager.AddToGroupAsync` always succeeds and is
synchronous; `FakeStatusPushSink.PushFleetAsync` never throws and never observes the
`CancellationToken`. The production code has explicit `try/catch` around
`_sink.PushFleetAsync` / `PushPlcAsync` (`StatusBroadcaster.cs:88-110`) — a sink that throws
is a real scenario (a SignalR send to a dropped connection) and that catch is **completely
uncovered**. The `BuildDebug` failure catch (`StatusBroadcaster.cs:82-86`) is likewise
uncovered. Fix: add a throwing variant of `FakeStatusPushSink` and assert `PushOnceAsync`
swallows the exception and still attempts the per-PLC pushes; assert the `ct` is honoured
(an `OperationCanceledException` from the sink must *not* be logged-and-swallowed the same
way — the `when (ex is not OperationCanceledException)` filter is untested).

**m2 — `FakeHubCallerContext.ConnectionAborted` is hardcoded to `CancellationToken.None`.**
`SignalRFakes.cs:23`. Real SignalR fires this token on disconnect. No current test needs it,
but if anyone later tests connection-abort handling the fake will silently mask it. Low risk;
note it with a comment so a future test author knows the fake is inert here.

**m3 — Asset content-type test relies on a hand-maintained `[InlineData]` allow-list.**
`AdminEndpointTests.Get_Asset_ReturnsCorrectContentType` covers 5 of the 14 embedded files.
`bootstrap.bundle.min.js`, `detail.css`, `detail.js`, `index.html`, `plc.html`,
`ibm-plex-sans-400/600.woff2` are not asserted. More importantly there is **no test that the
csproj glob actually embedded every wwwroot file** — if someone adds `favicon.ico` to
`wwwroot/` and the glob silently misses it (it won't, `*.*` catches it, but a `.gitignore`'d
or renamed file would), nothing fails. Fix: add a test that enumerates
`typeof(AdminEndpointHost).Assembly.GetManifestResourceNames()`, filters the
`Mbproxy.Admin.wwwroot.` prefix, and asserts every file physically present in `wwwroot/` has
a matching resource (or just assert the count). This is the only thing that would catch a
broken `EmbeddedResource` glob.

**m4 — `Get_Asset` does not assert the bytes are the *right* asset.**
`AdminEndpointTests.cs` (the new theory). It asserts `bytes.Length > 0` and the content
type, but not that `dashboard.js` contains a known marker (it already does this for the HTML
shells via `ShouldContain("/assets/dashboard.js")`). A resource-name collision or a wrong
`ContentTypeFor` mapping for a *correctly-served-but-wrong* file would pass. Cheap to harden:
for the text assets assert a known substring.

**m5 — `csproj` `EmbeddedResource Include="Admin\wwwroot\*.*"` — glob caveats.**
`src/Mbproxy/Mbproxy.csproj` (the new `ItemGroup`). The SDK is `Microsoft.NET.Sdk.Worker`,
which does **not** enable web default-item globs, so there is no double-include of `.css`/
`.js`/`.html` as `Content` — good. Two real caveats:
- `*.*` matches only files containing a dot. Every current asset has an extension so this is
  fine, but it is a non-obvious constraint; `**\*` or just `*` would be more honest. The
  comment says "intentionally flat", so `*.*` is acceptable, but worth a one-word note that
  extensionless files would be skipped.
- Globs in an `<ItemGroup>` are evaluated at project-evaluation time; a newly-added asset is
  *not* picked up by an incremental build that didn't re-evaluate. In practice `dotnet
  build`/`publish` always re-evaluates, so this is a non-issue — but it is the kind of thing
  that bites a watch-mode developer. No fix needed; flagged for completeness.
The resource-name → request-path mapping (`Mbproxy.Admin.wwwroot.<file>`) is correct for the
flat directory and matches `AssetResourcePrefix` in `AdminEndpointHost.cs:301`.

**m6 — `mbproxy.smoke.config.json` — valid but with stale/odd values.**
`tests/sim/mbproxy.smoke.config.json`. The schema is current (`AdminPushIntervalMs` present,
`Keepalive` field names match `KeepaliveOptions`, no removed keys). Issues:
- The header comment says the simulator runs on `127.0.0.1:5020` and both `line-a`/`line-b`
  point `Port: 5020` — consistent. Good.
- `BackendHeartbeatIdleMs: 10000` while `BackendRequestTimeoutMs: 2000` — satisfies the
  "must be greater than BackendRequestTimeoutMs" cross-field rule. Fine.
- `line-dead` has no `BcdTags`, so it inherits the empty `Global: []` set — fine, the intent
  is just an unreachable backend.
- The file has **no `Resilience` and no `Cache` section.** Both are optional (defaults
  apply), so the config binds — but a smoke config whose stated purpose is exercising the
  dashboard's "problems only" filter would be more representative with the defaults made
  explicit, or at least a comment that defaults are intentionally relied on.
- Comment line 1 references "Phase 4/5 web-UI browser smoke tests" and
  `plans/2026-05-15-webui-dashboard.md`. `plans/` is untracked (`?? plans/` in git status)
  and per recent commit `7466a46` the project has been *retiring* plan docs. A smoke config
  pointing at an untracked plan file is a dangling reference — either commit the plan or
  drop the citation.
- No `Resilience.ReadCoalescing` means coalescing runs with defaults during the smoke run;
  acceptable but undocumented.

**m7 — `ConfigReconcilerTests` change is a minimal compile-fix, not a behaviour test.**
`tests/Mbproxy.Tests/Configuration/ConfigReconcilerTests.cs` (the 1-line diff). The new
`TagCaptureRegistry` ctor arg is passed as a throwaway `new TagCaptureRegistry()`. That is
the correct minimal change, but it means **the reconciler's new responsibility — calling
`TagCaptureRegistry.GetOrCreate` on PLC add and `Remove` on PLC removal during hot-reload —
has no test.** `ConfigReconciler.cs` was modified in this commit (`+11` lines) to wire the
registry; nothing asserts that a hot-reload-added PLC gets a capture or that a removed PLC's
capture is dropped. Fix: in `ConfigReconcilerTests`, pass a real registry the test holds a
reference to, trigger an add/remove reload, and assert `registry.TryGet` reflects it.

**m8 — `ReloadValidator`/`MbproxyOptions` `AdminPushIntervalMs` validation: consistent, but
the upper bound is unguarded.** `ReloadValidator.cs` and `MbproxyOptionsValidator` both
reject `<= 0` — consistent with how `GracefulShutdownTimeoutMs` and the keepalive intervals
are validated (lower-bound-only, `> 0`). So the *new* validation matches house style. But
note no interval option in this codebase has an *upper* bound, and `AdminPushIntervalMs` is
the one most likely to be fat-fingered into something absurd (`1000000` = 16-minute feed).
Not a regression and not inconsistent — flagged only because the dashboard's whole value
proposition is "live". The `LoopAsync` `Math.Max(100, ...)` floor protects against tiny
values; nothing protects against huge ones. Optional: a soft upper bound (e.g. reject
`> 60_000`) or just a doc note. The two new `ReloadValidatorTests` (zero, negative) and two
new `MbproxyOptionsBindingTests` (default 1000, binds 250) are correct and adequate for the
bounds that *are* enforced.

**m9 — `StatusHubTests.SecondSubscriber_FirstLeaveKeepsArmed_LastLeaveDisarms` passes
`null` to `OnDisconnectedAsync`.** Line 72/76. Real SignalR passes a non-null `Exception`
on an abnormal disconnect and `null` on a clean one, so `null` is a legal value — but the
test never exercises the exception-carrying path. `StatusHub.OnDisconnectedAsync` ignores
the exception entirely, so this is harmless today; flagged so a future change that *uses*
the exception doesn't find itself untested.

## Coverage gaps (behavior in `e719dd5` with NO test)

1. **`StatusBroadcaster.LoopAsync`** — the entire background loop (M1).
2. **Arm/disarm vs. `Record` race** in `TagValueCapture` (M2).
3. **`PlcSubscriptionTracker`** — no direct test, no concurrency test (M3).
4. **Sink-throws / build-throws error handling** in `PushOnceAsync` — the `try/catch`
   blocks and the `when (ex is not OperationCanceledException)` filter (m1).
5. **`ConfigReconciler` ↔ `TagCaptureRegistry` wiring** — capture created on hot-reload PLC
   add, removed on PLC remove (m7).
6. **`AdminEndpointHost` AdminPort hot-reload tears down the broadcaster and disarms
   captures** — `StopCurrentAppAsync` disposes `_broadcaster` (which `DisarmAll`s); the
   broadcaster's `StopAsync_DisarmsEveryCapture` test proves the disarm in isolation but no
   test proves the *hot-reload rebind* path runs it. (`AdminEndpointTests` has an admin-port
   rebind test from the prior commit — extend it to assert capture disarm.)
7. **`/hub/status` SignalR endpoint** — not reachable by any test. The 405-methods test
   explicitly excludes it. No test connects a SignalR client and verifies a `fleet`/`plc`
   message is actually received end-to-end. The `IStatusPushSink` unit tests cover the
   broadcaster logic but nothing covers `SignalRStatusPushSink` + `MapHub` wiring. Given the
   project already stands up a real Kestrel host in `AdminEndpointTests`, one
   `HubConnectionBuilder`-based E2E test (`SubscribeFleet` → await a `fleet` message) would
   close the single biggest "does the feature actually work" gap.
8. **Multi-tag / 32-bit capture via `BcdPduPipeline`** — `BcdPduPipelineCaptureTests` covers
   FC03 16/32-bit and FC06/FC16 16-bit, but not FC16 *32-bit* (CDAB pair write) capture, and
   not a PDU spanning a BCD tag *and* a non-BCD register (does capture record only the BCD
   one?). The pure `TagValueCaptureTests` cover 32-bit, but the *pipeline hook* for 32-bit
   writes is unverified.
9. **`Get_PlcDetailRoute_ReturnsDetailShell`** uses `/plc/anything` — it never checks that a
   PLC name with a slash, encoded characters, or empty segment is handled. The route is
   `/plc/{name}` and the name is read client-side, so this is low-risk, but
   `/plc/` (empty) and `/plc/a%2Fb` are untested.

## What looks good

- **`IStatusPushSink` is a clean seam.** Extracting the outbound side behind an interface so
  the broadcaster loop logic is testable without a SignalR host is exactly right, and
  `FakeStatusPushSink` using `ConcurrentBag` is the correct call for a type that production
  pushes to from a background thread.
- **`TagValueCapture` torn-read test is genuine.** `ConcurrentRecordAndSnapshot_NeverYields
  TornSlot` races 4 writers × 200k ops against a reader and checks a real invariant
  (`DecodedValue == RawLow + RawHigh`). This is the right way to test the `Volatile.Write`
  immutable-record swap, and it would actually catch a regression to a mutable slot.
- **`SignalRFakes` model the right shapes.** `FakeHubCallerContext : HubCallerContext` with
  the abstract members overridden, `FakeGroupManager : IGroupManager`, and the hub-property
  injection (`Context = ...`, `Groups = ...`) is the standard, correct way to unit-test a
  SignalR `Hub` without a host. No mocking framework is dragged in.
- **`BcdPduPipelineCaptureTests` regression guards are well-chosen.** The disarmed-capture
  and null-capture cases assert the rewrite *still happens byte-identically* — that is the
  load-bearing property (capture must never perturb the proxy's transparency) and it is
  explicitly tested.
- **`TagCaptureRegistryTests.GetOrCreate_Rebuild_PreservesArmedFlag`** correctly tests the
  hot-reload reseat contract (rebuilt capture is a new instance but keeps `IsArmed`), and
  `UnknownPlc_Operations_AreSafeNoOps` covers the no-op-for-ghost-PLC contract that
  `StatusHub` relies on.
- **`StatusHtmlRenderer` removal is clean.** No source or test file references it; the only
  remaining mentions are in `plans/`, `docs/`, and prior `codereviews/` — all expected.
- **The `AdminPushIntervalMs` validation is consistent with the codebase.** Lower-bound-only
  `> 0` checks in both `MbproxyOptionsValidator` and `ReloadValidator`, error-message format
  matching the sibling checks, and tests for default + bind + zero + negative. This is the
  right pattern; the new tests are adequate for what is enforced.
- **The csproj `EmbeddedResource` glob is correct** for the flat-directory design, the
  comment accurately documents the resource-name → request-path mapping, and there is no
  accidental double-include because the Worker SDK does not enable web default globs.
- **`mbproxy.smoke.config.json` binds against the current schema** — no stale keys, the
  three-PLC line-a/line-b/line-dead topology is a sensible smoke surface, and
  `AdminPushIntervalMs` is present and explicit.

## Key file references

- `src/Mbproxy/Admin/StatusBroadcaster.cs:113-135` — untested `LoopAsync` (M1).
- `src/Mbproxy/Proxy/TagValueCapture.cs:103-129` — arm/disarm vs. record race (M2).
- `src/Mbproxy/Admin/PlcSubscriptionTracker.cs` — no test file exists (M3).
- `src/Mbproxy/Admin/StatusHub.cs:60-66` — `OnDisconnectedAsync`; `FakeGroupManager.Removed`
  is unasserted dead code (M4).
- `tests/Mbproxy.Tests/Admin/SignalRFakes.cs` — success-path-only fakes (m1, m2).
- `tests/Mbproxy.Tests/Configuration/ConfigReconcilerTests.cs` — compile-fix only, reconciler
  ↔ registry wiring untested (m7).
- `src/Mbproxy/Mbproxy.csproj` — `EmbeddedResource Include="Admin\wwwroot\*.*"` (m5).
- `tests/sim/mbproxy.smoke.config.json` — dangling `plans/` reference (m6).