Files
wwtools/mbproxy/codereviews/2026-05-15/TestsAndConfig.md
T
Joseph Doherty 554b05d28c mbproxy: fix dashboard review findings, add named BCD tags + fleet config
Reviewed the new SignalR dashboard and fixed its two top findings: a stored XSS on the connection-detail page (unescaped tag name / direction / timestamp rendered into innerHTML) and FC03/FC04 cache hits bypassing the debug-view capture, which left cached tags frozen while their age climbed. Also adds an optional human-friendly Name to BCD tags surfaced on the debug view, and loads the real fleet config from tags.txt (12 named BCD tags, PLC Z28061) so the published appsettings.json is deploy-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 03:39:39 -04:00

20 KiB
Raw Blame History

Code Review — Tests & Config for the SignalR Dashboard (commit e719dd5)

Scope: the test files, config/options/build changes, and the smoke config introduced or modified by e719dd5 ("replace status page with a live SignalR web dashboard"). Production SignalR/capture code (StatusHub, StatusBroadcaster, TagValueCapture, TagCaptureRegistry, AdminEndpointHost) was read for context but is reviewed only insofar as it tells us whether the new tests actually exercise the risky paths. The wider UI (dashboard.js, detail.js, HTML/CSS) is explicitly out of scope.

Summary

The new tests are competently written, follow the existing Subject_Condition_Expectation naming, and the IStatusPushSink seam is a genuinely good design that makes the broadcaster unit-testable without a SignalR host. The TagValueCapture torn-read test is a real concurrency test, not a pretend one. But the concurrency-sensitive paths the commit message advertises — "armed only while a detail page is open", arm/disarm under churn, the broadcaster loop — are tested only at the single-threaded happy-path level. The most serious gaps:

  1. The StatusBroadcaster.LoopAsync background loop has zero test coverage — every broadcaster test calls the internal PushOnceAsync directly. The loop's interval re-read, the Math.Max(100, ...) floor, the cancellation path, and the loop-terminated-unexpectedly catch are all unverified.
  2. The arm/disarm-vs-record race in TagValueCapture is real and untested. Disarm() sets _armed = false then clears slots; Record() checks _armed then writes. A Record that passes the _armed check before Disarm flips it can write a slot after Disarm cleared it — leaving a stale observation on a disarmed capture. The torn-read test only races Record against Snapshot, never against Disarm/Arm.
  3. The PlcSubscriptionTracker — the lock-guarded type whose entire job is concurrent subscriber counting — has no direct test at all and no concurrent test anywhere. Its behaviour is only incidentally exercised through single-threaded StatusHubTests.

None are release-blockers for a read-only admin page on a trusted segment, but #2 is a latent correctness bug and #1/#3 mean a regression in the live-feed plumbing would ship silently.

Findings

Critical

None. This is a read-only admin surface; nothing here can corrupt the Modbus path.

Major

M1 — StatusBroadcaster.LoopAsync has no test. src/Mbproxy/Admin/StatusBroadcaster.cs:113-135; tests/Mbproxy.Tests/Admin/StatusBroadcasterTests.cs. All four broadcaster tests drive PushOnceAsync directly. The actual background loop — which is what runs in production — is untested. Specifically uncovered:

  • the per-cycle _options.CurrentValue.AdminPushIntervalMs re-read (the documented hot-reload-without-restart behaviour);
  • the Math.Max(100, ...) floor that defends against a bad interval slipping past validation (note: validation rejects <= 0, so the floor only ever matters for values 1..99 — itself an untested corner);
  • that StartAsyncStopAsync actually terminates the loop and that StopAsync is safe to call when the loop never started. Fix: add a test that constructs the broadcaster with AdminPushIntervalMs ~100ms, calls Start(), polls Sink.FleetPushes.Count until >= 2 with a generous deadline + the test CancellationToken, then StopAsync() and asserts the count stops growing. Keep timing hedged (ShouldBeGreaterThanOrEqualTo), consistent with the existing coalescing tests.

M2 — Arm/disarm-vs-record race in TagValueCapture is real and untested. src/Mbproxy/Proxy/TagValueCapture.cs:103-129. Disarm() does _armed = false then clears slots; Record() does if (!_armed) return; then Volatile.Write the slot. Interleaving: Record reads _armed == trueDisarm sets false and clears all slots → Record writes its observation. Result: a disarmed capture holds a non-null slot, and the next Snapshot() reports stale traffic with UpdatedAtUtc != null — exactly the "reopened page shows stale data" failure the on-demand design says it prevents. The hot path makes this unlikely (record traffic stops when no viewer is attached) but it is not impossible: a detail page closing while an in-flight FC03 response is being rewritten hits it. ConcurrentRecordAndSnapshot_NeverYieldsTornSlot (line 118) races Record vs Snapshot only — never vs Disarm. Fix in production: have Disarm() set _armed = false, then clear slots, then clear again (or re-clear after a memory barrier), or re-check _armed inside Record after the Volatile.Write and null the slot if disarmed. At minimum add a test that hammers Arm/Disarm on one task while Record runs on another and asserts a disarmed capture's Snapshot() is all-null. The review should not let this ship "tested" when the test deliberately avoids the racy interleaving.

M3 — PlcSubscriptionTracker has no direct or concurrent test. src/Mbproxy/Admin/PlcSubscriptionTracker.cs (whole file); no PlcSubscriptionTrackerTests.cs exists. This is the single lock-guarded type whose correctness drives capture arm/disarm. StatusHubTests exercises it only single-threaded and indirectly. Untested behaviour that has no coverage anywhere:

  • the redundant re-subscribe path (Add returns false when the same connection re-subscribes the same PLC — set.Add fails);
  • RemoveConnection for an unknown connection id returning Array.Empty<string>();
  • a connection subscribed to multiple PLCs being torn down in one RemoveConnection;
  • concurrent Add/RemoveConnection from many "connections" — the lock is claimed thread-safe but nothing proves the count never goes negative or leaks. Fix: add PlcSubscriptionTrackerTests with the single-threaded cases plus one parallel-stress test (N tasks each Add-then-Remove, assert ActivePlcs() is empty and no exception), mirroring TxIdAllocatorTests' concurrency style.

M4 — StatusHub group-leave on disconnect is not verified. src/Mbproxy/Admin/StatusHub.cs:60-66; StatusHubTests.cs. OnDisconnectedAsync is tested for the capture-disarm side effect, but the hub never calls Groups.RemoveFromGroupAsync — and FakeGroupManager records Removed but no test ever asserts groups.Removed. Worth confirming this is intentional (real SignalR auto- removes a disconnected connection from all groups, so an explicit RemoveFromGroupAsync is genuinely unnecessary) — but then FakeGroupManager.Removed is dead code that implies a contract the hub does not honour. Either delete Removed from the fake, or if the hub is supposed to leave fleet groups explicitly, that's a production bug. As written, a reviewer cannot tell which. Fix: drop the unused Removed list, or add a comment on the fake explaining SignalR's implicit group cleanup so the gap is not mistaken for an omission.

Minor

m1 — SignalRFakes do not model the real SignalR failure surface. tests/Mbproxy.Tests/Admin/SignalRFakes.cs. The fakes are faithful to the shapes but model only the success path: FakeGroupManager.AddToGroupAsync always succeeds and is synchronous; FakeStatusPushSink.PushFleetAsync never throws and never observes the CancellationToken. The production code has explicit try/catch around _sink.PushFleetAsync / PushPlcAsync (StatusBroadcaster.cs:88-110) — a sink that throws is a real scenario (a SignalR send to a dropped connection) and that catch is completely uncovered. The BuildDebug failure catch (StatusBroadcaster.cs:82-86) is likewise uncovered. Fix: add a throwing variant of FakeStatusPushSink and assert PushOnceAsync swallows the exception and still attempts the per-PLC pushes; assert the ct is honoured (an OperationCanceledException from the sink must not be logged-and-swallowed the same way — the when (ex is not OperationCanceledException) filter is untested).

m2 — FakeHubCallerContext.ConnectionAborted is hardcoded to CancellationToken.None. SignalRFakes.cs:23. Real SignalR fires this token on disconnect. No current test needs it, but if anyone later tests connection-abort handling the fake will silently mask it. Low risk; note it with a comment so a future test author knows the fake is inert here.

m3 — Asset content-type test relies on a hand-maintained [InlineData] allow-list. AdminEndpointTests.Get_Asset_ReturnsCorrectContentType covers 5 of the 14 embedded files. bootstrap.bundle.min.js, detail.css, detail.js, index.html, plc.html, ibm-plex-sans-400/600.woff2 are not asserted. More importantly there is no test that the csproj glob actually embedded every wwwroot file — if someone adds favicon.ico to wwwroot/ and the glob silently misses it (it won't, *.* catches it, but a .gitignore'd or renamed file would), nothing fails. Fix: add a test that enumerates typeof(AdminEndpointHost).Assembly.GetManifestResourceNames(), filters the Mbproxy.Admin.wwwroot. prefix, and asserts every file physically present in wwwroot/ has a matching resource (or just assert the count). This is the only thing that would catch a broken EmbeddedResource glob.

m4 — Get_Asset does not assert the bytes are the right asset. AdminEndpointTests.cs (the new theory). It asserts bytes.Length > 0 and the content type, but not that dashboard.js contains a known marker (it already does this for the HTML shells via ShouldContain("/assets/dashboard.js")). A resource-name collision or a wrong ContentTypeFor mapping for a correctly-served-but-wrong file would pass. Cheap to harden: for the text assets assert a known substring.

m5 — csproj EmbeddedResource Include="Admin\wwwroot\*.*" — glob caveats. src/Mbproxy/Mbproxy.csproj (the new ItemGroup). The SDK is Microsoft.NET.Sdk.Worker, which does not enable web default-item globs, so there is no double-include of .css/ .js/.html as Content — good. Two real caveats:

  • *.* matches only files containing a dot. Every current asset has an extension so this is fine, but it is a non-obvious constraint; **\* or just * would be more honest. The comment says "intentionally flat", so *.* is acceptable, but worth a one-word note that extensionless files would be skipped.
  • Globs in an <ItemGroup> are evaluated at project-evaluation time; a newly-added asset is not picked up by an incremental build that didn't re-evaluate. In practice dotnet build/publish always re-evaluates, so this is a non-issue — but it is the kind of thing that bites a watch-mode developer. No fix needed; flagged for completeness. The resource-name → request-path mapping (Mbproxy.Admin.wwwroot.<file>) is correct for the flat directory and matches AssetResourcePrefix in AdminEndpointHost.cs:301.

m6 — mbproxy.smoke.config.json — valid but with stale/odd values. tests/sim/mbproxy.smoke.config.json. The schema is current (AdminPushIntervalMs present, Keepalive field names match KeepaliveOptions, no removed keys). Issues:

  • The header comment says the simulator runs on 127.0.0.1:5020 and both line-a/line-b point Port: 5020 — consistent. Good.
  • BackendHeartbeatIdleMs: 10000 while BackendRequestTimeoutMs: 2000 — satisfies the "must be greater than BackendRequestTimeoutMs" cross-field rule. Fine.
  • line-dead has no BcdTags, so it inherits the empty Global: [] set — fine, the intent is just an unreachable backend.
  • The file has no Resilience and no Cache section. Both are optional (defaults apply), so the config binds — but a smoke config whose stated purpose is exercising the dashboard's "problems only" filter would be more representative with the defaults made explicit, or at least a comment that defaults are intentionally relied on.
  • Comment line 1 references "Phase 4/5 web-UI browser smoke tests" and plans/2026-05-15-webui-dashboard.md. plans/ is untracked (?? plans/ in git status) and per recent commit 7466a46 the project has been retiring plan docs. A smoke config pointing at an untracked plan file is a dangling reference — either commit the plan or drop the citation.
  • No Resilience.ReadCoalescing means coalescing runs with defaults during the smoke run; acceptable but undocumented.

m7 — ConfigReconcilerTests change is a minimal compile-fix, not a behaviour test. tests/Mbproxy.Tests/Configuration/ConfigReconcilerTests.cs (the 1-line diff). The new TagCaptureRegistry ctor arg is passed as a throwaway new TagCaptureRegistry(). That is the correct minimal change, but it means the reconciler's new responsibility — calling TagCaptureRegistry.GetOrCreate on PLC add and Remove on PLC removal during hot-reload — has no test. ConfigReconciler.cs was modified in this commit (+11 lines) to wire the registry; nothing asserts that a hot-reload-added PLC gets a capture or that a removed PLC's capture is dropped. Fix: in ConfigReconcilerTests, pass a real registry the test holds a reference to, trigger an add/remove reload, and assert registry.TryGet reflects it.

m8 — ReloadValidator/MbproxyOptions AdminPushIntervalMs validation: consistent, but the upper bound is unguarded. ReloadValidator.cs and MbproxyOptionsValidator both reject <= 0 — consistent with how GracefulShutdownTimeoutMs and the keepalive intervals are validated (lower-bound-only, > 0). So the new validation matches house style. But note no interval option in this codebase has an upper bound, and AdminPushIntervalMs is the one most likely to be fat-fingered into something absurd (1000000 = 16-minute feed). Not a regression and not inconsistent — flagged only because the dashboard's whole value proposition is "live". The LoopAsync Math.Max(100, ...) floor protects against tiny values; nothing protects against huge ones. Optional: a soft upper bound (e.g. reject > 60_000) or just a doc note. The two new ReloadValidatorTests (zero, negative) and two new MbproxyOptionsBindingTests (default 1000, binds 250) are correct and adequate for the bounds that are enforced.

m9 — StatusHubTests.SecondSubscriber_FirstLeaveKeepsArmed_LastLeaveDisarms passes null to OnDisconnectedAsync. Line 72/76. Real SignalR passes a non-null Exception on an abnormal disconnect and null on a clean one, so null is a legal value — but the test never exercises the exception-carrying path. StatusHub.OnDisconnectedAsync ignores the exception entirely, so this is harmless today; flagged so a future change that uses the exception doesn't find itself untested.

Coverage gaps (behavior in e719dd5 with NO test)

  1. StatusBroadcaster.LoopAsync — the entire background loop (M1).
  2. Arm/disarm vs. Record race in TagValueCapture (M2).
  3. PlcSubscriptionTracker — no direct test, no concurrency test (M3).
  4. Sink-throws / build-throws error handling in PushOnceAsync — the try/catch blocks and the when (ex is not OperationCanceledException) filter (m1).
  5. ConfigReconcilerTagCaptureRegistry wiring — capture created on hot-reload PLC add, removed on PLC remove (m7).
  6. AdminEndpointHost AdminPort hot-reload tears down the broadcaster and disarms capturesStopCurrentAppAsync disposes _broadcaster (which DisarmAlls); the broadcaster's StopAsync_DisarmsEveryCapture test proves the disarm in isolation but no test proves the hot-reload rebind path runs it. (AdminEndpointTests has an admin-port rebind test from the prior commit — extend it to assert capture disarm.)
  7. /hub/status SignalR endpoint — not reachable by any test. The 405-methods test explicitly excludes it. No test connects a SignalR client and verifies a fleet/plc message is actually received end-to-end. The IStatusPushSink unit tests cover the broadcaster logic but nothing covers SignalRStatusPushSink + MapHub wiring. Given the project already stands up a real Kestrel host in AdminEndpointTests, one HubConnectionBuilder-based E2E test (SubscribeFleet → await a fleet message) would close the single biggest "does the feature actually work" gap.
  8. Multi-tag / 32-bit capture via BcdPduPipelineBcdPduPipelineCaptureTests covers FC03 16/32-bit and FC06/FC16 16-bit, but not FC16 32-bit (CDAB pair write) capture, and not a PDU spanning a BCD tag and a non-BCD register (does capture record only the BCD one?). The pure TagValueCaptureTests cover 32-bit, but the pipeline hook for 32-bit writes is unverified.
  9. Get_PlcDetailRoute_ReturnsDetailShell uses /plc/anything — it never checks that a PLC name with a slash, encoded characters, or empty segment is handled. The route is /plc/{name} and the name is read client-side, so this is low-risk, but /plc/ (empty) and /plc/a%2Fb are untested.

What looks good

  • IStatusPushSink is a clean seam. Extracting the outbound side behind an interface so the broadcaster loop logic is testable without a SignalR host is exactly right, and FakeStatusPushSink using ConcurrentBag is the correct call for a type that production pushes to from a background thread.
  • TagValueCapture torn-read test is genuine. ConcurrentRecordAndSnapshot_NeverYields TornSlot races 4 writers × 200k ops against a reader and checks a real invariant (DecodedValue == RawLow + RawHigh). This is the right way to test the Volatile.Write immutable-record swap, and it would actually catch a regression to a mutable slot.
  • SignalRFakes model the right shapes. FakeHubCallerContext : HubCallerContext with the abstract members overridden, FakeGroupManager : IGroupManager, and the hub-property injection (Context = ..., Groups = ...) is the standard, correct way to unit-test a SignalR Hub without a host. No mocking framework is dragged in.
  • BcdPduPipelineCaptureTests regression guards are well-chosen. The disarmed-capture and null-capture cases assert the rewrite still happens byte-identically — that is the load-bearing property (capture must never perturb the proxy's transparency) and it is explicitly tested.
  • TagCaptureRegistryTests.GetOrCreate_Rebuild_PreservesArmedFlag correctly tests the hot-reload reseat contract (rebuilt capture is a new instance but keeps IsArmed), and UnknownPlc_Operations_AreSafeNoOps covers the no-op-for-ghost-PLC contract that StatusHub relies on.
  • StatusHtmlRenderer removal is clean. No source or test file references it; the only remaining mentions are in plans/, docs/, and prior codereviews/ — all expected.
  • The AdminPushIntervalMs validation is consistent with the codebase. Lower-bound-only > 0 checks in both MbproxyOptionsValidator and ReloadValidator, error-message format matching the sibling checks, and tests for default + bind + zero + negative. This is the right pattern; the new tests are adequate for what is enforced.
  • The csproj EmbeddedResource glob is correct for the flat-directory design, the comment accurately documents the resource-name → request-path mapping, and there is no accidental double-include because the Worker SDK does not enable web default globs.
  • mbproxy.smoke.config.json binds against the current schema — no stale keys, the three-PLC line-a/line-b/line-dead topology is a sensible smoke surface, and AdminPushIntervalMs is present and explicit.

Key file references

  • src/Mbproxy/Admin/StatusBroadcaster.cs:113-135 — untested LoopAsync (M1).
  • src/Mbproxy/Proxy/TagValueCapture.cs:103-129 — arm/disarm vs. record race (M2).
  • src/Mbproxy/Admin/PlcSubscriptionTracker.cs — no test file exists (M3).
  • src/Mbproxy/Admin/StatusHub.cs:60-66OnDisconnectedAsync; FakeGroupManager.Removed is unasserted dead code (M4).
  • tests/Mbproxy.Tests/Admin/SignalRFakes.cs — success-path-only fakes (m1, m2).
  • tests/Mbproxy.Tests/Configuration/ConfigReconcilerTests.cs — compile-fix only, reconciler ↔ registry wiring untested (m7).
  • src/Mbproxy/Mbproxy.csprojEmbeddedResource Include="Admin\wwwroot\*.*" (m5).
  • tests/sim/mbproxy.smoke.config.json — dangling plans/ reference (m6).