mbproxy: fix dashboard review findings, add named BCD tags + fleet config

Reviewed the new SignalR dashboard and fixed its two top findings: a stored XSS on the connection-detail page (unescaped tag name / direction / timestamp rendered into innerHTML) and FC03/FC04 cache hits bypassing the debug-view capture, which left cached tags frozen while their age climbed. Also adds an optional human-friendly Name to BCD tags surfaced on the debug view, and loads the real fleet config from tags.txt (12 named BCD tags, PLC Z28061) so the published appsettings.json is deploy-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-16 03:39:39 -04:00
parent e719dd51c1
commit 554b05d28c
27 changed files with 964 additions and 83 deletions
@@ -0,0 +1,93 @@
# Admin SignalR Web Dashboard — Code Review
Scope: commit `e719dd5` ("replace status page with a live SignalR web dashboard"), files `src/Mbproxy/Admin/{AdminEndpointHost,StatusHub,StatusBroadcaster,StatusPushSink,PlcSubscriptionTracker,StatusSnapshotBuilder,DebugDto}.cs`. Cross-checked against `docs/Operations/StatusPage.md`, `docs/Reference/LogEvents.md`, the mbproxy `CLAUDE.md`, and the supporting types `Proxy/TagCaptureRegistry.cs`, `Proxy/TagValueCapture.cs`, `Proxy/ProxyWorker.cs`, `HostingExtensions.cs`, `Configuration/ConfigReconciler.cs`. Tests under `tests/Mbproxy.Tests/Admin/{StatusHubTests,StatusBroadcasterTests}.cs`.
## Summary
- The decomposition is sound: `IStatusPushSink` cleanly isolates the push loop from SignalR, `PlcSubscriptionTracker` is correctly single-locked, and the broadcaster lifecycle is tied to the Kestrel app's lifecycle so an `AdminPort` hot-reload re-bind does not leak a second broadcaster. Per-cycle error handling in `PushOnceAsync` is genuinely defensive.
- The most serious problem is a **subscriber-count leak in `PlcSubscriptionTracker`**: a SignalR reconnect (transport drop without a clean close) increments a PLC's count on the new connection but the old connection's `OnDisconnectedAsync` decrement is not guaranteed, so a capture can be left armed forever — or, in the opposite race, double-armed/never-disarmed. Combined with the fact that capture arming has no reference to the *connection liveness*, the on-demand-capture invariant ("armed only while a viewer is open") is not actually upheld in the field.
- A **second real bug**: `StatusHub.SubscribePlc` is not atomic — `Groups.AddToGroupAsync` then `_tracker.Add` are two awaits with an interleaving point, and `OnDisconnectedAsync` can run on the same connection between them, producing a *negative-free but still wrong* state where the group membership outlives the tracker entry (capture disarmed while the page still receives pushes) or vice versa.
- `DebugJsonContext` (in `DebugDto.cs`) is dead code — defined, never referenced. SignalR serializes `PlcDetailResponse` via the reflection-based `System.Text.Json` path, not the source-gen context. Harmless today (no trimming/AOT in the csproj) but it is a misleading artifact and a latent trap if AOT is ever enabled.
- The documented contract is matched on the fleet path but the **detail push (`PlcDetailResponse`) is undocumented in the `status.json` schema** and is not reachable from `StatusJsonContext` — only over SignalR. That is by design, but the camelCase guarantee for it rests entirely on the hub's `AddJsonProtocol` config, with no test asserting the wire shape.
## Critical findings
**C1. `PlcSubscriptionTracker` leaks subscriber counts across SignalR reconnects — captures get stuck armed (or never armed).** `StatusHub.cs:48-54` / `PlcSubscriptionTracker.cs:29-73`. SignalR assigns a *new* `ConnectionId` on every transport reconnect (WebSocket drop, long-polling cycle, network blip). The client (`detail.js`) re-invokes `SubscribePlc` on reconnect. The sequence on a reconnect is:
1. Old connection's transport dies. SignalR *eventually* fires `OnDisconnectedAsync` for the old `ConnectionId` — but this is **not** ordered relative to the new connection's `OnConnectedAsync`/`SubscribePlc`, and on an ungraceful drop it may be delayed by the server's keepalive/timeout window (default ~30 s) or, if the server is shutting down, may not fire deterministically at all.
2. New connection calls `SubscribePlc``_tracker.Add(newId, plc)` → count `1 → 2`, returns `false`, so no re-arm (fine).
3. Old connection's `OnDisconnectedAsync` runs late → count `2 → 1`. Capture stays armed. **Correct only if the order is (2) then (3).**
The failure case: if the old connection's `OnDisconnectedAsync` is delayed past the *new* connection also disconnecting, or if the operator closes the tab during the reconnect window, the count never returns to 0 and **the capture is armed forever with no viewer** — exactly the hot-path cost the on-demand design exists to avoid. Over a long-running service with flaky operator networks this accumulates: every PLC ever viewed ends up permanently armed. `TagValueCapture.Record` is then a non-trivial cost (`FrozenDictionary` lookup + allocation of a `TagValueObservation` + `Volatile.Write`) on the backend reader task and every FC06/FC16 upstream task, fleet-wide, forever.
The `StatusBroadcaster.StopAsync``DisarmAll()` safety net only fires on admin shutdown / `AdminPort` hot-reload, not during normal operation, so it does not bound this leak in a steady-state service.
Fix: do not rely on `ConnectionId` lifetime as the capture-arming key. Either (a) key subscriptions on a stable client-supplied identity and treat reconnects as idempotent re-subscribes, or (b) drive disarm off the broadcaster: each cycle, `ActivePlcs()` is the live set; reconcile armed captures against it (`_captureRegistry` disarms any PLC not in `ActivePlcs()`), so a leaked count is self-healing within one push interval — but a leaked *count* still keeps `ActivePlcs()` returning the PLC, so (b) alone is insufficient. The robust fix is (a): give `PlcSubscriptionTracker.RemoveConnection` a periodic sweep against SignalR's live connection set, or add a TTL/heartbeat so a connection that has not pushed a keepalive in N intervals is reaped. At minimum, document the leak and have the broadcaster log when `ActivePlcs()` count exceeds the number of distinct connections it has seen.
**C2. `SubscribePlc` is not atomic — group membership and tracker state can diverge under a concurrent disconnect.** `StatusHub.cs:48-54`:
```csharp
public async Task SubscribePlc(string plcName)
{
await Groups.AddToGroupAsync(Context.ConnectionId, PlcGroup(plcName)).ConfigureAwait(false);
if (_tracker.Add(Context.ConnectionId, plcName))
_captureRegistry.Arm(plcName);
}
```
There are two awaits' worth of interleaving here. SignalR will dispatch `OnDisconnectedAsync` for the same connection if the transport drops while `SubscribePlc` is mid-flight (hub method invocations and the disconnect callback are not mutually exclusive — only invocations on the *same* connection are serialized *with each other*, and `OnDisconnectedAsync` is a separate dispatch path). Concretely:
- Connection drops after `AddToGroupAsync` completes but before `_tracker.Add`. `OnDisconnectedAsync` runs `_tracker.RemoveConnection(id)` → returns empty (nothing tracked yet). Then `SubscribePlc` resumes, `_tracker.Add` runs → count `0 → 1`, returns `true`**`_captureRegistry.Arm(plcName)` on a connection that is already gone.** The capture is now armed with a phantom viewer. This is the same stuck-armed leak as C1, reached by a tighter race.
- The mirror case leaves a group membership with no tracker entry; benign for pushes (the dead connection just never receives them) but it confirms the two data structures are not kept consistent.
Fix: make subscribe/unsubscribe go through a single critical section that also checks connection liveness, or — simpler — register the tracker entry *first* (synchronous, under its own lock), then `AddToGroupAsync`, and have `OnDisconnectedAsync` always run last and unconditionally. Even then the arm/disarm must be idempotent and reconciled against the live connection set (see C1). The cleanest fix addresses C1 and C2 together: treat `OnConnectedAsync`/`OnDisconnectedAsync` as the *only* mutators of tracker state, make `SubscribePlc` only adjust group membership, and arm/disarm from a periodic reconciliation in the broadcaster against SignalR's actual group/connection state.
## Major findings
**M1. `DebugJsonContext` is dead code; the detail payload is not source-gen serialized.** `DebugDto.cs:54-61` declares `[JsonSerializable(typeof(PlcDetailResponse))] ... DebugJsonContext`, but nothing references `DebugJsonContext.Default` anywhere in the tree (grep confirms zero usages). `AdminEndpointHost.cs:192-196` configures the SignalR JSON protocol with only a `PropertyNamingPolicy` — no `TypeInfoResolver` — so SignalR serializes `PlcDetailResponse` through the **reflection-based** `System.Text.Json` resolver. The `status.json` route uses `StatusJsonContext` (source-gen); the SignalR path does not use any source-gen context. Today this works (the csproj sets neither `PublishTrimmed` nor `PublishAot`), but: (a) `DebugJsonContext` is misleading dead code that implies the detail payload is AOT-safe when it is not; (b) if trimming/AOT is ever turned on, every SignalR push of `PlcDetailResponse` *and* of `StatusResponse` (the `"fleet"` message also goes through the reflection resolver — `StatusJsonContext` is only wired into the `status.json` `JsonSerializer.Serialize` call, not into `AddJsonProtocol`) will throw or silently emit `{}`. Fix: either delete `DebugJsonContext` and document that the SignalR path is reflection-only, or — better — wire both contexts into the hub via `AddJsonProtocol(o => o.PayloadSerializerOptions.TypeInfoResolverChain.Insert(0, StatusJsonContext.Default); ...Insert(1, DebugJsonContext.Default))` so the SignalR path is consistent with `status.json` and trim-safe. Note `DebugJsonContext` would also need `StatusResponse`/`PlcStatus` reachable since `PlcDetailResponse` embeds `PlcStatus?`.
**M2. No test asserts the SignalR wire shape is camelCase.** `AdminEndpointHost.cs:194-196` is the *only* thing making the live feed's JSON match the documented `status.json` contract (`docs/Operations/StatusPage.md` repeatedly states "camelCase property names"). `StatusBroadcasterTests` uses `FakeStatusPushSink` and never serializes; `StatusHubTests` uses fakes. If someone removes or mis-edits the `AddJsonProtocol` lambda, every dashboard field silently becomes `PascalCase` and the JS (`dashboard.js`/`detail.js` expecting `service.uptimeSeconds` etc.) breaks with no failing test. The reflection-vs-source-gen split in M1 makes this worse — the two endpoints' naming is configured in two unrelated places. Add an integration test that stands up the hub (or at least serializes `PlcDetailResponse`/`StatusResponse` with the exact `PayloadSerializerOptions` the hub builds) and asserts `uptimeSeconds`/`captureArmed` appear lowercase-first.
**M3. `StatusBroadcaster.PushOnceAsync` swallows a cancelled per-PLC push but keeps looping over remaining PLCs.** `StatusBroadcaster.cs:97-110`. The `catch (Exception ex) when (ex is not OperationCanceledException)` correctly lets cancellation propagate — but only out of the `await _sink.PushPlcAsync`. If `ct` is cancelled mid-loop, the `OperationCanceledException` escapes `PushOnceAsync`, which is caught by `LoopAsync`'s outer `catch (OperationCanceledException)` — fine. However, if `_builder.BuildDebug(plcName)` (synchronous, `StatusSnapshotBuilder.cs:44`) throws a *non-OCE* exception for one PLC, the `catch` logs and the loop continues — good. But `snapshot.Plcs.FirstOrDefault(...)` is re-run inside the `try` for every PLC: an O(N) scan per PLC over the `Plcs` list = O(N²) per push cycle. For 54 PLCs that is 2,916 comparisons per second — trivially cheap here, but flagged because the fleet snapshot is already a `List` and a `Dictionary<string,PlcStatus>` projection once per cycle would be cleaner and removes the quadratic. Minor on its own; grouped here because it sits in the same loop as a real concern: `BuildDebug` is called for every active PLC even when `snapshot.Plcs` has no matching entry (PLC removed by hot-reload) — that path is handled (`Plc: null`) but worth a test.
**M4. `StatusBroadcaster.Start()` is fire-and-forget with no guard against being called twice.** `StatusBroadcaster.cs:51-52`: `public void Start() => _loop = Task.Run(...)`. The XML doc says "Idempotent only in the sense that it is called once" — but nothing *enforces* that. A second `Start()` overwrites `_loop`, orphaning the first loop task (it keeps running against the same `_cts`, so two loops now push concurrently every interval until cancellation). `AdminEndpointHost` only ever calls it once per `StartAppAsync`, and a new `StatusBroadcaster` is constructed each re-bind, so this is not hit today — but a one-line guard (`if (_loop is not { IsCompleted: true } and not null) throw`/return, or an `Interlocked` flag) would make the class safe to misuse. Also: `Task.Run` returns a task whose faults are observed only via `_loop` being awaited in `StopAsync`; if `StopAsync` is never called (e.g. `StartAppAsync` throws *after* `_broadcaster.Start()` but the catch at `AdminEndpointHost.cs:253` sets `_app = null` without disposing `_broadcaster`) the loop task and its `_cts` leak. See M5.
**M5. A bind failure after `_broadcaster.Start()` leaks the broadcaster and its push loop.** `AdminEndpointHost.cs:237-258`. `StartAppAsync` does `await app.StartAsync(ct)` (line 237), then `_app = app`, then constructs and `Start()`s `_broadcaster` (lines 242-249), then `LogAdminStarted`. The whole body is wrapped in `catch (Exception ex) when (ex is not OperationCanceledException)` (line 253) which logs `mbproxy.admin.bind.failed` and sets `_app = null`. If anything between line 238 and 251 throws — e.g. `app.Services.GetRequiredService<IHubContext<StatusHub>>()` fails, or the `StatusBroadcaster` constructor throws, or `LogAdminStarted` somehow throws — the catch sets `_app = null` but **does not stop the already-started `_broadcaster` nor the already-started Kestrel `app`**. The push loop keeps running forever against a sink whose hub is on a Kestrel app that `StopCurrentAppAsync` will never see (`_app` is null, `_broadcaster` field may or may not be set depending on where the throw landed). Result: a leaked Kestrel listener still bound to the port, plus a leaked broadcaster loop. The probability is low (those calls rarely throw) but the catch's cleanup is incomplete. Fix: in the catch, best-effort stop/dispose whatever was started — mirror `StopCurrentAppAsync`'s logic, or wrap the post-`StartAsync` section so a failure tears down `app` and `_broadcaster` before nulling the fields.
**M6. `PushOnceAsync` builds the debug snapshot for a PLC whose subscriber count is stale.** Tied to C1/C2: `ActivePlcs()` (`PlcSubscriptionTracker.cs:76-82`) returns a key snapshot, and the broadcaster pushes a `"plc"` message to `PlcGroup(plcName)` for each. If the count is leaked-high (C1), the broadcaster pushes to an empty SignalR group every cycle forever — cheap (SignalR no-ops an empty group) but it also keeps calling `_builder.BuildDebug(plcName)` and, more importantly, the *capture stays armed* because nothing disarms it. This is the observable symptom of C1 inside the broadcaster. Recording it separately because the fix (broadcaster-side reconciliation) lives here.
## Minor findings
**N1. `StatusBroadcaster` does not use a stable log event name.** `StatusBroadcaster.cs:84,94,108,133` all call `_logger.LogError(ex, "StatusBroadcaster: ...")` with free-text messages and no `EventId`/`EventName`. Every other component in this codebase uses `[LoggerMessage]` source-gen with a stable `mbproxy.*` event name catalogued in `docs/Reference/LogEvents.md` (e.g. `mbproxy.admin.started` EventId 70, `mbproxy.admin.bind.failed` EventId 71). The broadcaster's "loop terminated unexpectedly" at line 133 is exactly the kind of event an operator would alert on, and it is invisible to event-name-based log queries. Add `[LoggerMessage]` entries (e.g. `mbproxy.admin.broadcast.failed`, `mbproxy.admin.broadcast.loop.terminated`) and register them in `LogEvents.md`.
**N2. `LoopAsync` does `Task.Delay` *before* the first push.** `StatusBroadcaster.cs:122-124`: the loop delays `interval` ms, then pushes. The first dashboard client therefore waits up to `AdminPushIntervalMs` (default 1000 ms) for its first `"fleet"` message even though a snapshot is available immediately. `index.html`/`dashboard.js` presumably also fetch `status.json` or render empty until the first push — but a push-before-delay (or an immediate `PushOnceAsync` before entering the loop) would make the dashboard populate instantly on open. Cosmetic, but easy.
**N3. `StatusBroadcaster.StopAsync` is not idempotent-safe against the `_loop` await.** `StatusBroadcaster.cs:57-72`: if `StopAsync` is called twice (it is reachable: `DisposeAsync` calls `StopAsync`, and `AdminEndpointHost.StopCurrentAppAsync` calls `broadcaster.DisposeAsync()` which calls `StopAsync` — only one path per instance today, but). The second call: `_cts.IsCancellationRequested` is true so it skips `CancelAsync`, then `await _loop` again (a completed task — fine), then `_captureRegistry.DisarmAll()` again (fine). Benign, but `DisposeAsync` at line 137-141 then calls `_cts.Dispose()`; a *third* path touching `_cts` after dispose would throw `ObjectDisposedException`. No live bug, but the class lacks the `_disposed` guard that `AdminEndpointHost` itself carries (and whose absence the `AdminEndpointHost` comment at lines 59-64 explicitly calls out as a regression risk). Add a `_stopped`/`_disposed` flag for symmetry.
**N4. `OnDisconnectedAsync` does its cleanup *before* `base.OnDisconnectedAsync`.** `StatusHub.cs:60-66`. The capture disarm runs first, then `base.OnDisconnectedAsync(exception)`. This is the correct order (you want to release resources before the base teardown) and the disarm is synchronous so it cannot be lost — but note there is **no `OnConnectedAsync` override**, which is fine, and no try/finally around the `foreach`. If `_captureRegistry.Disarm` threw (it cannot — it is a dictionary lookup + volatile write), the remaining PLCs in the connection's set would not be disarmed and `base.OnDisconnectedAsync` would be skipped. Defensive only; `Disarm` is genuinely no-throw today.
**N5. `PlcSubscriptionTracker.ActivePlcs()` allocates a fresh array every push cycle.** `PlcSubscriptionTracker.cs:80`: `_plcCounts.Keys.ToArray()` under the lock, once per `AdminPushIntervalMs`. With the common case of zero detail-page viewers it correctly returns `Array.Empty<string>()` (line 80 short-circuits on `Count == 0`). Only allocates when someone is viewing. Fine — flagged only as a known per-cycle allocation if push interval is ever lowered aggressively.
**N6. `ServeHtmlShell` / asset routes have no `HEAD` handling and `/plc/{name}` ignores `name`.** `AdminEndpointHost.cs:210`: `app.MapGet("/plc/{name}", (string name, HttpContext ctx) => ServeHtmlShell(ctx, "plc.html"))``name` is bound but unused (the page reads it client-side from the URL). Harmless, but the unused parameter will draw a compiler/analyzer warning under this project's `TreatWarningsAsErrors` unless suppressed; verify it builds. If it does build clean, fine — `MapGet` route parameters are not flagged as unused. Minor; mentioned for the reviewer to confirm against CI.
**N7. The detail payload's `PlcDetailResponse` shape is undocumented as part of the `/status.json` contract but the doc table at `StatusPage.md:317-328` does describe it.** Actually documented — withdrawing the "undocumented" concern from the Summary's last bullet to this extent: the *fields* are in `StatusPage.md`. What is genuinely missing is a statement that this payload travels **only over SignalR** and is **not** reachable at any `GET` route, and that its serialization path differs from `status.json` (M1). One sentence in `StatusPage.md`'s "Debug View Data" section would close it.
## What looks good
- **Broadcaster lifecycle is correctly bound to the Kestrel app, not the host.** `AdminEndpointHost.cs:242-249` creates a fresh `StatusBroadcaster` inside `StartAppAsync` and `StopCurrentAppAsync` (lines 268-279) disposes it *before* stopping Kestrel. An `AdminPort` hot-reload therefore tears down the old broadcaster and starts a new one — no broadcaster leak across re-binds, and the `DisarmAll()` in `StopAsync` ensures the re-bind does not strand an armed capture. This directly answers the open question N3-style concern from the 2026-05-14 `AdminAndDiagnostics` review about provider/loop leaks across re-binds.
- **`IStatusPushSink` is a clean seam.** Defining the outbound side as an interface (`StatusPushSink.cs`) lets `StatusBroadcasterTests` exercise the full push-cycle logic (fleet always, per-PLC only for active PLCs, disarm-on-stop) with a recording fake and zero SignalR host — and the tests actually do this. Good testability design.
- **`PlcSubscriptionTracker` locking is correct *as a data structure*.** Single `_gate`, every method takes it, the count transitions (`0→1` returns arm-signal, `1→0` returns disarm-signal) are computed under the lock, and `RemoveConnection` correctly handles the multi-PLC-per-connection case. The bug (C1/C2) is not in the locking — it is that `ConnectionId` is the wrong key for a *lifetime* and the tracker is mutated from two un-serialized hub dispatch paths. The class itself is internally consistent.
- **`PushOnceAsync` per-stage error isolation.** `StatusBroadcaster.cs:78-110` wraps snapshot-build, fleet-push, and each per-PLC push in their own `try/catch` so one PLC's failure does not abort the cycle and a snapshot-build failure does not kill the loop. The `when (ex is not OperationCanceledException)` filters correctly let shutdown cancellation propagate to `LoopAsync`'s handler. This is the right shape.
- **`LoopAsync` re-reads `AdminPushIntervalMs` every cycle** (`StatusBroadcaster.cs:122`) so a hot-reload of the interval takes effect without restarting the loop, and floors it at 100 ms so a bad value cannot spin the CPU. Matches the hot-reload-everything posture in `CLAUDE.md`.
- **`TagValueCapture` concurrency is genuinely lock-free-correct.** `Volatile.Write`/`Volatile.Read` of references to an immutable `record` (`TagValueObservation`), `_armed` is `volatile`, and `Record` short-circuits on `!_armed` with a single volatile read before any work — so the disarmed hot path cost is one bool read, as advertised. `Disarm` clears slots so a re-arm shows only fresh data. This part of the design is solid; the weakness is *who calls Arm/Disarm and when* (C1/C2), not the capture itself.
- **`StatusSnapshotBuilder.BuildDebug` degrades gracefully for an unknown PLC** (`StatusSnapshotBuilder.cs:44-47`) — returns a disarmed empty snapshot rather than throwing, which is the correct behavior for a detail page open on a hot-reload-removed PLC, and `PlcDetailResponse.Plc` is nullable to carry that state. `ConfigReconciler.cs:259` calls `_captureRegistry.Remove(name)` on PLC removal, so the registry and the config stay consistent.
- **`StatusHub.SubscribePlc` for an unknown PLC is a documented no-op** (`StatusHub.cs:53`, `TagCaptureRegistry.Arm` no-ops a missing key) and `StatusHubTests.SubscribePlc_UnknownPlc_DoesNotThrow_AndArmsNothing` covers it. A hub method throwing would be sent to the caller as a hub error; this path correctly does not throw.
- **Asset serving is safe.** `AdminEndpointHost.cs:212-226` rejects `/`, `\`, and `..` in the asset path segment before touching `GetManifestResourceStream`, caches bytes and misses in a `static ConcurrentDictionary` shared across app re-builds, and sets `immutable` cache headers for content-addressed assets vs. `no-cache` for the HTML shells. Embedded resources mean no filesystem traversal surface at all.
## Open questions
1. **C1 field impact:** how often do operators' browsers reconnect in this deployment? If the dashboard is on a stable internal segment and detail pages are short-lived, the leak is slow — but it is unbounded and never self-heals in a steady-state service. Is there an existing process-restart cadence that masks it? Either way the invariant in `StatusPage.md:315` ("the hot path carries zero cost when nobody is watching") is currently false after any reconnect-during-view.
2. **C2 / hub dispatch model:** confirm against the SignalR version in use whether `OnDisconnectedAsync` for a connection can overlap an in-flight `SubscribePlc` invocation on that *same* connection. If SignalR guarantees `OnDisconnectedAsync` runs only after all in-flight invocations for that connection complete, C2's same-connection race narrows to the cross-connection (reconnect) race in C1 — still a bug, but the fix scope shrinks.
3. **M1:** is trimming/AOT on the roadmap for this service? `CLAUDE.md` mentions single-file self-contained publish but not trimming. If AOT is ever planned, M1 is upgraded to Critical (the SignalR reflection JSON path will break) and `DebugJsonContext` must be wired in, not deleted.
4. **M5:** has the post-`StartAsync` failure path (e.g. `GetRequiredService<IHubContext<StatusHub>>` failing) ever been observed? It is low-probability, but the catch block's cleanup is provably incomplete — worth a deliberate decision to either fix it or document it as accepted.
5. Is there any reason `StatusJsonContext.Default` is not also wired into the hub's `AddJsonProtocol` so the fleet `"fleet"` push and `GET /status.json` share one serialization path and one camelCase configuration point (M1/M2)?
@@ -0,0 +1,96 @@
# Frontend / Live Web Dashboard — Code Review
Scope: commit `e719dd5` ("replace status page with a live SignalR web dashboard"). Files reviewed — all under `src/Mbproxy/Admin/wwwroot/`:
`index.html`, `plc.html`, `dashboard.js`, `detail.js`, `theme.css`, `dashboard.css`, `detail.css`. Vendored assets (`bootstrap.min.css`, `bootstrap.bundle.min.js`, `signalr.min.js`, `*.woff2`) explicitly out of scope. Cross-checked against `docs/Operations/StatusPage.md` (wire contract), `CLAUDE.md` (mbproxy), and the server side `StatusHub.cs` / `StatusBroadcaster.cs` for the SignalR contract.
## Summary
- The dashboard is well-built vanilla JS: thoughtful escaping helpers, clean rate computation, sensible reconnection wiring, and a correct subscribe-on-reconnect pattern that keeps the on-demand tag capture armed/disarmed correctly.
- **The single most important finding is a real, exploitable XSS hole on the detail page**: `detail.js` renders `t.rawHex`, `t.address`, `t.width`, `t.direction`, `c.remote`, and `plc.host` straight into `innerHTML` **without escaping**. `escapeHtml` exists in the file but is applied inconsistently — several attacker/PLC-influenceable fields bypass it. The fleet table (`dashboard.js`) is escaped correctly throughout; the detail page is not. This is a Critical finding.
- A second Major issue: on a *cold* SignalR failure the page enters a `setTimeout(start, 3000)` retry loop that **builds a brand-new `HubConnection` on every retry** — no, it reuses the same connection object, but it never tears down handlers, and combined with `withAutomaticReconnect` the failure/retry semantics are muddled. There is also no upper bound and no UX past "retrying".
- DOM updates are full-`innerHTML` re-renders of the whole table body every ~1 s. Correct and flicker-free for 54 rows, but it blows away focus/selection and is wasteful; acceptable at this scale, flagged.
## Critical findings
**C1. Stored/reflected XSS on the connection-detail page — multiple unescaped fields rendered via `innerHTML`.** `detail.js` defines `escapeHtml` (line 23) and uses it for *some* fields but omits it for several others, all of which are interpolated into template strings later assigned to `.innerHTML`:
- **`detail.js:194` and `:203`** — the debug-row builder:
```js
<td>${t.address} <span class="ratio-sub">${hex4(t.address)}</span></td>
<td>${t.width}-bit</td>
...
<td><span class="dir-tag ${dirCls}">${t.direction}</span></td>
```
`t.direction` is a server string (`"read"`/`"write"` per the schema) interpolated raw into both an attribute-ish context and element text. `t.address`/`t.width` are typed `int` in the documented DTO, so they are lower risk — but the code does not coerce them (`Number(...)`) or escape them, so it relies entirely on the server DTO type. `t.direction` is a free `string` on the wire and is **not escaped**.
- **`detail.js:86`** — the client list:
```js
`<div class="client-line">${escapeHtml(c.remote)}` +
`<span class="pdu"> · ${num(c.pdusForwarded)} PDUs · since ${shortTime(c.connectedAtUtc)}</span></div>`
```
`c.remote` *is* escaped (good), but `shortTime(c.connectedAtUtc)` is **not** — and `shortTime` (line 36) has a `catch { return iso; }` branch that returns the **raw, unescaped `connectedAtUtc` string** verbatim when `new Date(iso)` parsing throws or the value is non-ISO. A malformed/attacker-controlled timestamp string therefore lands unescaped inside `innerHTML`.
- **`detail.js:203`** — `escapeHtml(t.rawHex)` *is* applied (good), and `num(t.decodedValue)` is numeric-safe. So `rawHex` is the one debug field that is handled correctly. (Correcting the brief: `rawHex` is escaped; `direction` and the timestamp path are the live holes.)
- **`detail.js:65`** — `$('plc-sub').textContent = `${plc.host}:${plc.listenPort}`;` is **safe** (`textContent`). But `detail.js:1416` does the same for `plcName` via `textContent` — also safe. So the identity header is fine; the holes are specifically the `innerHTML` card/table builders.
Severity rationale: per `docs/Operations/StatusPage.md` the admin endpoint binds `IPAddress.Any` with **no authentication** ("Authentication lives at the network layer"). PLC `name`/`host` and backend-derived strings come off the Modbus wire / `appsettings.json` and are operator- or device-influenceable. A PLC named or a backend that returns a crafted string can inject `<img src=x onerror=...>` into any operator's browser session on the trusted segment. Read-only UI does not mean low impact: the injected script runs with the operator's origin.
Concrete fix: route **every** dynamic value through `escapeHtml` before it enters an `innerHTML` template, with **no exceptions**:
- `detail.js:194` / `:203`: wrap `t.direction` in `escapeHtml(...)`; coerce `t.address`/`t.width` with `Number(...)` (or escape) rather than trusting the DTO type.
- `detail.js:86`: wrap the `shortTime(...)` result in `escapeHtml(...)`, and/or change `shortTime`'s fallback to `escapeHtml(iso)` / return `'—'`.
- Add a single `escapeAttr` helper (as `dashboard.js` has at line 179) and use it for the `dirCls`/class interpolation at `:197` if `dirCls` ever becomes data-derived (currently it is a literal, so low priority — but `card()`'s `cls` parameter at `detail.js:45` is interpolated into `class="v ${cls||''}"` and is caller-supplied; today all callers pass literals, so it is safe *now* but fragile).
- Better still: stop hand-building HTML. Build rows with `document.createElement` + `textContent`/`dataset`, which makes escaping structural rather than a discipline that one missed call defeats.
## Major findings
**M1. Cold-start retry loop layers a manual `setTimeout` retry on top of `withAutomaticReconnect`, with no bound and a misleading pill.** `dashboard.js:252263` and the identical `detail.js:236245`:
```js
async function start() {
try { ...; await connection.start(); await connection.invoke('SubscribeFleet'); ... }
catch { setConn('disconnected','retrying'); setTimeout(start, 3000); }
}
```
`withAutomaticReconnect` only covers a connection that *was* established and then dropped — it does **not** retry the initial `start()`. So the manual loop is needed. But: (a) it retries **forever** every 3 s with no backoff and no cap — if the hub is permanently gone the browser hammers it indefinitely; (b) if `connection.start()` succeeds but `invoke('SubscribeFleet')` throws, the `catch` calls `start()` again on an **already-started** connection, which will throw "cannot start a connection that is not in the Disconnected state" and wedge the retry loop; (c) the pill shows `disconnected`/`retrying` which is reasonable, but there is no terminal "giving up" state and no way for an operator to know the difference between "server down" and "server slow". Fix: separate "start the connection" from "subscribe" so a subscribe failure doesn't restart the socket; add capped exponential backoff; guard `start()` against re-entry when `connection.state !== 'Disconnected'`.
**M2. Detail page never handles "PLC name not in fleet" vs. "hub never delivers".** `detail.js` subscribes with `SubscribePlc(plcName)` where `plcName` is taken from `location.pathname`. If the name is wrong (typo, stale bookmark, or a name that never existed), `StatusHub.SubscribePlc` happily adds the connection to a `plc:{bogus}` group and `_captureRegistry.Arm` is a documented no-op — so the server **never sends a `plc` message** and `onDetail` never fires. The page sits forever on "Waiting for first snapshot…" with a green `connected` pill. There is no timeout, no "unknown PLC" state. The `renderMissing()` path (`detail.js:168`) only triggers when the server *does* push a payload with `detail.plc === null` (PLC removed by hot-reload) — it cannot fire for a name that was never configured because nothing is pushed at all. Fix: after `connected`, start a watchdog (e.g. 3× the push interval); if no `plc` message arrives, show an "unknown or unreachable PLC" notice.
**M3. Fleet PDU/s rate is wrong for the first snapshot after any reconnect, and silently leaks `prevPdu`/`rateByName` entries for removed PLCs.** `dashboard.js:6577` `updateRates` keys `prevPdu`/`rateByName` by `plc.name` and never removes entries. On hot-reload removal of a PLC, its stale rate stays in `rateByName` forever and is still summed into the fleet rate? — no, `renderAggregates` only iterates `s.plcs`, so a removed PLC drops out of the *sum*; but the Map still grows unboundedly across the process lifetime if PLCs churn. More importantly, after a SignalR reconnect the counters are *cumulative since service start*, so `cur - prev.forwarded` across a multi-second reconnect gap produces a correct (if coarse) rate — that part is fine. The real bug: `performance.now()` is monotonic per-page, so that's fine too. Net: low-grade Map leak on PLC churn; prune `prevPdu`/`rateByName` to the current `snapshot.plcs` name set each cycle.
**M4. Full-table-body `innerHTML` re-render on every ~1 s push destroys transient UI state.** `dashboard.js:155173` rebuilds the entire `<tbody>` string and assigns `tbody.innerHTML`. Consequences at the 1 s cadence: any text selection inside the table is lost every second; `:hover` is re-evaluated (minor); the browser re-parses ~54 rows of HTML each tick. It is *not* a flicker source (synchronous replace, no intermediate paint) and 54 rows is cheap, so this is acceptable — but it is the kind of thing that becomes a problem if columns/rows grow. A keyed diff (update existing `<tr>` cells in place, add/remove only on set change) would also fix the selection-loss annoyance. Flagged as Major because of the selection-loss UX regression on a screen operators stare at; the brief explicitly asked about this.
## Minor findings
**N1. `detail.js` has no `escapeAttr` and `dashboard.js`'s is duplicated.** Both files define their own `escapeHtml` (`dashboard.js:176`, `detail.js:23`) — identical code, copy-pasted. `escapeAttr` exists only in `dashboard.js:179`. Since both files are separately served and there is no shared module, duplication is the pragmatic choice for a no-build-step project, but the *inconsistency* (detail.js lacking `escapeAttr`) is exactly what enabled C1. Recommend a tiny shared `util.js` served from `/assets/`, or at minimum make the two `escapeHtml` definitions and an `escapeAttr` identical and present in both.
**N2. Accessibility gaps.**
- The sortable `<th>` elements (`index.html:7281`) are clickable via a JS `click` handler but are not keyboard-focusable and carry no `role="button"`/`tabindex="0"`/`aria-sort`. A keyboard-only operator cannot sort the table. Add `tabindex="0"`, `aria-sort` reflecting `sorted-asc`/`sorted-desc`, and a `keydown` (Enter/Space) handler.
- Table rows are clickable (`dashboard.js:232`) to open a detail page but are `<tr>` with `cursor:pointer` only — not keyboard-reachable and not announced as interactive. Consider making the PLC-name cell an actual `<a href="/plc/...">` so it is focusable, middle-click/ctrl-click works natively, and screen readers announce it. This would also remove the need for the JS `window.open` handler.
- The connection-state pill (`#conn`) updates visually but has no `aria-live` region, so a screen-reader user is never told the hub dropped. Add `aria-live="polite"` to `#conn` or to `#conn-text`.
- `<input type="search" id="f-search">` has a `placeholder` but no associated `<label>` (visible or `aria-label`). Same for `#f-state`. Add `aria-label`.
**N3. Row click always `window.open(..., '_blank')` — no modifier-key respect, popup-blocker exposure.** `dashboard.js:235` unconditionally calls `window.open`. A plain click should arguably open in the same tab or respect the user's intent; programmatic `window.open` not in direct response to a trusted click on an anchor can be caught by popup blockers in some configs. Tied to N2's suggestion: render the PLC name as a real `<a target="_blank" rel="noopener">` and delete the handler. (`rel="noopener"` also matters — `window.open` without it leaves `window.opener` live; here the opened page is same-origin so impact is low, but it is still best practice.)
**N4. `detail.js` `plcName` parsing is brittle.** `detail.js:10`: `decodeURIComponent(location.pathname.replace(/^\/plc\//, ''))`. If the route is ever served under a path prefix, or the name itself contains an encoded `/`, this misparses. Also if `decodeURIComponent` throws (malformed `%` sequence in the URL) the whole script aborts at line 10 before `connect()` is ever reached — the page is then blank with no error. Wrap in try/catch and fall back to a visible error state.
**N5. `dashboard.js:101` and `:106` — fleet PDU/s shows `` until the *second* snapshot.** Expected (rate needs two samples) and correctly handled, but the aggregate card shows `` for the first ~1 s while the table already shows per-row `` rates. Cosmetic; no fix required, noted for completeness.
**N6. No `console.log`, no hardcoded absolute URLs, no obvious dead code.** All asset/hub URLs are root-relative (`/assets/...`, `/hub/status`, `/plc/...`) — correct, survives any host/port. `card()`'s `extra` parameter and the `cls` third element of card rows are lightly-used but not dead. Clean on this axis.
**N7. `formatUptime` / `formatAge` silently misrender negative or NaN input.** `dashboard.js:199` and `detail.js:28`: if `uptimeSeconds`/`ageSeconds` ever arrive negative (clock skew) or non-numeric, `Math.floor` yields `NaN` and the card shows `NaN`. Low risk given server types; a `Number.isFinite` guard returning `'—'` is cheap insurance.
## What looks good
- **Reconnect → re-subscribe is correct.** `dashboard.js:249` and `detail.js:233` both re-`invoke('SubscribeFleet'/'SubscribePlc')` inside `onreconnected`. This is essential and easy to forget: SignalR auto-reconnect gives a new `ConnectionId`, server-side group membership does **not** survive, and the detail page's tag capture is armed per-connection — without the re-subscribe the capture would silently disarm on every transient drop. Verified against `StatusHub.SubscribePlc`/`OnDisconnectedAsync` and `StatusBroadcaster` group targeting — the lifecycle is sound.
- **`onclose` correctly does not re-subscribe** — it just sets the pill; `withAutomaticReconnect`'s own loop owns recovery. No double-retry on the warm path.
- **Single `HubConnection` per page, closed implicitly on navigation.** Opening a detail page in a new tab (`window.open`) means each tab owns exactly one hub connection; closing the tab fires `OnDisconnectedAsync` server-side which disarms the capture. No connection leak across navigation.
- **The fleet table escapes correctly.** `dashboard.js:161163` routes `plc.name` through `escapeAttr` for the `data-name` attribute and `escapeHtml` for cell text; `plc.host` through `escapeHtml`; `plc.listener.lastBindError` through `escapeAttr` for the `title=` attribute (line 160). This is the right discipline — it is only `detail.js` that fails to match it (C1).
- **`escapeHtml` is a correct minimal implementation** — `&`, `<`, `>` cover element-text contexts; `escapeAttr` adds `"`. Order matters (`&` first) and is correct.
- **Rate computation is robust.** `dashboard.js:70` guards `now > prev.t`, `Math.max(0, ...)` clamps a counter reset/reconnect to a non-negative rate, and `performance.now()` (monotonic) is the right clock for deltas — not `Date.now()`.
- **Filter is correct and cheap.** `visiblePlcs` (`dashboard.js:130`) filters a `.slice()` copy (never mutates the snapshot), search is case-folded once, and the sort has a stable `localeCompare` tiebreaker by name. At 54 rows the filter+sort+render per keystroke is sub-millisecond. It correctly survives live updates because `render()` always re-derives from `latest` + current `filter` state.
- **`renderMissing()` hot-reload path.** The detail page genuinely handles a PLC disappearing mid-session (`detail.js:168`) — `notice` shown, cards cleared and `hidden`. Good defensive UX for the hot-reload scenario (the gap is the *never-existed* name, see M2).
- **No CDN dependencies** — every `<script>`/`<link>` is `/assets/...`, consistent with the firewalled-network design goal in `StatusPage.md`.
- **CSS is clean**: design tokens via custom properties, `font-display: swap` on all `@font-face`, responsive `agg-grid` breakpoints, `prefers`-free but no egregious issues. `tr.stale` / `tr.no-traffic` styling gives the debug view real legibility. No `!important` abuse beyond two justified `.empty-row` overrides.
## Open questions
1. **C1 exploitability depends on whether `t.direction` and `connectedAtUtc` can actually carry attacker bytes.** `direction` is server-derived from the FC, so today it is effectively a closed set — but it is typed `string` on the wire and `detail.js` trusts it. `connectedAtUtc` is a `DateTimeOffset` serialized server-side, so it too is well-formed *today*. The finding stands because the frontend must not depend on server-side type discipline for its own XSS safety — but if the threat model formally excludes a compromised service, C1 could be re-rated Major. Recommend fixing regardless: the cost is three `escapeHtml` calls.
2. Does `StatusBroadcaster` ever push a `plc` payload for a PLC that has *zero* configured BCD tags? `detail.js:183` handles `debug.tags` empty → "No BCD tags configured". Confirmed handled; noted only to flag that the empty-state is covered.
3. Should the detail page cap how long it waits before declaring the PLC unknown (M2)? The server has no "unknown PLC" rejection in `SubscribePlc` — it silently accepts any name. A client-side watchdog is the only place this can be surfaced without a server change.
@@ -0,0 +1,52 @@
# mbproxy SignalR Web Dashboard — Code Review Overview
Scope: commit `e719dd5` ("replace status page with a live SignalR web dashboard") — ~3,500 lines across 49 files. Reviewed in four subsystem passes:
- [`AdminSignalR.md`](AdminSignalR.md) — `src/Mbproxy/Admin/*` (host, hub, broadcaster, push sink, subscription tracker, snapshot builder, DTOs).
- [`TagCapture.md`](TagCapture.md) — `src/Mbproxy/Proxy/{TagValueCapture,TagCaptureRegistry}.cs` + pipeline/reconciler/worker integration.
- [`Frontend.md`](Frontend.md) — `src/Mbproxy/Admin/wwwroot/*` (hand-written HTML/CSS/JS only; vendored assets excluded).
- [`TestsAndConfig.md`](TestsAndConfig.md) — new/changed tests, `MbproxyOptions`/`ReloadValidator`, csproj `EmbeddedResource`, smoke config.
## Verdict
The dashboard is well-architected at the macro level — the `IStatusPushSink` testability seam, the lock-free `TagValueCapture` data structure, the broadcaster's lifecycle binding to the Kestrel app, and the embedded-asset model are all sound. But the review surfaced **one security bug and a cluster of concurrency bugs that share a single root cause**, plus a feature-correctness gap. None of these should ship to operators as-is.
The standout pattern: **two independent reviewers (`AdminSignalR.md`, `TagCapture.md`) converged on the same root cause** — capture arm/disarm state is not authoritatively owned. `TagValueCapture.IsArmed` is carried on the transient capture instance and counted by `PlcSubscriptionTracker` keyed on SignalR `ConnectionId`. That single design choice produces C2, C3, and M3 below. Fix it once and three findings collapse.
## Cross-cutting critical findings
**C1 — Stored XSS on the connection-detail page.** (`Frontend.md` C1) `detail.js` interpolates `t.direction` raw into `.innerHTML` (`detail.js:194/203`), and the client-list builder escapes `c.remote` but not the `shortTime(c.connectedAtUtc)` result — whose `catch` branch returns the raw timestamp string verbatim (`detail.js:86`). The admin endpoint binds `IPAddress.Any` with **no authentication**, and the injected strings are device-/config-influenceable. A crafted value executes script in any operator's browser on the trusted segment. Fleet table (`dashboard.js`) escapes correctly — only the detail page breaks discipline. **Fix:** three `escapeHtml` calls, or switch the row builders to `createElement`/`textContent`.
**C2 — Capture armed forever after a SignalR reconnect.** (`AdminSignalR.md` C1) `PlcSubscriptionTracker` keys subscriber counts on `ConnectionId`, which changes on every transport reconnect. `OnDisconnectedAsync` for the old connection is unordered relative to the new connection's `SubscribePlc`, so a reconnect-during-view leaks the count and leaves a PLC's `TagValueCapture` **armed with no viewer for the life of the process**. The documented invariant "zero hot-path cost when nobody is watching" becomes false after any reconnect — every backend read and FC06/FC16 write then pays a `FrozenDictionary` lookup + `TagValueObservation` allocation, fleet-wide. `DisarmAll()` only fires on admin shutdown/port hot-reload, so the leak is never bounded in steady state.
**C3 — Cache hits never reach `Record()`.** (`TagCapture.md` C1) The `ctx.Capture?.Record(...)` calls live only in `BcdPduPipeline.ProcessResponse`, but the Phase-11 response-cache hit path (`PlcMultiplexer.cs:823-828`) returns cached post-rewrite bytes without invoking the pipeline. For any BCD tag with `CacheTtlMs > 0`, once the cache is warm the debug view **freezes at the last cache-miss observation while `AgeSeconds` climbs** — actively misleading an operator into thinking a live tag is dead. Feature-correctness, not a crash; caching is OFF by default so the default deployment is unaffected.
## Major findings (consolidated)
- **M1 — Non-atomic `SubscribePlc`.** (`AdminSignalR.md` C2) `AddToGroupAsync` then `_tracker.Add` span two awaits; a same-connection `OnDisconnectedAsync` can interleave and arm a capture on an already-gone connection. Same root cause as C2.
- **M2 — `GetOrCreate` lost-update race.** (`TagCapture.md` M1) The `AddOrUpdate` delegate reads `existing.IsArmed`, but a concurrent detail-page open can land its `Arm` on the about-to-be-discarded instance — publishing the rebuilt capture disarmed under an open page, or leaking it armed with no viewer.
- **M3 — Recommended fix for C2/M1/M2:** make `PlcSubscriptionTracker`'s subscriber count the *single authority* for arm state — key it on a stable per-tab identifier (or count distinct viewers, not connections), and derive `IsArmed` from the count rather than carrying it on the transient capture.
- **M4 — `StatusBroadcaster.LoopAsync` has zero coverage.** (`TestsAndConfig.md`) All four broadcaster tests call `PushOnceAsync` directly; the production push loop, its interval hot-reload re-read, the `Math.Max(100,…)` floor, and cancellation are unverified. The `/hub/status` endpoint is never exercised end-to-end.
- **M5 — `DebugJsonContext` is dead code; SignalR serializes via reflection `System.Text.Json`.** (`AdminSignalR.md`) A latent AOT trap; the camelCase wire-shape guarantee has no test.
- **M6 — Bind failure after `_broadcaster.Start()` leaks the broadcaster loop and a bound listener** — the catch block's cleanup is incomplete. (`AdminSignalR.md`)
- **M7 — Untested arm/disarm race in `TagValueCapture`.** (`TestsAndConfig.md`) `Disarm()` flips `_armed=false` then clears slots; a `Record()` that wins the `_armed` check before `Disarm` runs leaves a stale observation on a disarmed capture. The torn-read test only races `Record` vs `Snapshot`, never vs `Disarm`.
- **M8 — Frontend cold-start retry loop** layers an unbounded `setTimeout` retry over `withAutomaticReconnect` and can wedge if `start()` succeeds but `invoke()` fails. (`Frontend.md` M1)
- **M9 — Detail page never handles an unknown PLC name** — sits forever on "Waiting for first snapshot…" with a green pill. (`Frontend.md` M2)
## Recommended remediation order
1. **C1 (XSS)** — smallest fix, highest severity, ships in any operator-facing build. Do first.
2. **M3** — re-root capture arm/disarm authority in `PlcSubscriptionTracker`; closes C2, M1, M2, M7 together. Add a concurrency test for the tracker (currently has none).
3. **C3** — add a `Record()` call on the cache-hit path in `PlcMultiplexer`, or document the debug view as cache-blind. Decide explicitly.
4. **M4** — add an end-to-end `/hub/status` test (real `HubConnection`, assert a `fleet`/`plc` message and its camelCase shape — also closes the M5 gap) and a `LoopAsync` interval/cancellation test.
5. **M6, M8, M9** and the Minor findings in each subsystem file.
## What looks good
- `IStatusPushSink` is a genuine, well-placed testability seam.
- `TagValueCapture` itself — lock-free, torn-read-safe via immutable records + `Volatile.Write`/`Read`, `FrozenDictionary` address map — is correct. The weakness is *who arms it*, not the structure.
- Broadcaster per-cycle error isolation and lifecycle binding to the Kestrel app (no leak across port hot-reloads).
- Fleet table (`dashboard.js`) escapes all dynamic content; reconnect→re-subscribe is correctly wired in both JS files; no CDN deps, no stray `console.log`.
- `StatusHtmlRenderer` removed cleanly — no dangling source or test references.
- csproj `EmbeddedResource` glob is correct (Worker SDK has no competing web default globs).
- `AdminPushIntervalMs` validation matches house style across both validators.
@@ -0,0 +1,70 @@
# Tag-Value Capture Review
Scope: commit `e719dd5` ("replace status page with a live SignalR web dashboard"), restricted to the on-demand tag-value capture feature:
`src/Mbproxy/Proxy/TagValueCapture.cs`, `src/Mbproxy/Proxy/TagCaptureRegistry.cs`, the `PerPlcContext.cs` / `BcdPduPipeline.cs` / `ProxyWorker.cs` / `ConfigReconciler.cs` / `HostingExtensions.cs` deltas. Cross-checked against `mbproxy/CLAUDE.md` (design intent: capture armed only while a detail page is open; disarmed hot-path cost = one nullable-deref + one volatile read; torn-read safety via immutable records swapped with `Volatile.Write`) and the surrounding admin layer (`StatusHub`, `PlcSubscriptionTracker`, `StatusBroadcaster`, `AdminEndpointHost`, `StatusSnapshotBuilder`, `PlcMultiplexer`).
## Summary
- The disarmed hot path is genuinely near-zero: one `?.` null check plus one volatile-bool read. No allocations, no dictionary lookup, no lock when disarmed — the design contract holds.
- Torn-read safety is correct: `TagValueObservation` is an immutable `record`, slots are reference-typed and only ever swapped via `Volatile.Write` / read via `Volatile.Read`. `FrozenDictionary` is built once in the constructor and never mutated. No defect here.
- The single material correctness gap is **feature, not crash**: with the Phase-11 response cache enabled, an FC03/FC04 **cache hit bypasses the pipeline entirely**, so `Record` never fires for cached reads — the debug view silently freezes for any cacheable tag while caching is on.
- One real lifecycle leak: a `TagCaptureRegistry` reseat/restart on a PLC that is **not** currently being viewed still rebuilds the capture, and `GetOrCreate`'s armed-flag preservation has a benign-but-real race against `Arm`/`Disarm`.
- `ProcessFc06Response` does not `Record`, which is defensible but leaves the write path slightly asymmetric — noted as Minor.
## Critical Findings
### C1. Response-cache hits never reach `Record` — the debug view freezes for cached tags
`PlcMultiplexer.cs:817-828` vs `BcdPduPipeline.cs:408,437`. The capture `Record` calls live exclusively inside `BcdPduPipeline.ProcessResponse`, which only runs when a backend response is rewritten (`PlcMultiplexer.cs:618`). The Phase-11 cache-hit path at `PlcMultiplexer.cs:823-828` builds the upstream frame straight from `cached.PduBytes` via `BuildCacheHitFrame` and returns — **the pipeline is never invoked, so `ctx.Capture?.Record(...)` never fires**. Consequence: for any BCD tag with `CacheTtlMs > 0`, once the cache is warm the connection-detail debug view shows a value that is frozen at the last *cache-miss* observation and ages indefinitely (`AgeSeconds` keeps climbing, `UpdatedAtUtc` never advances), even though clients are reading the tag every poll cycle. An operator using the debug view to confirm "is this tag live?" is actively misled — the tag *is* live, the proxy just isn't recording the cache-served reads.
This is a feature-correctness defect, not a crash, but it directly defeats the stated purpose of the debug view ("real-time debug view of raw PLC-side BCD vs. decoded client-side values"). Note caching is OFF by default, so the default deployment is unaffected — hence Critical-for-the-feature rather than Critical-for-the-service.
Fix: record on the cache-hit path too. The cleanest option is, inside the `responseCache.TryGet` hit branch, to decode the cached PDU against the request range and call `ctx.Capture?.Record(...)` — but the cache stores **post-rewrite** bytes (binary, already decoded), so the raw BCD nibbles are no longer available there. Better: store the `TagValueObservation`(s) alongside the cache entry at `Set` time (line 658) and re-publish them into the capture on a hit; or have the cache entry retain the pre-rewrite raw words. At minimum, document in `docs/Architecture/ResponseCache.md` and the debug-view docs that the debug view does not reflect cache-served reads when caching is enabled, and surface "served from cache" in the UI so the stale age is not mistaken for a dead tag.
## Major Findings
### M1. `GetOrCreate` rebuilds (and re-arms) the capture for PLCs nobody is viewing
`TagCaptureRegistry.cs:29-39`, `ConfigReconciler.cs:308,364,400`. On every Reseat/Restart/Add, `ConfigReconciler` calls `GetOrCreate`, whose `AddOrUpdate` update-delegate **always constructs a brand-new `TagValueCapture`** and copies `existing.IsArmed` onto it. For the overwhelmingly common case — a PLC with no open detail page — this allocates a fresh capture (arrays + `FrozenDictionary`) on every hot-reload for every reconciled PLC, throwing away an identically-shaped object. `FrozenDictionary` construction is not free; doing 54 of them on a tag-list reload is wasteful churn. Functionally harmless, but it contradicts the "on-demand, cheap when no viewer" spirit. Fix: when `!existing.IsArmed` **and** the resolved tag set is unchanged, return `existing` unchanged. A cheap tag-set equality check (ordered address+width sequence) avoids the rebuild for the no-op-reload case entirely.
### M2. Armed-flag preservation in `GetOrCreate` races `Arm`/`Disarm`
`TagCaptureRegistry.cs:33-38`. The update delegate reads `existing.IsArmed`, builds `rebuilt`, conditionally `rebuilt.Arm()`s, and returns it. `ConcurrentDictionary.AddOrUpdate` may invoke this delegate **more than once** under contention, and — more importantly — there is no synchronization between this delegate and a concurrent `StatusHub.SubscribePlc``Registry.Arm` / `OnDisconnectedAsync``Registry.Disarm`. Interleavings that lose an arm/disarm:
- A viewer opens the detail page (`Arm`) *after* the delegate reads `existing.IsArmed == false` but *before* `AddOrUpdate` swaps `rebuilt` in. The `Arm` lands on `existing`, which is then discarded — the new `rebuilt` is published **disarmed**. The detail page stays open but capture is silently off until the next subscribe event (there is none — subscription already completed).
- Symmetrically, a `Disarm` racing the rebuild can be lost, leaving a capture armed with no viewer (a slow leak — see M3).
This is a genuine lost-update race. The window is small (config reload concurrent with a detail-page open) but real. Fix: serialize capture arm-state transitions — e.g. funnel `Arm`/`Disarm`/`GetOrCreate` through the `PlcSubscriptionTracker`'s authoritative subscriber count: after a reseat, `GetOrCreate` should set the rebuilt capture's armed state from `tracker.ActivePlcs()` (the source of truth) rather than from the soon-to-be-discarded `existing` object. That makes the tracker — not a transient capture instance — the single owner of arm state.
### M3. A capture can stay armed forever if the last viewer's disconnect cleanup is lost
`StatusHub.cs:60-66`, `PlcSubscriptionTracker.cs:50-73`, `TagCaptureRegistry.cs:56-60`. Disarm happens only on three paths: `OnDisconnectedAsync`, `StatusBroadcaster.StopAsync``DisarmAll`, and (indirectly) a hot-reload `Remove`. `OnDisconnectedAsync` is best-effort in SignalR — for an abruptly-killed browser tab it fires on the **server transport-timeout** (default ~30 s for WebSockets, longer for long-polling), which is acceptable. But two narrower holes remain:
- If `Remove` is called for a hot-reload-removed PLC (`ConfigReconciler.cs:259`) while a detail page is still open, the capture object is dropped from the registry but the `PlcSubscriptionTracker` still holds the connection→PLC subscription. The eventual `OnDisconnectedAsync` calls `Registry.Disarm(plcName)` which is now a no-op (PLC unknown) — fine — but the subscriber count for the removed PLC is never reconciled, and if that PLC name is later re-added by another hot-reload, `GetOrCreate` creates a *disarmed* capture even though a stale subscription still nominally exists. Minor in practice.
- Combined with M2: a `Disarm` lost to the rebuild race leaves a capture armed with no viewer until the next `DisarmAll` (admin port hot-reload or shutdown). That capture keeps doing the full `Record` work (volatile write + `FrozenDictionary` lookup + record allocation) on the proxy hot path for every BCD PDU, indefinitely. Low-frequency trigger, unbounded duration. The fix for M2 closes this.
There is no periodic reconciliation of "armed captures vs. tracker subscriber counts." Consider a guard: `StatusBroadcaster.PushOnceAsync` already enumerates `tracker.ActivePlcs()` every tick — it could cheaply assert/repair that exactly those PLCs (and no others) are armed, turning any lost arm/disarm into a self-healing condition within one push interval.
## Minor Findings
- **`BcdPduPipeline.cs:448-479``ProcessFc06Response` does not `Record`.** FC06 write is captured on the *request* path (`ProcessFc06Request`, line 144) with the client's binary `value` and the `encoded` BCD it sent to the PLC. The FC06 *response* decodes the PLC's BCD echo back to binary but does not record. This is defensible (the request already captured the write, and recording the echo would just re-stamp `UpdatedAtUtc`), and the inline comment correctly notes the counter is intentionally not double-incremented. But it is asymmetric with FC03/FC04, which record on the response. Leave as-is, or add a one-line comment in `ProcessFc06Response` stating capture is intentionally request-side only for FC06 — otherwise a future reader will "fix" the perceived omission.
- **`BcdPduPipeline.cs:251` — FC16 32-bit `Record` passes `binaryValue` reconstructed as `clientHigh * 10_000 + clientLow`.** This is the base-10000 CDAB reconstruction used for encoding, consistent with the FC03/FC04 read path (`Decode32`) and with `DebugDto.ToTagDto`'s `0x{RawHigh:X4}{RawLow:X4}` rendering. Correct — but note the `DecodedValue` shown for a 32-bit tag is the base-10000 composed integer, not a true binary 32-bit value. That matches the rest of the proxy's 32-bit BCD model; just confirm the UI labels it consistently. No code change needed.
- **`TagValueCapture.cs:136-146``Snapshot()` allocates a fresh `TagValueObservation` for every empty slot on every call.** `StatusBroadcaster` calls `BuildDebug``Snapshot` once per push interval per *viewed* PLC, so this is bounded and cheap (only viewed PLCs, low cadence). Not worth fixing, but the placeholder records for never-seen tags could be cached once at construction since they are constant. Optional micro-optimization.
- **`TagValueCapture.cs:68-74` — constructor de-dups tags by `Address` via `GroupBy(...).Select(g => g.First())`.** If the resolved `BcdTagMap` ever contains two tags at the same address with different `Width` (it should not — the map resolver should reject that), the capture silently keeps the first and the debug view width could disagree with what the rewriter actually does. Low risk given upstream validation, but a defensive assert or a comment that the map is already de-duplicated would help.
- **`TagCaptureRegistry.cs:45-46``TryGet`'s `out` uses `capture!` to suppress nullability.** Fine, but callers (`StatusSnapshotBuilder.BuildDebug:46`) correctly branch on the bool first. No issue; noted only for completeness.
- **`PerPlcContext.cs` clone — `WithCurrentRequest` copies `Capture` by reference.** Correct: the capture is per-PLC and shared across all per-call context clones, which is exactly what's wanted (all concurrent responses for one PLC record into the same capture). Confirmed not a bug.
## What Looks Good
- **Disarmed hot-path cost meets the contract.** `Record` is `if (!_armed) return;``_armed` is a `volatile bool`, so it is one volatile read; reaching `Record` at all is one `?.` null check on `ctx.Capture`. No allocation, no dictionary lookup, no lock on the disarmed path. The CLAUDE.md claim ("one nullable-deref + one volatile read when disarmed") is accurate.
- **Torn-read safety is correctly implemented.** `TagValueObservation` is a `sealed record` with init-only positional members — genuinely immutable. Slots are `TagValueObservation?[]`, mutated only via `Volatile.Write` and read only via `Volatile.Read`. Reference assignment is atomic on all .NET-supported architectures, and the record being immutable means a reader either sees the old reference or the fully-constructed new one — never a torn object. `Disarm` clears slots with `Volatile.Write(..., null)`. No lock needed and none taken.
- **`FrozenDictionary` usage is correct.** Built once in the constructor from a plain `Dictionary`, never mutated afterward, only read on the hot path — exactly the intended `FrozenDictionary` use case (build-once, read-many, allocation-free lookup).
- **`Snapshot()` always returns one row per configured tag**, substituting a placeholder (`UpdatedAtUtc = null`, zero values) for never-seen slots, so the debug view renders a stable row set — good UX decision, and `DebugDto.ToTagDto` honours it (`HasValue`, `RawHex = "—"`).
- **`PlcSubscriptionTracker` is clean.** Single-lock, low-frequency, reference-counted; `Add` returns "first subscriber" and `RemoveConnection` returns "PLCs whose count hit 0" — exactly the arm/disarm edges. The lock is appropriate for the low churn.
- **`StatusBroadcaster.StopAsync` calls `DisarmAll`** explicitly, and `AdminEndpointHost.StopCurrentAppAsync` disposes the broadcaster *before* stopping Kestrel — so an AdminPort hot-reload that tears down the SignalR host without firing per-connection `OnDisconnectedAsync` still disarms every capture. This is the deliberate backstop for the "browser tab killed / host torn down" case and it is wired correctly.
- **`ConfigReconciler.Remove` is called on the PLC-removed path** (`ConfigReconciler.cs:259`) before the supervisor is stopped, so a hot-reload-removed PLC does not leak its capture in the registry.
- **DI wiring is correct.** `TagCaptureRegistry` is a singleton in the outer container (`HostingExtensions.cs`), and `AdminEndpointHost.StartAppAsync:197` re-registers the *same instance* into the inner Kestrel container so `StatusHub` shares it — not a second copy. Verified.
## Open Questions
1. C1: is the debug view *expected* to reflect cache-served reads? If the product decision is "debug view shows wire traffic only," then C1 downgrades to a documentation gap — but the UI must then clearly distinguish "cached, no recent wire read" from "tag dead." If the decision is "debug view shows what the client sees," C1 is a real defect and the cache path must record.
2. M2/M3: should arm state be owned by `PlcSubscriptionTracker` (the authoritative subscriber count) rather than carried on the transient `TagValueCapture` instance and copied across rebuilds? That single change removes the lost-update race and makes `GetOrCreate` stateless w.r.t. arm state.
3. Is there value in `StatusBroadcaster` self-healing arm state each tick from `ActivePlcs()`? It already enumerates that set; the reconcile is nearly free and converts any lost arm/disarm into a one-interval transient.
@@ -0,0 +1,285 @@
# Code Review — Tests & Config for the SignalR Dashboard (commit `e719dd5`)
Scope: the test files, config/options/build changes, and the smoke config introduced or
modified by `e719dd5` ("replace status page with a live SignalR web dashboard"). Production
SignalR/capture code (`StatusHub`, `StatusBroadcaster`, `TagValueCapture`,
`TagCaptureRegistry`, `AdminEndpointHost`) was read for context but is reviewed only insofar
as it tells us whether the new tests actually exercise the risky paths. The wider UI
(`dashboard.js`, `detail.js`, HTML/CSS) is explicitly out of scope.
## Summary
The new tests are competently written, follow the existing `Subject_Condition_Expectation`
naming, and the `IStatusPushSink` seam is a genuinely good design that makes the broadcaster
unit-testable without a SignalR host. The `TagValueCapture` torn-read test is a real
concurrency test, not a pretend one. **But the concurrency-sensitive paths the commit message
advertises — "armed only while a detail page is open", arm/disarm under churn, the broadcaster
*loop* — are tested only at the single-threaded happy-path level.** The most serious gaps:
1. The **`StatusBroadcaster.LoopAsync` background loop has zero test coverage** — every
broadcaster test calls the internal `PushOnceAsync` directly. The loop's interval
re-read, the `Math.Max(100, ...)` floor, the cancellation path, and the
loop-terminated-unexpectedly catch are all unverified.
2. The **arm/disarm-vs-record race in `TagValueCapture` is real and untested.** `Disarm()`
sets `_armed = false` *then* clears slots; `Record()` checks `_armed` *then* writes. A
`Record` that passes the `_armed` check before `Disarm` flips it can write a slot *after*
`Disarm` cleared it — leaving a stale observation on a disarmed capture. The torn-read
test only races `Record` against `Snapshot`, never against `Disarm`/`Arm`.
3. The **`PlcSubscriptionTracker` — the lock-guarded type whose entire job is concurrent
subscriber counting — has no direct test at all** and no concurrent test anywhere. Its
behaviour is only incidentally exercised through single-threaded `StatusHubTests`.
None are release-blockers for a read-only admin page on a trusted segment, but #2 is a latent
correctness bug and #1/#3 mean a regression in the live-feed plumbing would ship silently.
## Findings
### Critical
None. This is a read-only admin surface; nothing here can corrupt the Modbus path.
### Major
**M1 — `StatusBroadcaster.LoopAsync` has no test.**
`src/Mbproxy/Admin/StatusBroadcaster.cs:113-135`; `tests/Mbproxy.Tests/Admin/StatusBroadcasterTests.cs`.
All four broadcaster tests drive `PushOnceAsync` directly. The actual background loop —
which is what runs in production — is untested. Specifically uncovered:
- the per-cycle `_options.CurrentValue.AdminPushIntervalMs` re-read (the documented
hot-reload-without-restart behaviour);
- the `Math.Max(100, ...)` floor that defends against a bad interval slipping past
validation (note: validation rejects `<= 0`, so the floor only ever matters for values
`1..99` — itself an untested corner);
- that `StartAsync``StopAsync` actually terminates the loop and that `StopAsync` is safe
to call when the loop never started.
Fix: add a test that constructs the broadcaster with `AdminPushIntervalMs` ~100ms, calls
`Start()`, polls `Sink.FleetPushes.Count` until `>= 2` with a generous deadline + the test
`CancellationToken`, then `StopAsync()` and asserts the count stops growing. Keep timing
hedged (`ShouldBeGreaterThanOrEqualTo`), consistent with the existing coalescing tests.
**M2 — Arm/disarm-vs-record race in `TagValueCapture` is real and untested.**
`src/Mbproxy/Proxy/TagValueCapture.cs:103-129`. `Disarm()` does `_armed = false` then clears
slots; `Record()` does `if (!_armed) return;` then `Volatile.Write` the slot. Interleaving:
`Record` reads `_armed == true``Disarm` sets `false` and clears all slots → `Record`
writes its observation. Result: a *disarmed* capture holds a non-null slot, and the next
`Snapshot()` reports stale traffic with `UpdatedAtUtc != null` — exactly the "reopened page
shows stale data" failure the on-demand design says it prevents. The hot path makes this
unlikely (record traffic stops when no viewer is attached) but it is not impossible: a
detail page closing while an in-flight FC03 response is being rewritten hits it.
`ConcurrentRecordAndSnapshot_NeverYieldsTornSlot` (line 118) races `Record` vs `Snapshot`
only — never vs `Disarm`. Fix in production: have `Disarm()` set `_armed = false`, then
clear slots, then clear *again* (or re-clear after a memory barrier), or re-check `_armed`
inside `Record` after the `Volatile.Write` and null the slot if disarmed. At minimum add a
test that hammers `Arm`/`Disarm` on one task while `Record` runs on another and asserts a
disarmed capture's `Snapshot()` is all-null. The review should not let this ship "tested"
when the test deliberately avoids the racy interleaving.
**M3 — `PlcSubscriptionTracker` has no direct or concurrent test.**
`src/Mbproxy/Admin/PlcSubscriptionTracker.cs` (whole file); no `PlcSubscriptionTrackerTests.cs`
exists. This is the single lock-guarded type whose correctness drives capture arm/disarm.
`StatusHubTests` exercises it only single-threaded and indirectly. Untested behaviour that
has no coverage anywhere:
- the redundant re-subscribe path (`Add` returns `false` when the same connection
re-subscribes the same PLC — `set.Add` fails);
- `RemoveConnection` for an unknown connection id returning `Array.Empty<string>()`;
- a connection subscribed to *multiple* PLCs being torn down in one `RemoveConnection`;
- concurrent `Add`/`RemoveConnection` from many "connections" — the lock is claimed
thread-safe but nothing proves the count never goes negative or leaks.
Fix: add `PlcSubscriptionTrackerTests` with the single-threaded cases plus one
parallel-stress test (N tasks each Add-then-Remove, assert `ActivePlcs()` is empty and no
exception), mirroring `TxIdAllocatorTests`' concurrency style.
**M4 — `StatusHub` group-leave on disconnect is not verified.**
`src/Mbproxy/Admin/StatusHub.cs:60-66`; `StatusHubTests.cs`.
`OnDisconnectedAsync` is tested for the *capture-disarm* side effect, but the hub never
calls `Groups.RemoveFromGroupAsync` — and `FakeGroupManager` records `Removed` but **no test
ever asserts `groups.Removed`**. Worth confirming this is intentional (real SignalR auto-
removes a disconnected connection from all groups, so an explicit `RemoveFromGroupAsync` is
genuinely unnecessary) — but then `FakeGroupManager.Removed` is dead code that implies a
contract the hub does not honour. Either delete `Removed` from the fake, or if the hub is
*supposed* to leave fleet groups explicitly, that's a production bug. As written, a reviewer
cannot tell which. Fix: drop the unused `Removed` list, or add a comment on the fake
explaining SignalR's implicit group cleanup so the gap is not mistaken for an omission.
### Minor
**m1 — `SignalRFakes` do not model the real SignalR failure surface.**
`tests/Mbproxy.Tests/Admin/SignalRFakes.cs`. The fakes are faithful to the *shapes* but
model only the success path: `FakeGroupManager.AddToGroupAsync` always succeeds and is
synchronous; `FakeStatusPushSink.PushFleetAsync` never throws and never observes the
`CancellationToken`. The production code has explicit `try/catch` around
`_sink.PushFleetAsync` / `PushPlcAsync` (`StatusBroadcaster.cs:88-110`) — a sink that throws
is a real scenario (a SignalR send to a dropped connection) and that catch is **completely
uncovered**. The `BuildDebug` failure catch (`StatusBroadcaster.cs:82-86`) is likewise
uncovered. Fix: add a throwing variant of `FakeStatusPushSink` and assert `PushOnceAsync`
swallows the exception and still attempts the per-PLC pushes; assert the `ct` is honoured
(an `OperationCanceledException` from the sink must *not* be logged-and-swallowed the same
way — the `when (ex is not OperationCanceledException)` filter is untested).
**m2 — `FakeHubCallerContext.ConnectionAborted` is hardcoded to `CancellationToken.None`.**
`SignalRFakes.cs:23`. Real SignalR fires this token on disconnect. No current test needs it,
but if anyone later tests connection-abort handling the fake will silently mask it. Low risk;
note it with a comment so a future test author knows the fake is inert here.
**m3 — Asset content-type test relies on a hand-maintained `[InlineData]` allow-list.**
`AdminEndpointTests.Get_Asset_ReturnsCorrectContentType` covers 5 of the 14 embedded files.
`bootstrap.bundle.min.js`, `detail.css`, `detail.js`, `index.html`, `plc.html`,
`ibm-plex-sans-400/600.woff2` are not asserted. More importantly there is **no test that the
csproj glob actually embedded every wwwroot file** — if someone adds `favicon.ico` to
`wwwroot/` and the glob silently misses it (it won't, `*.*` catches it, but a `.gitignore`'d
or renamed file would), nothing fails. Fix: add a test that enumerates
`typeof(AdminEndpointHost).Assembly.GetManifestResourceNames()`, filters the
`Mbproxy.Admin.wwwroot.` prefix, and asserts every file physically present in `wwwroot/` has
a matching resource (or just assert the count). This is the only thing that would catch a
broken `EmbeddedResource` glob.
**m4 — `Get_Asset` does not assert the bytes are the *right* asset.**
`AdminEndpointTests.cs` (the new theory). It asserts `bytes.Length > 0` and the content
type, but not that `dashboard.js` contains a known marker (it already does this for the HTML
shells via `ShouldContain("/assets/dashboard.js")`). A resource-name collision or a wrong
`ContentTypeFor` mapping for a *correctly-served-but-wrong* file would pass. Cheap to harden:
for the text assets assert a known substring.
**m5 — `csproj` `EmbeddedResource Include="Admin\wwwroot\*.*"` — glob caveats.**
`src/Mbproxy/Mbproxy.csproj` (the new `ItemGroup`). The SDK is `Microsoft.NET.Sdk.Worker`,
which does **not** enable web default-item globs, so there is no double-include of `.css`/
`.js`/`.html` as `Content` — good. Two real caveats:
- `*.*` matches only files containing a dot. Every current asset has an extension so this is
fine, but it is a non-obvious constraint; `**\*` or just `*` would be more honest. The
comment says "intentionally flat", so `*.*` is acceptable, but worth a one-word note that
extensionless files would be skipped.
- Globs in an `<ItemGroup>` are evaluated at project-evaluation time; a newly-added asset is
*not* picked up by an incremental build that didn't re-evaluate. In practice `dotnet
build`/`publish` always re-evaluates, so this is a non-issue — but it is the kind of thing
that bites a watch-mode developer. No fix needed; flagged for completeness.
The resource-name → request-path mapping (`Mbproxy.Admin.wwwroot.<file>`) is correct for the
flat directory and matches `AssetResourcePrefix` in `AdminEndpointHost.cs:301`.
**m6 — `mbproxy.smoke.config.json` — valid but with stale/odd values.**
`tests/sim/mbproxy.smoke.config.json`. The schema is current (`AdminPushIntervalMs` present,
`Keepalive` field names match `KeepaliveOptions`, no removed keys). Issues:
- The header comment says the simulator runs on `127.0.0.1:5020` and both `line-a`/`line-b`
point `Port: 5020` — consistent. Good.
- `BackendHeartbeatIdleMs: 10000` while `BackendRequestTimeoutMs: 2000` — satisfies the
"must be greater than BackendRequestTimeoutMs" cross-field rule. Fine.
- `line-dead` has no `BcdTags`, so it inherits the empty `Global: []` set — fine, the intent
is just an unreachable backend.
- The file has **no `Resilience` and no `Cache` section.** Both are optional (defaults
apply), so the config binds — but a smoke config whose stated purpose is exercising the
dashboard's "problems only" filter would be more representative with the defaults made
explicit, or at least a comment that defaults are intentionally relied on.
- Comment line 1 references "Phase 4/5 web-UI browser smoke tests" and
`plans/2026-05-15-webui-dashboard.md`. `plans/` is untracked (`?? plans/` in git status)
and per recent commit `7466a46` the project has been *retiring* plan docs. A smoke config
pointing at an untracked plan file is a dangling reference — either commit the plan or
drop the citation.
- No `Resilience.ReadCoalescing` means coalescing runs with defaults during the smoke run;
acceptable but undocumented.
**m7 — `ConfigReconcilerTests` change is a minimal compile-fix, not a behaviour test.**
`tests/Mbproxy.Tests/Configuration/ConfigReconcilerTests.cs` (the 1-line diff). The new
`TagCaptureRegistry` ctor arg is passed as a throwaway `new TagCaptureRegistry()`. That is
the correct minimal change, but it means **the reconciler's new responsibility — calling
`TagCaptureRegistry.GetOrCreate` on PLC add and `Remove` on PLC removal during hot-reload —
has no test.** `ConfigReconciler.cs` was modified in this commit (`+11` lines) to wire the
registry; nothing asserts that a hot-reload-added PLC gets a capture or that a removed PLC's
capture is dropped. Fix: in `ConfigReconcilerTests`, pass a real registry the test holds a
reference to, trigger an add/remove reload, and assert `registry.TryGet` reflects it.
**m8 — `ReloadValidator`/`MbproxyOptions` `AdminPushIntervalMs` validation: consistent, but
the upper bound is unguarded.** `ReloadValidator.cs` and `MbproxyOptionsValidator` both
reject `<= 0` — consistent with how `GracefulShutdownTimeoutMs` and the keepalive intervals
are validated (lower-bound-only, `> 0`). So the *new* validation matches house style. But
note no interval option in this codebase has an *upper* bound, and `AdminPushIntervalMs` is
the one most likely to be fat-fingered into something absurd (`1000000` = 16-minute feed).
Not a regression and not inconsistent — flagged only because the dashboard's whole value
proposition is "live". The `LoopAsync` `Math.Max(100, ...)` floor protects against tiny
values; nothing protects against huge ones. Optional: a soft upper bound (e.g. reject
`> 60_000`) or just a doc note. The two new `ReloadValidatorTests` (zero, negative) and two
new `MbproxyOptionsBindingTests` (default 1000, binds 250) are correct and adequate for the
bounds that *are* enforced.
**m9 — `StatusHubTests.SecondSubscriber_FirstLeaveKeepsArmed_LastLeaveDisarms` passes
`null` to `OnDisconnectedAsync`.** Line 72/76. Real SignalR passes a non-null `Exception`
on an abnormal disconnect and `null` on a clean one, so `null` is a legal value — but the
test never exercises the exception-carrying path. `StatusHub.OnDisconnectedAsync` ignores
the exception entirely, so this is harmless today; flagged so a future change that *uses*
the exception doesn't find itself untested.
## Coverage gaps (behavior in `e719dd5` with NO test)
1. **`StatusBroadcaster.LoopAsync`** — the entire background loop (M1).
2. **Arm/disarm vs. `Record` race** in `TagValueCapture` (M2).
3. **`PlcSubscriptionTracker`** — no direct test, no concurrency test (M3).
4. **Sink-throws / build-throws error handling** in `PushOnceAsync` — the `try/catch`
blocks and the `when (ex is not OperationCanceledException)` filter (m1).
5. **`ConfigReconciler``TagCaptureRegistry` wiring** — capture created on hot-reload PLC
add, removed on PLC remove (m7).
6. **`AdminEndpointHost` AdminPort hot-reload tears down the broadcaster and disarms
captures** — `StopCurrentAppAsync` disposes `_broadcaster` (which `DisarmAll`s); the
broadcaster's `StopAsync_DisarmsEveryCapture` test proves the disarm in isolation but no
test proves the *hot-reload rebind* path runs it. (`AdminEndpointTests` has an admin-port
rebind test from the prior commit — extend it to assert capture disarm.)
7. **`/hub/status` SignalR endpoint** — not reachable by any test. The 405-methods test
explicitly excludes it. No test connects a SignalR client and verifies a `fleet`/`plc`
message is actually received end-to-end. The `IStatusPushSink` unit tests cover the
broadcaster logic but nothing covers `SignalRStatusPushSink` + `MapHub` wiring. Given the
project already stands up a real Kestrel host in `AdminEndpointTests`, one
`HubConnectionBuilder`-based E2E test (`SubscribeFleet` → await a `fleet` message) would
close the single biggest "does the feature actually work" gap.
8. **Multi-tag / 32-bit capture via `BcdPduPipeline`**`BcdPduPipelineCaptureTests` covers
FC03 16/32-bit and FC06/FC16 16-bit, but not FC16 *32-bit* (CDAB pair write) capture, and
not a PDU spanning a BCD tag *and* a non-BCD register (does capture record only the BCD
one?). The pure `TagValueCaptureTests` cover 32-bit, but the *pipeline hook* for 32-bit
writes is unverified.
9. **`Get_PlcDetailRoute_ReturnsDetailShell`** uses `/plc/anything` — it never checks that a
PLC name with a slash, encoded characters, or empty segment is handled. The route is
`/plc/{name}` and the name is read client-side, so this is low-risk, but
`/plc/` (empty) and `/plc/a%2Fb` are untested.
## What looks good
- **`IStatusPushSink` is a clean seam.** Extracting the outbound side behind an interface so
the broadcaster loop logic is testable without a SignalR host is exactly right, and
`FakeStatusPushSink` using `ConcurrentBag` is the correct call for a type that production
pushes to from a background thread.
- **`TagValueCapture` torn-read test is genuine.** `ConcurrentRecordAndSnapshot_NeverYields
TornSlot` races 4 writers × 200k ops against a reader and checks a real invariant
(`DecodedValue == RawLow + RawHigh`). This is the right way to test the `Volatile.Write`
immutable-record swap, and it would actually catch a regression to a mutable slot.
- **`SignalRFakes` model the right shapes.** `FakeHubCallerContext : HubCallerContext` with
the abstract members overridden, `FakeGroupManager : IGroupManager`, and the hub-property
injection (`Context = ...`, `Groups = ...`) is the standard, correct way to unit-test a
SignalR `Hub` without a host. No mocking framework is dragged in.
- **`BcdPduPipelineCaptureTests` regression guards are well-chosen.** The disarmed-capture
and null-capture cases assert the rewrite *still happens byte-identically* — that is the
load-bearing property (capture must never perturb the proxy's transparency) and it is
explicitly tested.
- **`TagCaptureRegistryTests.GetOrCreate_Rebuild_PreservesArmedFlag`** correctly tests the
hot-reload reseat contract (rebuilt capture is a new instance but keeps `IsArmed`), and
`UnknownPlc_Operations_AreSafeNoOps` covers the no-op-for-ghost-PLC contract that
`StatusHub` relies on.
- **`StatusHtmlRenderer` removal is clean.** No source or test file references it; the only
remaining mentions are in `plans/`, `docs/`, and prior `codereviews/` — all expected.
- **The `AdminPushIntervalMs` validation is consistent with the codebase.** Lower-bound-only
`> 0` checks in both `MbproxyOptionsValidator` and `ReloadValidator`, error-message format
matching the sibling checks, and tests for default + bind + zero + negative. This is the
right pattern; the new tests are adequate for what is enforced.
- **The csproj `EmbeddedResource` glob is correct** for the flat-directory design, the
comment accurately documents the resource-name → request-path mapping, and there is no
accidental double-include because the Worker SDK does not enable web default globs.
- **`mbproxy.smoke.config.json` binds against the current schema** — no stale keys, the
three-PLC line-a/line-b/line-dead topology is a sensible smoke surface, and
`AdminPushIntervalMs` is present and explicit.
## Key file references
- `src/Mbproxy/Admin/StatusBroadcaster.cs:113-135` — untested `LoopAsync` (M1).
- `src/Mbproxy/Proxy/TagValueCapture.cs:103-129` — arm/disarm vs. record race (M2).
- `src/Mbproxy/Admin/PlcSubscriptionTracker.cs` — no test file exists (M3).
- `src/Mbproxy/Admin/StatusHub.cs:60-66` — `OnDisconnectedAsync`; `FakeGroupManager.Removed`
is unasserted dead code (M4).
- `tests/Mbproxy.Tests/Admin/SignalRFakes.cs` — success-path-only fakes (m1, m2).
- `tests/Mbproxy.Tests/Configuration/ConfigReconcilerTests.cs` — compile-fix only, reconciler
↔ registry wiring untested (m7).
- `src/Mbproxy/Mbproxy.csproj` — `EmbeddedResource Include="Admin\wwwroot\*.*"` (m5).
- `tests/sim/mbproxy.smoke.config.json` — dangling `plans/` reference (m6).
+1
View File
@@ -158,6 +158,7 @@ The fleet-wide BCD tag list. Every PLC starts with this set, then applies its pe
| `Address` | ushort | `0` | `[0, 65535]` | Modbus PDU address (decimal). Address `0` is valid on DL205/DL260 — do not skip it. Octal V-memory addresses must be converted: `V2000` octal = decimal `1024`. |
| `Width` | byte | `0` | `{ 16, 32 }` | Bit width. `16` is one register holding 4 BCD digits (`09999`). `32` is a CDAB-ordered register pair at `Address` (low word) and `Address+1` (high word). |
| `CacheTtlMs` | int? | `null` | `>= 0`, `<= 60000` unless `Cache.AllowLongTtl = true` | Optional per-tag opt-in to the response cache. `null` falls back to the PLC's `DefaultCacheTtlMs`. `0` explicitly disables caching for this tag even when the PLC default is non-zero. |
| `Name` | string? | `null` | free-form | Optional human-friendly label (e.g. `"Left AirSP"`). Shown on the connection-detail debug view as the row heading. No effect on Modbus rewriting — purely a display aid. |
`MbproxyOptionsValidator` rejects any entry whose `Width` is not `16` or `32`. See [`../Features/BcdRewriting.md`](../Features/BcdRewriting.md) for the wire encoding rules and the multi-tag-overlap validation that runs in `BcdTagMapBuilder`.
+3
View File
@@ -314,12 +314,15 @@ The UI is a Bootstrap 5 single-page app served from embedded assets under `src/M
The detail page's debug view is fed by an **on-demand per-tag value capture** (`Proxy/TagValueCapture.cs`, one per PLC, held in `Proxy/TagCaptureRegistry.cs`). The `BcdPduPipeline` records the last raw/decoded value for each configured BCD tag — but only while the capture is *armed*. `StatusHub` arms a PLC's capture when the first detail page subscribes and disarms it (clearing all slots) when the last viewer leaves, so the hot path carries zero cost when nobody is watching. The per-PLC payload is `PlcDetailResponse` (`src/Mbproxy/Admin/DebugDto.cs`):
> When the response cache is enabled, an FC03/FC04 **cache hit** bypasses the pipeline. To keep the debug view live for cached tags, each cache entry carries the tag observations captured when it was stored (only when a viewer was armed at that time); a hit replays them into the capture, re-stamped to the hit time. The debug view therefore reflects the value the client actually receives — cache-served reads included — not only backend round-trips.
| JSON path | Type | Meaning |
|---|---|---|
| `plc` | `PlcStatus?` | The standard per-PLC status row, or `null` if the PLC was removed by a hot-reload. |
| `debug.captureArmed` | `bool` | Whether a detail page currently has the capture armed. |
| `debug.tags[].address` | `int` | BCD tag PDU address. |
| `debug.tags[].width` | `int` | 16 or 32. |
| `debug.tags[].name` | `string?` | Optional human-friendly tag label from config (`BcdTags[].Name`); `null` when unset. Shown as the debug-row heading, with the PDU address as sub-text. |
| `debug.tags[].hasValue` | `bool` | `false` until the first observation since the capture was armed. |
| `debug.tags[].direction` | `string` | `"read"` (FC03/FC04) or `"write"` (FC06/FC16). |
| `debug.tags[].rawHex` | `string` | Raw PLC-side value as BCD nibbles — `0xLLLL` (16-bit) or `0xHHHHLLLL` (32-bit). |
+31 -33
View File
@@ -25,21 +25,33 @@
// Remove — removes specific addresses from the effective set for that PLC.
// Effective set = (Global Add) Remove, resolved per PDU.
"BcdTags": {
// Fleet-wide BCD tag list from tags.txt — applies to every PLC.
// Address = Modbus PDU-decimal = (4xxxx Modbus address 40001), which also
// equals the DirectLOGIC V-memory address converted octal → decimal
// (e.g. 41549 / V3014 → 41549 40001 = 1548 ; octal 3014 = 1548).
// A 32-bit tag is ONE entry at its low/base address; it covers Address &
// Address+1 (CDAB: low word at Address, high word at Address+1).
// Name (optional) is a human-friendly label shown on the connection-detail
// debug view; it has no effect on rewriting. tags.txt's "Data Direction" is
// informational — the proxy rewrites BCD on whichever FC touches the address.
// CacheTtlMs (optional, per entry) opts a tag into the Phase-11 response cache;
// omitted / 0 = uncached (the default for every tag).
"Global": [
// V2000 (octal) = decimal address 1024. 16-bit BCD counter.
{ "Address": 1024, "Width": 16 },
// ── 16-bit setpoints — BCD16, HMI-written ────────────────────────────
{ "Address": 1536, "Width": 16, "Name": "Left ArgonSP" }, // 41537
{ "Address": 1539, "Width": 16, "Name": "Right ArgonSP" }, // 41540
{ "Address": 1544, "Width": 16, "Name": "Left ChlorineSP" }, // 41545 · V3010
{ "Address": 1545, "Width": 16, "Name": "Right ChlorineSP" }, // 41546 · V3011
{ "Address": 1546, "Width": 16, "Name": "Left HydrogenSP" }, // 41547 · V3012
{ "Address": 1547, "Width": 16, "Name": "Right HydrogenSP" }, // 41548 · V3013
{ "Address": 1548, "Width": 16, "Name": "Left AirSP" }, // 41549 · V3014
{ "Address": 1549, "Width": 16, "Name": "Right AirSP" }, // 41550 · V3015
// V2040 (octal) = decimal address 1056. 32-bit BCD total at 1056/1057.
{ "Address": 1056, "Width": 32 },
// V2100 (octal) = decimal address 1088. 16-bit BCD setpoint.
//
// Phase 11: CacheTtlMs (optional) opts this tag into the response cache. With
// CacheTtlMs > 0 set, upstream clients reading this register will see values up
// to CacheTtlMs MILLISECONDS OLD — explicit acknowledgement of the staleness
// window is required by enabling it. Default (omitted or 0) = cache disabled
// for this tag. The cache is OFF by default for every tag.
{ "Address": 1088, "Width": 16 /* , "CacheTtlMs": 1000 */ }
// ── 32-bit runtimes — BCD32, read; CDAB pair spans Address & Address+1 ─
{ "Address": 4616, "Width": 32, "Name": "MTA Runtime Left (min)" }, // 44617/44618 · V11010
{ "Address": 4618, "Width": 32, "Name": "MTA Runtime Right (min)" }, // 44619/44620 · V11012
{ "Address": 4626, "Width": 32, "Name": "FRR Runtime Left (min)" }, // 44627/44628 · V11022
{ "Address": 4628, "Width": 32, "Name": "FRR Runtime Right (min)" } // 44629/44630 · V11024
]
},
@@ -52,26 +64,12 @@
// port will cause a backend connect failure and an immediate upstream disconnect.
"Plcs": [
{
"Name": "Line1-Mixer", // Human-readable name (shown on status page and in logs)
"ListenPort": 5020, // Port the proxy listens on (upstream clients connect here)
"Host": "10.0.1.1", // PLC IP address or hostname
"Port": 502, // PLC Modbus TCP port (almost always 502)
"BcdTags": {
// Additional 32-bit tag specific to this PLC only.
"Add": [
{ "Address": 1200, "Width": 32 }
],
// Remove address 1056 from the Global list for this PLC
// (this mixer doesn't use the 32-bit BCD total).
"Remove": [ 1056 ]
}
},
{
"Name": "Line1-Conveyor",
"ListenPort": 5021,
"Host": "10.0.1.2",
"Port": 502
// No BcdTags override — uses the Global set as-is.
"Name": "Z28061", // Human-readable name (shown on status page and in logs)
"ListenPort": 5020, // Port the proxy listens on (upstream clients connect here)
"Host": "10.210.192.5", // PLC IP address or hostname
"Port": 502 // PLC Modbus TCP port (almost always 502)
// No BcdTags override — uses the Global set as-is. Per-PLC overrides are
// available: "BcdTags": { "Add": [ ... ], "Remove": [ ... ] }.
}
// Add one entry per PLC. Ports must be unique per host. Typical fleet: 54 PLCs.
],
@@ -29,21 +29,33 @@
// Remove — removes specific addresses from the effective set for that PLC.
// Effective set = (Global Add) Remove, resolved per PDU.
"BcdTags": {
// Fleet-wide BCD tag list from tags.txt — applies to every PLC.
// Address = Modbus PDU-decimal = (4xxxx Modbus address 40001), which also
// equals the DirectLOGIC V-memory address converted octal → decimal
// (e.g. 41549 / V3014 → 41549 40001 = 1548 ; octal 3014 = 1548).
// A 32-bit tag is ONE entry at its low/base address; it covers Address &
// Address+1 (CDAB: low word at Address, high word at Address+1).
// Name (optional) is a human-friendly label shown on the connection-detail
// debug view; it has no effect on rewriting. tags.txt's "Data Direction" is
// informational — the proxy rewrites BCD on whichever FC touches the address.
// CacheTtlMs (optional, per entry) opts a tag into the Phase-11 response cache;
// omitted / 0 = uncached (the default for every tag).
"Global": [
// V2000 (octal) = decimal address 1024. 16-bit BCD counter.
{ "Address": 1024, "Width": 16 },
// ── 16-bit setpoints — BCD16, HMI-written ────────────────────────────
{ "Address": 1536, "Width": 16, "Name": "Left ArgonSP" }, // 41537
{ "Address": 1539, "Width": 16, "Name": "Right ArgonSP" }, // 41540
{ "Address": 1544, "Width": 16, "Name": "Left ChlorineSP" }, // 41545 · V3010
{ "Address": 1545, "Width": 16, "Name": "Right ChlorineSP" }, // 41546 · V3011
{ "Address": 1546, "Width": 16, "Name": "Left HydrogenSP" }, // 41547 · V3012
{ "Address": 1547, "Width": 16, "Name": "Right HydrogenSP" }, // 41548 · V3013
{ "Address": 1548, "Width": 16, "Name": "Left AirSP" }, // 41549 · V3014
{ "Address": 1549, "Width": 16, "Name": "Right AirSP" }, // 41550 · V3015
// V2040 (octal) = decimal address 1056. 32-bit BCD total at 1056/1057.
{ "Address": 1056, "Width": 32 },
// V2100 (octal) = decimal address 1088. 16-bit BCD setpoint.
//
// Phase 11: CacheTtlMs (optional) opts this tag into the response cache. With
// CacheTtlMs > 0 set, upstream clients reading this register will see values up
// to CacheTtlMs MILLISECONDS OLD — explicit acknowledgement of the staleness
// window is required by enabling it. Default (omitted or 0) = cache disabled
// for this tag. The cache is OFF by default for every tag.
{ "Address": 1088, "Width": 16 /* , "CacheTtlMs": 1000 */ }
// ── 32-bit runtimes — BCD32, read; CDAB pair spans Address & Address+1 ─
{ "Address": 4616, "Width": 32, "Name": "MTA Runtime Left (min)" }, // 44617/44618 · V11010
{ "Address": 4618, "Width": 32, "Name": "MTA Runtime Right (min)" }, // 44619/44620 · V11012
{ "Address": 4626, "Width": 32, "Name": "FRR Runtime Left (min)" }, // 44627/44628 · V11022
{ "Address": 4628, "Width": 32, "Name": "FRR Runtime Right (min)" } // 44629/44630 · V11024
]
},
@@ -56,26 +68,12 @@
// port will cause a backend connect failure and an immediate upstream disconnect.
"Plcs": [
{
"Name": "Line1-Mixer", // Human-readable name (shown on status page and in logs)
"ListenPort": 5020, // Port the proxy listens on (upstream clients connect here)
"Host": "10.0.1.1", // PLC IP address or hostname
"Port": 502, // PLC Modbus TCP port (almost always 502)
"BcdTags": {
// Additional 32-bit tag specific to this PLC only.
"Add": [
{ "Address": 1200, "Width": 32 }
],
// Remove address 1056 from the Global list for this PLC
// (this mixer doesn't use the 32-bit BCD total).
"Remove": [ 1056 ]
}
},
{
"Name": "Line1-Conveyor",
"ListenPort": 5021,
"Host": "10.0.1.2",
"Port": 502
// No BcdTags override — uses the Global set as-is.
"Name": "Z28061", // Human-readable name (shown on status page and in logs)
"ListenPort": 5020, // Port the proxy listens on (upstream clients connect here)
"Host": "10.210.192.5", // PLC IP address or hostname
"Port": 502 // PLC Modbus TCP port (almost always 502)
// No BcdTags override — uses the Global set as-is. Per-PLC overrides are
// available: "BcdTags": { "Add": [ ... ], "Remove": [ ... ] }.
}
// Add one entry per PLC. Ports must be unique per host. Typical fleet: 54 PLCs.
],
+23
View File
@@ -10,6 +10,10 @@
framework-dependent\ ~1.6 MB — requires the .NET 10 + ASP.NET Core runtime
preinstalled on the target.
Each folder also receives a current appsettings.json — the platform-appropriate
install template (Windows or Linux, selected by -Rid) — so every publish-out
flavour is a complete, deployable folder.
The runtime is selected with -Rid (default win-x64). The binary is Mbproxy.exe on
Windows RIDs and Mbproxy on Linux/macOS RIDs.
@@ -70,6 +74,25 @@ Write-Host "`n=== Publishing framework-dependent ($Rid, ~1.6 MB) ===" -Foregroun
& dotnet publish $csproj -c Release -r $Rid -p:SelfContained=false -p:PublishSingleFile=true -o $frameworkDependentOut --nologo
if ($LASTEXITCODE -ne 0) { throw "framework-dependent publish failed (exit $LASTEXITCODE)" }
# ── Ship the platform-appropriate config template as appsettings.json ──────────
# dotnet publish already copies it via the Mbproxy.csproj <Content> link, but that
# link uses PreserveNewest — an incremental (non-Clean) run can leave a stale
# config behind. Copy it explicitly so every publish-out flavour is guaranteed a
# current appsettings.json, and so the config's source is obvious.
$configTemplate = if ($Rid -like 'win-*') {
Join-Path $repoRoot 'install\mbproxy.config.template.json'
} else {
Join-Path $repoRoot 'install\mbproxy.linux.config.template.json'
}
if (-not (Test-Path $configTemplate)) { throw "Cannot find config template: $configTemplate" }
Write-Host "`n=== Config (appsettings.json) ===" -ForegroundColor Cyan
foreach ($flavour in 'self-contained','framework-dependent') {
$dest = Join-Path $OutputDir "$flavour\appsettings.json"
Copy-Item -LiteralPath $configTemplate -Destination $dest -Force
Write-Host (" {0,-22} <- {1}" -f $flavour, $configTemplate)
}
function Format-Size {
param([long]$Bytes)
if ($Bytes -ge 1MB) { '{0:N1} MB' -f ($Bytes / 1MB) }
+26
View File
@@ -10,6 +10,10 @@
# framework-dependent/ ~1.6 MB — requires the .NET 10 + ASP.NET Core runtime
# preinstalled on the target.
#
# Each folder also receives a current appsettings.json — the platform-appropriate
# install template (Windows or Linux, selected by -r RID) — so every publish-out
# flavour is a complete, deployable folder.
#
# Both builds use the Release configuration and inherit the publish settings in
# src/Mbproxy/Mbproxy.csproj (those settings are gated on an explicit RID, which
# is supplied here). The framework-dependent build overrides SelfContained=false.
@@ -68,6 +72,28 @@ echo "=== Publishing framework-dependent ($rid, ~1.6 MB) ==="
dotnet publish "$csproj" -c Release -r "$rid" \
-p:SelfContained=false -p:PublishSingleFile=true -o "$framework_dependent_out" --nologo
# Ship the platform-appropriate config template as appsettings.json.
# dotnet publish already copies it via the Mbproxy.csproj <Content> link, but that
# link uses PreserveNewest — an incremental (non-clean) run can leave a stale config
# behind. Copy it explicitly so every publish-out flavour is guaranteed a current
# appsettings.json, and so the config's source is obvious.
if [[ "$rid" == win-* ]]; then
config_template="$repo_root/install/mbproxy.config.template.json"
else
config_template="$repo_root/install/mbproxy.linux.config.template.json"
fi
if [[ ! -f "$config_template" ]]; then
echo "Cannot find config template: $config_template" >&2
exit 1
fi
echo
echo "=== Config (appsettings.json) ==="
for flavour in self-contained framework-dependent; do
cp -f "$config_template" "$output_dir/$flavour/appsettings.json"
printf ' %-22s <- %s\n' "$flavour" "$config_template"
done
echo
echo "=== Result ($rid) ==="
for flavour in self-contained framework-dependent; do
+2
View File
@@ -35,6 +35,8 @@ public sealed record PlcDebugSnapshot(
public sealed record TagValueDto(
int Address,
int Width,
/// <summary>Optional human-friendly tag label from config; <c>null</c> when unset.</summary>
string? Name,
bool HasValue,
/// <summary><c>"read"</c> (FC03/FC04) or <c>"write"</c> (FC06/FC16).</summary>
string Direction,
@@ -67,6 +67,7 @@ internal sealed class StatusSnapshotBuilder
return new TagValueDto(
Address: o.Address,
Width: o.Width,
Name: o.Name,
HasValue: hasValue,
Direction: o.Direction == CaptureDirection.Write ? "write" : "read",
RawHex: rawHex,
@@ -115,6 +115,17 @@
.debug-table td.num { text-align: right; font-family: var(--mono); }
.debug-table tbody tr:last-child td { border-bottom: none; }
/* First column: friendly tag name over its PDU address, or a bare address. */
.debug-table .tag-name { display: block; font-weight: 600; }
.debug-table .tag-addr {
display: block;
margin-top: 0.1rem;
font-family: var(--mono);
font-size: 0.72rem;
color: var(--ink-faint);
}
.debug-table .tag-addr-hex { color: var(--ink-faint); font-size: 0.76rem; }
.debug-table .raw { color: var(--accent-deep); }
.debug-table .dec { font-weight: 600; }
.debug-table tr.stale td { color: var(--ink-faint); }
+18 -7
View File
@@ -25,6 +25,17 @@
}
function hex4(n) { return '0x' + (n & 0xffff).toString(16).toUpperCase().padStart(4, '0'); }
// First debug-row cell: the tag's friendly name (when configured) over its PDU
// address, or just the address when unnamed. All dynamic text is escaped.
function tagCell(t) {
const addr = Number(t.address);
if (t.name) {
return `<td><span class="tag-name">${escapeHtml(t.name)}</span>` +
`<span class="tag-addr">PDU ${addr} · ${hex4(addr)}</span></td>`;
}
return `<td>${addr} <span class="tag-addr-hex">${hex4(addr)}</span></td>`;
}
function formatAge(sec) {
if (sec === null || sec === undefined) return '—';
if (sec < 1) return 'now';
@@ -54,7 +65,7 @@
const cls = state === 'bound' ? 'chip-ok'
: state === 'recovering' ? 'chip-warn'
: 'chip-idle';
return `<span class="chip ${cls}">${state}</span>`;
return `<span class="chip ${cls}">${escapeHtml(state)}</span>`;
}
// ── Render: PLC counters ───────────────────────────────────────────────
@@ -83,7 +94,7 @@
// Clients
const clientLines = (plc.clients.remoteEndpoints || []).map(c =>
`<div class="client-line">${escapeHtml(c.remote)}` +
`<span class="pdu"> · ${num(c.pdusForwarded)} PDUs · since ${shortTime(c.connectedAtUtc)}</span></div>`
`<span class="pdu"> · ${num(c.pdusForwarded)} PDUs · since ${escapeHtml(shortTime(c.connectedAtUtc))}</span></div>`
).join('');
cards.push(card('Clients', [
['Connected', num(plc.clients.connected)],
@@ -188,8 +199,8 @@
tbody.innerHTML = debug.tags.map(t => {
if (!t.hasValue) {
return `<tr class="no-traffic">
<td>${t.address} <span class="ratio-sub">${hex4(t.address)}</span></td>
<td>${t.width}-bit</td>
${tagCell(t)}
<td>${Number(t.width)}-bit</td>
<td colspan="3">no traffic yet</td>
<td class="num"></td>
</tr>`;
@@ -197,9 +208,9 @@
const dirCls = t.direction === 'write' ? 'dir-write' : 'dir-read';
const stale = (t.ageSeconds || 0) > 30 ? ' stale' : '';
return `<tr class="${stale.trim()}">
<td>${t.address} <span class="ratio-sub">${hex4(t.address)}</span></td>
<td>${t.width}-bit</td>
<td><span class="dir-tag ${dirCls}">${t.direction}</span></td>
${tagCell(t)}
<td>${Number(t.width)}-bit</td>
<td><span class="dir-tag ${dirCls}">${escapeHtml(t.direction)}</span></td>
<td class="num raw">${escapeHtml(t.rawHex)}</td>
<td class="num dec">${num(t.decodedValue)}</td>
<td class="num">${formatAge(t.ageSeconds)}</td>
+1 -1
View File
@@ -49,7 +49,7 @@
<table class="debug-table">
<thead>
<tr>
<th>Tag (PDU addr)</th>
<th>Tag</th>
<th>Width</th>
<th>Direction</th>
<th class="num">PLC side (raw BCD)</th>
+6 -3
View File
@@ -8,14 +8,17 @@ namespace Mbproxy.Bcd;
/// milliseconds. 0 (the default) means caching is disabled for this tag. Positive values
/// cap upstream staleness; the multi-tag-range read uses <c>min(TTLs)</c> across all
/// matched tags and treats any 0 in the range as "uncached for the whole read."</para>
///
/// <para><b><see cref="Name"/></b> is an optional human-friendly label carried through
/// to the connection-detail debug view. It has no effect on Modbus rewriting.</para>
/// </summary>
public sealed record BcdTag(ushort Address, byte Width, int CacheTtlMs = 0)
public sealed record BcdTag(ushort Address, byte Width, int CacheTtlMs = 0, string? Name = null)
{
/// <summary>
/// Creates a <see cref="BcdTag"/> and validates that Width is 16 or 32.
/// </summary>
/// <exception cref="ArgumentException">Width is not 16 or 32.</exception>
public static BcdTag Create(ushort address, byte width, int cacheTtlMs = 0)
public static BcdTag Create(ushort address, byte width, int cacheTtlMs = 0, string? name = null)
{
if (width != 16 && width != 32)
throw new ArgumentException(
@@ -27,7 +30,7 @@ public sealed record BcdTag(ushort Address, byte Width, int CacheTtlMs = 0)
$"BCD tag CacheTtlMs must be >= 0; got {cacheTtlMs} at address {address}.",
nameof(cacheTtlMs));
return new BcdTag(address, width, cacheTtlMs);
return new BcdTag(address, width, cacheTtlMs, name);
}
/// <summary>True when this tag occupies two registers (32-bit BCD).</summary>
+1 -1
View File
@@ -122,7 +122,7 @@ public static class BcdTagMapBuilder
int resolvedTtl = opt.CacheTtlMs ?? perPlcDefaultCacheTtlMs;
if (resolvedTtl < 0) resolvedTtl = 0;
validated[addr] = BcdTag.Create(addr, opt.Width, resolvedTtl);
validated[addr] = BcdTag.Create(addr, opt.Width, resolvedTtl, opt.Name);
}
// High-register collision check (only meaningful for 32-bit entries).
@@ -12,4 +12,11 @@ public sealed class BcdTagOptions
/// values cap the staleness window in milliseconds.
/// </summary>
public int? CacheTtlMs { get; init; }
/// <summary>
/// Optional human-friendly label identifying this tag (e.g. <c>"Left AirSP"</c>).
/// Free-form; shown on the connection-detail debug view. Null/omitted renders as
/// the bare PDU address. Has no effect on Modbus rewriting — purely a display aid.
/// </summary>
public string? Name { get; init; }
}
@@ -14,10 +14,18 @@ namespace Mbproxy.Proxy.Cache;
/// monotonic ticker; LRU eviction picks the entry with the smallest tick. Using a long
/// instead of <see cref="DateTimeOffset.UtcNow"/> on every access keeps the hot path free
/// of clock calls and works correctly even if the wall clock moves backward.</para>
///
/// <para><b><see cref="CapturedTags"/></b> holds the BCD-tag observations captured (raw
/// nibbles + decoded value) when this entry was stored — but only when the connection's
/// debug-view capture was armed at insert time; otherwise <c>null</c>. A cache hit bypasses
/// the BCD pipeline, so without this the debug view would never see cache-served reads.
/// On a hit these are replayed into the capture so the detail page reflects what the
/// client actually receives. See <see cref="ResponseCache"/> and the debug-view docs.</para>
/// </summary>
internal sealed record CacheEntry(
byte[] PduBytes,
DateTimeOffset CachedAtUtc,
DateTimeOffset ExpiresAtUtc,
int Length,
long LastUsedTick);
long LastUsedTick,
IReadOnlyList<TagValueObservation>? CapturedTags = null);
@@ -645,6 +645,14 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
byte[] pduSnapshot = new byte[pduBodyLen];
Buffer.BlockCopy(frame, MbapFrame.HeaderSize, pduSnapshot, 0, pduBodyLen);
// If a detail-page viewer has armed this PLC's capture, the
// pipeline above just recorded fresh observations for the BCD
// tags in this read range. Attach them to the cache entry so a
// later hit (which bypasses the pipeline) can replay them into
// the debug view — otherwise the view would freeze for the TTL.
var capturedTags = CaptureRangeObservations(
inFlight.StartAddress, inFlight.Qty);
var cacheKey = new CacheKey(
inFlight.UnitId, inFlight.Fc,
inFlight.StartAddress, inFlight.Qty);
@@ -654,7 +662,8 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
CachedAtUtc: now,
ExpiresAtUtc: now.AddMilliseconds(inFlight.ResolvedCacheTtlMs),
Length: pduSnapshot.Length,
LastUsedTick: 0); // ResponseCache.Set stamps the real tick
LastUsedTick: 0, // ResponseCache.Set stamps the real tick
CapturedTags: capturedTags);
postCache.Set(cacheKey, entry);
CacheLogEvents.Store(_logger, _plc.Name,
@@ -825,6 +834,21 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
_ctx.Counters.IncrementCacheHit();
CacheLogEvents.Hit(_logger, _plc.Name, unitId, fcByte, startAddr, qty);
// A cache hit bypasses the BCD pipeline, so the debug-view capture
// would otherwise never see cache-served reads. Replay the
// observations captured when this entry was stored — re-stamped now,
// since the client receives this value right now — so the detail
// page reflects what the client actually gets. Entries stored while
// no viewer was armed carry no observations; those tags self-heal on
// the next cache miss.
if (_ctx.Capture is { IsArmed: true } capture &&
cached.CapturedTags is { Count: > 0 } cachedTags)
{
foreach (var obs in cachedTags)
capture.Record(obs.Address, obs.RawLow, obs.RawHigh,
obs.DecodedValue, CaptureDirection.Read);
}
byte[] hitFrame = BuildCacheHitFrame(originalTxId, unitId, cached.PduBytes);
await pipe.SendResponseAsync(hitFrame, ct).ConfigureAwait(false);
// Outbound bytes for cache-hit response.
@@ -1345,6 +1369,35 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
return frame;
}
/// <summary>
/// Snapshots the debug-view observations for the BCD tags within the given FC03/FC04
/// read range, for attaching to the response-cache entry being stored. Returns
/// <c>null</c> when no detail-page viewer has armed this PLC's capture (the common
/// case — zero work) or when no observed tag falls in the range. The pipeline records
/// observations on the cache-miss path just before the entry is stored; this captures
/// them so a later cache hit — which bypasses the pipeline — can replay them. Scoped
/// to the read range so a hit never re-stamps a tag that was not part of this read.
/// </summary>
private IReadOnlyList<TagValueObservation>? CaptureRangeObservations(ushort startAddress, ushort qty)
{
if (_ctx.Capture is not { IsArmed: true } capture)
return null;
if (!_ctx.TagMap.TryGetForRange(startAddress, qty, out var hits) || hits.Count == 0)
return null;
var inRange = new HashSet<ushort>();
foreach (var hit in hits)
inRange.Add(hit.Tag.Address);
List<TagValueObservation>? observed = null;
foreach (var obs in capture.Snapshot())
{
if (obs.UpdatedAtUtc is not null && inRange.Contains(obs.Address))
(observed ??= []).Add(obs);
}
return observed;
}
private static byte[] BuildExceptionFrame(ushort originalTxId, byte unitId, byte fc, byte exceptionCode)
{
// Modbus exception PDU = [fc | 0x80][exceptionCode].
+8 -3
View File
@@ -18,11 +18,13 @@ public enum CaptureDirection
///
/// <para>"PLC side" is always the BCD-encoded form on the Modbus wire to/from the
/// device; "client side" is always the decoded binary integer the upstream client
/// reads or wrote. <see cref="RawHigh"/> is 0 for a 16-bit tag.</para>
/// reads or wrote. <see cref="RawHigh"/> is 0 for a 16-bit tag. <see cref="Name"/> is
/// the tag's optional human-friendly label (null when the config gave none).</para>
/// </summary>
public sealed record TagValueObservation(
ushort Address,
byte Width,
string? Name,
ushort RawLow,
ushort RawHigh,
int DecodedValue,
@@ -54,6 +56,7 @@ internal sealed class TagValueCapture
// Slot index → tag identity. Parallel to _slots; immutable after construction.
private readonly ushort[] _addresses;
private readonly byte[] _widths;
private readonly string?[] _names;
// Slot index → last observation (null = no traffic captured yet). Each element is
// swapped via Volatile.Write; never mutated in place.
@@ -75,6 +78,7 @@ internal sealed class TagValueCapture
_addresses = new ushort[ordered.Length];
_widths = new byte[ordered.Length];
_names = new string?[ordered.Length];
_slots = new TagValueObservation?[ordered.Length];
var index = new Dictionary<ushort, int>(ordered.Length);
@@ -82,6 +86,7 @@ internal sealed class TagValueCapture
{
_addresses[i] = ordered[i].Address;
_widths[i] = ordered[i].Width;
_names[i] = ordered[i].Name;
index[ordered[i].Address] = i;
}
_addressToSlot = index.ToFrozenDictionary();
@@ -124,7 +129,7 @@ internal sealed class TagValueCapture
Volatile.Write(
ref _slots[idx],
new TagValueObservation(
_addresses[idx], _widths[idx], rawLow, rawHigh, decoded, direction,
_addresses[idx], _widths[idx], _names[idx], rawLow, rawHigh, decoded, direction,
DateTimeOffset.UtcNow));
}
@@ -140,7 +145,7 @@ internal sealed class TagValueCapture
{
result[i] = Volatile.Read(ref _slots[i])
?? new TagValueObservation(
_addresses[i], _widths[i], 0, 0, 0, CaptureDirection.Read, null);
_addresses[i], _widths[i], _names[i], 0, 0, 0, CaptureDirection.Read, null);
}
return result;
}
+17
View File
@@ -0,0 +1,17 @@
Original Tag Address New Tag Address Modbus Address Description Data Type Data Direction
V3014 41549 Left AirSP BCD16 Write-Only
41537 Left ArgonSP BCD16 Write-Only
V3010 41545 Left ChlorineSP BCD16 Write-Only
V3012 41547 Left HydrogenSP BCD16 Write-Only
V3015 41550 Right AirSP BCD16 Write-Only
41540 Right ArgonSP BCD16 Write-Only
V3011 41546 Right ChlorineSP BCD16 Write-Only
V3013 41548 Right HydrogenSP BCD16 Write-Only
V1700 V11010 44617 MTA Runtime Left in Minuntes BCD32 Read-Only
V1701 V11011 44618 MTA Runtime Left in Minuntes BCD32 Read-Only
V1702 V11012 44619 MTA Runtime Right in Minuntes BCD32 Read-Only
V1703 V11013 44620 MTA Runtime Right in Minuntes BCD32 Read-Only
V1710 V11022 44627 FRR Runtime Left in Minutes BCD32 Read-Only
V1711 V11023 44628 FRR Runtime Left in Minutes BCD32 Read-Only
V1712 V11024 44629 FRR Runtime Right in Minutes BCD32 Read-Only
V1713 V11025 44630 FRR Runtime Right in Minutes BCD32 Read-Only
@@ -54,6 +54,29 @@ public sealed class BcdTagMapBuilderTests
t32.Width.ShouldBe((byte)32);
}
[Fact]
public void Build_CarriesOptionalTagName_IntoResolvedMap()
{
// The optional human-friendly Name flows from config options through to the
// resolved BcdTag; an omitted Name resolves to null.
var global = new BcdTagListOptions
{
Global =
[
new BcdTagOptions { Address = 1548, Width = 16, Name = "Left AirSP" },
new BcdTagOptions { Address = 1080, Width = 32 },
],
};
var result = BcdTagMapBuilder.Build(global, perPlc: null);
result.Errors.ShouldBeEmpty();
result.Map.TryGet(1548, out var named).ShouldBeTrue();
named.Name.ShouldBe("Left AirSP");
result.Map.TryGet(1080, out var unnamed).ShouldBeTrue();
unnamed.Name.ShouldBeNull();
}
[Fact]
public void Build_PerPlcAdd_AppendsToGlobal()
{
@@ -45,6 +45,7 @@ public sealed class MbproxyOptionsBindingTests
{
["Mbproxy:BcdTags:Global:0:Address"] = "1072",
["Mbproxy:BcdTags:Global:0:Width"] = "16",
["Mbproxy:BcdTags:Global:0:Name"] = "Left AirSP",
["Mbproxy:BcdTags:Global:1:Address"] = "1080",
["Mbproxy:BcdTags:Global:1:Width"] = "32",
});
@@ -52,8 +53,10 @@ public sealed class MbproxyOptionsBindingTests
options.BcdTags.Global.Count.ShouldBe(2);
options.BcdTags.Global[0].Address.ShouldBe((ushort)1072);
options.BcdTags.Global[0].Width.ShouldBe((byte)16);
options.BcdTags.Global[0].Name.ShouldBe("Left AirSP");
options.BcdTags.Global[1].Address.ShouldBe((ushort)1080);
options.BcdTags.Global[1].Width.ShouldBe((byte)32);
options.BcdTags.Global[1].Name.ShouldBeNull("Name is optional — an omitted entry binds to null");
}
// -------------------------------------------------------------------------
@@ -542,4 +542,75 @@ public sealed class ResponseCacheMultiplexerTests
l.Stop();
}
}
[Fact]
public async Task CacheHit_RecordsServedRead_IntoArmedDebugCapture()
{
// C3 regression guard: a cache hit bypasses the BCD pipeline, so without the
// cache-entry observation replay the connection-detail debug view would freeze
// for the whole TTL on a cached tag. A hit must re-record the served read into
// the (armed) capture so the debug view reflects what the client receives.
int backendPort = PickFreePort();
await using var backend = new StubBackend(backendPort) { RegisterValue = 0x1234 };
using var cache = new ResponseCache(maxEntriesPerPlc: 64, evictionIntervalMs: 5000);
var tag = BcdTag.Create(100, 16, cacheTtlMs: 5000);
// A detail-page viewer has armed this PLC's debug-view capture.
var capture = new TagValueCapture([tag]);
capture.Arm();
var frozen = new[] { tag }.ToDictionary(t => t.Address).ToFrozenDictionary();
var ctx = new PerPlcContext
{
PlcName = "PLC1",
TagMap = new BcdTagMap(frozen),
Counters = new ProxyCounters(),
Logger = NullLogger.Instance,
Cache = cache,
Capture = capture,
};
var plc = new PlcOptions { Name = "PLC1", ListenPort = 0, Host = "127.0.0.1", Port = backendPort };
await using var mux = BuildMux(plc, ctx);
var (c, p, l) = await ConnectClientAsync(mux, plc.Name);
try
{
// First read — cache miss. The pipeline records the observation and the
// entry is stored with the per-tag observations attached.
await c.SendAsync(BuildFc03(0x0001, 100, 1), SocketFlags.None);
_ = await ReadOneFrameAsync(c, TestContext.Current.CancellationToken);
var afterMiss = capture.Snapshot().Single(o => o.Address == 100);
afterMiss.UpdatedAtUtc.ShouldNotBeNull("the cache-miss read must record an observation");
afterMiss.DecodedValue.ShouldBe(1234);
afterMiss.RawLow.ShouldBe((ushort)0x1234);
// Clear the capture's slots (models the debug view holding no fresh data),
// then re-arm. Only the cache-hit replay can repopulate slot 100 now — the
// backend is not contacted again.
capture.Disarm();
capture.Arm();
capture.Snapshot().Single(o => o.Address == 100).UpdatedAtUtc
.ShouldBeNull("Disarm must clear the slot");
// Second read — cache hit. No backend round-trip; the pipeline is bypassed.
await c.SendAsync(BuildFc03(0x0002, 100, 1), SocketFlags.None);
_ = await ReadOneFrameAsync(c, TestContext.Current.CancellationToken);
backend.RequestCount.ShouldBe(1, "the second read must be served from the cache");
var afterHit = capture.Snapshot().Single(o => o.Address == 100);
afterHit.UpdatedAtUtc.ShouldNotBeNull(
"a cache hit must re-record the served read so the debug view does not freeze");
afterHit.DecodedValue.ShouldBe(1234, "the replayed observation carries the decoded value");
afterHit.RawLow.ShouldBe((ushort)0x1234, "the replayed observation carries the raw BCD nibbles");
afterHit.Direction.ShouldBe(CaptureDirection.Read);
}
finally
{
c.Dispose();
await p.DisposeAsync();
l.Stop();
}
}
}
@@ -43,6 +43,27 @@ public sealed class TagValueCaptureTests
slot.UpdatedAtUtc.ShouldNotBeNull();
}
[Fact]
public void Snapshot_CarriesTagName_FromConfiguredTag()
{
// The capture surfaces a BCD tag's optional friendly name on every
// observation — placeholder rows (no traffic) and recorded values alike —
// so the debug view can label rows. An unnamed tag surfaces a null Name.
var capture = new TagValueCapture(
[
BcdTag.Create(100, 16, name: "Left AirSP"),
BcdTag.Create(200, 32),
]);
var before = capture.Snapshot();
before.Single(o => o.Address == 100).Name.ShouldBe("Left AirSP");
before.Single(o => o.Address == 200).Name.ShouldBeNull();
capture.Arm();
capture.Record(100, 0x1234, 0, 1234, CaptureDirection.Read);
capture.Snapshot().Single(o => o.Address == 100).Name.ShouldBe("Left AirSP");
}
[Fact]
public void Armed_Record_UnknownAddress_IsIgnored()
{