diff --git a/mbproxy/codereviews/2026-05-16/AdminSignalR.md b/mbproxy/codereviews/2026-05-16/AdminSignalR.md new file mode 100644 index 0000000..fd4c008 --- /dev/null +++ b/mbproxy/codereviews/2026-05-16/AdminSignalR.md @@ -0,0 +1,110 @@ +# Admin SignalR / Tag Capture — Code Review + +Scope: `src/Mbproxy/Admin/*.cs` (AdminEndpointHost, StatusBroadcaster, StatusHub, PlcSubscriptionTracker, StatusSnapshotBuilder, StatusPushSink, StatusDto, DebugDto, AssemblyVersionAccessor) plus `src/Mbproxy/Proxy/TagValueCapture.cs` and `src/Mbproxy/Proxy/TagCaptureRegistry.cs`. Branch `mbproxy-webui-dashboard` (HEAD `0308490`). Cross-checked against `docs/Operations/StatusPage.md`, `docs/Reference/LogEvents.md`, the mbproxy `CLAUDE.md`, the prior reviews `codereviews/2026-05-15/{AdminSignalR,TagCapture}.md`, and the consuming code in `ConfigReconciler.cs` / `ProxyWorker.cs` / `PlcMultiplexer.cs`. + +## Summary + +The two Critical findings from the 2026-05-15 reviews have been genuinely fixed and the fixes hold. The SignalR subscription tracker is now keyed on a stable per-page-load `tabId` (`PlcSubscriptionTracker`), not on the `ConnectionId`, so a transport reconnect can no longer leak a viewer; arm/disarm is funneled through a single authority (`StatusBroadcaster.PushOnceAsync` → `TagCaptureRegistry.ReconcileArmed`), removing the lost-update race; the hub no longer arms captures at all. The prior XSS surface does not exist in the current admin layer — no server-rendered HTML, no string interpolation into a response; `/plc/{name}` serves a static embedded shell and `name` is read client-side. The cache-hit debug-view freeze (prior TagCapture C1) is fixed by `CacheEntry.CapturedTags` replay. The endpoint is strictly read-only. `M5`/`M6` from the prior review (bind-failure leak) are addressed: `StartAppAsync` now tears down a partially-started `app`/`_broadcaster` in its catch. + +Remaining issues are lower severity: a real but bounded reconcile/disarm race against `Snapshot()`, a stale doc comment, a missing `Start()` double-call guard, and assorted test-gap / doc-drift items. No new Critical issues. + +Findings by severity: **Critical: 0 · Major: 2 · Minor: 7** + +## Major findings + +### M1. `ReconcileArmed` disarm can clear a slot between snapshot capture and push — debug view can flicker an extra empty cycle + +`StatusBroadcaster.cs:108-130`, `TagCaptureRegistry.cs:55-68`, `TagValueCapture.cs:108-113`. + +`PushOnceAsync` calls `ReconcileArmed(activePlcs)` *before* the per-PLC `foreach` that calls `_builder.BuildDebug(plcName)`. For a PLC that is currently armed and still has a viewer this is fine. But the ordering interacts badly with a viewer that leaves *during* a push cycle: + +- `activePlcs` is snapshotted at line 108. Suppose `plc-x` is in it. +- `ReconcileArmed` keeps `plc-x` armed. +- The viewer's `OnDisconnectedAsync` runs on a hub thread mid-cycle, removing `plc-x` from the tracker — but the `foreach` still iterates the stale `activePlcs` list and pushes one final detail message to a now-empty group. Benign (empty group no-ops). + +The genuinely observable case is the inverse and is a correctness nuance, not a crash: because `ReconcileArmed` runs once per cycle and `BuildDebug` runs in the same cycle, a PLC that becomes active *after* line 108 (a `SubscribePlc` landing between line 108 and the `foreach`) is not in `activePlcs`, so it is neither armed nor pushed this cycle — it self-heals next cycle. That is acceptable. What is worth flagging: `ReconcileArmed` and `BuildDebug` are not consistent within the cycle for a PLC whose arm state flips mid-cycle, and the capture's `Disarm()` clears every slot (`TagValueCapture.cs:111-112`). If a hot-reload `GetOrCreate` rebuild or a `ReconcileArmed` disarm interleaves between `ReconcileArmed` and the `capture.Snapshot()` inside `BuildDebug`, the pushed `PlcDebugSnapshot` can report `CaptureArmed = true` (read at `StatusSnapshotBuilder.cs:54`) with all-empty tag rows, or `CaptureArmed = false` with populated rows. + +Impact: a transient one-cycle (≤ `AdminPushIntervalMs`, default 1 s) inconsistency in the debug view — `captureArmed` and the tag rows can momentarily disagree. Not data corruption, not a leak; the next cycle is correct. Worth recording because the prior reviews treated arm-state consistency as the central invariant. + +Recommendation: read `capture.IsArmed` and `capture.Snapshot()` once, atomically enough for display — or, simplest, have `BuildDebug` derive `CaptureArmed` from the same `activePlcs` set the broadcaster already holds rather than re-reading `capture.IsArmed`. Pass the active set (or a `bool armed`) into `BuildDebug` instead of letting it independently re-query the capture. That makes the pushed payload internally consistent by construction. + +### M2. `StatusBroadcaster.Start()` has no guard against a second call — a double-start orphans a push loop + +`StatusBroadcaster.cs:56-57`: `public void Start() => _loop = Task.Run(() => LoopAsync(_cts.Token));` + +A second `Start()` overwrites `_loop`, orphaning the first loop task. The first task keeps running against the same `_cts`, so two loops now push the fleet snapshot every interval until cancellation, and `StopAsync` only `await`s the *second* `_loop` — the orphaned first loop is never awaited and its faults are unobserved. The XML doc ("Idempotent only in the sense that it is called once") documents the assumption but nothing enforces it. + +Today `AdminEndpointHost` constructs a fresh `StatusBroadcaster` per `StartAppAsync` and calls `Start()` exactly once, so this is not hit in production. But `StatusBroadcasterTests.Loop_PushesRepeatedly_ThenStopsAfterStopAsync` (line 175) calls `h.Broadcaster.Start()` on a harness whose broadcaster was *not* started by a hidden path — that one is fine — yet the class remains trivially misusable, and `IAsyncDisposable.DisposeAsync` → `StopAsync` would only ever clean up one of two loops. + +Recommendation: add a one-line guard — `Interlocked.CompareExchange` on an `int _started`, or `if (_loop is not { Status: TaskStatus.RanToCompletion or ... } ) ...` — and either throw or no-op on the second call. An `Interlocked` flag is the cleanest. Pair it with a brief test asserting a second `Start()` does not double the push rate. + +## Minor findings + +### N1. Stale comment in `ConfigReconciler` claims `GetOrCreate` preserves the armed flag + +`ConfigReconciler.cs:364-365`: *"Rebuild the capture for the new tag set; GetOrCreate preserves the armed flag so an open detail page keeps capturing across the reload."* + +`TagCaptureRegistry.GetOrCreate` (`TagCaptureRegistry.cs:33-37`) no longer does this — the update delegate `(_, _) => new TagValueCapture(map.All)` builds a fresh **disarmed** capture and copies nothing. The XML doc on `GetOrCreate` itself (lines 27-31) correctly describes the new behavior ("The rebuilt capture is disarmed; StatusBroadcaster re-arms it on its next push cycle"). The `ConfigReconciler` comment is the stale one and now contradicts the code. It is the only surviving reference to the old armed-flag-preservation design. + +Impact: misleading to a future maintainer; could prompt a "fix" that reintroduces the lost-update race the redesign eliminated. + +Recommendation: update the `ConfigReconciler.cs:364-365` comment to: *"Rebuild the capture for the new tag set; the rebuilt capture is disarmed and StatusBroadcaster re-arms it within one push cycle if the PLC still has a viewer."* + +### N2. `TagValueCapture.Record`'s Record-vs-Disarm re-check still has a narrow stale-slot window + +`TagValueCapture.cs:122-143`. The defensive re-read of `_armed` after the `Volatile.Write` is correct in spirit and closes the common case. But it is not airtight: `Disarm()` does `_armed = false` *then* clears slots (`TagValueCapture.cs:110-112`). Interleaving `Record` against `Disarm`: + +1. `Record`: `_armed` true → passes the gate. +2. `Disarm`: `_armed = false`; begins the slot-clear loop, clears slot `idx`. +3. `Record`: `Volatile.Write(ref _slots[idx], newObs)` — writes *after* Disarm already cleared `idx`. +4. `Record`: re-reads `_armed` → false → `Volatile.Write(ref _slots[idx], null)`. Slot ends null. **Correct.** + +That path self-corrects. The genuinely uncovered ordering is step 3 landing *after* step 4's own null-write would have run — it cannot, single-threaded within `Record` — so `Record` is in fact safe for its own write. The residual risk is purely that `Record` is called *concurrently from two threads* for the same address (e.g. an FC03 backend response and an FC06 upstream write racing on the same tag): both pass the gate, both write, the re-check on the later writer can null the slot even though the capture is armed, dropping that observation until the next traffic. This is a lost *update*, not stale data, and the slot self-heals on the next PDU. + +Impact: under concurrent same-address traffic with a near-simultaneous disarm, a single observation can be dropped. Functionally negligible for a debug view; flagged only for completeness since the prior review called this race out and it is narrowed but not fully eliminated. + +Recommendation: no code change required — document in the `Record` comment that the guarantee is "armed captures never retain stale data after disarm," not "every observation while armed is retained." If stricter retention is ever wanted, `Disarm` would need to set `_armed = false` *after* clearing slots and `Record` would re-validate, but that inverts a different race; the current trade-off is the right one for a debug view. + +### N3. No test exercises a reconnect that drops the *new* connection first + +`StatusHubTests.Reconnect_SameTab_NewConnection_DoesNotLeakViewer` (line 50) and `PlcSubscriptionTrackerTests.SameTab_TwoConnections_StaysActiveUntilLastConnectionGone` (line 27) both cover the reconnect-overlap case where the *old* connection's `OnDisconnectedAsync` fires last. Neither covers the mirror ordering (new connection's disconnect arrives before the old one's), nor the case where `SubscribePlc` for the new connection arrives *before* `RemoveConnection` for the old one but with a different `tabId` (e.g. a hard page reload that generates a fresh `tabId`). The tracker handles all of these correctly by construction, but the test suite asserts only one of the orderings. + +Recommendation: add a test where two connections of the same tab are removed in reverse order, and one where a page reload (new `tabId`, old tab still has a live connection mid-teardown) is shown not to leak. + +### N4. `SubscribePlc` registers the tracker entry before the group join — a throw in `AddToGroupAsync` leaves a tracked viewer with no group membership + +`StatusHub.cs:46-54`. The current order is deliberate (comment lines 48-51: register with the tracker first so this connection's own `OnDisconnectedAsync` sees consistent state). But if `Groups.AddToGroupAsync` throws (transport fault mid-invocation), the tracker now holds a subscription for a connection that is not in the PLC's SignalR group. The capture for that PLC gets armed by the next `ReconcileArmed`, but the broadcaster pushes detail messages to a group the connection never joined, so the page receives nothing while the capture is needlessly armed — until `OnDisconnectedAsync` eventually fires for that connection and `RemoveConnection` cleans it up. Self-healing (the disconnect callback always runs eventually) and the capture cost is bounded, so this is Minor — but the asymmetry is real. + +Recommendation: either accept it (the disconnect callback is the backstop, and `OnDisconnectedAsync` is guaranteed to run for any connection that completed `OnConnectedAsync`) and add a one-line comment to that effect, or wrap the body so a failed `AddToGroupAsync` rolls back the tracker entry via `RemoveConnection`. The accept-and-document option is sufficient. + +### N5. `StatusBroadcaster` swallows every non-OCE push exception with no rate limiting + +`StatusBroadcaster.cs:96-130`. Per-stage `try/catch` correctly isolates a snapshot-build failure, a fleet-push failure, and each per-PLC push failure, and the `[LoggerMessage]` events (`mbproxy.admin.broadcast.{snapshot,fleet,detail}.failed`, EventIds 72-74) match `LogEvents.md`. Good. But a persistently failing push (e.g. a wedged SignalR transport) logs one `Error` event *per cycle* — at the default 1 s cadence that is 86,400 error log lines/day for one stuck PLC. `LogEvents.md` for `mbproxy.admin.broadcast.detail.failed` says "Sustained occurrences ... mean that PLC's detail page is not receiving live updates" — so the volume is by design as an operator signal, but it can drown the rolling file. + +Recommendation: consider logging the first failure at `Error` and subsequent consecutive failures at `Debug` (or every Nth), resetting on the first success. Minor — the documented contract permits the current behavior, so this is an enhancement, not a defect. + +### N6. `DebugDtoSerializationTests` re-creates the hub options by hand rather than asserting against the real configuration + +`DebugDtoSerializationTests.cs:18-19` builds `new JsonSerializerOptions { PropertyNamingPolicy = JsonNamingPolicy.CamelCase }` and comments that this is "The exact policy AdminEndpointHost configures." It is a *copy* of `AdminEndpointHost.cs:200-202`, not a reference to it — if someone edits the `AddJsonProtocol` lambda (adds `DefaultIgnoreCondition`, changes the policy, etc.) the test still passes against its stale copy. `HubStatusE2ETests` does exercise the real hub end-to-end and asserts `captureArmed`/`service`/`plcs` appear, which is the genuine guard; the unit test is a weaker duplicate. + +Recommendation: low priority — either delete the unit test as redundant with the E2E coverage, or extract the `JsonSerializerOptions` configuration into a shared static (e.g. `AdminEndpointHost.HubJsonOptions`) that both the production `AddJsonProtocol` call and the test reference, so they cannot drift. + +### N7. Doc drift: `StatusPage.md` does not state the detail payload travels only over SignalR, and `LogEvents.md` `EventId` 75 wording is fine but the broadcaster's `mbproxy.admin.broadcast.*` family is not cross-referenced from the doc's source list + +`docs/Operations/StatusPage.md` "Debug View Data" (lines 313-331) documents every `PlcDetailResponse` field but never says the payload is reachable *only* over the `/hub/status` SignalR feed and has no `GET` route — a scraper author could reasonably expect a `/plc/{name}.json` twin. Prior review N7 raised exactly this and it is still open. Separately, `StatusPage.md:23-25` describes the bind-failure and `AdminPort` hot-reload behavior accurately, and `LogEvents.md` lists the four `mbproxy.admin.broadcast.*` events correctly — no defect there, just confirming. + +Recommendation: add one sentence to `StatusPage.md`'s "Debug View Data" section: *"`PlcDetailResponse` is delivered only over the `/hub/status` SignalR feed (the `\"plc\"` message); there is no `GET` route for it, and it is serialized through the SignalR JSON protocol, not `StatusJsonContext`."* + +## What looks good + +- **The prior Critical SignalR capture-leak (2026-05-15 C1/C2) is genuinely fixed.** `PlcSubscriptionTracker` is keyed on `tabId`, not `ConnectionId`; a reconnect is "same tab acquires a new connection," so the viewer count cannot leak across a transport drop. `StatusHub` no longer arms/disarms captures — `OnConnectedAsync` is not even overridden — and arm/disarm flows through `StatusBroadcaster.PushOnceAsync → TagCaptureRegistry.ReconcileArmed`, a single-threaded once-per-cycle authority. `ReconcileArmed` reconciles *all* captures against the live `ActivePlcs()` set every cycle, so any stale arm state self-heals within one `AdminPushIntervalMs`. The `StatusHubTests` / `PlcSubscriptionTrackerTests` reconnect and concurrency-stress tests cover the core invariant. +- **The prior TagCapture Critical (cache-hit debug-view freeze) is fixed.** `PlcMultiplexer.cs:653-666` captures `TagValueObservation`s into `CacheEntry.CapturedTags` at store time, and the cache-hit path (`PlcMultiplexer.cs:844-850`) replays them into the armed capture re-stamped to the hit time. The debug view now reflects cache-served reads. +- **No XSS / output-encoding surface.** The admin layer renders no server-side HTML. `/` and `/plc/{name}` serve static embedded `index.html` / `plc.html` byte blobs (`ServeHtmlShell`); `name` is bound but unused and read client-side. `/status.json` and the SignalR payloads are JSON-serialized, never string-interpolated. There is nowhere for attacker-controlled input to reach an HTML response. +- **Endpoint is strictly read-only.** Routes are `GET /`, `GET /plc/{name}`, `GET /assets/{path}`, `GET /status.json`, and the hub. The hub exposes only `SubscribeFleet` / `SubscribePlc` (group joins + tracker mutation) and `OnDisconnectedAsync`. No mutation route, no control action, no log download. +- **Asset serving is traversal-safe.** `AdminEndpointHost.cs:221` rejects `/`, `\`, and `..` in the path segment before any resource lookup; assets are embedded manifest resources, so there is no filesystem surface. Bytes and misses are cached in a `static ConcurrentDictionary`. +- **Bind-failure cleanup is now complete (prior M5/M6 fixed).** `StartAppAsync` declares `app` outside the `try` and its `catch` (`AdminEndpointHost.cs:258-280`) best-effort disposes the `_broadcaster` and stops/disposes the partially-started Kestrel `app`, so a throw after `app.StartAsync` no longer leaks a bound listener or a running push loop. +- **Lifecycle / hot-reload of `AdminPort` is correct.** `_broadcaster` is constructed per `StartAppAsync` and disposed before Kestrel stops in `StopCurrentAppAsync`; the `OnChange` callback is guarded by the `_disposed` flag both before queueing and after the threadpool picks it up, and re-checks `newPort == _currentPort` under `_lock`. `DisposeAsync` is idempotent via `_disposed`. CTS/semaphore disposal is race-guarded. +- **`StatusBroadcaster` resource hygiene.** `_cts` is disposed in `DisposeAsync` after `StopAsync` awaits the loop; `StopAsync` carries a `_stopped` flag so a double-stop (DisposeAsync after an explicit StopAsync) does not touch the disposed CTS or re-cancel. `LoopAsync` re-reads `AdminPushIntervalMs` each cycle (hot-reloadable) and floors it at 100 ms; it pushes before delaying so a freshly-connected dashboard populates immediately. +- **`TagValueCapture` torn-read safety holds.** `TagValueObservation` is an immutable `sealed record`; slots are reference-typed and only swapped via `Volatile.Write` / read via `Volatile.Read`; `_armed` is `volatile`; the disarmed hot path is a single volatile-bool read after one `?.` null-check. `FrozenDictionary` is built once and never mutated. +- **`StatusSnapshotBuilder` is lock-free and side-effect-free**, degrades gracefully for an unknown PLC (`BuildDebug` returns a disarmed empty snapshot), and the `CounterSnapshot` fallback for a not-yet-started supervisor is exhaustive. +- **`StatusBroadcaster.PushOnceAsync` indexes the fleet snapshot into a `Dictionary` once per cycle** (`StatusBroadcaster.cs:113-114`) — the prior review's O(N²) `FirstOrDefault`-per-PLC concern is resolved. diff --git a/mbproxy/codereviews/2026-05-16/ConfigAndHosting.md b/mbproxy/codereviews/2026-05-16/ConfigAndHosting.md new file mode 100644 index 0000000..2c8942e --- /dev/null +++ b/mbproxy/codereviews/2026-05-16/ConfigAndHosting.md @@ -0,0 +1,157 @@ +# Code Review — Configuration, Options, Hosting, Diagnostics + +**Scope:** `src/Mbproxy/Configuration/`, `src/Mbproxy/Options/`, `src/Mbproxy/Diagnostics/`, `src/Mbproxy/HostingExtensions.cs`, `Program.cs`, `ServiceCounters.cs` (with `ProxyWorker.cs` and `PlcListenerSupervisor.cs` read for context). +**Branch:** `mbproxy-webui-dashboard` @ `0308490`. +**Date:** 2026-05-16. + +## Summary + +The configuration/hosting subsystem is in good shape overall: the hot-reload reconciler is cleanly structured, the debounce loop and `_applySemaphore` serialization are correct and leak-free, the diagnostic-sink platform selection is well factored and unit-testable, and `ServiceCounters` uses `Interlocked` correctly. Prior-review remediations appear intact. The most consequential issue found is that `ReloadValidator` is **never invoked at startup** — it runs only inside `ConfigReconciler.ApplyUnderLockAsync` — so an `appsettings.json` that has duplicate `ListenPort`s, an `AdminPort` collision, a bad keepalive cross-field relationship, or a non-positive timeout will start the service in a broken state instead of failing fast, directly contradicting the documented contract. A second real bug: the `Restart` reconcile path removes a supervisor from the dictionary *before* rebuilding it, so a transient failure during rebuild silently drops the PLC with no listener and no recovery. The remaining findings are validation-completeness gaps and maintainability items. + +**Findings count:** Critical 0 · Major 5 · Minor 6 + +--- + +## Major + +### M1 — `ReloadValidator` never runs at startup; cross-PLC config errors are not fail-fast + +**Files:** `src/Mbproxy/Configuration/ConfigReconciler.cs:232` (only call site); `src/Mbproxy/Proxy/ProxyWorker.cs:91-180`; `src/Mbproxy/HostingExtensions.cs:21-43`; `docs/Operations/Configuration.md:284`. + +`ReloadValidator.Validate` is called from exactly one place — `ApplyUnderLockAsync` (`ConfigReconciler.cs:232`). The startup path is `ProxyWorker.ExecuteAsync`, which builds per-PLC tag maps directly via `BcdTagMapBuilder.Build` (`ProxyWorker.cs:101`) and never calls `ReloadValidator`. The only schema validation that runs at startup is `MbproxyOptionsValidator` via `.ValidateOnStart()` (`HostingExtensions.cs:21-26`), and that validator does **not** check the cross-PLC rules. + +Consequences at startup (none of these are caught): +- **Duplicate `ListenPort`** across two PLCs — both supervisors try to bind the same port; one wins, the other enters the *infinite* `ListenerRecovery` loop forever. No fail-fast, no clear error. +- **`AdminPort` colliding with a `ListenPort`** — same silent-conflict outcome. +- **Duplicate PLC `Name`** — `ProxyWorker.cs:125` and `:179` write `_supervisors[plc.Name] = ...`, so the second entry silently overwrites the first; one configured PLC simply never gets a listener and its supervisor object leaks (created at `:165`, overwritten, never started/disposed). +- **Bad keepalive cross-field rule** (`BackendHeartbeatIdleMs <= BackendRequestTimeoutMs`) — only enforced in `ReloadValidator` (`ReloadValidator.cs:163`), so a fresh install with this mistake produces a continuously-firing heartbeat. + +`docs/Operations/Configuration.md:284` explicitly states "`ReloadValidator.Validate` runs on every config load (startup and hot reload) … On rejection at startup, the service exits non-zero." That is false for the current code. + +**Impact:** A config mistake that should abort startup instead produces a half-working fleet that "looks" up. The most likely operational mistakes (port typos, copy-pasted PLC blocks) are exactly the ones not caught. + +**Recommendation:** Run `ReloadValidator.Validate` against `_options.CurrentValue` at the top of `ProxyWorker.ExecuteAsync` (before building any supervisor) and, on failure, log `mbproxy.config.reload.rejected` (or a startup-specific event) and fail the host (`throw` / `StopApplication`). Alternatively, fold the cross-PLC rules into `MbproxyOptionsValidator` so `.ValidateOnStart()` covers them. Either way, make startup and reload share one validation gate, and fix the doc. + +--- + +### M2 — `Restart` reconcile path drops the PLC if the rebuild throws + +**File:** `src/Mbproxy/Configuration/ConfigReconciler.cs:286-340`. + +In the restart branch, the old supervisor is removed from the dictionary first (`TryRemove(name, out var old)`, `:292`), then stopped/disposed, then a fresh context + supervisor is built (`:302-330`) and re-inserted (`:332`). The whole body is wrapped in `try { … } catch (Exception ex) { _logger.LogError(...) }` (`:335-339`). + +If anything between the `TryRemove` and the `_supervisors[name] = newSupervisor` re-insert throws — e.g. `BcdTagMapBuilder.Build` faults, `PolicyFactory`/`PerPlcContext` construction throws, an `OOM`, or `StopAsync` on the old supervisor throws something other than what it swallows — the catch logs and returns, but the dictionary no longer contains `name`. The PLC now has **no listener at all** and nothing ever recovers it: the next reload's `ReloadPlan.Compute` sees `name` present in both `current.Plcs` and `next.Plcs` (because `_currentOptions` is updated to `next` at `:437` regardless), so it lands in "unchanged" or `ToReseat`/`ToRestart` again only if options differ — an identical subsequent save produces *no* plan entry, leaving the PLC permanently dark until a manual config edit forces a restart. + +The same structural risk exists in the `Add` path (`:388-431`), but there it is benign: a failed add just means the PLC was never there to begin with, and a subsequent reload will re-`Add` it because it is still absent from `_currentOptions`. + +**Impact:** A transient fault during a hot-reload restart permanently removes a PLC from service with only an Error log line. For a 54-PLC fleet this is a silent single-line outage. + +**Recommendation:** Build the new context + supervisor *before* removing/stopping the old one (or into a local first), and only swap into the dictionary once construction succeeded. On a build failure, leave the old supervisor in place and surface the error. At minimum, in the `catch`, if `name` is no longer in `_supervisors`, re-queue the PLC for `Add` on the next pass or roll `_currentOptions[name]` back to the old value so the next `Compute` re-detects it. + +--- + +### M3 — Startup-built supervisors and reconciler-built supervisors do not use the same `MaxParties` / coalescing live-config path consistently with the rest of `Resilience` + +**Files:** `src/Mbproxy/Configuration/ConfigReconciler.cs:281-330, 383-421`; `src/Mbproxy/Proxy/ProxyWorker.cs:138-141`. + +The backend-connect Polly pipeline is rebuilt inside `ApplyUnderLockAsync` from `next.Resilience.BackendConnect` (`:282-284`, `:384-386`) and the listener-recovery pipeline from `next.Resilience.ListenerRecovery` (`:314`, `:405`) — so add/restart supervisors *do* pick up reloaded `BackendConnect`/`ListenerRecovery` values. But existing supervisors that land in `ToReseat` (or in neither bucket) keep the pipelines built at startup (`ProxyWorker.cs:139`). A reload that changes only `Resilience.BackendConnect.MaxAttempts` or `Resilience.ListenerRecovery.*` therefore takes effect for *added/restarted* PLCs but silently does **not** propagate to the majority of PLCs that were not otherwise touched. The hot-reload propagation tables in `docs/Features/HotReload.md:45-58` and `docs/Operations/Configuration.md:423-435` list `ReadCoalescing` and `Backend*TimeoutMs` but say nothing about `Resilience.BackendConnect`/`ListenerRecovery` — operators will reasonably assume they hot-reload like everything else. + +**Impact:** Inconsistent reload semantics: the same key behaves differently depending on whether a PLC happened to be added/restarted in the same save. Hard to diagnose ("I changed `MaxAttempts`, three PLCs honour it, fifty-one don't"). + +**Recommendation:** Either (a) document explicitly that `Resilience.BackendConnect` and `Resilience.ListenerRecovery` require a service restart (like `AdminPort`), and stop rebuilding them in the reconciler so behaviour is uniform; or (b) make them genuinely hot-reloadable by threading a live accessor (as already done for `ReadCoalescing`/`Keepalive`) so all supervisors re-read them. Option (a) is the smaller, safer change. + +--- + +### M4 — `MbproxyOptionsValidator` does not validate `ListenPort` range or `Host`, leaving binding edge cases uncaught at schema time + +**File:** `src/Mbproxy/Options/MbproxyOptions.cs:62-144`; `src/Mbproxy/Options/PlcOptions.cs`. + +`MbproxyOptionsValidator` validates `Width`, cache TTLs, cache-size knobs, connection timeouts, `AdminPushIntervalMs`, and keepalive ranges — but never `PlcOptions.ListenPort` (range/uniqueness), `PlcOptions.Name` (empty/unique), `PlcOptions.Host` (empty), `PlcOptions.Port` (range), or `AdminPort` (range/collision). Those live only in `ReloadValidator`. Given M1 (ReloadValidator not run at startup), this is the *only* validator on the startup path, so: +- `ListenPort = 0` (the C# default when the key is omitted, see `PlcOptions.cs:6`) passes schema validation. The supervisor then tries to bind port 0, which the OS interprets as "pick an ephemeral port" — the listener binds a random port and clients can never reach it. No error is raised anywhere. +- An empty `Host` produces a `SocketException`/`ArgumentException` only at first backend connect, surfacing as a recoverable runtime fault rather than a config rejection. + +`docs/Operations/Configuration.md:130-135` documents `Name`, `ListenPort`, `Host` as **required** — but nothing enforces "required" at bind time. + +**Impact:** A PLC block missing `ListenPort` or `Host` starts a useless listener with no diagnostic. Combined with M1, the only realistic guard (`ReloadValidator`) is never consulted at startup. + +**Recommendation:** Add `ListenPort ∈ [1,65535]`, non-empty `Name`, non-empty `Host`, and `Port ∈ [1,65535]` checks to `MbproxyOptionsValidator` (it already iterates `options.Plcs`), or — preferably — resolve M1 so `ReloadValidator` runs at startup and consider whether the two validators should be merged to remove the consistency burden entirely. + +--- + +### M5 — `Resilience.BackendConnect.BackoffMs` length is never validated against `MaxAttempts` + +**Files:** `src/Mbproxy/Options/ResilienceOptions.cs:19-29`; `src/Mbproxy/Options/MbproxyOptions.cs:62-144`; `src/Mbproxy/Configuration/ReloadValidator.cs`; `src/Mbproxy/Proxy/Supervision/PolicyFactory.cs:43-75`. + +`docs/Operations/Configuration.md:220` states `BackoffMs` "Must have `MaxAttempts - 1` entries." Nothing enforces this. Neither `MbproxyOptionsValidator` nor `ReloadValidator` looks at `RetryProfile.MaxAttempts`, `BackoffMs`, `RecoveryProfile.InitialBackoffMs`, or `SteadyStateMs` at all. `PolicyFactory.BuildBackendConnect` (`:46`) does `Math.Max(1, MaxAttempts)` and clamps the backoff index to the last element (`:59-61`) — so it is *crash-safe*, but a config with `MaxAttempts = 5` and a single `BackoffMs` entry, or negative backoff values, silently produces behaviour that does not match the operator's intent (negative ms become `TimeSpan` that Polly may treat as zero/throw depending on version). `MaxAttempts = 0` is silently coerced to 1. + +**Impact:** Low-blast-radius but a documented invariant ("must have `MaxAttempts - 1` entries") is unenforced, and a negative backoff or zero `MaxAttempts` is accepted without warning. Doc and code disagree. + +**Recommendation:** Add a validation rule (in whichever validator survives M1's consolidation): `MaxAttempts >= 1`; `BackoffMs.Count >= MaxAttempts - 1`; every `BackoffMs`/`InitialBackoffMs` entry `>= 0`; `SteadyStateMs > 0`. Or, if the clamp-and-coerce behaviour is genuinely intended, relax the doc wording to match. + +--- + +## Minor + +### N1 — `ApplyUnderLockAsync` logs `mbproxy.config.reload.applied` and bumps the success counter even when every step failed + +**File:** `src/Mbproxy/Configuration/ConfigReconciler.cs:335-443`. + +By design, a step that throws is caught and logged and the loop continues (`:335`, `:373`, `:426`), then step 7 unconditionally sets `_currentOptions = next`, calls `RecordReloadApplied`, and logs `applied` (`:437-441`). If, say, every restart in `ToRestart` threw, the operator sees `config.reloadCount` increment and an INFO `mbproxy.config.reload.applied` line — with no signal that the apply was partial or total-failure. The `Plcs*` counts in the event are *planned* counts (`:243-246`), not *achieved* counts. + +**Impact:** Status page and logs report a clean reload when reconciliation actually failed. Misleading during incident triage. + +**Recommendation:** Track per-step failures (a counter or list). If any step threw, log at `Warning` with a "partial" qualifier (or emit a distinct event) and either skip `RecordReloadApplied` or expose a `reloadPartialCount`. At minimum, include achieved-vs-planned counts in the event. + +### N2 — Reseat step swaps `_currentOptions` membership assumptions but uses `next.Plcs.First(...)` which throws inside the loop + +**File:** `src/Mbproxy/Configuration/ConfigReconciler.cs:353`. + +`var plcNew = next.Plcs.First(p => p.Name == name);` — `ReloadPlan.Compute` guarantees `name` exists in `next`, so this will not throw in practice, but `First` with no match throws `InvalidOperationException`, which the surrounding `catch` (`:373`) would log as a generic "Error reseating context". A `FirstOrDefault` with an explicit null-guard, or passing the `PlcOptions` through `ReloadPlan.ToReseat` alongside the map, would make the invariant explicit and the failure mode obvious. Minor robustness/clarity. + +**Recommendation:** Extend `ToReseat` tuple to `(string Name, PlcOptions New, BcdTagMap NewMap)` so the reconciler never re-resolves by name. + +### N3 — `ComputeGlobalTagDelta` throws on a duplicate global address; `ReloadPlan.Compute` indexes PLCs with `ToDictionary` likewise + +**Files:** `src/Mbproxy/Configuration/ConfigReconciler.cs:465-466`; `src/Mbproxy/Configuration/ReloadPlan.cs:39-40, 83-84`. + +`before.Global.ToDictionary(t => t.Address)` (`:465`) throws `ArgumentException` if `BcdTags.Global` contains two entries at the same address. `BcdTagMapBuilder` detects duplicate addresses as a validation error, and `ReloadValidator` rejects such a snapshot — *but only `next`*. `ComputeGlobalTagDelta` is called with `_currentOptions.BcdTags` as `before` (`:249`); `_currentOptions` was validated when it became current, so this is safe today. It is, however, a latent fragility: any future path that sets `_currentOptions` without full validation, or a `Global` list with duplicate addresses that `BcdTagMapBuilder` happens not to flag, turns a cosmetic delta computation into an unhandled exception inside `ApplyUnderLockAsync` (caught only by the debounce-loop catch-all, aborting the whole apply). Same applies to `ReloadPlan.Compute`'s `ToDictionary(p => p.Name)` — safe only because `next` was validated first. + +**Recommendation:** Use `ToDictionary` with a last-wins lambda or `DistinctBy`, or wrap `ComputeGlobalTagDelta` defensively. It is a pure cosmetic counter — it should never be able to abort a reload. + +### N4 — `EventLogBridge` caches `_sourceExists` at construction; a post-install source registration never takes effect, but the doc oversells "no per-event registry traffic" + +**File:** `src/Mbproxy/Diagnostics/EventLogBridge.cs:62-80`. + +Behaviour is correct and intentional (documented at `Configuration.md:100`). One nuance: `Emit` still wraps `EventLog.WriteEntry` in a `try/catch` (`:101-109`) that swallows everything — so if the source is deleted *after* startup, every Error+ event silently no-ops with zero diagnostics. That is acceptable for a logging sink (must never recurse/crash), but there is no `SelfLog.WriteLine` breadcrumb as `SyslogBridge` has (`SyslogBridge.cs:46`). For symmetry and debuggability, emit one `SelfLog` line the first time a write fails. + +**Recommendation:** Add a one-shot `SelfLog.WriteLine` on first write failure in `EventLogBridge.Emit` to match `SyslogBridge`'s degradation breadcrumb. + +### N5 — `ConfigReconciler.Dispose` blocks up to 2 s synchronously on `_debounceLoop.Wait` + +**File:** `src/Mbproxy/Configuration/ConfigReconciler.cs:485-501`. + +`Dispose` cancels `_disposalCts` and then `_debounceLoop.Wait(TimeSpan.FromSeconds(2))`. Since `ConfigReconciler` is a DI singleton, this runs on the host's shutdown path (`ServiceProvider.Dispose`). If an apply is mid-flight when shutdown begins, the loop will not observe cancellation until the current `ApplyAsync` returns (the apply's per-step CTS are linked to `ct` = `_disposalCts.Token`, so they *do* cancel — good), so 2 s is normally ample. But a hung supervisor `StopAsync` inside an apply could make `Dispose` block for the full 2 s and then *abandon* a still-running loop thread. Acceptable, but worth a comment that the 2 s is a hard cap and the loop may be abandoned; also consider implementing `IAsyncDisposable` so the host can await it cleanly rather than block a thread. + +**Recommendation:** Document the 2 s cap and abandonment behaviour, or convert to `IAsyncDisposable`. + +### N6 — Doc drift: `Configuration.md:284` claims startup rejection, and the validator-ordering claims in HotReload.md are slightly off + +**Files:** `docs/Operations/Configuration.md:284, 314-318`; `docs/Features/HotReload.md:81, 95`. + +Beyond M1's headline doc bug: `HotReload.md:81` asserts "There is no second validator that could disagree with the first" and `:95` says "The two paths overlap deliberately so both startup and reload reject the same malformed input with the same error wording." In reality there *are* two validators (`MbproxyOptionsValidator` and `ReloadValidator`) with overlapping-but-not-identical rules and different wording (e.g. `MbproxyOptionsValidator` says "exceeds the 60_000 ms safety cap" while `ReloadValidator` says "exceeds 60_000 ms without Cache.AllowLongTtl=true"). The "same error wording" claim is inaccurate. `Configuration.md:298` rule 7 ("Width … enforced by `MbproxyOptionsValidator`") correctly notes Width is *not* in `ReloadValidator` — which means a hot reload that somehow bound a bad Width would not be caught by `ReloadValidator` (it relies on `BcdTagMapBuilder` to reject it; verify that builder does). + +**Recommendation:** Once M1/M4 are resolved (ideally by consolidating to one validator), rewrite these doc passages to describe the actual single gate. If two validators remain, drop the "same error wording" claim. + +--- + +## Items checked and found correct (no finding) + +- **Debounce loop** (`ConfigReconciler.cs:165-218`): the linked-CTS-per-iteration pattern is correct; `debounceCts` is `using`-scoped per loop iteration so no CTS leak; the `OperationCanceledException when (!ct.IsCancellationRequested)` filter correctly distinguishes window-elapsed from disposal. +- **`_applySemaphore` serialization** (`:150-161`): correct `WaitAsync`/`try`/`finally`/`Release`; disposed once in `Dispose`. +- **Channel** (`:73-77`): bounded capacity 1 with `DropOldest` is the right choice for a coalesced "something changed" signal; no unbounded growth. +- **`OnChange` registration** (`:105-111`): non-blocking `TryWrite`, registration disposed in `Dispose` (`:487`). +- **`ServiceCounters`** (`ServiceCounters.cs`): all reads/writes via `Interlocked`; the ticks-as-long pattern for `LastReloadUtc` is sound; `CompareExchange(ref x, 0, 0)` as a volatile read is correct. +- **`DiagnosticSinkSelector.Select`** (`DiagnosticSinkSelector.cs:51-59`): pure, correct precedence (Windows wins, `isSystemd` only consulted off-Windows); the `OperatingSystem.IsWindows()` re-guard at `HostingExtensions.cs:101` correctly satisfies the platform analyzer for `[SupportedOSPlatform("windows")]` `EventLogBridge`. +- **Graceful shutdown wiring** (`ProxyWorker.cs:288-351`): `GracefulShutdownTimeoutMs` read live from `CurrentValue`; in-flight snapshot taken before `base.StopAsync`; admin stopped last; supervisors disposed last. Ordering matches the documented contract. +- **`ReloadPlan.Compute`** (`ReloadPlan.cs`): identity keyed on `Name`, `TagMapsEqual` includes `CacheTtlMs`, reseat-vs-restart split is correct. +- **Reseat counter preservation** (`ConfigReconciler.cs:359`): `supervisor.CurrentCounters` passed into the new context — correct, matches `HotReload.md:117`. diff --git a/mbproxy/codereviews/2026-05-16/Frontend.md b/mbproxy/codereviews/2026-05-16/Frontend.md new file mode 100644 index 0000000..3d1f4ba --- /dev/null +++ b/mbproxy/codereviews/2026-05-16/Frontend.md @@ -0,0 +1,160 @@ +# Frontend / Live Web Dashboard — Code Review + +Scope: commit `0308490` ("close out the dashboard code-review minor findings"). Files reviewed under `src/Mbproxy/Admin/wwwroot/`: `index.html`, `plc.html`, `dashboard.js`, `detail.js`, `util.js`, `theme.css`, `dashboard.css`, `detail.css`. Vendored assets out of scope. Cross-checked against `docs/Operations/StatusPage.md`, `src/Mbproxy/Admin/StatusDto.cs`, `src/Mbproxy/Admin/DebugDto.cs`, `src/Mbproxy/Admin/StatusBroadcaster.cs`, and `src/Mbproxy/Admin/StatusHub.cs`. Prior findings from `codereviews/2026-05-15/Frontend.md` were verified: the XSS on `detail.js` (C1 in that review) and the `shortTime`/`escapeHtml` gap are confirmed fixed. The cold-start retry loop, URL-decode guard, NaN guards, accessibility attributes, and `util.js` extraction are all verified resolved. This review covers the current code fresh. + +**Finding counts: 1 Critical, 1 Major, 4 Minor.** + +--- + +## Critical Findings + +### C1 — `dashboard.js` `stateChip` renders `listener.state` into `innerHTML` without escaping +`src/Mbproxy/Admin/wwwroot/dashboard.js:48` + +**What is wrong.** `stateChip(state)` in `dashboard.js` returns: +```js +return `${state}`; // line 48 +``` +The `state` value is a server-supplied string from `plc.listener.state`, interpolated raw into an `innerHTML`-bound template string with no `escapeHtml` call. It flows into `innerHTML` at line 166: +```js +${stateChip(plc.listener.state)} +``` + +**Inconsistency with `detail.js`.** The identical `stateChip` function in `detail.js` was fixed by the prior review and now correctly uses `escapeHtml(state)` at line 76: +```js +return `${escapeHtml(state)}`; +``` +The fix was applied only to `detail.js` and was not propagated back to `dashboard.js`. + +**Impact.** `listener.state` is a `string` on the SignalR wire. Its values in the current implementation are the closed set `"bound"` / `"recovering"` / `"stopped"`, produced by the supervisor state machine — so the immediate exploitability is limited. However, the frontend must not rely on server-side type discipline for its own XSS safety (same rationale as the prior C1 finding). Under the same threat model as the prior review — no authentication on the admin port, PLC names and other state fields originate from operator-editable `appsettings.json` — any path that injects an unexpected string value into `listener.state` (serialization change, a future hot-reload edge case, a compromised service) would execute arbitrary script in every open fleet-dashboard tab. The fleet dashboard is more exposed than the detail page because it is the persistent long-lived view. + +**Fix.** One character change: replace `${state}` with `${escapeHtml(state)}` at `dashboard.js:48`. `escapeHtml` is already imported from `window.mbproxyUtil` at line 18. + +--- + +## Major Findings + +### M1 — `onreconnected` silently swallows subscribe failure; `connected` pill is misleading when the subscribe invoke is rejected +`src/Mbproxy/Admin/wwwroot/dashboard.js:254`, `src/Mbproxy/Admin/wwwroot/detail.js:285` + +**What is wrong.** Both pages' `onreconnected` callbacks fire after `withAutomaticReconnect` restores the transport. They call the hub method (`SubscribeFleet` / `SubscribePlc`) and swallow any error with `.catch(() => {})`: + +```js +// dashboard.js:252-255 +connection.onreconnected(() => { + setConn('connected'); + connection.invoke('SubscribeFleet').catch(() => {}); +}); +``` + +If the hub method throws (hub overloaded, exception in `OnConnectedAsync`, network jitter between the reconnect and the invoke), the connection is live but the subscription is silently dropped. The pill shows `connected` (green) while the page receives zero further updates. There is no retry, no error indication to the operator, and no watchdog that would notice the dead feed. The cold-start path (`connect()`) retries with backoff and guards against re-entering on a live socket; the warm-reconnect path has no equivalent safety net. + +**Impact.** On a typical LAN deployment with stable infrastructure this will be rare, but it is the kind of silent failure that is hard to diagnose. An operator watching a page that appears connected but shows stale counter values has no indication anything is wrong. The detail page additionally does not re-arm `armSnapshotWatchdog()` on the reconnect path, so the 6-second "no data" notice will not fire either if the subscribe silently fails. + +**Fix.** Add a minimal retry for the `onreconnected` subscribe. The simplest form: invoke, and on rejection set the pill to `'disconnected'`/`'retrying'` and call `connect()` (the cold-start function already guards against re-starting a non-Disconnected connection). Alternatively, factor the subscribe into a shared helper used by both the cold-start and warm-reconnect paths, so the retry discipline is written once. + +--- + +## Minor Findings + +### N1 — `detail.js` does not destructure `escapeAttr` from `window.mbproxyUtil`; the attribute-escaping helper is silently unavailable +`src/Mbproxy/Admin/wwwroot/detail.js:23` + +**What is wrong.** `detail.js` imports only `escapeHtml`: +```js +const { escapeHtml } = window.mbproxyUtil; // line 23 — escapeAttr not imported +``` +`escapeAttr` is defined in `util.js` (line 15) and is available on `window.mbproxyUtil`, but `detail.js` never references it. All current attribute-context class interpolations in `detail.js` (the `cls` slots at lines 65, 76; `dirCls` at line 222; `stale.trim()` at line 224) happen to be computed from literal strings, so there is no current XSS. The risk is latent: any future change that passes a server-derived value into a class or other attribute slot in `detail.js` will produce an unescaped-attribute injection with no indication at the call site that the escaping tool is absent. The inconsistency was directly implicated in the prior review's C1 finding (the fix added `escapeHtml` to `detail.js` but did not import `escapeAttr`). + +**Recommendation.** Change line 23 to: +```js +const { escapeHtml, escapeAttr } = window.mbproxyUtil; +``` +No other changes needed — the helper just needs to be in scope so future reviewers and the linter can verify attribute contexts are correctly escaped. + +--- + +### N2 — `prevPdu` and `rateByName` Maps are never pruned; stale entries accumulate across hot-reload PLC removal +`src/Mbproxy/Admin/wwwroot/dashboard.js:11–12, 66–78` + +**What is wrong.** `updateRates(snapshot)` (lines 66–78) only writes entries for PLCs present in the current snapshot; it never deletes entries for PLCs removed by a hot-reload. Over a long session with PLC churn, `prevPdu` and `rateByName` grow without bound. Concretely: `rateByName.size` (line 102) is used to decide whether the fleet PDU/s card shows a rate or `—`; after a PLC is removed its stale entry keeps `rateByName.size > 0`, so the card shows `0` instead of the arguably more correct `—` for a fleet that has shrunk to zero active PLCs. `renderAggregates` already iterates only `s.plcs` (line 91) so the stale rate values are never summed — this is correct — but the Map still grows. + +**Impact.** Negligible in the current 54-PLC deployment with low PLC churn rate. Becomes visible if PLCs are repeatedly added and removed via hot-reload over a long-lived browser session. The `rateByName.size` false-positive is the most user-visible symptom. + +**Recommendation.** At the end of `updateRates`, prune both Maps to the current snapshot's PLC name set: +```js +const currentNames = new Set(snapshot.plcs.map(p => p.name)); +for (const k of prevPdu.keys()) { if (!currentNames.has(k)) prevPdu.delete(k); } +for (const k of rateByName.keys()) { if (!currentNames.has(k)) rateByName.delete(k); } +``` + +--- + +### N3 — Empty `class=""` attribute emitted on non-stale debug rows; `stale.trim()` call is a no-op +`src/Mbproxy/Admin/wwwroot/detail.js:223–224` + +**What is wrong.** The stale/non-stale class is computed as: +```js +const stale = (t.ageSeconds || 0) > 30 ? ' stale' : ''; // line 223 +return ` // line 224 +``` +When the row is not stale, `stale` is `''`, `stale.trim()` is `''`, and the emitted markup is ``. The `.trim()` call is defensive against the leading space in `' stale'`, but a simpler and cleaner expression that also avoids the empty-attribute emission is: +```js +const staleCls = (t.ageSeconds || 0) > 30 ? 'stale' : ''; +return ` +``` +Or simply omit `.trim()` and use `class="${stale}"` — the leading space is harmless in CSS class matching. This is cosmetic. + +**Impact.** Zero functional impact. The empty `class=""` attribute is parsed and ignored by every browser. Flagged as a code-cleanliness observation. + +--- + +### N4 — `window.mbproxyUtil` is destructured at script-top-level; a util.js load failure aborts the entire page silently +`src/Mbproxy/Admin/wwwroot/dashboard.js:18`, `src/Mbproxy/Admin/wwwroot/detail.js:23` + +**What is wrong.** Both scripts destructure `window.mbproxyUtil` at the top of their IIFE, before any DOM or error-handling setup: +```js +const { escapeHtml, escapeAttr } = window.mbproxyUtil; // dashboard.js:18 +const { escapeHtml } = window.mbproxyUtil; // detail.js:23 +``` +If `util.js` fails to load (404 after a publish error, embedded-asset routing bug, script tag typo), `window.mbproxyUtil` is `undefined`, the destructuring throws a `TypeError` at the first `const`, and the remaining script — including `DOMContentLoaded`, error notices, and the SignalR connection — never runs. The page renders its static HTML indefinitely with no visible error. The HTML `