mbproxy: remediate the 2026-05-16 code-review findings

Fixes every finding from the codereviews/2026-05-16 multi-agent review
(2 Critical, 20 Major, 38 Minor) and adds that review to the repo.

Highlights: dashboard XSS escape; response cache invalidated on the
write request (not just the response); ReloadValidator now runs at
startup so port collisions / duplicate names / malformed Resilience
profiles fail fast; AdminPort 0 genuinely disables the admin endpoint;
PlcListener accept-loop faults propagate to the supervisor's faulted
path; reconciler Restart builds before removing; Resilience pipelines
are restart-only from a frozen snapshot; multiplexer connect-race leak,
watchdog party-list snapshot, backend-response and FC16 framing
validation; frontend reconnect retry and util.js load guard; plus the
log-event/doc drift sweep and test-port hygiene.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-16 18:08:06 -04:00
parent 0308490aef
commit b222362ce0
45 changed files with 1735 additions and 151 deletions
@@ -0,0 +1,110 @@
# Admin SignalR / Tag Capture — Code Review
Scope: `src/Mbproxy/Admin/*.cs` (AdminEndpointHost, StatusBroadcaster, StatusHub, PlcSubscriptionTracker, StatusSnapshotBuilder, StatusPushSink, StatusDto, DebugDto, AssemblyVersionAccessor) plus `src/Mbproxy/Proxy/TagValueCapture.cs` and `src/Mbproxy/Proxy/TagCaptureRegistry.cs`. Branch `mbproxy-webui-dashboard` (HEAD `0308490`). Cross-checked against `docs/Operations/StatusPage.md`, `docs/Reference/LogEvents.md`, the mbproxy `CLAUDE.md`, the prior reviews `codereviews/2026-05-15/{AdminSignalR,TagCapture}.md`, and the consuming code in `ConfigReconciler.cs` / `ProxyWorker.cs` / `PlcMultiplexer.cs`.
## Summary
The two Critical findings from the 2026-05-15 reviews have been genuinely fixed and the fixes hold. The SignalR subscription tracker is now keyed on a stable per-page-load `tabId` (`PlcSubscriptionTracker`), not on the `ConnectionId`, so a transport reconnect can no longer leak a viewer; arm/disarm is funneled through a single authority (`StatusBroadcaster.PushOnceAsync``TagCaptureRegistry.ReconcileArmed`), removing the lost-update race; the hub no longer arms captures at all. The prior XSS surface does not exist in the current admin layer — no server-rendered HTML, no string interpolation into a response; `/plc/{name}` serves a static embedded shell and `name` is read client-side. The cache-hit debug-view freeze (prior TagCapture C1) is fixed by `CacheEntry.CapturedTags` replay. The endpoint is strictly read-only. `M5`/`M6` from the prior review (bind-failure leak) are addressed: `StartAppAsync` now tears down a partially-started `app`/`_broadcaster` in its catch.
Remaining issues are lower severity: a real but bounded reconcile/disarm race against `Snapshot()`, a stale doc comment, a missing `Start()` double-call guard, and assorted test-gap / doc-drift items. No new Critical issues.
Findings by severity: **Critical: 0 · Major: 2 · Minor: 7**
## Major findings
### M1. `ReconcileArmed` disarm can clear a slot between snapshot capture and push — debug view can flicker an extra empty cycle
`StatusBroadcaster.cs:108-130`, `TagCaptureRegistry.cs:55-68`, `TagValueCapture.cs:108-113`.
`PushOnceAsync` calls `ReconcileArmed(activePlcs)` *before* the per-PLC `foreach` that calls `_builder.BuildDebug(plcName)`. For a PLC that is currently armed and still has a viewer this is fine. But the ordering interacts badly with a viewer that leaves *during* a push cycle:
- `activePlcs` is snapshotted at line 108. Suppose `plc-x` is in it.
- `ReconcileArmed` keeps `plc-x` armed.
- The viewer's `OnDisconnectedAsync` runs on a hub thread mid-cycle, removing `plc-x` from the tracker — but the `foreach` still iterates the stale `activePlcs` list and pushes one final detail message to a now-empty group. Benign (empty group no-ops).
The genuinely observable case is the inverse and is a correctness nuance, not a crash: because `ReconcileArmed` runs once per cycle and `BuildDebug` runs in the same cycle, a PLC that becomes active *after* line 108 (a `SubscribePlc` landing between line 108 and the `foreach`) is not in `activePlcs`, so it is neither armed nor pushed this cycle — it self-heals next cycle. That is acceptable. What is worth flagging: `ReconcileArmed` and `BuildDebug` are not consistent within the cycle for a PLC whose arm state flips mid-cycle, and the capture's `Disarm()` clears every slot (`TagValueCapture.cs:111-112`). If a hot-reload `GetOrCreate` rebuild or a `ReconcileArmed` disarm interleaves between `ReconcileArmed` and the `capture.Snapshot()` inside `BuildDebug`, the pushed `PlcDebugSnapshot` can report `CaptureArmed = true` (read at `StatusSnapshotBuilder.cs:54`) with all-empty tag rows, or `CaptureArmed = false` with populated rows.
Impact: a transient one-cycle (≤ `AdminPushIntervalMs`, default 1 s) inconsistency in the debug view — `captureArmed` and the tag rows can momentarily disagree. Not data corruption, not a leak; the next cycle is correct. Worth recording because the prior reviews treated arm-state consistency as the central invariant.
Recommendation: read `capture.IsArmed` and `capture.Snapshot()` once, atomically enough for display — or, simplest, have `BuildDebug` derive `CaptureArmed` from the same `activePlcs` set the broadcaster already holds rather than re-reading `capture.IsArmed`. Pass the active set (or a `bool armed`) into `BuildDebug` instead of letting it independently re-query the capture. That makes the pushed payload internally consistent by construction.
### M2. `StatusBroadcaster.Start()` has no guard against a second call — a double-start orphans a push loop
`StatusBroadcaster.cs:56-57`: `public void Start() => _loop = Task.Run(() => LoopAsync(_cts.Token));`
A second `Start()` overwrites `_loop`, orphaning the first loop task. The first task keeps running against the same `_cts`, so two loops now push the fleet snapshot every interval until cancellation, and `StopAsync` only `await`s the *second* `_loop` — the orphaned first loop is never awaited and its faults are unobserved. The XML doc ("Idempotent only in the sense that it is called once") documents the assumption but nothing enforces it.
Today `AdminEndpointHost` constructs a fresh `StatusBroadcaster` per `StartAppAsync` and calls `Start()` exactly once, so this is not hit in production. But `StatusBroadcasterTests.Loop_PushesRepeatedly_ThenStopsAfterStopAsync` (line 175) calls `h.Broadcaster.Start()` on a harness whose broadcaster was *not* started by a hidden path — that one is fine — yet the class remains trivially misusable, and `IAsyncDisposable.DisposeAsync``StopAsync` would only ever clean up one of two loops.
Recommendation: add a one-line guard — `Interlocked.CompareExchange` on an `int _started`, or `if (_loop is not { Status: TaskStatus.RanToCompletion or ... } ) ...` — and either throw or no-op on the second call. An `Interlocked` flag is the cleanest. Pair it with a brief test asserting a second `Start()` does not double the push rate.
## Minor findings
### N1. Stale comment in `ConfigReconciler` claims `GetOrCreate` preserves the armed flag
`ConfigReconciler.cs:364-365`: *"Rebuild the capture for the new tag set; GetOrCreate preserves the armed flag so an open detail page keeps capturing across the reload."*
`TagCaptureRegistry.GetOrCreate` (`TagCaptureRegistry.cs:33-37`) no longer does this — the update delegate `(_, _) => new TagValueCapture(map.All)` builds a fresh **disarmed** capture and copies nothing. The XML doc on `GetOrCreate` itself (lines 27-31) correctly describes the new behavior ("The rebuilt capture is disarmed; StatusBroadcaster re-arms it on its next push cycle"). The `ConfigReconciler` comment is the stale one and now contradicts the code. It is the only surviving reference to the old armed-flag-preservation design.
Impact: misleading to a future maintainer; could prompt a "fix" that reintroduces the lost-update race the redesign eliminated.
Recommendation: update the `ConfigReconciler.cs:364-365` comment to: *"Rebuild the capture for the new tag set; the rebuilt capture is disarmed and StatusBroadcaster re-arms it within one push cycle if the PLC still has a viewer."*
### N2. `TagValueCapture.Record`'s Record-vs-Disarm re-check still has a narrow stale-slot window
`TagValueCapture.cs:122-143`. The defensive re-read of `_armed` after the `Volatile.Write` is correct in spirit and closes the common case. But it is not airtight: `Disarm()` does `_armed = false` *then* clears slots (`TagValueCapture.cs:110-112`). Interleaving `Record` against `Disarm`:
1. `Record`: `_armed` true → passes the gate.
2. `Disarm`: `_armed = false`; begins the slot-clear loop, clears slot `idx`.
3. `Record`: `Volatile.Write(ref _slots[idx], newObs)` — writes *after* Disarm already cleared `idx`.
4. `Record`: re-reads `_armed` → false → `Volatile.Write(ref _slots[idx], null)`. Slot ends null. **Correct.**
That path self-corrects. The genuinely uncovered ordering is step 3 landing *after* step 4's own null-write would have run — it cannot, single-threaded within `Record` — so `Record` is in fact safe for its own write. The residual risk is purely that `Record` is called *concurrently from two threads* for the same address (e.g. an FC03 backend response and an FC06 upstream write racing on the same tag): both pass the gate, both write, the re-check on the later writer can null the slot even though the capture is armed, dropping that observation until the next traffic. This is a lost *update*, not stale data, and the slot self-heals on the next PDU.
Impact: under concurrent same-address traffic with a near-simultaneous disarm, a single observation can be dropped. Functionally negligible for a debug view; flagged only for completeness since the prior review called this race out and it is narrowed but not fully eliminated.
Recommendation: no code change required — document in the `Record` comment that the guarantee is "armed captures never retain stale data after disarm," not "every observation while armed is retained." If stricter retention is ever wanted, `Disarm` would need to set `_armed = false` *after* clearing slots and `Record` would re-validate, but that inverts a different race; the current trade-off is the right one for a debug view.
### N3. No test exercises a reconnect that drops the *new* connection first
`StatusHubTests.Reconnect_SameTab_NewConnection_DoesNotLeakViewer` (line 50) and `PlcSubscriptionTrackerTests.SameTab_TwoConnections_StaysActiveUntilLastConnectionGone` (line 27) both cover the reconnect-overlap case where the *old* connection's `OnDisconnectedAsync` fires last. Neither covers the mirror ordering (new connection's disconnect arrives before the old one's), nor the case where `SubscribePlc` for the new connection arrives *before* `RemoveConnection` for the old one but with a different `tabId` (e.g. a hard page reload that generates a fresh `tabId`). The tracker handles all of these correctly by construction, but the test suite asserts only one of the orderings.
Recommendation: add a test where two connections of the same tab are removed in reverse order, and one where a page reload (new `tabId`, old tab still has a live connection mid-teardown) is shown not to leak.
### N4. `SubscribePlc` registers the tracker entry before the group join — a throw in `AddToGroupAsync` leaves a tracked viewer with no group membership
`StatusHub.cs:46-54`. The current order is deliberate (comment lines 48-51: register with the tracker first so this connection's own `OnDisconnectedAsync` sees consistent state). But if `Groups.AddToGroupAsync` throws (transport fault mid-invocation), the tracker now holds a subscription for a connection that is not in the PLC's SignalR group. The capture for that PLC gets armed by the next `ReconcileArmed`, but the broadcaster pushes detail messages to a group the connection never joined, so the page receives nothing while the capture is needlessly armed — until `OnDisconnectedAsync` eventually fires for that connection and `RemoveConnection` cleans it up. Self-healing (the disconnect callback always runs eventually) and the capture cost is bounded, so this is Minor — but the asymmetry is real.
Recommendation: either accept it (the disconnect callback is the backstop, and `OnDisconnectedAsync` is guaranteed to run for any connection that completed `OnConnectedAsync`) and add a one-line comment to that effect, or wrap the body so a failed `AddToGroupAsync` rolls back the tracker entry via `RemoveConnection`. The accept-and-document option is sufficient.
### N5. `StatusBroadcaster` swallows every non-OCE push exception with no rate limiting
`StatusBroadcaster.cs:96-130`. Per-stage `try/catch` correctly isolates a snapshot-build failure, a fleet-push failure, and each per-PLC push failure, and the `[LoggerMessage]` events (`mbproxy.admin.broadcast.{snapshot,fleet,detail}.failed`, EventIds 72-74) match `LogEvents.md`. Good. But a persistently failing push (e.g. a wedged SignalR transport) logs one `Error` event *per cycle* — at the default 1 s cadence that is 86,400 error log lines/day for one stuck PLC. `LogEvents.md` for `mbproxy.admin.broadcast.detail.failed` says "Sustained occurrences ... mean that PLC's detail page is not receiving live updates" — so the volume is by design as an operator signal, but it can drown the rolling file.
Recommendation: consider logging the first failure at `Error` and subsequent consecutive failures at `Debug` (or every Nth), resetting on the first success. Minor — the documented contract permits the current behavior, so this is an enhancement, not a defect.
### N6. `DebugDtoSerializationTests` re-creates the hub options by hand rather than asserting against the real configuration
`DebugDtoSerializationTests.cs:18-19` builds `new JsonSerializerOptions { PropertyNamingPolicy = JsonNamingPolicy.CamelCase }` and comments that this is "The exact policy AdminEndpointHost configures." It is a *copy* of `AdminEndpointHost.cs:200-202`, not a reference to it — if someone edits the `AddJsonProtocol` lambda (adds `DefaultIgnoreCondition`, changes the policy, etc.) the test still passes against its stale copy. `HubStatusE2ETests` does exercise the real hub end-to-end and asserts `captureArmed`/`service`/`plcs` appear, which is the genuine guard; the unit test is a weaker duplicate.
Recommendation: low priority — either delete the unit test as redundant with the E2E coverage, or extract the `JsonSerializerOptions` configuration into a shared static (e.g. `AdminEndpointHost.HubJsonOptions`) that both the production `AddJsonProtocol` call and the test reference, so they cannot drift.
### N7. Doc drift: `StatusPage.md` does not state the detail payload travels only over SignalR, and `LogEvents.md` `EventId` 75 wording is fine but the broadcaster's `mbproxy.admin.broadcast.*` family is not cross-referenced from the doc's source list
`docs/Operations/StatusPage.md` "Debug View Data" (lines 313-331) documents every `PlcDetailResponse` field but never says the payload is reachable *only* over the `/hub/status` SignalR feed and has no `GET` route — a scraper author could reasonably expect a `/plc/{name}.json` twin. Prior review N7 raised exactly this and it is still open. Separately, `StatusPage.md:23-25` describes the bind-failure and `AdminPort` hot-reload behavior accurately, and `LogEvents.md` lists the four `mbproxy.admin.broadcast.*` events correctly — no defect there, just confirming.
Recommendation: add one sentence to `StatusPage.md`'s "Debug View Data" section: *"`PlcDetailResponse` is delivered only over the `/hub/status` SignalR feed (the `\"plc\"` message); there is no `GET` route for it, and it is serialized through the SignalR JSON protocol, not `StatusJsonContext`."*
## What looks good
- **The prior Critical SignalR capture-leak (2026-05-15 C1/C2) is genuinely fixed.** `PlcSubscriptionTracker` is keyed on `tabId`, not `ConnectionId`; a reconnect is "same tab acquires a new connection," so the viewer count cannot leak across a transport drop. `StatusHub` no longer arms/disarms captures — `OnConnectedAsync` is not even overridden — and arm/disarm flows through `StatusBroadcaster.PushOnceAsync → TagCaptureRegistry.ReconcileArmed`, a single-threaded once-per-cycle authority. `ReconcileArmed` reconciles *all* captures against the live `ActivePlcs()` set every cycle, so any stale arm state self-heals within one `AdminPushIntervalMs`. The `StatusHubTests` / `PlcSubscriptionTrackerTests` reconnect and concurrency-stress tests cover the core invariant.
- **The prior TagCapture Critical (cache-hit debug-view freeze) is fixed.** `PlcMultiplexer.cs:653-666` captures `TagValueObservation`s into `CacheEntry.CapturedTags` at store time, and the cache-hit path (`PlcMultiplexer.cs:844-850`) replays them into the armed capture re-stamped to the hit time. The debug view now reflects cache-served reads.
- **No XSS / output-encoding surface.** The admin layer renders no server-side HTML. `/` and `/plc/{name}` serve static embedded `index.html` / `plc.html` byte blobs (`ServeHtmlShell`); `name` is bound but unused and read client-side. `/status.json` and the SignalR payloads are JSON-serialized, never string-interpolated. There is nowhere for attacker-controlled input to reach an HTML response.
- **Endpoint is strictly read-only.** Routes are `GET /`, `GET /plc/{name}`, `GET /assets/{path}`, `GET /status.json`, and the hub. The hub exposes only `SubscribeFleet` / `SubscribePlc` (group joins + tracker mutation) and `OnDisconnectedAsync`. No mutation route, no control action, no log download.
- **Asset serving is traversal-safe.** `AdminEndpointHost.cs:221` rejects `/`, `\`, and `..` in the path segment before any resource lookup; assets are embedded manifest resources, so there is no filesystem surface. Bytes and misses are cached in a `static ConcurrentDictionary`.
- **Bind-failure cleanup is now complete (prior M5/M6 fixed).** `StartAppAsync` declares `app` outside the `try` and its `catch` (`AdminEndpointHost.cs:258-280`) best-effort disposes the `_broadcaster` and stops/disposes the partially-started Kestrel `app`, so a throw after `app.StartAsync` no longer leaks a bound listener or a running push loop.
- **Lifecycle / hot-reload of `AdminPort` is correct.** `_broadcaster` is constructed per `StartAppAsync` and disposed before Kestrel stops in `StopCurrentAppAsync`; the `OnChange` callback is guarded by the `_disposed` flag both before queueing and after the threadpool picks it up, and re-checks `newPort == _currentPort` under `_lock`. `DisposeAsync` is idempotent via `_disposed`. CTS/semaphore disposal is race-guarded.
- **`StatusBroadcaster` resource hygiene.** `_cts` is disposed in `DisposeAsync` after `StopAsync` awaits the loop; `StopAsync` carries a `_stopped` flag so a double-stop (DisposeAsync after an explicit StopAsync) does not touch the disposed CTS or re-cancel. `LoopAsync` re-reads `AdminPushIntervalMs` each cycle (hot-reloadable) and floors it at 100 ms; it pushes before delaying so a freshly-connected dashboard populates immediately.
- **`TagValueCapture` torn-read safety holds.** `TagValueObservation` is an immutable `sealed record`; slots are reference-typed and only swapped via `Volatile.Write` / read via `Volatile.Read`; `_armed` is `volatile`; the disarmed hot path is a single volatile-bool read after one `?.` null-check. `FrozenDictionary` is built once and never mutated.
- **`StatusSnapshotBuilder` is lock-free and side-effect-free**, degrades gracefully for an unknown PLC (`BuildDebug` returns a disarmed empty snapshot), and the `CounterSnapshot` fallback for a not-yet-started supervisor is exhaustive.
- **`StatusBroadcaster.PushOnceAsync` indexes the fleet snapshot into a `Dictionary` once per cycle** (`StatusBroadcaster.cs:113-114`) — the prior review's O(N²) `FirstOrDefault`-per-PLC concern is resolved.
@@ -0,0 +1,157 @@
# Code Review — Configuration, Options, Hosting, Diagnostics
**Scope:** `src/Mbproxy/Configuration/`, `src/Mbproxy/Options/`, `src/Mbproxy/Diagnostics/`, `src/Mbproxy/HostingExtensions.cs`, `Program.cs`, `ServiceCounters.cs` (with `ProxyWorker.cs` and `PlcListenerSupervisor.cs` read for context).
**Branch:** `mbproxy-webui-dashboard` @ `0308490`.
**Date:** 2026-05-16.
## Summary
The configuration/hosting subsystem is in good shape overall: the hot-reload reconciler is cleanly structured, the debounce loop and `_applySemaphore` serialization are correct and leak-free, the diagnostic-sink platform selection is well factored and unit-testable, and `ServiceCounters` uses `Interlocked` correctly. Prior-review remediations appear intact. The most consequential issue found is that `ReloadValidator` is **never invoked at startup** — it runs only inside `ConfigReconciler.ApplyUnderLockAsync` — so an `appsettings.json` that has duplicate `ListenPort`s, an `AdminPort` collision, a bad keepalive cross-field relationship, or a non-positive timeout will start the service in a broken state instead of failing fast, directly contradicting the documented contract. A second real bug: the `Restart` reconcile path removes a supervisor from the dictionary *before* rebuilding it, so a transient failure during rebuild silently drops the PLC with no listener and no recovery. The remaining findings are validation-completeness gaps and maintainability items.
**Findings count:** Critical 0 · Major 5 · Minor 6
---
## Major
### M1 — `ReloadValidator` never runs at startup; cross-PLC config errors are not fail-fast
**Files:** `src/Mbproxy/Configuration/ConfigReconciler.cs:232` (only call site); `src/Mbproxy/Proxy/ProxyWorker.cs:91-180`; `src/Mbproxy/HostingExtensions.cs:21-43`; `docs/Operations/Configuration.md:284`.
`ReloadValidator.Validate` is called from exactly one place — `ApplyUnderLockAsync` (`ConfigReconciler.cs:232`). The startup path is `ProxyWorker.ExecuteAsync`, which builds per-PLC tag maps directly via `BcdTagMapBuilder.Build` (`ProxyWorker.cs:101`) and never calls `ReloadValidator`. The only schema validation that runs at startup is `MbproxyOptionsValidator` via `.ValidateOnStart()` (`HostingExtensions.cs:21-26`), and that validator does **not** check the cross-PLC rules.
Consequences at startup (none of these are caught):
- **Duplicate `ListenPort`** across two PLCs — both supervisors try to bind the same port; one wins, the other enters the *infinite* `ListenerRecovery` loop forever. No fail-fast, no clear error.
- **`AdminPort` colliding with a `ListenPort`** — same silent-conflict outcome.
- **Duplicate PLC `Name`** — `ProxyWorker.cs:125` and `:179` write `_supervisors[plc.Name] = ...`, so the second entry silently overwrites the first; one configured PLC simply never gets a listener and its supervisor object leaks (created at `:165`, overwritten, never started/disposed).
- **Bad keepalive cross-field rule** (`BackendHeartbeatIdleMs <= BackendRequestTimeoutMs`) — only enforced in `ReloadValidator` (`ReloadValidator.cs:163`), so a fresh install with this mistake produces a continuously-firing heartbeat.
`docs/Operations/Configuration.md:284` explicitly states "`ReloadValidator.Validate` runs on every config load (startup and hot reload) … On rejection at startup, the service exits non-zero." That is false for the current code.
**Impact:** A config mistake that should abort startup instead produces a half-working fleet that "looks" up. The most likely operational mistakes (port typos, copy-pasted PLC blocks) are exactly the ones not caught.
**Recommendation:** Run `ReloadValidator.Validate` against `_options.CurrentValue` at the top of `ProxyWorker.ExecuteAsync` (before building any supervisor) and, on failure, log `mbproxy.config.reload.rejected` (or a startup-specific event) and fail the host (`throw` / `StopApplication`). Alternatively, fold the cross-PLC rules into `MbproxyOptionsValidator` so `.ValidateOnStart()` covers them. Either way, make startup and reload share one validation gate, and fix the doc.
---
### M2 — `Restart` reconcile path drops the PLC if the rebuild throws
**File:** `src/Mbproxy/Configuration/ConfigReconciler.cs:286-340`.
In the restart branch, the old supervisor is removed from the dictionary first (`TryRemove(name, out var old)`, `:292`), then stopped/disposed, then a fresh context + supervisor is built (`:302-330`) and re-inserted (`:332`). The whole body is wrapped in `try { … } catch (Exception ex) { _logger.LogError(...) }` (`:335-339`).
If anything between the `TryRemove` and the `_supervisors[name] = newSupervisor` re-insert throws — e.g. `BcdTagMapBuilder.Build` faults, `PolicyFactory`/`PerPlcContext` construction throws, an `OOM`, or `StopAsync` on the old supervisor throws something other than what it swallows — the catch logs and returns, but the dictionary no longer contains `name`. The PLC now has **no listener at all** and nothing ever recovers it: the next reload's `ReloadPlan.Compute` sees `name` present in both `current.Plcs` and `next.Plcs` (because `_currentOptions` is updated to `next` at `:437` regardless), so it lands in "unchanged" or `ToReseat`/`ToRestart` again only if options differ — an identical subsequent save produces *no* plan entry, leaving the PLC permanently dark until a manual config edit forces a restart.
The same structural risk exists in the `Add` path (`:388-431`), but there it is benign: a failed add just means the PLC was never there to begin with, and a subsequent reload will re-`Add` it because it is still absent from `_currentOptions`.
**Impact:** A transient fault during a hot-reload restart permanently removes a PLC from service with only an Error log line. For a 54-PLC fleet this is a silent single-line outage.
**Recommendation:** Build the new context + supervisor *before* removing/stopping the old one (or into a local first), and only swap into the dictionary once construction succeeded. On a build failure, leave the old supervisor in place and surface the error. At minimum, in the `catch`, if `name` is no longer in `_supervisors`, re-queue the PLC for `Add` on the next pass or roll `_currentOptions[name]` back to the old value so the next `Compute` re-detects it.
---
### M3 — Startup-built supervisors and reconciler-built supervisors do not use the same `MaxParties` / coalescing live-config path consistently with the rest of `Resilience`
**Files:** `src/Mbproxy/Configuration/ConfigReconciler.cs:281-330, 383-421`; `src/Mbproxy/Proxy/ProxyWorker.cs:138-141`.
The backend-connect Polly pipeline is rebuilt inside `ApplyUnderLockAsync` from `next.Resilience.BackendConnect` (`:282-284`, `:384-386`) and the listener-recovery pipeline from `next.Resilience.ListenerRecovery` (`:314`, `:405`) — so add/restart supervisors *do* pick up reloaded `BackendConnect`/`ListenerRecovery` values. But existing supervisors that land in `ToReseat` (or in neither bucket) keep the pipelines built at startup (`ProxyWorker.cs:139`). A reload that changes only `Resilience.BackendConnect.MaxAttempts` or `Resilience.ListenerRecovery.*` therefore takes effect for *added/restarted* PLCs but silently does **not** propagate to the majority of PLCs that were not otherwise touched. The hot-reload propagation tables in `docs/Features/HotReload.md:45-58` and `docs/Operations/Configuration.md:423-435` list `ReadCoalescing` and `Backend*TimeoutMs` but say nothing about `Resilience.BackendConnect`/`ListenerRecovery` — operators will reasonably assume they hot-reload like everything else.
**Impact:** Inconsistent reload semantics: the same key behaves differently depending on whether a PLC happened to be added/restarted in the same save. Hard to diagnose ("I changed `MaxAttempts`, three PLCs honour it, fifty-one don't").
**Recommendation:** Either (a) document explicitly that `Resilience.BackendConnect` and `Resilience.ListenerRecovery` require a service restart (like `AdminPort`), and stop rebuilding them in the reconciler so behaviour is uniform; or (b) make them genuinely hot-reloadable by threading a live accessor (as already done for `ReadCoalescing`/`Keepalive`) so all supervisors re-read them. Option (a) is the smaller, safer change.
---
### M4 — `MbproxyOptionsValidator` does not validate `ListenPort` range or `Host`, leaving binding edge cases uncaught at schema time
**File:** `src/Mbproxy/Options/MbproxyOptions.cs:62-144`; `src/Mbproxy/Options/PlcOptions.cs`.
`MbproxyOptionsValidator` validates `Width`, cache TTLs, cache-size knobs, connection timeouts, `AdminPushIntervalMs`, and keepalive ranges — but never `PlcOptions.ListenPort` (range/uniqueness), `PlcOptions.Name` (empty/unique), `PlcOptions.Host` (empty), `PlcOptions.Port` (range), or `AdminPort` (range/collision). Those live only in `ReloadValidator`. Given M1 (ReloadValidator not run at startup), this is the *only* validator on the startup path, so:
- `ListenPort = 0` (the C# default when the key is omitted, see `PlcOptions.cs:6`) passes schema validation. The supervisor then tries to bind port 0, which the OS interprets as "pick an ephemeral port" — the listener binds a random port and clients can never reach it. No error is raised anywhere.
- An empty `Host` produces a `SocketException`/`ArgumentException` only at first backend connect, surfacing as a recoverable runtime fault rather than a config rejection.
`docs/Operations/Configuration.md:130-135` documents `Name`, `ListenPort`, `Host` as **required** — but nothing enforces "required" at bind time.
**Impact:** A PLC block missing `ListenPort` or `Host` starts a useless listener with no diagnostic. Combined with M1, the only realistic guard (`ReloadValidator`) is never consulted at startup.
**Recommendation:** Add `ListenPort ∈ [1,65535]`, non-empty `Name`, non-empty `Host`, and `Port ∈ [1,65535]` checks to `MbproxyOptionsValidator` (it already iterates `options.Plcs`), or — preferably — resolve M1 so `ReloadValidator` runs at startup and consider whether the two validators should be merged to remove the consistency burden entirely.
---
### M5 — `Resilience.BackendConnect.BackoffMs` length is never validated against `MaxAttempts`
**Files:** `src/Mbproxy/Options/ResilienceOptions.cs:19-29`; `src/Mbproxy/Options/MbproxyOptions.cs:62-144`; `src/Mbproxy/Configuration/ReloadValidator.cs`; `src/Mbproxy/Proxy/Supervision/PolicyFactory.cs:43-75`.
`docs/Operations/Configuration.md:220` states `BackoffMs` "Must have `MaxAttempts - 1` entries." Nothing enforces this. Neither `MbproxyOptionsValidator` nor `ReloadValidator` looks at `RetryProfile.MaxAttempts`, `BackoffMs`, `RecoveryProfile.InitialBackoffMs`, or `SteadyStateMs` at all. `PolicyFactory.BuildBackendConnect` (`:46`) does `Math.Max(1, MaxAttempts)` and clamps the backoff index to the last element (`:59-61`) — so it is *crash-safe*, but a config with `MaxAttempts = 5` and a single `BackoffMs` entry, or negative backoff values, silently produces behaviour that does not match the operator's intent (negative ms become `TimeSpan` that Polly may treat as zero/throw depending on version). `MaxAttempts = 0` is silently coerced to 1.
**Impact:** Low-blast-radius but a documented invariant ("must have `MaxAttempts - 1` entries") is unenforced, and a negative backoff or zero `MaxAttempts` is accepted without warning. Doc and code disagree.
**Recommendation:** Add a validation rule (in whichever validator survives M1's consolidation): `MaxAttempts >= 1`; `BackoffMs.Count >= MaxAttempts - 1`; every `BackoffMs`/`InitialBackoffMs` entry `>= 0`; `SteadyStateMs > 0`. Or, if the clamp-and-coerce behaviour is genuinely intended, relax the doc wording to match.
---
## Minor
### N1 — `ApplyUnderLockAsync` logs `mbproxy.config.reload.applied` and bumps the success counter even when every step failed
**File:** `src/Mbproxy/Configuration/ConfigReconciler.cs:335-443`.
By design, a step that throws is caught and logged and the loop continues (`:335`, `:373`, `:426`), then step 7 unconditionally sets `_currentOptions = next`, calls `RecordReloadApplied`, and logs `applied` (`:437-441`). If, say, every restart in `ToRestart` threw, the operator sees `config.reloadCount` increment and an INFO `mbproxy.config.reload.applied` line — with no signal that the apply was partial or total-failure. The `Plcs*` counts in the event are *planned* counts (`:243-246`), not *achieved* counts.
**Impact:** Status page and logs report a clean reload when reconciliation actually failed. Misleading during incident triage.
**Recommendation:** Track per-step failures (a counter or list). If any step threw, log at `Warning` with a "partial" qualifier (or emit a distinct event) and either skip `RecordReloadApplied` or expose a `reloadPartialCount`. At minimum, include achieved-vs-planned counts in the event.
### N2 — Reseat step swaps `_currentOptions` membership assumptions but uses `next.Plcs.First(...)` which throws inside the loop
**File:** `src/Mbproxy/Configuration/ConfigReconciler.cs:353`.
`var plcNew = next.Plcs.First(p => p.Name == name);``ReloadPlan.Compute` guarantees `name` exists in `next`, so this will not throw in practice, but `First` with no match throws `InvalidOperationException`, which the surrounding `catch` (`:373`) would log as a generic "Error reseating context". A `FirstOrDefault` with an explicit null-guard, or passing the `PlcOptions` through `ReloadPlan.ToReseat` alongside the map, would make the invariant explicit and the failure mode obvious. Minor robustness/clarity.
**Recommendation:** Extend `ToReseat` tuple to `(string Name, PlcOptions New, BcdTagMap NewMap)` so the reconciler never re-resolves by name.
### N3 — `ComputeGlobalTagDelta` throws on a duplicate global address; `ReloadPlan.Compute` indexes PLCs with `ToDictionary` likewise
**Files:** `src/Mbproxy/Configuration/ConfigReconciler.cs:465-466`; `src/Mbproxy/Configuration/ReloadPlan.cs:39-40, 83-84`.
`before.Global.ToDictionary(t => t.Address)` (`:465`) throws `ArgumentException` if `BcdTags.Global` contains two entries at the same address. `BcdTagMapBuilder` detects duplicate addresses as a validation error, and `ReloadValidator` rejects such a snapshot — *but only `next`*. `ComputeGlobalTagDelta` is called with `_currentOptions.BcdTags` as `before` (`:249`); `_currentOptions` was validated when it became current, so this is safe today. It is, however, a latent fragility: any future path that sets `_currentOptions` without full validation, or a `Global` list with duplicate addresses that `BcdTagMapBuilder` happens not to flag, turns a cosmetic delta computation into an unhandled exception inside `ApplyUnderLockAsync` (caught only by the debounce-loop catch-all, aborting the whole apply). Same applies to `ReloadPlan.Compute`'s `ToDictionary(p => p.Name)` — safe only because `next` was validated first.
**Recommendation:** Use `ToDictionary` with a last-wins lambda or `DistinctBy`, or wrap `ComputeGlobalTagDelta` defensively. It is a pure cosmetic counter — it should never be able to abort a reload.
### N4 — `EventLogBridge` caches `_sourceExists` at construction; a post-install source registration never takes effect, but the doc oversells "no per-event registry traffic"
**File:** `src/Mbproxy/Diagnostics/EventLogBridge.cs:62-80`.
Behaviour is correct and intentional (documented at `Configuration.md:100`). One nuance: `Emit` still wraps `EventLog.WriteEntry` in a `try/catch` (`:101-109`) that swallows everything — so if the source is deleted *after* startup, every Error+ event silently no-ops with zero diagnostics. That is acceptable for a logging sink (must never recurse/crash), but there is no `SelfLog.WriteLine` breadcrumb as `SyslogBridge` has (`SyslogBridge.cs:46`). For symmetry and debuggability, emit one `SelfLog` line the first time a write fails.
**Recommendation:** Add a one-shot `SelfLog.WriteLine` on first write failure in `EventLogBridge.Emit` to match `SyslogBridge`'s degradation breadcrumb.
### N5 — `ConfigReconciler.Dispose` blocks up to 2 s synchronously on `_debounceLoop.Wait`
**File:** `src/Mbproxy/Configuration/ConfigReconciler.cs:485-501`.
`Dispose` cancels `_disposalCts` and then `_debounceLoop.Wait(TimeSpan.FromSeconds(2))`. Since `ConfigReconciler` is a DI singleton, this runs on the host's shutdown path (`ServiceProvider.Dispose`). If an apply is mid-flight when shutdown begins, the loop will not observe cancellation until the current `ApplyAsync` returns (the apply's per-step CTS are linked to `ct` = `_disposalCts.Token`, so they *do* cancel — good), so 2 s is normally ample. But a hung supervisor `StopAsync` inside an apply could make `Dispose` block for the full 2 s and then *abandon* a still-running loop thread. Acceptable, but worth a comment that the 2 s is a hard cap and the loop may be abandoned; also consider implementing `IAsyncDisposable` so the host can await it cleanly rather than block a thread.
**Recommendation:** Document the 2 s cap and abandonment behaviour, or convert to `IAsyncDisposable`.
### N6 — Doc drift: `Configuration.md:284` claims startup rejection, and the validator-ordering claims in HotReload.md are slightly off
**Files:** `docs/Operations/Configuration.md:284, 314-318`; `docs/Features/HotReload.md:81, 95`.
Beyond M1's headline doc bug: `HotReload.md:81` asserts "There is no second validator that could disagree with the first" and `:95` says "The two paths overlap deliberately so both startup and reload reject the same malformed input with the same error wording." In reality there *are* two validators (`MbproxyOptionsValidator` and `ReloadValidator`) with overlapping-but-not-identical rules and different wording (e.g. `MbproxyOptionsValidator` says "exceeds the 60_000 ms safety cap" while `ReloadValidator` says "exceeds 60_000 ms without Cache.AllowLongTtl=true"). The "same error wording" claim is inaccurate. `Configuration.md:298` rule 7 ("Width … enforced by `MbproxyOptionsValidator`") correctly notes Width is *not* in `ReloadValidator` — which means a hot reload that somehow bound a bad Width would not be caught by `ReloadValidator` (it relies on `BcdTagMapBuilder` to reject it; verify that builder does).
**Recommendation:** Once M1/M4 are resolved (ideally by consolidating to one validator), rewrite these doc passages to describe the actual single gate. If two validators remain, drop the "same error wording" claim.
---
## Items checked and found correct (no finding)
- **Debounce loop** (`ConfigReconciler.cs:165-218`): the linked-CTS-per-iteration pattern is correct; `debounceCts` is `using`-scoped per loop iteration so no CTS leak; the `OperationCanceledException when (!ct.IsCancellationRequested)` filter correctly distinguishes window-elapsed from disposal.
- **`_applySemaphore` serialization** (`:150-161`): correct `WaitAsync`/`try`/`finally`/`Release`; disposed once in `Dispose`.
- **Channel** (`:73-77`): bounded capacity 1 with `DropOldest` is the right choice for a coalesced "something changed" signal; no unbounded growth.
- **`OnChange` registration** (`:105-111`): non-blocking `TryWrite`, registration disposed in `Dispose` (`:487`).
- **`ServiceCounters`** (`ServiceCounters.cs`): all reads/writes via `Interlocked`; the ticks-as-long pattern for `LastReloadUtc` is sound; `CompareExchange(ref x, 0, 0)` as a volatile read is correct.
- **`DiagnosticSinkSelector.Select`** (`DiagnosticSinkSelector.cs:51-59`): pure, correct precedence (Windows wins, `isSystemd` only consulted off-Windows); the `OperatingSystem.IsWindows()` re-guard at `HostingExtensions.cs:101` correctly satisfies the platform analyzer for `[SupportedOSPlatform("windows")]` `EventLogBridge`.
- **Graceful shutdown wiring** (`ProxyWorker.cs:288-351`): `GracefulShutdownTimeoutMs` read live from `CurrentValue`; in-flight snapshot taken before `base.StopAsync`; admin stopped last; supervisors disposed last. Ordering matches the documented contract.
- **`ReloadPlan.Compute`** (`ReloadPlan.cs`): identity keyed on `Name`, `TagMapsEqual` includes `CacheTtlMs`, reseat-vs-restart split is correct.
- **Reseat counter preservation** (`ConfigReconciler.cs:359`): `supervisor.CurrentCounters` passed into the new context — correct, matches `HotReload.md:117`.
+160
View File
@@ -0,0 +1,160 @@
# Frontend / Live Web Dashboard — Code Review
Scope: commit `0308490` ("close out the dashboard code-review minor findings"). Files reviewed under `src/Mbproxy/Admin/wwwroot/`: `index.html`, `plc.html`, `dashboard.js`, `detail.js`, `util.js`, `theme.css`, `dashboard.css`, `detail.css`. Vendored assets out of scope. Cross-checked against `docs/Operations/StatusPage.md`, `src/Mbproxy/Admin/StatusDto.cs`, `src/Mbproxy/Admin/DebugDto.cs`, `src/Mbproxy/Admin/StatusBroadcaster.cs`, and `src/Mbproxy/Admin/StatusHub.cs`. Prior findings from `codereviews/2026-05-15/Frontend.md` were verified: the XSS on `detail.js` (C1 in that review) and the `shortTime`/`escapeHtml` gap are confirmed fixed. The cold-start retry loop, URL-decode guard, NaN guards, accessibility attributes, and `util.js` extraction are all verified resolved. This review covers the current code fresh.
**Finding counts: 1 Critical, 1 Major, 4 Minor.**
---
## Critical Findings
### C1 — `dashboard.js` `stateChip` renders `listener.state` into `innerHTML` without escaping
`src/Mbproxy/Admin/wwwroot/dashboard.js:48`
**What is wrong.** `stateChip(state)` in `dashboard.js` returns:
```js
return `<span class="chip ${cls}">${state}</span>`; // line 48
```
The `state` value is a server-supplied string from `plc.listener.state`, interpolated raw into an `innerHTML`-bound template string with no `escapeHtml` call. It flows into `innerHTML` at line 166:
```js
<td>${stateChip(plc.listener.state)}</td>
```
**Inconsistency with `detail.js`.** The identical `stateChip` function in `detail.js` was fixed by the prior review and now correctly uses `escapeHtml(state)` at line 76:
```js
return `<span class="chip ${cls}">${escapeHtml(state)}</span>`;
```
The fix was applied only to `detail.js` and was not propagated back to `dashboard.js`.
**Impact.** `listener.state` is a `string` on the SignalR wire. Its values in the current implementation are the closed set `"bound"` / `"recovering"` / `"stopped"`, produced by the supervisor state machine — so the immediate exploitability is limited. However, the frontend must not rely on server-side type discipline for its own XSS safety (same rationale as the prior C1 finding). Under the same threat model as the prior review — no authentication on the admin port, PLC names and other state fields originate from operator-editable `appsettings.json` — any path that injects an unexpected string value into `listener.state` (serialization change, a future hot-reload edge case, a compromised service) would execute arbitrary script in every open fleet-dashboard tab. The fleet dashboard is more exposed than the detail page because it is the persistent long-lived view.
**Fix.** One character change: replace `${state}` with `${escapeHtml(state)}` at `dashboard.js:48`. `escapeHtml` is already imported from `window.mbproxyUtil` at line 18.
---
## Major Findings
### M1 — `onreconnected` silently swallows subscribe failure; `connected` pill is misleading when the subscribe invoke is rejected
`src/Mbproxy/Admin/wwwroot/dashboard.js:254`, `src/Mbproxy/Admin/wwwroot/detail.js:285`
**What is wrong.** Both pages' `onreconnected` callbacks fire after `withAutomaticReconnect` restores the transport. They call the hub method (`SubscribeFleet` / `SubscribePlc`) and swallow any error with `.catch(() => {})`:
```js
// dashboard.js:252-255
connection.onreconnected(() => {
setConn('connected');
connection.invoke('SubscribeFleet').catch(() => {});
});
```
If the hub method throws (hub overloaded, exception in `OnConnectedAsync`, network jitter between the reconnect and the invoke), the connection is live but the subscription is silently dropped. The pill shows `connected` (green) while the page receives zero further updates. There is no retry, no error indication to the operator, and no watchdog that would notice the dead feed. The cold-start path (`connect()`) retries with backoff and guards against re-entering on a live socket; the warm-reconnect path has no equivalent safety net.
**Impact.** On a typical LAN deployment with stable infrastructure this will be rare, but it is the kind of silent failure that is hard to diagnose. An operator watching a page that appears connected but shows stale counter values has no indication anything is wrong. The detail page additionally does not re-arm `armSnapshotWatchdog()` on the reconnect path, so the 6-second "no data" notice will not fire either if the subscribe silently fails.
**Fix.** Add a minimal retry for the `onreconnected` subscribe. The simplest form: invoke, and on rejection set the pill to `'disconnected'`/`'retrying'` and call `connect()` (the cold-start function already guards against re-starting a non-Disconnected connection). Alternatively, factor the subscribe into a shared helper used by both the cold-start and warm-reconnect paths, so the retry discipline is written once.
---
## Minor Findings
### N1 — `detail.js` does not destructure `escapeAttr` from `window.mbproxyUtil`; the attribute-escaping helper is silently unavailable
`src/Mbproxy/Admin/wwwroot/detail.js:23`
**What is wrong.** `detail.js` imports only `escapeHtml`:
```js
const { escapeHtml } = window.mbproxyUtil; // line 23 — escapeAttr not imported
```
`escapeAttr` is defined in `util.js` (line 15) and is available on `window.mbproxyUtil`, but `detail.js` never references it. All current attribute-context class interpolations in `detail.js` (the `cls` slots at lines 65, 76; `dirCls` at line 222; `stale.trim()` at line 224) happen to be computed from literal strings, so there is no current XSS. The risk is latent: any future change that passes a server-derived value into a class or other attribute slot in `detail.js` will produce an unescaped-attribute injection with no indication at the call site that the escaping tool is absent. The inconsistency was directly implicated in the prior review's C1 finding (the fix added `escapeHtml` to `detail.js` but did not import `escapeAttr`).
**Recommendation.** Change line 23 to:
```js
const { escapeHtml, escapeAttr } = window.mbproxyUtil;
```
No other changes needed — the helper just needs to be in scope so future reviewers and the linter can verify attribute contexts are correctly escaped.
---
### N2 — `prevPdu` and `rateByName` Maps are never pruned; stale entries accumulate across hot-reload PLC removal
`src/Mbproxy/Admin/wwwroot/dashboard.js:1112, 6678`
**What is wrong.** `updateRates(snapshot)` (lines 6678) only writes entries for PLCs present in the current snapshot; it never deletes entries for PLCs removed by a hot-reload. Over a long session with PLC churn, `prevPdu` and `rateByName` grow without bound. Concretely: `rateByName.size` (line 102) is used to decide whether the fleet PDU/s card shows a rate or `—`; after a PLC is removed its stale entry keeps `rateByName.size > 0`, so the card shows `0` instead of the arguably more correct `—` for a fleet that has shrunk to zero active PLCs. `renderAggregates` already iterates only `s.plcs` (line 91) so the stale rate values are never summed — this is correct — but the Map still grows.
**Impact.** Negligible in the current 54-PLC deployment with low PLC churn rate. Becomes visible if PLCs are repeatedly added and removed via hot-reload over a long-lived browser session. The `rateByName.size` false-positive is the most user-visible symptom.
**Recommendation.** At the end of `updateRates`, prune both Maps to the current snapshot's PLC name set:
```js
const currentNames = new Set(snapshot.plcs.map(p => p.name));
for (const k of prevPdu.keys()) { if (!currentNames.has(k)) prevPdu.delete(k); }
for (const k of rateByName.keys()) { if (!currentNames.has(k)) rateByName.delete(k); }
```
---
### N3 — Empty `class=""` attribute emitted on non-stale debug rows; `stale.trim()` call is a no-op
`src/Mbproxy/Admin/wwwroot/detail.js:223224`
**What is wrong.** The stale/non-stale class is computed as:
```js
const stale = (t.ageSeconds || 0) > 30 ? ' stale' : ''; // line 223
return `<tr class="${stale.trim()}"> // line 224
```
When the row is not stale, `stale` is `''`, `stale.trim()` is `''`, and the emitted markup is `<tr class="">`. The `.trim()` call is defensive against the leading space in `' stale'`, but a simpler and cleaner expression that also avoids the empty-attribute emission is:
```js
const staleCls = (t.ageSeconds || 0) > 30 ? 'stale' : '';
return `<tr${staleCls ? ` class="${staleCls}"` : ''}>
```
Or simply omit `.trim()` and use `class="${stale}"` — the leading space is harmless in CSS class matching. This is cosmetic.
**Impact.** Zero functional impact. The empty `class=""` attribute is parsed and ignored by every browser. Flagged as a code-cleanliness observation.
---
### N4 — `window.mbproxyUtil` is destructured at script-top-level; a util.js load failure aborts the entire page silently
`src/Mbproxy/Admin/wwwroot/dashboard.js:18`, `src/Mbproxy/Admin/wwwroot/detail.js:23`
**What is wrong.** Both scripts destructure `window.mbproxyUtil` at the top of their IIFE, before any DOM or error-handling setup:
```js
const { escapeHtml, escapeAttr } = window.mbproxyUtil; // dashboard.js:18
const { escapeHtml } = window.mbproxyUtil; // detail.js:23
```
If `util.js` fails to load (404 after a publish error, embedded-asset routing bug, script tag typo), `window.mbproxyUtil` is `undefined`, the destructuring throws a `TypeError` at the first `const`, and the remaining script — including `DOMContentLoaded`, error notices, and the SignalR connection — never runs. The page renders its static HTML indefinitely with no visible error. The HTML `<noscript>` fallback is absent.
**Impact.** Unlikely in production given the assets are embedded in the binary and verified by `EmbeddedAssetsTests`. Worth noting because the failure mode is silent — no browser console error surfaces to the operator, and the page appears partially rendered rather than broken.
**Recommendation.** Add a null-guard:
```js
if (!window.mbproxyUtil) {
document.body.innerHTML = '<p style="padding:2rem;color:red">Admin UI failed to load — missing util.js. Check the browser console.</p>';
throw new Error('window.mbproxyUtil not defined');
}
const { escapeHtml, escapeAttr } = window.mbproxyUtil;
```
Or equivalently move the destructuring inside `DOMContentLoaded` after a check. The `throw` intentionally aborts the script — at least with a visible error — rather than letting it limp along with broken escaping.
---
## What Looks Good
- **All prior C1 and M-series findings from `codereviews/2026-05-15/Frontend.md` are confirmed fixed.** Specifically: `t.direction`, `t.rawHex`, `t.name`, `shortTime(c.connectedAtUtc)`, and `l.lastBindError` in `detail.js` are all now correctly wrapped in `escapeHtml`. The `shortTime` fallback now goes through `escapeHtml` at the call site (line 105). The `plcNameError` guard prevents the script from silently failing on a malformed URL. The `util.js` shared helper was introduced and is loaded before both page scripts. The `armSnapshotWatchdog` addresses the "unknown PLC sits forever on waiting" scenario. Keyboard/Enter/Space sort handlers and `aria-sort` on `<th>` elements are in place. `aria-live="polite"` on the connection pill and `aria-label` on the toolbar inputs are present. The PLC name is a real `<a>` link with `rel="noopener"` — no more `window.open`.
- **`detail.js` `stateChip` is correctly escaped.** The fix from the prior review was applied to `detail.js:76` (`escapeHtml(state)`). Only `dashboard.js` has the regression (C1 above).
- **`detail.js` `card()` `v` slot is safe through consistent caller discipline.** All values passed as the second element of card rows are either `stateChip()` output (internally escaped), `num()` output (numeric-only), `ratioText()` output (`n%` or `—`), or explicitly `escapeHtml`-wrapped. No server string lands unescaped in the `v` slot.
- **SignalR lifecycle is correct.** `onreconnected` re-invokes `SubscribeFleet`/`SubscribePlc` — essential for group re-subscription after a transport reconnect. `onclose` does not re-subscribe (correct — `withAutomaticReconnect` owns the warm path). `tabId` is stable per page-load (not per ConnectionId) so transport reconnects do not leak armed captures server-side. `gotSnapshot()` correctly clears the watchdog timer on the first data arrival.
- **Cold-start retry is well-structured.** The `connect()` function guards `connection.state === Disconnected` before calling `start()`, preventing a subscribe-failure from re-starting a live socket. Capped exponential backoff (`retryMs = Math.min(retryMs * 2, 30000)`) prevents infinite tight retries.
- **All DTO field names are correct.** `detail.js` and `dashboard.js` read `coalescedHitCount`, `cacheHitCount`, `backendHeartbeatsSent`, `backendHeartbeatsFailed`, `backendIdleDisconnects`, `disconnectCascades`, `queueDepth`, `txIdWraps`, `inFlight`, `maxInFlight`, `invalidBcdWarnings`, `exceptionsByCode.codeOther`, etc. — all matching the camelCase JSON policy applied by `StatusJsonContext`/`JsonKnownNamingPolicy.CamelCase`. No field name typos found.
- **`escapeHtml` and `escapeAttr` in `util.js` are correct.** `&` is replaced before `<`/`>` (correct order), and `escapeAttr` adds `"` on top of `escapeHtml`. Both functions are pure and side-effect-free.
- **No CDN dependencies.** All `<script>`/`<link>` tags reference `/assets/…` — consistent with the firewalled-network design intent documented in `StatusPage.md`.
- **`formatUptime` and `formatAge` both guard `Number.isFinite` and `< 0`.** NaN/negative clock skew returns `—` rather than `NaN` text.
- **`tagCell(t)` correctly coerces `t.address` with `Number()`.** The address is rendered as a pure decimal or hex numeric, never as a raw server string.
- **`debug-rows` empty-state and `colspan` counts are consistent with the HTML `<thead>`.** The debug table has 6 columns; all static empty-row `<td colspan="6">` and the `no-traffic` span layout (1+1+3+1) sum to 6. The KPI table has 10 columns; `colspan="10"` matches. No cross-file DOM-id inconsistencies found.
- **CSS is clean.** `focus-visible` outlines on sortable `<th>` elements and the PLC-name `<a>` provide keyboard focus indicators. No `!important` abuse. Design tokens are consistent across the three CSS files.
@@ -0,0 +1,158 @@
# Code Review — Backend Connection Layer (Multiplexing + Cache)
**Date:** 2026-05-16
**Branch:** `mbproxy-webui-dashboard` (HEAD `0308490`)
**Scope:** `src/Mbproxy/Proxy/Multiplexing/` (PlcMultiplexer, UpstreamPipe, CorrelationMap, InFlightByKeyMap, InFlightRequest, TxIdAllocator, CoalescingKey, `*LogEvents.cs`), `src/Mbproxy/Proxy/Cache/` (ResponseCache, CacheEntry, CacheKey, CacheInvalidator, CacheLogEvents), `src/Mbproxy/Proxy/MbapFrame.cs`.
## Summary
The connection layer is well-structured and the threading invariants documented in `docs/Architecture/ConnectionModel.md` are largely honoured by the code: single backend writer/reader, per-pipe write loop, `ConcurrentDictionary`-backed correlation map, and a claim-then-dispatch watchdog. The lock-ordering discipline (map lock → no async inside it; connect gate serialising connect vs. teardown) is sound, and the prior reviews' remediations appear intact. However this review found one **wire-protocol correctness bug** that can deliver a stale value to a client across a write (a real data-integrity hazard for an SCADA proxy), one **resource-leak** path, and several **major** correctness gaps around partial-PDU parsing, cache-invalidation arithmetic overflow, and a watchdog/late-attach race window. None of the findings is a deadlock or crash, but the cache-vs-write ordering issue is genuine silent-wrong-value territory and should be treated as the headline item.
**Findings by severity:** Critical: 1 · Major: 6 · Minor: 7
---
## Critical
### C1 — Cache hit can serve a value contradicted by an in-flight write to the same register (read-after-write inversion)
**Files:** `PlcMultiplexer.cs:826-857` (cache lookup), `PlcMultiplexer.cs:674-695` (invalidation on FC06/FC16 *response*).
**What's wrong.** Cache invalidation for a write fires only in `RunBackendReaderAsync` *after the FC06/FC16 response lands from the PLC* (lines 674-695). The cache lookup on a read (lines 826-857) happens with no awareness of writes that are currently *in flight but not yet acked*. Consider this realistic sequence on one PLC, all on the same register range:
1. Client A reads `[100..110)` → cache miss → backend round-trip → response stored in cache (TTL 500 ms).
2. Client B writes register 105 (FC06) → request enqueued, travels to PLC, PLC applies it.
3. Client C reads `[100..110)`**cache hit** → C receives the *pre-write* value.
4. The FC06 response finally arrives → invalidation runs → cache entry for `[100..110)` dropped.
Between steps 2 and 4 (one full backend round-trip, plus outbound-channel queue depth, plus ECOM scan time — easily tens of ms, far less than a 500 ms TTL) the cache continues to serve the *old* value even though the PLC has already accepted the new one. Client C is told register 105 is the old value *after* the write that changed it was already accepted upstream by the proxy. For a Modbus proxy fronting SCADA, that is a silent wrong value delivered to a client — the exact failure mode the review brief calls Critical.
The doc (`ResponseCache.md` "Write Invalidation") frames invalidation as "a successful FC06/FC16 *response* invalidates" — i.e. the design intentionally invalidates on the response, not the request. That is defensible *only* if the cache also refuses to serve a range while a write to an overlapping range is in flight. It does not: there is no in-flight-write tracking on the read path.
**Impact.** Silent stale value across a write, bounded by one backend round-trip rather than by `CacheTtlMs`. Operationally most visible when a client writes a setpoint then another client (or the same client on a different connection) immediately reads it back and sees the old value. Note coalescing does *not* have this problem — it dies on response — but the cache extends the staleness window past the write.
**Recommendation.** Invalidate (or suppress) on the **request** side, not only the response side. When an FC06/FC16 frame is parsed in `OnUpstreamFrameAsync` (the FC06/FC16 branch around lines 805-817), immediately call `responseCache.Invalidate(unitId, startAddr, qty)` *before* the write is enqueued, and additionally track the overlapping range as "write pending" so reads that arrive before the write response lands also miss. The simplest correct fix: invalidate on request enqueue *and* keep the existing response-side invalidation as a backstop. The minor cost (a write may evict an entry that the write later turns out to fail with an exception) is acceptable — a failed write just causes a cache miss and a fresh read, which is always safe; serving a stale value is not.
---
## Major
### M1 — `EnsureBackendConnectedAsync` leaks the backend socket if `_disposed` flips during connect
**File:** `PlcMultiplexer.cs:287-358`.
**What's wrong.** A caller can enter `EnsureBackendConnectedAsync`, pass the `_disposed` check at line 289, acquire `_connectGate`, and run a multi-second Polly connect (lines 310-324). If `DisposeAsync` runs concurrently, it sets `_disposed = true`, cancels `_disposeCts`, and calls `TearDownBackendAsync` — which acquires the gate (or times out after 2 s and proceeds gate-less), and at that moment `_backendSocket` is still `null` because the connecting task has not yet reached line 341. `TearDownBackendAsync` sees `oldSocket is null && oldCts is null` and returns without closing anything. The connecting task then completes its `ConnectAsync`, takes `_backendLock`, and assigns a **fully live `Socket`** plus three running tasks (`_backendWriterTask`/`ReaderTask`/`HeartbeatTask`) into the now-disposed multiplexer. Nothing ever disposes that socket or joins those tasks; `_disposeCts` is already disposed so `CancellationTokenSource.CreateLinkedTokenSource(_disposeCts.Token)` at line 338 throws `ObjectDisposedException`, leaving `backend` connected and undisposed.
**Impact.** Leaked TCP socket to the PLC (one of the ECOM's 4 client slots held forever) plus three orphaned background tasks per occurrence. Occurs on a hot-reload PLC-remove or service stop that races a cold-start connect — not common, but it happens under exactly the churn the supervisor is designed to handle.
**Recommendation.** After a successful connect, re-check `_disposed` (and `_disposeCts.IsCancellationRequested`) *under `_backendLock`* before publishing the socket; if disposal won the race, dispose `backend` and return `false`. Wrap the `CreateLinkedTokenSource` in the same guarded block so a disposed `_disposeCts` cannot throw past the socket assignment.
### M2 — Backend reader drops a frame whose PDU body length disagrees with the PLC, with no resync
**File:** `PlcMultiplexer.cs:540-574`, `UpstreamPipe.cs:111-163`.
**What's wrong.** Both `FillAsync` loops trust the MBAP `Length` field absolutely. If the backend (or a middlebox) ever produces a frame whose `Length` field is wrong — too short — the reader consumes `length-1` body bytes, then loops and reinterprets the *next* bytes (which are really still part of the previous PDU) as a fresh 7-byte MBAP header. From that point every subsequent frame on the socket is mis-framed: `proxyTxId` is garbage, `CorrelationMap.TryRemove` misses, frames are silently dropped (line 580-584), and every real in-flight request leaks until the watchdog times it out with 0x0B. The socket never faults, so no cascade fires; the PLC effectively goes dark for `BackendRequestTimeoutMs` repeatedly. The only oversized-frame guard (line 561) catches `length-1 > 253` but not a *too small* / mis-aligned length.
**Impact.** A single corrupt length field on the wire desynchronises the backend stream indefinitely. Recovery depends entirely on the backend eventually closing the socket. Realistic on flaky industrial networks; the brief explicitly lists "middlebox drops a packet" as a covered scenario.
**Recommendation.** This is hard to fully solve without a length-validation/resync strategy, but at minimum: when `RunBackendReaderAsync` removes a correlation entry it can sanity-check the response PDU length against the request (e.g. an FC03 response body should be `2 + 2*qty`); a mismatch is strong evidence of desync and should `break` to force a teardown+cascade rather than silently fan out a wrong-sized payload. At a minimum, document that the reader assumes a well-framed backend and add a desync-suspected counter when `TryRemove` misses repeatedly.
### M3 — FC03/FC04 cache hit fans out a payload whose register count is not validated against the request
**Files:** `PlcMultiplexer.cs:826-857` (`BuildCacheHitFrame`), `PlcMultiplexer.cs:1356-1370`.
**What's wrong.** `CacheKey` includes `Qty`, so a hit only occurs for an identical `(unit,fc,start,qty)` — good. But `BuildCacheHitFrame` splices `cached.PduBytes` onto a fresh MBAP header and sets `Length = 1 + pduLen` with no upper-bound check. The stored PDU came from the backend reader (line 645-646, snapshot of `pduBodyLen` bytes which *is* bounded to ≤253 at line 561), so today it cannot exceed spec. However nothing asserts the cached PDU's declared byte-count actually matches `2*qty`. If a backend ever returned a short/long FC03 body that still passed the ≤253 check, that malformed payload is now cached and replayed to *every* future hit for the lifetime of the TTL — the cache amplifies one bad backend response into many bad client responses. Coalescing has the same single-response blast radius but only for the concurrently-attached parties; the cache widens it across time.
**Impact.** Lower likelihood than C1/M2 but the same class of harm — a wrong-shaped frame delivered to clients, amplified by caching.
**Recommendation.** When storing an FC03/FC04 response (line 641-667), validate the PDU body: `frame[HeaderSize+1]` (byte count) must equal `2*qty` and `pduBodyLen` must equal `2 + 2*qty`. Skip the cache store (and ideally skip the fan-out / force teardown) on mismatch. This also strengthens M2's resync story.
### M4 — `CacheInvalidator.FindOverlapping` and the FC16 parse can integer-overflow the half-open upper bound
**Files:** `CacheInvalidator.cs:38,46`, `PlcMultiplexer.cs:812-817`.
**What's wrong.** `writeEnd = writeStart + writeQty` and `keyEnd = key.StartAddress + key.Qty` are computed as `int` from two `ushort`s — that part is fine (max 0x1FFFE, no overflow). But the FC16 parse at `PlcMultiplexer.cs:815-816` reads `qty` straight from the wire with no clamp: a malicious or buggy client can send FC16 with `qty = 0xFFFF`. That `qty` is then used both as the `InFlightRequest.Qty` and, on the FC06/FC16 response, passed to `postCache.Invalidate(unitId, startAddress, qty)`. The proxy never validates that `qty` is within the DL260's FC16 cap of 100 (per `dl205.md` / CLAUDE.md) — the comment says "the PLC enforces the cap with exception 03", which is true for the *write itself*, but the **invalidation still runs with the bogus 65535-register span** because invalidation is gated only on `isException == false`. If the PLC accepts a smaller write and returns success, an attacker-supplied wide `qty` cannot reach here — but an FC16 request whose declared `qty` does not match its byte-count payload, accepted by a lax backend, would invalidate a 64K-register span and flush most of the cache. More concretely: the FC16 *request* `qty` and the *response* `qty` are assumed identical (the response echoes start+qty), but the code uses the **request-side** `inFlight.Qty` for invalidation (line 687) — so a request that lies about `qty` drives invalidation regardless of what the PLC actually wrote.
**Impact.** Cache-flush amplification / mild DoS on the cache from a malformed FC16. Not memory-unsafe (the `int` math does not overflow), but a contract violation: invalidation should reflect what was *written*, and it currently reflects what the client *claimed*.
**Recommendation.** Validate FC16 framing in `OnUpstreamFrameAsync`: require `qty` in `1..123` (or the DL260 cap of 100) and require `frame.Length >= pduOffset + 6 + 2*qty` and `byteCount == 2*qty`; reject (exception 03) otherwise rather than forwarding. Also prefer driving invalidation from the FC16 *response* PDU's echoed start/qty rather than the request's.
### M5 — Watchdog can miss a late-attached coalescing party, hanging that upstream until *its* socket times out
**Files:** `PlcMultiplexer.cs:1126-1161`, `InFlightByKeyMap.cs:59-83`.
**What's wrong.** The watchdog snapshots stale entries from `CorrelationMap` (line 1123), then for each one does `TryRemove` from `CorrelationMap` (line 1131) and only *afterwards* removes the `CoalescingKey` from `InFlightByKeyMap` (line 1160). Between the `CorrelationMap.TryRemove` and the `_inFlightByKey.TryRemove` there is a window where a new FC03/FC04 request can call `AttachOrCreate`, find the still-present coalescing entry, and append itself to `req.InterestedParties`. The watchdog then walks `req.InterestedParties` at line 1165 — but it captured `req` from the `CorrelationMap.TryRemove` out-param, and `InterestedParties` is the *same mutable `List<InterestedParty>`*, so a party appended after the `foreach` started its enumeration would throw `InvalidOperationException` ("collection modified") — caught by the outer `catch (Exception)` at line 1195, which logs and **exits the watchdog loop entirely**. A party appended *before* the `foreach` but after the `CorrelationMap.TryRemove` is delivered the 0x0B (fine), but a party that attached and whose request was *never enqueued to the backend* (the coalescing entry is now orphaned — no proxy TxId of its own) gets nothing if the enumeration already passed it.
The doc (`ReadCoalescing.md` "Multi-writer multi-reader safety") claims the watchdog "removes the coalescing key before it walks `req.InterestedParties`". The code does the opposite ordering: `CorrelationMap` removal first, then the walk at 1165, then `_inFlightByKey` removal at 1160 is actually *before* the walk — re-reading: line 1157-1161 (`_inFlightByKey.TryRemove`) runs before line 1165 (`foreach`). So the key *is* removed before the walk. Good. **But** the `CoalescingKey` removal at 1160 only happens for `req.Fc is 0x03 or 0x04`; a non-coalescing FC still has no key. The real residual hole: a late attach in the window between `CorrelationMap.TryRemove` (1131) and `_inFlightByKey.TryRemove` (1160) appends to the list, and the `foreach` at 1165 then enumerates a list that *was mutated after `TryRemove`*. If the append lands during the `foreach`, `InvalidOperationException` kills the watchdog (caught at 1195 → loop exits → **no more timeouts ever fire for this PLC**).
**Impact.** A rare race can permanently disable the timeout watchdog for one PLC, after which any lost backend response leaks its correlation entry forever and hangs upstream clients indefinitely — defeating the entire reason the watchdog exists.
**Recommendation.** Remove the `CoalescingKey` from `_inFlightByKey` *before* removing from `CorrelationMap`, mirroring the backend reader's ordering (reader does `CorrelationMap.TryRemove` then `_inFlightByKey.TryRemove` — actually the reader has the same ordering; see N-note). The robust fix: in `InFlightByKeyMap`, hand fan-out callers a *frozen copy* of the party list at removal time, or snapshot `req.InterestedParties.ToArray()` immediately after the claim and iterate the copy. Apply the same snapshot in `RunBackendReaderAsync`'s fan-out loop (line 710) for symmetry.
### M6 — `RunBackendReaderAsync` cascade-on-EOF/fault is fire-and-forget; an exception inside `TearDownBackendAsync` is unobserved
**Files:** `PlcMultiplexer.cs:753-766`, `538-537`, `RunBackendHeartbeat/Watchdog` similar.
**What's wrong.** Reader EOF (line 756), reader fault (765), and writer fault (536) all start `TearDownBackendAsync(...)` with `_ = Task.Run`-style fire-and-forget (just `_ = TearDownBackendAsync(...)`). `TearDownBackendAsync` is `async Task`; if it throws *synchronously before its first await* — or after, on a path the `try/finally` doesn't cover — the exception lands on an unobserved `Task`. The code guards the common case (`if (!_disposeCts.IsCancellationRequested)`) to avoid the disposed-`_connectGate` race, but `_connectGate.WaitAsync` can still throw `ObjectDisposedException` if disposal flips *between* the check and the call — that path is handled inside `TearDownBackendAsync` (line 398-403, returns). The residual risk is narrower than M1 but the pattern (unobserved fire-and-forget `async Task` from three call sites) is fragile: any future exception added to `TearDownBackendAsync`'s pre-await section becomes a process-level `UnobservedTaskException`.
**Impact.** Latent; today probably benign. Worth hardening because all three of the layer's failure-detection paths funnel through it.
**Recommendation.** Wrap the fire-and-forget in a helper that attaches a continuation logging any fault: `_ = TearDownBackendAsync(...).ContinueWith(t => _logger.LogError(t.Exception, ...), TaskContinuationOptions.OnlyOnFaulted)`. Or make the call sites `await` it where the calling context allows (the reader/writer tasks can `await` since they are about to exit anyway).
---
## Minor
### N1 — `UpstreamPipe.RunReadLoopAsync` permits a 7-byte zero-body frame but `OnUpstreamFrameAsync` cannot route it
**Files:** `UpstreamPipe.cs:133-141`, `PlcMultiplexer.cs:775-797`.
A `Length < 1` frame is forwarded as a header-only `byte[7]` (line 136-138). `OnUpstreamFrameAsync` then reads `fcByte = frame.Length > pduOffset ? ... : 0``fcByte = 0`, falls through every FC branch, allocates a proxy TxId, and forwards a 7-byte frame to the backend. A 7-byte MBAP frame with `Length=0` is itself malformed Modbus (`Length` must be ≥2: UnitId+FC). The proxy should reject it rather than allocate a TxId and forward garbage. Recommend: in the read loop, treat `length < 2` as a protocol error and close the pipe (consistent with the oversized-frame handling).
### N2 — Backend reader's `_inFlightByKey.TryRemove` runs *after* `CorrelationMap.TryRemove` but the doc invariant wants it before fan-out only
**File:** `PlcMultiplexer.cs:580-607`.
The ordering is correct for the stated invariant (key removed at 602-607, fan-out at 710), but the `CoalescingKey` is reconstructed from `inFlight.UnitId/Fc/StartAddress/Qty` (line 604-605) — if `inFlight` were ever a heartbeat with `Fc=0x03` this would spuriously probe the map; it is guarded by the `IsHeartbeat` early-`continue` at 595-596, so this is safe today. Worth a one-line comment that the FC03/FC04 branch is heartbeat-free by construction, so a future refactor doesn't reorder the `IsHeartbeat` check below it.
### N3 — `TxIdAllocator.Release` silently no-ops a double-release, masking the documented `TearDownBackendAsync` race
**Files:** `TxIdAllocator.cs:119-129`, `PlcMultiplexer.cs:377-385`.
The "KNOWN RACE" comment block at `PlcMultiplexer.cs:377-385` describes a double-release that frees a legitimately in-flight slot. `Release` cannot detect this — it just checks `_inUse[id]`. Consider having the allocator track a generation/epoch per slot, or at minimum increment a `doubleReleaseSuspected` counter when `Release` is called on an already-free slot, so the (admittedly very rare) silent-request-drop the comment accepts becomes observable in production rather than invisible.
### N4 — `ResponseCache` LRU eviction is O(n) per insert and the doc's 1000-entry assumption is unenforced at the call path
**File:** `ResponseCache.cs:206-230`.
`EvictLeastRecentlyUsed` is a full linear scan, acceptable at the documented 1000-entry cap. But `_maxEntries` comes from `Cache.MaxEntriesPerPlc` which is operator-settable with no documented ceiling in this file; a fat-fingered `MaxEntriesPerPlc = 1_000_000` turns every cache insert into a million-element scan under the lock, stalling the backend reader for every cache-miss store. Recommend either documenting/validating a hard ceiling or switching to an intrusive LRU (linked-list + dict) if large caps must be supported.
### N5 — `SnapshotOlderThan` allocates a `List` every watchdog tick even when nothing is stale
**File:** `CorrelationMap.cs:72-81`, `PlcMultiplexer.cs:1123-1124`.
The watchdog ticks every `BackendRequestTimeoutMs/4` (≥100 ms) and always allocates a fresh `List`, then checks `.Count == 0`. On an idle fleet of 54 PLCs that is 540 throwaway lists/sec. Trivial, but `SnapshotOlderThan` could return `Array.Empty` when the map is empty, or the watchdog could skip the call when `_correlation.Count == 0`.
### N6 — `CacheEntry.CapturedTags` replay on a cache hit re-stamps observations "now" for every hit, including hits served during a backend outage
**File:** `PlcMultiplexer.cs:844-850`.
When the cache serves a hit during a `recovering` backend, the debug view's `capture.Record(...)` is called with `CaptureDirection.Read` and the current time, making the detail page show a "fresh" read of a tag that has not actually been read from the PLC for up to `CacheTtlMs`. This is arguably misleading for an operator debugging during an outage — the value is cache-aged, not live. Consider tagging replayed observations as cache-served, or stamping them with the entry's `CachedAtUtc` rather than now.
### N7 — Doc drift: `ConnectionModel.md` says the cascade closes "every attached pipe" but the inline comment at the cascade still references the superseded "in flight" wording
**File:** `PlcMultiplexer.cs:448-450`.
The comment reads "Close every attached pipe that had a request in flight; the others will simply re-issue" — directly contradicted by the next line ("Per docs/.../ConnectionModel.md, ALL attached upstreams cascade") and by `upstreamCount = _pipes.Count`. The code is correct (all pipes cascade); the first comment sentence is stale and should be deleted to avoid confusing a future reader into thinking idle pipes survive.
---
## Notes on things checked and found correct
- **TxId wraparound** (`TxIdAllocator.cs`): the forward scan with `_inFlightCount < SlotCount` guard correctly terminates; `_wrapCount` increments on `0xFFFF→0x0000` cursor roll. No off-by-one. Reuse-distance maximisation works as documented.
- **Claim-then-dispatch in the watchdog** (`PlcMultiplexer.cs:1131`): the `TryRemove` race against a real response is correctly handled — whoever wins `TryRemove` owns dispatch; the loser skips. No double-delivery of a real response + 0x0B for the *non-coalesced* case.
- **MBAP TxId preservation**: proxy TxId is written at request enqueue (lines 1014-1015 / 1080-1081) and the original is restored per-party on the response (lines 736-737); heartbeat and cache-hit and exception frames all carry the correct TxId. Verified correct.
- **Single-party fan-out buffer reuse** (line 732-734): the `Count == 1` fast path reuses the buffer; multi-party clones per party. No aliasing bug.
- **Connect gate vs. teardown**: `_connectGate` correctly serialises `EnsureBackendConnectedAsync` against `TearDownBackendAsync`; the bounded 2 s teardown wait and its documented best-effort fallback are reasonable (the residual race is C-noted at N3, not a new finding).
- **`InFlightByKeyMap` lock discipline**: no async work happens under the lock; the factory is invoked under the lock but does only synchronous allocation/`TryAdd`. No lock-ordering inversion against `TxIdAllocator`'s lock (allocator lock is always taken *inside* the map lock, never the reverse).
- **Cache `recovering`-state invalidation skip**: the structural argument in the inline comment (lines 676-685) holds — no backend reader exists in `recovering`, so no FC06/FC16 response can drive an invalidation. Correct as written (but see C1, which is a *different* and real problem on the read side).
+115
View File
@@ -0,0 +1,115 @@
# Code Review — mbproxy (2026-05-16)
**Branch:** `mbproxy-webui-dashboard` · **HEAD:** `0308490`
**Predecessors:** `codereviews/2026-05-14/`, `codereviews/2026-05-15/` (all of their findings remediated in `554b05d`, `374eecd`, `0308490`).
## Method
This was a full-service review run as six parallel area reviews, each in its own file:
| File | Scope |
|------|-------|
| [`Multiplexing.md`](Multiplexing.md) | Backend connection layer — `Proxy/Multiplexing/`, `Proxy/Cache/`, `MbapFrame` |
| [`ProxyAndBcd.md`](ProxyAndBcd.md) | Proxy lifecycle + BCD codec — `Proxy/` core, `Proxy/Supervision/`, `Bcd/` |
| [`AdminSignalR.md`](AdminSignalR.md) | Admin endpoint, SignalR backend, `TagValueCapture`/`TagCaptureRegistry` |
| [`Frontend.md`](Frontend.md) | `Admin/wwwroot/` — HTML/CSS/JS dashboard |
| [`ConfigAndHosting.md`](ConfigAndHosting.md) | `Configuration/`, `Options/`, `Diagnostics/`, hosting, `ServiceCounters` |
| [`TestsAndConfig.md`](TestsAndConfig.md) | `tests/`, `install/`, config templates, csproj packaging |
Open each area file for the full finding text (description, impact, recommendation, line refs). This Overview consolidates, adjudicates severity where I disagree with an agent, and sets a remediation order.
## Headline verdict
The service is fundamentally sound. The hard parts — TxId multiplexing, the claim-then-dispatch watchdog, lock ordering, the BCD codec in all four directions, `Interlocked` counters, graceful-shutdown sequencing, the tab-keyed SignalR capture model — were checked and hold up. **Every Critical and Major finding from the two prior reviews is confirmed remediated with no regressions** (one frontend exception, below).
This review found **2 findings rated Critical by the area agents, 20 Major, and 38 Minor** (60 total). My own adjudication downgrades both "Criticals" to high-Major (reasoning under *Critical findings*), so I would summarise the real risk profile as: **no true Critical, ~8 Major worth scheduling, the rest opportunistic.** The two largest themes are (1) the proxy trusts self-describing Modbus wire fields more than it should, and (2) startup config validation is weaker than hot-reload config validation.
## Findings by area
| Area | Critical | Major | Minor | Most important |
|------|:--:|:--:|:--:|----------------|
| Multiplexing | 1 | 6 | 7 | C1 cache can serve a value stale across a write |
| ProxyAndBcd | 0 | 2 | 7 | M2 `PlcListener` swallows accept-loop faults |
| AdminSignalR | 0 | 2 | 7 | M1 one-cycle arm-state/tag-row inconsistency |
| Frontend | 1 | 1 | 4 | C1 `dashboard.js` `stateChip` unescaped (XSS regression) |
| ConfigAndHosting | 0 | 5 | 6 | M1 `ReloadValidator` never runs at startup |
| TestsAndConfig | 0 | 4 | 7 | M1 `AdminPort: 0` does not disable the admin endpoint |
| **Total** | **2** | **20** | **38** | |
## Critical findings — adjudicated
### Multiplexing C1 — response cache can serve a value contradicted by an in-flight write
`PlcMultiplexer.cs:826-857` (read/lookup) vs `:674-695` (invalidation on the FC06/FC16 *response*). Cache invalidation fires only when the write *response* lands. A read arriving after a write was forwarded but before its response gets a cache hit and the pre-write value. See `Multiplexing.md` C1 for the full sequence.
**My adjudication: high-Major, not Critical.** Mitigating: the cache is opt-in per tag (default OFF); the window is one backend round-trip and is already inside the documented `CacheTtlMs` staleness contract; and until the PLC acks the write, the old value is arguably the last *confirmed* value. The genuinely defective sub-case is a **write that times out** — it never produces a response, so invalidation never runs and the cache can serve a potentially-changed value for the rest of the TTL. Regardless of severity label, the fix is cheap and removes all ambiguity: **invalidate on the write request (enqueue), keeping response-side invalidation as a backstop.** Recommended.
### Frontend C1 — `dashboard.js` `stateChip` interpolates `listener.state` into `innerHTML` unescaped
`dashboard.js:48`. `detail.js`'s identical function was escaped by the prior review; the fix was not propagated to `dashboard.js`. See `Frontend.md` C1.
**My adjudication: Major, but fix immediately.** `listener.state` is a server-serialized enum, so practical exploitability today is low — but this is a straight regression from the prior XSS sweep and the fix is one character (`${escapeHtml(state)}`). It should not wait.
## Major findings worth scheduling
Grouped by theme. Full text in the area files.
**Startup vs hot-reload validation asymmetry — fix as one piece of work.**
- `ConfigAndHosting.md` M1 — `ReloadValidator.Validate` runs *only* inside the reconciler; the startup path runs only `MbproxyOptionsValidator`, which does not check duplicate `ListenPort`s, `AdminPort` collisions, duplicate PLC names, or the keepalive cross-field rule. A port-typo config starts a half-working fleet instead of failing fast — and `docs/Operations/Configuration.md:284` claims the opposite.
- `ConfigAndHosting.md` M4 — `MbproxyOptionsValidator` never range-checks `ListenPort`/`Port`/`AdminPort` or rejects empty `Host`/`Name`; `ListenPort: 0` (the omitted-key default) silently binds an ephemeral port.
- `TestsAndConfig.md` M1 — both config templates and the docs say `AdminPort: 0` *disables* the admin endpoint; it does not — Kestrel binds an ephemeral port. And `ReloadValidator` rejects `AdminPort < 1` on reload while startup accepts it. **This is the most user-facing Major: documented behaviour is false on a security-relevant setting.**
**The proxy trusts self-describing Modbus wire fields.**
- `Multiplexing.md` M2 — a single wrong MBAP `Length` desynchronises the backend stream indefinitely with no resync; the PLC goes dark until the socket happens to close.
- `Multiplexing.md` M3 — a cached FC03/FC04 payload is never validated against `2*qty`; a malformed backend response is cached and replayed to every hit for the TTL.
- `Multiplexing.md` M4 / `ProxyAndBcd.md` M1 — FC16 `qty`/`byteCount` are never cross-checked; a request whose self-describing fields disagree drives cache invalidation and partial BCD rewriting off the client's claim rather than reality. (Not memory-unsafe — per-slot bounds checks hold — but a contract gap.)
- Recommendation: add one framing-validation helper for inbound FC16 (`byteCount == 2*qty`, length consistent, `qty` within the DL260 cap) and one for cached/forwarded FC03/04 responses (`byteCount == 2*qty`); reject/teardown on mismatch.
**Hot-reload robustness.**
- `ConfigAndHosting.md` M2 — the reconciler `Restart` path removes the old supervisor from the dictionary *before* rebuilding; a transient fault during rebuild drops that PLC permanently with only an Error line. Build the new supervisor first, swap last.
- `ConfigAndHosting.md` M3 — `Resilience.BackendConnect`/`ListenerRecovery` reloads reach only added/restarted PLCs, not reseated/untouched ones — inconsistent, undocumented. Either thread a live accessor (as done for `ReadCoalescing`/`Keepalive`) or document them as restart-only.
**Connection-layer resilience.**
- `ProxyAndBcd.md` M2 — `PlcListener.RunAsync` catches and *returns* on an accept-loop fault instead of rethrowing, making the supervisor's exception-carrying `mbproxy.listener.faulted` (EventId 43) unreachable; faults are double-logged and lose their stack trace.
- `Multiplexing.md` M5 — a late-attaching coalescing party can mutate `InterestedParties` while the watchdog enumerates it, throwing `InvalidOperationException` that the outer catch turns into **permanent watchdog death for that PLC**. Snapshot the party list before fan-out.
- `Multiplexing.md` M1 — `EnsureBackendConnectedAsync` can leak a live backend socket + three tasks if `_disposed` flips during a cold-start connect. Re-check `_disposed` under `_backendLock` before publishing the socket.
- `Multiplexing.md` M6 — three failure-detection paths call `TearDownBackendAsync` fire-and-forget; a pre-await throw becomes an unobserved task exception. Attach a faulted-continuation logger.
**Silent failures.**
- `Frontend.md` M1 — `onreconnected` swallows a failed `SubscribeFleet`/`SubscribePlc`; the pill shows green while the feed is dead, with no retry. Route the warm-reconnect subscribe through the same retry as cold start.
- `AdminSignalR.md` M1 — `ReconcileArmed` and `BuildDebug` re-query the capture independently in one push cycle, so a mid-cycle disarm can push a `PlcDebugSnapshot` where `captureArmed` and the tag rows disagree for ≤1 cycle. Self-heals; derive `CaptureArmed` from the already-held `activePlcs` set instead. (I'd call this Minor — one-cycle cosmetic — but it touches the arm-state invariant the prior reviews centred on.)
- `AdminSignalR.md` M2 — `StatusBroadcaster.Start()` has no double-call guard; a second call orphans a push loop. Latent (one call site today); add an `Interlocked` flag.
## Cross-cutting Minor themes
- **Doc drift** (several): `BcdRewriting.md` uses `mbproxy.rewrite.exception_passthrough` for an event actually named `mbproxy.exception.passthrough` (`ProxyAndBcd.md` N2); `Configuration.md:284` and `HotReload.md:81/95` describe a validation model the code does not implement (`ConfigAndHosting.md` M1/N6); `ConfigReconciler.cs:364` comment claims `GetOrCreate` preserves the armed flag — it no longer does (`AdminSignalR.md` N1); `LogEvents.md` lists EventId 21 for a `ProxyWorker` event with no call site (`ProxyAndBcd.md` N1).
- **Unbounded growth on PLC churn**: frontend `prevPdu`/`rateByName` maps (`Frontend.md` N2); `ResponseCache` LRU is O(n)-per-insert with an unbounded operator-settable cap (`Multiplexing.md` N4).
- **Test hygiene**: hard-coded `AdminPort: 8080` in five test configs → parallel-run bind conflicts (`TestsAndConfig.md` M2); the sim fixture's readiness poll ignores the runner cancellation token (`TestsAndConfig.md` M3); a needless 400 ms delay in a broadcaster test (`TestsAndConfig.md` m2).
- **Best-effort swallow gaps**: `SocketKeepalive.Apply` does not catch `NotSupportedException`/`PlatformNotSupportedException` despite a "never abort a connection" contract (`ProxyAndBcd.md` N4); `EventLogBridge` has no first-failure breadcrumb where `SyslogBridge` does (`ConfigAndHosting.md` N4).
## Correction to an area finding
`TestsAndConfig.md` **m4** (and its regression-table row for prior `m6`) reports that `tests/sim/mbproxy.smoke.config.json` still cites `plans/2026-05-15-webui-dashboard.md`. **This is a false positive** — that reference was removed in commit `0308490` (task #21); the current header reads "mbproxy smoke-test configuration for the web-UI browser smoke tests." No action needed for m4.
## Prior-review regression check
All Critical and Major findings from `codereviews/2026-05-15/` are confirmed fixed and the fixes hold:
- SignalR capture-leak (prior C1/C2) — `PlcSubscriptionTracker` is tab-keyed; arm/disarm funnels through `ReconcileArmed`; the hub no longer arms captures. ✔
- XSS in `detail.js` (prior C1) — `detail.js` is fully escaped. ✔ **But the same fix was not propagated to `dashboard.js` — see Frontend C1 above (regression-by-omission).**
- Cache-hit debug-view freeze (prior TagCapture C1) — fixed via `CacheEntry.CapturedTags` replay. ✔
- Bind-failure leak (prior M5/M6) — `StartAppAsync` tears down a partially-started app/broadcaster. ✔
- The 8 Minor follow-ups (#14#21) — all closed; `TestsAndConfig.md`'s regression table confirms, with the m4/m6 false positive corrected above.
## Recommended remediation order
1. **`dashboard.js:48`** — escape `stateChip` (`${escapeHtml(state)}`). One character; closes the XSS regression.
2. **`AdminPort: 0` semantics** — either implement the documented disable (a `port == 0 → return` guard in `AdminEndpointHost.StartAppAsync`, then allow 0 in both validators) or correct the templates/docs to say the admin endpoint is always on. Pick one; today the docs lie.
3. **Single startup/reload validation gate** — run `ReloadValidator` (or a merged validator) at startup so port collisions and duplicate PLC names fail fast; fix `Configuration.md:284`.
4. **Cache invalidation on the write request** (Multiplexing C1) — invalidate overlapping FC03/04 entries when an FC06/FC16 is enqueued, not only on its response.
5. **Reconciler `Restart` ordering** (ConfigAndHosting M2) — build the new supervisor before removing the old one.
6. **`PlcListener.RunAsync` rethrow** (ProxyAndBcd M2) so the supervisor's fault path runs as designed.
7. **Watchdog party-list snapshot** (Multiplexing M5) — iterate a frozen copy so a late attach cannot kill the watchdog.
8. **Inbound FC16 / cached-response framing validation** (Multiplexing M2/M3/M4, ProxyAndBcd M1) — one helper, applied at both ends.
Items 13 are small and high-value. 48 are the connection-layer correctness work and deserve their own change with tests. The 38 Minors are opportunistic — the doc-drift cluster is worth a single sweep.
@@ -0,0 +1,128 @@
# Code Review — Proxy Lifecycle & BCD Codec
**Date:** 2026-05-16
**Branch:** `mbproxy-webui-dashboard` (HEAD `0308490`)
**Scope:** `src/Mbproxy/Proxy/{PlcListener,ProxyWorker,ProxyCounters,SocketKeepalive,BcdPduPipeline,NoopPduPipeline,IPduPipeline,PerPlcContext,RewriterLogEvents}.cs`, `src/Mbproxy/Proxy/Supervision/*`, `src/Mbproxy/Bcd/*`. Excludes `Proxy/Multiplexing/`, `Proxy/Cache/`, and `TagValueCapture`/`TagCaptureRegistry`.
## Summary
The proxy-lifecycle and BCD-codec subsystems are in good shape. The BCD codec itself (`BcdCodec`) is correct in all four directions, range-checked, and allocation-free; the 32-bit CDAB word order matches the documented `high*10000+low` convention and is consistently applied through `BcdPduPipeline` for FC03/04/06/16. Partial-overlap handling, invalid-BCD passthrough, exception passthrough, and the FC16 per-word base-10000 guard are all implemented correctly. The supervisor's Polly recovery loop, CTS/TCS re-arming, and graceful-shutdown sequencing are carefully built and well-documented. No Critical issues were found. The findings below are one Major correctness gap in the FC16 request length check that admits an integer-overflow bypass, plus a small set of resource/robustness and maintainability items.
**Findings by severity:** Critical 0 · Major 2 · Minor 7
---
## Major
### M1 — FC16 request payload-length check can be bypassed by an oversized `qty`, enabling out-of-PDU reads/writes mitigated only by a later per-slot check
**File:** `src/Mbproxy/Proxy/BcdPduPipeline.cs:158-167`
`ProcessFc16Request` reads `qty` straight from the wire (`pdu[3]<<8 | pdu[4]`, range 065535) and then guards with:
```csharp
if (pdu.Length < 6 + qty * 2)
return;
```
`qty` is `ushort`; `qty * 2` is computed in `int` so the multiplication itself does not overflow (max 131070), and `6 + qty*2` is fine. So the arithmetic is actually safe — but the *intent* documented in the comment ("a client claiming qty=10 with only 4 bytes of register data would otherwise have its BCD slots silently skipped") is only half-delivered. The real exposure is the opposite case the comment does not mention: the PDU is *delivered* in a buffer larger than the framed PDU. `pdu` here is a `Span<byte>` whose length is the parsed PDU length from the MBAP length field. If a malicious or buggy client sends an MBAP length that is large but a `byteCount`/`qty` that is internally inconsistent, the `pdu.Length < 6 + qty*2` check passes when `pdu.Length` is large, and then the per-slot `lowByteOff + 2 > pdu.Length` checks (lines 200, 261) are the only thing keeping the writes in bounds. That secondary check *is* present and correct, so no actual out-of-bounds write occurs — but the FC16 path never validates `byteCount` (`pdu[5]`) against `qty` at all. A request with `qty=2`, `byteCount=4`, but only the FC16-min 6 bytes plus a short tail will be partially rewritten against whatever stale bytes follow.
**Impact:** Not a memory-safety bug (the per-slot bounds checks hold), but a malformed FC16 whose `byteCount` disagrees with `qty` is silently partially rewritten instead of being passed through for the PLC to reject. The rewriter mutates register bytes the client never intended as register data. This is a wire-protocol-correctness gap for adversarial/buggy input.
**Recommendation:** Add an explicit `byteCount` consistency check after reading `qty`: `byte byteCount = pdu[5]; if (byteCount != qty * 2 || pdu.Length < 6 + byteCount) return;`. This makes the rewriter pass through any FC16 whose self-describing fields disagree, matching the documented "let the PLC's validator surface the protocol error" policy and removing reliance on the per-slot check as the safety net.
### M2 — `PlcListener.RunAsync` swallows a faulted accept loop without distinguishing a transient fault from a permanently dead listener; the supervisor then treats every non-cancellation exit identically
**File:** `src/Mbproxy/Proxy/PlcListener.cs:152-161`, interacting with `Supervision/PlcListenerSupervisor.cs:386-432`
`PlcListener.RunAsync` catches *every* non-`OperationCanceledException` exception, logs `mbproxy.listener.faulted`, and **returns normally**. Because it returns rather than rethrows, the supervisor's `await listener.RunAsync(token)` (line 388) completes without throwing. Control then falls through to lines 418-432, which treat a normal return as "listener accept loop ended unexpectedly", increment `RecoveryAttempts`, log `mbproxy.listener.ended`, and throw `InvalidOperationException` to drive a Polly retry.
The net effect is that a genuine accept-loop fault (e.g. `SocketException` from `AcceptSocketAsync`) is reported to operators as `mbproxy.listener.faulted` **and then** `mbproxy.listener.ended`, and the recovery counter is incremented once for the same event. The supervisor's own `catch (Exception runEx)` block at line 397 — which exists specifically to log the fault with the exception object attached and increment the counter — is **unreachable for any fault originating inside `RunAsync`**, because `RunAsync` never lets the exception escape.
**Impact:** Double logging (two distinct event names for one fault), the `mbproxy.listener.faulted` supervised emission (EventId 43, Warning, *with* stack trace) never fires for accept-loop faults — only the unsupervised `PlcListener` emission (EventId 22, Error, *no* exception object) does — so operators lose the stack trace the doc (`LogEvents.md` line 99) promises. The supervisor's `IncrementRecoveryAttempt(runEx.Message)` with the real reason is also never reached; the counter is bumped with the generic "Listener accept loop ended unexpectedly" string instead.
**Recommendation:** Have `PlcListener.RunAsync` rethrow after logging (`catch (Exception ex) { LogListenerFaulted(...); throw; }`), so the supervisor's `catch (Exception runEx)` path handles it as designed. Then the supervisor logs `mbproxy.listener.faulted` (EventId 43) with the exception, and the "ended unexpectedly" path is reserved for the genuinely-clean-return case it was written for. Alternatively, drop the supervisor's dead `catch (Exception runEx)` block and accept the current behaviour — but the current split is misleading and contradicts the documented event semantics.
---
## Minor
### N1 — `ProxyWorker.LogBindFailed` (EventId 21) is declared but never invoked; `LogEvents.md` documents it as emitted by `ProxyWorker`
**File:** `src/Mbproxy/Proxy/ProxyWorker.cs:382-385`; doc `docs/Reference/LogEvents.md:63`
The `[LoggerMessage]` `LogBindFailed` partial method (EventId 21, `mbproxy.startup.bind.failed`) is defined in `ProxyWorker` but has no call site — all bind-failure logging happens in `PlcListenerSupervisor.LogBindFailed` (EventId 41). `LogEvents.md` line 63 lists EventId 21 / `ProxyWorker.cs` as a source for `mbproxy.startup.bind.failed`, which no longer matches the code.
**Impact:** Dead code; misleading documentation. An operator filtering on EventId 21 will never see a hit.
**Recommendation:** Delete the unused `LogBindFailed` declaration from `ProxyWorker.cs` and remove the `21 (ProxyWorker)` / `src/Mbproxy/Proxy/ProxyWorker.cs` references from the `mbproxy.startup.bind.failed` table in `LogEvents.md`.
### N2 — `RewriterLogEvents.ExceptionPassthrough` event name `mbproxy.exception.passthrough` is inconsistent with the `mbproxy.rewrite.*` family it sits in
**File:** `src/Mbproxy/Proxy/RewriterLogEvents.cs:46-55`
`PartialBcd` and `InvalidBcd` use `mbproxy.rewrite.partial_bcd` / `mbproxy.rewrite.invalid_bcd`, but `ExceptionPassthrough` uses `mbproxy.exception.passthrough`. `LogEvents.md` (line 557) explicitly lists `exception` as its own `<area>`, so this is intentional and *documented* — but `BcdRewriting.md:146` calls it `mbproxy.rewrite.exception_passthrough`, and `BcdRewriting.md:232` repeats `mbproxy.rewrite.exception_passthrough`. The code and `LogEvents.md` agree on `mbproxy.exception.passthrough`; `BcdRewriting.md` is wrong in two places.
**Impact:** Doc drift. An operator following `BcdRewriting.md` to build a log filter would grep a name that is never emitted.
**Recommendation:** Fix the two occurrences in `docs/Features/BcdRewriting.md` (lines 146 and 232) to `mbproxy.exception.passthrough` to match `LogEvents.md` and the source.
### N3 — `PlcListener.RunAsync` orphans the per-pipe `ContinueWith` cleanup if the listener faults between `Task.Run` and dictionary insertion
**File:** `src/Mbproxy/Proxy/PlcListener.cs:136-149`
The accept loop creates `pipeTask`, inserts it into `_pipeTasks`, then attaches a `ContinueWith` that removes the entry. If `DisposeAsync` runs concurrently (it snapshots `_pipeTasks.Values` at line 178), there is a benign window where a pipe task started after the snapshot is not awaited — acceptable best-effort. More notably, the `ContinueWith` continuation uses `TaskScheduler.Default` and discards its own task (`_ = ...`); if the continuation itself throws (it cannot here, `TryRemove` does not throw) it would be an unobserved exception. Low risk, but the pattern of fire-and-forget `ContinueWith` for cleanup is fragile.
**Impact:** None in practice — `TryRemove` cannot throw. Maintainability only.
**Recommendation:** Prefer awaiting cleanup inside the `Task.Run` lambda's `finally` (the lambda already has a `finally` that disposes the pipe — add `_pipeTasks.TryRemove(pipe.Id, out _);` there) and drop the separate `ContinueWith`. This collapses two scheduling primitives into one and removes the discarded-task wart.
### N4 — `SocketKeepalive.Apply` does not catch `NotSupportedException`; some platforms throw it (not `SocketException`) for the `TcpKeepAlive*` options
**File:** `src/Mbproxy/Proxy/SocketKeepalive.cs:36-47`
`Apply` catches `SocketException` and `ObjectDisposedException`. The three `TcpKeepAlive*` socket options are documented as "not honoured on every platform"; on platforms/runtimes where the option is unrecognised, `Socket.SetSocketOption` can throw `SocketException(SocketError.ProtocolOption)` (caught) **or**, on some older runtimes/OSes, `NotSupportedException` / `PlatformNotSupportedException` for the named option enum. Those would escape and — per the comment's own intent ("must never abort a connection") — defeat the best-effort contract.
**Impact:** On a platform that throws `PlatformNotSupportedException` for `TcpKeepAliveRetryCount`, applying keepalive to a backend or accepted upstream socket would throw out of the constructor / connect path, aborting the connection. The `SO_KEEPALIVE` master option (line 29) is universally supported, so the realistic failure window is narrow, but the swallow set is incomplete relative to the stated guarantee.
**Recommendation:** Broaden the catch to also include `NotSupportedException` (which `PlatformNotSupportedException` derives from) — or simplest, catch `Exception` here given the explicit "swallow everything, keepalive is best-effort" contract documented in the XML summary.
### N5 — FC03/04 response partial-overlap warning logs `qty` (raw request qty) while the in-range computation uses `effectiveQty` (clamped to response words)
**File:** `src/Mbproxy/Proxy/BcdPduPipeline.cs:348,362-370`
`ProcessResponse` clamps `effectiveQty = min(qty, wordsInResponse)` and uses `effectiveQty` for the `lowInRange`/`highInRange` checks (lines 362-363), which is correct. But the partial-BCD warning at line 367-368 logs `startAddress, qty` — the *original* request qty, not `effectiveQty`. If a PLC returns a short response (`byteCount` smaller than `qty*2`), a 32-bit tag that *would* have been fully in range for the requested `qty` is reported as a partial overlap with the full `qty`, making the warning look like a client/config straddle when it is actually a truncated response.
**Impact:** Misleading `mbproxy.rewrite.partial_bcd` warning in the (rare) short-response case; an operator would chase a client tag-map mismatch that does not exist.
**Recommendation:** Either log `effectiveQty` in the response-path `PartialBcd` call, or — better — distinguish the two causes: a short response is a backend/transport anomaly, not a client straddle, and arguably warrants a different (or no) warning. At minimum make the logged qty match the qty actually used for the in-range decision.
### N6 — `ProxyCounters.UpdateRoundTripEwma` comment claims ~1 µs resolution but stores nanoseconds; the fixed-point scale and the comment disagree
**File:** `src/Mbproxy/Proxy/ProxyCounters.cs:411-415`
`sampleFixed = (long)(sampleMs * 1000.0)` converts milliseconds to microseconds (×1000). The comment on line 413 says "store microseconds * 1000 (i.e. nanoseconds)" — but `sampleMs * 1000` is microseconds, not nanoseconds. `Snapshot()` then divides by 1000.0 (line 466) to get milliseconds back, which is consistent with microsecond storage. So the *code* is internally consistent (ms→µs store, µs→ms read); only the comment's "(i.e. nanoseconds)" parenthetical is wrong, and the "~1 µs resolution" claim is right for microsecond storage.
**Impact:** None functional — the EWMA value is correct. Comment is misleading for a future maintainer.
**Recommendation:** Fix the line 413 comment to "store milliseconds * 1000 (i.e. microseconds)".
### N7 — `BcdTagMapBuilder` includes entries implicated in an `OverlappingHighRegister` error in the returned map; relies entirely on the caller checking `Errors.Count`
**File:** `src/Mbproxy/Bcd/BcdTagMapBuilder.cs:155-164`
The builder's own comment (lines 156-160) acknowledges that `OverlappingHighRegister`-implicated entries are *kept* in the frozen map, unlike `InvalidWidth`/`DuplicateAddress` entries which are excluded. The safety of this depends on every caller treating `Errors.Count > 0` as fatal. `ProxyWorker.ExecuteAsync` (lines 107-114) does exactly that — it skips the listener. But `ConfigReconciler` and any future caller must replicate the check; the asymmetry (some bad entries excluded, some included) is a latent footgun.
**Impact:** None today — the one production caller handles it. A future caller that builds a map and uses it without checking `Errors` would get a map with a known-bad overlapping 32-bit pair, and the rewriter would then mis-decode the overlapping registers (a 32-bit tag whose high register is another tag's address produces a `RangeHit` for both, and both would be rewritten).
**Recommendation:** For consistency and defence-in-depth, exclude `OverlappingHighRegister`-implicated entries from the frozen map the same way `InvalidWidth`/`DuplicateAddress` entries are excluded, so a map returned alongside errors is always safe to use even by a caller that forgets the check. If keeping them is deliberate (for diagnostics), document the contract on `ValidationResult.Map` itself, not just inside the builder.
---
## Notes (no finding)
- **BCD codec correctness verified.** `Encode16`/`Decode16` round-trip cleanly for `[0,9999]`; `(uint)value > Max16` correctly rejects negatives via unsigned wrap. `Encode32`/`Decode32` implement CDAB low-word-first with `high*10000+low`, matching `BcdRewriting.md` and `dl205.md`. `HasBadNibble` checks all four nibbles. The FC16 request path's per-word `clientLow/clientHigh > 9999` guard (lines 222-228) correctly prevents the documented silent-mutation bug where `(9999,9999)` would otherwise survive `Encode32`'s 99,999,999 ceiling.
- **FC06 high-register partial detection is correct.** `ProcessFc06Request` (lines 96-116) uses `TryGetForRange(address,1,...)` to catch the case where the written address is the *high* register of a configured 32-bit pair; `TryGetForRange`'s negative-`OffsetWords` semantics make `hit.OffsetWords < 0` the right discriminator.
- **Supervisor lifecycle is sound.** CTS and TCS are re-armed per `StartAsync`, the previous CTS is disposed before reassignment, `DisposeAsync` is idempotent, and the response cache's eviction timer is disposed in both `ReplaceContextAsync` (old cache) and `DisposeAsync`. Polly's listener-recovery pipeline is correctly infinite (`MaxRetryAttempts = int.MaxValue`) with cancellation as the sole exit.
- **Graceful shutdown ordering is correct.** `ProxyWorker.StopAsync` snapshots in-flight counts before `base.StopAsync`, drains via supervisor stop, stops the admin endpoint last, and disposes supervisors — the documented sequence. The `inFlightAtCancel` snapshot rationale is well-reasoned.
- **`ProxyCounters` is correctly lock-free** — all increments are `Interlocked`, `ObserveInFlight` and `UpdateRoundTripEwma` use proper CAS loops, and `Snapshot` reads each field atomically.
@@ -0,0 +1,312 @@
# Code Review — Tests, Install/Packaging, and Config (HEAD `0308490`)
Scope: `tests/Mbproxy.Tests/` (all test code), `tests/sim/` (simulator profile and fixture),
`install/` (all scripts and config templates), `src/Mbproxy/Mbproxy.csproj`, and
`tests/Mbproxy.Tests/Mbproxy.Tests.csproj`. Prior-review findings from `codereviews/2026-05-15/TestsAndConfig.md`
were read for regression context; the code is reviewed fresh against the current HEAD.
## Summary
The prior-review remediation is genuine and thorough: every `Major` finding from the
2026-05-15 review has been addressed — `StatusBroadcaster.LoopAsync` now has a proper
timing-hedged loop test (`Loop_PushesRepeatedly_ThenStopsAfterStopAsync`), the
`Record`/`Disarm` race is fixed in production and covered by
`ConcurrentRecordAndDisarm_LeavesNoStaleObservation`, `PlcSubscriptionTracker` now has a
dedicated test file including a concurrency stress test, the `ThrowingStatusPushSink` fake
covers both the swallow-path and the `OperationCanceledException` filter, and
`ConfigReconcilerTests` now holds a real registry and asserts add/remove lifecycle. The
`HubStatusE2ETests` E2E pair covers the `MapHub`/`SignalRStatusPushSink` gap that the prior
review flagged as the single biggest "does the feature actually work" gap. The `EmbeddedAssetsTests`
glob guard and the `Get_Asset` byte-verbatim assertion also close the prior `m3`/`m4` nits.
No prior-review finding has regressed. The remaining findings below are all new, introduced by
the current HEAD or present in the pre-existing code but newly visible against the completed
remediation baseline.
**Findings: 0 Critical · 4 Major · 7 Minor**
---
## Critical
None.
---
## Major
**M1 — `AdminPort=0` silently binds a random OS-assigned port instead of disabling the admin endpoint.**
`install/mbproxy.config.template.json:83`, `install/mbproxy.linux.config.template.json:87`,
`docs/Operations/Configuration.md:54`; `src/Mbproxy/Admin/AdminEndpointHost.cs:190`.
Both config templates carry the comment `// Set to 0 to disable the admin endpoint.`, and the
Operations docs repeat this. The comment is wrong. `AdminEndpointHost.StartAppAsync` calls
`k.Listen(System.Net.IPAddress.Any, port)` with whatever `port` is — Kestrel interprets port 0
as "pick a free OS-assigned port" (ephemeral port), not as "skip the bind". There is no code in
`StartAppAsync` or its callers that special-cases port 0 as a disable signal. A production
operator who sets `AdminPort: 0` to disable the admin surface does not get a disabled endpoint;
they get an admin endpoint on an unknown ephemeral port that is not logged in the startup banner
in a predictable way. This is made worse by:
1. `ReloadValidator.Validate` (line 6568) rejects `AdminPort < 1`, so a hot-reload with `AdminPort: 0`
is correctly blocked — but the *initial* startup path does not go through `ReloadValidator`
(only `MbproxyOptionsValidator`, which does not range-check `AdminPort`). So port 0 at
startup actually takes effect, while the same value in a reload is rejected. The two paths
are inconsistent.
2. Several tests intentionally use `["Mbproxy:AdminPort"] = "0"` with the comment "disable
admin to avoid port conflicts" (`ShutdownE2ETests.cs:163`, `StatusBroadcasterTests.cs:47`,
`StatusSnapshotBuilderTests.cs:266`). These tests are not broken — they work because Kestrel
binds on 0 and no test probes the admin port — but they silently start an admin endpoint
rather than suppressing it. If those tests ever run concurrently with another test that
probes all listening ports they can interfere.
**Impact:** Operators following the documented `AdminPort: 0` procedure to disable the admin
surface on a sensitive network have an undocumented, unauthorised HTTP listener running. This is
a documentation/behavioural inconsistency on a production-critical setting.
**Recommendation:** Implement the documented disable. In `AdminEndpointHost.StartAppAsync`, add
`if (port == 0) return;` before the builder construction. Then update `ReloadValidator` to allow
0 (skip collision check) and update `MbproxyOptionsValidator` to allow 0. The templates and docs
are then correct. Alternatively, remove the "Set to 0 to disable" comment and document that the
admin endpoint is always started, and rename the test comments to reflect reality.
---
**M2 — Multiple E2E tests hard-code `AdminPort: 8080`, causing bind-conflict failures when run in parallel or when port 8080 is already in use.**
`tests/Mbproxy.Tests/Proxy/ProxyForwardingTests.cs:67,242,310`; `tests/Mbproxy.Tests/Proxy/RewriterE2ETests.cs:388`;
`tests/Mbproxy.Tests/HostSmokeTests.cs:111`.
`ProxyForwardingTests` (3 configs), `RewriterE2ETests` (1 config), and the shared
`TestHostBuilderExtensions.ConfigureForTest` used by all three `HostSmokeTests` hard-code
`["Mbproxy:AdminPort"] = "8080"`. When these tests run in parallel (xUnit v3's default policy
for non-collection tests) they each try to bind a Kestrel listener on port 8080. The first
binder wins; all subsequent tests get a `bind.failed` log and a missing admin endpoint. While
the proxy tests themselves do not assert admin-endpoint behaviour, the spurious bind-failed error
can mask real failures in CI logs and reduces test isolation. `HostSmokeTests` compounds this:
it is not in any `[Collection]` so it runs concurrently with everything else.
**Impact:** Flaky CI runs when multiple test classes execute simultaneously. Currently mitigated
by the `AdminEndpointHost` bind-failure-is-non-fatal path, so test assertions still pass — but
the coverage is degraded (admin is not actually up) and the log noise makes CI output harder to
read.
**Recommendation:** Replace the hard-coded `"8080"` values with `PickFreePort()` calls (the
same pattern already used by `AdminEndpointTests` and `HubStatusE2ETests`). Update
`TestHostBuilderExtensions.ConfigureForTest` to accept a port parameter with default 8080 for
callers that do not care (smoke tests do not probe the admin surface and an ephemeral-port bind
is fine).
---
**M3 — `DL205SimulatorFixture` swallows the test's own `CancellationToken` in the readiness poll, blocking test cancellation for up to 120 seconds.**
`tests/Mbproxy.Tests/Sim/DL205SimulatorFixture.cs:121-163`.
`InitializeAsync` constructs a `CancellationTokenSource` with a hard-coded 120-second deadline
(line 61) and a `linked` token that chains the deadline against `CancellationToken.None` (line
122):
```csharp
using var linked = CancellationTokenSource.CreateLinkedTokenSource(
deadline.Token, CancellationToken.None);
```
The test framework's cancellation token — which xUnit v3 exposes as
`TestContext.Current.CancellationToken` — is never linked in. If the test runner cancels the
test suite (CI timeout, keyboard interrupt), the readiness poll continues for up to 120 more
seconds and the spawned Python process is not killed until `DisposeAsync` finally runs. On a CI
runner with a per-job timeout this can cause the whole runner to be killed mid-cleanup, leaving
orphaned Python processes.
**Impact:** Poor CI behaviour under cancellation; can leave zombie `python`/`pwsh` processes.
**Recommendation:** Link the fixture against the framework's application lifetime or accept a
`CancellationToken` through the `IAsyncLifetime` interface (xUnit v3 provides one via
`InitializeAsync(CancellationToken)`). Alternatively, use
`TestContext.Current.CancellationToken` from inside `InitializeAsync` if xUnit v3's fixture
`IAsyncLifetime` overload provides it:
```csharp
public async ValueTask InitializeAsync() // or the overload with CancellationToken
```
At minimum, change `CancellationToken.None` to the method's incoming token or
`TestContext.Current.CancellationToken` inside `linked` so a kill signal actually propagates.
---
**M4 — `install.ps1` copies the config template to `$dataDir` (ProgramData) but the `ReloadValidator` AdminPort=0 / `MbproxyOptionsValidator` gap means an operator using the template's `AdminPort: 0` disable hint starts a service with an undocumented ephemeral admin port.**
(Secondary impact of M1, listed separately because the install path is independently broken.)
`install/install.ps1:153-162`; `install/mbproxy.config.template.json:83-84`.
When `install.ps1` seeds the template as the initial config and an operator follows the inline
comment to set `AdminPort: 0`, the service starts with Kestrel binding on an OS-assigned
ephemeral port, not disabled. The `install.ps1` script itself has no bug — the error is upstream
in M1 — but the install path is where the misleading comment is first acted upon. Listing it
separately so the install script gets a concrete fix recommendation: if M1 is fixed by adding a
`port == 0 → skip` guard in `AdminEndpointHost`, the template comment becomes correct and this
finding is resolved simultaneously. If M1 is fixed differently, the template comment must still
be updated to match whatever the new semantics are.
**Impact:** Same as M1 — an unexpected open HTTP port on a "disabled" admin config.
**Recommendation:** Fix M1 first; this finding closes automatically once the template's
`AdminPort: 0` comment reflects actual behaviour.
---
## Minor
**m1 — `HubStatusE2ETests.PickFreePort` has a TOCTOU race: the port can be stolen between `l.Stop()` and Kestrel's bind.**
`tests/Mbproxy.Tests/Admin/HubStatusE2ETests.cs:141-147`.
The helper stops the listener and returns the port number — same pattern used in a dozen other
test files. This is acknowledged as acceptable in the `DL205SimulatorFixture` TOCTOU comment
(line 72). No change needed, but worth noting since `HubStatusE2ETests` is one of the few tests
that allocates two ports in the same method (admin + proxy) and has a wider race window. Low
priority; flagged for awareness.
---
**m2 — `Loop_PushesRepeatedly_ThenStopsAfterStopAsync` has a 400ms post-stop idle assertion that can be a source of slowness but not flakiness.**
`tests/Mbproxy.Tests/Admin/StatusBroadcasterTests.cs:188-190`.
```csharp
await Task.Delay(400, TestContext.Current.CancellationToken);
h.Sink.FleetPushes.Count.ShouldBe(afterStop, "no pushes may occur after StopAsync");
```
The test correctly waits 400 ms after `StopAsync` to confirm silence. Given `AdminPushIntervalMs=100`
and that `StopAsync` awaits the loop task (`await _loop`), the loop is guaranteed terminated
before `StopAsync` returns — so zero extra delay is needed for the assertion to be correct. The
400 ms delay adds wall-clock time to every test run without buying additional confidence. The
existing `_loop` await in `StopAsync` is the real guarantee.
**Recommendation:** Drop the `Task.Delay(400, ...)` line and assert the count immediately after
`StopAsync`. The test would be equally correct and ~400 ms faster.
---
**m3 — `FakeGroupManager.Removed` list is still dead code: no test ever asserts it.**
`tests/Mbproxy.Tests/Admin/SignalRFakes.cs:30-31`.
This was noted as prior finding M4 in the 2026-05-15 review ("either delete `Removed` from the
fake, or add a comment"). The list is still present and still has no asserting test.
`StatusHub.OnDisconnectedAsync` never calls `RemoveFromGroupAsync` (relying on SignalR's
implicit group cleanup), so no assertion ever fires. The unasserted field is a minor maintenance
hazard: a future author may assume this is the intended assertion target and waste time debugging
a test that never fails.
**Recommendation:** Drop the `Removed` list from `FakeGroupManager`, or add a single inline
comment: `// SignalR auto-removes disconnected connections from groups; explicit RemoveFromGroupAsync is not called.`
---
**m4 — `mbproxy.smoke.config.json` still carries a dangling reference to `plans/`.**
`tests/sim/mbproxy.smoke.config.json:1`.
The header comment references `plans/2026-05-15-webui-dashboard.md`. As of commit `7466a46`
("retire superseded design/plan docs"), `plans/` is untracked (`?? plans/` in git status) and
contains no file of that name. The comment is a stale citation that will silently fail any
documentation-link check.
**Recommendation:** Replace the reference with a plain description of the smoke file's purpose,
removing the specific plan filename.
---
**m5 — Both install templates have `AdminPort: 8080` hard-coded in the `Plcs` section inline comment, but the comment already appears in the admin-port section — the duplication in the PLC entry comment is not wrong but potentially confusing.**
`install/mbproxy.config.template.json:79` (`"// GET /status.json → …"` inside the PLC
description); both templates are identical here.
This is a documentation nit: the `Plcs` array comment block on lines 6075 is accurate, and the
preceding inline comment about the 4-client cap remains true and relevant. No functional issue.
**Recommendation:** No change required; noted for completeness.
---
**m6 — `publish.ps1` and `publish.sh` do not exit non-zero if the config template copy fails.**
`install/publish.ps1:87-94`; `install/publish.sh:85-95`.
Both scripts call `throw` / `exit 1` if the template file is missing, which is correct. However
the `Copy-Item -Force` / `cp -f` steps that copy the template into each output flavour silently
succeed even if the destination directory does not exist (the directory was just created by
`dotnet publish`). If `dotnet publish` produced no output directory (a degenerate build failure
that `dotnet publish` should have already reported as non-zero), the copy also fails with an
error but neither script re-checks `$LASTEXITCODE` after the copy loop. In practice `dotnet publish`
failing before this point already causes both scripts to throw/exit (the preceding
`if ($LASTEXITCODE -ne 0) { throw }` guards). This is low-risk but the post-copy result
verification (the `Format-Size` / `du -h` summary block) is output-only and does not fail the
script if a binary is missing — it emits a `Write-Warning` / `echo WARNING` instead.
**Recommendation:** In the result summary loop of both scripts, change the missing-binary branch
from a warning to a non-zero exit:
```powershell
# publish.ps1
if (Test-Path $bin) { ... } else { throw "Expected binary not found: $bin" }
```
```bash
# publish.sh
if [[ -f "$bin" ]]; then ... else echo "ERROR: expected binary not found: $bin" >&2; exit 1; fi
```
---
**m7 — `mbproxy.service` `ReadWritePaths` does not include `/etc/mbproxy`, so a hot-reload write by the service account to its own config directory is blocked by `ProtectSystem=strict`.**
`install/mbproxy.service:40`.
```ini
ReadWritePaths=/var/log/mbproxy /var/cache/mbproxy
```
`ProtectSystem=strict` makes the entire filesystem read-only except paths listed in
`ReadWritePaths`. The service itself never writes to `/etc/mbproxy` (it only reads
`appsettings.json`), so this is not a runtime bug. However:
1. The comment at line 8 states `WorkingDirectory=/etc/mbproxy` — the service account needs
read access there, which `ProtectSystem=strict` provides (read-only is allowed).
2. If an operator's monitoring or deployment tooling updates the config in-place (e.g.
`scp`-ing a new `appsettings.json` as the `mbproxy` user via an SSH key), that write will
be blocked by `ProtectSystem=strict`. The operator would need to write as root and chown/chmod
as appropriate.
This is likely intentional (config is an admin operation), but the systemd unit has no comment
explaining why `/etc/mbproxy` is intentionally absent from `ReadWritePaths`. It trips up
operators who expect the service account to own its config.
**Recommendation:** Add a comment after the `ReadWritePaths` line: `# /etc/mbproxy is intentionally absent — config changes require root (ProtectSystem=strict makes it read-only for the service account).`
---
## Regression check against 2026-05-15 findings
| Prior finding | Status |
|---|---|
| M1 — `LoopAsync` untested | Closed — `Loop_PushesRepeatedly_ThenStopsAfterStopAsync` added |
| M2 — `Record`/`Disarm` race | Closed — production fix + `ConcurrentRecordAndDisarm_LeavesNoStaleObservation` |
| M3 — `PlcSubscriptionTracker` no direct/concurrent test | Closed — `PlcSubscriptionTrackerTests.cs` added with stress test |
| M4 — `FakeGroupManager.Removed` dead code | Partially addressed: no test added; `Removed` list is still present — see new `m3` |
| m1 — `ThrowingStatusPushSink` missing | Closed — `ThrowingStatusPushSink` and two swallow/propagate tests added |
| m2 — `FakeHubCallerContext.ConnectionAborted` hardcoded | No change; accepted as low-risk |
| m3 — asset allow-list coverage | Closed — `EmbeddedAssetsTests` covers glob; `Get_Asset` asserts bytes verbatim |
| m4 — `Get_Asset` byte assertion missing | Closed — `ReadEmbeddedAsset` comparison added |
| m5 — `*.*` glob caveats | Unchanged; accepted |
| m6 — `mbproxy.smoke.config.json` dangling plans reference | Not fixed — see new `m4` |
| m7 — `ConfigReconcilerTests` registry wiring untested | Closed — `Apply_AddThenRemovePlc_TagCaptureRegistryTracksRoster` added |
| m8 — `AdminPushIntervalMs` upper bound unguarded | Closed — upper bound added to both `ReloadValidator` and `MbproxyOptionsValidator`; four new tests |
| m9 — `OnDisconnectedAsync` always called with null | No change; accepted as low-risk |
| Coverage gap 7 — `/hub/status` SignalR E2E | Closed — `HubStatusE2ETests` (2 facts) added |
## Key file references
- `install/mbproxy.config.template.json:83`, `install/mbproxy.linux.config.template.json:87` — "Set to 0 to disable" comment (M1)
- `src/Mbproxy/Admin/AdminEndpointHost.cs:190``k.Listen(..., port)` with no port-0 guard (M1)
- `src/Mbproxy/Configuration/ReloadValidator.cs:65-68` — rejects `AdminPort < 1` on hot-reload but not at startup (M1)
- `tests/Mbproxy.Tests/Proxy/ProxyForwardingTests.cs:67,242,310`, `RewriterE2ETests.cs:388`, `HostSmokeTests.cs:111` — hard-coded port 8080 (M2)
- `tests/Mbproxy.Tests/Sim/DL205SimulatorFixture.cs:121-122``CancellationToken.None` swallows test-runner cancellation (M3)
- `tests/Mbproxy.Tests/Admin/StatusBroadcasterTests.cs:188` — unnecessary 400 ms idle delay (m2)
- `tests/Mbproxy.Tests/Admin/SignalRFakes.cs:30-31``Removed` list never asserted (m3)
- `tests/sim/mbproxy.smoke.config.json:1` — dangling `plans/` reference (m4)
- `install/publish.ps1:103-110`, `install/publish.sh:99-106` — warning-only when expected binary missing (m6)
- `install/mbproxy.service:40``ReadWritePaths` silent about `/etc/mbproxy` exclusion (m7)
+2 -2
View File
@@ -143,7 +143,7 @@ DL205 / DL260 BCD is non-negative in the default ladder pattern. `BcdCodec.Encod
## Exception Pass-Through
Modbus exception responses pass through unchanged. The rewriter detects an exception response by the high bit of the function code (`fc & 0x80 != 0`), emits a `mbproxy.rewrite.exception_passthrough` event, increments the per-FC exception counter, and returns without touching the payload.
Modbus exception responses pass through unchanged. The rewriter detects an exception response by the high bit of the function code (`fc & 0x80 != 0`), emits a `mbproxy.exception.passthrough` event, increments the per-FC exception counter, and returns without touching the payload.
Covered exception codes:
@@ -229,7 +229,7 @@ The rewriter feeds two counters that surface on the status page:
An out-of-range value (`< 0` or `> 9999` for 16-bit; `< 0` or `> 99_999_999` for 32-bit) on a write, or a bad nibble (`>= 0xA`) on a read, increments an internal invalid-BCD counter and emits `mbproxy.rewrite.invalid_bcd` at warning. The PDU passes through raw in that case; the rewriter never substitutes a value the client did not send (writes) or the PLC did not return (reads).
Both counters are exposed on the status page; see [`../Operations/StatusPage.md`](../Operations/StatusPage.md). The corresponding log events (`mbproxy.rewrite.partial_bcd`, `mbproxy.rewrite.invalid_bcd`, `mbproxy.rewrite.exception_passthrough`) are catalogued in [`../Reference/LogEvents.md`](../Reference/LogEvents.md). Partial-overlap troubleshooting is covered in [`../Operations/Troubleshooting.md`](../Operations/Troubleshooting.md).
Both counters are exposed on the status page; see [`../Operations/StatusPage.md`](../Operations/StatusPage.md). The corresponding log events (`mbproxy.rewrite.partial_bcd`, `mbproxy.rewrite.invalid_bcd`, `mbproxy.exception.passthrough`) are catalogued in [`../Reference/LogEvents.md`](../Reference/LogEvents.md). Partial-overlap troubleshooting is covered in [`../Operations/Troubleshooting.md`](../Operations/Troubleshooting.md).
The `dl205.json` pymodbus simulator profile encodes BCD test fixtures used by the integration test suite; see [`../Testing/Simulator.md`](../Testing/Simulator.md).
+9 -7
View File
@@ -56,6 +56,7 @@ If a step throws, the exception is logged at Error and the loop continues with t
| `Cache.EvictionIntervalMs` | Read by the next eviction loop tick. |
| `Resilience.ReadCoalescing.Enabled` flipped to `false` | Already-running coalesced entries drain naturally. Subsequent reads bypass coalescing. |
| `Resilience.ReadCoalescing.MaxParties` | Applies to subsequent attaches. Existing in-flight entries keep their current cap. |
| `Resilience.BackendConnect.*` or `Resilience.ListenerRecovery.*` | **Restart-only.** The backend-connect and listener-recovery Polly pipelines are built from the `Resilience` snapshot taken at service startup; the reconciler builds add/restart supervisors from that same frozen snapshot, so a hot-reload of these values does not propagate to any PLC. Restart the service to change them. |
| Invalid reload (schema break, duplicate ports, duplicate addresses in a resolved tag list, `CacheTtlMs > 60_000` without `Cache.AllowLongTtl = true`) | Reload is rejected as a whole. The current in-memory config stays in effect. `mbproxy.config.reload.rejected` is logged at Error. |
The "next-PDU" wording is load-bearing for the tag-list rows: the rewriter does not snapshot the tag map at connection accept time. It resolves the map for the active PLC at the start of every request frame, so a hot-reloaded tag list is in effect for the very next request, even on existing TCP connections.
@@ -78,21 +79,22 @@ The `ReloadPlan` distinguishes two kinds of "PLC is still here but changed":
3. Merge in `Plcs[i].BcdTags.Add` entries — if an address already exists in the working set, the `Add` entry wins. This is how a per-PLC width override is expressed (the global lists a 16-bit tag at the same address; the per-PLC `Add` overrides it to 32-bit).
4. Fold `Plcs[i].DefaultCacheTtlMs` into any tag whose explicit `CacheTtlMs` is null.
The same builder runs both at startup and during reload validation, so a configuration that builds cleanly at startup is guaranteed to build cleanly at reload, and vice versa. There is no second validator that could disagree with the first.
The same builder runs both at startup and during reload validation, so a configuration that builds cleanly at startup is guaranteed to build cleanly at reload, and vice versa.
## Validation Rules
`ReloadValidator.Validate` is the gate the hot-reload path consults directly. It runs the following checks in order:
`ReloadValidator.Validate` is the configuration gate. It runs at **startup** (in `ProxyWorker.ExecuteAsync`, before any supervisor is built — a rejection logs `mbproxy.startup.config.rejected` and the service exits non-zero) **and** on every hot reload. It runs the following checks in order:
1. PLC names are non-empty and unique under ordinal comparison.
2. Every `Plcs[i].ListenPort` is in `[1, 65535]` and unique across the `Plcs` list.
3. `AdminPort` is in `[1, 65535]` and does not collide with any `ListenPort`.
2. Every `Plcs[i].ListenPort` is in `[1, 65535]` and unique across the `Plcs` list; every `Host` is non-empty and every backend `Port` is in `[1, 65535]`.
3. `AdminPort` is in `[1, 65535]`, or `0` to disable the admin endpoint; a non-zero `AdminPort` must not collide with any `ListenPort`.
4. For each PLC, `BcdTagMapBuilder.Build(next.BcdTags, plc.BcdTags, plc.DefaultCacheTtlMs)` reports no errors. This delegates the per-PLC well-formedness checks — duplicate addresses within a single resolved list, and 32-bit entries whose high register (`Address + 1`) overlaps a separate 16-bit entry — to the single source of truth used at startup.
5. Cache TTL bounds: every `BcdTag.CacheTtlMs` and every `Plcs[i].DefaultCacheTtlMs` must be `>= 0`, and any value above `60_000` ms requires `Cache.AllowLongTtl = true`. `Cache.MaxEntriesPerPlc` and `Cache.EvictionIntervalMs` must be `>= 0`.
5. Cache TTL bounds: every `BcdTag.CacheTtlMs` and every `Plcs[i].DefaultCacheTtlMs` must be `>= 0`, and any value above `60_000` ms requires `Cache.AllowLongTtl = true`. `Cache.MaxEntriesPerPlc` must be in `[0, 100000]` and `Cache.EvictionIntervalMs` must be `>= 0`.
6. `AdminPushIntervalMs` is in `[1, 60000]`; connection timeouts are `> 0`; the keepalive cross-field rule holds; and the `Resilience` profiles are well-formed (`BackendConnect.MaxAttempts >= 1` with at least `MaxAttempts - 1` non-negative `BackoffMs` entries, `ListenerRecovery.SteadyStateMs > 0`, `ReadCoalescing.MaxParties >= 1`).
A failure at any step appends to the error list but the validator runs to completion so the operator sees every problem with a single save. If the list is non-empty, the reload is rejected atomically and no state mutates.
A failure at any step appends to the error list but the validator runs to completion so the operator sees every problem with a single save. If the list is non-empty, the reload is rejected atomically and no state mutates (at startup, the service refuses to start).
Schema-level checks — invalid `Width` values on a `BcdTagOptions`, type mismatches, malformed JSON — are also enforced by `MbproxyOptionsValidator` (`IValidateOptions<MbproxyOptions>`) at bind time. The two paths overlap deliberately so both startup and reload reject the same malformed input with the same error wording.
Schema-level checks — invalid `Width` values on a `BcdTagOptions`, type mismatches, malformed JSON — are also enforced by `MbproxyOptionsValidator` (`IValidateOptions<MbproxyOptions>`) at bind time. The two validators overlap deliberately; their error wording is similar but not guaranteed identical.
### Rejected-reload example
+6 -5
View File
@@ -281,21 +281,22 @@ The cache itself is described in detail in [`../Architecture/ResponseCache.md`](
## Validation Rules
`ReloadValidator.Validate` runs on every config load (startup and hot reload) and rejects the entire snapshot if any rule fails. On rejection at startup, the service exits non-zero. On rejection at runtime, the current in-memory config stays in effect and `mbproxy.config.reload.rejected` is logged at `Error`.
`ReloadValidator.Validate` runs on every config load (startup and hot reload) and rejects the entire snapshot if any rule fails. On rejection at startup, the service logs `mbproxy.startup.config.rejected` at `Error` and exits non-zero. On rejection at runtime, the current in-memory config stays in effect and `mbproxy.config.reload.rejected` is logged at `Error`.
Rules (in order):
1. **PLC names**: every `Plcs[i].Name` is non-empty and unique (ordinal comparison).
2. **ListenPort**: every `Plcs[i].ListenPort` is in `[1, 65535]` and unique across the array.
3. **AdminPort**: in `[1, 65535]` and does not collide with any `ListenPort`.
2. **ListenPort / Host / Port**: every `Plcs[i].ListenPort` is in `[1, 65535]` and unique across the array; every `Host` is non-empty; every backend `Port` is in `[1, 65535]`.
3. **AdminPort**: in `[1, 65535]`, or `0` to disable the admin endpoint; a non-zero value does not collide with any `ListenPort`.
4. **BCD tag map** per PLC, delegated to `BcdTagMapBuilder.Build`:
- duplicate addresses within a single PLC's resolved tag list
- 32-bit entries whose high register (`Address + 1`) overlaps a separate 16-bit entry at that address
5. **Cache TTL bounds**:
- any `CacheTtlMs` or `DefaultCacheTtlMs` less than 0 is rejected
- any `CacheTtlMs` or `DefaultCacheTtlMs` greater than `60_000` is rejected unless `Cache.AllowLongTtl = true`
6. **Cache size knobs**: `Cache.MaxEntriesPerPlc >= 0`, `Cache.EvictionIntervalMs >= 0`.
7. **Width**: every `BcdTagOptions.Width` is `16` or `32` (enforced by `MbproxyOptionsValidator` at schema time).
6. **Cache size knobs**: `Cache.MaxEntriesPerPlc` in `[0, 100000]`, `Cache.EvictionIntervalMs >= 0`.
7. **AdminPushIntervalMs / timeouts / keepalive / Resilience**: `AdminPushIntervalMs` in `[1, 60000]`; connection timeouts `> 0`; the keepalive cross-field rule (`BackendHeartbeatIdleMs > BackendRequestTimeoutMs`); and well-formed `Resilience` profiles (`BackendConnect.MaxAttempts >= 1` with `>= MaxAttempts - 1` non-negative `BackoffMs` entries, `ListenerRecovery.SteadyStateMs > 0`, `ReadCoalescing.MaxParties >= 1`).
8. **Width**: every `BcdTagOptions.Width` is `16` or `32` (also enforced by `MbproxyOptionsValidator` at schema time).
Sample rejection messages (logged at `Error` with the structured property `errors` carrying the full list):
+2
View File
@@ -330,6 +330,8 @@ The detail page's debug view is fed by an **on-demand per-tag value capture** (`
| `debug.tags[].updatedAtUtc` | `string?` | ISO-8601 time of the observation; `null` when no traffic yet. |
| `debug.tags[].ageSeconds` | `double?` | Seconds since the observation; `null` when no traffic yet. |
`PlcDetailResponse` is delivered **only** over the `/hub/status` SignalR feed (the `"plc"` message); there is no `GET` route for it, and it is serialized through the SignalR JSON protocol rather than `StatusJsonContext`. Scrapers that want per-PLC counters use the `plcs[]` array of `GET /status.json` instead — the debug-view capture has no JSON-twin endpoint.
## How to Scrape It
The JSON twin is plain HTTP. Any monitoring system that can curl an endpoint can scrape it.
+15 -3
View File
@@ -45,6 +45,18 @@ Fires once after `ProxyWorker.StartAsync` has spun up every per-PLC supervisor a
**Operator action:** if the two counts disagree, search for `mbproxy.startup.bind.failed` entries to identify the missing PLCs.
### mbproxy.startup.config.rejected
**Level:** Error &middot; **EventId:** 2 &middot; **Source:** `src/Mbproxy/Proxy/ProxyWorker.cs`
| Property | Type | Meaning |
|----------|------|---------|
| `Errors` | `string` | Concatenated validation failures (one per `;`). |
Fires once at startup when `ReloadValidator.Validate` rejects the initial `appsettings.json` — duplicate listen ports, an `AdminPort` collision, duplicate PLC names, a malformed BCD tag list, a bad keepalive cross-field relationship, or an invalid `Resilience` profile. The service then exits non-zero; no listeners are started. This is the startup-time twin of `mbproxy.config.reload.rejected`.
**Operator action:** fix the offending entry in `appsettings.json` and restart the service. The error text names every failed rule.
### mbproxy.startup.bind
**Level:** Information &middot; **EventId:** 20 (`PlcListener`) / 40 (`PlcListenerSupervisor`) &middot; **Source:** `src/Mbproxy/Proxy/PlcListener.cs`, `src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs`
@@ -60,7 +72,7 @@ Fires when a per-PLC `TcpListener` successfully binds its configured port. Emitt
### mbproxy.startup.bind.failed
**Level:** Error &middot; **EventId:** 21 (`ProxyWorker`) / 41 (`PlcListenerSupervisor`) &middot; **Source:** `src/Mbproxy/Proxy/ProxyWorker.cs`, `src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs`
**Level:** Error &middot; **EventId:** 41 &middot; **Source:** `src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs`
| Property | Type | Meaning |
|----------|------|---------|
@@ -88,7 +100,7 @@ Fires after the supervisor's Polly recovery pipeline successfully rebinds a list
### mbproxy.listener.faulted
**Level:** Error (`PlcListener`) / Warning (`PlcListenerSupervisor`) &middot; **EventId:** 22 / 43 &middot; **Source:** `src/Mbproxy/Proxy/PlcListener.cs`, `src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs`
**Level:** Warning &middot; **EventId:** 43 &middot; **Source:** `src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs`
| Property | Type | Meaning |
|----------|------|---------|
@@ -96,7 +108,7 @@ Fires after the supervisor's Polly recovery pipeline successfully rebinds a list
| `Port` | `int` | Port whose listener faulted. |
| `Reason` | `string` | Top-level exception message. |
Fires when a listener's accept loop throws. The two sources emit at different levels deliberately: the unsupervised `PlcListener` instance logs at `Error` (a terminal condition for that listener), while the supervised emission is `Warning` because Polly will retry. The supervised path attaches the exception object as the `LoggerMessage` exception parameter, so the stack trace is captured.
Fires when a listener's accept loop throws. `PlcListener.RunAsync` propagates the fault to its `PlcListenerSupervisor`, which logs this event at `Warning` (Polly will retry) with the exception object attached as the `LoggerMessage` exception parameter, so the stack trace is captured.
**Operator action:** if the same `Plc` produces repeated faults inside a few minutes, inspect the network path. A burst of faults paired with `mbproxy.multiplex.backend.disconnected` indicates the PLC itself is unhealthy rather than a proxy issue.
+3
View File
@@ -38,6 +38,9 @@ ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
ReadWritePaths=/var/log/mbproxy /var/cache/mbproxy
# /etc/mbproxy is intentionally absent from ReadWritePaths: the service only READS its
# config (ProtectSystem=strict still allows reads), and config changes are an admin
# operation. Editing appsettings.json must be done as root, not by the service account.
# If any configured ListenPort is below 1024, also add:
# AmbientCapabilities=CAP_NET_BIND_SERVICE
+3 -1
View File
@@ -106,7 +106,9 @@ foreach ($flavour in 'self-contained','framework-dependent') {
$size = (Get-Item $bin).Length
Write-Host (" {0,-22} {1,10} {2}" -f $flavour, (Format-Size $size), $bin)
} else {
Write-Warning "Missing: $bin"
# A missing expected binary means the publish silently produced nothing usable —
# fail the script rather than emit a warning a CI job would scroll past.
throw "Expected published binary not found: $bin"
}
}
Write-Host ""
+4 -1
View File
@@ -102,7 +102,10 @@ for flavour in self-contained framework-dependent; do
size="$(du -h "$bin" | cut -f1)"
printf ' %-22s %8s %s\n' "$flavour" "$size" "$bin"
else
echo " WARNING: missing $bin" >&2
# A missing expected binary means the publish silently produced nothing
# usable — fail rather than emit a warning a CI job would scroll past.
echo "ERROR: expected published binary not found: $bin" >&2
exit 1
fi
done
echo
+23 -5
View File
@@ -161,11 +161,22 @@ internal sealed partial class AdminEndpointHost : IAsyncDisposable
/// <summary>
/// Builds and starts a Kestrel <see cref="WebApplication"/> on <paramref name="port"/>.
/// On bind failure, logs the error and sets <c>_app = null</c> — does NOT throw.
/// Caller must hold <c>_lock</c> or be in a single-threaded context (StartAsync).
/// <paramref name="port"/> 0 means the admin endpoint is disabled — no listener is
/// started. On bind failure, logs the error and sets <c>_app = null</c> — does NOT
/// throw. Caller must hold <c>_lock</c> or be in a single-threaded context (StartAsync).
/// </summary>
private async Task StartAppAsync(int port, CancellationToken ct)
{
// AdminPort 0 means "admin endpoint disabled" — the documented disable switch in
// the config templates. Kestrel would otherwise interpret port 0 as "bind an
// OS-assigned ephemeral port", leaving an unadvertised HTTP listener running
// (review TestsAndConfig M1). Skip the build entirely.
if (port == 0)
{
_logger.LogInformation("Admin endpoint disabled (Mbproxy.AdminPort = 0).");
return;
}
// Declared outside the try so the catch can dispose a built-but-not-fully-started
// app on a bind failure (M6 — otherwise a built WebApplication or a started
// Kestrel listener leaks on any throw after Build()).
@@ -197,9 +208,7 @@ internal sealed partial class AdminEndpointHost : IAsyncDisposable
// does not trim/AOT, so a reflection JSON path is acceptable here.
builder.Services
.AddSignalR()
.AddJsonProtocol(o =>
o.PayloadSerializerOptions.PropertyNamingPolicy =
System.Text.Json.JsonNamingPolicy.CamelCase);
.AddJsonProtocol(o => ConfigureHubPayloadJson(o.PayloadSerializerOptions));
builder.Services.AddSingleton(_subscriptionTracker);
app = builder.Build();
@@ -318,6 +327,15 @@ internal sealed partial class AdminEndpointHost : IAsyncDisposable
await app.DisposeAsync().ConfigureAwait(false);
}
/// <summary>
/// Configures the JSON serialization for the SignalR hub payloads: camelCase property
/// names, matching the dashboard JS field names and the <c>GET /status.json</c> wire
/// shape. Extracted as a single shared method so <c>DebugDtoSerializationTests</c> can
/// assert against the exact same configuration the hub uses — neither side can drift.
/// </summary>
internal static void ConfigureHubPayloadJson(System.Text.Json.JsonSerializerOptions options)
=> options.PropertyNamingPolicy = System.Text.Json.JsonNamingPolicy.CamelCase;
// ── Embedded asset serving ───────────────────────────────────────────────
private const string AssetResourcePrefix = "Mbproxy.Admin.wwwroot.";
+36 -6
View File
@@ -37,6 +37,19 @@ internal sealed partial class StatusBroadcaster : IAsyncDisposable
// _disposed flag, and defends a future caller from touching the disposed CTS.
private bool _stopped;
// 0 until the first Start(); a second Start() is a no-op (would otherwise orphan the
// first loop task — StopAsync only awaits the latest _loop).
private int _started;
// Count of consecutive push cycles that hit at least one failure. Used to throttle
// the per-cycle error logging so a wedged SignalR transport does not flood the
// rolling file (the first failure and then one per ~minute are logged). Touched only
// from the single-threaded push loop.
private int _consecutivePushFailures;
private bool ShouldLogPushFailure()
=> _consecutivePushFailures == 0 || _consecutivePushFailures % 60 == 0;
public StatusBroadcaster(
IStatusPushSink sink,
StatusSnapshotBuilder builder,
@@ -53,8 +66,13 @@ internal sealed partial class StatusBroadcaster : IAsyncDisposable
_logger = logger;
}
/// <summary>Starts the push loop. Idempotent only in the sense that it is called once.</summary>
public void Start() => _loop = Task.Run(() => LoopAsync(_cts.Token));
/// <summary>Starts the push loop. Idempotent — a second call is a no-op.</summary>
public void Start()
{
if (Interlocked.CompareExchange(ref _started, 1, 0) != 0)
return; // already started — do not orphan the running loop
_loop = Task.Run(() => LoopAsync(_cts.Token));
}
/// <summary>
/// Stops the push loop and disarms every tag-value capture.
@@ -82,6 +100,10 @@ internal sealed partial class StatusBroadcaster : IAsyncDisposable
/// <summary>One push cycle. Exposed internally so unit tests can drive it deterministically.</summary>
internal async Task PushOnceAsync(CancellationToken ct)
{
// Tracks whether this cycle hit any failure, to drive the consecutive-failure
// log throttle (see ShouldLogPushFailure).
bool anyFailure = false;
StatusResponse snapshot;
try
{
@@ -89,7 +111,8 @@ internal sealed partial class StatusBroadcaster : IAsyncDisposable
}
catch (Exception ex)
{
LogSnapshotFailed(_logger, ex);
if (ShouldLogPushFailure()) LogSnapshotFailed(_logger, ex);
_consecutivePushFailures++;
return;
}
@@ -99,7 +122,8 @@ internal sealed partial class StatusBroadcaster : IAsyncDisposable
}
catch (Exception ex) when (ex is not OperationCanceledException)
{
LogFleetPushFailed(_logger, ex);
if (ShouldLogPushFailure()) LogFleetPushFailed(_logger, ex);
anyFailure = true;
}
// Reconcile capture arm state from the live viewer set. This is the single
@@ -119,15 +143,21 @@ internal sealed partial class StatusBroadcaster : IAsyncDisposable
try
{
var plc = plcsByName!.GetValueOrDefault(plcName);
var debug = _builder.BuildDebug(plcName);
// armedOverride: true — every plcName here came from activePlcs, which
// ReconcileArmed was just driven from, so the capture is armed by intent.
var debug = _builder.BuildDebug(plcName, armedOverride: true);
var detail = new PlcDetailResponse(plc, debug);
await _sink.PushPlcAsync(plcName, detail, ct).ConfigureAwait(false);
}
catch (Exception ex) when (ex is not OperationCanceledException)
{
LogDetailPushFailed(_logger, plcName, ex);
if (ShouldLogPushFailure()) LogDetailPushFailed(_logger, plcName, ex);
anyFailure = true;
}
}
// Advance / reset the consecutive-failure run that throttles error logging.
_consecutivePushFailures = anyFailure ? _consecutivePushFailures + 1 : 0;
}
private async Task LoopAsync(CancellationToken ct)
+6
View File
@@ -49,6 +49,12 @@ internal sealed class StatusHub : Hub
// own OnDisconnectedAsync — which SignalR dispatches only after this invocation's
// Task completes — always observes a consistent state. Capture arming is NOT done
// here; StatusBroadcaster reconciles it each cycle from the tracker.
//
// If AddToGroupAsync below throws (a mid-invocation transport fault), the tracker
// entry is left in place; OnDisconnectedAsync is the backstop — it always runs for
// a connection that completed OnConnectedAsync, and RemoveConnection then releases
// the tab. The transient cost is at most one armed capture with no live group
// member until that disconnect callback fires.
_tracker.SubscribePlc(Context.ConnectionId, tabId, plcName);
await Groups.AddToGroupAsync(Context.ConnectionId, PlcGroup(plcName)).ConfigureAwait(false);
}
@@ -40,8 +40,16 @@ internal sealed class StatusSnapshotBuilder
/// for every configured BCD tag. Returns an empty, disarmed snapshot when
/// <paramref name="plcName"/> is unknown (e.g. a detail page open for a PLC removed
/// by hot-reload).
///
/// <para><paramref name="armedOverride"/> lets the caller supply the armed flag rather
/// than have this method independently re-read <c>capture.IsArmed</c>. The broadcaster
/// passes <c>true</c> because it only builds a debug snapshot for PLCs it just
/// reconciled armed in the same push cycle — so the pushed payload's <c>CaptureArmed</c>
/// flag is consistent with that decision by construction, instead of racing a
/// disarm between the reconcile and this read (review AdminSignalR M1). When omitted,
/// the live <c>capture.IsArmed</c> is used.</para>
/// </summary>
public PlcDebugSnapshot BuildDebug(string plcName)
public PlcDebugSnapshot BuildDebug(string plcName, bool? armedOverride = null)
{
if (!_captureRegistry.TryGet(plcName, out var capture))
return new PlcDebugSnapshot(CaptureArmed: false, Tags: Array.Empty<TagValueDto>());
@@ -51,7 +59,7 @@ internal sealed class StatusSnapshotBuilder
.Select(o => ToTagDto(o, now))
.ToList();
return new PlcDebugSnapshot(capture.IsArmed, tags);
return new PlcDebugSnapshot(armedOverride ?? capture.IsArmed, tags);
}
private static TagValueDto ToTagDto(TagValueObservation o, DateTimeOffset now)
+19 -5
View File
@@ -15,6 +15,15 @@
// ── Helpers ────────────────────────────────────────────────────────────
const $ = (id) => document.getElementById(id);
// util.js must load before this script. If it failed to load, fail loud and
// visible rather than letting the destructure throw and abort the page silently.
if (!window.mbproxyUtil) {
document.body.innerHTML =
'<p style="padding:2rem;font-family:sans-serif;color:#b00">' +
'Admin UI failed to load (util.js missing). Check the browser console.</p>';
throw new Error('window.mbproxyUtil is not defined');
}
const { escapeHtml, escapeAttr } = window.mbproxyUtil;
function num(n) {
@@ -45,7 +54,7 @@
const cls = state === 'bound' ? 'chip-ok'
: state === 'recovering' ? 'chip-warn'
: 'chip-idle';
return `<span class="chip ${cls}">${state}</span>`;
return `<span class="chip ${cls}">${escapeHtml(state)}</span>`;
}
function ratioCell(hit, miss) {
@@ -75,6 +84,11 @@
}
prevPdu.set(plc.name, { forwarded: cur, t: now });
}
// Prune entries for PLCs no longer in the snapshot (hot-reload removal) so the
// Maps don't grow unbounded and rateByName.size stays an accurate "have rates" gauge.
const currentNames = new Set(snapshot.plcs.map(p => p.name));
for (const k of prevPdu.keys()) if (!currentNames.has(k)) prevPdu.delete(k);
for (const k of rateByName.keys()) if (!currentNames.has(k)) rateByName.delete(k);
}
function rateOf(name) {
@@ -249,10 +263,10 @@
connection.on('fleet', onSnapshot);
connection.onreconnecting(() => setConn('connecting', 'reconnecting'));
connection.onreconnected(() => {
setConn('connected');
connection.invoke('SubscribeFleet').catch(() => {});
});
// On a warm reconnect the transport is back but the group subscription is not —
// re-subscribe through connect(), which retries with backoff if the invoke fails
// (instead of silently leaving a live connection with no feed).
connection.onreconnected(() => connect());
connection.onclose(() => setConn('disconnected', 'disconnected'));
// Cold start. withAutomaticReconnect only recovers an already-established
+16 -7
View File
@@ -20,7 +20,16 @@
}
const $ = (id) => document.getElementById(id);
const { escapeHtml } = window.mbproxyUtil;
// util.js must load before this script. If it failed to load, fail loud and
// visible rather than letting the destructure throw and abort the page silently.
if (!window.mbproxyUtil) {
document.body.innerHTML =
'<p style="padding:2rem;font-family:sans-serif;color:#b00">' +
'Admin UI failed to load (util.js missing). Check the browser console.</p>';
throw new Error('window.mbproxyUtil is not defined');
}
const { escapeHtml, escapeAttr } = window.mbproxyUtil;
document.title = `mbproxy — ${plcName}`;
$('crumb-name').textContent = plcName;
@@ -220,8 +229,8 @@
</tr>`;
}
const dirCls = t.direction === 'write' ? 'dir-write' : 'dir-read';
const stale = (t.ageSeconds || 0) > 30 ? ' stale' : '';
return `<tr class="${stale.trim()}">
const staleAttr = (t.ageSeconds || 0) > 30 ? ' class="stale"' : '';
return `<tr${staleAttr}>
${tagCell(t)}
<td>${Number(t.width)}-bit</td>
<td><span class="dir-tag ${dirCls}">${escapeHtml(t.direction)}</span></td>
@@ -280,10 +289,10 @@
connection.on('plc', onDetail);
connection.onreconnecting(() => setConn('connecting', 'reconnecting'));
connection.onreconnected(() => {
setConn('connected');
connection.invoke('SubscribePlc', plcName, tabId).catch(() => {});
});
// On a warm reconnect the transport is back but the group subscription is not —
// re-subscribe through connect(), which retries with backoff if the invoke fails
// and re-arms the snapshot watchdog (instead of silently showing a dead feed).
connection.onreconnected(() => connect());
connection.onclose(() => setConn('disconnected', 'disconnected'));
// Cold start. withAutomaticReconnect only recovers an already-established
+15 -6
View File
@@ -130,6 +130,7 @@ public static class BcdTagMapBuilder
// and (101,W=32)) would otherwise produce two BcdErrors — one from each
// direction. Track reported (low,high) pairs so each collision logs once.
var reportedCollisions = new HashSet<(ushort, ushort)>();
var collidingAddresses = new HashSet<ushort>();
foreach (var tag in validated.Values)
{
if (!tag.IsThirtyTwoBit)
@@ -138,6 +139,10 @@ public static class BcdTagMapBuilder
ushort highReg = tag.HighRegister;
if (validated.TryGetValue(highReg, out var collision))
{
// Both entries of the colliding pair are unsafe to keep in the map.
collidingAddresses.Add(tag.Address);
collidingAddresses.Add(collision.Address);
// Canonicalise the pair so (a,b) and (b,a) collapse.
var pair = tag.Address < collision.Address
? (tag.Address, collision.Address)
@@ -152,12 +157,16 @@ public static class BcdTagMapBuilder
}
}
// ── Step 5: build the frozen map from entries that have no errors ─────
// Entries implicated in an OverlappingHighRegister error are still included
// in the map so that the caller can see all context; the error list tells them
// the config is invalid and must be corrected before the service is safe to run.
// (If callers want to exclude bad entries they should check Errors.Count > 0
// and refuse to start the listener for that PLC.)
// ── Step 5: build the frozen map from entries that passed validation ──
// Exclude entries implicated in an OverlappingHighRegister collision — consistent
// with how InvalidWidth/DuplicateAddress entries are already dropped above — so a
// map returned alongside a non-empty Errors list is always safe for a caller to
// use even if it forgets to check Errors.Count (review ProxyAndBcd N7). Callers
// SHOULD still treat Errors.Count > 0 as a fatal config problem and refuse to
// start the listener; this is defence-in-depth, not a substitute for that check.
foreach (var addr in collidingAddresses)
validated.Remove(addr);
var frozen = validated.ToFrozenDictionary();
var map = frozen.Count > 0 ? new BcdTagMap(frozen) : BcdTagMap.Empty;
@@ -66,6 +66,16 @@ internal sealed partial class ConfigReconciler : IDisposable
// restarted via hot-reload honour the current `Connection.Keepalive` values.
private Func<KeepaliveOptions>? _keepaliveAccessor;
// Resilience options frozen at startup. The reconciler builds the backend-connect and
// listener-recovery Polly pipelines from THIS snapshot, never from the reloaded
// `next.Resilience` — so an added/restarted PLC uses the same retry/recovery profile
// as the PLCs that were never touched. `Resilience.BackendConnect` /
// `Resilience.ListenerRecovery` are therefore restart-only (documented in
// docs/Features/HotReload.md); this avoids the inconsistency where the same key
// behaved differently per PLC depending on whether it was added in the reload
// (review ConfigAndHosting M3).
private ResilienceOptions? _startupResilience;
// ── Debounce + serialisation machinery ───────────────────────────────────────────────
// Channel carries Unit to signal "something changed — please check".
@@ -135,6 +145,7 @@ internal sealed partial class ConfigReconciler : IDisposable
_currentOptions = initialOptions;
_coalescingAccessor = coalescingAccessor;
_keepaliveAccessor = keepaliveAccessor;
_startupResilience = initialOptions.Resilience;
}
// ── ApplyAsync (exposed for tests) ───────────────────────────────────────────────────
@@ -248,6 +259,12 @@ internal sealed partial class ConfigReconciler : IDisposable
// Compute global tag delta (count of entries that differ).
int globalTagDelta = ComputeGlobalTagDelta(_currentOptions.BcdTags, next.BcdTags);
// Counts per-step failures across the parallel apply tasks so step 7 can report a
// reload that was applied only partially, instead of logging an unqualified
// "applied" (review ConfigAndHosting N1). Incremented via Interlocked because the
// Remove/Restart/Add tasks run concurrently under Task.WhenAll.
int stepFailures = 0;
// ── 3. Apply: Remove ─────────────────────────────────────────────────
if (plan.ToRemove.Count > 0)
{
@@ -266,6 +283,7 @@ internal sealed partial class ConfigReconciler : IDisposable
}
catch (Exception ex)
{
Interlocked.Increment(ref stepFailures);
_logger.LogError(ex, "Error stopping supervisor for removed PLC '{Plc}': {Message}",
name, ex.Message);
}
@@ -278,7 +296,9 @@ internal sealed partial class ConfigReconciler : IDisposable
// ── 4. Apply: Restart (stop + rebuild + start) ───────────────────────
if (plan.ToRestart.Count > 0)
{
var resilienceOpts = next.Resilience;
// Pipelines built from the FROZEN startup Resilience snapshot, not
// next.Resilience — see _startupResilience (review ConfigAndHosting M3).
var resilienceOpts = _startupResilience ?? next.Resilience;
var backendPipeline = PolicyFactory.BuildBackendConnect(
resilienceOpts.BackendConnect,
_loggerFactory.CreateLogger("Mbproxy.Proxy.BackendConnect"));
@@ -288,17 +308,11 @@ internal sealed partial class ConfigReconciler : IDisposable
var (name, plcNew) = entry;
try
{
// Stop old supervisor.
if (_supervisors.TryRemove(name, out var old))
{
using var stopCts = CancellationTokenSource.CreateLinkedTokenSource(ct);
stopCts.CancelAfter(TimeSpan.FromSeconds(10));
await old.StopAsync(stopCts.Token).ConfigureAwait(false);
await old.DisposeAsync().ConfigureAwait(false);
}
// Build fresh context. Pass DefaultCacheTtlMs so per-PLC default
// caching folds into the resolved tag map.
// Build the new context + supervisor FIRST, while the old supervisor
// is still in the dictionary and running. If any construction step
// throws (BcdTagMapBuilder, PolicyFactory, ctor), the catch leaves the
// old supervisor in place — a transient rebuild fault no longer drops
// the PLC from service permanently (review ConfigAndHosting M2).
var result = BcdTagMapBuilder.Build(next.BcdTags, plcNew.BcdTags, plcNew.DefaultCacheTtlMs);
var newCtx = new PerPlcContext
{
@@ -310,7 +324,6 @@ internal sealed partial class ConfigReconciler : IDisposable
Capture = _captureRegistry.GetOrCreate(plcNew.Name, result.Map),
};
// Build and start new supervisor.
var recoveryPipeline = PolicyFactory.BuildListenerRecovery(
resilienceOpts.ListenerRecovery,
_loggerFactory.CreateLogger($"Mbproxy.Proxy.ListenerRecovery.{plcNew.Name}"));
@@ -329,11 +342,22 @@ internal sealed partial class ConfigReconciler : IDisposable
_coalescingAccessor,
_keepaliveAccessor);
// Construction succeeded — now swap. Stop the old supervisor, then
// publish and start the new one.
if (_supervisors.TryRemove(name, out var old))
{
using var stopCts = CancellationTokenSource.CreateLinkedTokenSource(ct);
stopCts.CancelAfter(TimeSpan.FromSeconds(10));
await old.StopAsync(stopCts.Token).ConfigureAwait(false);
await old.DisposeAsync().ConfigureAwait(false);
}
_supervisors[name] = newSupervisor;
await newSupervisor.StartAsync(ct).ConfigureAwait(false);
}
catch (Exception ex)
{
Interlocked.Increment(ref stepFailures);
_logger.LogError(ex, "Error restarting supervisor for PLC '{Plc}': {Message}",
name, ex.Message);
}
@@ -350,7 +374,13 @@ internal sealed partial class ConfigReconciler : IDisposable
try
{
var plcNew = next.Plcs.First(p => p.Name == name);
// ReloadPlan.Compute guarantees `name` is present in next.Plcs; the
// explicit null-guard makes that invariant a clean skip rather than an
// InvalidOperationException mislabelled as a reseat error (review N2).
var plcNew = next.Plcs.FirstOrDefault(p => p.Name == name);
if (plcNew is null)
continue;
var newCtx = new PerPlcContext
{
PlcName = name,
@@ -361,8 +391,9 @@ internal sealed partial class ConfigReconciler : IDisposable
// Any reseat (tag-map change) constructs a fresh cache. The
// supervisor disposes the old one inside ReplaceContextAsync.
Cache = BuildCacheIfNeeded(newMap, next.Cache),
// Rebuild the capture for the new tag set; GetOrCreate preserves the
// armed flag so an open detail page keeps capturing across the reload.
// Rebuild the capture for the new tag set. The rebuilt capture is
// disarmed; StatusBroadcaster re-arms it within one push cycle if the
// PLC still has a detail-page viewer.
Capture = _captureRegistry.GetOrCreate(name, newMap),
};
@@ -372,6 +403,7 @@ internal sealed partial class ConfigReconciler : IDisposable
}
catch (Exception ex)
{
Interlocked.Increment(ref stepFailures);
_logger.LogError(ex, "Error reseating context for PLC '{Plc}': {Message}",
name, ex.Message);
}
@@ -380,7 +412,8 @@ internal sealed partial class ConfigReconciler : IDisposable
// ── 6. Apply: Add new PLCs ────────────────────────────────────────────
if (plan.ToAdd.Count > 0)
{
var resilienceOpts = next.Resilience;
// Frozen startup Resilience snapshot, not next.Resilience (M3).
var resilienceOpts = _startupResilience ?? next.Resilience;
var backendPipeline = PolicyFactory.BuildBackendConnect(
resilienceOpts.BackendConnect,
_loggerFactory.CreateLogger("Mbproxy.Proxy.BackendConnect"));
@@ -425,6 +458,7 @@ internal sealed partial class ConfigReconciler : IDisposable
}
catch (Exception ex)
{
Interlocked.Increment(ref stepFailures);
_logger.LogError(ex, "Error adding supervisor for PLC '{Plc}': {Message}",
plcNew.Name, ex.Message);
}
@@ -438,6 +472,12 @@ internal sealed partial class ConfigReconciler : IDisposable
var appliedAt = DateTimeOffset.UtcNow;
_serviceCounters.RecordReloadApplied(appliedAt);
if (stepFailures > 0)
_logger.LogWarning(
"Config reload applied with {Failures} step failure(s) — see the preceding " +
"error log lines; the affected PLC(s) may not be in their intended state.",
stepFailures);
LogReloadApplied(_logger, plcsAdded, plcsRemoved, plcsRestarted, plcsReseated, globalTagDelta);
return true;
@@ -461,9 +501,18 @@ internal sealed partial class ConfigReconciler : IDisposable
private static int ComputeGlobalTagDelta(BcdTagListOptions before, BcdTagListOptions after)
{
// Count entries in before but not in after (removed), plus entries in after
// but not in before (added), plus entries with the same address but different width.
var beforeDict = before.Global.ToDictionary(t => t.Address);
var afterDict = after.Global.ToDictionary(t => t.Address);
// but not in before (added), plus entries with the same address but different
// width. Last-wins indexing (not ToDictionary, which throws on a duplicate
// address) keeps this purely-cosmetic delta from ever aborting a reload — a
// duplicate-address config is already rejected by ReloadValidator (review N3).
static Dictionary<ushort, BcdTagOptions> Index(BcdTagListOptions o)
{
var d = new Dictionary<ushort, BcdTagOptions>(o.Global.Count);
foreach (var t in o.Global) d[t.Address] = t;
return d;
}
var beforeDict = Index(before);
var afterDict = Index(after);
int delta = 0;
foreach (var addr in beforeDict.Keys.Union(afterDict.Keys).Distinct())
@@ -489,6 +538,12 @@ internal sealed partial class ConfigReconciler : IDisposable
try
{
// Hard 2 s cap on the synchronous join. The debounce loop observes
// cancellation promptly when idle; if an apply is mid-flight its per-step
// CTS are linked to _disposalCts and also cancel. The cap exists only to
// bound host-shutdown latency if a supervisor StopAsync inside an apply
// hangs — in that case the loop task is abandoned (left running) rather
// than blocking shutdown indefinitely.
_debounceLoop.Wait(TimeSpan.FromSeconds(2));
}
catch
@@ -61,15 +61,27 @@ internal static class ReloadValidator
}
}
// ── 3. AdminPort range and collision ─────────────────────────────────
int adminPort = next.AdminPort;
if (adminPort is < 1 or > 65535)
// ── 2b. Per-PLC backend Host / Port ───────────────────────────────────
// A missing Host surfaces only as a runtime connect fault; an out-of-range Port
// is never reachable. Reject both at the config gate (review ConfigAndHosting M4).
foreach (var plc in next.Plcs)
{
errs.Add($"AdminPort {adminPort} is out of range [1, 65535].");
if (string.IsNullOrWhiteSpace(plc.Host))
errs.Add($"Plc '{plc.Name}': Host must be non-empty.");
if (plc.Port is < 1 or > 65535)
errs.Add($"Plc '{plc.Name}': Port {plc.Port} is out of range [1, 65535].");
}
else if (seenPorts.TryGetValue(adminPort, out string? clashPlc))
// ── 3. AdminPort range and collision ─────────────────────────────────
// AdminPort 0 is valid: it disables the admin endpoint entirely (review
// TestsAndConfig M1). The range and collision checks apply only to a real port.
int adminPort = next.AdminPort;
if (adminPort != 0)
{
errs.Add($"AdminPort {adminPort} collides with ListenPort of PLC '{clashPlc}'.");
if (adminPort is < 1 or > 65535)
errs.Add($"AdminPort {adminPort} is out of range [1, 65535] (or 0 to disable).");
else if (seenPorts.TryGetValue(adminPort, out string? clashPlc))
errs.Add($"AdminPort {adminPort} collides with ListenPort of PLC '{clashPlc}'.");
}
if (next.AdminPushIntervalMs <= 0 || next.AdminPushIntervalMs > 60_000)
@@ -130,6 +142,8 @@ internal static class ReloadValidator
}
if (next.Cache.MaxEntriesPerPlc < 0)
errs.Add($"Cache.MaxEntriesPerPlc must be >= 0; got {next.Cache.MaxEntriesPerPlc}.");
else if (next.Cache.MaxEntriesPerPlc > 100_000)
errs.Add($"Cache.MaxEntriesPerPlc must be <= 100000; got {next.Cache.MaxEntriesPerPlc}.");
if (next.Cache.EvictionIntervalMs < 0)
errs.Add($"Cache.EvictionIntervalMs must be >= 0; got {next.Cache.EvictionIntervalMs}.");
@@ -149,22 +163,55 @@ internal static class ReloadValidator
// Schema bounds are also checked in MbproxyOptionsValidator; re-checking here keeps
// the hot-reload gate self-contained. The cross-field rule (heartbeat interval must
// sit above the request timeout, or it would fire continuously) lives only here.
//
// The tunable checks are gated on Keepalive.Enabled: when keepalive is off the
// probe interval / idle threshold / cross-field relationship are all inert
// (SocketKeepalive.Apply and the heartbeat loop both no-op), so a disabled-
// keepalive config must not be rejected for the values of knobs it never reads.
var ka = next.Connection.Keepalive;
if (ka.TcpIdleTimeMs <= 0)
errs.Add($"Connection.Keepalive.TcpIdleTimeMs must be > 0; got {ka.TcpIdleTimeMs}.");
if (ka.TcpProbeIntervalMs <= 0)
errs.Add($"Connection.Keepalive.TcpProbeIntervalMs must be > 0; got {ka.TcpProbeIntervalMs}.");
if (ka.TcpProbeCount <= 0)
errs.Add($"Connection.Keepalive.TcpProbeCount must be > 0; got {ka.TcpProbeCount}.");
if (ka.BackendHeartbeatProbeAddress is < 0 or > 65535)
if (ka.Enabled)
{
if (ka.TcpIdleTimeMs <= 0)
errs.Add($"Connection.Keepalive.TcpIdleTimeMs must be > 0; got {ka.TcpIdleTimeMs}.");
if (ka.TcpProbeIntervalMs <= 0)
errs.Add($"Connection.Keepalive.TcpProbeIntervalMs must be > 0; got {ka.TcpProbeIntervalMs}.");
if (ka.TcpProbeCount <= 0)
errs.Add($"Connection.Keepalive.TcpProbeCount must be > 0; got {ka.TcpProbeCount}.");
if (ka.BackendHeartbeatProbeAddress is < 0 or > 65535)
errs.Add(
$"Connection.Keepalive.BackendHeartbeatProbeAddress must be in [0, 65535]; " +
$"got {ka.BackendHeartbeatProbeAddress}.");
if (ka.BackendHeartbeatIdleMs <= next.Connection.BackendRequestTimeoutMs)
errs.Add(
$"Connection.Keepalive.BackendHeartbeatIdleMs ({ka.BackendHeartbeatIdleMs}) must be greater " +
$"than Connection.BackendRequestTimeoutMs ({next.Connection.BackendRequestTimeoutMs}); " +
"a heartbeat interval at or below the request timeout would fire continuously.");
}
// ── 7. Resilience profiles ────────────────────────────────────────────
// PolicyFactory clamps these defensively at runtime, but a config whose
// BackoffMs is shorter than MaxAttempts-1, or carries negative delays, does not
// behave as the operator intends — reject it at the gate (review M5).
var bc = next.Resilience.BackendConnect;
if (bc.MaxAttempts < 1)
errs.Add($"Resilience.BackendConnect.MaxAttempts must be >= 1; got {bc.MaxAttempts}.");
if (bc.BackoffMs.Count < bc.MaxAttempts - 1)
errs.Add(
$"Connection.Keepalive.BackendHeartbeatProbeAddress must be in [0, 65535]; " +
$"got {ka.BackendHeartbeatProbeAddress}.");
if (ka.BackendHeartbeatIdleMs <= next.Connection.BackendRequestTimeoutMs)
$"Resilience.BackendConnect.BackoffMs must have at least MaxAttempts-1 " +
$"({Math.Max(0, bc.MaxAttempts - 1)}) entries; got {bc.BackoffMs.Count}.");
if (bc.BackoffMs.Any(ms => ms < 0))
errs.Add("Resilience.BackendConnect.BackoffMs entries must all be >= 0.");
var lr = next.Resilience.ListenerRecovery;
if (lr.InitialBackoffMs.Any(ms => ms < 0))
errs.Add("Resilience.ListenerRecovery.InitialBackoffMs entries must all be >= 0.");
if (lr.SteadyStateMs <= 0)
errs.Add($"Resilience.ListenerRecovery.SteadyStateMs must be > 0; got {lr.SteadyStateMs}.");
if (next.Resilience.ReadCoalescing.MaxParties < 1)
errs.Add(
$"Connection.Keepalive.BackendHeartbeatIdleMs ({ka.BackendHeartbeatIdleMs}) must be greater " +
$"than Connection.BackendRequestTimeoutMs ({next.Connection.BackendRequestTimeoutMs}); " +
"a heartbeat interval at or below the request timeout would fire continuously.");
$"Resilience.ReadCoalescing.MaxParties must be >= 1; got " +
$"{next.Resilience.ReadCoalescing.MaxParties}.");
errors = errs;
return errs.Count == 0;
@@ -59,6 +59,11 @@ internal sealed class EventLogBridge : ILogEventSink
// pick up; in practice install.ps1 registers it once at install.
private readonly bool _sourceExists;
// One-shot guard: 0 until the first Event Log write fails, then 1. Lets Emit drop a
// single SelfLog breadcrumb (parity with SyslogBridge's degradation breadcrumb)
// without flooding SelfLog on every subsequent Error+ event.
private int _writeFailureReported;
public EventLogBridge(bool enabled)
{
_enabled = enabled;
@@ -102,10 +107,14 @@ internal sealed class EventLogBridge : ILogEventSink
{
EventLog.WriteEntry(Source, message, type);
}
catch
catch (Exception ex)
{
// Swallow: if the Event Log write fails (e.g., source not registered,
// quota exceeded) we must not crash the application or recurse.
// Swallow: if the Event Log write fails (e.g., source removed after start,
// quota exceeded) we must not crash the application or recurse. Leave one
// SelfLog breadcrumb the first time so the degradation is diagnosable.
if (Interlocked.Exchange(ref _writeFailureReported, 1) == 0)
Serilog.Debugging.SelfLog.WriteLine(
"EventLogBridge: Event Log write failed (further failures suppressed): {0}", ex);
}
}
}
@@ -97,9 +97,14 @@ public sealed class MbproxyOptionsValidator : IValidateOptions<MbproxyOptions>
}
}
// Cache section ranges.
// Cache section ranges. MaxEntriesPerPlc has a hard ceiling because the cache's
// LRU eviction is an O(n) scan under a lock — a fat-fingered seven-figure value
// would stall the backend reader on every cache-miss store.
if (options.Cache.MaxEntriesPerPlc < 0)
errors.Add($"Cache.MaxEntriesPerPlc must be >= 0; got {options.Cache.MaxEntriesPerPlc}.");
else if (options.Cache.MaxEntriesPerPlc > 100_000)
errors.Add(
$"Cache.MaxEntriesPerPlc must be <= 100000; got {options.Cache.MaxEntriesPerPlc}.");
if (options.Cache.EvictionIntervalMs < 0)
errors.Add($"Cache.EvictionIntervalMs must be >= 0; got {options.Cache.EvictionIntervalMs}.");
+13 -8
View File
@@ -157,14 +157,15 @@ internal sealed class BcdPduPipeline : IPduPipeline
ushort startAddress = (ushort)((pdu[1] << 8) | pdu[2]);
ushort qty = (ushort)((pdu[3] << 8) | pdu[4]);
byte byteCount = pdu[5];
// Validate the request is fully sized for `qty` registers (each 2 bytes after
// the byteCount byte). A client claiming qty=10 with only 4 bytes of register
// data would otherwise have its BCD slots silently skipped by the per-slot
// bounds check below — half the request rewritten, half not. Returning here
// passes the malformed PDU through unchanged so the PLC's own validator
// surfaces the protocol error.
if (pdu.Length < 6 + qty * 2)
// The FC16 PDU is self-describing in three places that must agree: qty (register
// count), byteCount, and the actual data tail. Validate all three before touching
// any register data. A request whose byteCount disagrees with qty, or whose tail
// is short, is passed through unchanged so the PLC's own validator surfaces the
// protocol error — the rewriter must not partially mutate bytes the client never
// framed as register data (review ProxyAndBcd M1).
if (byteCount != qty * 2 || pdu.Length < 6 + byteCount)
return;
if (!ctx.TagMap.TryGetForRange(startAddress, qty, out var hits))
@@ -364,8 +365,12 @@ internal sealed class BcdPduPipeline : IPduPipeline
if (!lowInRange || !highInRange)
{
// Log effectiveQty — the range actually used for the in-range test —
// not the raw request qty. A short backend response clamps
// effectiveQty below qty; logging qty would mis-report a truncated
// response as a client/config straddle (review ProxyAndBcd N5).
RewriterLogEvents.PartialBcd(ctx.Logger, ctx.PlcName,
tag.Address, startAddress, qty);
tag.Address, startAddress, effectiveQty);
ctx.Counters.IncrementPartialBcd();
continue;
}
@@ -334,10 +334,21 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
return false;
}
// Successful connect. Wire up the backend tasks.
var cts2 = CancellationTokenSource.CreateLinkedTokenSource(_disposeCts.Token);
// Successful connect. Re-check disposal under the lock before publishing the
// socket: a DisposeAsync that ran during the (possibly multi-second) connect
// would otherwise strand this live socket and its three tasks in a disposed
// multiplexer, leaking an ECOM client slot forever (review M1). We hold
// _backendLock here, and DisposeAsync's TearDownBackendAsync also takes it
// before _disposeCts is disposed — so CreateLinkedTokenSource is safe inside.
lock (_backendLock)
{
if (_disposed || _disposeCts.IsCancellationRequested)
{
backend.Dispose();
return false;
}
var cts2 = CancellationTokenSource.CreateLinkedTokenSource(_disposeCts.Token);
_backendSocket = backend;
_backendCts = cts2;
// Seed the idle timer so the heartbeat loop measures idleness from connect.
@@ -445,9 +456,10 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
int upstreamCount = 0;
if (cascadeUpstreams)
{
// Close every attached pipe that had a request in flight; the others will
// simply re-issue on next request through a fresh backend connect.
// Per docs/Architecture/ConnectionModel.md, ALL attached upstreams cascade on backend disconnect.
// Per docs/Architecture/ConnectionModel.md, ALL attached upstream pipes
// cascade-close on a backend disconnect — not just those with a request
// in flight. Clients re-issue on their next request through a fresh
// backend connect.
upstreamCount = _pipes.Count;
// Snapshot keys before disposal modifies the dictionary indirectly.
@@ -499,6 +511,23 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
}
}
/// <summary>
/// Fire-and-forget a backend teardown from a failure-detection path (writer/reader
/// fault, reader EOF, malformed response). Attaches a faulted continuation so a throw
/// inside <see cref="TearDownBackendAsync"/> is logged rather than surfacing as an
/// unobserved <see cref="TaskScheduler.UnobservedTaskException"/>.
/// </summary>
private void FireTeardown(string reason)
{
_ = TearDownBackendAsync(reason, cascadeUpstreams: true)
.ContinueWith(
t => _logger.LogError(t.Exception,
"Backend teardown faulted: Plc={Plc} Reason={Reason}", _plc.Name, reason),
CancellationToken.None,
TaskContinuationOptions.OnlyOnFaulted | TaskContinuationOptions.ExecuteSynchronously,
TaskScheduler.Default);
}
// ── Backend writer / reader tasks ─────────────────────────────────────────
private async Task RunBackendWriterAsync(Socket backend, CancellationToken ct)
@@ -533,7 +562,7 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
// race against it, hitting a disposed _connectGate and producing an
// unobserved-task exception.
if (!_disposeCts.IsCancellationRequested)
_ = TearDownBackendAsync($"writer fault: {ex.Message}", cascadeUpstreams: true);
FireTeardown($"writer fault: {ex.Message}");
}
}
@@ -595,10 +624,34 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
if (inFlight.IsHeartbeat)
continue;
// Validate an FC03/FC04 success response's shape against the request
// before it is rewritten, cached, or fanned out. A backend that frames a
// response wrong (truncated body, bad MBAP Length) would otherwise desync
// the stream indefinitely or get a malformed payload cached and replayed
// for the whole TTL. FC03/FC04 success PDU = [fc][byteCount][data] with
// byteCount == 2*qty; on any mismatch force a teardown (the stream can no
// longer be trusted) rather than fan out a wrong-shaped frame.
if (inFlight.Fc is 0x03 or 0x04 && (frame[MbapFrame.HeaderSize] & 0x80) == 0)
{
int expectedBody = 2 + 2 * inFlight.Qty;
byte byteCount = pduBodyLen >= 2 ? frame[MbapFrame.HeaderSize + 1] : (byte)0;
if (pduBodyLen != expectedBody || byteCount != 2 * inFlight.Qty)
{
_logger.LogWarning(
"Malformed FC{Fc:X2} backend response for Plc={Plc}: pduBody={Body} expected={Expected} byteCount={ByteCount} — forcing teardown",
inFlight.Fc, _plc.Name, pduBodyLen, expectedBody, byteCount);
if (!_disposeCts.IsCancellationRequested)
FireTeardown("malformed backend response");
return;
}
}
// For FC03/FC04 reads, also clear the coalescing-by-key entry so a
// brand-new identical request issued AFTER this response is treated as a
// miss (opens a fresh round-trip). The TryRemove is best-effort: a
// watchdog timeout or cascade may have already removed it.
// watchdog timeout or cascade may have already removed it. This branch is
// heartbeat-free by construction — the IsHeartbeat check above already
// `continue`d — so the CoalescingKey rebuilt here is always a real read.
if (inFlight.Fc is 0x03 or 0x04)
{
var coalKey = new CoalescingKey(inFlight.UnitId, inFlight.Fc,
@@ -753,7 +806,7 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
// Reader exited cleanly — backend closed by remote. Cascade. Skip if
// dispose is already in progress (see writer-side comment above).
if (!_disposeCts.IsCancellationRequested)
_ = TearDownBackendAsync("backend reader EOF", cascadeUpstreams: true);
FireTeardown("backend reader EOF");
}
catch (OperationCanceledException)
{
@@ -762,7 +815,7 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
catch (Exception ex)
{
if (!_disposeCts.IsCancellationRequested)
_ = TearDownBackendAsync($"reader fault: {ex.Message}", cascadeUpstreams: true);
FireTeardown($"reader fault: {ex.Message}");
}
}
@@ -809,11 +862,35 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
startAddr = (ushort)((frame[pduOffset + 1] << 8) | frame[pduOffset + 2]);
qty = 1;
}
else if (fcByte == 0x10 && frame.Length >= pduOffset + 5)
else if (fcByte == 0x10 && frame.Length >= pduOffset + 6)
{
// FC16 = Write Multiple Registers. PDU: [fc=10][startHi][startLo][qtyHi][qtyLo][byteCount]...
// Only trust the claimed qty for cache invalidation when the FC16 frame is
// internally consistent (byteCount == 2*qty and the register payload is fully
// present). A malformed FC16 is still forwarded — the PLC rejects it with
// exception 03 — but it must not drive cache invalidation off a span the
// client merely claimed (review Multiplexing M4).
startAddr = (ushort)((frame[pduOffset + 1] << 8) | frame[pduOffset + 2]);
qty = (ushort)((frame[pduOffset + 3] << 8) | frame[pduOffset + 4]);
ushort claimedQty = (ushort)((frame[pduOffset + 3] << 8) | frame[pduOffset + 4]);
byte byteCount = frame[pduOffset + 5];
if (byteCount == claimedQty * 2 && frame.Length >= pduOffset + 6 + byteCount)
qty = claimedQty;
}
// Write-request-side cache invalidation. Drop overlapping cached FC03/FC04 entries
// the moment an FC06/FC16 is forwarded — not only when its response lands — so a
// concurrent read in the in-flight window, or a write whose response never
// arrives, cannot serve a value the write changed (review Multiplexing C1). The
// response-side invalidation in RunBackendReaderAsync remains as a backstop for
// entries stored during the in-flight window.
if (fcByte is 0x06 or 0x10 && qty > 0 && _ctx.Cache is { } writeCache)
{
int invalidated = writeCache.Invalidate(unitId, startAddr, qty);
if (invalidated > 0)
{
_ctx.Counters.AddCacheInvalidations(invalidated);
CacheLogEvents.Invalidated(_logger, _plc.Name, unitId, startAddr, qty, invalidated);
}
}
// Response-cache path. Cache check happens BEFORE coalescing AND before we
@@ -1119,6 +1196,10 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
{
await Task.Delay(tickMs, ct).ConfigureAwait(false);
// Skip the snapshot allocation entirely on an idle PLC (the common case
// across a 54-PLC fleet) — no in-flight requests, nothing to time out.
if (_correlation.Count == 0) continue;
var threshold = DateTimeOffset.UtcNow.AddMilliseconds(-_connectionOptions.BackendRequestTimeoutMs);
var stale = _correlation.SnapshotOlderThan(threshold);
if (stale.Count == 0) continue;
@@ -1162,7 +1243,14 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
long elapsedMs = (long)(DateTimeOffset.UtcNow - req.SentAtUtc).TotalMilliseconds;
foreach (var party in req.InterestedParties)
// Snapshot the party list before fan-out. For a coalesced FC03/FC04 a
// late attach could have appended to the live List between the
// CorrelationMap claim above and the _inFlightByKey removal; iterating
// the live List while it mutates throws InvalidOperationException,
// which the outer catch would turn into permanent watchdog death for
// this PLC (review Multiplexing M5). The _inFlightByKey.TryRemove above
// takes the map lock, so once it returns the list is stable to copy.
foreach (var party in req.InterestedParties.ToArray())
{
MultiplexerLogEvents.RequestTimeout(
_logger, _plc.Name, proxyTxId, party.OriginalTxId, req.Fc, elapsedMs);
@@ -32,6 +32,7 @@ internal sealed class TxIdAllocator
private ushort _next; // rolling cursor; 0 on construction
private int _inFlightCount; // 0..65536
private long _wrapCount; // monotonic; never resets
private long _doubleReleaseCount; // monotonic; Release called on an already-free slot
/// <summary>
/// Number of currently-in-flight proxy TxIds (i.e., allocated but not yet released).
@@ -56,6 +57,14 @@ internal sealed class TxIdAllocator
/// </summary>
public long WrapCount => Interlocked.Read(ref _wrapCount);
/// <summary>
/// Number of times <see cref="Release"/> was called on a slot that was already free.
/// A double-release is normally a benign cascade-vs-timeout race, but a sustained
/// non-zero rate points at the documented <c>TearDownBackendAsync</c> gate-not-held
/// race actually firing — making the otherwise-silent request drop observable.
/// </summary>
public long DoubleReleaseCount => Interlocked.Read(ref _doubleReleaseCount);
/// <summary>
/// Attempts to allocate the next free proxy TxId.
/// Returns <c>true</c> with <paramref name="id"/> set when an ID was allocated.
@@ -125,6 +134,12 @@ internal sealed class TxIdAllocator
_inUse[id] = false;
_inFlightCount--;
}
else
{
// Double-release: the slot was already free. Harmless to the allocator
// (idempotent) but tracked so the rare cascade-vs-timeout race is visible.
Interlocked.Increment(ref _doubleReleaseCount);
}
}
}
@@ -130,14 +130,16 @@ internal sealed partial class UpstreamPipe : IAsyncDisposable
out _, out _, out ushort length, out _))
return;
if (length < 1)
if (length < 2)
{
// Length field claims no body — forward the header alone via a fresh buffer.
byte[] degenerate = new byte[MbapFrame.HeaderSize];
Buffer.BlockCopy(headerBuf, 0, degenerate, 0, MbapFrame.HeaderSize);
await onFrame(degenerate, token).ConfigureAwait(false);
Interlocked.Increment(ref _pdusForwardedCount);
continue;
// A valid MBAP Length covers at least UnitId(1) + FC(1) = 2 bytes. A
// frame claiming less is malformed Modbus — there is no FC to route on
// and no PDU to forward. Close the upstream rather than allocate a
// proxy TxId and push a 7-byte garbage frame at the backend (review N1).
_logger.LogWarning(
"Malformed upstream frame: Plc={Plc} MbapLength={Length} < 2 — closing pipe",
_plcName, length);
return;
}
int pduBodyLen = length - 1;
+14 -10
View File
@@ -142,22 +142,30 @@ internal sealed partial class PlcListener : IAsyncDisposable
finally
{
await pipe.DisposeAsync().ConfigureAwait(false);
// Self-evict from the task table in the same finally that disposes
// the pipe — one scheduling primitive instead of a separate
// fire-and-forget ContinueWith.
_pipeTasks.TryRemove(pipe.Id, out _);
}
}, CancellationToken.None);
_pipeTasks[pipe.Id] = pipeTask;
_ = pipeTask.ContinueWith(prev => _pipeTasks.TryRemove(pipe.Id, out _), TaskScheduler.Default);
// If the pipe task already ran to completion (and its finally's TryRemove
// fired before this add), evict the now-stale completed entry.
if (pipeTask.IsCompleted)
_pipeTasks.TryRemove(pipe.Id, out _);
}
}
catch (OperationCanceledException)
{
// Normal shutdown.
}
catch (Exception ex)
{
// Listener faulted — log and return. The supervisor will restart.
LogListenerFaulted(_listenerLogger, _plc.Name, _plc.ListenPort, ex.Message);
}
// Any other exception (an accept-loop fault) propagates to the caller. The
// PlcListenerSupervisor's run handler catches it, logs mbproxy.listener.faulted
// (EventId 43, Warning, WITH the exception + stack trace), and drives Polly
// recovery. RunAsync must not swallow the fault here — doing so made that
// supervised fault path unreachable and downgraded faults to a generic
// "ended unexpectedly" with no stack trace (review ProxyAndBcd M2).
}
// ── IAsyncDisposable ──────────────────────────────────────────────────────────────────
@@ -197,8 +205,4 @@ internal sealed partial class PlcListener : IAsyncDisposable
[LoggerMessage(EventId = 20, EventName = "mbproxy.startup.bind",
Level = LogLevel.Information, Message = "Listener bound: Plc={Plc} Port={Port}")]
private static partial void LogBound(ILogger logger, string plc, int port);
[LoggerMessage(EventId = 22, EventName = "mbproxy.listener.faulted",
Level = LogLevel.Error, Message = "Listener faulted: Plc={Plc} Port={Port} Reason={Reason}")]
private static partial void LogListenerFaulted(ILogger logger, string plc, int port, string reason);
}
+1 -1
View File
@@ -410,7 +410,7 @@ internal sealed class ProxyCounters
// Convert ticks to microseconds.
double sampleMs = (double)elapsedTicks / System.Diagnostics.Stopwatch.Frequency * 1000.0;
// Fixed-point: store microseconds * 1000 (i.e. nanoseconds) as long for CAS.
// Fixed-point: store milliseconds * 1000 (i.e. microseconds) as long for CAS.
// This gives ~1 µs resolution which is fine for Modbus round-trips (1100 ms range).
long sampleFixed = (long)(sampleMs * 1000.0);
+20 -3
View File
@@ -93,6 +93,23 @@ internal sealed partial class ProxyWorker : BackgroundService
var opts = _options.CurrentValue;
int plcsConfigured = opts.Plcs.Count;
// ── 0. Fail-fast on configuration errors ─────────────────────────────────────
// ReloadValidator is the single config gate — it checks cross-PLC rules
// (duplicate listen ports, AdminPort collisions, duplicate PLC names, backend
// Host/Port, the keepalive cross-field rule, Resilience profiles) that
// MbproxyOptionsValidator / .ValidateOnStart() does not. It previously ran only
// on hot-reload; running it here too means a config mistake aborts startup with
// a clear error instead of silently producing a half-working fleet (review
// ConfigAndHosting M1/M4/M5). The host's default BackgroundService exception
// behaviour (StopHost) turns the throw into a non-zero exit.
if (!ReloadValidator.Validate(opts, out var configErrors))
{
LogConfigRejectedAtStartup(_logger, string.Join("; ", configErrors));
throw new InvalidOperationException(
"mbproxy configuration is invalid; service will not start. " +
string.Join("; ", configErrors));
}
// ── 1. Build per-PLC BCD tag maps ────────────────────────────────────────────
var plcContexts = new Dictionary<string, PerPlcContext>(opts.Plcs.Count, StringComparer.Ordinal);
@@ -379,10 +396,10 @@ internal sealed partial class ProxyWorker : BackgroundService
Message = "mbproxy service ready — ListenersBound={ListenersBound} PlcsConfigured={PlcsConfigured}")]
private static partial void LogStartupReady(ILogger logger, int listenersBound, int plcsConfigured);
[LoggerMessage(EventId = 21, EventName = "mbproxy.startup.bind.failed",
[LoggerMessage(EventId = 2, EventName = "mbproxy.startup.config.rejected",
Level = LogLevel.Error,
Message = "Failed to bind listener: Plc={Plc} Port={Port} Reason={Reason}")]
private static partial void LogBindFailed(ILogger logger, string plc, int port, string reason);
Message = "Startup configuration rejected — service will not start: {Errors}")]
private static partial void LogConfigRejectedAtStartup(ILogger logger, string errors);
[LoggerMessage(EventId = 80, EventName = "mbproxy.shutdown.complete",
Level = LogLevel.Information,
@@ -45,5 +45,12 @@ internal static class SocketKeepalive
{
// Socket closed concurrently — nothing to do.
}
catch (NotSupportedException)
{
// Some platforms/runtimes throw NotSupportedException (or its derived
// PlatformNotSupportedException) rather than SocketException for an
// unrecognised TcpKeepAlive* option. Swallow it too — the "must never abort a
// connection" contract above applies regardless of the exception type.
}
}
}
@@ -115,6 +115,13 @@ internal sealed class TagValueCapture
/// <summary>
/// Records the last value of the BCD tag at <paramref name="address"/>. No-op when
/// disarmed or when <paramref name="address"/> is not a tracked tag.
///
/// <para>Concurrency guarantee: a disarmed capture never <i>retains</i> a stale
/// observation (the post-write re-check below undoes a write that raced a
/// <see cref="Disarm"/>). It is <b>not</b> guaranteed that every observation recorded
/// while armed survives — two threads recording the same address concurrent with a
/// near-simultaneous disarm can drop one update. That is a lost update on a debug
/// view, not stale data, and the slot self-heals on the next traffic.</para>
/// </summary>
/// <param name="rawLow">BCD-encoded low word as it sits on the PLC wire.</param>
/// <param name="rawHigh">BCD-encoded high word (0 for a 16-bit tag).</param>
@@ -14,9 +14,16 @@ namespace Mbproxy.Tests.Admin;
[Trait("Category", "Unit")]
public sealed class DebugDtoSerializationTests
{
// The exact policy AdminEndpointHost configures on the hub's PayloadSerializerOptions.
private static readonly JsonSerializerOptions HubOptions =
new() { PropertyNamingPolicy = JsonNamingPolicy.CamelCase };
// The exact configuration AdminEndpointHost applies to the hub's
// PayloadSerializerOptions — referenced, not copied, so the two cannot drift.
private static readonly JsonSerializerOptions HubOptions = BuildHubOptions();
private static JsonSerializerOptions BuildHubOptions()
{
var o = new JsonSerializerOptions();
AdminEndpointHost.ConfigureHubPayloadJson(o);
return o;
}
[Fact]
public void PlcDetailResponse_SerializesWithCamelCaseFieldNames()
@@ -39,6 +39,23 @@ public sealed class PlcSubscriptionTrackerTests
t.ActivePlcs().ShouldBeEmpty();
}
[Fact]
public void SameTab_TwoConnections_RemovedNewestFirst_StaysActiveUntilLast()
{
// Mirror of SameTab_TwoConnections_StaysActiveUntilLastConnectionGone: the
// reconnect's NEW connection is the one that drops first (the order is not
// guaranteed). The tab must still be alive on the surviving old connection.
var t = new PlcSubscriptionTracker();
t.SubscribePlc("c-old", "tab", "plc");
t.SubscribePlc("c-new", "tab", "plc");
t.RemoveConnection("c-new");
t.ActivePlcs().ShouldContain("plc", "the tab still holds the old connection");
t.RemoveConnection("c-old");
t.ActivePlcs().ShouldBeEmpty();
}
[Fact]
public void DistinctTabs_AreCountedSeparately()
{
@@ -24,11 +24,10 @@ internal sealed class FakeHubCallerContext : HubCallerContext
public override void Abort() { }
}
/// <summary>Records every group join/leave so tests can assert membership changes.</summary>
/// <summary>Records every group join so tests can assert membership changes.</summary>
internal sealed class FakeGroupManager : IGroupManager
{
public List<(string ConnectionId, string Group)> Added { get; } = [];
public List<(string ConnectionId, string Group)> Removed { get; } = [];
public Task AddToGroupAsync(string connectionId, string groupName, CancellationToken cancellationToken = default)
{
@@ -36,11 +35,10 @@ internal sealed class FakeGroupManager : IGroupManager
return Task.CompletedTask;
}
// StatusHub.OnDisconnectedAsync never calls RemoveFromGroupAsync — SignalR removes a
// disconnected connection from its groups implicitly. Nothing to record here.
public Task RemoveFromGroupAsync(string connectionId, string groupName, CancellationToken cancellationToken = default)
{
Removed.Add((connectionId, groupName));
return Task.CompletedTask;
}
=> Task.CompletedTask;
}
/// <summary>Records every push so <see cref="StatusBroadcaster"/> tests can assert routing.</summary>
@@ -183,9 +183,10 @@ public sealed class StatusBroadcasterTests
h.Sink.FleetPushes.Count.ShouldBeGreaterThanOrEqualTo(3,
"the background loop must push the fleet snapshot every interval");
// StopAsync awaits the loop task before returning, so the loop is guaranteed
// terminated here — no settling delay is needed for the assertion to be sound.
await h.Broadcaster.StopAsync();
int afterStop = h.Sink.FleetPushes.Count;
await Task.Delay(400, TestContext.Current.CancellationToken);
h.Sink.FleetPushes.Count.ShouldBe(afterStop, "no pushes may occur after StopAsync");
}
}
@@ -68,6 +68,28 @@ public sealed class StatusHubTests
tracker.ActivePlcs().ShouldBeEmpty("no viewer may be stranded after the tab closes");
}
[Fact]
public async Task Reconnect_SameTab_NewConnectionDisconnectsFirst_DoesNotLeakViewer()
{
// Mirror of the reconnect test above with the disconnect ordering reversed: the
// NEW connection's OnDisconnectedAsync arrives before the old one's. SignalR does
// not guarantee the order, so the tracker must be correct either way.
var tracker = new PlcSubscriptionTracker();
var first = MakeHub("conn-old", tracker, out _);
await first.SubscribePlc("plc-1", "tab-A");
var second = MakeHub("conn-new", tracker, out _);
await second.SubscribePlc("plc-1", "tab-A");
await second.OnDisconnectedAsync(null); // the new connection drops first
tracker.ActivePlcs().ShouldContain("plc-1",
"the tab is still open on the original connection");
await first.OnDisconnectedAsync(null);
tracker.ActivePlcs().ShouldBeEmpty("no viewer may be stranded after the tab closes");
}
[Fact]
public async Task TwoTabs_FirstCloseKeepsActive_LastCloseClears()
{
@@ -105,10 +105,12 @@ internal static class TestHostBuilderExtensions
this HostApplicationBuilder builder,
Serilog.ILogger serilogLogger)
{
// Minimal in-memory config so AddMbproxyOptions doesn't fail.
// Minimal in-memory config so AddMbproxyOptions doesn't fail. AdminPort 0
// disables the admin endpoint — the smoke tests do not exercise it, and a fixed
// port would collide under parallel test execution.
builder.Configuration.AddInMemoryCollection(new Dictionary<string, string?>
{
["Mbproxy:AdminPort"] = "8080",
["Mbproxy:AdminPort"] = "0",
});
builder.Services.AddSerilog(serilogLogger, dispose: false);
@@ -342,6 +342,9 @@ public sealed class MultiplexerE2ETests
["Mbproxy:Connection:BackendConnectTimeoutMs"] = "3000",
// Long request timeout so the watchdog doesn't fire during the test's wait window.
["Mbproxy:Connection:BackendRequestTimeoutMs"] = "30000",
// This test exercises backend disconnect, not keepalive — disable keepalive so
// the 30 s request timeout above doesn't trip the heartbeat cross-field rule.
["Mbproxy:Connection:Keepalive:Enabled"] = "false",
// Aggressive backend retry so the second connect happens fast.
["Mbproxy:Resilience:BackendConnect:MaxAttempts"] = "5",
["Mbproxy:Resilience:BackendConnect:BackoffMs:0"] = "50",
@@ -458,8 +461,11 @@ public sealed class MultiplexerE2ETests
var config = MakeBaseConfig(proxyPort);
config["Mbproxy:AdminPort"] = adminPort.ToString();
// Short idle window so the heartbeat fires several times within the test budget.
// BackendRequestTimeoutMs is lowered below the 700 ms idle window so the
// heartbeat cross-field rule (idle > request timeout) holds.
config["Mbproxy:Connection:Keepalive:Enabled"] = "true";
config["Mbproxy:Connection:Keepalive:BackendHeartbeatIdleMs"] = "700";
config["Mbproxy:Connection:BackendRequestTimeoutMs"] = "500";
var host = BuildBcdHost(config);
using var startCts = new CancellationTokenSource(TimeSpan.FromSeconds(3));
@@ -64,7 +64,9 @@ public sealed class ProxyForwardingTests
var config = new Dictionary<string, string?>
{
["Mbproxy:AdminPort"] = "8080",
// 0 disables the admin endpoint — this test does not exercise it, and a
// fixed port would collide under parallel test execution.
["Mbproxy:AdminPort"] = "0",
[$"Mbproxy:Plcs:0:Name"] = "TestPLC",
[$"Mbproxy:Plcs:0:ListenPort"] = proxyPort.ToString(),
[$"Mbproxy:Plcs:0:Host"] = _sim.Host,
@@ -239,7 +241,9 @@ public sealed class ProxyForwardingTests
var config = new Dictionary<string, string?>
{
["Mbproxy:AdminPort"] = "8080",
// 0 disables the admin endpoint — this test does not exercise it, and a
// fixed port would collide under parallel test execution.
["Mbproxy:AdminPort"] = "0",
[$"Mbproxy:Plcs:0:Name"] = "BadPLC",
[$"Mbproxy:Plcs:0:ListenPort"] = proxyPort.ToString(),
[$"Mbproxy:Plcs:0:Host"] = "127.0.0.1",
@@ -307,7 +311,9 @@ public sealed class ProxyForwardingTests
var config = new Dictionary<string, string?>
{
["Mbproxy:AdminPort"] = "8080",
// 0 disables the admin endpoint — this test does not exercise it, and a
// fixed port would collide under parallel test execution.
["Mbproxy:AdminPort"] = "0",
[$"Mbproxy:Plcs:0:Name"] = "TestPLC",
[$"Mbproxy:Plcs:0:ListenPort"] = proxyPort.ToString(),
[$"Mbproxy:Plcs:0:Host"] = _sim.Host,
@@ -385,7 +385,9 @@ public sealed class RewriterE2ETests
var config = new Dictionary<string, string?>
{
["Mbproxy:AdminPort"] = "8080",
// 0 disables the admin endpoint — this test does not exercise it, and a
// fixed port would collide under parallel test execution.
["Mbproxy:AdminPort"] = "0",
["Mbproxy:Plcs:0:Name"] = "TestPLC",
["Mbproxy:Plcs:0:ListenPort"] = proxyPort.ToString(),
["Mbproxy:Plcs:0:Host"] = _sim.Host,
@@ -118,9 +118,12 @@ public sealed class DL205SimulatorFixture : IAsyncLifetime
_process.BeginErrorReadLine();
// ── 5. Poll for TCP readiness (up to ReadinessTimeout) ───────────────
// Link the readiness deadline against the test-runner's cancellation token so a
// CI job timeout / keyboard interrupt aborts the poll promptly instead of running
// the full 120 s and leaving the spawned Python process orphaned (review M3).
using var deadline = new CancellationTokenSource(ReadinessTimeout);
using var linked = CancellationTokenSource.CreateLinkedTokenSource(
deadline.Token, CancellationToken.None);
deadline.Token, TestContext.Current.CancellationToken);
bool ready = false;
while (!linked.Token.IsCancellationRequested)