mbproxy: remediate the 2026-05-16 code-review findings

Fixes every finding from the codereviews/2026-05-16 multi-agent review
(2 Critical, 20 Major, 38 Minor) and adds that review to the repo.

Highlights: dashboard XSS escape; response cache invalidated on the
write request (not just the response); ReloadValidator now runs at
startup so port collisions / duplicate names / malformed Resilience
profiles fail fast; AdminPort 0 genuinely disables the admin endpoint;
PlcListener accept-loop faults propagate to the supervisor's faulted
path; reconciler Restart builds before removing; Resilience pipelines
are restart-only from a frozen snapshot; multiplexer connect-race leak,
watchdog party-list snapshot, backend-response and FC16 framing
validation; frontend reconnect retry and util.js load guard; plus the
log-event/doc drift sweep and test-port hygiene.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-16 18:08:06 -04:00
parent 0308490aef
commit b222362ce0
45 changed files with 1735 additions and 151 deletions
+2 -2
View File
@@ -143,7 +143,7 @@ DL205 / DL260 BCD is non-negative in the default ladder pattern. `BcdCodec.Encod
## Exception Pass-Through
Modbus exception responses pass through unchanged. The rewriter detects an exception response by the high bit of the function code (`fc & 0x80 != 0`), emits a `mbproxy.rewrite.exception_passthrough` event, increments the per-FC exception counter, and returns without touching the payload.
Modbus exception responses pass through unchanged. The rewriter detects an exception response by the high bit of the function code (`fc & 0x80 != 0`), emits a `mbproxy.exception.passthrough` event, increments the per-FC exception counter, and returns without touching the payload.
Covered exception codes:
@@ -229,7 +229,7 @@ The rewriter feeds two counters that surface on the status page:
An out-of-range value (`< 0` or `> 9999` for 16-bit; `< 0` or `> 99_999_999` for 32-bit) on a write, or a bad nibble (`>= 0xA`) on a read, increments an internal invalid-BCD counter and emits `mbproxy.rewrite.invalid_bcd` at warning. The PDU passes through raw in that case; the rewriter never substitutes a value the client did not send (writes) or the PLC did not return (reads).
Both counters are exposed on the status page; see [`../Operations/StatusPage.md`](../Operations/StatusPage.md). The corresponding log events (`mbproxy.rewrite.partial_bcd`, `mbproxy.rewrite.invalid_bcd`, `mbproxy.rewrite.exception_passthrough`) are catalogued in [`../Reference/LogEvents.md`](../Reference/LogEvents.md). Partial-overlap troubleshooting is covered in [`../Operations/Troubleshooting.md`](../Operations/Troubleshooting.md).
Both counters are exposed on the status page; see [`../Operations/StatusPage.md`](../Operations/StatusPage.md). The corresponding log events (`mbproxy.rewrite.partial_bcd`, `mbproxy.rewrite.invalid_bcd`, `mbproxy.exception.passthrough`) are catalogued in [`../Reference/LogEvents.md`](../Reference/LogEvents.md). Partial-overlap troubleshooting is covered in [`../Operations/Troubleshooting.md`](../Operations/Troubleshooting.md).
The `dl205.json` pymodbus simulator profile encodes BCD test fixtures used by the integration test suite; see [`../Testing/Simulator.md`](../Testing/Simulator.md).
+9 -7
View File
@@ -56,6 +56,7 @@ If a step throws, the exception is logged at Error and the loop continues with t
| `Cache.EvictionIntervalMs` | Read by the next eviction loop tick. |
| `Resilience.ReadCoalescing.Enabled` flipped to `false` | Already-running coalesced entries drain naturally. Subsequent reads bypass coalescing. |
| `Resilience.ReadCoalescing.MaxParties` | Applies to subsequent attaches. Existing in-flight entries keep their current cap. |
| `Resilience.BackendConnect.*` or `Resilience.ListenerRecovery.*` | **Restart-only.** The backend-connect and listener-recovery Polly pipelines are built from the `Resilience` snapshot taken at service startup; the reconciler builds add/restart supervisors from that same frozen snapshot, so a hot-reload of these values does not propagate to any PLC. Restart the service to change them. |
| Invalid reload (schema break, duplicate ports, duplicate addresses in a resolved tag list, `CacheTtlMs > 60_000` without `Cache.AllowLongTtl = true`) | Reload is rejected as a whole. The current in-memory config stays in effect. `mbproxy.config.reload.rejected` is logged at Error. |
The "next-PDU" wording is load-bearing for the tag-list rows: the rewriter does not snapshot the tag map at connection accept time. It resolves the map for the active PLC at the start of every request frame, so a hot-reloaded tag list is in effect for the very next request, even on existing TCP connections.
@@ -78,21 +79,22 @@ The `ReloadPlan` distinguishes two kinds of "PLC is still here but changed":
3. Merge in `Plcs[i].BcdTags.Add` entries — if an address already exists in the working set, the `Add` entry wins. This is how a per-PLC width override is expressed (the global lists a 16-bit tag at the same address; the per-PLC `Add` overrides it to 32-bit).
4. Fold `Plcs[i].DefaultCacheTtlMs` into any tag whose explicit `CacheTtlMs` is null.
The same builder runs both at startup and during reload validation, so a configuration that builds cleanly at startup is guaranteed to build cleanly at reload, and vice versa. There is no second validator that could disagree with the first.
The same builder runs both at startup and during reload validation, so a configuration that builds cleanly at startup is guaranteed to build cleanly at reload, and vice versa.
## Validation Rules
`ReloadValidator.Validate` is the gate the hot-reload path consults directly. It runs the following checks in order:
`ReloadValidator.Validate` is the configuration gate. It runs at **startup** (in `ProxyWorker.ExecuteAsync`, before any supervisor is built — a rejection logs `mbproxy.startup.config.rejected` and the service exits non-zero) **and** on every hot reload. It runs the following checks in order:
1. PLC names are non-empty and unique under ordinal comparison.
2. Every `Plcs[i].ListenPort` is in `[1, 65535]` and unique across the `Plcs` list.
3. `AdminPort` is in `[1, 65535]` and does not collide with any `ListenPort`.
2. Every `Plcs[i].ListenPort` is in `[1, 65535]` and unique across the `Plcs` list; every `Host` is non-empty and every backend `Port` is in `[1, 65535]`.
3. `AdminPort` is in `[1, 65535]`, or `0` to disable the admin endpoint; a non-zero `AdminPort` must not collide with any `ListenPort`.
4. For each PLC, `BcdTagMapBuilder.Build(next.BcdTags, plc.BcdTags, plc.DefaultCacheTtlMs)` reports no errors. This delegates the per-PLC well-formedness checks — duplicate addresses within a single resolved list, and 32-bit entries whose high register (`Address + 1`) overlaps a separate 16-bit entry — to the single source of truth used at startup.
5. Cache TTL bounds: every `BcdTag.CacheTtlMs` and every `Plcs[i].DefaultCacheTtlMs` must be `>= 0`, and any value above `60_000` ms requires `Cache.AllowLongTtl = true`. `Cache.MaxEntriesPerPlc` and `Cache.EvictionIntervalMs` must be `>= 0`.
5. Cache TTL bounds: every `BcdTag.CacheTtlMs` and every `Plcs[i].DefaultCacheTtlMs` must be `>= 0`, and any value above `60_000` ms requires `Cache.AllowLongTtl = true`. `Cache.MaxEntriesPerPlc` must be in `[0, 100000]` and `Cache.EvictionIntervalMs` must be `>= 0`.
6. `AdminPushIntervalMs` is in `[1, 60000]`; connection timeouts are `> 0`; the keepalive cross-field rule holds; and the `Resilience` profiles are well-formed (`BackendConnect.MaxAttempts >= 1` with at least `MaxAttempts - 1` non-negative `BackoffMs` entries, `ListenerRecovery.SteadyStateMs > 0`, `ReadCoalescing.MaxParties >= 1`).
A failure at any step appends to the error list but the validator runs to completion so the operator sees every problem with a single save. If the list is non-empty, the reload is rejected atomically and no state mutates.
A failure at any step appends to the error list but the validator runs to completion so the operator sees every problem with a single save. If the list is non-empty, the reload is rejected atomically and no state mutates (at startup, the service refuses to start).
Schema-level checks — invalid `Width` values on a `BcdTagOptions`, type mismatches, malformed JSON — are also enforced by `MbproxyOptionsValidator` (`IValidateOptions<MbproxyOptions>`) at bind time. The two paths overlap deliberately so both startup and reload reject the same malformed input with the same error wording.
Schema-level checks — invalid `Width` values on a `BcdTagOptions`, type mismatches, malformed JSON — are also enforced by `MbproxyOptionsValidator` (`IValidateOptions<MbproxyOptions>`) at bind time. The two validators overlap deliberately; their error wording is similar but not guaranteed identical.
### Rejected-reload example
+6 -5
View File
@@ -281,21 +281,22 @@ The cache itself is described in detail in [`../Architecture/ResponseCache.md`](
## Validation Rules
`ReloadValidator.Validate` runs on every config load (startup and hot reload) and rejects the entire snapshot if any rule fails. On rejection at startup, the service exits non-zero. On rejection at runtime, the current in-memory config stays in effect and `mbproxy.config.reload.rejected` is logged at `Error`.
`ReloadValidator.Validate` runs on every config load (startup and hot reload) and rejects the entire snapshot if any rule fails. On rejection at startup, the service logs `mbproxy.startup.config.rejected` at `Error` and exits non-zero. On rejection at runtime, the current in-memory config stays in effect and `mbproxy.config.reload.rejected` is logged at `Error`.
Rules (in order):
1. **PLC names**: every `Plcs[i].Name` is non-empty and unique (ordinal comparison).
2. **ListenPort**: every `Plcs[i].ListenPort` is in `[1, 65535]` and unique across the array.
3. **AdminPort**: in `[1, 65535]` and does not collide with any `ListenPort`.
2. **ListenPort / Host / Port**: every `Plcs[i].ListenPort` is in `[1, 65535]` and unique across the array; every `Host` is non-empty; every backend `Port` is in `[1, 65535]`.
3. **AdminPort**: in `[1, 65535]`, or `0` to disable the admin endpoint; a non-zero value does not collide with any `ListenPort`.
4. **BCD tag map** per PLC, delegated to `BcdTagMapBuilder.Build`:
- duplicate addresses within a single PLC's resolved tag list
- 32-bit entries whose high register (`Address + 1`) overlaps a separate 16-bit entry at that address
5. **Cache TTL bounds**:
- any `CacheTtlMs` or `DefaultCacheTtlMs` less than 0 is rejected
- any `CacheTtlMs` or `DefaultCacheTtlMs` greater than `60_000` is rejected unless `Cache.AllowLongTtl = true`
6. **Cache size knobs**: `Cache.MaxEntriesPerPlc >= 0`, `Cache.EvictionIntervalMs >= 0`.
7. **Width**: every `BcdTagOptions.Width` is `16` or `32` (enforced by `MbproxyOptionsValidator` at schema time).
6. **Cache size knobs**: `Cache.MaxEntriesPerPlc` in `[0, 100000]`, `Cache.EvictionIntervalMs >= 0`.
7. **AdminPushIntervalMs / timeouts / keepalive / Resilience**: `AdminPushIntervalMs` in `[1, 60000]`; connection timeouts `> 0`; the keepalive cross-field rule (`BackendHeartbeatIdleMs > BackendRequestTimeoutMs`); and well-formed `Resilience` profiles (`BackendConnect.MaxAttempts >= 1` with `>= MaxAttempts - 1` non-negative `BackoffMs` entries, `ListenerRecovery.SteadyStateMs > 0`, `ReadCoalescing.MaxParties >= 1`).
8. **Width**: every `BcdTagOptions.Width` is `16` or `32` (also enforced by `MbproxyOptionsValidator` at schema time).
Sample rejection messages (logged at `Error` with the structured property `errors` carrying the full list):
+2
View File
@@ -330,6 +330,8 @@ The detail page's debug view is fed by an **on-demand per-tag value capture** (`
| `debug.tags[].updatedAtUtc` | `string?` | ISO-8601 time of the observation; `null` when no traffic yet. |
| `debug.tags[].ageSeconds` | `double?` | Seconds since the observation; `null` when no traffic yet. |
`PlcDetailResponse` is delivered **only** over the `/hub/status` SignalR feed (the `"plc"` message); there is no `GET` route for it, and it is serialized through the SignalR JSON protocol rather than `StatusJsonContext`. Scrapers that want per-PLC counters use the `plcs[]` array of `GET /status.json` instead — the debug-view capture has no JSON-twin endpoint.
## How to Scrape It
The JSON twin is plain HTTP. Any monitoring system that can curl an endpoint can scrape it.
+15 -3
View File
@@ -45,6 +45,18 @@ Fires once after `ProxyWorker.StartAsync` has spun up every per-PLC supervisor a
**Operator action:** if the two counts disagree, search for `mbproxy.startup.bind.failed` entries to identify the missing PLCs.
### mbproxy.startup.config.rejected
**Level:** Error &middot; **EventId:** 2 &middot; **Source:** `src/Mbproxy/Proxy/ProxyWorker.cs`
| Property | Type | Meaning |
|----------|------|---------|
| `Errors` | `string` | Concatenated validation failures (one per `;`). |
Fires once at startup when `ReloadValidator.Validate` rejects the initial `appsettings.json` — duplicate listen ports, an `AdminPort` collision, duplicate PLC names, a malformed BCD tag list, a bad keepalive cross-field relationship, or an invalid `Resilience` profile. The service then exits non-zero; no listeners are started. This is the startup-time twin of `mbproxy.config.reload.rejected`.
**Operator action:** fix the offending entry in `appsettings.json` and restart the service. The error text names every failed rule.
### mbproxy.startup.bind
**Level:** Information &middot; **EventId:** 20 (`PlcListener`) / 40 (`PlcListenerSupervisor`) &middot; **Source:** `src/Mbproxy/Proxy/PlcListener.cs`, `src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs`
@@ -60,7 +72,7 @@ Fires when a per-PLC `TcpListener` successfully binds its configured port. Emitt
### mbproxy.startup.bind.failed
**Level:** Error &middot; **EventId:** 21 (`ProxyWorker`) / 41 (`PlcListenerSupervisor`) &middot; **Source:** `src/Mbproxy/Proxy/ProxyWorker.cs`, `src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs`
**Level:** Error &middot; **EventId:** 41 &middot; **Source:** `src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs`
| Property | Type | Meaning |
|----------|------|---------|
@@ -88,7 +100,7 @@ Fires after the supervisor's Polly recovery pipeline successfully rebinds a list
### mbproxy.listener.faulted
**Level:** Error (`PlcListener`) / Warning (`PlcListenerSupervisor`) &middot; **EventId:** 22 / 43 &middot; **Source:** `src/Mbproxy/Proxy/PlcListener.cs`, `src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs`
**Level:** Warning &middot; **EventId:** 43 &middot; **Source:** `src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs`
| Property | Type | Meaning |
|----------|------|---------|
@@ -96,7 +108,7 @@ Fires after the supervisor's Polly recovery pipeline successfully rebinds a list
| `Port` | `int` | Port whose listener faulted. |
| `Reason` | `string` | Top-level exception message. |
Fires when a listener's accept loop throws. The two sources emit at different levels deliberately: the unsupervised `PlcListener` instance logs at `Error` (a terminal condition for that listener), while the supervised emission is `Warning` because Polly will retry. The supervised path attaches the exception object as the `LoggerMessage` exception parameter, so the stack trace is captured.
Fires when a listener's accept loop throws. `PlcListener.RunAsync` propagates the fault to its `PlcListenerSupervisor`, which logs this event at `Warning` (Polly will retry) with the exception object attached as the `LoggerMessage` exception parameter, so the stack trace is captured.
**Operator action:** if the same `Plc` produces repeated faults inside a few minutes, inspect the network path. A burst of faults paired with `mbproxy.multiplex.backend.disconnected` indicates the PLC itself is unhealthy rather than a proxy issue.