b222362ce0
Fixes every finding from the codereviews/2026-05-16 multi-agent review (2 Critical, 20 Major, 38 Minor) and adds that review to the repo. Highlights: dashboard XSS escape; response cache invalidated on the write request (not just the response); ReloadValidator now runs at startup so port collisions / duplicate names / malformed Resilience profiles fail fast; AdminPort 0 genuinely disables the admin endpoint; PlcListener accept-loop faults propagate to the supervisor's faulted path; reconciler Restart builds before removing; Resilience pipelines are restart-only from a frozen snapshot; multiplexer connect-race leak, watchdog party-list snapshot, backend-response and FC16 framing validation; frontend reconnect retry and util.js load guard; plus the log-event/doc drift sweep and test-port hygiene. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
313 lines
18 KiB
Markdown
313 lines
18 KiB
Markdown
# Code Review — Tests, Install/Packaging, and Config (HEAD `0308490`)
|
||
|
||
Scope: `tests/Mbproxy.Tests/` (all test code), `tests/sim/` (simulator profile and fixture),
|
||
`install/` (all scripts and config templates), `src/Mbproxy/Mbproxy.csproj`, and
|
||
`tests/Mbproxy.Tests/Mbproxy.Tests.csproj`. Prior-review findings from `codereviews/2026-05-15/TestsAndConfig.md`
|
||
were read for regression context; the code is reviewed fresh against the current HEAD.
|
||
|
||
## Summary
|
||
|
||
The prior-review remediation is genuine and thorough: every `Major` finding from the
|
||
2026-05-15 review has been addressed — `StatusBroadcaster.LoopAsync` now has a proper
|
||
timing-hedged loop test (`Loop_PushesRepeatedly_ThenStopsAfterStopAsync`), the
|
||
`Record`/`Disarm` race is fixed in production and covered by
|
||
`ConcurrentRecordAndDisarm_LeavesNoStaleObservation`, `PlcSubscriptionTracker` now has a
|
||
dedicated test file including a concurrency stress test, the `ThrowingStatusPushSink` fake
|
||
covers both the swallow-path and the `OperationCanceledException` filter, and
|
||
`ConfigReconcilerTests` now holds a real registry and asserts add/remove lifecycle. The
|
||
`HubStatusE2ETests` E2E pair covers the `MapHub`/`SignalRStatusPushSink` gap that the prior
|
||
review flagged as the single biggest "does the feature actually work" gap. The `EmbeddedAssetsTests`
|
||
glob guard and the `Get_Asset` byte-verbatim assertion also close the prior `m3`/`m4` nits.
|
||
No prior-review finding has regressed. The remaining findings below are all new, introduced by
|
||
the current HEAD or present in the pre-existing code but newly visible against the completed
|
||
remediation baseline.
|
||
|
||
**Findings: 0 Critical · 4 Major · 7 Minor**
|
||
|
||
---
|
||
|
||
## Critical
|
||
|
||
None.
|
||
|
||
---
|
||
|
||
## Major
|
||
|
||
**M1 — `AdminPort=0` silently binds a random OS-assigned port instead of disabling the admin endpoint.**
|
||
`install/mbproxy.config.template.json:83`, `install/mbproxy.linux.config.template.json:87`,
|
||
`docs/Operations/Configuration.md:54`; `src/Mbproxy/Admin/AdminEndpointHost.cs:190`.
|
||
|
||
Both config templates carry the comment `// Set to 0 to disable the admin endpoint.`, and the
|
||
Operations docs repeat this. The comment is wrong. `AdminEndpointHost.StartAppAsync` calls
|
||
`k.Listen(System.Net.IPAddress.Any, port)` with whatever `port` is — Kestrel interprets port 0
|
||
as "pick a free OS-assigned port" (ephemeral port), not as "skip the bind". There is no code in
|
||
`StartAppAsync` or its callers that special-cases port 0 as a disable signal. A production
|
||
operator who sets `AdminPort: 0` to disable the admin surface does not get a disabled endpoint;
|
||
they get an admin endpoint on an unknown ephemeral port that is not logged in the startup banner
|
||
in a predictable way. This is made worse by:
|
||
|
||
1. `ReloadValidator.Validate` (line 65–68) rejects `AdminPort < 1`, so a hot-reload with `AdminPort: 0`
|
||
is correctly blocked — but the *initial* startup path does not go through `ReloadValidator`
|
||
(only `MbproxyOptionsValidator`, which does not range-check `AdminPort`). So port 0 at
|
||
startup actually takes effect, while the same value in a reload is rejected. The two paths
|
||
are inconsistent.
|
||
2. Several tests intentionally use `["Mbproxy:AdminPort"] = "0"` with the comment "disable
|
||
admin to avoid port conflicts" (`ShutdownE2ETests.cs:163`, `StatusBroadcasterTests.cs:47`,
|
||
`StatusSnapshotBuilderTests.cs:266`). These tests are not broken — they work because Kestrel
|
||
binds on 0 and no test probes the admin port — but they silently start an admin endpoint
|
||
rather than suppressing it. If those tests ever run concurrently with another test that
|
||
probes all listening ports they can interfere.
|
||
|
||
**Impact:** Operators following the documented `AdminPort: 0` procedure to disable the admin
|
||
surface on a sensitive network have an undocumented, unauthorised HTTP listener running. This is
|
||
a documentation/behavioural inconsistency on a production-critical setting.
|
||
|
||
**Recommendation:** Implement the documented disable. In `AdminEndpointHost.StartAppAsync`, add
|
||
`if (port == 0) return;` before the builder construction. Then update `ReloadValidator` to allow
|
||
0 (skip collision check) and update `MbproxyOptionsValidator` to allow 0. The templates and docs
|
||
are then correct. Alternatively, remove the "Set to 0 to disable" comment and document that the
|
||
admin endpoint is always started, and rename the test comments to reflect reality.
|
||
|
||
---
|
||
|
||
**M2 — Multiple E2E tests hard-code `AdminPort: 8080`, causing bind-conflict failures when run in parallel or when port 8080 is already in use.**
|
||
`tests/Mbproxy.Tests/Proxy/ProxyForwardingTests.cs:67,242,310`; `tests/Mbproxy.Tests/Proxy/RewriterE2ETests.cs:388`;
|
||
`tests/Mbproxy.Tests/HostSmokeTests.cs:111`.
|
||
|
||
`ProxyForwardingTests` (3 configs), `RewriterE2ETests` (1 config), and the shared
|
||
`TestHostBuilderExtensions.ConfigureForTest` used by all three `HostSmokeTests` hard-code
|
||
`["Mbproxy:AdminPort"] = "8080"`. When these tests run in parallel (xUnit v3's default policy
|
||
for non-collection tests) they each try to bind a Kestrel listener on port 8080. The first
|
||
binder wins; all subsequent tests get a `bind.failed` log and a missing admin endpoint. While
|
||
the proxy tests themselves do not assert admin-endpoint behaviour, the spurious bind-failed error
|
||
can mask real failures in CI logs and reduces test isolation. `HostSmokeTests` compounds this:
|
||
it is not in any `[Collection]` so it runs concurrently with everything else.
|
||
|
||
**Impact:** Flaky CI runs when multiple test classes execute simultaneously. Currently mitigated
|
||
by the `AdminEndpointHost` bind-failure-is-non-fatal path, so test assertions still pass — but
|
||
the coverage is degraded (admin is not actually up) and the log noise makes CI output harder to
|
||
read.
|
||
|
||
**Recommendation:** Replace the hard-coded `"8080"` values with `PickFreePort()` calls (the
|
||
same pattern already used by `AdminEndpointTests` and `HubStatusE2ETests`). Update
|
||
`TestHostBuilderExtensions.ConfigureForTest` to accept a port parameter with default 8080 for
|
||
callers that do not care (smoke tests do not probe the admin surface and an ephemeral-port bind
|
||
is fine).
|
||
|
||
---
|
||
|
||
**M3 — `DL205SimulatorFixture` swallows the test's own `CancellationToken` in the readiness poll, blocking test cancellation for up to 120 seconds.**
|
||
`tests/Mbproxy.Tests/Sim/DL205SimulatorFixture.cs:121-163`.
|
||
|
||
`InitializeAsync` constructs a `CancellationTokenSource` with a hard-coded 120-second deadline
|
||
(line 61) and a `linked` token that chains the deadline against `CancellationToken.None` (line
|
||
122):
|
||
|
||
```csharp
|
||
using var linked = CancellationTokenSource.CreateLinkedTokenSource(
|
||
deadline.Token, CancellationToken.None);
|
||
```
|
||
|
||
The test framework's cancellation token — which xUnit v3 exposes as
|
||
`TestContext.Current.CancellationToken` — is never linked in. If the test runner cancels the
|
||
test suite (CI timeout, keyboard interrupt), the readiness poll continues for up to 120 more
|
||
seconds and the spawned Python process is not killed until `DisposeAsync` finally runs. On a CI
|
||
runner with a per-job timeout this can cause the whole runner to be killed mid-cleanup, leaving
|
||
orphaned Python processes.
|
||
|
||
**Impact:** Poor CI behaviour under cancellation; can leave zombie `python`/`pwsh` processes.
|
||
|
||
**Recommendation:** Link the fixture against the framework's application lifetime or accept a
|
||
`CancellationToken` through the `IAsyncLifetime` interface (xUnit v3 provides one via
|
||
`InitializeAsync(CancellationToken)`). Alternatively, use
|
||
`TestContext.Current.CancellationToken` from inside `InitializeAsync` if xUnit v3's fixture
|
||
`IAsyncLifetime` overload provides it:
|
||
|
||
```csharp
|
||
public async ValueTask InitializeAsync() // or the overload with CancellationToken
|
||
```
|
||
|
||
At minimum, change `CancellationToken.None` to the method's incoming token or
|
||
`TestContext.Current.CancellationToken` inside `linked` so a kill signal actually propagates.
|
||
|
||
---
|
||
|
||
**M4 — `install.ps1` copies the config template to `$dataDir` (ProgramData) but the `ReloadValidator` AdminPort=0 / `MbproxyOptionsValidator` gap means an operator using the template's `AdminPort: 0` disable hint starts a service with an undocumented ephemeral admin port.**
|
||
(Secondary impact of M1, listed separately because the install path is independently broken.)
|
||
`install/install.ps1:153-162`; `install/mbproxy.config.template.json:83-84`.
|
||
|
||
When `install.ps1` seeds the template as the initial config and an operator follows the inline
|
||
comment to set `AdminPort: 0`, the service starts with Kestrel binding on an OS-assigned
|
||
ephemeral port, not disabled. The `install.ps1` script itself has no bug — the error is upstream
|
||
in M1 — but the install path is where the misleading comment is first acted upon. Listing it
|
||
separately so the install script gets a concrete fix recommendation: if M1 is fixed by adding a
|
||
`port == 0 → skip` guard in `AdminEndpointHost`, the template comment becomes correct and this
|
||
finding is resolved simultaneously. If M1 is fixed differently, the template comment must still
|
||
be updated to match whatever the new semantics are.
|
||
|
||
**Impact:** Same as M1 — an unexpected open HTTP port on a "disabled" admin config.
|
||
|
||
**Recommendation:** Fix M1 first; this finding closes automatically once the template's
|
||
`AdminPort: 0` comment reflects actual behaviour.
|
||
|
||
---
|
||
|
||
## Minor
|
||
|
||
**m1 — `HubStatusE2ETests.PickFreePort` has a TOCTOU race: the port can be stolen between `l.Stop()` and Kestrel's bind.**
|
||
`tests/Mbproxy.Tests/Admin/HubStatusE2ETests.cs:141-147`.
|
||
|
||
The helper stops the listener and returns the port number — same pattern used in a dozen other
|
||
test files. This is acknowledged as acceptable in the `DL205SimulatorFixture` TOCTOU comment
|
||
(line 72). No change needed, but worth noting since `HubStatusE2ETests` is one of the few tests
|
||
that allocates two ports in the same method (admin + proxy) and has a wider race window. Low
|
||
priority; flagged for awareness.
|
||
|
||
---
|
||
|
||
**m2 — `Loop_PushesRepeatedly_ThenStopsAfterStopAsync` has a 400ms post-stop idle assertion that can be a source of slowness but not flakiness.**
|
||
`tests/Mbproxy.Tests/Admin/StatusBroadcasterTests.cs:188-190`.
|
||
|
||
```csharp
|
||
await Task.Delay(400, TestContext.Current.CancellationToken);
|
||
h.Sink.FleetPushes.Count.ShouldBe(afterStop, "no pushes may occur after StopAsync");
|
||
```
|
||
|
||
The test correctly waits 400 ms after `StopAsync` to confirm silence. Given `AdminPushIntervalMs=100`
|
||
and that `StopAsync` awaits the loop task (`await _loop`), the loop is guaranteed terminated
|
||
before `StopAsync` returns — so zero extra delay is needed for the assertion to be correct. The
|
||
400 ms delay adds wall-clock time to every test run without buying additional confidence. The
|
||
existing `_loop` await in `StopAsync` is the real guarantee.
|
||
|
||
**Recommendation:** Drop the `Task.Delay(400, ...)` line and assert the count immediately after
|
||
`StopAsync`. The test would be equally correct and ~400 ms faster.
|
||
|
||
---
|
||
|
||
**m3 — `FakeGroupManager.Removed` list is still dead code: no test ever asserts it.**
|
||
`tests/Mbproxy.Tests/Admin/SignalRFakes.cs:30-31`.
|
||
|
||
This was noted as prior finding M4 in the 2026-05-15 review ("either delete `Removed` from the
|
||
fake, or add a comment"). The list is still present and still has no asserting test.
|
||
`StatusHub.OnDisconnectedAsync` never calls `RemoveFromGroupAsync` (relying on SignalR's
|
||
implicit group cleanup), so no assertion ever fires. The unasserted field is a minor maintenance
|
||
hazard: a future author may assume this is the intended assertion target and waste time debugging
|
||
a test that never fails.
|
||
|
||
**Recommendation:** Drop the `Removed` list from `FakeGroupManager`, or add a single inline
|
||
comment: `// SignalR auto-removes disconnected connections from groups; explicit RemoveFromGroupAsync is not called.`
|
||
|
||
---
|
||
|
||
**m4 — `mbproxy.smoke.config.json` still carries a dangling reference to `plans/`.**
|
||
`tests/sim/mbproxy.smoke.config.json:1`.
|
||
|
||
The header comment references `plans/2026-05-15-webui-dashboard.md`. As of commit `7466a46`
|
||
("retire superseded design/plan docs"), `plans/` is untracked (`?? plans/` in git status) and
|
||
contains no file of that name. The comment is a stale citation that will silently fail any
|
||
documentation-link check.
|
||
|
||
**Recommendation:** Replace the reference with a plain description of the smoke file's purpose,
|
||
removing the specific plan filename.
|
||
|
||
---
|
||
|
||
**m5 — Both install templates have `AdminPort: 8080` hard-coded in the `Plcs` section inline comment, but the comment already appears in the admin-port section — the duplication in the PLC entry comment is not wrong but potentially confusing.**
|
||
`install/mbproxy.config.template.json:79` (`"// GET /status.json → …"` inside the PLC
|
||
description); both templates are identical here.
|
||
|
||
This is a documentation nit: the `Plcs` array comment block on lines 60–75 is accurate, and the
|
||
preceding inline comment about the 4-client cap remains true and relevant. No functional issue.
|
||
|
||
**Recommendation:** No change required; noted for completeness.
|
||
|
||
---
|
||
|
||
**m6 — `publish.ps1` and `publish.sh` do not exit non-zero if the config template copy fails.**
|
||
`install/publish.ps1:87-94`; `install/publish.sh:85-95`.
|
||
|
||
Both scripts call `throw` / `exit 1` if the template file is missing, which is correct. However
|
||
the `Copy-Item -Force` / `cp -f` steps that copy the template into each output flavour silently
|
||
succeed even if the destination directory does not exist (the directory was just created by
|
||
`dotnet publish`). If `dotnet publish` produced no output directory (a degenerate build failure
|
||
that `dotnet publish` should have already reported as non-zero), the copy also fails with an
|
||
error but neither script re-checks `$LASTEXITCODE` after the copy loop. In practice `dotnet publish`
|
||
failing before this point already causes both scripts to throw/exit (the preceding
|
||
`if ($LASTEXITCODE -ne 0) { throw }` guards). This is low-risk but the post-copy result
|
||
verification (the `Format-Size` / `du -h` summary block) is output-only and does not fail the
|
||
script if a binary is missing — it emits a `Write-Warning` / `echo WARNING` instead.
|
||
|
||
**Recommendation:** In the result summary loop of both scripts, change the missing-binary branch
|
||
from a warning to a non-zero exit:
|
||
|
||
```powershell
|
||
# publish.ps1
|
||
if (Test-Path $bin) { ... } else { throw "Expected binary not found: $bin" }
|
||
```
|
||
|
||
```bash
|
||
# publish.sh
|
||
if [[ -f "$bin" ]]; then ... else echo "ERROR: expected binary not found: $bin" >&2; exit 1; fi
|
||
```
|
||
|
||
---
|
||
|
||
**m7 — `mbproxy.service` `ReadWritePaths` does not include `/etc/mbproxy`, so a hot-reload write by the service account to its own config directory is blocked by `ProtectSystem=strict`.**
|
||
`install/mbproxy.service:40`.
|
||
|
||
```ini
|
||
ReadWritePaths=/var/log/mbproxy /var/cache/mbproxy
|
||
```
|
||
|
||
`ProtectSystem=strict` makes the entire filesystem read-only except paths listed in
|
||
`ReadWritePaths`. The service itself never writes to `/etc/mbproxy` (it only reads
|
||
`appsettings.json`), so this is not a runtime bug. However:
|
||
|
||
1. The comment at line 8 states `WorkingDirectory=/etc/mbproxy` — the service account needs
|
||
read access there, which `ProtectSystem=strict` provides (read-only is allowed).
|
||
2. If an operator's monitoring or deployment tooling updates the config in-place (e.g.
|
||
`scp`-ing a new `appsettings.json` as the `mbproxy` user via an SSH key), that write will
|
||
be blocked by `ProtectSystem=strict`. The operator would need to write as root and chown/chmod
|
||
as appropriate.
|
||
|
||
This is likely intentional (config is an admin operation), but the systemd unit has no comment
|
||
explaining why `/etc/mbproxy` is intentionally absent from `ReadWritePaths`. It trips up
|
||
operators who expect the service account to own its config.
|
||
|
||
**Recommendation:** Add a comment after the `ReadWritePaths` line: `# /etc/mbproxy is intentionally absent — config changes require root (ProtectSystem=strict makes it read-only for the service account).`
|
||
|
||
---
|
||
|
||
## Regression check against 2026-05-15 findings
|
||
|
||
| Prior finding | Status |
|
||
|---|---|
|
||
| M1 — `LoopAsync` untested | Closed — `Loop_PushesRepeatedly_ThenStopsAfterStopAsync` added |
|
||
| M2 — `Record`/`Disarm` race | Closed — production fix + `ConcurrentRecordAndDisarm_LeavesNoStaleObservation` |
|
||
| M3 — `PlcSubscriptionTracker` no direct/concurrent test | Closed — `PlcSubscriptionTrackerTests.cs` added with stress test |
|
||
| M4 — `FakeGroupManager.Removed` dead code | Partially addressed: no test added; `Removed` list is still present — see new `m3` |
|
||
| m1 — `ThrowingStatusPushSink` missing | Closed — `ThrowingStatusPushSink` and two swallow/propagate tests added |
|
||
| m2 — `FakeHubCallerContext.ConnectionAborted` hardcoded | No change; accepted as low-risk |
|
||
| m3 — asset allow-list coverage | Closed — `EmbeddedAssetsTests` covers glob; `Get_Asset` asserts bytes verbatim |
|
||
| m4 — `Get_Asset` byte assertion missing | Closed — `ReadEmbeddedAsset` comparison added |
|
||
| m5 — `*.*` glob caveats | Unchanged; accepted |
|
||
| m6 — `mbproxy.smoke.config.json` dangling plans reference | Not fixed — see new `m4` |
|
||
| m7 — `ConfigReconcilerTests` registry wiring untested | Closed — `Apply_AddThenRemovePlc_TagCaptureRegistryTracksRoster` added |
|
||
| m8 — `AdminPushIntervalMs` upper bound unguarded | Closed — upper bound added to both `ReloadValidator` and `MbproxyOptionsValidator`; four new tests |
|
||
| m9 — `OnDisconnectedAsync` always called with null | No change; accepted as low-risk |
|
||
| Coverage gap 7 — `/hub/status` SignalR E2E | Closed — `HubStatusE2ETests` (2 facts) added |
|
||
|
||
## Key file references
|
||
|
||
- `install/mbproxy.config.template.json:83`, `install/mbproxy.linux.config.template.json:87` — "Set to 0 to disable" comment (M1)
|
||
- `src/Mbproxy/Admin/AdminEndpointHost.cs:190` — `k.Listen(..., port)` with no port-0 guard (M1)
|
||
- `src/Mbproxy/Configuration/ReloadValidator.cs:65-68` — rejects `AdminPort < 1` on hot-reload but not at startup (M1)
|
||
- `tests/Mbproxy.Tests/Proxy/ProxyForwardingTests.cs:67,242,310`, `RewriterE2ETests.cs:388`, `HostSmokeTests.cs:111` — hard-coded port 8080 (M2)
|
||
- `tests/Mbproxy.Tests/Sim/DL205SimulatorFixture.cs:121-122` — `CancellationToken.None` swallows test-runner cancellation (M3)
|
||
- `tests/Mbproxy.Tests/Admin/StatusBroadcasterTests.cs:188` — unnecessary 400 ms idle delay (m2)
|
||
- `tests/Mbproxy.Tests/Admin/SignalRFakes.cs:30-31` — `Removed` list never asserted (m3)
|
||
- `tests/sim/mbproxy.smoke.config.json:1` — dangling `plans/` reference (m4)
|
||
- `install/publish.ps1:103-110`, `install/publish.sh:99-106` — warning-only when expected binary missing (m6)
|
||
- `install/mbproxy.service:40` — `ReadWritePaths` silent about `/etc/mbproxy` exclusion (m7)
|