Files
wwtools/mbproxy/codereviews/2026-05-16/TestsAndConfig.md
Joseph Doherty b222362ce0 mbproxy: remediate the 2026-05-16 code-review findings
Fixes every finding from the codereviews/2026-05-16 multi-agent review
(2 Critical, 20 Major, 38 Minor) and adds that review to the repo.

Highlights: dashboard XSS escape; response cache invalidated on the
write request (not just the response); ReloadValidator now runs at
startup so port collisions / duplicate names / malformed Resilience
profiles fail fast; AdminPort 0 genuinely disables the admin endpoint;
PlcListener accept-loop faults propagate to the supervisor's faulted
path; reconciler Restart builds before removing; Resilience pipelines
are restart-only from a frozen snapshot; multiplexer connect-race leak,
watchdog party-list snapshot, backend-response and FC16 framing
validation; frontend reconnect retry and util.js load guard; plus the
log-event/doc drift sweep and test-port hygiene.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 18:08:06 -04:00

313 lines
18 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Code Review — Tests, Install/Packaging, and Config (HEAD `0308490`)
Scope: `tests/Mbproxy.Tests/` (all test code), `tests/sim/` (simulator profile and fixture),
`install/` (all scripts and config templates), `src/Mbproxy/Mbproxy.csproj`, and
`tests/Mbproxy.Tests/Mbproxy.Tests.csproj`. Prior-review findings from `codereviews/2026-05-15/TestsAndConfig.md`
were read for regression context; the code is reviewed fresh against the current HEAD.
## Summary
The prior-review remediation is genuine and thorough: every `Major` finding from the
2026-05-15 review has been addressed — `StatusBroadcaster.LoopAsync` now has a proper
timing-hedged loop test (`Loop_PushesRepeatedly_ThenStopsAfterStopAsync`), the
`Record`/`Disarm` race is fixed in production and covered by
`ConcurrentRecordAndDisarm_LeavesNoStaleObservation`, `PlcSubscriptionTracker` now has a
dedicated test file including a concurrency stress test, the `ThrowingStatusPushSink` fake
covers both the swallow-path and the `OperationCanceledException` filter, and
`ConfigReconcilerTests` now holds a real registry and asserts add/remove lifecycle. The
`HubStatusE2ETests` E2E pair covers the `MapHub`/`SignalRStatusPushSink` gap that the prior
review flagged as the single biggest "does the feature actually work" gap. The `EmbeddedAssetsTests`
glob guard and the `Get_Asset` byte-verbatim assertion also close the prior `m3`/`m4` nits.
No prior-review finding has regressed. The remaining findings below are all new, introduced by
the current HEAD or present in the pre-existing code but newly visible against the completed
remediation baseline.
**Findings: 0 Critical · 4 Major · 7 Minor**
---
## Critical
None.
---
## Major
**M1 — `AdminPort=0` silently binds a random OS-assigned port instead of disabling the admin endpoint.**
`install/mbproxy.config.template.json:83`, `install/mbproxy.linux.config.template.json:87`,
`docs/Operations/Configuration.md:54`; `src/Mbproxy/Admin/AdminEndpointHost.cs:190`.
Both config templates carry the comment `// Set to 0 to disable the admin endpoint.`, and the
Operations docs repeat this. The comment is wrong. `AdminEndpointHost.StartAppAsync` calls
`k.Listen(System.Net.IPAddress.Any, port)` with whatever `port` is — Kestrel interprets port 0
as "pick a free OS-assigned port" (ephemeral port), not as "skip the bind". There is no code in
`StartAppAsync` or its callers that special-cases port 0 as a disable signal. A production
operator who sets `AdminPort: 0` to disable the admin surface does not get a disabled endpoint;
they get an admin endpoint on an unknown ephemeral port that is not logged in the startup banner
in a predictable way. This is made worse by:
1. `ReloadValidator.Validate` (line 6568) rejects `AdminPort < 1`, so a hot-reload with `AdminPort: 0`
is correctly blocked — but the *initial* startup path does not go through `ReloadValidator`
(only `MbproxyOptionsValidator`, which does not range-check `AdminPort`). So port 0 at
startup actually takes effect, while the same value in a reload is rejected. The two paths
are inconsistent.
2. Several tests intentionally use `["Mbproxy:AdminPort"] = "0"` with the comment "disable
admin to avoid port conflicts" (`ShutdownE2ETests.cs:163`, `StatusBroadcasterTests.cs:47`,
`StatusSnapshotBuilderTests.cs:266`). These tests are not broken — they work because Kestrel
binds on 0 and no test probes the admin port — but they silently start an admin endpoint
rather than suppressing it. If those tests ever run concurrently with another test that
probes all listening ports they can interfere.
**Impact:** Operators following the documented `AdminPort: 0` procedure to disable the admin
surface on a sensitive network have an undocumented, unauthorised HTTP listener running. This is
a documentation/behavioural inconsistency on a production-critical setting.
**Recommendation:** Implement the documented disable. In `AdminEndpointHost.StartAppAsync`, add
`if (port == 0) return;` before the builder construction. Then update `ReloadValidator` to allow
0 (skip collision check) and update `MbproxyOptionsValidator` to allow 0. The templates and docs
are then correct. Alternatively, remove the "Set to 0 to disable" comment and document that the
admin endpoint is always started, and rename the test comments to reflect reality.
---
**M2 — Multiple E2E tests hard-code `AdminPort: 8080`, causing bind-conflict failures when run in parallel or when port 8080 is already in use.**
`tests/Mbproxy.Tests/Proxy/ProxyForwardingTests.cs:67,242,310`; `tests/Mbproxy.Tests/Proxy/RewriterE2ETests.cs:388`;
`tests/Mbproxy.Tests/HostSmokeTests.cs:111`.
`ProxyForwardingTests` (3 configs), `RewriterE2ETests` (1 config), and the shared
`TestHostBuilderExtensions.ConfigureForTest` used by all three `HostSmokeTests` hard-code
`["Mbproxy:AdminPort"] = "8080"`. When these tests run in parallel (xUnit v3's default policy
for non-collection tests) they each try to bind a Kestrel listener on port 8080. The first
binder wins; all subsequent tests get a `bind.failed` log and a missing admin endpoint. While
the proxy tests themselves do not assert admin-endpoint behaviour, the spurious bind-failed error
can mask real failures in CI logs and reduces test isolation. `HostSmokeTests` compounds this:
it is not in any `[Collection]` so it runs concurrently with everything else.
**Impact:** Flaky CI runs when multiple test classes execute simultaneously. Currently mitigated
by the `AdminEndpointHost` bind-failure-is-non-fatal path, so test assertions still pass — but
the coverage is degraded (admin is not actually up) and the log noise makes CI output harder to
read.
**Recommendation:** Replace the hard-coded `"8080"` values with `PickFreePort()` calls (the
same pattern already used by `AdminEndpointTests` and `HubStatusE2ETests`). Update
`TestHostBuilderExtensions.ConfigureForTest` to accept a port parameter with default 8080 for
callers that do not care (smoke tests do not probe the admin surface and an ephemeral-port bind
is fine).
---
**M3 — `DL205SimulatorFixture` swallows the test's own `CancellationToken` in the readiness poll, blocking test cancellation for up to 120 seconds.**
`tests/Mbproxy.Tests/Sim/DL205SimulatorFixture.cs:121-163`.
`InitializeAsync` constructs a `CancellationTokenSource` with a hard-coded 120-second deadline
(line 61) and a `linked` token that chains the deadline against `CancellationToken.None` (line
122):
```csharp
using var linked = CancellationTokenSource.CreateLinkedTokenSource(
deadline.Token, CancellationToken.None);
```
The test framework's cancellation token — which xUnit v3 exposes as
`TestContext.Current.CancellationToken` — is never linked in. If the test runner cancels the
test suite (CI timeout, keyboard interrupt), the readiness poll continues for up to 120 more
seconds and the spawned Python process is not killed until `DisposeAsync` finally runs. On a CI
runner with a per-job timeout this can cause the whole runner to be killed mid-cleanup, leaving
orphaned Python processes.
**Impact:** Poor CI behaviour under cancellation; can leave zombie `python`/`pwsh` processes.
**Recommendation:** Link the fixture against the framework's application lifetime or accept a
`CancellationToken` through the `IAsyncLifetime` interface (xUnit v3 provides one via
`InitializeAsync(CancellationToken)`). Alternatively, use
`TestContext.Current.CancellationToken` from inside `InitializeAsync` if xUnit v3's fixture
`IAsyncLifetime` overload provides it:
```csharp
public async ValueTask InitializeAsync() // or the overload with CancellationToken
```
At minimum, change `CancellationToken.None` to the method's incoming token or
`TestContext.Current.CancellationToken` inside `linked` so a kill signal actually propagates.
---
**M4 — `install.ps1` copies the config template to `$dataDir` (ProgramData) but the `ReloadValidator` AdminPort=0 / `MbproxyOptionsValidator` gap means an operator using the template's `AdminPort: 0` disable hint starts a service with an undocumented ephemeral admin port.**
(Secondary impact of M1, listed separately because the install path is independently broken.)
`install/install.ps1:153-162`; `install/mbproxy.config.template.json:83-84`.
When `install.ps1` seeds the template as the initial config and an operator follows the inline
comment to set `AdminPort: 0`, the service starts with Kestrel binding on an OS-assigned
ephemeral port, not disabled. The `install.ps1` script itself has no bug — the error is upstream
in M1 — but the install path is where the misleading comment is first acted upon. Listing it
separately so the install script gets a concrete fix recommendation: if M1 is fixed by adding a
`port == 0 → skip` guard in `AdminEndpointHost`, the template comment becomes correct and this
finding is resolved simultaneously. If M1 is fixed differently, the template comment must still
be updated to match whatever the new semantics are.
**Impact:** Same as M1 — an unexpected open HTTP port on a "disabled" admin config.
**Recommendation:** Fix M1 first; this finding closes automatically once the template's
`AdminPort: 0` comment reflects actual behaviour.
---
## Minor
**m1 — `HubStatusE2ETests.PickFreePort` has a TOCTOU race: the port can be stolen between `l.Stop()` and Kestrel's bind.**
`tests/Mbproxy.Tests/Admin/HubStatusE2ETests.cs:141-147`.
The helper stops the listener and returns the port number — same pattern used in a dozen other
test files. This is acknowledged as acceptable in the `DL205SimulatorFixture` TOCTOU comment
(line 72). No change needed, but worth noting since `HubStatusE2ETests` is one of the few tests
that allocates two ports in the same method (admin + proxy) and has a wider race window. Low
priority; flagged for awareness.
---
**m2 — `Loop_PushesRepeatedly_ThenStopsAfterStopAsync` has a 400ms post-stop idle assertion that can be a source of slowness but not flakiness.**
`tests/Mbproxy.Tests/Admin/StatusBroadcasterTests.cs:188-190`.
```csharp
await Task.Delay(400, TestContext.Current.CancellationToken);
h.Sink.FleetPushes.Count.ShouldBe(afterStop, "no pushes may occur after StopAsync");
```
The test correctly waits 400 ms after `StopAsync` to confirm silence. Given `AdminPushIntervalMs=100`
and that `StopAsync` awaits the loop task (`await _loop`), the loop is guaranteed terminated
before `StopAsync` returns — so zero extra delay is needed for the assertion to be correct. The
400 ms delay adds wall-clock time to every test run without buying additional confidence. The
existing `_loop` await in `StopAsync` is the real guarantee.
**Recommendation:** Drop the `Task.Delay(400, ...)` line and assert the count immediately after
`StopAsync`. The test would be equally correct and ~400 ms faster.
---
**m3 — `FakeGroupManager.Removed` list is still dead code: no test ever asserts it.**
`tests/Mbproxy.Tests/Admin/SignalRFakes.cs:30-31`.
This was noted as prior finding M4 in the 2026-05-15 review ("either delete `Removed` from the
fake, or add a comment"). The list is still present and still has no asserting test.
`StatusHub.OnDisconnectedAsync` never calls `RemoveFromGroupAsync` (relying on SignalR's
implicit group cleanup), so no assertion ever fires. The unasserted field is a minor maintenance
hazard: a future author may assume this is the intended assertion target and waste time debugging
a test that never fails.
**Recommendation:** Drop the `Removed` list from `FakeGroupManager`, or add a single inline
comment: `// SignalR auto-removes disconnected connections from groups; explicit RemoveFromGroupAsync is not called.`
---
**m4 — `mbproxy.smoke.config.json` still carries a dangling reference to `plans/`.**
`tests/sim/mbproxy.smoke.config.json:1`.
The header comment references `plans/2026-05-15-webui-dashboard.md`. As of commit `7466a46`
("retire superseded design/plan docs"), `plans/` is untracked (`?? plans/` in git status) and
contains no file of that name. The comment is a stale citation that will silently fail any
documentation-link check.
**Recommendation:** Replace the reference with a plain description of the smoke file's purpose,
removing the specific plan filename.
---
**m5 — Both install templates have `AdminPort: 8080` hard-coded in the `Plcs` section inline comment, but the comment already appears in the admin-port section — the duplication in the PLC entry comment is not wrong but potentially confusing.**
`install/mbproxy.config.template.json:79` (`"// GET /status.json → …"` inside the PLC
description); both templates are identical here.
This is a documentation nit: the `Plcs` array comment block on lines 6075 is accurate, and the
preceding inline comment about the 4-client cap remains true and relevant. No functional issue.
**Recommendation:** No change required; noted for completeness.
---
**m6 — `publish.ps1` and `publish.sh` do not exit non-zero if the config template copy fails.**
`install/publish.ps1:87-94`; `install/publish.sh:85-95`.
Both scripts call `throw` / `exit 1` if the template file is missing, which is correct. However
the `Copy-Item -Force` / `cp -f` steps that copy the template into each output flavour silently
succeed even if the destination directory does not exist (the directory was just created by
`dotnet publish`). If `dotnet publish` produced no output directory (a degenerate build failure
that `dotnet publish` should have already reported as non-zero), the copy also fails with an
error but neither script re-checks `$LASTEXITCODE` after the copy loop. In practice `dotnet publish`
failing before this point already causes both scripts to throw/exit (the preceding
`if ($LASTEXITCODE -ne 0) { throw }` guards). This is low-risk but the post-copy result
verification (the `Format-Size` / `du -h` summary block) is output-only and does not fail the
script if a binary is missing — it emits a `Write-Warning` / `echo WARNING` instead.
**Recommendation:** In the result summary loop of both scripts, change the missing-binary branch
from a warning to a non-zero exit:
```powershell
# publish.ps1
if (Test-Path $bin) { ... } else { throw "Expected binary not found: $bin" }
```
```bash
# publish.sh
if [[ -f "$bin" ]]; then ... else echo "ERROR: expected binary not found: $bin" >&2; exit 1; fi
```
---
**m7 — `mbproxy.service` `ReadWritePaths` does not include `/etc/mbproxy`, so a hot-reload write by the service account to its own config directory is blocked by `ProtectSystem=strict`.**
`install/mbproxy.service:40`.
```ini
ReadWritePaths=/var/log/mbproxy /var/cache/mbproxy
```
`ProtectSystem=strict` makes the entire filesystem read-only except paths listed in
`ReadWritePaths`. The service itself never writes to `/etc/mbproxy` (it only reads
`appsettings.json`), so this is not a runtime bug. However:
1. The comment at line 8 states `WorkingDirectory=/etc/mbproxy` — the service account needs
read access there, which `ProtectSystem=strict` provides (read-only is allowed).
2. If an operator's monitoring or deployment tooling updates the config in-place (e.g.
`scp`-ing a new `appsettings.json` as the `mbproxy` user via an SSH key), that write will
be blocked by `ProtectSystem=strict`. The operator would need to write as root and chown/chmod
as appropriate.
This is likely intentional (config is an admin operation), but the systemd unit has no comment
explaining why `/etc/mbproxy` is intentionally absent from `ReadWritePaths`. It trips up
operators who expect the service account to own its config.
**Recommendation:** Add a comment after the `ReadWritePaths` line: `# /etc/mbproxy is intentionally absent — config changes require root (ProtectSystem=strict makes it read-only for the service account).`
---
## Regression check against 2026-05-15 findings
| Prior finding | Status |
|---|---|
| M1 — `LoopAsync` untested | Closed — `Loop_PushesRepeatedly_ThenStopsAfterStopAsync` added |
| M2 — `Record`/`Disarm` race | Closed — production fix + `ConcurrentRecordAndDisarm_LeavesNoStaleObservation` |
| M3 — `PlcSubscriptionTracker` no direct/concurrent test | Closed — `PlcSubscriptionTrackerTests.cs` added with stress test |
| M4 — `FakeGroupManager.Removed` dead code | Partially addressed: no test added; `Removed` list is still present — see new `m3` |
| m1 — `ThrowingStatusPushSink` missing | Closed — `ThrowingStatusPushSink` and two swallow/propagate tests added |
| m2 — `FakeHubCallerContext.ConnectionAborted` hardcoded | No change; accepted as low-risk |
| m3 — asset allow-list coverage | Closed — `EmbeddedAssetsTests` covers glob; `Get_Asset` asserts bytes verbatim |
| m4 — `Get_Asset` byte assertion missing | Closed — `ReadEmbeddedAsset` comparison added |
| m5 — `*.*` glob caveats | Unchanged; accepted |
| m6 — `mbproxy.smoke.config.json` dangling plans reference | Not fixed — see new `m4` |
| m7 — `ConfigReconcilerTests` registry wiring untested | Closed — `Apply_AddThenRemovePlc_TagCaptureRegistryTracksRoster` added |
| m8 — `AdminPushIntervalMs` upper bound unguarded | Closed — upper bound added to both `ReloadValidator` and `MbproxyOptionsValidator`; four new tests |
| m9 — `OnDisconnectedAsync` always called with null | No change; accepted as low-risk |
| Coverage gap 7 — `/hub/status` SignalR E2E | Closed — `HubStatusE2ETests` (2 facts) added |
## Key file references
- `install/mbproxy.config.template.json:83`, `install/mbproxy.linux.config.template.json:87` — "Set to 0 to disable" comment (M1)
- `src/Mbproxy/Admin/AdminEndpointHost.cs:190``k.Listen(..., port)` with no port-0 guard (M1)
- `src/Mbproxy/Configuration/ReloadValidator.cs:65-68` — rejects `AdminPort < 1` on hot-reload but not at startup (M1)
- `tests/Mbproxy.Tests/Proxy/ProxyForwardingTests.cs:67,242,310`, `RewriterE2ETests.cs:388`, `HostSmokeTests.cs:111` — hard-coded port 8080 (M2)
- `tests/Mbproxy.Tests/Sim/DL205SimulatorFixture.cs:121-122``CancellationToken.None` swallows test-runner cancellation (M3)
- `tests/Mbproxy.Tests/Admin/StatusBroadcasterTests.cs:188` — unnecessary 400 ms idle delay (m2)
- `tests/Mbproxy.Tests/Admin/SignalRFakes.cs:30-31``Removed` list never asserted (m3)
- `tests/sim/mbproxy.smoke.config.json:1` — dangling `plans/` reference (m4)
- `install/publish.ps1:103-110`, `install/publish.sh:99-106` — warning-only when expected binary missing (m6)
- `install/mbproxy.service:40``ReadWritePaths` silent about `/etc/mbproxy` exclusion (m7)