Files

T

Joseph Doherty b222362ce0 mbproxy: remediate the 2026-05-16 code-review findings

Fixes every finding from the codereviews/2026-05-16 multi-agent review
(2 Critical, 20 Major, 38 Minor) and adds that review to the repo.

Highlights: dashboard XSS escape; response cache invalidated on the
write request (not just the response); ReloadValidator now runs at
startup so port collisions / duplicate names / malformed Resilience
profiles fail fast; AdminPort 0 genuinely disables the admin endpoint;
PlcListener accept-loop faults propagate to the supervisor's faulted
path; reconciler Restart builds before removing; Resilience pipelines
are restart-only from a frozen snapshot; multiplexer connect-race leak,
watchdog party-list snapshot, backend-response and FC16 framing
validation; frontend reconnect retry and util.js load guard; plus the
log-event/doc drift sweep and test-port hygiene.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-16 18:08:06 -04:00

18 KiB

Raw Blame History

Code Review — Tests, Install/Packaging, and Config (HEAD `0308490`)

Scope: tests/Mbproxy.Tests/ (all test code), tests/sim/ (simulator profile and fixture), install/ (all scripts and config templates), src/Mbproxy/Mbproxy.csproj, and tests/Mbproxy.Tests/Mbproxy.Tests.csproj. Prior-review findings from codereviews/2026-05-15/TestsAndConfig.md were read for regression context; the code is reviewed fresh against the current HEAD.

Summary

The prior-review remediation is genuine and thorough: every Major finding from the 2026-05-15 review has been addressed — StatusBroadcaster.LoopAsync now has a proper timing-hedged loop test (Loop_PushesRepeatedly_ThenStopsAfterStopAsync), the Record/Disarm race is fixed in production and covered by ConcurrentRecordAndDisarm_LeavesNoStaleObservation, PlcSubscriptionTracker now has a dedicated test file including a concurrency stress test, the ThrowingStatusPushSink fake covers both the swallow-path and the OperationCanceledException filter, and ConfigReconcilerTests now holds a real registry and asserts add/remove lifecycle. The HubStatusE2ETests E2E pair covers the MapHub/SignalRStatusPushSink gap that the prior review flagged as the single biggest "does the feature actually work" gap. The EmbeddedAssetsTests glob guard and the Get_Asset byte-verbatim assertion also close the prior m3/m4 nits. No prior-review finding has regressed. The remaining findings below are all new, introduced by the current HEAD or present in the pre-existing code but newly visible against the completed remediation baseline.

Findings: 0 Critical · 4 Major · 7 Minor

Critical

None.

Major

M1 — AdminPort=0 silently binds a random OS-assigned port instead of disabling the admin endpoint. install/mbproxy.config.template.json:83, install/mbproxy.linux.config.template.json:87, docs/Operations/Configuration.md:54; src/Mbproxy/Admin/AdminEndpointHost.cs:190.

Both config templates carry the comment // Set to 0 to disable the admin endpoint., and the Operations docs repeat this. The comment is wrong. AdminEndpointHost.StartAppAsync calls k.Listen(System.Net.IPAddress.Any, port) with whatever port is — Kestrel interprets port 0 as "pick a free OS-assigned port" (ephemeral port), not as "skip the bind". There is no code in StartAppAsync or its callers that special-cases port 0 as a disable signal. A production operator who sets AdminPort: 0 to disable the admin surface does not get a disabled endpoint; they get an admin endpoint on an unknown ephemeral port that is not logged in the startup banner in a predictable way. This is made worse by:

ReloadValidator.Validate (line 65–68) rejects AdminPort < 1, so a hot-reload with AdminPort: 0 is correctly blocked — but the initial startup path does not go through ReloadValidator (only MbproxyOptionsValidator, which does not range-check AdminPort). So port 0 at startup actually takes effect, while the same value in a reload is rejected. The two paths are inconsistent.
Several tests intentionally use ["Mbproxy:AdminPort"] = "0" with the comment "disable admin to avoid port conflicts" (ShutdownE2ETests.cs:163, StatusBroadcasterTests.cs:47, StatusSnapshotBuilderTests.cs:266). These tests are not broken — they work because Kestrel binds on 0 and no test probes the admin port — but they silently start an admin endpoint rather than suppressing it. If those tests ever run concurrently with another test that probes all listening ports they can interfere.

Impact: Operators following the documented AdminPort: 0 procedure to disable the admin surface on a sensitive network have an undocumented, unauthorised HTTP listener running. This is a documentation/behavioural inconsistency on a production-critical setting.

Recommendation: Implement the documented disable. In AdminEndpointHost.StartAppAsync, add if (port == 0) return; before the builder construction. Then update ReloadValidator to allow 0 (skip collision check) and update MbproxyOptionsValidator to allow 0. The templates and docs are then correct. Alternatively, remove the "Set to 0 to disable" comment and document that the admin endpoint is always started, and rename the test comments to reflect reality.

M2 — Multiple E2E tests hard-code AdminPort: 8080, causing bind-conflict failures when run in parallel or when port 8080 is already in use. tests/Mbproxy.Tests/Proxy/ProxyForwardingTests.cs:67,242,310; tests/Mbproxy.Tests/Proxy/RewriterE2ETests.cs:388; tests/Mbproxy.Tests/HostSmokeTests.cs:111.

ProxyForwardingTests (3 configs), RewriterE2ETests (1 config), and the shared TestHostBuilderExtensions.ConfigureForTest used by all three HostSmokeTests hard-code ["Mbproxy:AdminPort"] = "8080". When these tests run in parallel (xUnit v3's default policy for non-collection tests) they each try to bind a Kestrel listener on port 8080. The first binder wins; all subsequent tests get a bind.failed log and a missing admin endpoint. While the proxy tests themselves do not assert admin-endpoint behaviour, the spurious bind-failed error can mask real failures in CI logs and reduces test isolation. HostSmokeTests compounds this: it is not in any [Collection] so it runs concurrently with everything else.

Impact: Flaky CI runs when multiple test classes execute simultaneously. Currently mitigated by the AdminEndpointHost bind-failure-is-non-fatal path, so test assertions still pass — but the coverage is degraded (admin is not actually up) and the log noise makes CI output harder to read.

Recommendation: Replace the hard-coded "8080" values with PickFreePort() calls (the same pattern already used by AdminEndpointTests and HubStatusE2ETests). Update TestHostBuilderExtensions.ConfigureForTest to accept a port parameter with default 8080 for callers that do not care (smoke tests do not probe the admin surface and an ephemeral-port bind is fine).

M3 — DL205SimulatorFixture swallows the test's own CancellationToken in the readiness poll, blocking test cancellation for up to 120 seconds. tests/Mbproxy.Tests/Sim/DL205SimulatorFixture.cs:121-163.

InitializeAsync constructs a CancellationTokenSource with a hard-coded 120-second deadline (line 61) and a linked token that chains the deadline against CancellationToken.None (line 122):

using var linked = CancellationTokenSource.CreateLinkedTokenSource(
    deadline.Token, CancellationToken.None);

The test framework's cancellation token — which xUnit v3 exposes as TestContext.Current.CancellationToken — is never linked in. If the test runner cancels the test suite (CI timeout, keyboard interrupt), the readiness poll continues for up to 120 more seconds and the spawned Python process is not killed until DisposeAsync finally runs. On a CI runner with a per-job timeout this can cause the whole runner to be killed mid-cleanup, leaving orphaned Python processes.

Impact: Poor CI behaviour under cancellation; can leave zombie python/pwsh processes.

Recommendation: Link the fixture against the framework's application lifetime or accept a CancellationToken through the IAsyncLifetime interface (xUnit v3 provides one via InitializeAsync(CancellationToken)). Alternatively, use TestContext.Current.CancellationToken from inside InitializeAsync if xUnit v3's fixture IAsyncLifetime overload provides it:

public async ValueTask InitializeAsync()   // or the overload with CancellationToken

At minimum, change CancellationToken.None to the method's incoming token or TestContext.Current.CancellationToken inside linked so a kill signal actually propagates.

M4 — install.ps1 copies the config template to $dataDir (ProgramData) but the ReloadValidator AdminPort=0 / MbproxyOptionsValidator gap means an operator using the template's AdminPort: 0 disable hint starts a service with an undocumented ephemeral admin port. (Secondary impact of M1, listed separately because the install path is independently broken.) install/install.ps1:153-162; install/mbproxy.config.template.json:83-84.

When install.ps1 seeds the template as the initial config and an operator follows the inline comment to set AdminPort: 0, the service starts with Kestrel binding on an OS-assigned ephemeral port, not disabled. The install.ps1 script itself has no bug — the error is upstream in M1 — but the install path is where the misleading comment is first acted upon. Listing it separately so the install script gets a concrete fix recommendation: if M1 is fixed by adding a port == 0 → skip guard in AdminEndpointHost, the template comment becomes correct and this finding is resolved simultaneously. If M1 is fixed differently, the template comment must still be updated to match whatever the new semantics are.

Impact: Same as M1 — an unexpected open HTTP port on a "disabled" admin config.

Recommendation: Fix M1 first; this finding closes automatically once the template's AdminPort: 0 comment reflects actual behaviour.

Minor

m1 — HubStatusE2ETests.PickFreePort has a TOCTOU race: the port can be stolen between l.Stop() and Kestrel's bind. tests/Mbproxy.Tests/Admin/HubStatusE2ETests.cs:141-147.

The helper stops the listener and returns the port number — same pattern used in a dozen other test files. This is acknowledged as acceptable in the DL205SimulatorFixture TOCTOU comment (line 72). No change needed, but worth noting since HubStatusE2ETests is one of the few tests that allocates two ports in the same method (admin + proxy) and has a wider race window. Low priority; flagged for awareness.

m2 — Loop_PushesRepeatedly_ThenStopsAfterStopAsync has a 400ms post-stop idle assertion that can be a source of slowness but not flakiness. tests/Mbproxy.Tests/Admin/StatusBroadcasterTests.cs:188-190.

await Task.Delay(400, TestContext.Current.CancellationToken);
h.Sink.FleetPushes.Count.ShouldBe(afterStop, "no pushes may occur after StopAsync");

The test correctly waits 400 ms after StopAsync to confirm silence. Given AdminPushIntervalMs=100 and that StopAsync awaits the loop task (await _loop), the loop is guaranteed terminated before StopAsync returns — so zero extra delay is needed for the assertion to be correct. The 400 ms delay adds wall-clock time to every test run without buying additional confidence. The existing _loop await in StopAsync is the real guarantee.

Recommendation: Drop the Task.Delay(400, ...) line and assert the count immediately after StopAsync. The test would be equally correct and ~400 ms faster.

m3 — FakeGroupManager.Removed list is still dead code: no test ever asserts it. tests/Mbproxy.Tests/Admin/SignalRFakes.cs:30-31.

This was noted as prior finding M4 in the 2026-05-15 review ("either delete Removed from the fake, or add a comment"). The list is still present and still has no asserting test. StatusHub.OnDisconnectedAsync never calls RemoveFromGroupAsync (relying on SignalR's implicit group cleanup), so no assertion ever fires. The unasserted field is a minor maintenance hazard: a future author may assume this is the intended assertion target and waste time debugging a test that never fails.

Recommendation: Drop the Removed list from FakeGroupManager, or add a single inline comment: // SignalR auto-removes disconnected connections from groups; explicit RemoveFromGroupAsync is not called.

m4 — mbproxy.smoke.config.json still carries a dangling reference to plans/. tests/sim/mbproxy.smoke.config.json:1.

The header comment references plans/2026-05-15-webui-dashboard.md. As of commit 7466a46 ("retire superseded design/plan docs"), plans/ is untracked (?? plans/ in git status) and contains no file of that name. The comment is a stale citation that will silently fail any documentation-link check.

Recommendation: Replace the reference with a plain description of the smoke file's purpose, removing the specific plan filename.

m5 — Both install templates have AdminPort: 8080 hard-coded in the Plcs section inline comment, but the comment already appears in the admin-port section — the duplication in the PLC entry comment is not wrong but potentially confusing. install/mbproxy.config.template.json:79 ("// GET /status.json → …" inside the PLC description); both templates are identical here.

This is a documentation nit: the Plcs array comment block on lines 60–75 is accurate, and the preceding inline comment about the 4-client cap remains true and relevant. No functional issue.

Recommendation: No change required; noted for completeness.

m6 — publish.ps1 and publish.sh do not exit non-zero if the config template copy fails. install/publish.ps1:87-94; install/publish.sh:85-95.

Both scripts call throw / exit 1 if the template file is missing, which is correct. However the Copy-Item -Force / cp -f steps that copy the template into each output flavour silently succeed even if the destination directory does not exist (the directory was just created by dotnet publish). If dotnet publish produced no output directory (a degenerate build failure that dotnet publish should have already reported as non-zero), the copy also fails with an error but neither script re-checks $LASTEXITCODE after the copy loop. In practice dotnet publish failing before this point already causes both scripts to throw/exit (the preceding if ($LASTEXITCODE -ne 0) { throw } guards). This is low-risk but the post-copy result verification (the Format-Size / du -h summary block) is output-only and does not fail the script if a binary is missing — it emits a Write-Warning / echo WARNING instead.

Recommendation: In the result summary loop of both scripts, change the missing-binary branch from a warning to a non-zero exit:

# publish.ps1
if (Test-Path $bin) { ... } else { throw "Expected binary not found: $bin" }

# publish.sh
if [[ -f "$bin" ]]; then ... else echo "ERROR: expected binary not found: $bin" >&2; exit 1; fi

m7 — mbproxy.service ReadWritePaths does not include /etc/mbproxy, so a hot-reload write by the service account to its own config directory is blocked by ProtectSystem=strict. install/mbproxy.service:40.

ReadWritePaths=/var/log/mbproxy /var/cache/mbproxy

ProtectSystem=strict makes the entire filesystem read-only except paths listed in ReadWritePaths. The service itself never writes to /etc/mbproxy (it only reads appsettings.json), so this is not a runtime bug. However:

The comment at line 8 states WorkingDirectory=/etc/mbproxy — the service account needs read access there, which ProtectSystem=strict provides (read-only is allowed).
If an operator's monitoring or deployment tooling updates the config in-place (e.g. scp-ing a new appsettings.json as the mbproxy user via an SSH key), that write will be blocked by ProtectSystem=strict. The operator would need to write as root and chown/chmod as appropriate.

This is likely intentional (config is an admin operation), but the systemd unit has no comment explaining why /etc/mbproxy is intentionally absent from ReadWritePaths. It trips up operators who expect the service account to own its config.

Recommendation: Add a comment after the ReadWritePaths line: # /etc/mbproxy is intentionally absent — config changes require root (ProtectSystem=strict makes it read-only for the service account).

Regression check against 2026-05-15 findings

Prior finding	Status
M1 — `LoopAsync` untested	Closed — `Loop_PushesRepeatedly_ThenStopsAfterStopAsync` added
M2 — `Record`/`Disarm` race	Closed — production fix + `ConcurrentRecordAndDisarm_LeavesNoStaleObservation`
M3 — `PlcSubscriptionTracker` no direct/concurrent test	Closed — `PlcSubscriptionTrackerTests.cs` added with stress test
M4 — `FakeGroupManager.Removed` dead code	Partially addressed: no test added; `Removed` list is still present — see new `m3`
m1 — `ThrowingStatusPushSink` missing	Closed — `ThrowingStatusPushSink` and two swallow/propagate tests added
m2 — `FakeHubCallerContext.ConnectionAborted` hardcoded	No change; accepted as low-risk
m3 — asset allow-list coverage	Closed — `EmbeddedAssetsTests` covers glob; `Get_Asset` asserts bytes verbatim
m4 — `Get_Asset` byte assertion missing	Closed — `ReadEmbeddedAsset` comparison added
m5 — `.` glob caveats	Unchanged; accepted
m6 — `mbproxy.smoke.config.json` dangling plans reference	Not fixed — see new `m4`
m7 — `ConfigReconcilerTests` registry wiring untested	Closed — `Apply_AddThenRemovePlc_TagCaptureRegistryTracksRoster` added
m8 — `AdminPushIntervalMs` upper bound unguarded	Closed — upper bound added to both `ReloadValidator` and `MbproxyOptionsValidator`; four new tests
m9 — `OnDisconnectedAsync` always called with null	No change; accepted as low-risk
Coverage gap 7 — `/hub/status` SignalR E2E	Closed — `HubStatusE2ETests` (2 facts) added

Key file references

install/mbproxy.config.template.json:83, install/mbproxy.linux.config.template.json:87 — "Set to 0 to disable" comment (M1)
src/Mbproxy/Admin/AdminEndpointHost.cs:190 — k.Listen(..., port) with no port-0 guard (M1)
src/Mbproxy/Configuration/ReloadValidator.cs:65-68 — rejects AdminPort < 1 on hot-reload but not at startup (M1)
tests/Mbproxy.Tests/Proxy/ProxyForwardingTests.cs:67,242,310, RewriterE2ETests.cs:388, HostSmokeTests.cs:111 — hard-coded port 8080 (M2)
tests/Mbproxy.Tests/Sim/DL205SimulatorFixture.cs:121-122 — CancellationToken.None swallows test-runner cancellation (M3)
tests/Mbproxy.Tests/Admin/StatusBroadcasterTests.cs:188 — unnecessary 400 ms idle delay (m2)
tests/Mbproxy.Tests/Admin/SignalRFakes.cs:30-31 — Removed list never asserted (m3)
tests/sim/mbproxy.smoke.config.json:1 — dangling plans/ reference (m4)
install/publish.ps1:103-110, install/publish.sh:99-106 — warning-only when expected binary missing (m6)
install/mbproxy.service:40 — ReadWritePaths silent about /etc/mbproxy exclusion (m7)

18 KiB Raw Blame History Unescape Escape

Code Review — Tests, Install/Packaging, and Config (HEAD 0308490)