Files
Joseph Doherty f2c6669444 mbproxy/codereviews: 2026-05-14 in-depth review + remediation plan
Eight area-focused reviews (BCD rewriter, multiplexer, response cache,
supervisor + hot-reload, admin + diagnostics, hosting + options, test
suite) plus an Overview that prioritises findings across areas, and a
RemediationPlan that groups the work into three waves with per-item
file:line citations and regression-test sketches.

Findings call out: hot-reload tag-list/cache changes that don't reach the
running multiplexer, a coalescing factory leak that hangs late attachers,
backend-reader head-of-line block on a wedged upstream, stranded outbound
frames after cascade, and ShutdownCoordinator double-stop ordering. Plus
the unconventional 32-bit BCD wire format (two base-10000 digits in CDAB,
not standard binary), unreachable BcdValidationError.DuplicateAddress,
mbproxy.cache.flushed event that's defined but never emitted, and missing
test coverage for Cache.AllowLongTtl.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 05:15:34 -04:00

9.0 KiB
Raw Permalink Blame History

Admin Endpoint & Diagnostics Subsystem — Code Review

Scope: src/Mbproxy/Admin/*, src/Mbproxy/Diagnostics/*, src/Mbproxy/{ServiceCounters,Proxy/ProxyCounters}.cs. Cross-checked against docs/design.md (Status page, Failure modes), docs/Architecture/Overview.md, docs/Operations/StatusPage.md, docs/Operations/Troubleshooting.md, docs/Reference/LogEvents.md. Tests under tests/Mbproxy.Tests/Admin/* + tests/Mbproxy.Tests/Diagnostics/*.

Summary

  • Endpoint surface is correctly minimal (two read-only routes), HTML escaping is sound, JSON shape matches the design spec, and the overall code quality is high. The Interlocked-only counter design is correct for x64.
  • The ShutdownCoordinator ordering does not match its own doc — ProxyWorker.StopAsync already stops/disposes supervisors during host shutdown before the ApplicationStopping callback runs the coordinator, so by the time the coordinator polls InFlightCount everything is already 0.
  • A subtle 32-bit double/DateTimeOffset torn-read issue exists in ProxyCounters.UpdateRoundTripEwma and ServiceCounters.LastReloadUtc that is benign on x64 but worth noting.

Critical findings

C1. ShutdownCoordinator never actually drains anything in production. Program.cs:6066 registers the coordinator on ApplicationStopping, but ProxyWorker.StopAsync (src/Mbproxy/Proxy/ProxyWorker.cs:196217) is IHostedService — the host runs StopAsync on every hosted service before firing ApplicationStopped. ApplicationStopping fires first, so the coordinator runs first — but the coordinator's "Step 1" calls supervisor.StopAsync again, which ProxyWorker.StopAsync will then call a second time. More importantly, the coordinator calls _adminEndpoint.StopAsync inside the lifetime callback, then the host immediately calls AdminEndpointHost.StopAsync again. Both supervisor and admin host get a redundant double-stop. The existing ShutdownE2ETests.cs:41 only asserts upstream socket closure, not that the drain actually occurred — so the gap is invisible from tests. Recommended: either (a) make ProxyWorker and AdminEndpointHost not implement IHostedService and let the coordinator own their lifecycle, or (b) move the drain logic into ProxyWorker.StopAsync and delete the coordinator.

Major findings

M1. Coordinator step 1's 5-second deadline is independent of GracefulShutdownTimeoutMs. src/Mbproxy/Diagnostics/ShutdownCoordinator.cs:147 hardcodes TimeSpan.FromSeconds(5) for the supervisor stop deadline. If GracefulShutdownTimeoutMs is set to e.g. 30s, supervisor stop will still time-bound at 5s. The doc comment at lines 7684 implies the configured timeout governs the whole sequence.

M2. EWMA double field has tearing potential on 32-bit hosts. src/Mbproxy/Proxy/ProxyCounters.cs:410 reads Interlocked.Read(ref _lastRoundTripUsEwma) correctly, but the value is divided by 1000.0 to produce a double returned in the snapshot. The underlying long is read atomically; that part is fine on any host. However, LastReloadUtc in ServiceCounters.cs:31 constructs a new DateTimeOffset from Interlocked.Read'd ticks — also fine. Conclusion: no actual tearing risk on x64; the project is RuntimeIdentifier=win-x64 (csproj line 25). Worth a one-line comment in ProxyCounters.cs:192 noting "x64 only — long fields are atomic per ECMA-335 only at 8-byte alignment, and we ship win-x64."

M3. EventLogBridge.Emit calls EventLog.SourceExists on every write. src/Mbproxy/Diagnostics/EventLogBridge.cs:46 — this is a registry hit per Error+ event. Cache the result in a field after the first call (one-shot at construction) to avoid the registry round-trip on every error-level log line under fault storms.

M4. JSON DTO shape — pdus.invalidBcdWarnings and backendExceptionOther are tracked in CounterSnapshot but never surfaced. ProxyCounters.cs:218 increments _invalidBcdWarnings; Snapshot() returns it (line 399) but StatusSnapshotBuilder.cs:127131 builds PlcPdusStatus without it. Same for BackendExceptionOther (ProxyCounters.cs:232, surfaced in snapshot at line 405, but not in ExceptionCounts at StatusDto.cs:102). These are the only two counters in code that don't appear on the wire. Either add invalidBcdWarnings to PlcPdusStatus and codeOther to ExceptionCounts, or remove the dead increment paths.

M5. JSON field naming policy may not apply to nested record parameter names under source-gen. StatusDto.cs:117 uses JsonKnownNamingPolicy.CamelCase on the source-gen context. The E2E test (AdminEndpointTests.cs:66, 91, 93) confirms uptimeSeconds, byFc, lastRoundTripMs are emitted in camelCase, so this is verified working. No action needed; flagged because it's a known source-gen gotcha worth keeping a test for.

Minor findings

N1. StatusSnapshotBuilder.cs:69 falls back to p.RemoteEp?.Address.ToString() after the ?.ToString() — unreachable. If p.RemoteEp is null the first ?. returns null; the second ?.ToString() also returns null → final "?". The middle expression p.RemoteEp?.Address.ToString() can never be reached because if p.RemoteEp is non-null, the first expression already produced a string. Simplify to p.RemoteEp?.ToString() ?? "?".

N2. Auto-refresh of 5 s × 50 simultaneous viewers = 10 req/s. Cheap (snapshot is in-memory, no I/O), but worth noting that the renderer allocates a fresh StringBuilder(4096) and List<PlcStatus>(opts.Plcs.Count) per request. For 54 PLCs and a sustained dashboard wall, that's ~10 ms of allocation per second — fine; flagging only as a future optimisation if memory profiling shows it.

N3. EventLogBridge.cs:57 truncation math is approximatemessage.Length * 2 > MaxMessageBytes assumes 2 bytes per char (UTF-16), which is correct for the underlying string but the EventLog.WriteEntry API actually has a ~32 KB character limit, not byte. The current truncation is conservative (errs short) which is fine.

N4. AdminEndpointHost.cs:179 2-second stop deadline is hardcoded; not configurable. Acceptable but undocumented.

What looks good

  • Bind-failure isolation. AdminEndpointHost.cs:159164 correctly catches non-cancellation exceptions from app.StartAsync, logs mbproxy.admin.bind.failed (event ID 71), and continues with _app = null. The proxy listeners are completely independent, exactly as designed. The E2E test at AdminEndpointTests.cs:200246 validates this.
  • Hot-reload of AdminPort. AdminEndpointHost.cs:6389 uses a semaphore + double-check pattern that correctly serializes concurrent OnChange callbacks, with E2E coverage at AdminEndpointTests.cs:251311.
  • HTML escaping is applied at every dynamic-content insertion site: StatusHtmlRenderer.cs:51, 59, 96, 97, 108, 112, 128. The XSS attack surface (PLC Name, Host, bind error messages, client Remote) is covered.
  • JSON serialization uses source generation (StatusJsonContext.Default.StatusResponse at AdminEndpointHost.cs:150) — AOT-friendly, no reflection.
  • Counter design is well thought through. ProxyCounters.ObserveInFlight (line 307) uses CAS for the high-water mark; UpdateRoundTripEwma uses fixed-point microseconds-as-long for atomic CAS on the EWMA (line 352). ICacheStatsProvider and IMultiplexCountersProvider are clean abstractions that decouple the snapshot path from the multiplexer's lifecycle.
  • AssemblyVersionAccessor.cs:1923 correctly uses GetCustomAttribute (works under single-file publish) with a sensible "0.0.0" fallback. Mbproxy.csproj:12 sets <InformationalVersion>1.0.0</InformationalVersion> per CLAUDE.md.
  • Coordinator testability. ShutdownCoordinator.cs:115 second constructor accepts ISupervisorHandle and IAdminEndpointHandle abstractions, allowing the unit tests at ShutdownCoordinatorTests.cs to verify ordering, drain, timeout, and admin-stop without touching real Kestrel/listeners.
  • EventLogBridge.cs:46 correctly never tries to create the source — install.ps1:1 registers it via New-EventLog -Source 'mbproxy' -LogName 'Application', and the bridge silently no-ops if the source is missing (line 46). Cannot crash the service.

Open questions

  1. C1 verification: does the existing ShutdownE2ETests.E2E_StopHost_DuringInFlightRequest_CancelsAfterTimeout actually exercise the coordinator's drain path, or does it only verify host stop timing? The test at ShutdownE2ETests.cs:113 doesn't read mbproxy.shutdown.complete to confirm the coordinator actually ran. Worth adding an assertion.
  2. StatusSnapshotBuilder.cs:67 materializes clientSnapshots.ToList() per snapshot. With 54 PLCs × ~4 clients each × 5-second refresh = 216 list allocations every 5s. Acceptable; flagging for memory-pressure dashboards.
  3. The ForwardingLoggerProvider (AdminEndpointHost.cs:218224) creates a new logger per category but Dispose() is a no-op — confirm this doesn't leak loggers across hot-reload re-binds (each re-bind builds a new WebApplication and a new provider instance).