Files
Joseph Doherty f2c6669444 mbproxy/codereviews: 2026-05-14 in-depth review + remediation plan
Eight area-focused reviews (BCD rewriter, multiplexer, response cache,
supervisor + hot-reload, admin + diagnostics, hosting + options, test
suite) plus an Overview that prioritises findings across areas, and a
RemediationPlan that groups the work into three waves with per-item
file:line citations and regression-test sketches.

Findings call out: hot-reload tag-list/cache changes that don't reach the
running multiplexer, a coalescing factory leak that hangs late attachers,
backend-reader head-of-line block on a wedged upstream, stranded outbound
frames after cascade, and ShutdownCoordinator double-stop ordering. Plus
the unconventional 32-bit BCD wire format (two base-10000 digits in CDAB,
not standard binary), unreachable BcdValidationError.DuplicateAddress,
mbproxy.cache.flushed event that's defined but never emitted, and missing
test coverage for Cache.AllowLongTtl.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 05:15:34 -04:00

53 lines
9.0 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Admin Endpoint & Diagnostics Subsystem — Code Review
Scope: `src/Mbproxy/Admin/*`, `src/Mbproxy/Diagnostics/*`, `src/Mbproxy/{ServiceCounters,Proxy/ProxyCounters}.cs`. Cross-checked against `docs/design.md` (Status page, Failure modes), `docs/Architecture/Overview.md`, `docs/Operations/StatusPage.md`, `docs/Operations/Troubleshooting.md`, `docs/Reference/LogEvents.md`. Tests under `tests/Mbproxy.Tests/Admin/*` + `tests/Mbproxy.Tests/Diagnostics/*`.
## Summary
- Endpoint surface is correctly minimal (two read-only routes), HTML escaping is sound, JSON shape matches the design spec, and the overall code quality is high. The `Interlocked`-only counter design is correct for x64.
- The `ShutdownCoordinator` ordering does **not** match its own doc — `ProxyWorker.StopAsync` already stops/disposes supervisors during host shutdown **before** the `ApplicationStopping` callback runs the coordinator, so by the time the coordinator polls `InFlightCount` everything is already 0.
- A subtle 32-bit `double`/`DateTimeOffset` torn-read issue exists in `ProxyCounters.UpdateRoundTripEwma` and `ServiceCounters.LastReloadUtc` that is benign on x64 but worth noting.
## Critical findings
**C1. ShutdownCoordinator never actually drains anything in production.** `Program.cs:6066` registers the coordinator on `ApplicationStopping`, but `ProxyWorker.StopAsync` (`src/Mbproxy/Proxy/ProxyWorker.cs:196217`) is `IHostedService` — the host runs `StopAsync` on every hosted service **before** firing `ApplicationStopped`. `ApplicationStopping` fires *first*, so the coordinator runs first — but the coordinator's "Step 1" calls `supervisor.StopAsync` again, which `ProxyWorker.StopAsync` will then call a *second* time. More importantly, the coordinator calls `_adminEndpoint.StopAsync` inside the lifetime callback, then the host immediately calls `AdminEndpointHost.StopAsync` again. Both supervisor and admin host get a redundant double-stop. The existing `ShutdownE2ETests.cs:41` only asserts upstream socket closure, not that the drain actually occurred — so the gap is invisible from tests. Recommended: either (a) make `ProxyWorker` and `AdminEndpointHost` *not* implement `IHostedService` and let the coordinator own their lifecycle, or (b) move the drain logic into `ProxyWorker.StopAsync` and delete the coordinator.
## Major findings
**M1. Coordinator step 1's 5-second deadline is independent of `GracefulShutdownTimeoutMs`.** `src/Mbproxy/Diagnostics/ShutdownCoordinator.cs:147` hardcodes `TimeSpan.FromSeconds(5)` for the supervisor stop deadline. If `GracefulShutdownTimeoutMs` is set to e.g. 30s, supervisor stop will still time-bound at 5s. The doc comment at lines 7684 implies the configured timeout governs the whole sequence.
**M2. EWMA double field has tearing potential on 32-bit hosts.** `src/Mbproxy/Proxy/ProxyCounters.cs:410` reads `Interlocked.Read(ref _lastRoundTripUsEwma)` correctly, but the value is divided by `1000.0` to produce a `double` returned in the snapshot. The underlying long is read atomically; that part is fine on any host. However, `LastReloadUtc` in `ServiceCounters.cs:31` constructs a new `DateTimeOffset` from `Interlocked.Read`'d ticks — also fine. **Conclusion: no actual tearing risk on x64**; the project is `RuntimeIdentifier=win-x64` (csproj line 25). Worth a one-line comment in `ProxyCounters.cs:192` noting "x64 only — long fields are atomic per ECMA-335 only at 8-byte alignment, and we ship win-x64."
**M3. `EventLogBridge.Emit` calls `EventLog.SourceExists` on every write.** `src/Mbproxy/Diagnostics/EventLogBridge.cs:46` — this is a registry hit per Error+ event. Cache the result in a field after the first call (one-shot at construction) to avoid the registry round-trip on every error-level log line under fault storms.
**M4. JSON DTO shape — `pdus.invalidBcdWarnings` and `backendExceptionOther` are tracked in `CounterSnapshot` but never surfaced.** `ProxyCounters.cs:218` increments `_invalidBcdWarnings`; `Snapshot()` returns it (line 399) but `StatusSnapshotBuilder.cs:127131` builds `PlcPdusStatus` without it. Same for `BackendExceptionOther` (`ProxyCounters.cs:232`, surfaced in snapshot at line 405, but not in `ExceptionCounts` at `StatusDto.cs:102`). These are the only two counters in code that don't appear on the wire. Either add `invalidBcdWarnings` to `PlcPdusStatus` and `codeOther` to `ExceptionCounts`, or remove the dead increment paths.
**M5. JSON field naming policy may not apply to nested `record` parameter names under source-gen.** `StatusDto.cs:117` uses `JsonKnownNamingPolicy.CamelCase` on the source-gen context. The E2E test (`AdminEndpointTests.cs:66, 91, 93`) confirms `uptimeSeconds`, `byFc`, `lastRoundTripMs` are emitted in camelCase, so this is verified working. No action needed; flagged because it's a known source-gen gotcha worth keeping a test for.
## Minor findings
**N1. `StatusSnapshotBuilder.cs:69` falls back to `p.RemoteEp?.Address.ToString()` after the `?.ToString()` — unreachable.** If `p.RemoteEp` is null the first `?.` returns null; the second `?.ToString()` also returns null → final `"?"`. The middle expression `p.RemoteEp?.Address.ToString()` can never be reached because if `p.RemoteEp` is non-null, the first expression already produced a string. Simplify to `p.RemoteEp?.ToString() ?? "?"`.
**N2. Auto-refresh of 5 s × 50 simultaneous viewers = 10 req/s.** Cheap (snapshot is in-memory, no I/O), but worth noting that the renderer allocates a fresh `StringBuilder(4096)` and `List<PlcStatus>(opts.Plcs.Count)` per request. For 54 PLCs and a sustained dashboard wall, that's ~10 ms of allocation per second — fine; flagging only as a future optimisation if memory profiling shows it.
**N3. `EventLogBridge.cs:57` truncation math is approximate**`message.Length * 2 > MaxMessageBytes` assumes 2 bytes per char (UTF-16), which is correct for the underlying string but the `EventLog.WriteEntry` API actually has a ~32 KB *character* limit, not byte. The current truncation is conservative (errs short) which is fine.
**N4. `AdminEndpointHost.cs:179` 2-second stop deadline is hardcoded; not configurable.** Acceptable but undocumented.
## What looks good
- **Bind-failure isolation.** `AdminEndpointHost.cs:159164` correctly catches non-cancellation exceptions from `app.StartAsync`, logs `mbproxy.admin.bind.failed` (event ID 71), and continues with `_app = null`. The proxy listeners are completely independent, exactly as designed. The E2E test at `AdminEndpointTests.cs:200246` validates this.
- **Hot-reload of `AdminPort`.** `AdminEndpointHost.cs:6389` uses a semaphore + double-check pattern that correctly serializes concurrent `OnChange` callbacks, with E2E coverage at `AdminEndpointTests.cs:251311`.
- **HTML escaping is applied at every dynamic-content insertion site:** `StatusHtmlRenderer.cs:51, 59, 96, 97, 108, 112, 128`. The XSS attack surface (PLC `Name`, `Host`, bind error messages, client `Remote`) is covered.
- **JSON serialization uses source generation** (`StatusJsonContext.Default.StatusResponse` at `AdminEndpointHost.cs:150`) — AOT-friendly, no reflection.
- **Counter design is well thought through.** `ProxyCounters.ObserveInFlight` (line 307) uses CAS for the high-water mark; `UpdateRoundTripEwma` uses fixed-point microseconds-as-long for atomic CAS on the EWMA (line 352). `ICacheStatsProvider` and `IMultiplexCountersProvider` are clean abstractions that decouple the snapshot path from the multiplexer's lifecycle.
- **`AssemblyVersionAccessor.cs:1923`** correctly uses `GetCustomAttribute` (works under single-file publish) with a sensible `"0.0.0"` fallback. `Mbproxy.csproj:12` sets `<InformationalVersion>1.0.0</InformationalVersion>` per CLAUDE.md.
- **Coordinator testability.** `ShutdownCoordinator.cs:115` second constructor accepts `ISupervisorHandle` and `IAdminEndpointHandle` abstractions, allowing the unit tests at `ShutdownCoordinatorTests.cs` to verify ordering, drain, timeout, and admin-stop without touching real Kestrel/listeners.
- **`EventLogBridge.cs:46`** correctly never tries to *create* the source — `install.ps1:1` registers it via `New-EventLog -Source 'mbproxy' -LogName 'Application'`, and the bridge silently no-ops if the source is missing (line 46). Cannot crash the service.
## Open questions
1. **C1 verification:** does the existing `ShutdownE2ETests.E2E_StopHost_DuringInFlightRequest_CancelsAfterTimeout` actually exercise the coordinator's drain path, or does it only verify host stop timing? The test at `ShutdownE2ETests.cs:113` doesn't read `mbproxy.shutdown.complete` to confirm the coordinator actually ran. Worth adding an assertion.
2. `StatusSnapshotBuilder.cs:67` materializes `clientSnapshots.ToList()` per snapshot. With 54 PLCs × ~4 clients each × 5-second refresh = 216 list allocations every 5s. Acceptable; flagging for memory-pressure dashboards.
3. The `ForwardingLoggerProvider` (`AdminEndpointHost.cs:218224`) creates a new logger per category but `Dispose()` is a no-op — confirm this doesn't leak loggers across hot-reload re-binds (each re-bind builds a new `WebApplication` and a new provider instance).