f2c6669444
Eight area-focused reviews (BCD rewriter, multiplexer, response cache, supervisor + hot-reload, admin + diagnostics, hosting + options, test suite) plus an Overview that prioritises findings across areas, and a RemediationPlan that groups the work into three waves with per-item file:line citations and regression-test sketches. Findings call out: hot-reload tag-list/cache changes that don't reach the running multiplexer, a coalescing factory leak that hangs late attachers, backend-reader head-of-line block on a wedged upstream, stranded outbound frames after cascade, and ShutdownCoordinator double-stop ordering. Plus the unconventional 32-bit BCD wire format (two base-10000 digits in CDAB, not standard binary), unreachable BcdValidationError.DuplicateAddress, mbproxy.cache.flushed event that's defined but never emitted, and missing test coverage for Cache.AllowLongTtl. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
53 lines
9.0 KiB
Markdown
53 lines
9.0 KiB
Markdown
# Admin Endpoint & Diagnostics Subsystem — Code Review
|
||
|
||
Scope: `src/Mbproxy/Admin/*`, `src/Mbproxy/Diagnostics/*`, `src/Mbproxy/{ServiceCounters,Proxy/ProxyCounters}.cs`. Cross-checked against `docs/design.md` (Status page, Failure modes), `docs/Architecture/Overview.md`, `docs/Operations/StatusPage.md`, `docs/Operations/Troubleshooting.md`, `docs/Reference/LogEvents.md`. Tests under `tests/Mbproxy.Tests/Admin/*` + `tests/Mbproxy.Tests/Diagnostics/*`.
|
||
|
||
## Summary
|
||
|
||
- Endpoint surface is correctly minimal (two read-only routes), HTML escaping is sound, JSON shape matches the design spec, and the overall code quality is high. The `Interlocked`-only counter design is correct for x64.
|
||
- The `ShutdownCoordinator` ordering does **not** match its own doc — `ProxyWorker.StopAsync` already stops/disposes supervisors during host shutdown **before** the `ApplicationStopping` callback runs the coordinator, so by the time the coordinator polls `InFlightCount` everything is already 0.
|
||
- A subtle 32-bit `double`/`DateTimeOffset` torn-read issue exists in `ProxyCounters.UpdateRoundTripEwma` and `ServiceCounters.LastReloadUtc` that is benign on x64 but worth noting.
|
||
|
||
## Critical findings
|
||
|
||
**C1. ShutdownCoordinator never actually drains anything in production.** `Program.cs:60–66` registers the coordinator on `ApplicationStopping`, but `ProxyWorker.StopAsync` (`src/Mbproxy/Proxy/ProxyWorker.cs:196–217`) is `IHostedService` — the host runs `StopAsync` on every hosted service **before** firing `ApplicationStopped`. `ApplicationStopping` fires *first*, so the coordinator runs first — but the coordinator's "Step 1" calls `supervisor.StopAsync` again, which `ProxyWorker.StopAsync` will then call a *second* time. More importantly, the coordinator calls `_adminEndpoint.StopAsync` inside the lifetime callback, then the host immediately calls `AdminEndpointHost.StopAsync` again. Both supervisor and admin host get a redundant double-stop. The existing `ShutdownE2ETests.cs:41` only asserts upstream socket closure, not that the drain actually occurred — so the gap is invisible from tests. Recommended: either (a) make `ProxyWorker` and `AdminEndpointHost` *not* implement `IHostedService` and let the coordinator own their lifecycle, or (b) move the drain logic into `ProxyWorker.StopAsync` and delete the coordinator.
|
||
|
||
## Major findings
|
||
|
||
**M1. Coordinator step 1's 5-second deadline is independent of `GracefulShutdownTimeoutMs`.** `src/Mbproxy/Diagnostics/ShutdownCoordinator.cs:147` hardcodes `TimeSpan.FromSeconds(5)` for the supervisor stop deadline. If `GracefulShutdownTimeoutMs` is set to e.g. 30s, supervisor stop will still time-bound at 5s. The doc comment at lines 76–84 implies the configured timeout governs the whole sequence.
|
||
|
||
**M2. EWMA double field has tearing potential on 32-bit hosts.** `src/Mbproxy/Proxy/ProxyCounters.cs:410` reads `Interlocked.Read(ref _lastRoundTripUsEwma)` correctly, but the value is divided by `1000.0` to produce a `double` returned in the snapshot. The underlying long is read atomically; that part is fine on any host. However, `LastReloadUtc` in `ServiceCounters.cs:31` constructs a new `DateTimeOffset` from `Interlocked.Read`'d ticks — also fine. **Conclusion: no actual tearing risk on x64**; the project is `RuntimeIdentifier=win-x64` (csproj line 25). Worth a one-line comment in `ProxyCounters.cs:192` noting "x64 only — long fields are atomic per ECMA-335 only at 8-byte alignment, and we ship win-x64."
|
||
|
||
**M3. `EventLogBridge.Emit` calls `EventLog.SourceExists` on every write.** `src/Mbproxy/Diagnostics/EventLogBridge.cs:46` — this is a registry hit per Error+ event. Cache the result in a field after the first call (one-shot at construction) to avoid the registry round-trip on every error-level log line under fault storms.
|
||
|
||
**M4. JSON DTO shape — `pdus.invalidBcdWarnings` and `backendExceptionOther` are tracked in `CounterSnapshot` but never surfaced.** `ProxyCounters.cs:218` increments `_invalidBcdWarnings`; `Snapshot()` returns it (line 399) but `StatusSnapshotBuilder.cs:127–131` builds `PlcPdusStatus` without it. Same for `BackendExceptionOther` (`ProxyCounters.cs:232`, surfaced in snapshot at line 405, but not in `ExceptionCounts` at `StatusDto.cs:102`). These are the only two counters in code that don't appear on the wire. Either add `invalidBcdWarnings` to `PlcPdusStatus` and `codeOther` to `ExceptionCounts`, or remove the dead increment paths.
|
||
|
||
**M5. JSON field naming policy may not apply to nested `record` parameter names under source-gen.** `StatusDto.cs:117` uses `JsonKnownNamingPolicy.CamelCase` on the source-gen context. The E2E test (`AdminEndpointTests.cs:66, 91, 93`) confirms `uptimeSeconds`, `byFc`, `lastRoundTripMs` are emitted in camelCase, so this is verified working. No action needed; flagged because it's a known source-gen gotcha worth keeping a test for.
|
||
|
||
## Minor findings
|
||
|
||
**N1. `StatusSnapshotBuilder.cs:69` falls back to `p.RemoteEp?.Address.ToString()` after the `?.ToString()` — unreachable.** If `p.RemoteEp` is null the first `?.` returns null; the second `?.ToString()` also returns null → final `"?"`. The middle expression `p.RemoteEp?.Address.ToString()` can never be reached because if `p.RemoteEp` is non-null, the first expression already produced a string. Simplify to `p.RemoteEp?.ToString() ?? "?"`.
|
||
|
||
**N2. Auto-refresh of 5 s × 50 simultaneous viewers = 10 req/s.** Cheap (snapshot is in-memory, no I/O), but worth noting that the renderer allocates a fresh `StringBuilder(4096)` and `List<PlcStatus>(opts.Plcs.Count)` per request. For 54 PLCs and a sustained dashboard wall, that's ~10 ms of allocation per second — fine; flagging only as a future optimisation if memory profiling shows it.
|
||
|
||
**N3. `EventLogBridge.cs:57` truncation math is approximate** — `message.Length * 2 > MaxMessageBytes` assumes 2 bytes per char (UTF-16), which is correct for the underlying string but the `EventLog.WriteEntry` API actually has a ~32 KB *character* limit, not byte. The current truncation is conservative (errs short) which is fine.
|
||
|
||
**N4. `AdminEndpointHost.cs:179` 2-second stop deadline is hardcoded; not configurable.** Acceptable but undocumented.
|
||
|
||
## What looks good
|
||
|
||
- **Bind-failure isolation.** `AdminEndpointHost.cs:159–164` correctly catches non-cancellation exceptions from `app.StartAsync`, logs `mbproxy.admin.bind.failed` (event ID 71), and continues with `_app = null`. The proxy listeners are completely independent, exactly as designed. The E2E test at `AdminEndpointTests.cs:200–246` validates this.
|
||
- **Hot-reload of `AdminPort`.** `AdminEndpointHost.cs:63–89` uses a semaphore + double-check pattern that correctly serializes concurrent `OnChange` callbacks, with E2E coverage at `AdminEndpointTests.cs:251–311`.
|
||
- **HTML escaping is applied at every dynamic-content insertion site:** `StatusHtmlRenderer.cs:51, 59, 96, 97, 108, 112, 128`. The XSS attack surface (PLC `Name`, `Host`, bind error messages, client `Remote`) is covered.
|
||
- **JSON serialization uses source generation** (`StatusJsonContext.Default.StatusResponse` at `AdminEndpointHost.cs:150`) — AOT-friendly, no reflection.
|
||
- **Counter design is well thought through.** `ProxyCounters.ObserveInFlight` (line 307) uses CAS for the high-water mark; `UpdateRoundTripEwma` uses fixed-point microseconds-as-long for atomic CAS on the EWMA (line 352). `ICacheStatsProvider` and `IMultiplexCountersProvider` are clean abstractions that decouple the snapshot path from the multiplexer's lifecycle.
|
||
- **`AssemblyVersionAccessor.cs:19–23`** correctly uses `GetCustomAttribute` (works under single-file publish) with a sensible `"0.0.0"` fallback. `Mbproxy.csproj:12` sets `<InformationalVersion>1.0.0</InformationalVersion>` per CLAUDE.md.
|
||
- **Coordinator testability.** `ShutdownCoordinator.cs:115` second constructor accepts `ISupervisorHandle` and `IAdminEndpointHandle` abstractions, allowing the unit tests at `ShutdownCoordinatorTests.cs` to verify ordering, drain, timeout, and admin-stop without touching real Kestrel/listeners.
|
||
- **`EventLogBridge.cs:46`** correctly never tries to *create* the source — `install.ps1:1` registers it via `New-EventLog -Source 'mbproxy' -LogName 'Application'`, and the bridge silently no-ops if the source is missing (line 46). Cannot crash the service.
|
||
|
||
## Open questions
|
||
|
||
1. **C1 verification:** does the existing `ShutdownE2ETests.E2E_StopHost_DuringInFlightRequest_CancelsAfterTimeout` actually exercise the coordinator's drain path, or does it only verify host stop timing? The test at `ShutdownE2ETests.cs:113` doesn't read `mbproxy.shutdown.complete` to confirm the coordinator actually ran. Worth adding an assertion.
|
||
2. `StatusSnapshotBuilder.cs:67` materializes `clientSnapshots.ToList()` per snapshot. With 54 PLCs × ~4 clients each × 5-second refresh = 216 list allocations every 5s. Acceptable; flagging for memory-pressure dashboards.
|
||
3. The `ForwardingLoggerProvider` (`AdminEndpointHost.cs:218–224`) creates a new logger per category but `Dispose()` is a no-op — confirm this doesn't leak loggers across hot-reload re-binds (each re-bind builds a new `WebApplication` and a new provider instance).
|