Eight area-focused reviews (BCD rewriter, multiplexer, response cache, supervisor + hot-reload, admin + diagnostics, hosting + options, test suite) plus an Overview that prioritises findings across areas, and a RemediationPlan that groups the work into three waves with per-item file:line citations and regression-test sketches. Findings call out: hot-reload tag-list/cache changes that don't reach the running multiplexer, a coalescing factory leak that hangs late attachers, backend-reader head-of-line block on a wedged upstream, stranded outbound frames after cascade, and ShutdownCoordinator double-stop ordering. Plus the unconventional 32-bit BCD wire format (two base-10000 digits in CDAB, not standard binary), unreachable BcdValidationError.DuplicateAddress, mbproxy.cache.flushed event that's defined but never emitted, and missing test coverage for Cache.AllowLongTtl. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9.0 KiB
Admin Endpoint & Diagnostics Subsystem — Code Review
Scope: src/Mbproxy/Admin/*, src/Mbproxy/Diagnostics/*, src/Mbproxy/{ServiceCounters,Proxy/ProxyCounters}.cs. Cross-checked against docs/design.md (Status page, Failure modes), docs/Architecture/Overview.md, docs/Operations/StatusPage.md, docs/Operations/Troubleshooting.md, docs/Reference/LogEvents.md. Tests under tests/Mbproxy.Tests/Admin/* + tests/Mbproxy.Tests/Diagnostics/*.
Summary
- Endpoint surface is correctly minimal (two read-only routes), HTML escaping is sound, JSON shape matches the design spec, and the overall code quality is high. The
Interlocked-only counter design is correct for x64. - The
ShutdownCoordinatorordering does not match its own doc —ProxyWorker.StopAsyncalready stops/disposes supervisors during host shutdown before theApplicationStoppingcallback runs the coordinator, so by the time the coordinator pollsInFlightCounteverything is already 0. - A subtle 32-bit
double/DateTimeOffsettorn-read issue exists inProxyCounters.UpdateRoundTripEwmaandServiceCounters.LastReloadUtcthat is benign on x64 but worth noting.
Critical findings
C1. ShutdownCoordinator never actually drains anything in production. Program.cs:60–66 registers the coordinator on ApplicationStopping, but ProxyWorker.StopAsync (src/Mbproxy/Proxy/ProxyWorker.cs:196–217) is IHostedService — the host runs StopAsync on every hosted service before firing ApplicationStopped. ApplicationStopping fires first, so the coordinator runs first — but the coordinator's "Step 1" calls supervisor.StopAsync again, which ProxyWorker.StopAsync will then call a second time. More importantly, the coordinator calls _adminEndpoint.StopAsync inside the lifetime callback, then the host immediately calls AdminEndpointHost.StopAsync again. Both supervisor and admin host get a redundant double-stop. The existing ShutdownE2ETests.cs:41 only asserts upstream socket closure, not that the drain actually occurred — so the gap is invisible from tests. Recommended: either (a) make ProxyWorker and AdminEndpointHost not implement IHostedService and let the coordinator own their lifecycle, or (b) move the drain logic into ProxyWorker.StopAsync and delete the coordinator.
Major findings
M1. Coordinator step 1's 5-second deadline is independent of GracefulShutdownTimeoutMs. src/Mbproxy/Diagnostics/ShutdownCoordinator.cs:147 hardcodes TimeSpan.FromSeconds(5) for the supervisor stop deadline. If GracefulShutdownTimeoutMs is set to e.g. 30s, supervisor stop will still time-bound at 5s. The doc comment at lines 76–84 implies the configured timeout governs the whole sequence.
M2. EWMA double field has tearing potential on 32-bit hosts. src/Mbproxy/Proxy/ProxyCounters.cs:410 reads Interlocked.Read(ref _lastRoundTripUsEwma) correctly, but the value is divided by 1000.0 to produce a double returned in the snapshot. The underlying long is read atomically; that part is fine on any host. However, LastReloadUtc in ServiceCounters.cs:31 constructs a new DateTimeOffset from Interlocked.Read'd ticks — also fine. Conclusion: no actual tearing risk on x64; the project is RuntimeIdentifier=win-x64 (csproj line 25). Worth a one-line comment in ProxyCounters.cs:192 noting "x64 only — long fields are atomic per ECMA-335 only at 8-byte alignment, and we ship win-x64."
M3. EventLogBridge.Emit calls EventLog.SourceExists on every write. src/Mbproxy/Diagnostics/EventLogBridge.cs:46 — this is a registry hit per Error+ event. Cache the result in a field after the first call (one-shot at construction) to avoid the registry round-trip on every error-level log line under fault storms.
M4. JSON DTO shape — pdus.invalidBcdWarnings and backendExceptionOther are tracked in CounterSnapshot but never surfaced. ProxyCounters.cs:218 increments _invalidBcdWarnings; Snapshot() returns it (line 399) but StatusSnapshotBuilder.cs:127–131 builds PlcPdusStatus without it. Same for BackendExceptionOther (ProxyCounters.cs:232, surfaced in snapshot at line 405, but not in ExceptionCounts at StatusDto.cs:102). These are the only two counters in code that don't appear on the wire. Either add invalidBcdWarnings to PlcPdusStatus and codeOther to ExceptionCounts, or remove the dead increment paths.
M5. JSON field naming policy may not apply to nested record parameter names under source-gen. StatusDto.cs:117 uses JsonKnownNamingPolicy.CamelCase on the source-gen context. The E2E test (AdminEndpointTests.cs:66, 91, 93) confirms uptimeSeconds, byFc, lastRoundTripMs are emitted in camelCase, so this is verified working. No action needed; flagged because it's a known source-gen gotcha worth keeping a test for.
Minor findings
N1. StatusSnapshotBuilder.cs:69 falls back to p.RemoteEp?.Address.ToString() after the ?.ToString() — unreachable. If p.RemoteEp is null the first ?. returns null; the second ?.ToString() also returns null → final "?". The middle expression p.RemoteEp?.Address.ToString() can never be reached because if p.RemoteEp is non-null, the first expression already produced a string. Simplify to p.RemoteEp?.ToString() ?? "?".
N2. Auto-refresh of 5 s × 50 simultaneous viewers = 10 req/s. Cheap (snapshot is in-memory, no I/O), but worth noting that the renderer allocates a fresh StringBuilder(4096) and List<PlcStatus>(opts.Plcs.Count) per request. For 54 PLCs and a sustained dashboard wall, that's ~10 ms of allocation per second — fine; flagging only as a future optimisation if memory profiling shows it.
N3. EventLogBridge.cs:57 truncation math is approximate — message.Length * 2 > MaxMessageBytes assumes 2 bytes per char (UTF-16), which is correct for the underlying string but the EventLog.WriteEntry API actually has a ~32 KB character limit, not byte. The current truncation is conservative (errs short) which is fine.
N4. AdminEndpointHost.cs:179 2-second stop deadline is hardcoded; not configurable. Acceptable but undocumented.
What looks good
- Bind-failure isolation.
AdminEndpointHost.cs:159–164correctly catches non-cancellation exceptions fromapp.StartAsync, logsmbproxy.admin.bind.failed(event ID 71), and continues with_app = null. The proxy listeners are completely independent, exactly as designed. The E2E test atAdminEndpointTests.cs:200–246validates this. - Hot-reload of
AdminPort.AdminEndpointHost.cs:63–89uses a semaphore + double-check pattern that correctly serializes concurrentOnChangecallbacks, with E2E coverage atAdminEndpointTests.cs:251–311. - HTML escaping is applied at every dynamic-content insertion site:
StatusHtmlRenderer.cs:51, 59, 96, 97, 108, 112, 128. The XSS attack surface (PLCName,Host, bind error messages, clientRemote) is covered. - JSON serialization uses source generation (
StatusJsonContext.Default.StatusResponseatAdminEndpointHost.cs:150) — AOT-friendly, no reflection. - Counter design is well thought through.
ProxyCounters.ObserveInFlight(line 307) uses CAS for the high-water mark;UpdateRoundTripEwmauses fixed-point microseconds-as-long for atomic CAS on the EWMA (line 352).ICacheStatsProviderandIMultiplexCountersProviderare clean abstractions that decouple the snapshot path from the multiplexer's lifecycle. AssemblyVersionAccessor.cs:19–23correctly usesGetCustomAttribute(works under single-file publish) with a sensible"0.0.0"fallback.Mbproxy.csproj:12sets<InformationalVersion>1.0.0</InformationalVersion>per CLAUDE.md.- Coordinator testability.
ShutdownCoordinator.cs:115second constructor acceptsISupervisorHandleandIAdminEndpointHandleabstractions, allowing the unit tests atShutdownCoordinatorTests.csto verify ordering, drain, timeout, and admin-stop without touching real Kestrel/listeners. EventLogBridge.cs:46correctly never tries to create the source —install.ps1:1registers it viaNew-EventLog -Source 'mbproxy' -LogName 'Application', and the bridge silently no-ops if the source is missing (line 46). Cannot crash the service.
Open questions
- C1 verification: does the existing
ShutdownE2ETests.E2E_StopHost_DuringInFlightRequest_CancelsAfterTimeoutactually exercise the coordinator's drain path, or does it only verify host stop timing? The test atShutdownE2ETests.cs:113doesn't readmbproxy.shutdown.completeto confirm the coordinator actually ran. Worth adding an assertion. StatusSnapshotBuilder.cs:67materializesclientSnapshots.ToList()per snapshot. With 54 PLCs × ~4 clients each × 5-second refresh = 216 list allocations every 5s. Acceptable; flagging for memory-pressure dashboards.- The
ForwardingLoggerProvider(AdminEndpointHost.cs:218–224) creates a new logger per category butDispose()is a no-op — confirm this doesn't leak loggers across hot-reload re-binds (each re-bind builds a newWebApplicationand a new provider instance).