Files
wwtools/mbproxy/codereviews/2026-05-16/ConfigAndHosting.md
Joseph Doherty b222362ce0 mbproxy: remediate the 2026-05-16 code-review findings
Fixes every finding from the codereviews/2026-05-16 multi-agent review
(2 Critical, 20 Major, 38 Minor) and adds that review to the repo.

Highlights: dashboard XSS escape; response cache invalidated on the
write request (not just the response); ReloadValidator now runs at
startup so port collisions / duplicate names / malformed Resilience
profiles fail fast; AdminPort 0 genuinely disables the admin endpoint;
PlcListener accept-loop faults propagate to the supervisor's faulted
path; reconciler Restart builds before removing; Resilience pipelines
are restart-only from a frozen snapshot; multiplexer connect-race leak,
watchdog party-list snapshot, backend-response and FC16 framing
validation; frontend reconnect retry and util.js load guard; plus the
log-event/doc drift sweep and test-port hygiene.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 18:08:06 -04:00

20 KiB

Code Review — Configuration, Options, Hosting, Diagnostics

Scope: src/Mbproxy/Configuration/, src/Mbproxy/Options/, src/Mbproxy/Diagnostics/, src/Mbproxy/HostingExtensions.cs, Program.cs, ServiceCounters.cs (with ProxyWorker.cs and PlcListenerSupervisor.cs read for context). Branch: mbproxy-webui-dashboard @ 0308490. Date: 2026-05-16.

Summary

The configuration/hosting subsystem is in good shape overall: the hot-reload reconciler is cleanly structured, the debounce loop and _applySemaphore serialization are correct and leak-free, the diagnostic-sink platform selection is well factored and unit-testable, and ServiceCounters uses Interlocked correctly. Prior-review remediations appear intact. The most consequential issue found is that ReloadValidator is never invoked at startup — it runs only inside ConfigReconciler.ApplyUnderLockAsync — so an appsettings.json that has duplicate ListenPorts, an AdminPort collision, a bad keepalive cross-field relationship, or a non-positive timeout will start the service in a broken state instead of failing fast, directly contradicting the documented contract. A second real bug: the Restart reconcile path removes a supervisor from the dictionary before rebuilding it, so a transient failure during rebuild silently drops the PLC with no listener and no recovery. The remaining findings are validation-completeness gaps and maintainability items.

Findings count: Critical 0 · Major 5 · Minor 6


Major

M1 — ReloadValidator never runs at startup; cross-PLC config errors are not fail-fast

Files: src/Mbproxy/Configuration/ConfigReconciler.cs:232 (only call site); src/Mbproxy/Proxy/ProxyWorker.cs:91-180; src/Mbproxy/HostingExtensions.cs:21-43; docs/Operations/Configuration.md:284.

ReloadValidator.Validate is called from exactly one place — ApplyUnderLockAsync (ConfigReconciler.cs:232). The startup path is ProxyWorker.ExecuteAsync, which builds per-PLC tag maps directly via BcdTagMapBuilder.Build (ProxyWorker.cs:101) and never calls ReloadValidator. The only schema validation that runs at startup is MbproxyOptionsValidator via .ValidateOnStart() (HostingExtensions.cs:21-26), and that validator does not check the cross-PLC rules.

Consequences at startup (none of these are caught):

  • Duplicate ListenPort across two PLCs — both supervisors try to bind the same port; one wins, the other enters the infinite ListenerRecovery loop forever. No fail-fast, no clear error.
  • AdminPort colliding with a ListenPort — same silent-conflict outcome.
  • Duplicate PLC NameProxyWorker.cs:125 and :179 write _supervisors[plc.Name] = ..., so the second entry silently overwrites the first; one configured PLC simply never gets a listener and its supervisor object leaks (created at :165, overwritten, never started/disposed).
  • Bad keepalive cross-field rule (BackendHeartbeatIdleMs <= BackendRequestTimeoutMs) — only enforced in ReloadValidator (ReloadValidator.cs:163), so a fresh install with this mistake produces a continuously-firing heartbeat.

docs/Operations/Configuration.md:284 explicitly states "ReloadValidator.Validate runs on every config load (startup and hot reload) … On rejection at startup, the service exits non-zero." That is false for the current code.

Impact: A config mistake that should abort startup instead produces a half-working fleet that "looks" up. The most likely operational mistakes (port typos, copy-pasted PLC blocks) are exactly the ones not caught.

Recommendation: Run ReloadValidator.Validate against _options.CurrentValue at the top of ProxyWorker.ExecuteAsync (before building any supervisor) and, on failure, log mbproxy.config.reload.rejected (or a startup-specific event) and fail the host (throw / StopApplication). Alternatively, fold the cross-PLC rules into MbproxyOptionsValidator so .ValidateOnStart() covers them. Either way, make startup and reload share one validation gate, and fix the doc.


M2 — Restart reconcile path drops the PLC if the rebuild throws

File: src/Mbproxy/Configuration/ConfigReconciler.cs:286-340.

In the restart branch, the old supervisor is removed from the dictionary first (TryRemove(name, out var old), :292), then stopped/disposed, then a fresh context + supervisor is built (:302-330) and re-inserted (:332). The whole body is wrapped in try { … } catch (Exception ex) { _logger.LogError(...) } (:335-339).

If anything between the TryRemove and the _supervisors[name] = newSupervisor re-insert throws — e.g. BcdTagMapBuilder.Build faults, PolicyFactory/PerPlcContext construction throws, an OOM, or StopAsync on the old supervisor throws something other than what it swallows — the catch logs and returns, but the dictionary no longer contains name. The PLC now has no listener at all and nothing ever recovers it: the next reload's ReloadPlan.Compute sees name present in both current.Plcs and next.Plcs (because _currentOptions is updated to next at :437 regardless), so it lands in "unchanged" or ToReseat/ToRestart again only if options differ — an identical subsequent save produces no plan entry, leaving the PLC permanently dark until a manual config edit forces a restart.

The same structural risk exists in the Add path (:388-431), but there it is benign: a failed add just means the PLC was never there to begin with, and a subsequent reload will re-Add it because it is still absent from _currentOptions.

Impact: A transient fault during a hot-reload restart permanently removes a PLC from service with only an Error log line. For a 54-PLC fleet this is a silent single-line outage.

Recommendation: Build the new context + supervisor before removing/stopping the old one (or into a local first), and only swap into the dictionary once construction succeeded. On a build failure, leave the old supervisor in place and surface the error. At minimum, in the catch, if name is no longer in _supervisors, re-queue the PLC for Add on the next pass or roll _currentOptions[name] back to the old value so the next Compute re-detects it.


M3 — Startup-built supervisors and reconciler-built supervisors do not use the same MaxParties / coalescing live-config path consistently with the rest of Resilience

Files: src/Mbproxy/Configuration/ConfigReconciler.cs:281-330, 383-421; src/Mbproxy/Proxy/ProxyWorker.cs:138-141.

The backend-connect Polly pipeline is rebuilt inside ApplyUnderLockAsync from next.Resilience.BackendConnect (:282-284, :384-386) and the listener-recovery pipeline from next.Resilience.ListenerRecovery (:314, :405) — so add/restart supervisors do pick up reloaded BackendConnect/ListenerRecovery values. But existing supervisors that land in ToReseat (or in neither bucket) keep the pipelines built at startup (ProxyWorker.cs:139). A reload that changes only Resilience.BackendConnect.MaxAttempts or Resilience.ListenerRecovery.* therefore takes effect for added/restarted PLCs but silently does not propagate to the majority of PLCs that were not otherwise touched. The hot-reload propagation tables in docs/Features/HotReload.md:45-58 and docs/Operations/Configuration.md:423-435 list ReadCoalescing and Backend*TimeoutMs but say nothing about Resilience.BackendConnect/ListenerRecovery — operators will reasonably assume they hot-reload like everything else.

Impact: Inconsistent reload semantics: the same key behaves differently depending on whether a PLC happened to be added/restarted in the same save. Hard to diagnose ("I changed MaxAttempts, three PLCs honour it, fifty-one don't").

Recommendation: Either (a) document explicitly that Resilience.BackendConnect and Resilience.ListenerRecovery require a service restart (like AdminPort), and stop rebuilding them in the reconciler so behaviour is uniform; or (b) make them genuinely hot-reloadable by threading a live accessor (as already done for ReadCoalescing/Keepalive) so all supervisors re-read them. Option (a) is the smaller, safer change.


M4 — MbproxyOptionsValidator does not validate ListenPort range or Host, leaving binding edge cases uncaught at schema time

File: src/Mbproxy/Options/MbproxyOptions.cs:62-144; src/Mbproxy/Options/PlcOptions.cs.

MbproxyOptionsValidator validates Width, cache TTLs, cache-size knobs, connection timeouts, AdminPushIntervalMs, and keepalive ranges — but never PlcOptions.ListenPort (range/uniqueness), PlcOptions.Name (empty/unique), PlcOptions.Host (empty), PlcOptions.Port (range), or AdminPort (range/collision). Those live only in ReloadValidator. Given M1 (ReloadValidator not run at startup), this is the only validator on the startup path, so:

  • ListenPort = 0 (the C# default when the key is omitted, see PlcOptions.cs:6) passes schema validation. The supervisor then tries to bind port 0, which the OS interprets as "pick an ephemeral port" — the listener binds a random port and clients can never reach it. No error is raised anywhere.
  • An empty Host produces a SocketException/ArgumentException only at first backend connect, surfacing as a recoverable runtime fault rather than a config rejection.

docs/Operations/Configuration.md:130-135 documents Name, ListenPort, Host as required — but nothing enforces "required" at bind time.

Impact: A PLC block missing ListenPort or Host starts a useless listener with no diagnostic. Combined with M1, the only realistic guard (ReloadValidator) is never consulted at startup.

Recommendation: Add ListenPort ∈ [1,65535], non-empty Name, non-empty Host, and Port ∈ [1,65535] checks to MbproxyOptionsValidator (it already iterates options.Plcs), or — preferably — resolve M1 so ReloadValidator runs at startup and consider whether the two validators should be merged to remove the consistency burden entirely.


M5 — Resilience.BackendConnect.BackoffMs length is never validated against MaxAttempts

Files: src/Mbproxy/Options/ResilienceOptions.cs:19-29; src/Mbproxy/Options/MbproxyOptions.cs:62-144; src/Mbproxy/Configuration/ReloadValidator.cs; src/Mbproxy/Proxy/Supervision/PolicyFactory.cs:43-75.

docs/Operations/Configuration.md:220 states BackoffMs "Must have MaxAttempts - 1 entries." Nothing enforces this. Neither MbproxyOptionsValidator nor ReloadValidator looks at RetryProfile.MaxAttempts, BackoffMs, RecoveryProfile.InitialBackoffMs, or SteadyStateMs at all. PolicyFactory.BuildBackendConnect (:46) does Math.Max(1, MaxAttempts) and clamps the backoff index to the last element (:59-61) — so it is crash-safe, but a config with MaxAttempts = 5 and a single BackoffMs entry, or negative backoff values, silently produces behaviour that does not match the operator's intent (negative ms become TimeSpan that Polly may treat as zero/throw depending on version). MaxAttempts = 0 is silently coerced to 1.

Impact: Low-blast-radius but a documented invariant ("must have MaxAttempts - 1 entries") is unenforced, and a negative backoff or zero MaxAttempts is accepted without warning. Doc and code disagree.

Recommendation: Add a validation rule (in whichever validator survives M1's consolidation): MaxAttempts >= 1; BackoffMs.Count >= MaxAttempts - 1; every BackoffMs/InitialBackoffMs entry >= 0; SteadyStateMs > 0. Or, if the clamp-and-coerce behaviour is genuinely intended, relax the doc wording to match.


Minor

N1 — ApplyUnderLockAsync logs mbproxy.config.reload.applied and bumps the success counter even when every step failed

File: src/Mbproxy/Configuration/ConfigReconciler.cs:335-443.

By design, a step that throws is caught and logged and the loop continues (:335, :373, :426), then step 7 unconditionally sets _currentOptions = next, calls RecordReloadApplied, and logs applied (:437-441). If, say, every restart in ToRestart threw, the operator sees config.reloadCount increment and an INFO mbproxy.config.reload.applied line — with no signal that the apply was partial or total-failure. The Plcs* counts in the event are planned counts (:243-246), not achieved counts.

Impact: Status page and logs report a clean reload when reconciliation actually failed. Misleading during incident triage.

Recommendation: Track per-step failures (a counter or list). If any step threw, log at Warning with a "partial" qualifier (or emit a distinct event) and either skip RecordReloadApplied or expose a reloadPartialCount. At minimum, include achieved-vs-planned counts in the event.

N2 — Reseat step swaps _currentOptions membership assumptions but uses next.Plcs.First(...) which throws inside the loop

File: src/Mbproxy/Configuration/ConfigReconciler.cs:353.

var plcNew = next.Plcs.First(p => p.Name == name);ReloadPlan.Compute guarantees name exists in next, so this will not throw in practice, but First with no match throws InvalidOperationException, which the surrounding catch (:373) would log as a generic "Error reseating context". A FirstOrDefault with an explicit null-guard, or passing the PlcOptions through ReloadPlan.ToReseat alongside the map, would make the invariant explicit and the failure mode obvious. Minor robustness/clarity.

Recommendation: Extend ToReseat tuple to (string Name, PlcOptions New, BcdTagMap NewMap) so the reconciler never re-resolves by name.

N3 — ComputeGlobalTagDelta throws on a duplicate global address; ReloadPlan.Compute indexes PLCs with ToDictionary likewise

Files: src/Mbproxy/Configuration/ConfigReconciler.cs:465-466; src/Mbproxy/Configuration/ReloadPlan.cs:39-40, 83-84.

before.Global.ToDictionary(t => t.Address) (:465) throws ArgumentException if BcdTags.Global contains two entries at the same address. BcdTagMapBuilder detects duplicate addresses as a validation error, and ReloadValidator rejects such a snapshot — but only next. ComputeGlobalTagDelta is called with _currentOptions.BcdTags as before (:249); _currentOptions was validated when it became current, so this is safe today. It is, however, a latent fragility: any future path that sets _currentOptions without full validation, or a Global list with duplicate addresses that BcdTagMapBuilder happens not to flag, turns a cosmetic delta computation into an unhandled exception inside ApplyUnderLockAsync (caught only by the debounce-loop catch-all, aborting the whole apply). Same applies to ReloadPlan.Compute's ToDictionary(p => p.Name) — safe only because next was validated first.

Recommendation: Use ToDictionary with a last-wins lambda or DistinctBy, or wrap ComputeGlobalTagDelta defensively. It is a pure cosmetic counter — it should never be able to abort a reload.

N4 — EventLogBridge caches _sourceExists at construction; a post-install source registration never takes effect, but the doc oversells "no per-event registry traffic"

File: src/Mbproxy/Diagnostics/EventLogBridge.cs:62-80.

Behaviour is correct and intentional (documented at Configuration.md:100). One nuance: Emit still wraps EventLog.WriteEntry in a try/catch (:101-109) that swallows everything — so if the source is deleted after startup, every Error+ event silently no-ops with zero diagnostics. That is acceptable for a logging sink (must never recurse/crash), but there is no SelfLog.WriteLine breadcrumb as SyslogBridge has (SyslogBridge.cs:46). For symmetry and debuggability, emit one SelfLog line the first time a write fails.

Recommendation: Add a one-shot SelfLog.WriteLine on first write failure in EventLogBridge.Emit to match SyslogBridge's degradation breadcrumb.

N5 — ConfigReconciler.Dispose blocks up to 2 s synchronously on _debounceLoop.Wait

File: src/Mbproxy/Configuration/ConfigReconciler.cs:485-501.

Dispose cancels _disposalCts and then _debounceLoop.Wait(TimeSpan.FromSeconds(2)). Since ConfigReconciler is a DI singleton, this runs on the host's shutdown path (ServiceProvider.Dispose). If an apply is mid-flight when shutdown begins, the loop will not observe cancellation until the current ApplyAsync returns (the apply's per-step CTS are linked to ct = _disposalCts.Token, so they do cancel — good), so 2 s is normally ample. But a hung supervisor StopAsync inside an apply could make Dispose block for the full 2 s and then abandon a still-running loop thread. Acceptable, but worth a comment that the 2 s is a hard cap and the loop may be abandoned; also consider implementing IAsyncDisposable so the host can await it cleanly rather than block a thread.

Recommendation: Document the 2 s cap and abandonment behaviour, or convert to IAsyncDisposable.

N6 — Doc drift: Configuration.md:284 claims startup rejection, and the validator-ordering claims in HotReload.md are slightly off

Files: docs/Operations/Configuration.md:284, 314-318; docs/Features/HotReload.md:81, 95.

Beyond M1's headline doc bug: HotReload.md:81 asserts "There is no second validator that could disagree with the first" and :95 says "The two paths overlap deliberately so both startup and reload reject the same malformed input with the same error wording." In reality there are two validators (MbproxyOptionsValidator and ReloadValidator) with overlapping-but-not-identical rules and different wording (e.g. MbproxyOptionsValidator says "exceeds the 60_000 ms safety cap" while ReloadValidator says "exceeds 60_000 ms without Cache.AllowLongTtl=true"). The "same error wording" claim is inaccurate. Configuration.md:298 rule 7 ("Width … enforced by MbproxyOptionsValidator") correctly notes Width is not in ReloadValidator — which means a hot reload that somehow bound a bad Width would not be caught by ReloadValidator (it relies on BcdTagMapBuilder to reject it; verify that builder does).

Recommendation: Once M1/M4 are resolved (ideally by consolidating to one validator), rewrite these doc passages to describe the actual single gate. If two validators remain, drop the "same error wording" claim.


Items checked and found correct (no finding)

  • Debounce loop (ConfigReconciler.cs:165-218): the linked-CTS-per-iteration pattern is correct; debounceCts is using-scoped per loop iteration so no CTS leak; the OperationCanceledException when (!ct.IsCancellationRequested) filter correctly distinguishes window-elapsed from disposal.
  • _applySemaphore serialization (:150-161): correct WaitAsync/try/finally/Release; disposed once in Dispose.
  • Channel (:73-77): bounded capacity 1 with DropOldest is the right choice for a coalesced "something changed" signal; no unbounded growth.
  • OnChange registration (:105-111): non-blocking TryWrite, registration disposed in Dispose (:487).
  • ServiceCounters (ServiceCounters.cs): all reads/writes via Interlocked; the ticks-as-long pattern for LastReloadUtc is sound; CompareExchange(ref x, 0, 0) as a volatile read is correct.
  • DiagnosticSinkSelector.Select (DiagnosticSinkSelector.cs:51-59): pure, correct precedence (Windows wins, isSystemd only consulted off-Windows); the OperatingSystem.IsWindows() re-guard at HostingExtensions.cs:101 correctly satisfies the platform analyzer for [SupportedOSPlatform("windows")] EventLogBridge.
  • Graceful shutdown wiring (ProxyWorker.cs:288-351): GracefulShutdownTimeoutMs read live from CurrentValue; in-flight snapshot taken before base.StopAsync; admin stopped last; supervisors disposed last. Ordering matches the documented contract.
  • ReloadPlan.Compute (ReloadPlan.cs): identity keyed on Name, TagMapsEqual includes CacheTtlMs, reseat-vs-restart split is correct.
  • Reseat counter preservation (ConfigReconciler.cs:359): supervisor.CurrentCounters passed into the new context — correct, matches HotReload.md:117.