# Code Review — Configuration, Options, Hosting, Diagnostics **Scope:** `src/Mbproxy/Configuration/`, `src/Mbproxy/Options/`, `src/Mbproxy/Diagnostics/`, `src/Mbproxy/HostingExtensions.cs`, `Program.cs`, `ServiceCounters.cs` (with `ProxyWorker.cs` and `PlcListenerSupervisor.cs` read for context). **Branch:** `mbproxy-webui-dashboard` @ `0308490`. **Date:** 2026-05-16. ## Summary The configuration/hosting subsystem is in good shape overall: the hot-reload reconciler is cleanly structured, the debounce loop and `_applySemaphore` serialization are correct and leak-free, the diagnostic-sink platform selection is well factored and unit-testable, and `ServiceCounters` uses `Interlocked` correctly. Prior-review remediations appear intact. The most consequential issue found is that `ReloadValidator` is **never invoked at startup** — it runs only inside `ConfigReconciler.ApplyUnderLockAsync` — so an `appsettings.json` that has duplicate `ListenPort`s, an `AdminPort` collision, a bad keepalive cross-field relationship, or a non-positive timeout will start the service in a broken state instead of failing fast, directly contradicting the documented contract. A second real bug: the `Restart` reconcile path removes a supervisor from the dictionary *before* rebuilding it, so a transient failure during rebuild silently drops the PLC with no listener and no recovery. The remaining findings are validation-completeness gaps and maintainability items. **Findings count:** Critical 0 · Major 5 · Minor 6 --- ## Major ### M1 — `ReloadValidator` never runs at startup; cross-PLC config errors are not fail-fast **Files:** `src/Mbproxy/Configuration/ConfigReconciler.cs:232` (only call site); `src/Mbproxy/Proxy/ProxyWorker.cs:91-180`; `src/Mbproxy/HostingExtensions.cs:21-43`; `docs/Operations/Configuration.md:284`. `ReloadValidator.Validate` is called from exactly one place — `ApplyUnderLockAsync` (`ConfigReconciler.cs:232`). The startup path is `ProxyWorker.ExecuteAsync`, which builds per-PLC tag maps directly via `BcdTagMapBuilder.Build` (`ProxyWorker.cs:101`) and never calls `ReloadValidator`. The only schema validation that runs at startup is `MbproxyOptionsValidator` via `.ValidateOnStart()` (`HostingExtensions.cs:21-26`), and that validator does **not** check the cross-PLC rules. Consequences at startup (none of these are caught): - **Duplicate `ListenPort`** across two PLCs — both supervisors try to bind the same port; one wins, the other enters the *infinite* `ListenerRecovery` loop forever. No fail-fast, no clear error. - **`AdminPort` colliding with a `ListenPort`** — same silent-conflict outcome. - **Duplicate PLC `Name`** — `ProxyWorker.cs:125` and `:179` write `_supervisors[plc.Name] = ...`, so the second entry silently overwrites the first; one configured PLC simply never gets a listener and its supervisor object leaks (created at `:165`, overwritten, never started/disposed). - **Bad keepalive cross-field rule** (`BackendHeartbeatIdleMs <= BackendRequestTimeoutMs`) — only enforced in `ReloadValidator` (`ReloadValidator.cs:163`), so a fresh install with this mistake produces a continuously-firing heartbeat. `docs/Operations/Configuration.md:284` explicitly states "`ReloadValidator.Validate` runs on every config load (startup and hot reload) … On rejection at startup, the service exits non-zero." That is false for the current code. **Impact:** A config mistake that should abort startup instead produces a half-working fleet that "looks" up. The most likely operational mistakes (port typos, copy-pasted PLC blocks) are exactly the ones not caught. **Recommendation:** Run `ReloadValidator.Validate` against `_options.CurrentValue` at the top of `ProxyWorker.ExecuteAsync` (before building any supervisor) and, on failure, log `mbproxy.config.reload.rejected` (or a startup-specific event) and fail the host (`throw` / `StopApplication`). Alternatively, fold the cross-PLC rules into `MbproxyOptionsValidator` so `.ValidateOnStart()` covers them. Either way, make startup and reload share one validation gate, and fix the doc. --- ### M2 — `Restart` reconcile path drops the PLC if the rebuild throws **File:** `src/Mbproxy/Configuration/ConfigReconciler.cs:286-340`. In the restart branch, the old supervisor is removed from the dictionary first (`TryRemove(name, out var old)`, `:292`), then stopped/disposed, then a fresh context + supervisor is built (`:302-330`) and re-inserted (`:332`). The whole body is wrapped in `try { … } catch (Exception ex) { _logger.LogError(...) }` (`:335-339`). If anything between the `TryRemove` and the `_supervisors[name] = newSupervisor` re-insert throws — e.g. `BcdTagMapBuilder.Build` faults, `PolicyFactory`/`PerPlcContext` construction throws, an `OOM`, or `StopAsync` on the old supervisor throws something other than what it swallows — the catch logs and returns, but the dictionary no longer contains `name`. The PLC now has **no listener at all** and nothing ever recovers it: the next reload's `ReloadPlan.Compute` sees `name` present in both `current.Plcs` and `next.Plcs` (because `_currentOptions` is updated to `next` at `:437` regardless), so it lands in "unchanged" or `ToReseat`/`ToRestart` again only if options differ — an identical subsequent save produces *no* plan entry, leaving the PLC permanently dark until a manual config edit forces a restart. The same structural risk exists in the `Add` path (`:388-431`), but there it is benign: a failed add just means the PLC was never there to begin with, and a subsequent reload will re-`Add` it because it is still absent from `_currentOptions`. **Impact:** A transient fault during a hot-reload restart permanently removes a PLC from service with only an Error log line. For a 54-PLC fleet this is a silent single-line outage. **Recommendation:** Build the new context + supervisor *before* removing/stopping the old one (or into a local first), and only swap into the dictionary once construction succeeded. On a build failure, leave the old supervisor in place and surface the error. At minimum, in the `catch`, if `name` is no longer in `_supervisors`, re-queue the PLC for `Add` on the next pass or roll `_currentOptions[name]` back to the old value so the next `Compute` re-detects it. --- ### M3 — Startup-built supervisors and reconciler-built supervisors do not use the same `MaxParties` / coalescing live-config path consistently with the rest of `Resilience` **Files:** `src/Mbproxy/Configuration/ConfigReconciler.cs:281-330, 383-421`; `src/Mbproxy/Proxy/ProxyWorker.cs:138-141`. The backend-connect Polly pipeline is rebuilt inside `ApplyUnderLockAsync` from `next.Resilience.BackendConnect` (`:282-284`, `:384-386`) and the listener-recovery pipeline from `next.Resilience.ListenerRecovery` (`:314`, `:405`) — so add/restart supervisors *do* pick up reloaded `BackendConnect`/`ListenerRecovery` values. But existing supervisors that land in `ToReseat` (or in neither bucket) keep the pipelines built at startup (`ProxyWorker.cs:139`). A reload that changes only `Resilience.BackendConnect.MaxAttempts` or `Resilience.ListenerRecovery.*` therefore takes effect for *added/restarted* PLCs but silently does **not** propagate to the majority of PLCs that were not otherwise touched. The hot-reload propagation tables in `docs/Features/HotReload.md:45-58` and `docs/Operations/Configuration.md:423-435` list `ReadCoalescing` and `Backend*TimeoutMs` but say nothing about `Resilience.BackendConnect`/`ListenerRecovery` — operators will reasonably assume they hot-reload like everything else. **Impact:** Inconsistent reload semantics: the same key behaves differently depending on whether a PLC happened to be added/restarted in the same save. Hard to diagnose ("I changed `MaxAttempts`, three PLCs honour it, fifty-one don't"). **Recommendation:** Either (a) document explicitly that `Resilience.BackendConnect` and `Resilience.ListenerRecovery` require a service restart (like `AdminPort`), and stop rebuilding them in the reconciler so behaviour is uniform; or (b) make them genuinely hot-reloadable by threading a live accessor (as already done for `ReadCoalescing`/`Keepalive`) so all supervisors re-read them. Option (a) is the smaller, safer change. --- ### M4 — `MbproxyOptionsValidator` does not validate `ListenPort` range or `Host`, leaving binding edge cases uncaught at schema time **File:** `src/Mbproxy/Options/MbproxyOptions.cs:62-144`; `src/Mbproxy/Options/PlcOptions.cs`. `MbproxyOptionsValidator` validates `Width`, cache TTLs, cache-size knobs, connection timeouts, `AdminPushIntervalMs`, and keepalive ranges — but never `PlcOptions.ListenPort` (range/uniqueness), `PlcOptions.Name` (empty/unique), `PlcOptions.Host` (empty), `PlcOptions.Port` (range), or `AdminPort` (range/collision). Those live only in `ReloadValidator`. Given M1 (ReloadValidator not run at startup), this is the *only* validator on the startup path, so: - `ListenPort = 0` (the C# default when the key is omitted, see `PlcOptions.cs:6`) passes schema validation. The supervisor then tries to bind port 0, which the OS interprets as "pick an ephemeral port" — the listener binds a random port and clients can never reach it. No error is raised anywhere. - An empty `Host` produces a `SocketException`/`ArgumentException` only at first backend connect, surfacing as a recoverable runtime fault rather than a config rejection. `docs/Operations/Configuration.md:130-135` documents `Name`, `ListenPort`, `Host` as **required** — but nothing enforces "required" at bind time. **Impact:** A PLC block missing `ListenPort` or `Host` starts a useless listener with no diagnostic. Combined with M1, the only realistic guard (`ReloadValidator`) is never consulted at startup. **Recommendation:** Add `ListenPort ∈ [1,65535]`, non-empty `Name`, non-empty `Host`, and `Port ∈ [1,65535]` checks to `MbproxyOptionsValidator` (it already iterates `options.Plcs`), or — preferably — resolve M1 so `ReloadValidator` runs at startup and consider whether the two validators should be merged to remove the consistency burden entirely. --- ### M5 — `Resilience.BackendConnect.BackoffMs` length is never validated against `MaxAttempts` **Files:** `src/Mbproxy/Options/ResilienceOptions.cs:19-29`; `src/Mbproxy/Options/MbproxyOptions.cs:62-144`; `src/Mbproxy/Configuration/ReloadValidator.cs`; `src/Mbproxy/Proxy/Supervision/PolicyFactory.cs:43-75`. `docs/Operations/Configuration.md:220` states `BackoffMs` "Must have `MaxAttempts - 1` entries." Nothing enforces this. Neither `MbproxyOptionsValidator` nor `ReloadValidator` looks at `RetryProfile.MaxAttempts`, `BackoffMs`, `RecoveryProfile.InitialBackoffMs`, or `SteadyStateMs` at all. `PolicyFactory.BuildBackendConnect` (`:46`) does `Math.Max(1, MaxAttempts)` and clamps the backoff index to the last element (`:59-61`) — so it is *crash-safe*, but a config with `MaxAttempts = 5` and a single `BackoffMs` entry, or negative backoff values, silently produces behaviour that does not match the operator's intent (negative ms become `TimeSpan` that Polly may treat as zero/throw depending on version). `MaxAttempts = 0` is silently coerced to 1. **Impact:** Low-blast-radius but a documented invariant ("must have `MaxAttempts - 1` entries") is unenforced, and a negative backoff or zero `MaxAttempts` is accepted without warning. Doc and code disagree. **Recommendation:** Add a validation rule (in whichever validator survives M1's consolidation): `MaxAttempts >= 1`; `BackoffMs.Count >= MaxAttempts - 1`; every `BackoffMs`/`InitialBackoffMs` entry `>= 0`; `SteadyStateMs > 0`. Or, if the clamp-and-coerce behaviour is genuinely intended, relax the doc wording to match. --- ## Minor ### N1 — `ApplyUnderLockAsync` logs `mbproxy.config.reload.applied` and bumps the success counter even when every step failed **File:** `src/Mbproxy/Configuration/ConfigReconciler.cs:335-443`. By design, a step that throws is caught and logged and the loop continues (`:335`, `:373`, `:426`), then step 7 unconditionally sets `_currentOptions = next`, calls `RecordReloadApplied`, and logs `applied` (`:437-441`). If, say, every restart in `ToRestart` threw, the operator sees `config.reloadCount` increment and an INFO `mbproxy.config.reload.applied` line — with no signal that the apply was partial or total-failure. The `Plcs*` counts in the event are *planned* counts (`:243-246`), not *achieved* counts. **Impact:** Status page and logs report a clean reload when reconciliation actually failed. Misleading during incident triage. **Recommendation:** Track per-step failures (a counter or list). If any step threw, log at `Warning` with a "partial" qualifier (or emit a distinct event) and either skip `RecordReloadApplied` or expose a `reloadPartialCount`. At minimum, include achieved-vs-planned counts in the event. ### N2 — Reseat step swaps `_currentOptions` membership assumptions but uses `next.Plcs.First(...)` which throws inside the loop **File:** `src/Mbproxy/Configuration/ConfigReconciler.cs:353`. `var plcNew = next.Plcs.First(p => p.Name == name);` — `ReloadPlan.Compute` guarantees `name` exists in `next`, so this will not throw in practice, but `First` with no match throws `InvalidOperationException`, which the surrounding `catch` (`:373`) would log as a generic "Error reseating context". A `FirstOrDefault` with an explicit null-guard, or passing the `PlcOptions` through `ReloadPlan.ToReseat` alongside the map, would make the invariant explicit and the failure mode obvious. Minor robustness/clarity. **Recommendation:** Extend `ToReseat` tuple to `(string Name, PlcOptions New, BcdTagMap NewMap)` so the reconciler never re-resolves by name. ### N3 — `ComputeGlobalTagDelta` throws on a duplicate global address; `ReloadPlan.Compute` indexes PLCs with `ToDictionary` likewise **Files:** `src/Mbproxy/Configuration/ConfigReconciler.cs:465-466`; `src/Mbproxy/Configuration/ReloadPlan.cs:39-40, 83-84`. `before.Global.ToDictionary(t => t.Address)` (`:465`) throws `ArgumentException` if `BcdTags.Global` contains two entries at the same address. `BcdTagMapBuilder` detects duplicate addresses as a validation error, and `ReloadValidator` rejects such a snapshot — *but only `next`*. `ComputeGlobalTagDelta` is called with `_currentOptions.BcdTags` as `before` (`:249`); `_currentOptions` was validated when it became current, so this is safe today. It is, however, a latent fragility: any future path that sets `_currentOptions` without full validation, or a `Global` list with duplicate addresses that `BcdTagMapBuilder` happens not to flag, turns a cosmetic delta computation into an unhandled exception inside `ApplyUnderLockAsync` (caught only by the debounce-loop catch-all, aborting the whole apply). Same applies to `ReloadPlan.Compute`'s `ToDictionary(p => p.Name)` — safe only because `next` was validated first. **Recommendation:** Use `ToDictionary` with a last-wins lambda or `DistinctBy`, or wrap `ComputeGlobalTagDelta` defensively. It is a pure cosmetic counter — it should never be able to abort a reload. ### N4 — `EventLogBridge` caches `_sourceExists` at construction; a post-install source registration never takes effect, but the doc oversells "no per-event registry traffic" **File:** `src/Mbproxy/Diagnostics/EventLogBridge.cs:62-80`. Behaviour is correct and intentional (documented at `Configuration.md:100`). One nuance: `Emit` still wraps `EventLog.WriteEntry` in a `try/catch` (`:101-109`) that swallows everything — so if the source is deleted *after* startup, every Error+ event silently no-ops with zero diagnostics. That is acceptable for a logging sink (must never recurse/crash), but there is no `SelfLog.WriteLine` breadcrumb as `SyslogBridge` has (`SyslogBridge.cs:46`). For symmetry and debuggability, emit one `SelfLog` line the first time a write fails. **Recommendation:** Add a one-shot `SelfLog.WriteLine` on first write failure in `EventLogBridge.Emit` to match `SyslogBridge`'s degradation breadcrumb. ### N5 — `ConfigReconciler.Dispose` blocks up to 2 s synchronously on `_debounceLoop.Wait` **File:** `src/Mbproxy/Configuration/ConfigReconciler.cs:485-501`. `Dispose` cancels `_disposalCts` and then `_debounceLoop.Wait(TimeSpan.FromSeconds(2))`. Since `ConfigReconciler` is a DI singleton, this runs on the host's shutdown path (`ServiceProvider.Dispose`). If an apply is mid-flight when shutdown begins, the loop will not observe cancellation until the current `ApplyAsync` returns (the apply's per-step CTS are linked to `ct` = `_disposalCts.Token`, so they *do* cancel — good), so 2 s is normally ample. But a hung supervisor `StopAsync` inside an apply could make `Dispose` block for the full 2 s and then *abandon* a still-running loop thread. Acceptable, but worth a comment that the 2 s is a hard cap and the loop may be abandoned; also consider implementing `IAsyncDisposable` so the host can await it cleanly rather than block a thread. **Recommendation:** Document the 2 s cap and abandonment behaviour, or convert to `IAsyncDisposable`. ### N6 — Doc drift: `Configuration.md:284` claims startup rejection, and the validator-ordering claims in HotReload.md are slightly off **Files:** `docs/Operations/Configuration.md:284, 314-318`; `docs/Features/HotReload.md:81, 95`. Beyond M1's headline doc bug: `HotReload.md:81` asserts "There is no second validator that could disagree with the first" and `:95` says "The two paths overlap deliberately so both startup and reload reject the same malformed input with the same error wording." In reality there *are* two validators (`MbproxyOptionsValidator` and `ReloadValidator`) with overlapping-but-not-identical rules and different wording (e.g. `MbproxyOptionsValidator` says "exceeds the 60_000 ms safety cap" while `ReloadValidator` says "exceeds 60_000 ms without Cache.AllowLongTtl=true"). The "same error wording" claim is inaccurate. `Configuration.md:298` rule 7 ("Width … enforced by `MbproxyOptionsValidator`") correctly notes Width is *not* in `ReloadValidator` — which means a hot reload that somehow bound a bad Width would not be caught by `ReloadValidator` (it relies on `BcdTagMapBuilder` to reject it; verify that builder does). **Recommendation:** Once M1/M4 are resolved (ideally by consolidating to one validator), rewrite these doc passages to describe the actual single gate. If two validators remain, drop the "same error wording" claim. --- ## Items checked and found correct (no finding) - **Debounce loop** (`ConfigReconciler.cs:165-218`): the linked-CTS-per-iteration pattern is correct; `debounceCts` is `using`-scoped per loop iteration so no CTS leak; the `OperationCanceledException when (!ct.IsCancellationRequested)` filter correctly distinguishes window-elapsed from disposal. - **`_applySemaphore` serialization** (`:150-161`): correct `WaitAsync`/`try`/`finally`/`Release`; disposed once in `Dispose`. - **Channel** (`:73-77`): bounded capacity 1 with `DropOldest` is the right choice for a coalesced "something changed" signal; no unbounded growth. - **`OnChange` registration** (`:105-111`): non-blocking `TryWrite`, registration disposed in `Dispose` (`:487`). - **`ServiceCounters`** (`ServiceCounters.cs`): all reads/writes via `Interlocked`; the ticks-as-long pattern for `LastReloadUtc` is sound; `CompareExchange(ref x, 0, 0)` as a volatile read is correct. - **`DiagnosticSinkSelector.Select`** (`DiagnosticSinkSelector.cs:51-59`): pure, correct precedence (Windows wins, `isSystemd` only consulted off-Windows); the `OperatingSystem.IsWindows()` re-guard at `HostingExtensions.cs:101` correctly satisfies the platform analyzer for `[SupportedOSPlatform("windows")]` `EventLogBridge`. - **Graceful shutdown wiring** (`ProxyWorker.cs:288-351`): `GracefulShutdownTimeoutMs` read live from `CurrentValue`; in-flight snapshot taken before `base.StopAsync`; admin stopped last; supervisors disposed last. Ordering matches the documented contract. - **`ReloadPlan.Compute`** (`ReloadPlan.cs`): identity keyed on `Name`, `TagMapsEqual` includes `CacheTtlMs`, reseat-vs-restart split is correct. - **Reseat counter preservation** (`ConfigReconciler.cs:359`): `supervisor.CurrentCounters` passed into the new context — correct, matches `HotReload.md:117`.