# Code Review — Configuration, Options, Hosting, Diagnostics

**Scope:** `src/Mbproxy/Configuration/`, `src/Mbproxy/Options/`, `src/Mbproxy/Diagnostics/`, `src/Mbproxy/HostingExtensions.cs`, `Program.cs`, `ServiceCounters.cs` (with `ProxyWorker.cs` and `PlcListenerSupervisor.cs` read for context).
**Branch:** `mbproxy-webui-dashboard` @ `0308490`.
**Date:** 2026-05-16.

## Summary

The configuration/hosting subsystem is in good shape overall: the hot-reload reconciler is cleanly structured, the debounce loop and `_applySemaphore` serialization are correct and leak-free, the diagnostic-sink platform selection is well factored and unit-testable, and `ServiceCounters` uses `Interlocked` correctly. Prior-review remediations appear intact. The most consequential issue found is that `ReloadValidator` is **never invoked at startup** — it runs only inside `ConfigReconciler.ApplyUnderLockAsync` — so an `appsettings.json` that has duplicate `ListenPort`s, an `AdminPort` collision, a bad keepalive cross-field relationship, or a non-positive timeout will start the service in a broken state instead of failing fast, directly contradicting the documented contract. A second real bug: the `Restart` reconcile path removes a supervisor from the dictionary *before* rebuilding it, so a transient failure during rebuild silently drops the PLC with no listener and no recovery. The remaining findings are validation-completeness gaps and maintainability items.

**Findings count:** Critical 0 · Major 5 · Minor 6

---

## Major

### M1 — `ReloadValidator` never runs at startup; cross-PLC config errors are not fail-fast

**Files:** `src/Mbproxy/Configuration/ConfigReconciler.cs:232` (only call site); `src/Mbproxy/Proxy/ProxyWorker.cs:91-180`; `src/Mbproxy/HostingExtensions.cs:21-43`; `docs/Operations/Configuration.md:284`.

`ReloadValidator.Validate` is called from exactly one place — `ApplyUnderLockAsync` (`ConfigReconciler.cs:232`). The startup path is `ProxyWorker.ExecuteAsync`, which builds per-PLC tag maps directly via `BcdTagMapBuilder.Build` (`ProxyWorker.cs:101`) and never calls `ReloadValidator`. The only schema validation that runs at startup is `MbproxyOptionsValidator` via `.ValidateOnStart()` (`HostingExtensions.cs:21-26`), and that validator does **not** check the cross-PLC rules.

Consequences at startup (none of these are caught):
- **Duplicate `ListenPort`** across two PLCs — both supervisors try to bind the same port; one wins, the other enters the *infinite* `ListenerRecovery` loop forever. No fail-fast, no clear error.
- **`AdminPort` colliding with a `ListenPort`** — same silent-conflict outcome.
- **Duplicate PLC `Name`** — `ProxyWorker.cs:125` and `:179` write `_supervisors[plc.Name] = ...`, so the second entry silently overwrites the first; one configured PLC simply never gets a listener and its supervisor object leaks (created at `:165`, overwritten, never started/disposed).
- **Bad keepalive cross-field rule** (`BackendHeartbeatIdleMs <= BackendRequestTimeoutMs`) — only enforced in `ReloadValidator` (`ReloadValidator.cs:163`), so a fresh install with this mistake produces a continuously-firing heartbeat.

`docs/Operations/Configuration.md:284` explicitly states "`ReloadValidator.Validate` runs on every config load (startup and hot reload) … On rejection at startup, the service exits non-zero." That is false for the current code.

**Impact:** A config mistake that should abort startup instead produces a half-working fleet that "looks" up. The most likely operational mistakes (port typos, copy-pasted PLC blocks) are exactly the ones not caught.

**Recommendation:** Run `ReloadValidator.Validate` against `_options.CurrentValue` at the top of `ProxyWorker.ExecuteAsync` (before building any supervisor) and, on failure, log `mbproxy.config.reload.rejected` (or a startup-specific event) and fail the host (`throw` / `StopApplication`). Alternatively, fold the cross-PLC rules into `MbproxyOptionsValidator` so `.ValidateOnStart()` covers them. Either way, make startup and reload share one validation gate, and fix the doc.

---

### M2 — `Restart` reconcile path drops the PLC if the rebuild throws

**File:** `src/Mbproxy/Configuration/ConfigReconciler.cs:286-340`.

In the restart branch, the old supervisor is removed from the dictionary first (`TryRemove(name, out var old)`, `:292`), then stopped/disposed, then a fresh context + supervisor is built (`:302-330`) and re-inserted (`:332`). The whole body is wrapped in `try { … } catch (Exception ex) { _logger.LogError(...) }` (`:335-339`).

If anything between the `TryRemove` and the `_supervisors[name] = newSupervisor` re-insert throws — e.g. `BcdTagMapBuilder.Build` faults, `PolicyFactory`/`PerPlcContext` construction throws, an `OOM`, or `StopAsync` on the old supervisor throws something other than what it swallows — the catch logs and returns, but the dictionary no longer contains `name`. The PLC now has **no listener at all** and nothing ever recovers it: the next reload's `ReloadPlan.Compute` sees `name` present in both `current.Plcs` and `next.Plcs` (because `_currentOptions` is updated to `next` at `:437` regardless), so it lands in "unchanged" or `ToReseat`/`ToRestart` again only if options differ — an identical subsequent save produces *no* plan entry, leaving the PLC permanently dark until a manual config edit forces a restart.

The same structural risk exists in the `Add` path (`:388-431`), but there it is benign: a failed add just means the PLC was never there to begin with, and a subsequent reload will re-`Add` it because it is still absent from `_currentOptions`.

**Impact:** A transient fault during a hot-reload restart permanently removes a PLC from service with only an Error log line. For a 54-PLC fleet this is a silent single-line outage.

**Recommendation:** Build the new context + supervisor *before* removing/stopping the old one (or into a local first), and only swap into the dictionary once construction succeeded. On a build failure, leave the old supervisor in place and surface the error. At minimum, in the `catch`, if `name` is no longer in `_supervisors`, re-queue the PLC for `Add` on the next pass or roll `_currentOptions[name]` back to the old value so the next `Compute` re-detects it.

---

### M3 — Startup-built supervisors and reconciler-built supervisors do not use the same `MaxParties` / coalescing live-config path consistently with the rest of `Resilience`

**Files:** `src/Mbproxy/Configuration/ConfigReconciler.cs:281-330, 383-421`; `src/Mbproxy/Proxy/ProxyWorker.cs:138-141`.

The backend-connect Polly pipeline is rebuilt inside `ApplyUnderLockAsync` from `next.Resilience.BackendConnect` (`:282-284`, `:384-386`) and the listener-recovery pipeline from `next.Resilience.ListenerRecovery` (`:314`, `:405`) — so add/restart supervisors *do* pick up reloaded `BackendConnect`/`ListenerRecovery` values. But existing supervisors that land in `ToReseat` (or in neither bucket) keep the pipelines built at startup (`ProxyWorker.cs:139`). A reload that changes only `Resilience.BackendConnect.MaxAttempts` or `Resilience.ListenerRecovery.*` therefore takes effect for *added/restarted* PLCs but silently does **not** propagate to the majority of PLCs that were not otherwise touched. The hot-reload propagation tables in `docs/Features/HotReload.md:45-58` and `docs/Operations/Configuration.md:423-435` list `ReadCoalescing` and `Backend*TimeoutMs` but say nothing about `Resilience.BackendConnect`/`ListenerRecovery` — operators will reasonably assume they hot-reload like everything else.

**Impact:** Inconsistent reload semantics: the same key behaves differently depending on whether a PLC happened to be added/restarted in the same save. Hard to diagnose ("I changed `MaxAttempts`, three PLCs honour it, fifty-one don't").

**Recommendation:** Either (a) document explicitly that `Resilience.BackendConnect` and `Resilience.ListenerRecovery` require a service restart (like `AdminPort`), and stop rebuilding them in the reconciler so behaviour is uniform; or (b) make them genuinely hot-reloadable by threading a live accessor (as already done for `ReadCoalescing`/`Keepalive`) so all supervisors re-read them. Option (a) is the smaller, safer change.

---

### M4 — `MbproxyOptionsValidator` does not validate `ListenPort` range or `Host`, leaving binding edge cases uncaught at schema time

**File:** `src/Mbproxy/Options/MbproxyOptions.cs:62-144`; `src/Mbproxy/Options/PlcOptions.cs`.

`MbproxyOptionsValidator` validates `Width`, cache TTLs, cache-size knobs, connection timeouts, `AdminPushIntervalMs`, and keepalive ranges — but never `PlcOptions.ListenPort` (range/uniqueness), `PlcOptions.Name` (empty/unique), `PlcOptions.Host` (empty), `PlcOptions.Port` (range), or `AdminPort` (range/collision). Those live only in `ReloadValidator`. Given M1 (ReloadValidator not run at startup), this is the *only* validator on the startup path, so:
- `ListenPort = 0` (the C# default when the key is omitted, see `PlcOptions.cs:6`) passes schema validation. The supervisor then tries to bind port 0, which the OS interprets as "pick an ephemeral port" — the listener binds a random port and clients can never reach it. No error is raised anywhere.
- An empty `Host` produces a `SocketException`/`ArgumentException` only at first backend connect, surfacing as a recoverable runtime fault rather than a config rejection.

`docs/Operations/Configuration.md:130-135` documents `Name`, `ListenPort`, `Host` as **required** — but nothing enforces "required" at bind time.

**Impact:** A PLC block missing `ListenPort` or `Host` starts a useless listener with no diagnostic. Combined with M1, the only realistic guard (`ReloadValidator`) is never consulted at startup.

**Recommendation:** Add `ListenPort ∈ [1,65535]`, non-empty `Name`, non-empty `Host`, and `Port ∈ [1,65535]` checks to `MbproxyOptionsValidator` (it already iterates `options.Plcs`), or — preferably — resolve M1 so `ReloadValidator` runs at startup and consider whether the two validators should be merged to remove the consistency burden entirely.

---

### M5 — `Resilience.BackendConnect.BackoffMs` length is never validated against `MaxAttempts`

**Files:** `src/Mbproxy/Options/ResilienceOptions.cs:19-29`; `src/Mbproxy/Options/MbproxyOptions.cs:62-144`; `src/Mbproxy/Configuration/ReloadValidator.cs`; `src/Mbproxy/Proxy/Supervision/PolicyFactory.cs:43-75`.

`docs/Operations/Configuration.md:220` states `BackoffMs` "Must have `MaxAttempts - 1` entries." Nothing enforces this. Neither `MbproxyOptionsValidator` nor `ReloadValidator` looks at `RetryProfile.MaxAttempts`, `BackoffMs`, `RecoveryProfile.InitialBackoffMs`, or `SteadyStateMs` at all. `PolicyFactory.BuildBackendConnect` (`:46`) does `Math.Max(1, MaxAttempts)` and clamps the backoff index to the last element (`:59-61`) — so it is *crash-safe*, but a config with `MaxAttempts = 5` and a single `BackoffMs` entry, or negative backoff values, silently produces behaviour that does not match the operator's intent (negative ms become `TimeSpan` that Polly may treat as zero/throw depending on version). `MaxAttempts = 0` is silently coerced to 1.

**Impact:** Low-blast-radius but a documented invariant ("must have `MaxAttempts - 1` entries") is unenforced, and a negative backoff or zero `MaxAttempts` is accepted without warning. Doc and code disagree.

**Recommendation:** Add a validation rule (in whichever validator survives M1's consolidation): `MaxAttempts >= 1`; `BackoffMs.Count >= MaxAttempts - 1`; every `BackoffMs`/`InitialBackoffMs` entry `>= 0`; `SteadyStateMs > 0`. Or, if the clamp-and-coerce behaviour is genuinely intended, relax the doc wording to match.

---

## Minor

### N1 — `ApplyUnderLockAsync` logs `mbproxy.config.reload.applied` and bumps the success counter even when every step failed

**File:** `src/Mbproxy/Configuration/ConfigReconciler.cs:335-443`.

By design, a step that throws is caught and logged and the loop continues (`:335`, `:373`, `:426`), then step 7 unconditionally sets `_currentOptions = next`, calls `RecordReloadApplied`, and logs `applied` (`:437-441`). If, say, every restart in `ToRestart` threw, the operator sees `config.reloadCount` increment and an INFO `mbproxy.config.reload.applied` line — with no signal that the apply was partial or total-failure. The `Plcs*` counts in the event are *planned* counts (`:243-246`), not *achieved* counts.

**Impact:** Status page and logs report a clean reload when reconciliation actually failed. Misleading during incident triage.

**Recommendation:** Track per-step failures (a counter or list). If any step threw, log at `Warning` with a "partial" qualifier (or emit a distinct event) and either skip `RecordReloadApplied` or expose a `reloadPartialCount`. At minimum, include achieved-vs-planned counts in the event.

### N2 — Reseat step swaps `_currentOptions` membership assumptions but uses `next.Plcs.First(...)` which throws inside the loop

**File:** `src/Mbproxy/Configuration/ConfigReconciler.cs:353`.

`var plcNew = next.Plcs.First(p => p.Name == name);` — `ReloadPlan.Compute` guarantees `name` exists in `next`, so this will not throw in practice, but `First` with no match throws `InvalidOperationException`, which the surrounding `catch` (`:373`) would log as a generic "Error reseating context". A `FirstOrDefault` with an explicit null-guard, or passing the `PlcOptions` through `ReloadPlan.ToReseat` alongside the map, would make the invariant explicit and the failure mode obvious. Minor robustness/clarity.

**Recommendation:** Extend `ToReseat` tuple to `(string Name, PlcOptions New, BcdTagMap NewMap)` so the reconciler never re-resolves by name.

### N3 — `ComputeGlobalTagDelta` throws on a duplicate global address; `ReloadPlan.Compute` indexes PLCs with `ToDictionary` likewise

**Files:** `src/Mbproxy/Configuration/ConfigReconciler.cs:465-466`; `src/Mbproxy/Configuration/ReloadPlan.cs:39-40, 83-84`.

`before.Global.ToDictionary(t => t.Address)` (`:465`) throws `ArgumentException` if `BcdTags.Global` contains two entries at the same address. `BcdTagMapBuilder` detects duplicate addresses as a validation error, and `ReloadValidator` rejects such a snapshot — *but only `next`*. `ComputeGlobalTagDelta` is called with `_currentOptions.BcdTags` as `before` (`:249`); `_currentOptions` was validated when it became current, so this is safe today. It is, however, a latent fragility: any future path that sets `_currentOptions` without full validation, or a `Global` list with duplicate addresses that `BcdTagMapBuilder` happens not to flag, turns a cosmetic delta computation into an unhandled exception inside `ApplyUnderLockAsync` (caught only by the debounce-loop catch-all, aborting the whole apply). Same applies to `ReloadPlan.Compute`'s `ToDictionary(p => p.Name)` — safe only because `next` was validated first.

**Recommendation:** Use `ToDictionary` with a last-wins lambda or `DistinctBy`, or wrap `ComputeGlobalTagDelta` defensively. It is a pure cosmetic counter — it should never be able to abort a reload.

### N4 — `EventLogBridge` caches `_sourceExists` at construction; a post-install source registration never takes effect, but the doc oversells "no per-event registry traffic"

**File:** `src/Mbproxy/Diagnostics/EventLogBridge.cs:62-80`.

Behaviour is correct and intentional (documented at `Configuration.md:100`). One nuance: `Emit` still wraps `EventLog.WriteEntry` in a `try/catch` (`:101-109`) that swallows everything — so if the source is deleted *after* startup, every Error+ event silently no-ops with zero diagnostics. That is acceptable for a logging sink (must never recurse/crash), but there is no `SelfLog.WriteLine` breadcrumb as `SyslogBridge` has (`SyslogBridge.cs:46`). For symmetry and debuggability, emit one `SelfLog` line the first time a write fails.

**Recommendation:** Add a one-shot `SelfLog.WriteLine` on first write failure in `EventLogBridge.Emit` to match `SyslogBridge`'s degradation breadcrumb.

### N5 — `ConfigReconciler.Dispose` blocks up to 2 s synchronously on `_debounceLoop.Wait`

**File:** `src/Mbproxy/Configuration/ConfigReconciler.cs:485-501`.

`Dispose` cancels `_disposalCts` and then `_debounceLoop.Wait(TimeSpan.FromSeconds(2))`. Since `ConfigReconciler` is a DI singleton, this runs on the host's shutdown path (`ServiceProvider.Dispose`). If an apply is mid-flight when shutdown begins, the loop will not observe cancellation until the current `ApplyAsync` returns (the apply's per-step CTS are linked to `ct` = `_disposalCts.Token`, so they *do* cancel — good), so 2 s is normally ample. But a hung supervisor `StopAsync` inside an apply could make `Dispose` block for the full 2 s and then *abandon* a still-running loop thread. Acceptable, but worth a comment that the 2 s is a hard cap and the loop may be abandoned; also consider implementing `IAsyncDisposable` so the host can await it cleanly rather than block a thread.

**Recommendation:** Document the 2 s cap and abandonment behaviour, or convert to `IAsyncDisposable`.

### N6 — Doc drift: `Configuration.md:284` claims startup rejection, and the validator-ordering claims in HotReload.md are slightly off

**Files:** `docs/Operations/Configuration.md:284, 314-318`; `docs/Features/HotReload.md:81, 95`.

Beyond M1's headline doc bug: `HotReload.md:81` asserts "There is no second validator that could disagree with the first" and `:95` says "The two paths overlap deliberately so both startup and reload reject the same malformed input with the same error wording." In reality there *are* two validators (`MbproxyOptionsValidator` and `ReloadValidator`) with overlapping-but-not-identical rules and different wording (e.g. `MbproxyOptionsValidator` says "exceeds the 60_000 ms safety cap" while `ReloadValidator` says "exceeds 60_000 ms without Cache.AllowLongTtl=true"). The "same error wording" claim is inaccurate. `Configuration.md:298` rule 7 ("Width … enforced by `MbproxyOptionsValidator`") correctly notes Width is *not* in `ReloadValidator` — which means a hot reload that somehow bound a bad Width would not be caught by `ReloadValidator` (it relies on `BcdTagMapBuilder` to reject it; verify that builder does).

**Recommendation:** Once M1/M4 are resolved (ideally by consolidating to one validator), rewrite these doc passages to describe the actual single gate. If two validators remain, drop the "same error wording" claim.

---

## Items checked and found correct (no finding)

- **Debounce loop** (`ConfigReconciler.cs:165-218`): the linked-CTS-per-iteration pattern is correct; `debounceCts` is `using`-scoped per loop iteration so no CTS leak; the `OperationCanceledException when (!ct.IsCancellationRequested)` filter correctly distinguishes window-elapsed from disposal.
- **`_applySemaphore` serialization** (`:150-161`): correct `WaitAsync`/`try`/`finally`/`Release`; disposed once in `Dispose`.
- **Channel** (`:73-77`): bounded capacity 1 with `DropOldest` is the right choice for a coalesced "something changed" signal; no unbounded growth.
- **`OnChange` registration** (`:105-111`): non-blocking `TryWrite`, registration disposed in `Dispose` (`:487`).
- **`ServiceCounters`** (`ServiceCounters.cs`): all reads/writes via `Interlocked`; the ticks-as-long pattern for `LastReloadUtc` is sound; `CompareExchange(ref x, 0, 0)` as a volatile read is correct.
- **`DiagnosticSinkSelector.Select`** (`DiagnosticSinkSelector.cs:51-59`): pure, correct precedence (Windows wins, `isSystemd` only consulted off-Windows); the `OperatingSystem.IsWindows()` re-guard at `HostingExtensions.cs:101` correctly satisfies the platform analyzer for `[SupportedOSPlatform("windows")]` `EventLogBridge`.
- **Graceful shutdown wiring** (`ProxyWorker.cs:288-351`): `GracefulShutdownTimeoutMs` read live from `CurrentValue`; in-flight snapshot taken before `base.StopAsync`; admin stopped last; supervisors disposed last. Ordering matches the documented contract.
- **`ReloadPlan.Compute`** (`ReloadPlan.cs`): identity keyed on `Name`, `TagMapsEqual` includes `CacheTtlMs`, reseat-vs-restart split is correct.
- **Reseat counter preservation** (`ConfigReconciler.cs:359`): `supervisor.CurrentCounters` passed into the new context — correct, matches `HotReload.md:117`.