Files
Joseph Doherty b330faff03 mbproxy: cross-platform support — Linux/systemd alongside Windows
Make the service build, run, and install on Linux as a first-class
target while keeping the Windows Service + Event Log behaviour intact.

- Build: drop the hardcoded win-x64 RID — single-file publish now works
  for any RID. publish.ps1 gains -Rid; new publish.sh for Linux hosts.
- Diagnostics: DiagnosticSinkSelector picks the Error+ sink per host —
  Windows Event Log under the SCM, local syslog under systemd
  (Serilog.Sinks.SyslogMessages), none for interactive runs. The
  EventLog truncation helper is extracted so it is testable cross-OS.
- Host: Program.cs registers AddSystemd() alongside AddWindowsService().
- Config: a RID-conditioned appsettings template ships Windows or Unix
  paths; both templates are schema-validated by a test.
- Install: systemd unit (Type=exec) plus install.sh / uninstall.sh.
  Also fixes two cross-platform bugs found while testing: install.ps1
  and uninstall.ps1 used New-EventLog / Remove-EventLog (absent in
  PowerShell 7), and the E2E sim launcher hardcoded Windows venv paths.
- Docs updated across README, CLAUDE.md, and docs/ for dual-platform.

413 tests pass on Windows; 374 (all non-simulator) on Linux.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 09:41:59 -04:00

20 KiB

Hot Reload

A save to appsettings.json propagates to a running mbproxy without restarting the service. This document explains the mechanism, the reconcile pipeline, and what each configuration change does to the running state.

How Reload Works

Microsoft.Extensions.Configuration loads appsettings.json with reloadOnChange: true. Every consumer reads its options through IOptionsMonitor<MbproxyOptions> instead of capturing a one-shot IOptions<T> snapshot at construction. When the framework's FileSystemWatcher sees the file change, it re-parses the JSON, re-binds the option tree, and notifies subscribers through IOptionsMonitor.OnChange.

The chosen mechanism is deliberate. There is no custom file watcher, no IPC channel, no admin-port mutation endpoint, and no SIGHUP-style trigger. An operator edits the file in place (or a deployment tool atomically rewrites it) and the running service catches up. The reload contract is identical whether the service is running interactively, as a Windows Service under the SCM, or as a Linux systemd unit.

The OnChange callback can fire multiple times for a single logical save because text editors on Windows commonly use a rename-and-replace pattern that produces two or three FileSystemWatcher events. The reconciler debounces these inside its own background loop with a 250 ms quiescent window so a single save produces a single apply.

Debounce window

The debounce window is held in ConfigReconciler.DebounceWindow = TimeSpan.FromMilliseconds(250). The loop reads from the change channel, then keeps re-arming a linked CancellationTokenSource with a 250 ms expiry and waits again. As long as new signals keep arriving inside the window, the loop drains them and keeps waiting. When the window elapses with no new signal the loop falls through and calls ApplyAsync against IOptionsMonitor.CurrentValue. The window is short enough that operators perceive saves as instant and long enough to absorb every editor save pattern observed in practice (rename-and-replace, write-truncate-write, Notepad, Visual Studio Code, PowerShell Set-Content).

The Reconcile Pipeline

Three types in src/Mbproxy/Configuration/ carry the reload contract from "framework noticed the file changed" to "the running service matches the new file":

  • ReloadValidator (src/Mbproxy/Configuration/ReloadValidator.cs) — runs cross-PLC and per-PLC checks before the reload is allowed to take effect. The validator is a static gate: Validate(MbproxyOptions next, out IReadOnlyList<string> errors) returns false and a list of error strings if the snapshot is malformed, and the apply step bails out before touching any state.
  • ReloadPlan (src/Mbproxy/Configuration/ReloadPlan.cs) — an immutable record produced by the pure function ReloadPlan.Compute(MbproxyOptions current, MbproxyOptions next). It buckets PLCs into ToAdd, ToRemove, ToRestart (network identity changed), and ToReseat (only the resolved BcdTagMap changed). PLC identity is keyed on Name, not ListenPort, so a port change is still the same PLC and goes to ToRestart rather than ToRemove + ToAdd.
  • ConfigReconciler (src/Mbproxy/Configuration/ConfigReconciler.cs) — subscribes to IOptionsMonitor.OnChange, debounces and serialises change events through a bounded Channel<bool> and a SemaphoreSlim(1, 1), then runs the plan: removes go first (concurrent), restarts next (concurrent), reseats apply via PlcListenerSupervisor.ReplaceContextAsync, and adds finish last.

The reconciler's OnChange handler does not block. It writes to a Channel<bool> with BoundedChannelFullMode.DropOldest so a busy reload queue never stalls the configuration framework. A dedicated background loop drains the channel, applies the 250 ms debounce, and then calls ApplyAsync on the latest snapshot exposed by IOptionsMonitor.CurrentValue. The last enqueued change wins.

The apply itself runs under _applySemaphore (a SemaphoreSlim(1, 1)) so two saves arriving in rapid succession are serialised and never interleave. If a second save lands while the first apply is still running, it queues at the semaphore and runs against whatever CurrentValue exposes when its turn comes — which is the freshest options snapshot, not necessarily the one that caused the wake-up.

Apply order

ApplyUnderLockAsync runs the steps in this order against the freshly validated snapshot:

  1. Validate. If ReloadValidator.Validate returns errors, log mbproxy.config.reload.rejected, increment the rejected counter, and return without mutating state.
  2. Compute. Call ReloadPlan.Compute(_currentOptions, next) to bucket PLCs into ToAdd, ToRemove, ToRestart, and ToReseat.
  3. Remove. Stop every supervisor in ToRemove concurrently with a 10-second stop timeout, then dispose.
  4. Restart. Stop the old supervisor, build a fresh PerPlcContext (which includes a new ResponseCache when any resolved tag opts in), and start a new PlcListenerSupervisor on the new endpoint. Restarts run concurrently across affected PLCs.
  5. Reseat. For each PLC in ToReseat, build a new context that preserves the existing Counters (so operators see real history across the reseat) and call PlcListenerSupervisor.ReplaceContextAsync with a 5-second timeout.
  6. Add. Build and start a new supervisor for every PLC in ToAdd concurrently.
  7. Record. Update _currentOptions to next, call ServiceCounters.RecordReloadApplied, and log mbproxy.config.reload.applied with the apply counts and the global tag delta.

If a step throws, the exception is logged at Error and the loop continues with the remaining steps. The validator catches every precondition that can be checked from the configuration alone, so a runtime exception here is a true bug worth surfacing. The host stays up regardless.

Per-Change-Kind Reconcile Table

Change in appsettings.json Propagation
BcdTags.Global add / remove / width The rewriter dereferences IOptionsMonitor per PDU. The next PDU sees the new map. In-flight requests are not retroactively touched.
Plcs[i].BcdTags.Add or Plcs[i].BcdTags.Remove Same as above — next-PDU resolution against the rebuilt map.
New Plcs[i] entry ConfigReconciler builds a fresh PerPlcContext and PlcListenerSupervisor, which binds the new port under the same eager-then-auto-recover policy used at service startup.
Plcs[i] removed The supervisor for that PLC is stopped (10 s stop timeout) and disposed, which closes every upstream client connection bound to that listener.
Plcs[i].ListenPort or Host changed Equivalent to remove + add. The supervisor stops the old listener, the reconciler rebuilds the context, and a new supervisor starts on the new endpoint.
Connection.BackendConnectTimeoutMs and the other Backend*TimeoutMs values The next backend connect or request reads the new value through the monitor. In-flight operations keep their already-applied timeout.
BcdTags.*.CacheTtlMs or Plcs[i].DefaultCacheTtlMs A tag-map reseat constructs a fresh ResponseCache for that PLC, which drops every cached entry for that PLC. Entries re-populate on demand under the new TTL. Per-tag flush granularity is intentionally not implemented.
Cache.AllowLongTtl Enforced at the next reload validation. A pending reload that depends on it must save together.
Cache.MaxEntriesPerPlc Applies to subsequent inserts. Existing entries are not pruned.
Cache.EvictionIntervalMs Read by the next eviction loop tick.
Resilience.ReadCoalescing.Enabled flipped to false Already-running coalesced entries drain naturally. Subsequent reads bypass coalescing.
Resilience.ReadCoalescing.MaxParties Applies to subsequent attaches. Existing in-flight entries keep their current cap.
Invalid reload (schema break, duplicate ports, duplicate addresses in a resolved tag list, CacheTtlMs > 60_000 without Cache.AllowLongTtl = true) Reload is rejected as a whole. The current in-memory config stays in effect. mbproxy.config.reload.rejected is logged at Error.

The "next-PDU" wording is load-bearing for the tag-list rows: the rewriter does not snapshot the tag map at connection accept time. It resolves the map for the active PLC at the start of every request frame, so a hot-reloaded tag list is in effect for the very next request, even on existing TCP connections.

Reseat vs. restart

The ReloadPlan distinguishes two kinds of "PLC is still here but changed":

  • Restart is triggered when Host, ListenPort, or backend Port differ between the old and new PlcOptions. The TCP socket has to close and reopen on a new endpoint, so there is no way to preserve the listener — the supervisor stops and a brand-new one starts.
  • Reseat is triggered when only the resolved BcdTagMap differs (which ReloadPlan.Compute checks structurally through TagMapsEqual: same set of (Address, Width, CacheTtlMs) triples). The listener socket and the upstream pipes stay open. Only the PerPlcContext swaps.

TagMapsEqual includes BcdTag.CacheTtlMs in the comparison so a per-tag TTL change or a Plcs[i].DefaultCacheTtlMs change (which folds into per-tag TTLs through BcdTagMapBuilder.Build) also routes to ToReseat and so also drops the cache. A Plcs[i] whose options are byte-identical to the previous snapshot lands in neither bucket and the supervisor is left alone.

Tag map resolution

BcdTagMapBuilder.Build is the single source of truth for what the resolved tag list looks like for one PLC. The hybrid resolution it implements is:

  1. Start with BcdTags.Global from the root options.
  2. Remove every address present in Plcs[i].BcdTags.Remove.
  3. Merge in Plcs[i].BcdTags.Add entries — if an address already exists in the working set, the Add entry wins. This is how a per-PLC width override is expressed (the global lists a 16-bit tag at the same address; the per-PLC Add overrides it to 32-bit).
  4. Fold Plcs[i].DefaultCacheTtlMs into any tag whose explicit CacheTtlMs is null.

The same builder runs both at startup and during reload validation, so a configuration that builds cleanly at startup is guaranteed to build cleanly at reload, and vice versa. There is no second validator that could disagree with the first.

Validation Rules

ReloadValidator.Validate is the gate the hot-reload path consults directly. It runs the following checks in order:

  1. PLC names are non-empty and unique under ordinal comparison.
  2. Every Plcs[i].ListenPort is in [1, 65535] and unique across the Plcs list.
  3. AdminPort is in [1, 65535] and does not collide with any ListenPort.
  4. For each PLC, BcdTagMapBuilder.Build(next.BcdTags, plc.BcdTags, plc.DefaultCacheTtlMs) reports no errors. This delegates the per-PLC well-formedness checks — duplicate addresses within a single resolved list, and 32-bit entries whose high register (Address + 1) overlaps a separate 16-bit entry — to the single source of truth used at startup.
  5. Cache TTL bounds: every BcdTag.CacheTtlMs and every Plcs[i].DefaultCacheTtlMs must be >= 0, and any value above 60_000 ms requires Cache.AllowLongTtl = true. Cache.MaxEntriesPerPlc and Cache.EvictionIntervalMs must be >= 0.

A failure at any step appends to the error list but the validator runs to completion so the operator sees every problem with a single save. If the list is non-empty, the reload is rejected atomically and no state mutates.

Schema-level checks — invalid Width values on a BcdTagOptions, type mismatches, malformed JSON — are also enforced by MbproxyOptionsValidator (IValidateOptions<MbproxyOptions>) at bind time. The two paths overlap deliberately so both startup and reload reject the same malformed input with the same error wording.

Rejected-reload example

A duplicate ListenPort in the saved file produces an error like the following on the rejected log line:

Config reload rejected — Errors=Plc 'plc-02': Duplicate ListenPort 5020 (already used by 'plc-01').

When several rules trip on the same save, the validator joins them with ; so the operator sees every problem from one file save. The current in-memory configuration is unchanged, every supervisor keeps running on its existing context, and the next valid save will replay the whole apply against the now-current state.

What Stays vs. What Changes Mid-Flight

The reload contract is built around a simple invariant: a Modbus request that has already started routing keeps the configuration it started with. The next request after the reload picks up the new values.

The rewriter is the clearest example. BcdPduPipeline dereferences the tag map at the start of every PDU. A request that is already in the multiplexer's outbound queue is rewritten against the map that was current when it arrived. The very next request on the same TCP connection sees the new map. This avoids a torn behaviour where one PDU is half-rewritten under the old tag list and half under the new — every PDU is fully consistent with exactly one snapshot of the map.

The same principle applies to timeouts. Connection.BackendConnectTimeoutMs and the per-operation timeout values are read through IOptionsMonitor.CurrentValue at the point the operation starts. A backend connect that has already entered its retry pipeline keeps its already-applied timeout for the remainder of that attempt. The next backend connect reads the new value.

The reseat path is the only place where running state changes mid-connection. A reseat swaps the entire PerPlcContextTagMap, Counters, Cache — via PlcListenerSupervisor.ReplaceContextAsync. The listener socket and the existing upstream pipes survive the swap. The brief transition window between the old context and the new is documented in code: any PDU mid-flight at the swap point may observe the boundary, but the rewriter only consults the map at PDU start, so the practical effect is the same next-PDU resolution rule.

Counters are explicitly preserved across a reseat. The reconciler reads supervisor.CurrentCounters and passes the same ProxyCounters instance into the new context so request counts, rewrite counts, and error counts do not reset to zero every time an operator tweaks a tag. A restart, by contrast, constructs a brand-new ProxyCounters because the supervisor itself is brand new.

Effect on upstream sockets

The fate of an open upstream client socket depends on which bucket its PLC lands in:

  • Reseat. The socket stays open. The client never notices the reload happened; only its next request frame resolves against the new tag map.
  • Restart. The old listener stops, which closes every upstream socket bound to it. The client sees a TCP close and is expected to reconnect (Wonderware DAServer, generic Modbus masters, and the supported gateways all do this automatically). When it reconnects, it lands on the new listener at the new endpoint.
  • Remove. Same as a restart from the client's perspective: the listener stops and every connection closes. If the operator also removed the IP from the upstream client's configuration, the client stops reconnecting; otherwise the reconnect attempts simply fail with ECONNREFUSED until the PLC reappears.
  • Add. No effect on any existing socket. The new listener simply starts accepting on its ListenPort.

Cache and Hot-Reload

Any tag-list change that affects a PLC drops the entire ResponseCache for that PLC. The reseat path constructs a fresh cache through ConfigReconciler.BuildCacheIfNeeded, which inspects the resolved map and returns a new ResponseCache when at least one tag opts in, or null otherwise. The supervisor disposes the old cache during ReplaceContextAsync.

Per-tag granular flush is intentionally not implemented. The reasoning is correctness over micro-optimisation:

  • A width change between 16-bit and 32-bit can invalidate cached entries at neighbouring addresses, not just at the changed tag.
  • A tag removal means a cached value is no longer rewritten on the way out, so the cached entry that was valid one millisecond ago is now serving the wrong shape.
  • A TTL change on one tag does not influence neighbouring entries, but the cost of tracking per-entry TTL versions and replaying flushes outweighs the cost of repopulating on demand.

A wholesale drop is the simple correct move. Entries repopulate on demand at the next read against the new TTL, and a 54-PLC fleet with second-scale TTLs warms back to steady state within a handful of poll intervals.

Cache.MaxEntriesPerPlc and Cache.EvictionIntervalMs deliberately do not trigger a reseat. A change to either value is structurally invisible to TagMapsEqual (which only inspects the resolved tag triples), so no cache rebuild happens. MaxEntriesPerPlc is enforced on subsequent inserts only — existing entries above the new cap stay until natural LRU eviction reaches them. EvictionIntervalMs is sampled by each fresh tick of the eviction loop, so a change takes effect at the next tick of the old interval.

Reload Events

Two events surface in the structured log every time the reconciler runs:

[LoggerMessage(EventId = 60, EventName = "mbproxy.config.reload.applied",
    Level = LogLevel.Information,
    Message = "Config reload applied — PlcsAdded={PlcsAdded} PlcsRemoved={PlcsRemoved} " +
              "PlcsRestarted={PlcsRestarted} PlcsReseated={PlcsReseated} GlobalTagDelta={GlobalTagDelta}")]
private static partial void LogReloadApplied(
    ILogger logger, int plcsAdded, int plcsRemoved, int plcsRestarted, int plcsReseated, int globalTagDelta);

[LoggerMessage(EventId = 61, EventName = "mbproxy.config.reload.rejected",
    Level = LogLevel.Error,
    Message = "Config reload rejected — Errors={Errors}")]
private static partial void LogReloadRejected(ILogger logger, string errors);

mbproxy.config.reload.applied carries the counts from the executed ReloadPlan plus a GlobalTagDelta computed by ConfigReconciler.ComputeGlobalTagDelta, which counts how many global tag entries differ between the old and new options snapshots (added, removed, or width-changed).

mbproxy.config.reload.rejected carries the joined error string from ReloadValidator.Validate. The reconciler also increments service-wide counters through ServiceCounters.RecordReloadApplied and ServiceCounters.RecordReloadRejected, which surface on the status page as config.reloadCount, config.reloadRejectedCount, and config.lastReloadUtc. Both event names are catalogued in ../Reference/LogEvents.md.

Reading the events

A healthy reload looks like this in the log stream:

INFO mbproxy.config.reload.applied — PlcsAdded=1 PlcsRemoved=0 PlcsRestarted=0 PlcsReseated=2 GlobalTagDelta=3

The properties answer four questions at a glance: how many new listeners were brought up, how many old listeners were torn down, how many existing listeners moved to a new endpoint (and therefore disconnected their clients), and how many existing listeners had their tag maps swapped underneath open connections. GlobalTagDelta reports the number of BcdTags.Global entries that differ between snapshots; it counts each address once whether the difference is "added", "removed", or "width changed".

A rejected reload looks like this:

ERROR mbproxy.config.reload.rejected — Errors=Plc 'plc-02': Duplicate ListenPort 5020 (already used by 'plc-01').; Plc 'plc-03': BCD tag map error (DuplicateAddress): Address 1072 appears twice in resolved tag list.

Every error from the validator concatenates with ; so a single rejected event captures every problem. The matching config.reloadRejectedCount counter on the status page increments by one per rejected save, not per error inside the save.