mbproxy: cross-platform support — Linux/systemd alongside Windows

Make the service build, run, and install on Linux as a first-class
target while keeping the Windows Service + Event Log behaviour intact.

- Build: drop the hardcoded win-x64 RID — single-file publish now works
  for any RID. publish.ps1 gains -Rid; new publish.sh for Linux hosts.
- Diagnostics: DiagnosticSinkSelector picks the Error+ sink per host —
  Windows Event Log under the SCM, local syslog under systemd
  (Serilog.Sinks.SyslogMessages), none for interactive runs. The
  EventLog truncation helper is extracted so it is testable cross-OS.
- Host: Program.cs registers AddSystemd() alongside AddWindowsService().
- Config: a RID-conditioned appsettings template ships Windows or Unix
  paths; both templates are schema-validated by a test.
- Install: systemd unit (Type=exec) plus install.sh / uninstall.sh.
  Also fixes two cross-platform bugs found while testing: install.ps1
  and uninstall.ps1 used New-EventLog / Remove-EventLog (absent in
  PowerShell 7), and the E2E sim launcher hardcoded Windows venv paths.
- Docs updated across README, CLAUDE.md, and docs/ for dual-platform.

413 tests pass on Windows; 374 (all non-simulator) on Linux.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-15 09:41:59 -04:00
parent 0868613890
commit b330faff03
29 changed files with 1805 additions and 106 deletions
+1 -1
View File
@@ -6,7 +6,7 @@ This document is the entry point for readers new to the codebase. It sketches th
## Runtime Shape
The process is a single .NET 10 Generic Host worker. `Microsoft.Extensions.Hosting.WindowsServices` registers the host as a Windows Service so the same binary runs interactively (for development) or under the SCM (in production). All configuration binds from `appsettings.json` through `IOptionsMonitor<MbproxyOptions>`, which makes the tag list and PLC roster hot-reloadable without process restart. `ProxyWorker` is the long-lived `BackgroundService` that owns startup, shutdown, and the listener supervisors for every PLC. A small Kestrel admin endpoint runs in the same process to serve the read-only status page.
The process is a single .NET 10 Generic Host worker. It registers both `Microsoft.Extensions.Hosting.WindowsServices` and `Microsoft.Extensions.Hosting.Systemd` — each a no-op off its own init system — so the same binary runs interactively (for development), as a Windows Service under the SCM, or as a Linux systemd unit. All configuration binds from `appsettings.json` through `IOptionsMonitor<MbproxyOptions>`, which makes the tag list and PLC roster hot-reloadable without process restart. `ProxyWorker` is the long-lived `BackgroundService` that owns startup, shutdown, and the listener supervisors for every PLC. A small Kestrel admin endpoint runs in the same process to serve the read-only status page.
There is no in-process database, no message broker, and no persistent cache file: state is per-PLC, in-memory, and ephemeral. Restarting the service drops every in-flight request and every cached response. Upstream clients are expected to reconnect and reissue; the proxy never replays a request on their behalf.
+1 -1
View File
@@ -6,7 +6,7 @@ A save to `appsettings.json` propagates to a running `mbproxy` without restartin
`Microsoft.Extensions.Configuration` loads `appsettings.json` with `reloadOnChange: true`. Every consumer reads its options through `IOptionsMonitor<MbproxyOptions>` instead of capturing a one-shot `IOptions<T>` snapshot at construction. When the framework's `FileSystemWatcher` sees the file change, it re-parses the JSON, re-binds the option tree, and notifies subscribers through `IOptionsMonitor.OnChange`.
The chosen mechanism is deliberate. There is no custom file watcher, no IPC channel, no admin-port mutation endpoint, and no SIGHUP-style trigger. An operator edits the file in place (or a deployment tool atomically rewrites it) and the running service catches up. The reload contract is identical whether the service is running interactively or as a Windows Service under the SCM.
The chosen mechanism is deliberate. There is no custom file watcher, no IPC channel, no admin-port mutation endpoint, and no SIGHUP-style trigger. An operator edits the file in place (or a deployment tool atomically rewrites it) and the running service catches up. The reload contract is identical whether the service is running interactively, as a Windows Service under the SCM, or as a Linux systemd unit.
The `OnChange` callback can fire multiple times for a single logical save because text editors on Windows commonly use a rename-and-replace pattern that produces two or three `FileSystemWatcher` events. The reconciler debounces these inside its own background loop with a 250 ms quiescent window so a single save produces a single apply.
+31 -4
View File
@@ -7,8 +7,11 @@
The configuration loader resolves `appsettings.json` relative to the executable.
- **Development run** (`dotnet run`): `src/Mbproxy/appsettings.json` next to the build output.
- **Single-file publish** (`dotnet publish -c Release -r win-x64`): `appsettings.json` next to `Mbproxy.exe` in the publish folder.
- **Installed as a Windows Service**: `%ProgramData%\mbproxy\appsettings.json`. The install script copies the template at `install/mbproxy.config.template.json` to this path the first time only — an existing file is preserved across reinstalls.
- **Single-file publish** (`dotnet publish -c Release -r <rid>`): `appsettings.json` next to the published binary. A `win-x64` publish ships `install/mbproxy.config.template.json`; a `linux-x64` publish ships `install/mbproxy.linux.config.template.json` (same keys, Unix log path) — each linked into the bundle as `appsettings.json`.
- **Installed as a Windows Service**: `%ProgramData%\mbproxy\appsettings.json`, seeded by `install.ps1` from `mbproxy.config.template.json`.
- **Installed as a systemd unit**: `/etc/mbproxy/appsettings.json` (the unit's `WorkingDirectory`), seeded by `install.sh` from the Linux template.
In both installed cases the install script copies the template only when no config already exists — an existing file is preserved across reinstalls.
The file is loaded with `reloadOnChange: true`. All consumers read through `IOptionsMonitor<MbproxyOptions>`, so a save propagates without restarting the service. See [`../Features/HotReload.md`](../Features/HotReload.md) for per-key propagation semantics.
@@ -51,11 +54,19 @@ Every supported key under `Mbproxy:*`, populated to a representative default:
// Read-only HTTP status page. Set to 0 to disable.
"AdminPort": 8080,
// Backend connection / request / shutdown timeouts.
// Backend connection / request / shutdown timeouts and keepalive.
"Connection": {
"BackendConnectTimeoutMs": 3000,
"BackendRequestTimeoutMs": 3000,
"GracefulShutdownTimeoutMs": 10000
"GracefulShutdownTimeoutMs": 10000,
"Keepalive": {
"Enabled": true,
"TcpIdleTimeMs": 30000,
"TcpProbeIntervalMs": 5000,
"TcpProbeCount": 4,
"BackendHeartbeatIdleMs": 30000,
"BackendHeartbeatProbeAddress": 0
}
},
// Polly resilience policies.
@@ -169,6 +180,21 @@ Operational sizing notes:
- A 3 s request timeout is generous compared with typical DL205/DL260 scan times (a few ms to tens of ms for FC03 of 100 registers). The slack absorbs PLC scan-overlap jitter without faulting the upstream client.
- `GracefulShutdownTimeoutMs` should be less than the Service Control Manager's stop deadline. The default 10 s suits a fleet of 54 PLCs; on a much larger fleet, raise both the SCM wait hint and this value in lockstep.
## `Mbproxy.Connection.Keepalive`
TCP keepalive and backend heartbeat settings. Source: `KeepaliveOptions.cs`. Enabled by default — the DL205/DL260 ECOM never emits TCP keepalives, so an idle socket is otherwise dropped by middleboxes after 25 minutes. See [`../Architecture/Keepalive.md`](../Architecture/Keepalive.md) for the full design.
| Field | Type | Default | Notes |
|-------|------|---------|-------|
| `Enabled` | bool | `true` | Master switch. When `false`, neither `SO_KEEPALIVE` nor the backend heartbeat is applied and the proxy behaves exactly as a pre-keepalive build. |
| `TcpIdleTimeMs` | int | `30000` | `SO_KEEPALIVE` idle time before the OS sends its first probe. Applied to the backend socket and accepted upstream sockets. |
| `TcpProbeIntervalMs` | int | `5000` | `SO_KEEPALIVE` interval between probes once idle. |
| `TcpProbeCount` | int | `4` | `SO_KEEPALIVE` unanswered probes before the OS declares the socket dead. |
| `BackendHeartbeatIdleMs` | int | `30000` | After this much backend idle, the proxy issues a synthetic FC03 qty=1 read to keep the path warm and prove the ECOM still answers Modbus. Must be greater than `BackendRequestTimeoutMs`. |
| `BackendHeartbeatProbeAddress` | int | `0` | Modbus PDU address the heartbeat FC03 probe reads. Address `0` (`V0`) is valid on DL205/DL260 in factory absolute mode. Range `[0, 65535]`. |
On hot reload, the heartbeat interval and probe address are re-read on every loop tick. The `Tcp*` socket options are applied at connect/accept time, so a reload affects only sockets opened after the change. A reload where `BackendHeartbeatIdleMs <= BackendRequestTimeoutMs` is rejected — a heartbeat interval at or below the request timeout would fire continuously.
## `Mbproxy.Resilience`
Polly retry pipelines for backend connect, listener bind, and the in-flight read coalescer. Source: `ResilienceOptions.cs`.
@@ -391,6 +417,7 @@ A reduced view of [`../Features/HotReload.md`](../Features/HotReload.md), restri
| `Plcs[i]` removed | Supervisor stops the listener and closes all upstream connections for that PLC. |
| `Plcs[i].ListenPort` or `Host` changed | Equivalent to remove + add. |
| `Connection.Backend*TimeoutMs` | Next backend connect or request uses the new value. |
| `Connection.Keepalive` heartbeat fields | Re-read on every heartbeat loop tick. `Tcp*` socket options apply to backend/upstream sockets opened after the change. |
| `AdminPort` | Requires a service restart — the Kestrel admin host is built once at startup. |
| `Resilience.ReadCoalescing.Enabled` | Hot-reloadable; in-flight coalesced entries drain naturally. |
| `BcdTags.*.CacheTtlMs`, `Plcs[i].DefaultCacheTtlMs` | Tag-map reseat for the affected PLC drops that PLC's entire cache. |
+25 -2
View File
@@ -2,7 +2,9 @@
Operator diagnosis playbook for mbproxy. Each entry maps an observable symptom to the log event name and status-page counter that confirms it, then lists likely causes and remediation steps.
The rolling log lives at `C:\ProgramData\mbproxy\logs\mbproxy-<date>.log`. The live counters are at `http://<host>:<AdminPort>/status.json` (default port `8080`). Events at Error level and above are also mirrored to the Windows Application Event Log under source `mbproxy`.
The rolling log lives at `C:\ProgramData\mbproxy\logs\mbproxy-<date>.log` on Windows, or `/var/log/mbproxy/mbproxy-<date>.log` on Linux. The live counters are at `http://<host>:<AdminPort>/status.json` (default port `8080`). Events at Error level and above are also mirrored to the **Windows Application Event Log** (Windows Service) or the **local syslog / journal** (systemd) under source `mbproxy` — view the latter with `journalctl -t mbproxy` or `journalctl -u mbproxy`.
Paths and service commands below are written for Windows (`%ProgramData%`, `sc.exe`); the systemd equivalents are `/etc/mbproxy` + `/var/log/mbproxy` and `systemctl start|stop|status mbproxy`.
## Service Startup Failures
@@ -124,7 +126,28 @@ The rolling log lives at `C:\ProgramData\mbproxy\logs\mbproxy-<date>.log`. The l
1. Verify the upstream count on the status page returns to normal as clients reconnect — `plcs[].clients.connected` should climb again within seconds.
2. If cascades fire repeatedly against the same PLC, investigate the PLC and intermediate network for stability. The proxy itself has no state to repair.
3. If cascades correlate with idle periods, the idle middlebox-drop pattern is the likeliest cause; reduce the upstream client's poll interval below the middlebox idle timeout to keep traffic flowing.
3. If cascades correlate with idle periods, the idle middlebox-drop pattern is the likeliest cause. Keepalive is enabled by default and should already be preventing this — confirm `Connection.Keepalive.Enabled` is `true` and that `BackendHeartbeatIdleMs` is comfortably below the middlebox idle timeout. See [`../Architecture/Keepalive.md`](../Architecture/Keepalive.md).
### Backend keepalive heartbeat failing
**Symptom.** A PLC's backend connection is torn down while idle — no client was actively talking to it. `plcs[].backend.backendIdleDisconnects` increments and the upstream clients (if any were attached) are cascaded.
**Where to look.**
- Log events: `mbproxy.keepalive.heartbeat.timeout` (Warning) followed by `mbproxy.keepalive.backend.idle_disconnect` (Information).
- Status fields: `plcs[].backend.backendHeartbeatsSent`, `backendHeartbeatsFailed`, `backendIdleDisconnects`.
**Root causes.**
- The ECOM is reachable at the IP layer but no longer answering Modbus (firmware hang, ECOM reset mid-session).
- The path died between heartbeats and the heartbeat was the first request to discover it — this is the feature working as intended (the failure is found during idle, not on a client request).
- `BackendHeartbeatProbeAddress` points at an address the PLC rejects. The default (0 = `V0`) is safe on DL205/DL260; only an operator override could break this.
**Remediation.**
1. A single idle-disconnect that recovers on the next client request needs no action — the proxy reconnected the path proactively.
2. Repeated idle-disconnects on one PLC mean it keeps going dark while idle. Investigate the device and the network path; the proxy has no state to repair.
3. If `backendHeartbeatsFailed` climbs but the PLC answers real client requests fine, check that `BackendHeartbeatProbeAddress` is a register the device actually serves.
### Request timeout watchdog firing
+49 -3
View File
@@ -6,9 +6,9 @@ The stable catalog of every `mbproxy.*` event name the service emits, with its l
The service uses [Serilog](https://serilog.net/) wired through the `Microsoft.Extensions.Logging` bridge. Three sinks are configured (see `src/Mbproxy/HostingExtensions.cs`):
- **Console** — written to stdout for interactive `--console` runs and for the SCM stdout capture.
- **Rolling file** — under `%ProgramData%\mbproxy\logs\` (`mbproxy-<date>.log`).
- **Windows Event Log** — only when running as a Windows Service, and only for events at `Error` and above (see `src/Mbproxy/Diagnostics/EventLogBridge.cs`).
- **Console** — stdout; captured by the Windows SCM or by systemd-journald.
- **Rolling file** — `%ProgramData%\mbproxy\logs\` on Windows, `/var/log/mbproxy/` on Linux (`mbproxy-<date>.log`).
- **Platform diagnostic sink** — `Error`+ events only. `DiagnosticSinkSelector` picks it once at the composition root: the **Windows Application Event Log** under the SCM (`EventLogBridge`), **local syslog** under systemd (`SyslogBridge`), or none for interactive/dev runs.
Every event uses source-generated `[LoggerMessage]` definitions, so the property names below match the message template token-for-token. The default minimum level is `Information`; lower the floor for `Mbproxy.*` categories via the standard `Logging:LogLevel` configuration to surface `Debug` events such as the coalesce and cache traces.
@@ -385,6 +385,51 @@ Fires whenever the entire per-PLC cache is wiped at once — primarily after a b
**Operator action:** none unless flushes happen on a tight loop, which would indicate the backend connection itself is unstable.
## Keepalive
See [`../Architecture/Keepalive.md`](../Architecture/Keepalive.md) for the backend heartbeat design.
### mbproxy.keepalive.heartbeat.sent
**Level:** Debug &middot; **EventId:** 150 &middot; **Source:** `src/Mbproxy/Proxy/Multiplexing/KeepaliveLogEvents.cs`
| Property | Type | Meaning |
|----------|------|---------|
| `Plc` | `string` | Configured PLC name. |
| `ProxyTxId` | `ushort` | Proxy-allocated TxId carrying the synthetic FC03 probe. |
| `Address` | `ushort` | Modbus address the probe reads (`BackendHeartbeatProbeAddress`). |
Fires each time the heartbeat loop issues a probe on an idle backend socket — at most one per `BackendHeartbeatIdleMs` per idle PLC.
**Operator action:** none. Debug-level; useful only when confirming the heartbeat is alive.
### mbproxy.keepalive.heartbeat.timeout
**Level:** Warning &middot; **EventId:** 151 &middot; **Source:** `src/Mbproxy/Proxy/Multiplexing/KeepaliveLogEvents.cs`
| Property | Type | Meaning |
|----------|------|---------|
| `Plc` | `string` | Configured PLC name. |
| `ProxyTxId` | `ushort` | Proxy TxId of the unanswered probe. |
| `ElapsedMs` | `long` | Milliseconds from probe send to timeout. |
Fires when a heartbeat probe is not answered within `BackendRequestTimeoutMs` — the backend is connected but no longer answering Modbus.
**Operator action:** check the PLC and the network path. Paired with `mbproxy.keepalive.backend.idle_disconnect` for the same PLC.
### mbproxy.keepalive.backend.idle_disconnect
**Level:** Information &middot; **EventId:** 152 &middot; **Source:** `src/Mbproxy/Proxy/Multiplexing/KeepaliveLogEvents.cs`
| Property | Type | Meaning |
|----------|------|---------|
| `Plc` | `string` | Configured PLC name. |
| `ElapsedMs` | `long` | Milliseconds the failed heartbeat waited before the teardown. |
Fires when a failed heartbeat triggers a proactive backend teardown. Every attached upstream pipe is cascaded; clients reconnect on their next request. This is the keepalive feature doing its job — finding a dead path during idle instead of on the next real request.
**Operator action:** none if isolated. Repeated idle-disconnects on one PLC indicate it keeps going dark while idle — investigate the device or the network path.
## BCD Rewriter
### mbproxy.rewrite.partial_bcd
@@ -495,5 +540,6 @@ Lifecycle events (`startup.*`, `listener.*`, `admin.*`, `shutdown.*`, `config.re
- [Response Cache](../Architecture/ResponseCache.md) — context for the `mbproxy.cache.*` events.
- [Status Page](../Operations/StatusPage.md) — counter equivalents for the high-volume Debug-level events.
- [Read Coalescing](../Architecture/ReadCoalescing.md) — context for the `mbproxy.coalesce.*` events.
- [Keepalive](../Architecture/Keepalive.md) — context for the `mbproxy.keepalive.*` events.
- [BCD Rewriting](../Features/BcdRewriting.md) — context for the `mbproxy.rewrite.*` and `mbproxy.exception.passthrough` events.
- [Hot Reload](../Features/HotReload.md) — context for the `mbproxy.config.reload.*` events.