mbproxy: cross-platform support — Linux/systemd alongside Windows

Make the service build, run, and install on Linux as a first-class target while keeping the Windows Service + Event Log behaviour intact. - Build: drop the hardcoded win-x64 RID — single-file publish now works for any RID. publish.ps1 gains -Rid; new publish.sh for Linux hosts. - Diagnostics: DiagnosticSinkSelector picks the Error+ sink per host — Windows Event Log under the SCM, local syslog under systemd (Serilog.Sinks.SyslogMessages), none for interactive runs. The EventLog truncation helper is extracted so it is testable cross-OS. - Host: Program.cs registers AddSystemd() alongside AddWindowsService(). - Config: a RID-conditioned appsettings template ships Windows or Unix paths; both templates are schema-validated by a test. - Install: systemd unit (Type=exec) plus install.sh / uninstall.sh. Also fixes two cross-platform bugs found while testing: install.ps1 and uninstall.ps1 used New-EventLog / Remove-EventLog (absent in PowerShell 7), and the E2E sim launcher hardcoded Windows venv paths. - Docs updated across README, CLAUDE.md, and docs/ for dual-platform. 413 tests pass on Windows; 374 (all non-simulator) on Linux. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 09:41:59 -04:00
parent 0868613890
commit b330faff03
29 changed files with 1805 additions and 106 deletions
@@ -7,8 +7,11 @@
 The configuration loader resolves `appsettings.json` relative to the executable.

 - **Development run** (`dotnet run`): `src/Mbproxy/appsettings.json` next to the build output.
- **Single-file publish** (`dotnet publish -c Release -r win-x64`): `appsettings.json` next to `Mbproxy.exe` in the publish folder.
- **Installed as a Windows Service**: `%ProgramData%\mbproxy\appsettings.json`. The install script copies the template at `install/mbproxy.config.template.json` to this path the first time only — an existing file is preserved across reinstalls.
+- **Single-file publish** (`dotnet publish -c Release -r <rid>`): `appsettings.json` next to the published binary. A `win-x64` publish ships `install/mbproxy.config.template.json`; a `linux-x64` publish ships `install/mbproxy.linux.config.template.json` (same keys, Unix log path) — each linked into the bundle as `appsettings.json`.
+- **Installed as a Windows Service**: `%ProgramData%\mbproxy\appsettings.json`, seeded by `install.ps1` from `mbproxy.config.template.json`.
+- **Installed as a systemd unit**: `/etc/mbproxy/appsettings.json` (the unit's `WorkingDirectory`), seeded by `install.sh` from the Linux template.
+
+In both installed cases the install script copies the template only when no config already exists — an existing file is preserved across reinstalls.

 The file is loaded with `reloadOnChange: true`. All consumers read through `IOptionsMonitor<MbproxyOptions>`, so a save propagates without restarting the service. See [`../Features/HotReload.md`](../Features/HotReload.md) for per-key propagation semantics.

@@ -51,11 +54,19 @@ Every supported key under `Mbproxy:*`, populated to a representative default:
    // Read-only HTTP status page. Set to 0 to disable.
    "AdminPort": 8080,

-    // Backend connection / request / shutdown timeouts.
+    // Backend connection / request / shutdown timeouts and keepalive.
    "Connection": {
      "BackendConnectTimeoutMs":   3000,
      "BackendRequestTimeoutMs":   3000,
-      "GracefulShutdownTimeoutMs": 10000
+      "GracefulShutdownTimeoutMs": 10000,
+      "Keepalive": {
+        "Enabled":                      true,
+        "TcpIdleTimeMs":                30000,
+        "TcpProbeIntervalMs":           5000,
+        "TcpProbeCount":                4,
+        "BackendHeartbeatIdleMs":       30000,
+        "BackendHeartbeatProbeAddress": 0
+      }
    },

    // Polly resilience policies.
@@ -169,6 +180,21 @@ Operational sizing notes:
 - A 3 s request timeout is generous compared with typical DL205/DL260 scan times (a few ms to tens of ms for FC03 of 100 registers). The slack absorbs PLC scan-overlap jitter without faulting the upstream client.
 - `GracefulShutdownTimeoutMs` should be less than the Service Control Manager's stop deadline. The default 10 s suits a fleet of 54 PLCs; on a much larger fleet, raise both the SCM wait hint and this value in lockstep.

+## `Mbproxy.Connection.Keepalive`
+
+TCP keepalive and backend heartbeat settings. Source: `KeepaliveOptions.cs`. Enabled by default — the DL205/DL260 ECOM never emits TCP keepalives, so an idle socket is otherwise dropped by middleboxes after 2–5 minutes. See [`../Architecture/Keepalive.md`](../Architecture/Keepalive.md) for the full design.
+
+| Field | Type | Default | Notes |
+|-------|------|---------|-------|
+| `Enabled` | bool | `true` | Master switch. When `false`, neither `SO_KEEPALIVE` nor the backend heartbeat is applied and the proxy behaves exactly as a pre-keepalive build. |
+| `TcpIdleTimeMs` | int | `30000` | `SO_KEEPALIVE` idle time before the OS sends its first probe. Applied to the backend socket and accepted upstream sockets. |
+| `TcpProbeIntervalMs` | int | `5000` | `SO_KEEPALIVE` interval between probes once idle. |
+| `TcpProbeCount` | int | `4` | `SO_KEEPALIVE` unanswered probes before the OS declares the socket dead. |
+| `BackendHeartbeatIdleMs` | int | `30000` | After this much backend idle, the proxy issues a synthetic FC03 qty=1 read to keep the path warm and prove the ECOM still answers Modbus. Must be greater than `BackendRequestTimeoutMs`. |
+| `BackendHeartbeatProbeAddress` | int | `0` | Modbus PDU address the heartbeat FC03 probe reads. Address `0` (`V0`) is valid on DL205/DL260 in factory absolute mode. Range `[0, 65535]`. |
+
+On hot reload, the heartbeat interval and probe address are re-read on every loop tick. The `Tcp*` socket options are applied at connect/accept time, so a reload affects only sockets opened after the change. A reload where `BackendHeartbeatIdleMs <= BackendRequestTimeoutMs` is rejected — a heartbeat interval at or below the request timeout would fire continuously.
+
 ## `Mbproxy.Resilience`

 Polly retry pipelines for backend connect, listener bind, and the in-flight read coalescer. Source: `ResilienceOptions.cs`.
@@ -391,6 +417,7 @@ A reduced view of [`../Features/HotReload.md`](../Features/HotReload.md), restri
 | `Plcs[i]` removed | Supervisor stops the listener and closes all upstream connections for that PLC. |
 | `Plcs[i].ListenPort` or `Host` changed | Equivalent to remove + add. |
 | `Connection.Backend*TimeoutMs` | Next backend connect or request uses the new value. |
+| `Connection.Keepalive` heartbeat fields | Re-read on every heartbeat loop tick. `Tcp*` socket options apply to backend/upstream sockets opened after the change. |
 | `AdminPort` | Requires a service restart — the Kestrel admin host is built once at startup. |
 | `Resilience.ReadCoalescing.Enabled` | Hot-reloadable; in-flight coalesced entries drain naturally. |
 | `BcdTags.*.CacheTtlMs`, `Plcs[i].DefaultCacheTtlMs` | Tag-map reseat for the affected PLC drops that PLC's entire cache. |
@@ -2,7 +2,9 @@

 Operator diagnosis playbook for mbproxy. Each entry maps an observable symptom to the log event name and status-page counter that confirms it, then lists likely causes and remediation steps.

-The rolling log lives at `C:\ProgramData\mbproxy\logs\mbproxy-<date>.log`. The live counters are at `http://<host>:<AdminPort>/status.json` (default port `8080`). Events at Error level and above are also mirrored to the Windows Application Event Log under source `mbproxy`.
+The rolling log lives at `C:\ProgramData\mbproxy\logs\mbproxy-<date>.log` on Windows, or `/var/log/mbproxy/mbproxy-<date>.log` on Linux. The live counters are at `http://<host>:<AdminPort>/status.json` (default port `8080`). Events at Error level and above are also mirrored to the **Windows Application Event Log** (Windows Service) or the **local syslog / journal** (systemd) under source `mbproxy` — view the latter with `journalctl -t mbproxy` or `journalctl -u mbproxy`.
+
+Paths and service commands below are written for Windows (`%ProgramData%`, `sc.exe`); the systemd equivalents are `/etc/mbproxy` + `/var/log/mbproxy` and `systemctl start|stop|status mbproxy`.

 ## Service Startup Failures

@@ -124,7 +126,28 @@ The rolling log lives at `C:\ProgramData\mbproxy\logs\mbproxy-<date>.log`. The l

 1. Verify the upstream count on the status page returns to normal as clients reconnect — `plcs[].clients.connected` should climb again within seconds.
 2. If cascades fire repeatedly against the same PLC, investigate the PLC and intermediate network for stability. The proxy itself has no state to repair.
-3. If cascades correlate with idle periods, the idle middlebox-drop pattern is the likeliest cause; reduce the upstream client's poll interval below the middlebox idle timeout to keep traffic flowing.
+3. If cascades correlate with idle periods, the idle middlebox-drop pattern is the likeliest cause. Keepalive is enabled by default and should already be preventing this — confirm `Connection.Keepalive.Enabled` is `true` and that `BackendHeartbeatIdleMs` is comfortably below the middlebox idle timeout. See [`../Architecture/Keepalive.md`](../Architecture/Keepalive.md).
+
+### Backend keepalive heartbeat failing
+
+**Symptom.** A PLC's backend connection is torn down while idle — no client was actively talking to it. `plcs[].backend.backendIdleDisconnects` increments and the upstream clients (if any were attached) are cascaded.
+
+**Where to look.**
+
+- Log events: `mbproxy.keepalive.heartbeat.timeout` (Warning) followed by `mbproxy.keepalive.backend.idle_disconnect` (Information).
+- Status fields: `plcs[].backend.backendHeartbeatsSent`, `backendHeartbeatsFailed`, `backendIdleDisconnects`.
+
+**Root causes.**
+
+- The ECOM is reachable at the IP layer but no longer answering Modbus (firmware hang, ECOM reset mid-session).
+- The path died between heartbeats and the heartbeat was the first request to discover it — this is the feature working as intended (the failure is found during idle, not on a client request).
+- `BackendHeartbeatProbeAddress` points at an address the PLC rejects. The default (0 = `V0`) is safe on DL205/DL260; only an operator override could break this.
+
+**Remediation.**
+
+1. A single idle-disconnect that recovers on the next client request needs no action — the proxy reconnected the path proactively.
+2. Repeated idle-disconnects on one PLC mean it keeps going dark while idle. Investigate the device and the network path; the proxy has no state to repair.
+3. If `backendHeartbeatsFailed` climbs but the PLC answers real client requests fine, check that `BackendHeartbeatProbeAddress` is a register the device actually serves.

 ### Request timeout watchdog firing