mbproxy: cross-platform support — Linux/systemd alongside Windows

Make the service build, run, and install on Linux as a first-class target while keeping the Windows Service + Event Log behaviour intact. - Build: drop the hardcoded win-x64 RID — single-file publish now works for any RID. publish.ps1 gains -Rid; new publish.sh for Linux hosts. - Diagnostics: DiagnosticSinkSelector picks the Error+ sink per host — Windows Event Log under the SCM, local syslog under systemd (Serilog.Sinks.SyslogMessages), none for interactive runs. The EventLog truncation helper is extracted so it is testable cross-OS. - Host: Program.cs registers AddSystemd() alongside AddWindowsService(). - Config: a RID-conditioned appsettings template ships Windows or Unix paths; both templates are schema-validated by a test. - Install: systemd unit (Type=exec) plus install.sh / uninstall.sh. Also fixes two cross-platform bugs found while testing: install.ps1 and uninstall.ps1 used New-EventLog / Remove-EventLog (absent in PowerShell 7), and the E2E sim launcher hardcoded Windows venv paths. - Docs updated across README, CLAUDE.md, and docs/ for dual-platform. 413 tests pass on Windows; 374 (all non-simulator) on Linux. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 09:41:59 -04:00
parent 0868613890
commit b330faff03
29 changed files with 1805 additions and 106 deletions
@@ -2,7 +2,9 @@

 Operator diagnosis playbook for mbproxy. Each entry maps an observable symptom to the log event name and status-page counter that confirms it, then lists likely causes and remediation steps.

-The rolling log lives at `C:\ProgramData\mbproxy\logs\mbproxy-<date>.log`. The live counters are at `http://<host>:<AdminPort>/status.json` (default port `8080`). Events at Error level and above are also mirrored to the Windows Application Event Log under source `mbproxy`.
+The rolling log lives at `C:\ProgramData\mbproxy\logs\mbproxy-<date>.log` on Windows, or `/var/log/mbproxy/mbproxy-<date>.log` on Linux. The live counters are at `http://<host>:<AdminPort>/status.json` (default port `8080`). Events at Error level and above are also mirrored to the **Windows Application Event Log** (Windows Service) or the **local syslog / journal** (systemd) under source `mbproxy` — view the latter with `journalctl -t mbproxy` or `journalctl -u mbproxy`.
+
+Paths and service commands below are written for Windows (`%ProgramData%`, `sc.exe`); the systemd equivalents are `/etc/mbproxy` + `/var/log/mbproxy` and `systemctl start|stop|status mbproxy`.

 ## Service Startup Failures

@@ -124,7 +126,28 @@ The rolling log lives at `C:\ProgramData\mbproxy\logs\mbproxy-<date>.log`. The l

 1. Verify the upstream count on the status page returns to normal as clients reconnect — `plcs[].clients.connected` should climb again within seconds.
 2. If cascades fire repeatedly against the same PLC, investigate the PLC and intermediate network for stability. The proxy itself has no state to repair.
-3. If cascades correlate with idle periods, the idle middlebox-drop pattern is the likeliest cause; reduce the upstream client's poll interval below the middlebox idle timeout to keep traffic flowing.
+3. If cascades correlate with idle periods, the idle middlebox-drop pattern is the likeliest cause. Keepalive is enabled by default and should already be preventing this — confirm `Connection.Keepalive.Enabled` is `true` and that `BackendHeartbeatIdleMs` is comfortably below the middlebox idle timeout. See [`../Architecture/Keepalive.md`](../Architecture/Keepalive.md).
+
+### Backend keepalive heartbeat failing
+
+**Symptom.** A PLC's backend connection is torn down while idle — no client was actively talking to it. `plcs[].backend.backendIdleDisconnects` increments and the upstream clients (if any were attached) are cascaded.
+
+**Where to look.**
+
+- Log events: `mbproxy.keepalive.heartbeat.timeout` (Warning) followed by `mbproxy.keepalive.backend.idle_disconnect` (Information).
+- Status fields: `plcs[].backend.backendHeartbeatsSent`, `backendHeartbeatsFailed`, `backendIdleDisconnects`.
+
+**Root causes.**
+
+- The ECOM is reachable at the IP layer but no longer answering Modbus (firmware hang, ECOM reset mid-session).
+- The path died between heartbeats and the heartbeat was the first request to discover it — this is the feature working as intended (the failure is found during idle, not on a client request).
+- `BackendHeartbeatProbeAddress` points at an address the PLC rejects. The default (0 = `V0`) is safe on DL205/DL260; only an operator override could break this.
+
+**Remediation.**
+
+1. A single idle-disconnect that recovers on the next client request needs no action — the proxy reconnected the path proactively.
+2. Repeated idle-disconnects on one PLC mean it keeps going dark while idle. Investigate the device and the network path; the proxy has no state to repair.
+3. If `backendHeartbeatsFailed` climbs but the PLC answers real client requests fine, check that `BackendHeartbeatProbeAddress` is a register the device actually serves.

 ### Request timeout watchdog firing