mbproxy: cross-platform support — Linux/systemd alongside Windows

Make the service build, run, and install on Linux as a first-class
target while keeping the Windows Service + Event Log behaviour intact.

- Build: drop the hardcoded win-x64 RID — single-file publish now works
  for any RID. publish.ps1 gains -Rid; new publish.sh for Linux hosts.
- Diagnostics: DiagnosticSinkSelector picks the Error+ sink per host —
  Windows Event Log under the SCM, local syslog under systemd
  (Serilog.Sinks.SyslogMessages), none for interactive runs. The
  EventLog truncation helper is extracted so it is testable cross-OS.
- Host: Program.cs registers AddSystemd() alongside AddWindowsService().
- Config: a RID-conditioned appsettings template ships Windows or Unix
  paths; both templates are schema-validated by a test.
- Install: systemd unit (Type=exec) plus install.sh / uninstall.sh.
  Also fixes two cross-platform bugs found while testing: install.ps1
  and uninstall.ps1 used New-EventLog / Remove-EventLog (absent in
  PowerShell 7), and the E2E sim launcher hardcoded Windows venv paths.
- Docs updated across README, CLAUDE.md, and docs/ for dual-platform.

413 tests pass on Windows; 374 (all non-simulator) on Linux.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-15 09:41:59 -04:00
parent 0868613890
commit b330faff03
29 changed files with 1805 additions and 106 deletions
+25 -2
View File
@@ -2,7 +2,9 @@
Operator diagnosis playbook for mbproxy. Each entry maps an observable symptom to the log event name and status-page counter that confirms it, then lists likely causes and remediation steps.
The rolling log lives at `C:\ProgramData\mbproxy\logs\mbproxy-<date>.log`. The live counters are at `http://<host>:<AdminPort>/status.json` (default port `8080`). Events at Error level and above are also mirrored to the Windows Application Event Log under source `mbproxy`.
The rolling log lives at `C:\ProgramData\mbproxy\logs\mbproxy-<date>.log` on Windows, or `/var/log/mbproxy/mbproxy-<date>.log` on Linux. The live counters are at `http://<host>:<AdminPort>/status.json` (default port `8080`). Events at Error level and above are also mirrored to the **Windows Application Event Log** (Windows Service) or the **local syslog / journal** (systemd) under source `mbproxy` — view the latter with `journalctl -t mbproxy` or `journalctl -u mbproxy`.
Paths and service commands below are written for Windows (`%ProgramData%`, `sc.exe`); the systemd equivalents are `/etc/mbproxy` + `/var/log/mbproxy` and `systemctl start|stop|status mbproxy`.
## Service Startup Failures
@@ -124,7 +126,28 @@ The rolling log lives at `C:\ProgramData\mbproxy\logs\mbproxy-<date>.log`. The l
1. Verify the upstream count on the status page returns to normal as clients reconnect — `plcs[].clients.connected` should climb again within seconds.
2. If cascades fire repeatedly against the same PLC, investigate the PLC and intermediate network for stability. The proxy itself has no state to repair.
3. If cascades correlate with idle periods, the idle middlebox-drop pattern is the likeliest cause; reduce the upstream client's poll interval below the middlebox idle timeout to keep traffic flowing.
3. If cascades correlate with idle periods, the idle middlebox-drop pattern is the likeliest cause. Keepalive is enabled by default and should already be preventing this — confirm `Connection.Keepalive.Enabled` is `true` and that `BackendHeartbeatIdleMs` is comfortably below the middlebox idle timeout. See [`../Architecture/Keepalive.md`](../Architecture/Keepalive.md).
### Backend keepalive heartbeat failing
**Symptom.** A PLC's backend connection is torn down while idle — no client was actively talking to it. `plcs[].backend.backendIdleDisconnects` increments and the upstream clients (if any were attached) are cascaded.
**Where to look.**
- Log events: `mbproxy.keepalive.heartbeat.timeout` (Warning) followed by `mbproxy.keepalive.backend.idle_disconnect` (Information).
- Status fields: `plcs[].backend.backendHeartbeatsSent`, `backendHeartbeatsFailed`, `backendIdleDisconnects`.
**Root causes.**
- The ECOM is reachable at the IP layer but no longer answering Modbus (firmware hang, ECOM reset mid-session).
- The path died between heartbeats and the heartbeat was the first request to discover it — this is the feature working as intended (the failure is found during idle, not on a client request).
- `BackendHeartbeatProbeAddress` points at an address the PLC rejects. The default (0 = `V0`) is safe on DL205/DL260; only an operator override could break this.
**Remediation.**
1. A single idle-disconnect that recovers on the next client request needs no action — the proxy reconnected the path proactively.
2. Repeated idle-disconnects on one PLC mean it keeps going dark while idle. Investigate the device and the network path; the proxy has no state to repair.
3. If `backendHeartbeatsFailed` climbs but the PLC answers real client requests fine, check that `BackendHeartbeatProbeAddress` is a register the device actually serves.
### Request timeout watchdog firing