mbproxy: add keepalive / connection monitoring

The DL205/DL260 ECOM emits no TCP keepalives, so an idle backend socket
can be silently dropped by a middlebox (switch, firewall, NAT) after
2-5 minutes. Enable OS SO_KEEPALIVE on backend and accepted upstream
sockets, and drive a periodic synthetic FC03 heartbeat on each idle
backend socket so a dead path is detected before a real client request
hits it. Controlled by Connection.Keepalive (ON by default).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-15 09:40:54 -04:00
parent 7466a46aa7
commit 0868613890
25 changed files with 1135 additions and 25 deletions
+20 -4
View File
@@ -135,6 +135,16 @@ These two fields are Tier-2 KPIs intended for memory-budget alerts. The cache is
| `backend.cacheEntryCount` | `long` | `CounterSnapshot.CacheEntryCount` | Current number of cached response entries for this PLC. |
| `backend.cacheBytes` | `long` | `CounterSnapshot.CacheBytes` | Approximate byte cost of the cache entries (response payloads plus key overhead). Used to detect runaway growth from a chatty client. |
### Keepalive counters
These fields describe the backend keepalive heartbeat. See [`../Architecture/Keepalive.md`](../Architecture/Keepalive.md).
| JSON path | Type | Source | Meaning |
|---|---|---|---|
| `backend.backendHeartbeatsSent` | `long` | `CounterSnapshot.BackendHeartbeatsSent` | Synthetic FC03 heartbeat probes issued on this PLC's idle backend socket. |
| `backend.backendHeartbeatsFailed` | `long` | `CounterSnapshot.BackendHeartbeatsFailed` | Heartbeat probes not answered within `BackendRequestTimeoutMs`. Each failure tears the backend down. |
| `backend.backendIdleDisconnects` | `long` | `CounterSnapshot.BackendIdleDisconnects` | Backend teardowns triggered by a failed heartbeat — an event count, distinct from `disconnectCascades` (which counts cascaded pipes). Sustained growth means a PLC is repeatedly going dark while idle. |
### Bytes
| JSON path | Type | Source | Meaning |
@@ -224,7 +234,10 @@ A representative two-PLC deployment, ~2 hours into a run:
"cacheMissCount": 88691,
"cacheInvalidations": 6203,
"cacheEntryCount": 47,
"cacheBytes": 18512
"cacheBytes": 18512,
"backendHeartbeatsSent": 412,
"backendHeartbeatsFailed": 0,
"backendIdleDisconnects": 0
},
"bytes": {
"upstreamIn": 4108290,
@@ -267,7 +280,10 @@ A representative two-PLC deployment, ~2 hours into a run:
"cacheMissCount": 0,
"cacheInvalidations": 0,
"cacheEntryCount": 0,
"cacheBytes": 0
"cacheBytes": 0,
"backendHeartbeatsSent": 0,
"backendHeartbeatsFailed": 0,
"backendIdleDisconnects": 0
},
"bytes": { "upstreamIn": 0, "upstreamOut": 0 }
}
@@ -282,10 +298,10 @@ The HTML renderer is `StatusHtmlRenderer.Render(StatusResponse)` in `src/Mbproxy
Structure:
1. **Header summary** — version, formatted uptime (`Nh MMm SSs`), `bound/configured` listener tally, last reload timestamp, reload count with a `(N rejected)` suffix when applicable.
2. **PLC table** — one row per configured PLC. Columns: Name, Host, Port, State (colour-coded — `bound` = green, `recovering` = orange, `stopped` = grey), Clients (count plus a comma-separated list of `remote (N PDUs)`), PDUs forwarded, FC03/FC04/FC06/FC16/FC? counts, BCD slots, Partial BCD, exception codes 01/02/03/04, RTT (ms), bytes in/out, multiplexer columns (in-flight, max in-flight, TxId wraps, cascades, queue), coalescing ratio cell, cache ratio cell.
2. **PLC table** — one row per configured PLC. Columns: Name, Host, Port, State (colour-coded — `bound` = green, `recovering` = orange, `stopped` = grey), Clients (count plus a comma-separated list of `remote (N PDUs)`), PDUs forwarded, FC03/FC04/FC06/FC16/FC? counts, BCD slots, Partial BCD, exception codes 01/02/03/04, RTT (ms), bytes in/out, multiplexer columns (in-flight, max in-flight, TxId wraps, cascades, queue), coalescing ratio cell, cache ratio cell, keepalive cell.
3. **State cell error detail** — when `state == "recovering"`, the cell also shows `lastBindError` and `(attempt N)` in a small red span.
The coalescing and cache cells each render as `<pct>% (<hits>)`. When neither has been exercised (`hit + miss == 0`), the cell renders an em-dash to keep the column narrow. Page weight is bounded by the design budget (≤ 50 KB for a 54-PLC fleet).
The coalescing and cache cells each render as `<pct>% (<hits>)`. When neither has been exercised (`hit + miss == 0`), the cell renders an em-dash to keep the column narrow. The keepalive cell shows the heartbeat-sent count, with `(fail N, idle-disc N)` appended only when either is non-zero. Page weight is bounded by the design budget (≤ 50 KB for a 54-PLC fleet).
The page does not depend on JavaScript. Refresh is driven entirely by the `<meta http-equiv="refresh" content="5">` tag, so any browser — including text-mode browsers — sees the same view.