mbproxy: add keepalive / connection monitoring
The DL205/DL260 ECOM emits no TCP keepalives, so an idle backend socket can be silently dropped by a middlebox (switch, firewall, NAT) after 2-5 minutes. Enable OS SO_KEEPALIVE on backend and accepted upstream sockets, and drive a periodic synthetic FC03 heartbeat on each idle backend socket so a dead path is detected before a real client request hits it. Controlled by Connection.Keepalive (ON by default). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -240,6 +240,7 @@ The per-request timeout watchdog described above is the production defence again
|
||||
- [`./Overview.md`](./Overview.md) — proxy architecture entry point
|
||||
- [`./ReadCoalescing.md`](./ReadCoalescing.md) — FC03/FC04 fan-out built on `InterestedParties`
|
||||
- [`./ResponseCache.md`](./ResponseCache.md) — per-PLC FC03/FC04 cache layered in front of this multiplexer
|
||||
- [`./Keepalive.md`](./Keepalive.md) — TCP keepalive and the backend heartbeat that keeps this socket warm
|
||||
- [`../Operations/Configuration.md`](../Operations/Configuration.md) — `Connection.BackendConnectTimeoutMs`, `Connection.BackendRequestTimeoutMs`, retry tuning
|
||||
- [`../Operations/StatusPage.md`](../Operations/StatusPage.md) — `inFlight`, `maxInFlight`, `txIdWraps`, `queueDepth`, `disconnectCascades` counters
|
||||
- [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — `mbproxy.multiplex.*` structured log events
|
||||
|
||||
@@ -0,0 +1,76 @@
|
||||
# Keepalive & Connection Monitoring
|
||||
|
||||
The DL205/DL260 ECOM does not emit TCP keepalives (see [`../Reference/dl205.md`](../Reference/dl205.md) → "Behavioural Oddities"). An idle socket is silently dropped by middleboxes — switches, firewalls, NAT — typically after 2–5 minutes. The proxy holds one **persistent backend socket per PLC** ([`./ConnectionModel.md`](./ConnectionModel.md)) plus many accepted upstream client sockets, so it needs its own keepalive on both sides.
|
||||
|
||||
Keepalive is **enabled by default** and is governed by the `Connection.Keepalive` option block (see [`../Operations/Configuration.md`](../Operations/Configuration.md)). Set `Connection.Keepalive.Enabled = false` to restore pre-keepalive behaviour exactly.
|
||||
|
||||
## Two mechanisms
|
||||
|
||||
| Mechanism | Scope | Detects |
|
||||
|-----------|-------|---------|
|
||||
| OS TCP keepalive (`SO_KEEPALIVE`) | Backend socket **and** accepted upstream sockets | A peer whose TCP stack is gone (host down, cable pulled, half-open socket). |
|
||||
| Application heartbeat (FC03 probe) | Backend socket only | The above **plus** a middlebox idle-drop and an ECOM that is connected-but-not-answering Modbus. |
|
||||
|
||||
The application heartbeat is the load-bearing mechanism; OS keepalive is a cheap belt-and-suspenders that also covers the window between heartbeat ticks.
|
||||
|
||||
## Backend: OS TCP keepalive
|
||||
|
||||
`SocketKeepalive.Apply` sets `SO_KEEPALIVE` plus the idle-time / probe-interval / probe-count tunables on the backend `Socket` right after it is created in `PlcMultiplexer.EnsureBackendConnectedAsync`. The tunables come from `Connection.Keepalive.Tcp*`. Socket options are applied **at connect time** — a hot-reload of the `Tcp*` values only affects backend sockets opened *after* the change.
|
||||
|
||||
## Backend: application heartbeat
|
||||
|
||||
A per-`PlcMultiplexer` background loop (`RunBackendHeartbeatAsync`) is started alongside the backend writer and reader on each successful connect, under the same `_backendCts`, and dies with them on teardown.
|
||||
|
||||
- The multiplexer tracks `_lastBackendActivityUtc`, updated by **both** the writer (on every send) and the reader (on every received frame). Real traffic in either direction therefore suppresses the heartbeat.
|
||||
- Each tick (a quarter of `BackendHeartbeatIdleMs`, floored at 500 ms), if the socket has been idle longer than `BackendHeartbeatIdleMs`, the loop issues a **synthetic FC03 qty=1 read** at `BackendHeartbeatProbeAddress` (default 0 = `V0`, valid on DL205/DL260). FC08 (Diagnostics) is **not** supported by the DL260 ECOM, so the probe must be a real register read.
|
||||
- The probe targets the unit ID of the most recent upstream request, so it reaches the same Modbus unit the real clients successfully use.
|
||||
- The probe takes a real proxy TxId and a `CorrelationMap` entry flagged `InFlightRequest.IsHeartbeat`. It is enqueued straight onto the backend outbound channel, **bypassing** the read-coalescing and response-cache paths.
|
||||
|
||||
### Heartbeat response
|
||||
|
||||
The backend reader recognises an `IsHeartbeat` correlation entry, refreshes the idle timer (already done on frame receipt), frees the TxId, and **drops the payload** — no rewriter, no cache write-through, no fan-out, and no round-trip-EWMA sample (the synthetic probe never pollutes the client-facing RTT metric).
|
||||
|
||||
### Heartbeat timeout
|
||||
|
||||
If a probe is not answered within `BackendRequestTimeoutMs`, the per-request timeout watchdog ([`./ConnectionModel.md`](./ConnectionModel.md) → "Per-Request Timeout Watchdog") finds the stale `IsHeartbeat` entry and — instead of dispatching a 0x0B exception to a (non-existent) upstream party — calls `TearDownBackendAsync`, cascading every attached upstream pipe.
|
||||
|
||||
This is a **proactive** version of the existing backend-disconnect cascade: the dead path is found during idle instead of corrupting the next real client request. Reconnect stays lazy — the heartbeat keeps an *existing* backend warm, it never resurrects a dead one and adds no eager-reconnect spinner. Clients reconnect on their next request, exactly as for an organic cascade.
|
||||
|
||||
`BackendHeartbeatIdleMs` must be greater than `BackendRequestTimeoutMs` (enforced by the reload validator) — a heartbeat interval at or below the request timeout would fire continuously.
|
||||
|
||||
## Upstream: OS TCP keepalive
|
||||
|
||||
`SocketKeepalive.Apply` is also called on each accepted client `Socket` in the `UpstreamPipe` constructor. This is the **only** standard keepalive available on the upstream side: Modbus TCP is strictly client-initiated, so the proxy — a server to its clients — cannot send an unsolicited application heartbeat to a client. OS keepalive lets the proxy's TCP stack probe each client; a dead or half-open client then faults the pipe's read loop, the pipe is disposed, and its correlation / coalescing slots are freed instead of leaking until the proxy next tries to write.
|
||||
|
||||
## Counters
|
||||
|
||||
Per-PLC, exposed on the status page (see [`../Operations/StatusPage.md`](../Operations/StatusPage.md)):
|
||||
|
||||
| Counter | Meaning |
|
||||
|---------|---------|
|
||||
| `backendHeartbeatsSent` | Heartbeat probes issued on idle backend sockets. |
|
||||
| `backendHeartbeatsFailed` | Probes not answered within `BackendRequestTimeoutMs`. |
|
||||
| `backendIdleDisconnects` | Backend teardowns triggered by a failed heartbeat (event count — distinct from `disconnectCascades`, which counts cascaded pipes). |
|
||||
|
||||
## Log events
|
||||
|
||||
`mbproxy.keepalive.*` — see [`../Reference/LogEvents.md`](../Reference/LogEvents.md):
|
||||
|
||||
- `mbproxy.keepalive.heartbeat.sent` (Debug)
|
||||
- `mbproxy.keepalive.heartbeat.timeout` (Warning)
|
||||
- `mbproxy.keepalive.backend.idle_disconnect` (Information)
|
||||
|
||||
## Hot reload
|
||||
|
||||
`Connection.Keepalive` is read through a live accessor (`Func<KeepaliveOptions>`), so a reload of `appsettings.json` propagates without a listener restart:
|
||||
|
||||
- The **heartbeat** interval and probe address are re-read on every loop tick.
|
||||
- The **TCP socket options** are applied at connect/accept time, so a reload affects only sockets opened after the change.
|
||||
|
||||
## Related documentation
|
||||
|
||||
- [`./ConnectionModel.md`](./ConnectionModel.md) — backend socket lifecycle, the timeout watchdog, and the disconnect cascade this feature hooks into
|
||||
- [`../Operations/Configuration.md`](../Operations/Configuration.md) — the `Connection.Keepalive` option block
|
||||
- [`../Operations/StatusPage.md`](../Operations/StatusPage.md) — keepalive counters
|
||||
- [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — `mbproxy.keepalive.*` events
|
||||
- [`../Reference/dl205.md`](../Reference/dl205.md) — the device "no keepalive" oddity and FC03/FC08 support
|
||||
@@ -135,6 +135,16 @@ These two fields are Tier-2 KPIs intended for memory-budget alerts. The cache is
|
||||
| `backend.cacheEntryCount` | `long` | `CounterSnapshot.CacheEntryCount` | Current number of cached response entries for this PLC. |
|
||||
| `backend.cacheBytes` | `long` | `CounterSnapshot.CacheBytes` | Approximate byte cost of the cache entries (response payloads plus key overhead). Used to detect runaway growth from a chatty client. |
|
||||
|
||||
### Keepalive counters
|
||||
|
||||
These fields describe the backend keepalive heartbeat. See [`../Architecture/Keepalive.md`](../Architecture/Keepalive.md).
|
||||
|
||||
| JSON path | Type | Source | Meaning |
|
||||
|---|---|---|---|
|
||||
| `backend.backendHeartbeatsSent` | `long` | `CounterSnapshot.BackendHeartbeatsSent` | Synthetic FC03 heartbeat probes issued on this PLC's idle backend socket. |
|
||||
| `backend.backendHeartbeatsFailed` | `long` | `CounterSnapshot.BackendHeartbeatsFailed` | Heartbeat probes not answered within `BackendRequestTimeoutMs`. Each failure tears the backend down. |
|
||||
| `backend.backendIdleDisconnects` | `long` | `CounterSnapshot.BackendIdleDisconnects` | Backend teardowns triggered by a failed heartbeat — an event count, distinct from `disconnectCascades` (which counts cascaded pipes). Sustained growth means a PLC is repeatedly going dark while idle. |
|
||||
|
||||
### Bytes
|
||||
|
||||
| JSON path | Type | Source | Meaning |
|
||||
@@ -224,7 +234,10 @@ A representative two-PLC deployment, ~2 hours into a run:
|
||||
"cacheMissCount": 88691,
|
||||
"cacheInvalidations": 6203,
|
||||
"cacheEntryCount": 47,
|
||||
"cacheBytes": 18512
|
||||
"cacheBytes": 18512,
|
||||
"backendHeartbeatsSent": 412,
|
||||
"backendHeartbeatsFailed": 0,
|
||||
"backendIdleDisconnects": 0
|
||||
},
|
||||
"bytes": {
|
||||
"upstreamIn": 4108290,
|
||||
@@ -267,7 +280,10 @@ A representative two-PLC deployment, ~2 hours into a run:
|
||||
"cacheMissCount": 0,
|
||||
"cacheInvalidations": 0,
|
||||
"cacheEntryCount": 0,
|
||||
"cacheBytes": 0
|
||||
"cacheBytes": 0,
|
||||
"backendHeartbeatsSent": 0,
|
||||
"backendHeartbeatsFailed": 0,
|
||||
"backendIdleDisconnects": 0
|
||||
},
|
||||
"bytes": { "upstreamIn": 0, "upstreamOut": 0 }
|
||||
}
|
||||
@@ -282,10 +298,10 @@ The HTML renderer is `StatusHtmlRenderer.Render(StatusResponse)` in `src/Mbproxy
|
||||
Structure:
|
||||
|
||||
1. **Header summary** — version, formatted uptime (`Nh MMm SSs`), `bound/configured` listener tally, last reload timestamp, reload count with a `(N rejected)` suffix when applicable.
|
||||
2. **PLC table** — one row per configured PLC. Columns: Name, Host, Port, State (colour-coded — `bound` = green, `recovering` = orange, `stopped` = grey), Clients (count plus a comma-separated list of `remote (N PDUs)`), PDUs forwarded, FC03/FC04/FC06/FC16/FC? counts, BCD slots, Partial BCD, exception codes 01/02/03/04, RTT (ms), bytes in/out, multiplexer columns (in-flight, max in-flight, TxId wraps, cascades, queue), coalescing ratio cell, cache ratio cell.
|
||||
2. **PLC table** — one row per configured PLC. Columns: Name, Host, Port, State (colour-coded — `bound` = green, `recovering` = orange, `stopped` = grey), Clients (count plus a comma-separated list of `remote (N PDUs)`), PDUs forwarded, FC03/FC04/FC06/FC16/FC? counts, BCD slots, Partial BCD, exception codes 01/02/03/04, RTT (ms), bytes in/out, multiplexer columns (in-flight, max in-flight, TxId wraps, cascades, queue), coalescing ratio cell, cache ratio cell, keepalive cell.
|
||||
3. **State cell error detail** — when `state == "recovering"`, the cell also shows `lastBindError` and `(attempt N)` in a small red span.
|
||||
|
||||
The coalescing and cache cells each render as `<pct>% (<hits>)`. When neither has been exercised (`hit + miss == 0`), the cell renders an em-dash to keep the column narrow. Page weight is bounded by the design budget (≤ 50 KB for a 54-PLC fleet).
|
||||
The coalescing and cache cells each render as `<pct>% (<hits>)`. When neither has been exercised (`hit + miss == 0`), the cell renders an em-dash to keep the column narrow. The keepalive cell shows the heartbeat-sent count, with `(fail N, idle-disc N)` appended only when either is non-zero. Page weight is bounded by the design budget (≤ 50 KB for a 54-PLC fleet).
|
||||
|
||||
The page does not depend on JavaScript. Refresh is driven entirely by the `<meta http-equiv="refresh" content="5">` tag, so any browser — including text-mode browsers — sees the same view.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user