0868613890
The DL205/DL260 ECOM emits no TCP keepalives, so an idle backend socket can be silently dropped by a middlebox (switch, firewall, NAT) after 2-5 minutes. Enable OS SO_KEEPALIVE on backend and accepted upstream sockets, and drive a periodic synthetic FC03 heartbeat on each idle backend socket so a dead path is detected before a real client request hits it. Controlled by Connection.Keepalive (ON by default). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
77 lines
6.4 KiB
Markdown
77 lines
6.4 KiB
Markdown
# Keepalive & Connection Monitoring
|
||
|
||
The DL205/DL260 ECOM does not emit TCP keepalives (see [`../Reference/dl205.md`](../Reference/dl205.md) → "Behavioural Oddities"). An idle socket is silently dropped by middleboxes — switches, firewalls, NAT — typically after 2–5 minutes. The proxy holds one **persistent backend socket per PLC** ([`./ConnectionModel.md`](./ConnectionModel.md)) plus many accepted upstream client sockets, so it needs its own keepalive on both sides.
|
||
|
||
Keepalive is **enabled by default** and is governed by the `Connection.Keepalive` option block (see [`../Operations/Configuration.md`](../Operations/Configuration.md)). Set `Connection.Keepalive.Enabled = false` to restore pre-keepalive behaviour exactly.
|
||
|
||
## Two mechanisms
|
||
|
||
| Mechanism | Scope | Detects |
|
||
|-----------|-------|---------|
|
||
| OS TCP keepalive (`SO_KEEPALIVE`) | Backend socket **and** accepted upstream sockets | A peer whose TCP stack is gone (host down, cable pulled, half-open socket). |
|
||
| Application heartbeat (FC03 probe) | Backend socket only | The above **plus** a middlebox idle-drop and an ECOM that is connected-but-not-answering Modbus. |
|
||
|
||
The application heartbeat is the load-bearing mechanism; OS keepalive is a cheap belt-and-suspenders that also covers the window between heartbeat ticks.
|
||
|
||
## Backend: OS TCP keepalive
|
||
|
||
`SocketKeepalive.Apply` sets `SO_KEEPALIVE` plus the idle-time / probe-interval / probe-count tunables on the backend `Socket` right after it is created in `PlcMultiplexer.EnsureBackendConnectedAsync`. The tunables come from `Connection.Keepalive.Tcp*`. Socket options are applied **at connect time** — a hot-reload of the `Tcp*` values only affects backend sockets opened *after* the change.
|
||
|
||
## Backend: application heartbeat
|
||
|
||
A per-`PlcMultiplexer` background loop (`RunBackendHeartbeatAsync`) is started alongside the backend writer and reader on each successful connect, under the same `_backendCts`, and dies with them on teardown.
|
||
|
||
- The multiplexer tracks `_lastBackendActivityUtc`, updated by **both** the writer (on every send) and the reader (on every received frame). Real traffic in either direction therefore suppresses the heartbeat.
|
||
- Each tick (a quarter of `BackendHeartbeatIdleMs`, floored at 500 ms), if the socket has been idle longer than `BackendHeartbeatIdleMs`, the loop issues a **synthetic FC03 qty=1 read** at `BackendHeartbeatProbeAddress` (default 0 = `V0`, valid on DL205/DL260). FC08 (Diagnostics) is **not** supported by the DL260 ECOM, so the probe must be a real register read.
|
||
- The probe targets the unit ID of the most recent upstream request, so it reaches the same Modbus unit the real clients successfully use.
|
||
- The probe takes a real proxy TxId and a `CorrelationMap` entry flagged `InFlightRequest.IsHeartbeat`. It is enqueued straight onto the backend outbound channel, **bypassing** the read-coalescing and response-cache paths.
|
||
|
||
### Heartbeat response
|
||
|
||
The backend reader recognises an `IsHeartbeat` correlation entry, refreshes the idle timer (already done on frame receipt), frees the TxId, and **drops the payload** — no rewriter, no cache write-through, no fan-out, and no round-trip-EWMA sample (the synthetic probe never pollutes the client-facing RTT metric).
|
||
|
||
### Heartbeat timeout
|
||
|
||
If a probe is not answered within `BackendRequestTimeoutMs`, the per-request timeout watchdog ([`./ConnectionModel.md`](./ConnectionModel.md) → "Per-Request Timeout Watchdog") finds the stale `IsHeartbeat` entry and — instead of dispatching a 0x0B exception to a (non-existent) upstream party — calls `TearDownBackendAsync`, cascading every attached upstream pipe.
|
||
|
||
This is a **proactive** version of the existing backend-disconnect cascade: the dead path is found during idle instead of corrupting the next real client request. Reconnect stays lazy — the heartbeat keeps an *existing* backend warm, it never resurrects a dead one and adds no eager-reconnect spinner. Clients reconnect on their next request, exactly as for an organic cascade.
|
||
|
||
`BackendHeartbeatIdleMs` must be greater than `BackendRequestTimeoutMs` (enforced by the reload validator) — a heartbeat interval at or below the request timeout would fire continuously.
|
||
|
||
## Upstream: OS TCP keepalive
|
||
|
||
`SocketKeepalive.Apply` is also called on each accepted client `Socket` in the `UpstreamPipe` constructor. This is the **only** standard keepalive available on the upstream side: Modbus TCP is strictly client-initiated, so the proxy — a server to its clients — cannot send an unsolicited application heartbeat to a client. OS keepalive lets the proxy's TCP stack probe each client; a dead or half-open client then faults the pipe's read loop, the pipe is disposed, and its correlation / coalescing slots are freed instead of leaking until the proxy next tries to write.
|
||
|
||
## Counters
|
||
|
||
Per-PLC, exposed on the status page (see [`../Operations/StatusPage.md`](../Operations/StatusPage.md)):
|
||
|
||
| Counter | Meaning |
|
||
|---------|---------|
|
||
| `backendHeartbeatsSent` | Heartbeat probes issued on idle backend sockets. |
|
||
| `backendHeartbeatsFailed` | Probes not answered within `BackendRequestTimeoutMs`. |
|
||
| `backendIdleDisconnects` | Backend teardowns triggered by a failed heartbeat (event count — distinct from `disconnectCascades`, which counts cascaded pipes). |
|
||
|
||
## Log events
|
||
|
||
`mbproxy.keepalive.*` — see [`../Reference/LogEvents.md`](../Reference/LogEvents.md):
|
||
|
||
- `mbproxy.keepalive.heartbeat.sent` (Debug)
|
||
- `mbproxy.keepalive.heartbeat.timeout` (Warning)
|
||
- `mbproxy.keepalive.backend.idle_disconnect` (Information)
|
||
|
||
## Hot reload
|
||
|
||
`Connection.Keepalive` is read through a live accessor (`Func<KeepaliveOptions>`), so a reload of `appsettings.json` propagates without a listener restart:
|
||
|
||
- The **heartbeat** interval and probe address are re-read on every loop tick.
|
||
- The **TCP socket options** are applied at connect/accept time, so a reload affects only sockets opened after the change.
|
||
|
||
## Related documentation
|
||
|
||
- [`./ConnectionModel.md`](./ConnectionModel.md) — backend socket lifecycle, the timeout watchdog, and the disconnect cascade this feature hooks into
|
||
- [`../Operations/Configuration.md`](../Operations/Configuration.md) — the `Connection.Keepalive` option block
|
||
- [`../Operations/StatusPage.md`](../Operations/StatusPage.md) — keepalive counters
|
||
- [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — `mbproxy.keepalive.*` events
|
||
- [`../Reference/dl205.md`](../Reference/dl205.md) — the device "no keepalive" oddity and FC03/FC08 support
|