Files

T

Joseph Doherty 0868613890 mbproxy: add keepalive / connection monitoring

The DL205/DL260 ECOM emits no TCP keepalives, so an idle backend socket
can be silently dropped by a middlebox (switch, firewall, NAT) after
2-5 minutes. Enable OS SO_KEEPALIVE on backend and accepted upstream
sockets, and drive a periodic synthetic FC03 heartbeat on each idle
backend socket so a dead path is detected before a real client request
hits it. Controlled by Connection.Keepalive (ON by default).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-15 09:40:54 -04:00

6.4 KiB

Raw Permalink Blame History

Keepalive & Connection Monitoring

The DL205/DL260 ECOM does not emit TCP keepalives (see ../Reference/dl205.md → "Behavioural Oddities"). An idle socket is silently dropped by middleboxes — switches, firewalls, NAT — typically after 2–5 minutes. The proxy holds one persistent backend socket per PLC (./ConnectionModel.md) plus many accepted upstream client sockets, so it needs its own keepalive on both sides.

Keepalive is enabled by default and is governed by the Connection.Keepalive option block (see ../Operations/Configuration.md). Set Connection.Keepalive.Enabled = false to restore pre-keepalive behaviour exactly.

Two mechanisms

Mechanism	Scope	Detects
OS TCP keepalive (`SO_KEEPALIVE`)	Backend socket and accepted upstream sockets	A peer whose TCP stack is gone (host down, cable pulled, half-open socket).
Application heartbeat (FC03 probe)	Backend socket only	The above plus a middlebox idle-drop and an ECOM that is connected-but-not-answering Modbus.

The application heartbeat is the load-bearing mechanism; OS keepalive is a cheap belt-and-suspenders that also covers the window between heartbeat ticks.

Backend: OS TCP keepalive

SocketKeepalive.Apply sets SO_KEEPALIVE plus the idle-time / probe-interval / probe-count tunables on the backend Socket right after it is created in PlcMultiplexer.EnsureBackendConnectedAsync. The tunables come from Connection.Keepalive.Tcp*. Socket options are applied at connect time — a hot-reload of the Tcp* values only affects backend sockets opened after the change.

Backend: application heartbeat

A per-PlcMultiplexer background loop (RunBackendHeartbeatAsync) is started alongside the backend writer and reader on each successful connect, under the same _backendCts, and dies with them on teardown.

The multiplexer tracks _lastBackendActivityUtc, updated by both the writer (on every send) and the reader (on every received frame). Real traffic in either direction therefore suppresses the heartbeat.
Each tick (a quarter of BackendHeartbeatIdleMs, floored at 500 ms), if the socket has been idle longer than BackendHeartbeatIdleMs, the loop issues a synthetic FC03 qty=1 read at BackendHeartbeatProbeAddress (default 0 = V0, valid on DL205/DL260). FC08 (Diagnostics) is not supported by the DL260 ECOM, so the probe must be a real register read.
The probe targets the unit ID of the most recent upstream request, so it reaches the same Modbus unit the real clients successfully use.
The probe takes a real proxy TxId and a CorrelationMap entry flagged InFlightRequest.IsHeartbeat. It is enqueued straight onto the backend outbound channel, bypassing the read-coalescing and response-cache paths.

Heartbeat response

The backend reader recognises an IsHeartbeat correlation entry, refreshes the idle timer (already done on frame receipt), frees the TxId, and drops the payload — no rewriter, no cache write-through, no fan-out, and no round-trip-EWMA sample (the synthetic probe never pollutes the client-facing RTT metric).

Heartbeat timeout

If a probe is not answered within BackendRequestTimeoutMs, the per-request timeout watchdog (./ConnectionModel.md → "Per-Request Timeout Watchdog") finds the stale IsHeartbeat entry and — instead of dispatching a 0x0B exception to a (non-existent) upstream party — calls TearDownBackendAsync, cascading every attached upstream pipe.

This is a proactive version of the existing backend-disconnect cascade: the dead path is found during idle instead of corrupting the next real client request. Reconnect stays lazy — the heartbeat keeps an existing backend warm, it never resurrects a dead one and adds no eager-reconnect spinner. Clients reconnect on their next request, exactly as for an organic cascade.

BackendHeartbeatIdleMs must be greater than BackendRequestTimeoutMs (enforced by the reload validator) — a heartbeat interval at or below the request timeout would fire continuously.

Upstream: OS TCP keepalive

SocketKeepalive.Apply is also called on each accepted client Socket in the UpstreamPipe constructor. This is the only standard keepalive available on the upstream side: Modbus TCP is strictly client-initiated, so the proxy — a server to its clients — cannot send an unsolicited application heartbeat to a client. OS keepalive lets the proxy's TCP stack probe each client; a dead or half-open client then faults the pipe's read loop, the pipe is disposed, and its correlation / coalescing slots are freed instead of leaking until the proxy next tries to write.

Counters

Per-PLC, exposed on the status page (see ../Operations/StatusPage.md):

Counter	Meaning
`backendHeartbeatsSent`	Heartbeat probes issued on idle backend sockets.
`backendHeartbeatsFailed`	Probes not answered within `BackendRequestTimeoutMs`.
`backendIdleDisconnects`	Backend teardowns triggered by a failed heartbeat (event count — distinct from `disconnectCascades`, which counts cascaded pipes).

Log events

mbproxy.keepalive.* — see ../Reference/LogEvents.md:

mbproxy.keepalive.heartbeat.sent (Debug)
mbproxy.keepalive.heartbeat.timeout (Warning)
mbproxy.keepalive.backend.idle_disconnect (Information)

Hot reload

Connection.Keepalive is read through a live accessor (Func<KeepaliveOptions>), so a reload of appsettings.json propagates without a listener restart:

The heartbeat interval and probe address are re-read on every loop tick.
The TCP socket options are applied at connect/accept time, so a reload affects only sockets opened after the change.

./ConnectionModel.md — backend socket lifecycle, the timeout watchdog, and the disconnect cascade this feature hooks into
../Operations/Configuration.md — the Connection.Keepalive option block
../Operations/StatusPage.md — keepalive counters
../Reference/LogEvents.md — mbproxy.keepalive.* events
../Reference/dl205.md — the device "no keepalive" oddity and FC03/FC08 support

6.4 KiB Raw Permalink Blame History Unescape Escape