Files
Joseph Doherty 0868613890 mbproxy: add keepalive / connection monitoring
The DL205/DL260 ECOM emits no TCP keepalives, so an idle backend socket
can be silently dropped by a middlebox (switch, firewall, NAT) after
2-5 minutes. Enable OS SO_KEEPALIVE on backend and accepted upstream
sockets, and drive a periodic synthetic FC03 heartbeat on each idle
backend socket so a dead path is detected before a real client request
hits it. Controlled by Connection.Keepalive (ON by default).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 09:40:54 -04:00

6.4 KiB
Raw Permalink Blame History

Keepalive & Connection Monitoring

The DL205/DL260 ECOM does not emit TCP keepalives (see ../Reference/dl205.md → "Behavioural Oddities"). An idle socket is silently dropped by middleboxes — switches, firewalls, NAT — typically after 25 minutes. The proxy holds one persistent backend socket per PLC (./ConnectionModel.md) plus many accepted upstream client sockets, so it needs its own keepalive on both sides.

Keepalive is enabled by default and is governed by the Connection.Keepalive option block (see ../Operations/Configuration.md). Set Connection.Keepalive.Enabled = false to restore pre-keepalive behaviour exactly.

Two mechanisms

Mechanism Scope Detects
OS TCP keepalive (SO_KEEPALIVE) Backend socket and accepted upstream sockets A peer whose TCP stack is gone (host down, cable pulled, half-open socket).
Application heartbeat (FC03 probe) Backend socket only The above plus a middlebox idle-drop and an ECOM that is connected-but-not-answering Modbus.

The application heartbeat is the load-bearing mechanism; OS keepalive is a cheap belt-and-suspenders that also covers the window between heartbeat ticks.

Backend: OS TCP keepalive

SocketKeepalive.Apply sets SO_KEEPALIVE plus the idle-time / probe-interval / probe-count tunables on the backend Socket right after it is created in PlcMultiplexer.EnsureBackendConnectedAsync. The tunables come from Connection.Keepalive.Tcp*. Socket options are applied at connect time — a hot-reload of the Tcp* values only affects backend sockets opened after the change.

Backend: application heartbeat

A per-PlcMultiplexer background loop (RunBackendHeartbeatAsync) is started alongside the backend writer and reader on each successful connect, under the same _backendCts, and dies with them on teardown.

  • The multiplexer tracks _lastBackendActivityUtc, updated by both the writer (on every send) and the reader (on every received frame). Real traffic in either direction therefore suppresses the heartbeat.
  • Each tick (a quarter of BackendHeartbeatIdleMs, floored at 500 ms), if the socket has been idle longer than BackendHeartbeatIdleMs, the loop issues a synthetic FC03 qty=1 read at BackendHeartbeatProbeAddress (default 0 = V0, valid on DL205/DL260). FC08 (Diagnostics) is not supported by the DL260 ECOM, so the probe must be a real register read.
  • The probe targets the unit ID of the most recent upstream request, so it reaches the same Modbus unit the real clients successfully use.
  • The probe takes a real proxy TxId and a CorrelationMap entry flagged InFlightRequest.IsHeartbeat. It is enqueued straight onto the backend outbound channel, bypassing the read-coalescing and response-cache paths.

Heartbeat response

The backend reader recognises an IsHeartbeat correlation entry, refreshes the idle timer (already done on frame receipt), frees the TxId, and drops the payload — no rewriter, no cache write-through, no fan-out, and no round-trip-EWMA sample (the synthetic probe never pollutes the client-facing RTT metric).

Heartbeat timeout

If a probe is not answered within BackendRequestTimeoutMs, the per-request timeout watchdog (./ConnectionModel.md → "Per-Request Timeout Watchdog") finds the stale IsHeartbeat entry and — instead of dispatching a 0x0B exception to a (non-existent) upstream party — calls TearDownBackendAsync, cascading every attached upstream pipe.

This is a proactive version of the existing backend-disconnect cascade: the dead path is found during idle instead of corrupting the next real client request. Reconnect stays lazy — the heartbeat keeps an existing backend warm, it never resurrects a dead one and adds no eager-reconnect spinner. Clients reconnect on their next request, exactly as for an organic cascade.

BackendHeartbeatIdleMs must be greater than BackendRequestTimeoutMs (enforced by the reload validator) — a heartbeat interval at or below the request timeout would fire continuously.

Upstream: OS TCP keepalive

SocketKeepalive.Apply is also called on each accepted client Socket in the UpstreamPipe constructor. This is the only standard keepalive available on the upstream side: Modbus TCP is strictly client-initiated, so the proxy — a server to its clients — cannot send an unsolicited application heartbeat to a client. OS keepalive lets the proxy's TCP stack probe each client; a dead or half-open client then faults the pipe's read loop, the pipe is disposed, and its correlation / coalescing slots are freed instead of leaking until the proxy next tries to write.

Counters

Per-PLC, exposed on the status page (see ../Operations/StatusPage.md):

Counter Meaning
backendHeartbeatsSent Heartbeat probes issued on idle backend sockets.
backendHeartbeatsFailed Probes not answered within BackendRequestTimeoutMs.
backendIdleDisconnects Backend teardowns triggered by a failed heartbeat (event count — distinct from disconnectCascades, which counts cascaded pipes).

Log events

mbproxy.keepalive.* — see ../Reference/LogEvents.md:

  • mbproxy.keepalive.heartbeat.sent (Debug)
  • mbproxy.keepalive.heartbeat.timeout (Warning)
  • mbproxy.keepalive.backend.idle_disconnect (Information)

Hot reload

Connection.Keepalive is read through a live accessor (Func<KeepaliveOptions>), so a reload of appsettings.json propagates without a listener restart:

  • The heartbeat interval and probe address are re-read on every loop tick.
  • The TCP socket options are applied at connect/accept time, so a reload affects only sockets opened after the change.