Phase 3 PR 53 -- Transport reconnect-on-drop + SO_KEEPALIVE (DL260 no-keepalive quirk) #52

Merged
dohertj2 merged 1 commits from phase-3-pr53-dl205-reconnect into v2 2026-04-18 22:35:42 -04:00
Owner

Summary

DL260's H2-ECOM100 does NOT send TCP keepalives — intermediate NAT/firewall middleboxes silently close idle sockets after 2-5 minutes. Previously the first SendAsync after the drop surfaced as IOException/EndOfStreamException to the caller even though the PLC was fine.

Changes in ModbusTcpTransport:

  • On socket-level failure (IOException, SocketException, EndOfStreamException, ObjectDisposedException), tear down the dead socket, reconnect, retry the PDU exactly once. Single-retry — further failures propagate so health reflects reality.
  • Protocol-layer failures (ModbusException codes like 02 Illegal Data Address) specifically NOT caught — retry would return the same code.
  • Enable SO_KEEPALIVE with 30s idle + 10s interval × 3 retries (detect dead socket in ~60s vs Windows default 2h40min).

New ModbusDriverOptions.AutoReconnect (default true). Opt-out preserves old fail-loud semantics for callers that want explicit reconnect control.

Validation

  • 142/142 Modbus.Tests pass (new ModbusTcpReconnectTests boots a real in-process TcpListener that forcibly closes after one transaction; asserts auto-retry succeeds AND opt-out path propagates the drop)
  • 11/11 DL205 integration tests pass — no regression from the transport change

Test plan

  • Real-socket reconnect test with in-process flaky server
  • Auto-reconnect off path verified
  • No-regression across full Modbus suite
  • Live pymodbus dl205 tests green
## Summary DL260's H2-ECOM100 does NOT send TCP keepalives — intermediate NAT/firewall middleboxes silently close idle sockets after 2-5 minutes. Previously the first `SendAsync` after the drop surfaced as `IOException`/`EndOfStreamException` to the caller even though the PLC was fine. Changes in `ModbusTcpTransport`: - On socket-level failure (IOException, SocketException, EndOfStreamException, ObjectDisposedException), tear down the dead socket, reconnect, retry the PDU exactly once. Single-retry — further failures propagate so health reflects reality. - Protocol-layer failures (ModbusException codes like 02 Illegal Data Address) specifically NOT caught — retry would return the same code. - Enable SO_KEEPALIVE with 30s idle + 10s interval × 3 retries (detect dead socket in ~60s vs Windows default 2h40min). New `ModbusDriverOptions.AutoReconnect` (default true). Opt-out preserves old fail-loud semantics for callers that want explicit reconnect control. ## Validation - 142/142 Modbus.Tests pass (new `ModbusTcpReconnectTests` boots a real in-process TcpListener that forcibly closes after one transaction; asserts auto-retry succeeds AND opt-out path propagates the drop) - 11/11 DL205 integration tests pass — no regression from the transport change ## Test plan - [x] Real-socket reconnect test with in-process flaky server - [x] Auto-reconnect off path verified - [x] No-regression across full Modbus suite - [x] Live pymodbus dl205 tests green
dohertj2 added 1 commit 2026-04-18 22:35:37 -04:00
Phase 3 PR 53 -- Transport reconnect-on-drop + SO_KEEPALIVE for DL205 no-keepalive quirk. AutomationDirect H2-ECOM100 does NOT send TCP keepalives per docs/v2/dl205.md behavioral-oddities section -- any NAT/firewall device between the gateway and the PLC can silently close an idle socket after 2-5 minutes of inactivity. The PLC itself never notices and the first SendAsync after the drop would previously surface as IOException / EndOfStreamException / SocketException to the caller even though the PLC is perfectly healthy. PR 53 makes ModbusTcpTransport survive mid-session socket drops: SendAsync wraps the previous body as SendOnceAsync; on the first attempt, if the failure is a socket-layer error (IOException, SocketException, EndOfStreamException, ObjectDisposedException) AND autoReconnect is enabled (default true), the transport tears down the dead socket, calls ConnectAsync to re-establish, and resends the PDU exactly once. Deliberately single-retry -- further failures propagate so the driver health surface reflects the real state, no masking a dead PLC. Protocol-layer failures (e.g. ModbusException with exception code 02) are specifically NOT caught by the reconnect path -- they would just come back with the same exception code after the reconnect, so retrying is wasted wire time. Socket-level vs protocol-level is a discriminator inside IsSocketLevelFailure. Also enables SO_KEEPALIVE on the TcpClient with aggressive timing: TcpKeepAliveTime=30s, TcpKeepAliveInterval=10s, TcpKeepAliveRetryCount=3. Total time-to-detect-dead-socket = 30 + 10*3 = 60s, vs the Windows default 2-hour idle + 9 retries = 2h40min. Best-effort: older OSes that don't expose the fine-grained keepalive knobs silently skip them (catch {}). New ModbusDriverOptions.AutoReconnect bool (default true) threads through to the default transport factory in ModbusDriver -- callers wanting the old 'fail loud on drop' behavior can set AutoReconnect=false, or use a custom transportFactory that ignores the option. Unit tests: ModbusTcpReconnectTests boots a FlakeyModbusServer in-process (real TcpListener on loopback) that serves one valid FC03 response then forcibly shuts down the socket. Transport_recovers_from_mid_session_drop_and_retries_successfully issues two consecutive SendAsync calls and asserts both return valid PDUs -- the second must trigger the reconnect path transparently. Transport_without_AutoReconnect_propagates_drop_to_caller asserts the legacy behavior when the opt-out is taken. Validates real socket semantics rather than mocked exceptions. 142/142 Modbus.Tests pass (113 prior + 2 mapper + 2 reconnect + 25 accumulated across PRs 45-52); 11/11 DL205 integration tests still pass with MODBUS_SIM_PROFILE=dl205 -- no regression from the transport change. 793c787315
dohertj2 merged commit d5c6280333 into v2 2026-04-18 22:35:42 -04:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: dohertj2/lmxopcua#52