mbproxy: add keepalive / connection monitoring

The DL205/DL260 ECOM emits no TCP keepalives, so an idle backend socket
can be silently dropped by a middlebox (switch, firewall, NAT) after
2-5 minutes. Enable OS SO_KEEPALIVE on backend and accepted upstream
sockets, and drive a periodic synthetic FC03 heartbeat on each idle
backend socket so a dead path is detected before a real client request
hits it. Controlled by Connection.Keepalive (ON by default).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-15 09:40:54 -04:00
parent 7466a46aa7
commit 0868613890
25 changed files with 1135 additions and 25 deletions
@@ -240,6 +240,7 @@ The per-request timeout watchdog described above is the production defence again
- [`./Overview.md`](./Overview.md) — proxy architecture entry point
- [`./ReadCoalescing.md`](./ReadCoalescing.md) — FC03/FC04 fan-out built on `InterestedParties`
- [`./ResponseCache.md`](./ResponseCache.md) — per-PLC FC03/FC04 cache layered in front of this multiplexer
- [`./Keepalive.md`](./Keepalive.md) — TCP keepalive and the backend heartbeat that keeps this socket warm
- [`../Operations/Configuration.md`](../Operations/Configuration.md) — `Connection.BackendConnectTimeoutMs`, `Connection.BackendRequestTimeoutMs`, retry tuning
- [`../Operations/StatusPage.md`](../Operations/StatusPage.md) — `inFlight`, `maxInFlight`, `txIdWraps`, `queueDepth`, `disconnectCascades` counters
- [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — `mbproxy.multiplex.*` structured log events
+76
View File
@@ -0,0 +1,76 @@
# Keepalive & Connection Monitoring
The DL205/DL260 ECOM does not emit TCP keepalives (see [`../Reference/dl205.md`](../Reference/dl205.md) → "Behavioural Oddities"). An idle socket is silently dropped by middleboxes — switches, firewalls, NAT — typically after 25 minutes. The proxy holds one **persistent backend socket per PLC** ([`./ConnectionModel.md`](./ConnectionModel.md)) plus many accepted upstream client sockets, so it needs its own keepalive on both sides.
Keepalive is **enabled by default** and is governed by the `Connection.Keepalive` option block (see [`../Operations/Configuration.md`](../Operations/Configuration.md)). Set `Connection.Keepalive.Enabled = false` to restore pre-keepalive behaviour exactly.
## Two mechanisms
| Mechanism | Scope | Detects |
|-----------|-------|---------|
| OS TCP keepalive (`SO_KEEPALIVE`) | Backend socket **and** accepted upstream sockets | A peer whose TCP stack is gone (host down, cable pulled, half-open socket). |
| Application heartbeat (FC03 probe) | Backend socket only | The above **plus** a middlebox idle-drop and an ECOM that is connected-but-not-answering Modbus. |
The application heartbeat is the load-bearing mechanism; OS keepalive is a cheap belt-and-suspenders that also covers the window between heartbeat ticks.
## Backend: OS TCP keepalive
`SocketKeepalive.Apply` sets `SO_KEEPALIVE` plus the idle-time / probe-interval / probe-count tunables on the backend `Socket` right after it is created in `PlcMultiplexer.EnsureBackendConnectedAsync`. The tunables come from `Connection.Keepalive.Tcp*`. Socket options are applied **at connect time** — a hot-reload of the `Tcp*` values only affects backend sockets opened *after* the change.
## Backend: application heartbeat
A per-`PlcMultiplexer` background loop (`RunBackendHeartbeatAsync`) is started alongside the backend writer and reader on each successful connect, under the same `_backendCts`, and dies with them on teardown.
- The multiplexer tracks `_lastBackendActivityUtc`, updated by **both** the writer (on every send) and the reader (on every received frame). Real traffic in either direction therefore suppresses the heartbeat.
- Each tick (a quarter of `BackendHeartbeatIdleMs`, floored at 500 ms), if the socket has been idle longer than `BackendHeartbeatIdleMs`, the loop issues a **synthetic FC03 qty=1 read** at `BackendHeartbeatProbeAddress` (default 0 = `V0`, valid on DL205/DL260). FC08 (Diagnostics) is **not** supported by the DL260 ECOM, so the probe must be a real register read.
- The probe targets the unit ID of the most recent upstream request, so it reaches the same Modbus unit the real clients successfully use.
- The probe takes a real proxy TxId and a `CorrelationMap` entry flagged `InFlightRequest.IsHeartbeat`. It is enqueued straight onto the backend outbound channel, **bypassing** the read-coalescing and response-cache paths.
### Heartbeat response
The backend reader recognises an `IsHeartbeat` correlation entry, refreshes the idle timer (already done on frame receipt), frees the TxId, and **drops the payload** — no rewriter, no cache write-through, no fan-out, and no round-trip-EWMA sample (the synthetic probe never pollutes the client-facing RTT metric).
### Heartbeat timeout
If a probe is not answered within `BackendRequestTimeoutMs`, the per-request timeout watchdog ([`./ConnectionModel.md`](./ConnectionModel.md) → "Per-Request Timeout Watchdog") finds the stale `IsHeartbeat` entry and — instead of dispatching a 0x0B exception to a (non-existent) upstream party — calls `TearDownBackendAsync`, cascading every attached upstream pipe.
This is a **proactive** version of the existing backend-disconnect cascade: the dead path is found during idle instead of corrupting the next real client request. Reconnect stays lazy — the heartbeat keeps an *existing* backend warm, it never resurrects a dead one and adds no eager-reconnect spinner. Clients reconnect on their next request, exactly as for an organic cascade.
`BackendHeartbeatIdleMs` must be greater than `BackendRequestTimeoutMs` (enforced by the reload validator) — a heartbeat interval at or below the request timeout would fire continuously.
## Upstream: OS TCP keepalive
`SocketKeepalive.Apply` is also called on each accepted client `Socket` in the `UpstreamPipe` constructor. This is the **only** standard keepalive available on the upstream side: Modbus TCP is strictly client-initiated, so the proxy — a server to its clients — cannot send an unsolicited application heartbeat to a client. OS keepalive lets the proxy's TCP stack probe each client; a dead or half-open client then faults the pipe's read loop, the pipe is disposed, and its correlation / coalescing slots are freed instead of leaking until the proxy next tries to write.
## Counters
Per-PLC, exposed on the status page (see [`../Operations/StatusPage.md`](../Operations/StatusPage.md)):
| Counter | Meaning |
|---------|---------|
| `backendHeartbeatsSent` | Heartbeat probes issued on idle backend sockets. |
| `backendHeartbeatsFailed` | Probes not answered within `BackendRequestTimeoutMs`. |
| `backendIdleDisconnects` | Backend teardowns triggered by a failed heartbeat (event count — distinct from `disconnectCascades`, which counts cascaded pipes). |
## Log events
`mbproxy.keepalive.*` — see [`../Reference/LogEvents.md`](../Reference/LogEvents.md):
- `mbproxy.keepalive.heartbeat.sent` (Debug)
- `mbproxy.keepalive.heartbeat.timeout` (Warning)
- `mbproxy.keepalive.backend.idle_disconnect` (Information)
## Hot reload
`Connection.Keepalive` is read through a live accessor (`Func<KeepaliveOptions>`), so a reload of `appsettings.json` propagates without a listener restart:
- The **heartbeat** interval and probe address are re-read on every loop tick.
- The **TCP socket options** are applied at connect/accept time, so a reload affects only sockets opened after the change.
## Related documentation
- [`./ConnectionModel.md`](./ConnectionModel.md) — backend socket lifecycle, the timeout watchdog, and the disconnect cascade this feature hooks into
- [`../Operations/Configuration.md`](../Operations/Configuration.md) — the `Connection.Keepalive` option block
- [`../Operations/StatusPage.md`](../Operations/StatusPage.md) — keepalive counters
- [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — `mbproxy.keepalive.*` events
- [`../Reference/dl205.md`](../Reference/dl205.md) — the device "no keepalive" oddity and FC03/FC08 support
+20 -4
View File
@@ -135,6 +135,16 @@ These two fields are Tier-2 KPIs intended for memory-budget alerts. The cache is
| `backend.cacheEntryCount` | `long` | `CounterSnapshot.CacheEntryCount` | Current number of cached response entries for this PLC. |
| `backend.cacheBytes` | `long` | `CounterSnapshot.CacheBytes` | Approximate byte cost of the cache entries (response payloads plus key overhead). Used to detect runaway growth from a chatty client. |
### Keepalive counters
These fields describe the backend keepalive heartbeat. See [`../Architecture/Keepalive.md`](../Architecture/Keepalive.md).
| JSON path | Type | Source | Meaning |
|---|---|---|---|
| `backend.backendHeartbeatsSent` | `long` | `CounterSnapshot.BackendHeartbeatsSent` | Synthetic FC03 heartbeat probes issued on this PLC's idle backend socket. |
| `backend.backendHeartbeatsFailed` | `long` | `CounterSnapshot.BackendHeartbeatsFailed` | Heartbeat probes not answered within `BackendRequestTimeoutMs`. Each failure tears the backend down. |
| `backend.backendIdleDisconnects` | `long` | `CounterSnapshot.BackendIdleDisconnects` | Backend teardowns triggered by a failed heartbeat — an event count, distinct from `disconnectCascades` (which counts cascaded pipes). Sustained growth means a PLC is repeatedly going dark while idle. |
### Bytes
| JSON path | Type | Source | Meaning |
@@ -224,7 +234,10 @@ A representative two-PLC deployment, ~2 hours into a run:
"cacheMissCount": 88691,
"cacheInvalidations": 6203,
"cacheEntryCount": 47,
"cacheBytes": 18512
"cacheBytes": 18512,
"backendHeartbeatsSent": 412,
"backendHeartbeatsFailed": 0,
"backendIdleDisconnects": 0
},
"bytes": {
"upstreamIn": 4108290,
@@ -267,7 +280,10 @@ A representative two-PLC deployment, ~2 hours into a run:
"cacheMissCount": 0,
"cacheInvalidations": 0,
"cacheEntryCount": 0,
"cacheBytes": 0
"cacheBytes": 0,
"backendHeartbeatsSent": 0,
"backendHeartbeatsFailed": 0,
"backendIdleDisconnects": 0
},
"bytes": { "upstreamIn": 0, "upstreamOut": 0 }
}
@@ -282,10 +298,10 @@ The HTML renderer is `StatusHtmlRenderer.Render(StatusResponse)` in `src/Mbproxy
Structure:
1. **Header summary** — version, formatted uptime (`Nh MMm SSs`), `bound/configured` listener tally, last reload timestamp, reload count with a `(N rejected)` suffix when applicable.
2. **PLC table** — one row per configured PLC. Columns: Name, Host, Port, State (colour-coded — `bound` = green, `recovering` = orange, `stopped` = grey), Clients (count plus a comma-separated list of `remote (N PDUs)`), PDUs forwarded, FC03/FC04/FC06/FC16/FC? counts, BCD slots, Partial BCD, exception codes 01/02/03/04, RTT (ms), bytes in/out, multiplexer columns (in-flight, max in-flight, TxId wraps, cascades, queue), coalescing ratio cell, cache ratio cell.
2. **PLC table** — one row per configured PLC. Columns: Name, Host, Port, State (colour-coded — `bound` = green, `recovering` = orange, `stopped` = grey), Clients (count plus a comma-separated list of `remote (N PDUs)`), PDUs forwarded, FC03/FC04/FC06/FC16/FC? counts, BCD slots, Partial BCD, exception codes 01/02/03/04, RTT (ms), bytes in/out, multiplexer columns (in-flight, max in-flight, TxId wraps, cascades, queue), coalescing ratio cell, cache ratio cell, keepalive cell.
3. **State cell error detail** — when `state == "recovering"`, the cell also shows `lastBindError` and `(attempt N)` in a small red span.
The coalescing and cache cells each render as `<pct>% (<hits>)`. When neither has been exercised (`hit + miss == 0`), the cell renders an em-dash to keep the column narrow. Page weight is bounded by the design budget (≤ 50 KB for a 54-PLC fleet).
The coalescing and cache cells each render as `<pct>% (<hits>)`. When neither has been exercised (`hit + miss == 0`), the cell renders an em-dash to keep the column narrow. The keepalive cell shows the heartbeat-sent count, with `(fail N, idle-disc N)` appended only when either is non-zero. Page weight is bounded by the design budget (≤ 50 KB for a 54-PLC fleet).
The page does not depend on JavaScript. Refresh is driven entirely by the `<meta http-equiv="refresh" content="5">` tag, so any browser — including text-mode browsers — sees the same view.
+28 -1
View File
@@ -99,7 +99,34 @@
// Max time (ms) to wait for in-flight PDUs to complete during graceful shutdown
// (sc.exe stop / Windows Service stop signal). After this deadline the coordinator
// cancels remaining work and proceeds. Keep at or below the SCM wait-hint (30 s).
"GracefulShutdownTimeoutMs": 10000
"GracefulShutdownTimeoutMs": 10000,
// ── Keepalive / connection monitoring ───────────────────────────────────
// The DL205/DL260 ECOM does not emit TCP keepalives, so an idle backend
// socket can be silently dropped by a middlebox (switch, firewall, NAT)
// after 2-5 minutes. This section enables OS-level SO_KEEPALIVE on both
// backend and upstream sockets, and drives a periodic Modbus FC03 heartbeat
// on each idle backend socket so a dead path is detected before a real
// client request hits it. See docs/Architecture/Keepalive.md.
"Keepalive": {
// Master switch. false → no SO_KEEPALIVE and no heartbeat; the proxy
// behaves exactly as a pre-keepalive build.
"Enabled": true,
// SO_KEEPALIVE: idle time (ms) before the OS sends its first probe.
"TcpIdleTimeMs": 30000,
// SO_KEEPALIVE: interval (ms) between probes once the idle time elapses.
"TcpProbeIntervalMs": 5000,
// SO_KEEPALIVE: unanswered probes before the OS declares the socket dead.
"TcpProbeCount": 4,
// Backend heartbeat: after this much backend idle (ms) the proxy issues a
// synthetic FC03 qty=1 read to keep the path warm and prove the ECOM is
// still answering Modbus. Must be greater than BackendRequestTimeoutMs.
"BackendHeartbeatIdleMs": 30000,
// FC03 PDU address the heartbeat reads. 0 = V0, valid on DL205/DL260.
"BackendHeartbeatProbeAddress": 0
}
},
// ── Resilience policies ─────────────────────────────────────────────────────────────
+7 -1
View File
@@ -103,7 +103,13 @@ public sealed record PlcBackendStatus(
long CacheMissCount,
long CacheInvalidations,
long CacheEntryCount,
long CacheBytes);
long CacheBytes,
/// <summary>Backend keepalive heartbeat probes issued on idle backend sockets.</summary>
long BackendHeartbeatsSent,
/// <summary>Keepalive heartbeat probes that timed out (backend not answering).</summary>
long BackendHeartbeatsFailed,
/// <summary>Backend teardowns triggered by a failed keepalive heartbeat.</summary>
long BackendIdleDisconnects);
/// <summary>Modbus exception counts by code.</summary>
public sealed record ExceptionCounts(
@@ -88,6 +88,9 @@ internal static class StatusHtmlRenderer
// an em-dash when no cache-eligible reads have occurred. Page-weight budget
// assertion stays under 50 KB for the 54-PLC fleet.
sb.Append("<th>Cache</th>");
// Keepalive column — heartbeats sent, with failure / idle-disconnect counts
// shown only when non-zero.
sb.Append("<th>Keepalive</th>");
sb.Append("</tr></thead><tbody>");
foreach (var plc in status.Plcs)
@@ -185,6 +188,24 @@ internal static class StatusHtmlRenderer
sb.Append(pct).Append("% (").Append(cacheHit).Append(')');
}
sb.Append("</td>");
// Keepalive cell — heartbeats sent; failures + idle-disconnects appended
// only when non-zero to keep the cell narrow.
long hbSent = plc.Backend.BackendHeartbeatsSent;
long hbFailed = plc.Backend.BackendHeartbeatsFailed;
long hbIdle = plc.Backend.BackendIdleDisconnects;
sb.Append("<td>");
if (hbSent == 0 && hbFailed == 0 && hbIdle == 0)
{
sb.Append("&mdash;");
}
else
{
sb.Append(hbSent);
if (hbFailed > 0 || hbIdle > 0)
sb.Append(" (fail ").Append(hbFailed)
.Append(", idle-disc ").Append(hbIdle).Append(')');
}
sb.Append("</td>");
sb.Append("</tr>");
}
@@ -108,7 +108,10 @@ internal sealed class StatusSnapshotBuilder
CacheInvalidations: 0,
CacheEntryCount: 0,
CacheBytes: 0,
ResponseDropForFullUpstream: 0);
ResponseDropForFullUpstream: 0,
BackendHeartbeatsSent: 0,
BackendHeartbeatsFailed: 0,
BackendIdleDisconnects: 0);
long connectsSuccess = counters.ConnectsSuccess;
long connectsFailed = counters.ConnectsFailed;
@@ -152,7 +155,10 @@ internal sealed class StatusSnapshotBuilder
CacheMissCount: counters.CacheMissCount,
CacheInvalidations: counters.CacheInvalidations,
CacheEntryCount: counters.CacheEntryCount,
CacheBytes: counters.CacheBytes),
CacheBytes: counters.CacheBytes,
BackendHeartbeatsSent: counters.BackendHeartbeatsSent,
BackendHeartbeatsFailed: counters.BackendHeartbeatsFailed,
BackendIdleDisconnects: counters.BackendIdleDisconnects),
Bytes: new PlcBytesStatus(
UpstreamIn: counters.BytesUpstreamIn,
UpstreamOut: counters.BytesUpstreamOut)));
@@ -61,6 +61,10 @@ internal sealed partial class ConfigReconciler : IDisposable
// and a hot-reload of `Enabled = false` would not propagate to them.
private Func<ReadCoalescingOptions>? _coalescingAccessor;
// Live accessor for KeepaliveOptions, threaded through Attach so PLCs added or
// restarted via hot-reload honour the current `Connection.Keepalive` values.
private Func<KeepaliveOptions>? _keepaliveAccessor;
// ── Debounce + serialisation machinery ───────────────────────────────────────────────
// Channel carries Unit to signal "something changed — please check".
@@ -121,11 +125,13 @@ internal sealed partial class ConfigReconciler : IDisposable
public void Attach(
ConcurrentDictionary<string, PlcListenerSupervisor> supervisors,
MbproxyOptions initialOptions,
Func<ReadCoalescingOptions>? coalescingAccessor = null)
Func<ReadCoalescingOptions>? coalescingAccessor = null,
Func<KeepaliveOptions>? keepaliveAccessor = null)
{
_supervisors = supervisors;
_currentOptions = initialOptions;
_coalescingAccessor = coalescingAccessor;
_keepaliveAccessor = keepaliveAccessor;
}
// ── ApplyAsync (exposed for tests) ───────────────────────────────────────────────────
@@ -315,7 +321,8 @@ internal sealed partial class ConfigReconciler : IDisposable
recoveryPipeline,
_loggerFactory.CreateLogger<PlcListenerSupervisor>(),
backendPipeline,
_coalescingAccessor);
_coalescingAccessor,
_keepaliveAccessor);
_supervisors[name] = newSupervisor;
await newSupervisor.StartAsync(ct).ConfigureAwait(false);
@@ -401,7 +408,8 @@ internal sealed partial class ConfigReconciler : IDisposable
recoveryPipeline,
_loggerFactory.CreateLogger<PlcListenerSupervisor>(),
backendPipeline,
_coalescingAccessor);
_coalescingAccessor,
_keepaliveAccessor);
_supervisors[plcNew.Name] = newSupervisor;
await newSupervisor.StartAsync(ct).ConfigureAwait(false);
@@ -141,6 +141,27 @@ internal static class ReloadValidator
errs.Add(
$"Connection.GracefulShutdownTimeoutMs must be > 0; got {next.Connection.GracefulShutdownTimeoutMs}.");
// ── 6. Keepalive section ──────────────────────────────────────────────
// Schema bounds are also checked in MbproxyOptionsValidator; re-checking here keeps
// the hot-reload gate self-contained. The cross-field rule (heartbeat interval must
// sit above the request timeout, or it would fire continuously) lives only here.
var ka = next.Connection.Keepalive;
if (ka.TcpIdleTimeMs <= 0)
errs.Add($"Connection.Keepalive.TcpIdleTimeMs must be > 0; got {ka.TcpIdleTimeMs}.");
if (ka.TcpProbeIntervalMs <= 0)
errs.Add($"Connection.Keepalive.TcpProbeIntervalMs must be > 0; got {ka.TcpProbeIntervalMs}.");
if (ka.TcpProbeCount <= 0)
errs.Add($"Connection.Keepalive.TcpProbeCount must be > 0; got {ka.TcpProbeCount}.");
if (ka.BackendHeartbeatProbeAddress is < 0 or > 65535)
errs.Add(
$"Connection.Keepalive.BackendHeartbeatProbeAddress must be in [0, 65535]; " +
$"got {ka.BackendHeartbeatProbeAddress}.");
if (ka.BackendHeartbeatIdleMs <= next.Connection.BackendRequestTimeoutMs)
errs.Add(
$"Connection.Keepalive.BackendHeartbeatIdleMs ({ka.BackendHeartbeatIdleMs}) must be greater " +
$"than Connection.BackendRequestTimeoutMs ({next.Connection.BackendRequestTimeoutMs}); " +
"a heartbeat interval at or below the request timeout would fire continuously.");
errors = errs;
return errs.Count == 0;
}
@@ -9,4 +9,9 @@ public sealed class ConnectionOptions
/// graceful shutdown before cancelling them. Default: 10000 (10 s).
/// </summary>
public int GracefulShutdownTimeoutMs { get; init; } = 10000;
/// <summary>
/// TCP keepalive and backend-heartbeat connection-monitoring settings. Enabled by default.
/// </summary>
public KeepaliveOptions Keepalive { get; init; } = new();
}
@@ -0,0 +1,52 @@
namespace Mbproxy.Options;
/// <summary>
/// TCP keepalive and application-level connection-monitoring settings.
///
/// <para>The DL205/DL260 ECOM does not emit TCP keepalives, so an idle backend socket can be
/// silently dropped by a middlebox (switch, firewall, NAT) after 2-5 minutes. These knobs
/// (a) enable OS-level <c>SO_KEEPALIVE</c> on both backend and accepted upstream sockets and
/// (b) drive a periodic Modbus FC03 heartbeat on each idle backend socket so the path stays
/// warm and a dead ECOM is detected before a real client request hits it.</para>
/// </summary>
public sealed class KeepaliveOptions
{
/// <summary>
/// Master switch. When <c>false</c>, neither <c>SO_KEEPALIVE</c> nor the backend heartbeat
/// is applied and the proxy behaves exactly as a pre-keepalive build. Default: <c>true</c>.
/// </summary>
public bool Enabled { get; init; } = true;
/// <summary>
/// <c>SO_KEEPALIVE</c> idle time in milliseconds — how long a socket may be idle before the
/// OS sends its first keepalive probe. Applied to backend and accepted upstream sockets.
/// Default: 30000 (30 s).
/// </summary>
public int TcpIdleTimeMs { get; init; } = 30000;
/// <summary>
/// <c>SO_KEEPALIVE</c> interval in milliseconds between keepalive probes once the idle time
/// has elapsed. Default: 5000 (5 s).
/// </summary>
public int TcpProbeIntervalMs { get; init; } = 5000;
/// <summary>
/// <c>SO_KEEPALIVE</c> probe count — unanswered probes before the OS declares the socket
/// dead. Default: 4.
/// </summary>
public int TcpProbeCount { get; init; } = 4;
/// <summary>
/// Backend application heartbeat: after this many milliseconds with no backend traffic, the
/// multiplexer issues a synthetic FC03 qty=1 read to keep the socket warm and prove the ECOM
/// is still answering Modbus. Must be greater than <see cref="ConnectionOptions.BackendRequestTimeoutMs"/>.
/// Default: 30000 (30 s).
/// </summary>
public int BackendHeartbeatIdleMs { get; init; } = 30000;
/// <summary>
/// Modbus PDU address read by the backend heartbeat FC03 probe. Address 0 (V0) is valid on
/// DL205/DL260 in factory absolute mode. Default: 0.
/// </summary>
public int BackendHeartbeatProbeAddress { get; init; } = 0;
}
@@ -106,6 +106,22 @@ public sealed class MbproxyOptionsValidator : IValidateOptions<MbproxyOptions>
errors.Add(
$"Connection.GracefulShutdownTimeoutMs must be > 0; got {options.Connection.GracefulShutdownTimeoutMs}.");
// Keepalive section ranges. Cross-field rules (heartbeat interval vs request
// timeout) are enforced in ReloadValidator.
var ka = options.Connection.Keepalive;
if (ka.TcpIdleTimeMs <= 0)
errors.Add($"Connection.Keepalive.TcpIdleTimeMs must be > 0; got {ka.TcpIdleTimeMs}.");
if (ka.TcpProbeIntervalMs <= 0)
errors.Add($"Connection.Keepalive.TcpProbeIntervalMs must be > 0; got {ka.TcpProbeIntervalMs}.");
if (ka.TcpProbeCount <= 0)
errors.Add($"Connection.Keepalive.TcpProbeCount must be > 0; got {ka.TcpProbeCount}.");
if (ka.BackendHeartbeatIdleMs <= 0)
errors.Add($"Connection.Keepalive.BackendHeartbeatIdleMs must be > 0; got {ka.BackendHeartbeatIdleMs}.");
if (ka.BackendHeartbeatProbeAddress is < 0 or > 65535)
errors.Add(
$"Connection.Keepalive.BackendHeartbeatProbeAddress must be in [0, 65535]; " +
$"got {ka.BackendHeartbeatProbeAddress}.");
return errors.Count > 0
? ValidateOptionsResult.Fail(errors)
: ValidateOptionsResult.Success;
@@ -28,6 +28,12 @@ internal sealed record InterestedParty(UpstreamPipe Pipe, ushort OriginalTxId);
/// read coalescing uses to fan out a single PLC response to multiple upstream clients.
/// Reviewer note: do <i>not</i> simplify back to a single <c>UpstreamPipe</c> field.</para>
/// </summary>
/// <param name="IsHeartbeat">
/// <c>true</c> for the synthetic FC03 keepalive probe issued by the backend heartbeat
/// loop. Heartbeat entries carry no <see cref="InterestedParties"/>: the backend reader
/// drops the response (no fan-out, no rewriter, no cache) and the timeout watchdog tears
/// the backend down instead of dispatching a 0x0B exception. Defaults to <c>false</c>.
/// </param>
internal sealed record InFlightRequest(
byte UnitId,
byte Fc,
@@ -35,4 +41,5 @@ internal sealed record InFlightRequest(
ushort Qty,
IReadOnlyList<InterestedParty> InterestedParties,
DateTimeOffset SentAtUtc,
int ResolvedCacheTtlMs = 0);
int ResolvedCacheTtlMs = 0,
bool IsHeartbeat = false);
@@ -0,0 +1,54 @@
namespace Mbproxy.Proxy.Multiplexing;
/// <summary>
/// Source-generated <see cref="LoggerMessage"/> definitions for the backend keepalive
/// heartbeat. Event names are stable — do not rename without updating
/// docs/Reference/LogEvents.md's event-name table.
/// </summary>
internal static partial class KeepaliveLogEvents
{
/// <summary>
/// Emitted each time the heartbeat loop issues a synthetic FC03 probe on an idle
/// backend socket. Debug level — one per <c>BackendHeartbeatIdleMs</c> per idle PLC.
/// </summary>
[LoggerMessage(
EventId = 150,
EventName = "mbproxy.keepalive.heartbeat.sent",
Level = LogLevel.Debug,
Message = "Keepalive heartbeat sent: Plc={Plc} ProxyTxId={ProxyTxId} Address={Address}")]
public static partial void HeartbeatSent(
ILogger logger,
string plc,
ushort proxyTxId,
ushort address);
/// <summary>
/// Emitted when a keepalive heartbeat probe is not answered within
/// <c>BackendRequestTimeoutMs</c>. The backend is connected-but-not-answering; the
/// multiplexer tears it down (see <see cref="BackendIdleDisconnect"/>).
/// </summary>
[LoggerMessage(
EventId = 151,
EventName = "mbproxy.keepalive.heartbeat.timeout",
Level = LogLevel.Warning,
Message = "Keepalive heartbeat timed out: Plc={Plc} ProxyTxId={ProxyTxId} ElapsedMs={ElapsedMs}")]
public static partial void HeartbeatTimeout(
ILogger logger,
string plc,
ushort proxyTxId,
long elapsedMs);
/// <summary>
/// Emitted when a failed keepalive heartbeat triggers a proactive backend teardown.
/// Every attached upstream pipe is cascaded; clients reconnect on their next request.
/// </summary>
[LoggerMessage(
EventId = 152,
EventName = "mbproxy.keepalive.backend.idle_disconnect",
Level = LogLevel.Information,
Message = "Backend torn down by keepalive: Plc={Plc} HeartbeatElapsedMs={ElapsedMs}")]
public static partial void BackendIdleDisconnect(
ILogger logger,
string plc,
long elapsedMs);
}
@@ -61,6 +61,12 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
// `() => optionsMonitor.CurrentValue.Resilience.ReadCoalescing`. Tests default to a
// fresh `ReadCoalescingOptions()` (Enabled = true, MaxParties = 32).
private readonly Func<ReadCoalescingOptions> _coalescingOptions;
// Live keepalive config accessor. Read at backend-connect time (TCP SO_KEEPALIVE) and
// on each heartbeat-loop tick (idle threshold + probe address) so a hot-reload of
// `Connection.Keepalive` propagates without a listener restart. Production wires this
// to `() => optionsMonitor.CurrentValue.Connection.Keepalive`; the fallback reads the
// construction-time `ConnectionOptions` snapshot.
private readonly Func<KeepaliveOptions> _keepaliveOptions;
private readonly TxIdAllocator _allocator = new();
private readonly CorrelationMap _correlation = new();
@@ -86,6 +92,19 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
private CancellationTokenSource? _backendCts;
private Task? _backendWriterTask;
private Task? _backendReaderTask;
private Task? _backendHeartbeatTask;
// UTC ticks of the last backend socket activity (a send OR a received frame). Updated
// by the writer and reader tasks; read by the heartbeat loop to decide whether the
// socket has been idle long enough to warrant a probe. Interlocked for cross-task
// coherence.
private long _lastBackendActivityTicks;
// Unit ID of the most recent upstream request. The synthetic heartbeat reuses it so
// the probe targets the same Modbus unit the real clients successfully talk to.
// Defaults to 0 until the first upstream frame is seen; by the time a heartbeat can
// fire the backend socket exists, which means at least one upstream frame arrived.
private int _lastSeenUnitId;
private readonly CancellationTokenSource _disposeCts = new();
// Volatile so the disposing thread's write is observed by every hot-path reader
@@ -102,7 +121,8 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
PerPlcContext perPlcContext,
ILogger<PlcMultiplexer> logger,
ResiliencePipeline? backendConnectPipeline = null,
Func<ReadCoalescingOptions>? coalescingOptions = null)
Func<ReadCoalescingOptions>? coalescingOptions = null,
Func<KeepaliveOptions>? keepaliveOptions = null)
{
_plc = plc;
_connectionOptions = connectionOptions;
@@ -111,6 +131,7 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
_logger = logger;
_backendConnectPipeline = backendConnectPipeline;
_coalescingOptions = coalescingOptions ?? (static () => new ReadCoalescingOptions());
_keepaliveOptions = keepaliveOptions ?? (() => _connectionOptions.Keepalive);
// Register the per-PLC cache as the live stats source for the snapshot path.
// Cache may be null when the per-PLC context has not been wired with one
@@ -282,6 +303,7 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
// Build a fresh backend socket and Polly-connect.
var backend = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp)
{ NoDelay = true };
SocketKeepalive.Apply(backend, _keepaliveOptions());
try
{
@@ -318,8 +340,11 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
{
_backendSocket = backend;
_backendCts = cts2;
// Seed the idle timer so the heartbeat loop measures idleness from connect.
Interlocked.Exchange(ref _lastBackendActivityTicks, DateTime.UtcNow.Ticks);
_backendWriterTask = Task.Run(() => RunBackendWriterAsync(backend, cts2.Token), CancellationToken.None);
_backendReaderTask = Task.Run(() => RunBackendReaderAsync(backend, cts2.Token), CancellationToken.None);
_backendHeartbeatTask = Task.Run(() => RunBackendHeartbeatAsync(cts2.Token), CancellationToken.None);
}
_ctx.Counters.IncrementConnectSuccess();
@@ -381,18 +406,20 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
{
Socket? oldSocket;
CancellationTokenSource? oldCts;
Task? writer, reader;
Task? writer, reader, heartbeat;
lock (_backendLock)
{
oldSocket = _backendSocket;
oldCts = _backendCts;
writer = _backendWriterTask;
reader = _backendReaderTask;
heartbeat = _backendHeartbeatTask;
_backendSocket = null;
_backendCts = null;
_backendWriterTask = null;
_backendReaderTask = null;
_backendHeartbeatTask = null;
}
if (oldSocket is null && oldCts is null) return;
@@ -454,6 +481,7 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
// Best-effort join.
try { if (writer is not null) await writer.WaitAsync(TimeSpan.FromSeconds(2)).ConfigureAwait(false); } catch { /* swallow */ }
try { if (reader is not null) await reader.WaitAsync(TimeSpan.FromSeconds(2)).ConfigureAwait(false); } catch { /* swallow */ }
try { if (heartbeat is not null) await heartbeat.WaitAsync(TimeSpan.FromSeconds(2)).ConfigureAwait(false); } catch { /* swallow */ }
oldCts?.Dispose();
@@ -489,6 +517,9 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
if (n == 0) throw new SocketException((int)SocketError.ConnectionReset);
sent += n;
}
// A send counts as backend activity — it suppresses the idle heartbeat.
Interlocked.Exchange(ref _lastBackendActivityTicks, DateTime.UtcNow.Ticks);
}
}
catch (OperationCanceledException)
@@ -542,6 +573,10 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
if (!await FillAsync(backend, frame, MbapFrame.HeaderSize, pduBodyLen, ct).ConfigureAwait(false))
break;
// A received frame counts as backend activity — it suppresses (and, for a
// heartbeat response, satisfies) the idle heartbeat.
Interlocked.Exchange(ref _lastBackendActivityTicks, DateTime.UtcNow.Ticks);
if (!_correlation.TryRemove(proxyTxId, out var inFlight))
{
// No correlation entry — either a stale response after cascade, or
@@ -552,6 +587,14 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
// Free the allocator slot immediately so it can be reused.
_allocator.Release(proxyTxId);
// Keepalive heartbeat response — the probe came back, the backend is alive.
// The activity timestamp was already refreshed above. There is no upstream
// party, no cache eligibility, and no rewriting to do: drop the payload and
// skip the EWMA update so the synthetic probe never pollutes the
// client-facing round-trip metric.
if (inFlight.IsHeartbeat)
continue;
// For FC03/FC04 reads, also clear the coalescing-by-key entry so a
// brand-new identical request issued AFTER this response is treated as a
// miss (opens a fresh round-trip). The TryRemove is best-effort: a
@@ -727,6 +770,10 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
out ushort originalTxId, out _, out _, out byte unitId))
return;
// Remember the unit ID so the backend keepalive heartbeat probes the same Modbus
// unit the real clients are known to reach successfully.
Volatile.Write(ref _lastSeenUnitId, unitId);
// Count inbound bytes from the upstream client. Surfaces in bytes.upstreamIn on
// the status page. Counted ONCE per parsed frame regardless of subsequent
// routing (cache hit, coalesce, backend round-trip, exception).
@@ -1062,6 +1109,23 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
_allocator.Release(proxyTxId);
// Keepalive heartbeat that never came back. The backend is no longer
// answering Modbus even though the socket may still look connected —
// tear it down proactively (cascading every attached pipe) so the
// failure is found here, during idle, instead of corrupting the next
// real client request. There is no upstream party to send a 0x0B to.
if (req.IsHeartbeat)
{
long hbElapsedMs = (long)(DateTimeOffset.UtcNow - req.SentAtUtc).TotalMilliseconds;
KeepaliveLogEvents.HeartbeatTimeout(_logger, _plc.Name, proxyTxId, hbElapsedMs);
_ctx.Counters.IncrementBackendHeartbeatFailed();
_ctx.Counters.IncrementBackendIdleDisconnect();
KeepaliveLogEvents.BackendIdleDisconnect(_logger, _plc.Name, hbElapsedMs);
if (!_disposeCts.IsCancellationRequested)
_ = TearDownBackendAsync("keepalive heartbeat timeout", cascadeUpstreams: true);
continue;
}
// Also clear the coalescing-by-key entry. A late attach that raced
// in just before the watchdog claim will still receive the 0x0B
// exception via this entry's InterestedParties list (List<T>
@@ -1110,6 +1174,124 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
}
}
// ── Backend keepalive heartbeat ───────────────────────────────────────────
/// <summary>
/// Backend keepalive heartbeat loop. Started alongside the writer/reader on each
/// successful connect and cancelled with them on teardown. While the backend socket
/// has been idle (no send or receive) for longer than
/// <see cref="KeepaliveOptions.BackendHeartbeatIdleMs"/>, it issues a synthetic FC03
/// qty=1 read so the path stays warm against middlebox idle-drop and a backend that is
/// connected-but-not-answering is detected here rather than on the next client request.
///
/// <para>The probe response is consumed by <see cref="RunBackendReaderAsync"/> (which
/// recognises <see cref="InFlightRequest.IsHeartbeat"/> and drops it); a probe that
/// never returns is timed out by <see cref="RunRequestTimeoutWatchdogAsync"/>, which
/// tears the backend down. The heartbeat keeps an <i>existing</i> backend warm — it
/// never resurrects a dead one (reconnect stays gated on the next upstream frame).</para>
/// </summary>
private async Task RunBackendHeartbeatAsync(CancellationToken ct)
{
try
{
while (!ct.IsCancellationRequested)
{
var ka = _keepaliveOptions();
int idleMs = Math.Max(1000, ka.BackendHeartbeatIdleMs);
// Tick at a quarter of the idle window so a freshly-elapsed idle period is
// noticed promptly, floored at 500 ms so the loop never busy-wakes.
int tickMs = Math.Max(500, idleMs / 4);
await Task.Delay(tickMs, ct).ConfigureAwait(false);
if (!ka.Enabled)
continue;
long lastTicks = Interlocked.Read(ref _lastBackendActivityTicks);
double idleElapsedMs =
(DateTime.UtcNow - new DateTime(lastTicks, DateTimeKind.Utc)).TotalMilliseconds;
if (idleElapsedMs < idleMs)
continue;
SendHeartbeat(ka);
}
}
catch (OperationCanceledException)
{
// Normal teardown.
}
catch (Exception ex)
{
_logger.LogError(ex, "Backend heartbeat loop faulted: Plc={Plc}", _plc.Name);
}
}
/// <summary>
/// Builds and enqueues one synthetic FC03 qty=1 heartbeat request onto the backend
/// outbound channel. The correlation entry is flagged <see cref="InFlightRequest.IsHeartbeat"/>
/// so the reader and watchdog treat it specially; it carries no interested parties and
/// bypasses the coalescing and cache paths entirely.
/// </summary>
private void SendHeartbeat(KeepaliveOptions ka)
{
// A saturated TxId space means the backend is busy (65,536 requests in flight),
// which is the opposite of idle — skip this tick rather than force a probe.
if (!_allocator.TryAllocate(out ushort proxyTxId))
return;
byte unitId = (byte)Volatile.Read(ref _lastSeenUnitId);
ushort address = (ushort)ka.BackendHeartbeatProbeAddress;
var inFlight = new InFlightRequest(
UnitId: unitId,
Fc: 0x03,
StartAddress: address,
Qty: 1,
InterestedParties: Array.Empty<InterestedParty>(),
SentAtUtc: DateTimeOffset.UtcNow,
ResolvedCacheTtlMs: 0,
IsHeartbeat: true);
if (!_correlation.TryAdd(proxyTxId, inFlight))
{
_allocator.Release(proxyTxId);
return;
}
byte[] frame = BuildHeartbeatFrame(proxyTxId, unitId, address);
// Non-blocking enqueue: if the channel is full the backend is not idle (a race), and
// if it is completed the backend is tearing down — either way, undo and skip.
if (!_outboundChannel.Writer.TryWrite(frame))
{
if (_correlation.TryRemove(proxyTxId, out _))
_allocator.Release(proxyTxId);
return;
}
_ctx.Counters.IncrementBackendHeartbeatSent();
KeepaliveLogEvents.HeartbeatSent(_logger, _plc.Name, proxyTxId, address);
}
/// <summary>
/// Builds a 12-byte MBAP-framed FC03 (Read Holding Registers) request reading one
/// register at <paramref name="address"/> — the keepalive heartbeat probe PDU.
/// </summary>
private static byte[] BuildHeartbeatFrame(ushort proxyTxId, byte unitId, ushort address)
{
// PDU = [fc=03][addrHi][addrLo][qtyHi][qtyLo]. MBAP length = UnitId(1) + PDU(5) = 6.
var frame = new byte[MbapFrame.HeaderSize + 5];
frame[0] = (byte)(proxyTxId >> 8);
frame[1] = (byte)(proxyTxId & 0xFF);
frame[2] = 0; frame[3] = 0; // ProtocolId
frame[4] = 0; frame[5] = 6; // Length
frame[6] = unitId;
frame[7] = 0x03; // FC03 Read Holding Registers
frame[8] = (byte)(address >> 8);
frame[9] = (byte)(address & 0xFF);
frame[10] = 0; frame[11] = 1; // Qty = 1
return frame;
}
// ── Helpers ───────────────────────────────────────────────────────────────
/// <summary>
@@ -1,6 +1,7 @@
using System.Net;
using System.Net.Sockets;
using System.Threading.Channels;
using Mbproxy.Options;
namespace Mbproxy.Proxy.Multiplexing;
@@ -77,10 +78,15 @@ internal sealed partial class UpstreamPipe : IAsyncDisposable
/// </summary>
public bool IsAlive => !_disposed && !_cts.IsCancellationRequested;
public UpstreamPipe(Socket upstream, string plcName, ILogger logger)
public UpstreamPipe(Socket upstream, string plcName, ILogger logger, KeepaliveOptions? keepalive = null)
{
_upstream = upstream;
_upstream.NoDelay = true;
// Enable OS TCP keepalive on the accepted client socket so a half-open/dead
// client (gone without a TCP FIN) faults the read loop and is reaped, instead of
// leaking a pipe + correlation slots until the proxy next tries to write to it.
if (keepalive is not null)
SocketKeepalive.Apply(_upstream, keepalive);
RemoteEp = upstream.RemoteEndPoint as IPEndPoint;
_plcName = plcName;
_logger = logger;
+10 -3
View File
@@ -30,6 +30,10 @@ internal sealed partial class PlcListener : IAsyncDisposable
private readonly PerPlcContext? _perPlcContext;
private readonly ResiliencePipeline? _backendConnectPipeline;
private readonly Func<ReadCoalescingOptions>? _coalescingOptions;
// Live keepalive accessor (TCP SO_KEEPALIVE on accepted upstream sockets + the backend
// heartbeat). Non-null after construction — falls back to the construction-time
// ConnectionOptions snapshot when no live accessor is supplied.
private readonly Func<KeepaliveOptions> _keepaliveOptions;
private TcpListener? _listener;
private PlcMultiplexer? _multiplexer;
@@ -62,7 +66,8 @@ internal sealed partial class PlcListener : IAsyncDisposable
ILogger pipeLogger,
PerPlcContext? perPlcContext = null,
ResiliencePipeline? backendConnectPipeline = null,
Func<ReadCoalescingOptions>? coalescingOptions = null)
Func<ReadCoalescingOptions>? coalescingOptions = null,
Func<KeepaliveOptions>? keepaliveOptions = null)
{
_plc = plc;
_connectionOptions = connectionOptions;
@@ -73,6 +78,7 @@ internal sealed partial class PlcListener : IAsyncDisposable
_perPlcContext = perPlcContext;
_backendConnectPipeline = backendConnectPipeline;
_coalescingOptions = coalescingOptions;
_keepaliveOptions = keepaliveOptions ?? (() => _connectionOptions.Keepalive);
}
/// <summary>
@@ -103,7 +109,8 @@ internal sealed partial class PlcListener : IAsyncDisposable
ctx,
_multiplexerLogger,
_backendConnectPipeline,
_coalescingOptions);
_coalescingOptions,
_keepaliveOptions);
}
/// <summary>
@@ -125,7 +132,7 @@ internal sealed partial class PlcListener : IAsyncDisposable
{
Socket upstream = await _listener.AcceptSocketAsync(ct).ConfigureAwait(false);
var pipe = new UpstreamPipe(upstream, _plc.Name, _pipeLogger);
var pipe = new UpstreamPipe(upstream, _plc.Name, _pipeLogger, _keepaliveOptions());
var pipeTask = Task.Run(async () =>
{
try
+39 -2
View File
@@ -133,7 +133,24 @@ public sealed record CounterSnapshot(
/// socket fast enough to keep up with the backend; the wedged client loses its own
/// responses but its peers on the same PLC continue to receive theirs.
/// </summary>
long ResponseDropForFullUpstream);
long ResponseDropForFullUpstream,
/// <summary>
/// Cumulative count of backend keepalive heartbeat probes issued (synthetic FC03
/// qty=1 reads sent on an idle backend socket).
/// </summary>
long BackendHeartbeatsSent,
/// <summary>
/// Cumulative count of backend keepalive heartbeat probes that were not answered
/// within <c>BackendRequestTimeoutMs</c>. Each failure triggers a proactive backend
/// teardown (see <see cref="BackendIdleDisconnects"/>).
/// </summary>
long BackendHeartbeatsFailed,
/// <summary>
/// Cumulative count of backend teardowns triggered by a failed keepalive heartbeat.
/// Distinct from <see cref="BackendDisconnectCascades"/> (which counts cascaded
/// pipes); this counts the disconnect <i>events</i> attributed to keepalive.
/// </summary>
long BackendIdleDisconnects);
/// <summary>
/// Thread-safe per-PLC counters backed by <see cref="System.Threading.Interlocked"/> longs.
@@ -184,6 +201,11 @@ internal sealed class ProxyCounters
// and account.
private long _responseDropForFullUpstream;
// Backend keepalive heartbeat counters.
private long _backendHeartbeatsSent;
private long _backendHeartbeatsFailed;
private long _backendIdleDisconnects;
// Live cache state pulled from a per-PLC ResponseCache on each snapshot. The
// multiplexer registers a single provider via SetCacheStatsProvider so the status
// page sees current entry-count / bytes without a separate poll.
@@ -315,6 +337,18 @@ internal sealed class ProxyCounters
public void IncrementResponseDropForFullUpstream()
=> Interlocked.Increment(ref _responseDropForFullUpstream);
/// <summary>Records one backend keepalive heartbeat probe sent.</summary>
public void IncrementBackendHeartbeatSent()
=> Interlocked.Increment(ref _backendHeartbeatsSent);
/// <summary>Records one backend keepalive heartbeat probe that timed out.</summary>
public void IncrementBackendHeartbeatFailed()
=> Interlocked.Increment(ref _backendHeartbeatsFailed);
/// <summary>Records one backend teardown triggered by a failed keepalive heartbeat.</summary>
public void IncrementBackendIdleDisconnect()
=> Interlocked.Increment(ref _backendIdleDisconnects);
/// <summary>
/// Wires the per-PLC <see cref="Cache.ResponseCache"/> as the live stats source for
/// the snapshot path. Pass <c>null</c> to detach during disposal.
@@ -445,7 +479,10 @@ internal sealed class ProxyCounters
CacheInvalidations: Interlocked.Read(ref _cacheInvalidations),
CacheEntryCount: cacheEntries,
CacheBytes: cacheBytes,
ResponseDropForFullUpstream: Interlocked.Read(ref _responseDropForFullUpstream));
ResponseDropForFullUpstream: Interlocked.Read(ref _responseDropForFullUpstream),
BackendHeartbeatsSent: Interlocked.Read(ref _backendHeartbeatsSent),
BackendHeartbeatsFailed: Interlocked.Read(ref _backendHeartbeatsFailed),
BackendIdleDisconnects: Interlocked.Read(ref _backendIdleDisconnects));
}
}
+10 -2
View File
@@ -150,6 +150,11 @@ internal sealed partial class ProxyWorker : BackgroundService
Func<ReadCoalescingOptions> coalescingAccessor =
() => _options.CurrentValue.Resilience.ReadCoalescing;
// Live accessor for KeepaliveOptions so a hot-reload of `Connection.Keepalive`
// propagates to the backend heartbeat loop and to upstream-socket keepalive.
Func<KeepaliveOptions> keepaliveAccessor =
() => _options.CurrentValue.Connection.Keepalive;
var supervisor = new PlcListenerSupervisor(
plc,
opts.Connection,
@@ -161,7 +166,8 @@ internal sealed partial class ProxyWorker : BackgroundService
recoveryPipeline,
_loggerFactory.CreateLogger<PlcListenerSupervisor>(),
backendPipeline,
coalescingAccessor);
coalescingAccessor,
keepaliveAccessor);
_supervisors[plc.Name] = supervisor;
}
@@ -175,7 +181,9 @@ internal sealed partial class ProxyWorker : BackgroundService
// (add/restart paths) honour hot-reloaded ReadCoalescing values.
Func<ReadCoalescingOptions> reconcilerCoalescingAccessor =
() => _options.CurrentValue.Resilience.ReadCoalescing;
_reconciler.Attach(_supervisors, opts, reconcilerCoalescingAccessor);
Func<KeepaliveOptions> reconcilerKeepaliveAccessor =
() => _options.CurrentValue.Connection.Keepalive;
_reconciler.Attach(_supervisors, opts, reconcilerCoalescingAccessor, reconcilerKeepaliveAccessor);
if (_supervisors.Count == 0)
{
@@ -0,0 +1,49 @@
using System.Net.Sockets;
using Mbproxy.Options;
namespace Mbproxy.Proxy;
/// <summary>
/// Applies OS-level TCP keepalive (<c>SO_KEEPALIVE</c> plus the idle-time / probe-interval /
/// probe-count tunables) to a socket. Used on both the backend socket (proxy → PLC) and
/// accepted upstream sockets (client → proxy) so the OS detects a dead peer on an
/// otherwise-idle connection — the DL205/DL260 ECOM never emits keepalives of its own.
/// </summary>
internal static class SocketKeepalive
{
/// <summary>
/// Enables TCP keepalive on <paramref name="socket"/> from <paramref name="options"/>.
/// A no-op when <see cref="KeepaliveOptions.Enabled"/> is <c>false</c>.
///
/// <para>Failures are swallowed: keepalive is a best-effort belt-and-suspenders measure
/// (the backend application heartbeat is the load-bearing mechanism) and must never
/// abort a connection. The three TCP tunables are also not honoured on every platform;
/// a refusal there is benign.</para>
/// </summary>
public static void Apply(Socket socket, KeepaliveOptions options)
{
if (!options.Enabled) return;
try
{
socket.SetSocketOption(SocketOptionLevel.Socket, SocketOptionName.KeepAlive, true);
// SocketOptionName.TcpKeepAliveTime / TcpKeepAliveInterval are specified in
// SECONDS; round the configured milliseconds up to at least one second.
int idleSec = Math.Max(1, (options.TcpIdleTimeMs + 999) / 1000);
int intervalSec = Math.Max(1, (options.TcpProbeIntervalMs + 999) / 1000);
socket.SetSocketOption(SocketOptionLevel.Tcp, SocketOptionName.TcpKeepAliveTime, idleSec);
socket.SetSocketOption(SocketOptionLevel.Tcp, SocketOptionName.TcpKeepAliveInterval, intervalSec);
socket.SetSocketOption(SocketOptionLevel.Tcp, SocketOptionName.TcpKeepAliveRetryCount, options.TcpProbeCount);
}
catch (SocketException)
{
// Platform refused a tunable — keepalive stays best-effort.
}
catch (ObjectDisposedException)
{
// Socket closed concurrently — nothing to do.
}
}
}
@@ -38,6 +38,7 @@ internal sealed partial class PlcListenerSupervisor : IAsyncDisposable
private readonly ILogger<PlcListenerSupervisor> _logger;
private readonly ResiliencePipeline? _backendConnectPipeline;
private readonly Func<ReadCoalescingOptions>? _coalescingOptions;
private readonly Func<KeepaliveOptions>? _keepaliveOptions;
// ── Mutable state ────────────────────────────────────────────────────────────────────
@@ -94,7 +95,8 @@ internal sealed partial class PlcListenerSupervisor : IAsyncDisposable
ResiliencePipeline recoveryPipeline,
ILogger<PlcListenerSupervisor> logger,
ResiliencePipeline? backendConnectPipeline = null,
Func<ReadCoalescingOptions>? coalescingOptions = null)
Func<ReadCoalescingOptions>? coalescingOptions = null,
Func<KeepaliveOptions>? keepaliveOptions = null)
{
_plc = plc;
_connectionOptions = connectionOptions;
@@ -108,6 +110,7 @@ internal sealed partial class PlcListenerSupervisor : IAsyncDisposable
_logger = logger;
_backendConnectPipeline = backendConnectPipeline;
_coalescingOptions = coalescingOptions;
_keepaliveOptions = keepaliveOptions;
}
/// <summary>
@@ -325,7 +328,8 @@ internal sealed partial class PlcListenerSupervisor : IAsyncDisposable
_pipeLogger,
_currentContext,
_backendConnectPipeline,
_coalescingOptions);
_coalescingOptions,
_keepaliveOptions);
// Expose the current listener for status-page pair enumeration.
_currentListener = listener;
@@ -53,7 +53,9 @@ public sealed class StatusHtmlRendererTests
CoalescedHitCount: 0, CoalescedMissCount: 0,
CoalescedResponseToDeadUpstream: 0,
CacheHitCount: 0, CacheMissCount: 0,
CacheInvalidations: 0, CacheEntryCount: 0, CacheBytes: 0),
CacheInvalidations: 0, CacheEntryCount: 0, CacheBytes: 0,
BackendHeartbeatsSent: 0, BackendHeartbeatsFailed: 0,
BackendIdleDisconnects: 0),
Bytes: new PlcBytesStatus(1024, 2048));
}
@@ -264,4 +264,74 @@ public sealed class ReloadValidatorTests
Assert.False(valid);
Assert.Contains(errors, e => e.Contains("GracefulShutdownTimeoutMs"));
}
// ── Keepalive section ─────────────────────────────────────────────────────
[Fact]
public void Validate_DefaultKeepalive_Passes()
{
// Default ConnectionOptions → default KeepaliveOptions (idle 30 s, request 3 s).
var opts = MakeOptions([MakePlc("PLC-A", 5020)]);
bool valid = ReloadValidator.Validate(opts, out _);
Assert.True(valid);
}
[Fact]
public void Validate_NonPositiveTcpProbeCount_Fails()
{
var opts = new MbproxyOptions
{
Plcs = [MakePlc("PLC-A", 5020)],
Connection = new ConnectionOptions
{
Keepalive = new KeepaliveOptions { TcpProbeCount = 0 },
},
};
bool valid = ReloadValidator.Validate(opts, out var errors);
Assert.False(valid);
Assert.Contains(errors, e => e.Contains("TcpProbeCount"));
}
[Fact]
public void Validate_OutOfRangeHeartbeatProbeAddress_Fails()
{
var opts = new MbproxyOptions
{
Plcs = [MakePlc("PLC-A", 5020)],
Connection = new ConnectionOptions
{
Keepalive = new KeepaliveOptions { BackendHeartbeatProbeAddress = 70000 },
},
};
bool valid = ReloadValidator.Validate(opts, out var errors);
Assert.False(valid);
Assert.Contains(errors, e => e.Contains("BackendHeartbeatProbeAddress"));
}
[Fact]
public void Validate_HeartbeatIdleNotAboveRequestTimeout_Fails()
{
// BackendHeartbeatIdleMs must sit ABOVE BackendRequestTimeoutMs, else a heartbeat
// would be timed out as fast as it could be issued.
var opts = new MbproxyOptions
{
Plcs = [MakePlc("PLC-A", 5020)],
Connection = new ConnectionOptions
{
BackendRequestTimeoutMs = 3000,
Keepalive = new KeepaliveOptions { BackendHeartbeatIdleMs = 3000 },
},
};
bool valid = ReloadValidator.Validate(opts, out var errors);
Assert.False(valid);
Assert.Contains(errors, e => e.Contains("BackendHeartbeatIdleMs"));
}
}
@@ -0,0 +1,366 @@
using System.Net;
using System.Net.Sockets;
using Mbproxy.Options;
using Mbproxy.Proxy;
using Mbproxy.Proxy.Multiplexing;
using Microsoft.Extensions.Logging.Abstractions;
using Shouldly;
using Xunit;
namespace Mbproxy.Tests.Proxy.Multiplexing;
/// <summary>
/// Tests for the backend keepalive heartbeat and the <see cref="SocketKeepalive"/> helper.
/// The heartbeat tests run the real <see cref="PlcMultiplexer"/> against a stub backend
/// (real sockets, no simulator) with a deliberately short <c>BackendHeartbeatIdleMs</c>.
/// </summary>
[Trait("Category", "Unit")]
public sealed class KeepaliveTests
{
// ── Helpers ────────────────────────────────────────────────────────────────
private static int PickFreePort()
{
var l = new TcpListener(IPAddress.Loopback, 0);
l.Start();
int port = ((IPEndPoint)l.LocalEndpoint).Port;
l.Stop();
return port;
}
private static async Task<byte[]> ReadExactAsync(Socket socket, int count, CancellationToken ct)
{
var buf = new byte[count];
int read = 0;
while (read < count)
{
int n = await socket.ReceiveAsync(buf.AsMemory(read, count - read), SocketFlags.None, ct);
if (n == 0) throw new IOException("EOF");
read += n;
}
return buf;
}
private static async Task<byte[]> ReadOneFrameAsync(Socket socket, CancellationToken ct)
{
var header = await ReadExactAsync(socket, 7, ct);
ushort length = (ushort)((header[4] << 8) | header[5]);
int bodyLen = length - 1;
var body = bodyLen > 0 ? await ReadExactAsync(socket, bodyLen, ct) : Array.Empty<byte>();
var frame = new byte[7 + bodyLen];
Buffer.BlockCopy(header, 0, frame, 0, 7);
if (bodyLen > 0) Buffer.BlockCopy(body, 0, frame, 7, bodyLen);
return frame;
}
private static byte[] BuildFc03ReadFrame(ushort txId, ushort start, ushort qty, byte unitId = 1)
=>
[
(byte)(txId >> 8), (byte)(txId & 0xFF),
0x00, 0x00,
0x00, 0x06,
unitId,
0x03,
(byte)(start >> 8), (byte)(start & 0xFF),
(byte)(qty >> 8), (byte)(qty & 0xFF),
];
private static byte[] BuildFc03Response(ushort txId, byte unitId, ushort register)
{
// Body = FC(1) + byteCount(1) + data(2) = 4. MBAP length = UnitId(1) + body(4) = 5.
var frame = new byte[7 + 4];
frame[0] = (byte)(txId >> 8);
frame[1] = (byte)(txId & 0xFF);
frame[2] = 0; frame[3] = 0;
frame[4] = 0; frame[5] = 5; // length
frame[6] = unitId;
frame[7] = 0x03;
frame[8] = 2; // byte count
frame[9] = (byte)(register >> 8);
frame[10] = (byte)(register & 0xFF);
return frame;
}
private static PerPlcContext MakeContext(string name) => new()
{
PlcName = name,
TagMap = Mbproxy.Bcd.BcdTagMap.Empty,
Counters = new ProxyCounters(),
Logger = NullLogger.Instance,
};
/// <summary>
/// Stub backend that echoes FC03 responses (including the synthetic heartbeat probe,
/// which is itself an FC03). When <see cref="Silent"/> is set it reads and drains
/// requests but never responds — used to drive heartbeat timeouts.
/// </summary>
private sealed class StubBackend : IAsyncDisposable
{
public int Port { get; }
public volatile bool Silent;
private int _requestCount;
public int RequestCount => Volatile.Read(ref _requestCount);
private readonly TcpListener _listener;
private readonly CancellationTokenSource _cts = new();
private readonly List<Task> _clientTasks = new();
public StubBackend(int port, bool silent = false)
{
Port = port;
Silent = silent;
_listener = new TcpListener(IPAddress.Loopback, port);
_listener.Start();
_ = AcceptLoop();
}
private async Task AcceptLoop()
{
try
{
while (!_cts.IsCancellationRequested)
{
Socket s = await _listener.AcceptSocketAsync(_cts.Token);
var task = Task.Run(() => HandleAsync(s));
lock (_clientTasks) _clientTasks.Add(task);
}
}
catch { /* shutdown */ }
}
private async Task HandleAsync(Socket s)
{
try
{
while (!_cts.IsCancellationRequested)
{
var req = await ReadOneFrameAsync(s, _cts.Token);
if (req.Length < 8) break;
Interlocked.Increment(ref _requestCount);
if (Silent) continue;
ushort txId = (ushort)((req[0] << 8) | req[1]);
byte unitId = req[6];
byte fc = req[7];
if (fc != 0x03) break;
await s.SendAsync(BuildFc03Response(txId, unitId, 0x1234), SocketFlags.None, _cts.Token);
}
}
catch { /* normal */ }
finally { try { s.Dispose(); } catch { } }
}
public async ValueTask DisposeAsync()
{
await _cts.CancelAsync();
try { _listener.Stop(); } catch { }
Task[] snap;
lock (_clientTasks) snap = _clientTasks.ToArray();
try { await Task.WhenAll(snap).WaitAsync(TimeSpan.FromSeconds(2)); } catch { }
_cts.Dispose();
}
}
private static PlcMultiplexer BuildMux(PlcOptions plc, ConnectionOptions connOpts, PerPlcContext ctx)
=> new(
plc, connOpts,
new BcdPduPipeline(),
ctx,
NullLogger<PlcMultiplexer>.Instance,
backendConnectPipeline: null);
private static async Task<(Socket client, UpstreamPipe pipe, TcpListener listener)>
ConnectClientAsync(PlcMultiplexer mux, string plcName)
{
int proxyPort = PickFreePort();
var listener = new TcpListener(IPAddress.Loopback, proxyPort);
listener.Start();
var client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp)
{ NoDelay = true };
await client.ConnectAsync(IPAddress.Loopback, proxyPort);
var upstream = await listener.AcceptSocketAsync();
var pipe = new UpstreamPipe(upstream, plcName, NullLogger.Instance);
_ = Task.Run(() => mux.StartPipeAsync(pipe, CancellationToken.None));
return (client, pipe, listener);
}
// ── SocketKeepalive helper ─────────────────────────────────────────────────
[Fact]
public void SocketKeepalive_Apply_Enabled_TurnsOnKeepAlive()
{
using var socket = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
SocketKeepalive.Apply(socket, new KeepaliveOptions
{
Enabled = true,
TcpIdleTimeMs = 30000,
TcpProbeIntervalMs = 5000,
TcpProbeCount = 4,
});
int keepAlive = (int)socket.GetSocketOption(SocketOptionLevel.Socket, SocketOptionName.KeepAlive)!;
keepAlive.ShouldNotBe(0, "SO_KEEPALIVE must be enabled after Apply");
}
[Fact]
public void SocketKeepalive_Apply_Disabled_IsNoOp()
{
using var socket = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
SocketKeepalive.Apply(socket, new KeepaliveOptions { Enabled = false });
int keepAlive = (int)socket.GetSocketOption(SocketOptionLevel.Socket, SocketOptionName.KeepAlive)!;
keepAlive.ShouldBe(0, "Apply with Enabled=false must not touch the socket");
}
// ── Backend heartbeat ──────────────────────────────────────────────────────
[Fact]
public async Task Heartbeat_FiresOnIdleBackend_AndIsAnswered_NoCascade()
{
int backendPort = PickFreePort();
await using var backend = new StubBackend(backendPort);
var ctx = MakeContext("PLC1");
var plc = new PlcOptions { Name = "PLC1", ListenPort = 0, Host = "127.0.0.1", Port = backendPort };
var connOpts = new ConnectionOptions
{
Keepalive = new KeepaliveOptions { Enabled = true, BackendHeartbeatIdleMs = 600 },
};
await using var mux = BuildMux(plc, connOpts, ctx);
var (client, pipe, listener) = await ConnectClientAsync(mux, plc.Name);
try
{
// One real round-trip brings the backend up and starts the heartbeat loop.
await client.SendAsync(BuildFc03ReadFrame(0x0001, 0, 1), SocketFlags.None);
_ = await ReadOneFrameAsync(client, TestContext.Current.CancellationToken);
// Idle the connection past the heartbeat threshold a few times over.
long sent = 0;
for (int i = 0; i < 60; i++)
{
sent = ctx.Counters.Snapshot().BackendHeartbeatsSent;
if (sent >= 1) break;
await Task.Delay(100, TestContext.Current.CancellationToken);
}
sent.ShouldBeGreaterThanOrEqualTo(1, "an idle backend must receive at least one heartbeat probe");
var snap = ctx.Counters.Snapshot();
snap.BackendHeartbeatsFailed.ShouldBe(0, "an answered heartbeat must not count as failed");
snap.BackendIdleDisconnects.ShouldBe(0, "an answered heartbeat must not tear the backend down");
// The client connection survived — a fresh request still round-trips.
await client.SendAsync(BuildFc03ReadFrame(0x0002, 0, 1), SocketFlags.None);
var rsp = await ReadOneFrameAsync(client, TestContext.Current.CancellationToken)
.WaitAsync(TimeSpan.FromSeconds(3), TestContext.Current.CancellationToken);
((ushort)((rsp[0] << 8) | rsp[1])).ShouldBe((ushort)0x0002);
}
finally
{
client.Dispose();
await pipe.DisposeAsync();
listener.Stop();
}
}
[Fact]
public async Task Heartbeat_SuppressedByRealTraffic()
{
int backendPort = PickFreePort();
await using var backend = new StubBackend(backendPort);
var ctx = MakeContext("PLC1");
var plc = new PlcOptions { Name = "PLC1", ListenPort = 0, Host = "127.0.0.1", Port = backendPort };
// Idle threshold well above the request cadence below.
var connOpts = new ConnectionOptions
{
Keepalive = new KeepaliveOptions { Enabled = true, BackendHeartbeatIdleMs = 1500 },
};
await using var mux = BuildMux(plc, connOpts, ctx);
var (client, pipe, listener) = await ConnectClientAsync(mux, plc.Name);
try
{
// Steady real traffic every ~200 ms for ~2.4 s. Each round-trip refreshes the
// activity timestamp, so the 1500 ms idle threshold is never reached.
for (ushort i = 1; i <= 12; i++)
{
await client.SendAsync(BuildFc03ReadFrame(i, 0, 1), SocketFlags.None);
_ = await ReadOneFrameAsync(client, TestContext.Current.CancellationToken)
.WaitAsync(TimeSpan.FromSeconds(3), TestContext.Current.CancellationToken);
await Task.Delay(200, TestContext.Current.CancellationToken);
}
ctx.Counters.Snapshot().BackendHeartbeatsSent
.ShouldBe(0, "real traffic must keep resetting the idle timer so no heartbeat fires");
}
finally
{
client.Dispose();
await pipe.DisposeAsync();
listener.Stop();
}
}
[Fact]
public async Task Heartbeat_Timeout_TearsDownBackend_AndCascades()
{
int backendPort = PickFreePort();
// Silent from the start: the backend accepts the TCP connection and drains every
// frame (including the heartbeat) but never replies.
await using var backend = new StubBackend(backendPort, silent: true);
var ctx = MakeContext("PLC1");
var plc = new PlcOptions { Name = "PLC1", ListenPort = 0, Host = "127.0.0.1", Port = backendPort };
var connOpts = new ConnectionOptions
{
BackendRequestTimeoutMs = 500,
Keepalive = new KeepaliveOptions { Enabled = true, BackendHeartbeatIdleMs = 700 },
};
await using var mux = BuildMux(plc, connOpts, ctx);
var (client, pipe, listener) = await ConnectClientAsync(mux, plc.Name);
try
{
// First request brings the backend TCP connection up and starts the heartbeat
// loop. It will itself time out with 0x0B (the backend never answers) — drain
// and ignore that frame.
await client.SendAsync(BuildFc03ReadFrame(0x0001, 0, 1), SocketFlags.None);
try
{
_ = await ReadOneFrameAsync(client, TestContext.Current.CancellationToken)
.WaitAsync(TimeSpan.FromSeconds(2), TestContext.Current.CancellationToken);
}
catch { /* 0x0B or socket close — not what this test asserts */ }
// The heartbeat fires on the idle socket, never gets answered, and the watchdog
// times it out — which tears the backend down.
long failed = 0, idleDisc = 0;
for (int i = 0; i < 80; i++)
{
var snap = ctx.Counters.Snapshot();
failed = snap.BackendHeartbeatsFailed;
idleDisc = snap.BackendIdleDisconnects;
if (failed >= 1 && idleDisc >= 1) break;
await Task.Delay(100, TestContext.Current.CancellationToken);
}
failed.ShouldBeGreaterThanOrEqualTo(1, "an unanswered heartbeat must count as failed");
idleDisc.ShouldBeGreaterThanOrEqualTo(1, "a failed heartbeat must trigger a backend idle-disconnect");
ctx.Counters.Snapshot().BackendHeartbeatsSent
.ShouldBeGreaterThanOrEqualTo(1, "a heartbeat must have been sent before it could fail");
}
finally
{
client.Dispose();
await pipe.DisposeAsync();
listener.Stop();
}
}
}
@@ -438,6 +438,69 @@ public sealed class MultiplexerE2ETests
}
}
// ── E2E 6: Backend keepalive heartbeat keeps an idle connection warm ─────────────
/// <summary>
/// With keepalive enabled, an idle backend connection receives periodic FC03 heartbeat
/// probes. This test idles a simulator-backed connection past
/// <c>BackendHeartbeatIdleMs</c>, verifies <c>backendHeartbeatsSent</c> climbs on the
/// status page, and confirms a later real read still round-trips on the same
/// (un-cascaded) connection.
/// </summary>
[Fact(Timeout = 8_000)]
public async Task E2E_Keepalive_IdleBackend_ReceivesHeartbeats_AndStaysUsable()
{
if (_sim.SkipReason is not null) Assert.Skip(_sim.SkipReason);
int proxyPort = PickFreePort();
int adminPort = PickFreePort();
var config = MakeBaseConfig(proxyPort);
config["Mbproxy:AdminPort"] = adminPort.ToString();
// Short idle window so the heartbeat fires several times within the test budget.
config["Mbproxy:Connection:Keepalive:Enabled"] = "true";
config["Mbproxy:Connection:Keepalive:BackendHeartbeatIdleMs"] = "700";
var host = BuildBcdHost(config);
using var startCts = new CancellationTokenSource(TimeSpan.FromSeconds(3));
await host.StartAsync(startCts.Token);
await using var hd = new AsyncHostDispose(host);
await Task.Delay(200, TestContext.Current.CancellationToken);
using (var client = new TcpClient())
{
await client.ConnectAsync("127.0.0.1", proxyPort, TestContext.Current.CancellationToken);
var master = new ModbusFactory().CreateMaster(client);
// One read brings the backend up and starts the heartbeat loop.
_ = master.ReadHoldingRegisters(1, 0, 1);
// Idle the connection so the heartbeat loop fires repeatedly.
await Task.Delay(2500, TestContext.Current.CancellationToken);
// A later read still succeeds — the connection was never cascaded.
ushort[] regs = master.ReadHoldingRegisters(1, 0, 1);
regs.Length.ShouldBe(1, "the idle-then-active connection must still serve reads");
}
using var httpClient = new HttpClient();
var resp = await httpClient.GetStringAsync(
$"http://127.0.0.1:{adminPort}/status.json",
TestContext.Current.CancellationToken);
using var doc = JsonDocument.Parse(resp);
var backend = doc.RootElement.GetProperty("plcs")[0].GetProperty("backend");
backend.TryGetProperty("backendHeartbeatsSent", out _)
.ShouldBeTrue("status.json must expose backend.backendHeartbeatsSent");
backend.GetProperty("backendHeartbeatsSent").GetInt64()
.ShouldBeGreaterThanOrEqualTo(1, "an idle backend must have received at least one heartbeat");
backend.GetProperty("backendHeartbeatsFailed").GetInt64()
.ShouldBe(0, "every heartbeat against the live simulator must be answered");
backend.GetProperty("backendIdleDisconnects").GetInt64()
.ShouldBe(0, "an answered heartbeat must never tear the backend down");
}
// ── Helpers ──────────────────────────────────────────────────────────────────────
private Dictionary<string, string?> MakeBaseConfig(int proxyPort) => new()