diff --git a/mbproxy/README.md b/mbproxy/README.md index ce8d0bb..d80129f 100644 --- a/mbproxy/README.md +++ b/mbproxy/README.md @@ -31,6 +31,36 @@ DL260/ DL205/DL260 reference material and pymodbus simulator prof | pymodbus simulator profile (register seeds for E2E tests) | [`DL260/dl205.json`](DL260/dl205.json) | | Agent-oriented coding guide (architecture bullets, device quirks, phase context) | [`CLAUDE.md`](CLAUDE.md) | +## Detailed documentation + +The `docs/` tree is organized by topic. Start with [`docs/design.md`](docs/design.md) for the canonical end-to-end design; jump to the focused pages below when you need depth on one area. + +### Architecture + +- [`Architecture/Overview.md`](docs/Architecture/Overview.md) — Listener topology, request flow, per-PLC isolation. +- [`Architecture/ConnectionModel.md`](docs/Architecture/ConnectionModel.md) — Single backend connection per PLC, TxId multiplexing, request-timeout watchdog, disconnect cascade. +- [`Architecture/ReadCoalescing.md`](docs/Architecture/ReadCoalescing.md) — In-flight FC03/FC04 deduplication via `InFlightByKeyMap`. +- [`Architecture/ResponseCache.md`](docs/Architecture/ResponseCache.md) — Opt-in per-tag response cache with bounded operator-configured staleness. + +### Features + +- [`Features/BcdRewriting.md`](docs/Features/BcdRewriting.md) — BCD codec, CDAB word order, FC03/04/06/16 scope, partial-overlap policy. +- [`Features/HotReload.md`](docs/Features/HotReload.md) — `IOptionsMonitor`-driven config reload with per-change-kind reconcile rules. + +### Operations + +- [`Operations/Configuration.md`](docs/Operations/Configuration.md) — Full `appsettings.json` reference: every `Mbproxy:*` key, default, and validation rule. +- [`Operations/StatusPage.md`](docs/Operations/StatusPage.md) — Admin endpoint surface (`/`, `/status.json`) with every JSON field documented. +- [`Operations/Troubleshooting.md`](docs/Operations/Troubleshooting.md) — Diagnosis playbook keyed to log events and status counters. + +### Reference + +- [`Reference/LogEvents.md`](docs/Reference/LogEvents.md) — Stable `mbproxy.*` event catalog (28 events across 7 categories). + +### Testing + +- [`Testing/Simulator.md`](docs/Testing/Simulator.md) — pymodbus DL205 fixture, skip policy, and the load-bearing pymodbus 3.13 framer quirk. + ## Build and run **Build (Debug, multi-file — fast for iteration):** diff --git a/mbproxy/docs/Architecture/ConnectionModel.md b/mbproxy/docs/Architecture/ConnectionModel.md new file mode 100644 index 0000000..75fd10f --- /dev/null +++ b/mbproxy/docs/Architecture/ConnectionModel.md @@ -0,0 +1,247 @@ +# Connection Model + +The proxy holds one persistent backend TCP socket per PLC and multiplexes many upstream client connections onto it by rewriting the MBAP transaction ID on every request and restoring each client's original TxId on the matching response. + +## Why One Backend Connection Per PLC + +An earlier design opened a fresh backend socket for each accepted upstream client (1:1 pairs). That model collapsed against the **AutomationDirect H2-ECOM100**, which caps simultaneous TCP clients at **4 per PLC** (see [`../../DL260/dl205.md`](../../DL260/dl205.md) under "Behavioural Oddities"). The fifth upstream client to attach to a busy PLC was refused at connect, with no recourse other than waiting for an existing pair to drop. + +Multiplexing replaces 1:N upstream-to-backend with N:1 upstream-to-multiplexer-to-backend: + +- The proxy occupies exactly one of the ECOM's 4 TCP client slots per PLC, regardless of how many upstream clients are attached. +- Upstream-side concurrency is no longer bounded by the controller's accept queue. +- Serialisation shifts from the PLC accept queue to the proxy's outbound channel (`_outboundChannel` in `PlcMultiplexer`). + +The honest trade-off: the wire-rate ceiling does not change. The ECOM serialises requests internally at roughly 2–10 ms per scan, so the multiplexer cannot deliver more PDUs per second to one PLC than the 1:1 model could. What multiplexing buys is connection headroom, plus the data structures that read coalescing and the response cache hook into. + +### Why TxId rewriting rather than connection pooling + +The MBAP transaction ID is a 16-bit field at bytes 0–1 of every Modbus TCP frame, and the Modbus TCP specification explicitly permits clients to pipeline requests under different TxIds on a single connection. The PLC echoes each request's TxId on the matching response. The multiplexer exploits that contract: by allocating a proxy-side TxId per request and substituting it for the upstream client's TxId on the wire, many upstream clients can have concurrent requests outstanding on one backend socket without their MBAP frames ever colliding. A connection pool, by contrast, would still need either one backend socket per concurrent request (defeating the ECOM cap workaround) or a serialisation lock on each pooled socket (defeating concurrency). + +## Components + +The load-bearing types all live in [`../../src/Mbproxy/Proxy/Multiplexing/`](../../src/Mbproxy/Proxy/Multiplexing). + +### Type roster + +| Type | File | Role | +|------|------|------| +| `PlcMultiplexer` | `PlcMultiplexer.cs` | Owns the backend socket, the outbound channel, the backend writer and reader tasks, the per-request timeout watchdog, and the set of attached upstream pipes. One instance per PLC. | +| `UpstreamPipe` | `UpstreamPipe.cs` | Per-upstream-client wrapper around an accepted `Socket`. Owns a read task that drives `PlcMultiplexer.OnUpstreamFrameAsync`, plus a write task that drains a bounded `_responseChannel` (capacity 16) onto the socket. | +| `TxIdAllocator` | `TxIdAllocator.cs` | Proxy-side 16-bit TxId allocator. Backed by a `bool[65536]` plus a rolling `_next` cursor under a single lock. Exposes `TryAllocate`, `Release`, `InFlightCount`, and `WrapCount`. | +| `CorrelationMap` | `CorrelationMap.cs` | `ConcurrentDictionary` mapping proxy TxId to its in-flight record. Exposes `TryAdd`, `TryRemove`, `DrainAll`, and `SnapshotOlderThan`. | +| `InFlightRequest` | `InFlightRequest.cs` | Record carrying `UnitId`, `Fc`, `StartAddress`, `Qty`, `IReadOnlyList InterestedParties`, `SentAtUtc`, and `ResolvedCacheTtlMs`. | +| `InterestedParty` | `InFlightRequest.cs` | Record `(UpstreamPipe Pipe, ushort OriginalTxId)` identifying who receives the response and which TxId to restore. | + +### Threading invariants + +The multiplexer relies on a handful of single-owner rules that keep the wire-touching code lock-free: + +- **One backend writer.** Only `RunBackendWriterAsync` calls `backend.SendAsync`. The single-writer drain of `_outboundChannel.Reader.ReadAllAsync` means no socket-level send lock is needed. +- **One backend reader.** Only `RunBackendReaderAsync` calls `backend.ReceiveAsync`. The reader is the sole producer of `CorrelationMap.TryRemove` for the response path. +- **Per-pipe write loop.** Each `UpstreamPipe` has exactly one task that drains `_responseChannel` to its upstream socket. The multiplexer fan-out path only enqueues; it never writes to the socket directly. +- **Per-pipe read loop.** A single read task per pipe parses MBAP frames and calls `OnUpstreamFrameAsync` sequentially. A single upstream client therefore cannot multi-PDU-pipeline itself; concurrency comes from having many pipes. + +`TxIdAllocator` holds an internal lock for `TryAllocate` / `Release`. Contention is low in practice — one PLC's wire rate is bounded by the ECOM scan time — and the lock is preferred over a lock-free approach so the saturation, cascade, and Polly-retry paths remain deterministic. + +### Why ConcurrentDictionary for the correlation map + +`CorrelationMap` is backed by `ConcurrentDictionary` even though the request-side adds and the response-side removes are nominally single-threaded each. Three independent paths can remove an entry concurrently with each other: the backend reader on a normal response, the watchdog on a timeout, and the cascade walker on a backend disconnect. Two adders (the coalescing path's factory and the non-coalescing fast path) can also race against a removal if the backend response arrives mid-add. The `ConcurrentDictionary` makes those `TryAdd`/`TryRemove` calls atomic, which is what the "claim then dispatch" pattern in the watchdog and reader relies on for correctness. + +## Upstream To Multiplexer Path + +`PlcListener` accepts an upstream `Socket` and constructs an `UpstreamPipe` around it. `PlcMultiplexer.StartPipeAsync` attaches the pipe, spins up its write loop, and invokes `RunReadLoopAsync` with `OnUpstreamFrameAsync` as the per-frame callback. When the read loop returns (clean upstream EOF, socket fault, or cascade), a `ContinueWith` removes the pipe from `_pipes`; disposal itself is owned by the listener. + +### Frame parsing + +The pipe's read loop reads a 7-byte MBAP header into a stack-buffered array, parses the `Length` field, allocates a fresh `byte[]` sized to header + (`Length` − 1) bytes, fills the PDU body, and hands the complete frame to the callback. Frames whose length field claims a body larger than `MbapFrame.MaxPduBodySize` are treated as a protocol error and close the upstream pipe; a zero-body length is permitted (the header alone is forwarded). The buffer ownership transfers to the multiplexer with each call so the multiplexer can store it in the `CorrelationMap` entry without coordinating buffer lifetimes back to the pipe. + +Each call to `OnUpstreamFrameAsync`: + +1. Parses the MBAP header to extract the upstream client's `originalTxId` and the `unitId`. +2. For FC03, FC04, FC06, and FC16 it also pulls `startAddress` and `qty` out of the PDU; these feed the cache, the read-coalescing key, and the response BCD rewriter. +3. (Response cache, FC03/FC04 only) checks `_ctx.Cache` via a `CacheKey`. A hit short-circuits the entire path — including the backend connect attempt — and returns a synthesised frame. +4. Calls `EnsureBackendConnectedAsync`, which lazily brings up the backend socket through a Polly retry pipeline driven by `Connection.BackendConnectTimeoutMs`. +5. (Read coalescing, FC03/FC04 only, when enabled) consults `InFlightByKeyMap` to either attach to an existing peer in flight or open a new entry. +6. On a coalescing miss or any non-coalescing FC: calls `TxIdAllocator.TryAllocate(out ushort proxyTxId)`. Saturation returns false and the client receives a Modbus exception code 4 (Slave Device Failure). +7. Builds an `InFlightRequest`, registers it in `CorrelationMap.TryAdd(proxyTxId, ...)`, and observes the new peak via `ObserveInFlight`. +8. Runs the BCD rewriter over the request payload through `_pipeline.Process(MbapDirection.RequestToBackend, ...)`. +9. Overwrites bytes 0 and 1 of the MBAP header with the big-endian encoding of `proxyTxId`. +10. Enqueues the frame onto `_outboundChannel` via `_outboundChannel.Writer.WriteAsync`. The channel is bounded at 256 with `BoundedChannelFullMode.Wait`, so a saturated outbound queue backpressures the upstream rather than dropping frames. + +```csharp +// Sketch of the proxy-TxId rewrite (PlcMultiplexer.OnUpstreamFrameAsync): +if (!_allocator.TryAllocate(out ushort proxyTxIdFc)) { /* exception 4 */ } +_correlation.TryAdd(proxyTxIdFc, inFlightNc); +_pipeline.Process(MbapDirection.RequestToBackend, header, body, requestCtxNc); +frame[0] = (byte)(proxyTxIdFc >> 8); +frame[1] = (byte)(proxyTxIdFc & 0xFF); +await _outboundChannel.Writer.WriteAsync(frame, ct).ConfigureAwait(false); +``` + +After enqueuing, the upstream read loop is free to read the next frame. There is no per-pipe in-flight gate beyond what the upstream client itself imposes by reading from a single TCP stream. + +### Saturation handling + +`TxIdAllocator.TryAllocate` returns `false` only when all 65,536 slots are simultaneously in flight against one PLC. In that state `OnUpstreamFrameAsync` calls `BuildExceptionFrame(originalTxId, unitId, fcByte, exceptionCode: 4)` and enqueues the frame straight onto the requesting pipe's response channel — the upstream client sees a clean Modbus exception code 4 (Slave Device Failure) rather than a hung socket. The same path emits `MultiplexerLogEvents.Saturated` with the remote endpoint string for operator triage. + +### Lazy backend connect + +The backend socket starts offline. `EnsureBackendConnectedAsync` runs under a `SemaphoreSlim` named `_connectGate` so concurrent upstream frames during a cold start serialise their connect attempts. The first caller through the gate builds a fresh `Socket`, sets `NoDelay = true`, and runs `ConnectAsync` under either the supplied `_backendConnectPipeline` (Polly resilience pipeline) or a plain `CancellationToken` linked to `Connection.BackendConnectTimeoutMs`. On failure it logs `MultiplexerLogEvents.BackendFailed`, increments the per-PLC `connectsFailed` counter, and returns `false`; the upstream pipe is disposed by the caller. On success it spawns the backend writer and reader tasks under a fresh `CancellationTokenSource` linked to `_disposeCts`, increments `connectsSuccess`, and logs `MultiplexerLogEvents.BackendConnected`. + +A double-checked fast path before the gate avoids the semaphore acquire on the happy path: the moment `_backendSocket is { Connected: true }` and `_backendCts is { IsCancellationRequested: false }`, `EnsureBackendConnectedAsync` returns immediately without taking the lock. The lazy-connect contract therefore costs one volatile read per request after the first successful connect. + +## Multiplexer To Backend Path + +The backend side is two tasks plus one bounded channel. `EnsureBackendConnectedAsync` launches two tasks against the backend socket on first connect, both under a single `_backendCts`: + +- **`RunBackendWriterAsync`** — single consumer of `_outboundChannel.Reader.ReadAllAsync`. Writes every frame to the backend socket via `SendAsync` with a loop to handle short writes. Single-writer means no socket-level lock is needed for sends. +- **`RunBackendReaderAsync`** — single producer reading frames off the backend socket. For each frame: + 1. Parses the MBAP header to extract `proxyTxId` and `length`. + 2. Reads the PDU body into a fresh `byte[]`. + 3. Calls `CorrelationMap.TryRemove(proxyTxId, out var inFlight)`. A miss (no entry) drops the frame silently — usually a stale response after a cascade. + 4. Frees the allocator slot via `_allocator.Release(proxyTxId)`. + 5. Updates the per-PLC EWMA round-trip via `UpdateRoundTripEwma` using `inFlight.SentAtUtc`. + 6. Runs the response-side BCD rewriter through `_pipeline.Process(MbapDirection.ResponseToClient, ...)`. The rewriter needs `inFlight.StartAddress` and `inFlight.Qty` because the FC03/FC04 response PDU does not echo the read range. + 7. (Cache write-through, post-rewriter) on a non-exception response, stores FC03/FC04 entries in `_ctx.Cache` or invalidates overlapping entries on FC06/FC16. + 8. Walks `inFlight.InterestedParties`. For each party with a live pipe, copies the frame, restores `party.OriginalTxId` into header bytes 0–1, and calls `party.Pipe.SendResponseAsync` to enqueue the frame onto that pipe's response channel. + +Single-reader on the backend socket plus per-pipe response channels means every cross-task hand-off goes through a `Channel` — no locks on the wire-touching code paths. + +### Frame fan-out + +When `inFlight.InterestedParties.Count == 1` — the common non-coalesced case — the reader optimises by passing the original frame buffer through to `SendResponseAsync` without copying. When the list has more than one party (a coalesced FC03/FC04 read), the reader clones the frame for each party before patching in its `OriginalTxId`, so each pipe's response channel owns an independent buffer. + +A party whose pipe reports `IsAlive == false` is skipped. For multi-party FC03/FC04 frames the skip path also increments the per-PLC `coalescedResponseToDeadUpstream` counter and logs `CoalescingLogEvents.DeadUpstream`, so operators can correlate cascade-mid-flight events with which reads were affected. + +## Per-Request Timeout Watchdog + +`RunRequestTimeoutWatchdogAsync` is launched from the multiplexer constructor and runs for the lifetime of the multiplexer. It ticks every `BackendRequestTimeoutMs / 4`, floored at 100 ms, and on each tick calls `CorrelationMap.SnapshotOlderThan(DateTimeOffset.UtcNow.AddMilliseconds(-BackendRequestTimeoutMs))`. + +For each stale entry the watchdog: + +1. Tries to claim the entry via `_correlation.TryRemove(proxyTxId, out var req)`. A failed claim means a response, cascade, or another watchdog tick already removed it — skip. +2. Releases the proxy TxId via `_allocator.Release(proxyTxId)`. +3. For FC03/FC04, also removes the matching `CoalescingKey` from `InFlightByKeyMap` so a brand-new identical request opens a fresh round-trip rather than attaching to a corpse. +4. Walks `req.InterestedParties` and, for each live pipe, delivers a synthesised Modbus exception frame with function code `req.Fc | 0x80` and exception code `0x0B` (Gateway Target Device Failed To Respond), with the party's `OriginalTxId` patched back into the MBAP header. + +The watchdog exists because the multiplexed model has no per-pair fault-on-timeout backstop. In the 1:1 model, a lost response simply sat on a dead socket that the upstream eventually closed; in the multiplexed model, a single missing or mis-echoed response would leak its `CorrelationMap` entry forever and hang every upstream party attached to it. Specific failure modes the watchdog covers: + +- The PLC drops a response (busy controller, scan-time excursion). +- A middlebox drops a packet on a long-idle backend socket. +- A backend mis-echoes the MBAP TxId — including pymodbus 3.13.0's deferred-handler bug noted below. + +### Why claim then release + +The watchdog reads the stale set via `SnapshotOlderThan` (a non-removing scan) and only then competes for each entry via `TryRemove`. The two-step is deliberate: a response arriving between the snapshot and the claim wins the `TryRemove` race and the watchdog skips that entry. Without the claim race, the upstream party could receive both a real response and a 0x0B exception for the same request, which would corrupt clients that expect responses in TxId order. + +### Tick cadence + +The 100 ms floor on `tickMs` keeps the watchdog from busy-waking when an operator configures `BackendRequestTimeoutMs` below 400 ms. With the production default of 3000 ms the watchdog ticks every 750 ms, which keeps timeout dispatch latency well under one second past the threshold. + +### Exception frame shape + +`BuildExceptionFrame` produces a 9-byte synthetic response: 7-byte MBAP header plus a 2-byte exception PDU. The function code byte is OR'd with `0x80` to flag the response as an exception, and the second PDU byte carries the exception code (`0x04` for allocator saturation, `0x0B` for the watchdog). The `Length` field in the MBAP header is set to 3 (`UnitId` + exception FC + exception code) and the `ProtocolId` is zero per the Modbus TCP spec. Clients written against a real DL260 see exactly the same frame layout a controller would emit, so client libraries surface a normal `ModbusException` rather than a transport error. + +## Backend Disconnect Cascade + +When the backend socket dies — reader EOF, writer fault, PLC reboot, network partition, or middlebox idle drop — `TearDownBackendAsync(reason, cascadeUpstreams: true)` runs: + +1. Cancels `_backendCts`, which terminates both backend tasks. +2. Shuts down and disposes the backend `Socket`. +3. Calls `CorrelationMap.DrainAll`, releases every allocator slot, and collects every `InterestedParty`'s pipe ID. +4. Calls `InFlightByKeyMap.DrainAll` so stale coalescing entries cannot outlive the backend they were aimed at. +5. Disposes every attached `UpstreamPipe` and clears `_pipes`. +6. Increments `BackendDisconnectCascades` on the per-PLC counters by the number of upstream pipes that were attached (`AddDisconnectCascades(upstreamCount)`). +7. Logs a `MultiplexerLogEvents.BackendDisconnected` event with the upstream count, drained correlation count, and a reason string. + +The rationale: a backend disconnect invalidates every in-flight response, and there is no clean way to mid-flight-rebind upstream clients to a fresh backend socket without risking silent data loss. Cascading the disconnect upstream is loud (clients re-issue immediately) but unambiguous — every upstream sees its socket close, no zombie upstream sockets hold stale state. The next upstream frame after the cascade triggers a fresh Polly-driven backend connect. + +### Failure detection paths + +Three independent paths can initiate a cascade: + +1. **Reader EOF.** `RunBackendReaderAsync` sees a clean zero-byte read from `ReceiveAsync` and falls out of the loop. It calls `TearDownBackendAsync("backend reader EOF", cascadeUpstreams: true)` as a fire-and-forget task. +2. **Reader fault or writer fault.** Either backend task catches a non-cancellation exception and calls `TearDownBackendAsync($"reader fault: {ex.Message}", ...)` or the equivalent writer-fault path. +3. **Watchdog-driven indirect failure.** A backend that mis-echoes TxIds will not itself fault the socket; the watchdog eventually times out the leaked correlation entries and delivers 0x0B exceptions. The socket stays up unless the backend then also stops responding to subsequent requests. + +`TearDownBackendAsync` is idempotent against itself — the `lock (_backendLock)` block atomically swaps the live socket and task references to `null`, so a second invocation sees `oldSocket is null && oldCts is null` and returns without re-cascading. + +### Why every attached upstream cascades + +An earlier sketch cascaded only upstream pipes that had a request in flight at the moment of disconnect. The current implementation cascades every attached pipe, in flight or idle. The reason: an idle upstream pipe is one that the proxy has been quietly answering from cache or that has simply not sent a request recently. After a backend disconnect, the proxy has no way to prove the PLC's state still matches what those idle clients last saw — a PLC reboot, ladder edit, or operator write between the disconnect and reconnect can have moved the values out from under them. Closing every upstream socket is the unambiguous signal that "the link to the device was lost; rebuild your state from scratch." Clients reconnect on their own next request. + +### Connect-on-next-frame, not eager reconnect + +The cascade tears down the backend without scheduling a reconnect. The next upstream frame that arrives invokes `EnsureBackendConnectedAsync`, which constructs a fresh socket and runs the Polly connect pipeline. The rationale is that an eager reconnect spinner would hammer a downed PLC at the configured backoff schedule even when no clients are attached; gating reconnect on client demand avoids waste during long PLC outages without sacrificing recovery latency once clients return. + +## Wire-Rate Considerations + +The multiplexer is not a throughput multiplier. The ECOM serialises every request it receives on its single internal scan, so PDUs-per-second to one PLC is bounded by `1 / ecom_scan_ms` regardless of how many upstream clients the proxy fans in. What changes: + +- **Connection count.** Upstream-side connection count is now limited by the OS socket budget and `OutboundChannelCapacity` (256), not by the ECOM's 4-client cap. +- **Coalescing opportunity.** Identical concurrent FC03/FC04 reads attach to the same `InFlightRequest` via `InFlightByKeyMap`, so the proxy issues one backend round-trip and fans the response out to all attached parties (see [`./ReadCoalescing.md`](./ReadCoalescing.md)). +- **Cache short-circuit.** FC03/FC04 reads with a resolved per-tag TTL never reach the wire while the cached PDU is fresh (see [`./ResponseCache.md`](./ResponseCache.md)). + +The proxy can hand more concurrent upstream clients a result on a hot tag than the bare PLC can serve simultaneously. It cannot let those clients hammer the PLC harder than the PLC's scan time allows. + +### Counters exposed by the status page + +`PlcMultiplexer` implements `IMultiplexCountersProvider` and registers itself with the per-PLC counters object during construction. The status page reads these values per snapshot: + +| Counter | Source | Meaning | +|---------|--------|---------| +| `inFlight` | `TxIdAllocator.InFlightCount` | Proxy TxIds currently allocated against this PLC. | +| `maxInFlight` | `Counters.ObserveInFlight` peak | High-water mark since service start. | +| `txIdWraps` | `TxIdAllocator.WrapCount` | Times the rolling cursor has rolled 0xFFFF → 0x0000. Sustained non-zero means very high churn. | +| `queueDepth` | `_outboundChannel.Reader.Count` | Frames sitting in the outbound channel waiting for the backend writer. Persistent depth means the PLC is the bottleneck. | +| `disconnectCascades` | `Counters.AddDisconnectCascades` | Cumulative count of upstream pipes cascaded by backend disconnects. Rises in chunks equal to the attached pipe count at cascade time. | +| `connectsSuccess` / `connectsFailed` | `Counters.IncrementConnectSuccess` / `IncrementConnectFailed` | Per-PLC backend connect outcomes. | + +### Interpreting non-zero txIdWraps + +Each `WrapCount` increment means the allocator has issued at least 65,536 TxIds against one PLC since service start. On a steady 10 ms-per-PDU pace that takes about 11 minutes; sustained wraps therefore indicate request rates in the hundreds-per-second range, well above what an ECOM-served PLC can answer. Wraps without a matching rise in `inFlight` simply reflect cumulative volume and are benign. Wraps that climb alongside a high `inFlight` value indicate the PLC is back-pressuring; check `queueDepth` and the EWMA round-trip on the same snapshot. + +### Interpreting non-zero queueDepth + +`_outboundChannel` is bounded at 256 with `BoundedChannelFullMode.Wait`. A persistent depth above zero means the backend writer is not draining as fast as upstream pipes are submitting — the PLC has become the bottleneck. A queue that climbs toward 256 means upstream pipes are starting to block on `WriteAsync`; that backpressure walks back up the per-pipe read loop and ultimately stalls the upstream client's send buffer, which is the correct behaviour for an overloaded PLC. + +A queue depth above zero with `inFlight` also climbing suggests the PLC is keeping up with requests but slowly; the EWMA round-trip on the same snapshot will confirm. A queue depth above zero with `inFlight` flat at the allocator's saturation ceiling indicates a stuck backend (no responses arriving, no slots freeing); the watchdog will eventually clear the stuck entries via 0x0B exceptions. + +### Memory footprint per PLC + +Each `PlcMultiplexer` holds a `bool[65536]` for the TxId allocator (~64 KB), the `ConcurrentDictionary` for the correlation map (sized to peak in-flight, typically tens of bytes per entry plus the `byte[]` frame buffers referenced by the entries), the bounded outbound channel (≤ 256 frames in flight; each frame at most 260 bytes), and the per-pipe response channels (≤ 16 frames per attached pipe). With ~54 PLCs the allocator alone accounts for roughly 3.4 MB; the rest is request-rate dependent and well within the service's measured ~30 MB working set under load. + +## Lifecycle And Disposal + +`PlcMultiplexer.DisposeAsync` is idempotent and runs in this order: + +1. Sets `_disposed = true` and unhooks the live `IMultiplexCountersProvider` registration so a concurrent status snapshot does not observe internal state mid-teardown. +2. Cancels `_disposeCts`, which cooperatively stops the watchdog task. +3. Awaits the watchdog with a 2-second timeout so its in-flight 0x0B dispatches settle before tests assert against counter values. +4. Calls `TearDownBackendAsync("disposing", cascadeUpstreams: true)` to close the backend, drain `CorrelationMap`, drain `InFlightByKeyMap`, and dispose every attached pipe. +5. Completes the outbound channel writer, then disposes any pipes that were not already cleared by the cascade walk. +6. Disposes `_disposeCts`. + +`UpstreamPipe.DisposeAsync` is similarly idempotent: it completes its response channel writer, cancels its internal CTS, shuts the upstream socket down both ways, and emits a `MultiplexerLogEvents.ClientDisconnected` event with the remote endpoint string and a reason. Disposal can be triggered by the listener (clean upstream EOF), by the read or write loop encountering a socket error, or by the cascade walk. + +## pymodbus 3.13.0 Simulator Quirk + +The pymodbus simulator's `ServerRequestHandler` stores a single `last_pdu` field per connection and schedules deferred response handlers via `asyncio.call_soon`. When two MBAP frames arrive in the same recv buffer — exactly the workload the multiplexer can produce on its shared backend socket — the second frame's `last_pdu` overwrites the first before either deferred handler runs. Both responses then carry the second request's TxId. + +### Why this only matters in tests + +The real H2-ECOM100 does not have this bug; it echoes per-request TxIds correctly. Multiplexer correctness under genuine backend concurrency is proven by the unit tests in `PlcMultiplexerTests` against a stub backend that respects MBAP TxIds, not via the simulator. The E2E suite paces requests against the pymodbus simulator to keep it in known-good single-PDU mode. + +The per-request timeout watchdog described above is the production defence against any backend (real or simulated) that mis-echoes a TxId: the unanswered `InFlightRequest` ages past `BackendRequestTimeoutMs` and the upstream party receives a clean Modbus exception 0x0B rather than a hung socket. + +## Related Documentation + +- [`./Overview.md`](./Overview.md) — proxy architecture entry point +- [`./ReadCoalescing.md`](./ReadCoalescing.md) — FC03/FC04 fan-out built on `InterestedParties` +- [`./ResponseCache.md`](./ResponseCache.md) — per-PLC FC03/FC04 cache layered in front of this multiplexer +- [`../Operations/Configuration.md`](../Operations/Configuration.md) — `Connection.BackendConnectTimeoutMs`, `Connection.BackendRequestTimeoutMs`, retry tuning +- [`../Operations/StatusPage.md`](../Operations/StatusPage.md) — `inFlight`, `maxInFlight`, `txIdWraps`, `queueDepth`, `disconnectCascades` counters +- [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — `mbproxy.multiplex.*` structured log events +- [`../Testing/Simulator.md`](../Testing/Simulator.md) — pymodbus 3.13.0 deferred-handler quirk in detail +- [`../../DL260/dl205.md`](../../DL260/dl205.md) — DL205/DL260 quirks including the 4-client ECOM cap diff --git a/mbproxy/docs/Architecture/Overview.md b/mbproxy/docs/Architecture/Overview.md new file mode 100644 index 0000000..f6f7819 --- /dev/null +++ b/mbproxy/docs/Architecture/Overview.md @@ -0,0 +1,150 @@ +# Architecture Overview + +`mbproxy` is a .NET 10 background service that sits inline between Modbus TCP clients and a fleet of AutomationDirect DL205/DL260 PLCs, rewriting BCD-encoded registers in both directions while multiplexing many upstream clients onto one persistent backend socket per PLC. + +This document is the entry point for readers new to the codebase. It sketches the runtime shape, the listener topology, the per-PLC isolation model, and the path a single Modbus frame takes from accept to response, and then hands off to the per-feature documents under `docs/Architecture/`, `docs/Features/`, and `docs/Operations/`. + +## Runtime Shape + +The process is a single .NET 10 Generic Host worker. `Microsoft.Extensions.Hosting.WindowsServices` registers the host as a Windows Service so the same binary runs interactively (for development) or under the SCM (in production). All configuration binds from `appsettings.json` through `IOptionsMonitor`, which makes the tag list and PLC roster hot-reloadable without process restart. `ProxyWorker` is the long-lived `BackgroundService` that owns startup, shutdown, and the listener supervisors for every PLC. A small Kestrel admin endpoint runs in the same process to serve the read-only status page. + +There is no in-process database, no message broker, and no persistent cache file: state is per-PLC, in-memory, and ephemeral. Restarting the service drops every in-flight request and every cached response. Upstream clients are expected to reconnect and reissue; the proxy never replays a request on their behalf. + +## Listener Topology + +The proxy opens **one `TcpListener` per PLC** on a distinct port. A client picks which PLC it is talking to by choosing which port to connect to. There is no protocol-level routing — port number is the PLC identity. This keeps the upstream surface trivial for Wonderware, Historian gateways, and generic Modbus clients that already know how to point at `host:port`, and it means no per-frame header inspection is needed to decide where a request is going. + +```text +Client A ──┐ +Client B ──┼──→ proxy:5020 ──→ PLC #1 (10.0.1.1:502) + ├──→ proxy:5021 ──→ PLC #2 (10.0.1.2:502) + │ ... + └──→ proxy:5073 ──→ PLC #54 (10.0.1.54:502) +``` + +Each listener runs under a `PlcListenerSupervisor` that owns its bind lifecycle. If a bind fails at startup or the listener faults at runtime, the supervisor reattempts under a Polly retry pipeline; the same code path also brings up newly-added PLCs from hot-reload and tears down removed ones. The supervisor's state (`SupervisorState`) is observable on the status page so an operator can tell at a glance whether a port is bound, recovering, or shut down. + +Because port identity is PLC identity, adding a PLC is purely a configuration change — append an entry to `Mbproxy.Plcs` with a free `ListenPort`, save, and the supervisor reconciliation loop binds the new port without touching any other PLC. Removing a PLC follows the same path in reverse. + +## Per-PLC Isolation + +Every PLC gets its own `PerPlcContext` carrying that PLC's `PlcMultiplexer`, `CorrelationMap`, `TxIdAllocator`, `InFlightByKeyMap`, optional `ResponseCache`, `CacheInvalidator`, and `BcdPduPipeline`. There is no shared mutable state across PLCs at the request path. + +The consequence is fault containment: + +- A slow or dead backend on PLC #17 cannot block the request loop for PLC #18. Each multiplexer owns its own outbound channel and its own backend reader/writer task pair. +- A flood of in-flight requests on one PLC consumes only that PLC's TxId allocator (the 16-bit space is per-PLC, not global). +- A backend disconnect on one PLC cascades only to that PLC's attached upstream pipes; the rest of the fleet is unaffected. +- Hot-reload of one PLC's tag list rewrites only that PLC's `BcdPduPipeline` view of the tag map. Other PLCs do not observe the swap. + +The listener topology and the per-PLC component graph are deliberately aligned: one port, one supervisor, one multiplexer, one backend socket, one cache instance. + +Cross-PLC state exists only in three places, and each is read-mostly: the bound `IOptionsMonitor` snapshot, the global Serilog logger, and the service-wide counter set surfaced on the status page. Counters are written via lock-free `Interlocked` operations on disjoint per-PLC fields, then summed when the status page is rendered. + +This isolation is what lets the service operate degraded without operator intervention. If three PLCs drop off the network, the supervisor for each enters `recovering`, their multiplexers tear down their backend sockets, attached upstream clients are disconnected, and the remaining 51 PLCs keep serving traffic with no measurable impact. When the dropped PLCs come back, their supervisors rebind their listeners and the next upstream request triggers a fresh backend connect through the Polly pipeline — no fleet-wide restart, no manual reconnect, no shared state to flush. + +## Request Flow + +The path of an FC03 read from an upstream client through the proxy and back. The cache check, the coalescing check, and the BCD rewrite all sit between the upstream parse and the backend send so the multiplexer can short-circuit the backend entirely when it does not need to be involved. Steps the upstream client never sees are indented. + +```text +Upstream client + │ TCP connect → proxy:5020 + ▼ +PlcListener (PlcListener.cs) accepts the socket + │ + ▼ +UpstreamPipe wraps the socket: read loop + bounded response channel + │ parses MBAP frames off the wire, hands each frame to: + ▼ +PlcMultiplexer.OnUpstreamFrameAsync(pipe, frame, ct) + │ + │ 1. Parse MBAP header → originalTxId, unitId + │ 2. Parse PDU → fc, startAddr, qty + │ 3. (FC03/FC04 only) ResponseCache.TryGet(CacheKey) + │ ├─ HIT → splice cached payload onto a fresh MBAP header + │ │ with originalTxId, push to upstream channel, DONE. + │ └─ MISS → fall through. + │ 4. InFlightByKeyMap coalesce check + │ ├─ duplicate read in flight → attach as additional waiter, + │ │ share the eventual response, DONE for this frame. + │ └─ first-of-key → become the leader, fall through. + │ 5. BcdPduPipeline rewrites request payload (FC06/FC16) binary → BCD + │ 6. TxIdAllocator hands out a free proxyTxId + │ 7. CorrelationMap[proxyTxId] = InFlightRequest(pipe, originalTxId, ...) + │ 8. Overwrite MBAP TxId field with proxyTxId; enqueue to outbound channel + ▼ +Backend writer task drains the outbound channel + │ → single persistent socket → PLC :502 + ▼ +PLC responds; backend reader task picks the frame off the socket + │ + │ 9. Look up proxyTxId in CorrelationMap; recover original requester(s) + │ 10. BcdPduPipeline rewrites response payload (FC03/FC04) BCD → binary + │ 11. ResponseCache stores the rewritten payload (if TTL > 0) + │ 12. Fan out to every waiter on the InFlightByKey entry, restoring each + │ waiter's originalTxId before pushing into its UpstreamPipe channel + ▼ +UpstreamPipe writer task drains its response channel → upstream socket + │ + ▼ +Upstream client sees a response with the TxId it originally sent. +``` + +Writes (FC06, FC16) take a shorter path: no cache lookup, no coalescing, but the request payload is BCD-rewritten before forwarding, and the response triggers `CacheInvalidator` to evict any overlapping cached read ranges so the next read does not serve stale data. + +A few invariants are worth flagging because they shape the design: + +- **Original TxId is preserved end-to-end.** The multiplexer rewrites the wire TxId for routing, but every upstream client sees the exact 16-bit value it sent. `InFlightRequest` carries the original TxId alongside the upstream pipe reference. +- **Single backend writer, single backend reader.** No socket-level synchronisation is needed because exactly one task writes to the backend socket and exactly one task reads from it. The outbound channel funnels every request through that single writer. +- **The cache check happens before backend connect.** If every read in a request is cache-served and the backend is currently disconnected, the upstream client still gets a response. The cache survives backend transitions intentionally. +- **No mid-request retries on writes.** FC06 and FC16 are non-idempotent on BCD tags (a partial-applied multi-register write could leave a 32-bit BCD value mid-transition), so a backend failure during a write surfaces as Modbus exception 0x0B and the client decides how to recover. + +## Component Map + +The major components a reader will hit when tracing a request, with their file locations under `src/Mbproxy/`. The list is ordered by where each component sits in the request path — accept loop at the top, rewrite at the bottom. + +- **`ProxyWorker`** — `Proxy/ProxyWorker.cs`. The `BackgroundService` host; reconciles the configured PLC list with the supervisor roster on startup and on `IOptionsMonitor` change events. +- **`PlcListenerSupervisor`** — `Proxy/Supervision/PlcListenerSupervisor.cs`. Owns one PLC's listener lifecycle (bind, run, recover, shut down). Uses Polly for bounded recovery. +- **`PlcListener`** — `Proxy/PlcListener.cs`. The actual `TcpListener` accept loop for one PLC; hands every accepted socket to that PLC's multiplexer as a new `UpstreamPipe`. +- **`UpstreamPipe`** — `Proxy/Multiplexing/UpstreamPipe.cs`. One per upstream socket. Frame-parses inbound bytes and pushes parsed MBAP frames into the multiplexer; drains outbound responses from a bounded channel back to the client. +- **`PlcMultiplexer`** — `Proxy/Multiplexing/PlcMultiplexer.cs`. The per-PLC fanin/fanout core. Owns the persistent backend socket, the outbound write loop, the backend read loop, the per-request watchdog, and the cascade-on-backend-disconnect contract. Entry point `OnUpstreamFrameAsync` is where every upstream frame enters the request path; it is the single function that ties cache, coalescing, BCD rewrite, TxId allocation, and correlation together. +- **`CorrelationMap`** — `Proxy/Multiplexing/CorrelationMap.cs`. Maps `proxyTxId → InFlightRequest` so backend responses can be routed back to the originating upstream pipe(s). Also the surface the watchdog scans for stale entries. +- **`TxIdAllocator`** — `Proxy/Multiplexing/TxIdAllocator.cs`. Allocates and recycles the per-PLC 16-bit proxy TxId space used by the multiplexer. +- **`InFlightByKeyMap`** — `Proxy/Multiplexing/InFlightByKeyMap.cs`. The read-coalescing seam: keys on `(unitId, fc, startAddr, qty)` so duplicate concurrent reads share one backend round-trip and one response. +- **`ResponseCache`** — `Proxy/Cache/ResponseCache.cs`. Opt-in per-tag-range TTL cache for FC03/FC04 responses. A cache hit short-circuits the backend entirely; cache lookup happens before the multiplexer even ensures the backend is connected. +- **`CacheInvalidator`** — `Proxy/Cache/CacheInvalidator.cs`. Invalidates cached read ranges that overlap with successful FC06/FC16 writes, so writes never leave stale reads behind. +- **`BcdPduPipeline`** — `Proxy/BcdPduPipeline.cs`. The actual BCD rewrite: walks request and response PDUs against the resolved tag map and re-encodes each configured register between BCD nibbles and binary integers. 32-bit BCD tags spanning the CDAB word pair are rewritten as a unit. Non-BCD registers pass through untouched, and any function code the pipeline does not own (diagnostics, exceptions, coil and discrete-input functions) is forwarded byte-for-byte. + +`PerPlcContext` (`Proxy/PerPlcContext.cs`) is the container that binds these together for one PLC and is the handle the supervisor and multiplexer carry around. + +Two supporting abstractions are worth knowing about even though they do not appear in the per-frame path: + +- **`IPduPipeline`** — the rewrite-pipeline interface (`Proxy/IPduPipeline.cs`). `BcdPduPipeline` is the production implementation; `NoopPduPipeline` is the test/passthrough implementation used when no BCD tags are configured for a PLC. +- **`MbapFrame`** — the static helper (`Proxy/MbapFrame.cs`) that parses and serialises the 7-byte MBAP header. Every component that touches the wire goes through this helper rather than indexing raw byte arrays directly. + +Counters and structured log event names emitted from these components are catalogued in `ProxyCounters` (`Proxy/ProxyCounters.cs`) and the various `*LogEvents` static classes (`MultiplexerLogEvents`, `CoalescingLogEvents`, `CacheLogEvents`, `RewriterLogEvents`). A reader following a runtime symptom back to its source should grep for the event-name constants in those files first. + +## Where to Read Next + +For the wire-level details of how one backend socket fans out to many upstream clients — TxId rewriting, the correlation map, the per-request watchdog, the backend disconnect cascade — read [`./ConnectionModel.md`](./ConnectionModel.md). It is the most load-bearing internal document; almost every failure-mode question routes through it. + +For the read-coalescing seam (when duplicate concurrent reads collapse onto one backend request) read [`./ReadCoalescing.md`](./ReadCoalescing.md). For the opt-in TTL cache and how writes invalidate overlapping read ranges read [`./ResponseCache.md`](./ResponseCache.md). The BCD rewrite itself — what gets rewritten, what passes through, and how CDAB 32-bit values are handled — is in [`../Features/BcdRewriting.md`](../Features/BcdRewriting.md). + +Operators looking for configuration shape, hot-reload semantics, and the status page should start at [`../Operations/Configuration.md`](../Operations/Configuration.md) and [`../Operations/StatusPage.md`](../Operations/StatusPage.md). When something is misbehaving in production, [`../Operations/Troubleshooting.md`](../Operations/Troubleshooting.md) and [`../Reference/LogEvents.md`](../Reference/LogEvents.md) are the two places to look first. + +The simulator used by the end-to-end test suite — a `pymodbus`-based stand-in for a real DL205 — has its own document at [`../Testing/Simulator.md`](../Testing/Simulator.md). Test-only quirks of that simulator are called out there rather than in the production docs, because the real DL260 ECOM does not share them. + +## Related Documentation + +- [`./ConnectionModel.md`](./ConnectionModel.md) — TxId multiplexing, correlation map, per-request watchdog. +- [`./ReadCoalescing.md`](./ReadCoalescing.md) — how `InFlightByKeyMap` collapses duplicate concurrent reads. +- [`./ResponseCache.md`](./ResponseCache.md) — `ResponseCache` and `CacheInvalidator` semantics. +- [`../Features/BcdRewriting.md`](../Features/BcdRewriting.md) — the `BcdPduPipeline` rewrite rules. +- [`../Features/HotReload.md`](../Features/HotReload.md) — `IOptionsMonitor` propagation and supervisor reconciliation. +- [`../Operations/Configuration.md`](../Operations/Configuration.md) — `appsettings.json` schema and tag list shape. +- [`../Operations/StatusPage.md`](../Operations/StatusPage.md) — the Kestrel admin endpoint and counter catalog. +- [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — stable structured log event names. +- [`../design.md`](../design.md) — canonical design decisions and rationale. +- [`../Testing/Simulator.md`](../Testing/Simulator.md) — `pymodbus` DL205 simulator used by the end-to-end suite. +- [`../plan/README.md`](../plan/README.md) — phase plan with per-phase test inventory. diff --git a/mbproxy/docs/Architecture/ReadCoalescing.md b/mbproxy/docs/Architecture/ReadCoalescing.md new file mode 100644 index 0000000..70e0eca --- /dev/null +++ b/mbproxy/docs/Architecture/ReadCoalescing.md @@ -0,0 +1,243 @@ +# Read Coalescing + +In-flight read coalescing collapses identical FC03/FC04 requests that arrive +while a backend response is still in flight onto a single backend round-trip, +then fans the single response out to every attached upstream client with each +client's original MBAP transaction ID restored. + +## What Coalescing Does + +When two upstream clients each send `(unitId=1, FC=3, start=100, qty=10)` +within the in-flight window of a previously-routed request, the second +arrival attaches to the existing `InFlightRequest` instead of opening a new +proxy transaction ID and a second backend round-trip. The PLC's reply is +delivered to both upstream pipes; each pipe sees its own MBAP `TxId` +restored on its copy of the response. + +The value each upstream sees is the same value an uncoalesced request would +have returned within the PLC's own scan-time precision (microseconds to +~10 ms typical window). Coalescing is not a cache layer — once the response +fans out, the in-flight entry dies, and a subsequent identical read opens a +fresh round-trip. Bounded-staleness caching is a separate feature; see +[`./ResponseCache.md`](./ResponseCache.md). + +## The Coalescing Key + +The lookup tuple is defined in `CoalescingKey.cs`: + +```csharp +internal readonly record struct CoalescingKey( + byte UnitId, + byte Fc, + ushort StartAddress, + ushort Qty); +``` + +Record-struct value equality drives the dictionary lookup in +`InFlightByKeyMap`. Several axes never coalesce, by design: + +- **Function code.** FC03 (Read Holding Registers) and FC04 (Read Input + Registers) read different Modbus tables on the device. Their responses + are not interchangeable, so they do not share a key even at the same + address. +- **Unit ID.** Distinct unit IDs behind a shared socket address different + Modbus personalities — coalescing never crosses a unit boundary. +- **Start address and quantity.** Two reads with overlapping but + non-identical ranges never coalesce. Range-overlap logic exists for cache + invalidation, not for coalescing. + +## Eligibility + +Only FC03 and FC04 enter the coalescing path. The multiplexer's request +handler parses the function code from the inbound PDU and gates on +`fcByte is 0x03 or 0x04` before consulting `_inFlightByKey`. + +- FC06 (Write Single Register) and FC16 (Write Multiple Registers) are + non-idempotent on BCD tags — a second send would write the value twice. + Writes bypass coalescing entirely and always take the one-round-trip path. +- Exception responses do not coalesce. Each upstream sees an exception + delivered against its own MBAP `TxId` through the normal correlation map + fan-out; there is no special exception-deduplication path. + +## The InterestedParties Seam + +The data shape that powers fan-out lives on `InFlightRequest`: + +```csharp +internal sealed record InFlightRequest( + byte UnitId, + byte Fc, + ushort StartAddress, + ushort Qty, + IReadOnlyList InterestedParties, + DateTimeOffset SentAtUtc, + int ResolvedCacheTtlMs = 0); + +internal sealed record InterestedParty(UpstreamPipe Pipe, ushort OriginalTxId); +``` + +Each `InterestedParty` records the upstream pipe to deliver the response to +and the original MBAP `TxId` that pipe sent. The backend reader iterates +this list, patches each party's `OriginalTxId` into a per-party copy of the +response frame, and hands the frame to `party.Pipe.SendResponseAsync`. + +### Multi-writer multi-reader safety + +The list typed as `IReadOnlyList` on the public surface is +in fact a mutable `List` underneath. `InFlightByKeyMap` +serialises every state mutation under a single `object` lock: + +- `TryAttachOrCreate` looks up the key, casts the existing + `InterestedParties` back to `List`, and appends the new + party — all under the lock. +- The backend reader calls `TryRemove(coalKey, out _)` **before** it + iterates the parties list during fan-out. Once the key is gone from the + map, no future attach can find it, so no further appends can occur. + +The reader's removal-before-iteration ordering is the load-bearing +invariant. By the time fan-out reads the list, the list is effectively +frozen — there is no other writer that can reach it. The watchdog timeout +path observes the same protocol: it removes the coalescing key before it +walks `req.InterestedParties` to deliver exception 0x0B. + +The reverse race (reader removes first, then a late attach arrives) is +impossible by construction — `TryRemove` and `TryAttachOrCreate` both take +the same map lock, so any late attach is serialised either entirely before +the removal (and is part of the fan-out) or entirely after (and opens a +fresh entry under a new factory call). + +## MaxParties Cap + +`ResilienceOptions.cs` exposes the load-shedding cap: + +```csharp +public sealed class ReadCoalescingOptions +{ + public bool Enabled { get; init; } = true; + public int MaxParties { get; init; } = 32; +} +``` + +`Mbproxy.Resilience.ReadCoalescing.MaxParties` defaults to 32. Inside +`TryAttachOrCreate`, an existing entry is only extended when +`existingList.Count < maxParties`; once the cap is hit, the next identical +arrival falls through to the factory branch and opens a fresh in-flight +entry (which means a fresh backend round-trip). + +The cap bounds two costs: + +- **Fan-out cost per entry** at O(MaxParties). The backend reader's + per-party copy-and-patch loop runs at most `MaxParties` times for any + single response. +- **Backend reader latency under pile-on.** A single pathologically popular + read (every HMI hitting the same tag at the same second) cannot stretch + one fan-out arbitrarily long. + +## Hot-Reloadable On/Off + +`Mbproxy.Resilience.ReadCoalescing.Enabled` defaults to `true`. The +multiplexer holds a `Func` accessor that production +binds to `() => optionsMonitor.CurrentValue.Resilience.ReadCoalescing`, so +a hot-reload of `appsettings.json` propagates immediately on the next +inbound PDU. + +Flipping `Enabled` to `false` at runtime does not disturb already-coalesced +entries: existing fan-outs drain through the backend reader naturally. +Subsequent FC03/FC04 requests skip the coalescing branch entirely and take +the one-proxy-TxId-per-upstream-request path verbatim. + +The same accessor reads `MaxParties` per PDU, so an operator can raise or +lower the cap without restarting the service. + +## Lookup Order in the Multiplexer's Read Path + +`OnUpstreamFrameAsync` consults three tiers in fixed order for FC03/FC04: + +1. **Cache** — if `_ctx.Cache` is wired and `_ctx.TagMap.ResolveCacheTtlMs` + returns a positive TTL for the read range, the response cache is + checked first. A hit short-circuits everything, including the + `EnsureBackendConnectedAsync` call. See + [`./ResponseCache.md`](./ResponseCache.md). +2. **Coalesce** — on a cache miss (or no cache configured), the request + consults `_inFlightByKey` via `TryAttachOrCreate`. A hit attaches the + new party to an in-flight peer and emits no backend traffic. +3. **Backend** — on a coalescing miss, the factory branch allocates a + proxy `TxId` through `TxIdAllocator`, registers the entry in + `CorrelationMap`, runs the BCD rewriter on the request PDU, and queues + the frame onto the outbound channel. + +The order is load-bearing. Cache hits avoid both backend traffic **and** +any coalescing-entry housekeeping. Coalescing hits avoid the backend but +still incur a list-append and a fan-out. Backend round-trips are the most +expensive of the three. + +## Counter Accounting + +`PerPlcContext.Counters` exposes three coalescing-specific counters, all +surfaced on the status page: + +- **`coalescedHitCount`** — increments inside `OnUpstreamFrameAsync` when + `TryAttachOrCreate` returns `wasNew == false` (the request attached to + an existing in-flight entry). +- **`coalescedMissCount`** — increments when `wasNew == true`. The + non-coalescing FC03/FC04 path also increments this counter when + coalescing is disabled, so the identity `coalescedHitCount + + coalescedMissCount == total FC03+FC04 requests since multiplexer + construction` holds regardless of `Enabled`. +- **`coalescedResponseToDeadUpstream`** — increments inside the backend + reader's fan-out loop when a coalesced party's pipe has gone dead + (`party.Pipe.IsAlive == false`) before the response landed. Only + counted when the in-flight entry had more than one party — single-party + dead-upstream skips are the normal Phase-9 behaviour and are silent. + +When `ReadCoalescing.Enabled == false`, `coalescedHitCount` remains zero +and every FC03/FC04 read increments `coalescedMissCount`. Aggregate fleet +metrics (hit ratio, requests per second) read directly from these +counters; see [`../Operations/StatusPage.md`](../Operations/StatusPage.md). + +The Debug-level log events `mbproxy.coalesce.hit`, +`mbproxy.coalesce.miss`, and `mbproxy.coalesce.dead_upstream` mirror each +counter increment; see [`../Reference/LogEvents.md`](../Reference/LogEvents.md). + +## Transparency Contract Preserved + +Each upstream client receives the same response shape it would have +received from a one-to-one proxy: + +- **Original MBAP `TxId` restored.** The backend reader patches + `outFrame[0..2]` with `party.OriginalTxId` for each party in the + `InterestedParties` list. The proxy's internal TxId never reaches an + upstream socket. +- **BCD rewriter runs once.** `_pipeline.Process(ResponseToClient, ...)` + fires exactly once against the shared backend response buffer. Cached + rewriter context (start address, quantity) comes from the + `InFlightRequest` that opened the round-trip. +- **One-party fan-out reuses the buffer.** When + `inFlight.InterestedParties.Count == 1`, the backend reader assigns the + original `frame` reference to `outFrame` instead of cloning, saving the + allocation. Multi-party fan-outs clone the frame per party so each can + carry a distinct `TxId` without trampling its peers. + +Coalescing is invisible at the wire-protocol layer. An upstream client +cannot tell whether its read was served by a fresh backend round-trip or +by attaching to a peer's in-flight request — only the timing distribution +changes. + +## Related Documentation + +- [`./ConnectionModel.md`](./ConnectionModel.md) — multiplexer overview; + the `InterestedParties` seam, `CorrelationMap`, and `TxIdAllocator` live + here. +- [`./ResponseCache.md`](./ResponseCache.md) — bounded-staleness cache that + sits above coalescing in the lookup order; cache hits short-circuit + coalescing entirely. +- [`../Operations/StatusPage.md`](../Operations/StatusPage.md) — exposes + `coalescedHitCount`, `coalescedMissCount`, and + `coalescedResponseToDeadUpstream` per PLC and as fleet aggregates. +- [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — full + `mbproxy.coalesce.*` event catalogue with event IDs. +- [`../Operations/Configuration.md`](../Operations/Configuration.md) — + binding for `Mbproxy.Resilience.ReadCoalescing.Enabled` and `MaxParties`, + hot-reload semantics. +- [`../Features/BcdRewriting.md`](../Features/BcdRewriting.md) — the + rewriter that runs once on the shared response buffer before fan-out. diff --git a/mbproxy/docs/Architecture/ResponseCache.md b/mbproxy/docs/Architecture/ResponseCache.md new file mode 100644 index 0000000..420e0b5 --- /dev/null +++ b/mbproxy/docs/Architecture/ResponseCache.md @@ -0,0 +1,398 @@ +# Response Cache + +The response cache is an opt-in per-tag, bounded-staleness layer that serves +FC03 and FC04 reads from in-process memory. It sits above read coalescing in +the request path so a hit avoids both the coalescing entry and the backend +round-trip entirely. + +## Cache Contract + +The cache is **off by default for every tag**. `CacheTtlMs = 0` on every BCD +tag is the default state, and a deployment that ships without any TTL +configuration behaves identically to one compiled without the cache at all +— no in-memory entries are created, every FC03/FC04 read falls through to +the coalescing-then-backend path, and counters that track cache activity +stay at zero. + +Operators opt a tag in by setting a positive `CacheTtlMs`. That positive +value is the explicit acknowledgement of the staleness window: the operator +is stating, "I am willing for upstream clients to see a value up to N +milliseconds old in exchange for taking the read off the backend." There is +no implicit cache enablement. There is no global cache toggle that turns +caching on for previously-uncached tags. Every cached tag is one whose +configuration has a positive TTL on its line. + +This stance is the design-contract pivot the cache introduces: before it, +the proxy is purely transparent except for BCD rewriting. With the cache, +the proxy is transparent **by default**, with an opt-in cache layer the +operator can engage tag-by-tag. + +## TTL Resolution Order + +Each FC03/FC04 read range resolves to one effective TTL through three +tiers: + +1. **Explicit per-tag.** `BcdTagOptions.CacheTtlMs` on the tag entry. A + non-null value wins regardless of the per-PLC default. An explicit `0` + here disables caching for that tag even when the PLC default is + positive. +2. **Per-PLC default.** `PlcOptions.DefaultCacheTtlMs` applies to any tag + whose explicit `CacheTtlMs` is `null` (unset). A `0` default means "no + caching by default at this PLC." +3. **Zero.** With nothing set at either tier, the resolved TTL is `0` and + the read is uncached. + +`BcdTagMap.ResolveCacheTtlMs(startAddress, qty)` implements the per-read +resolution. It enumerates the BCD tags whose register footprints intersect +the requested range and returns the smallest positive TTL across the hits, +or `0` if the range covers no configured tags. + +```csharp +public int ResolveCacheTtlMs(ushort startAddress, ushort qty) +{ + if (!TryGetForRange(startAddress, qty, out var hits) || hits.Count == 0) + return 0; + + int min = int.MaxValue; + foreach (var hit in hits) + { + int ttl = hit.Tag.CacheTtlMs; + if (ttl <= 0) return 0; + if (ttl < min) min = ttl; + } + return min == int.MaxValue ? 0 : min; +} +``` + +The `hit.Tag.CacheTtlMs` value resolved on each `BcdTag` already reflects +the explicit-then-default order — the options binder resolves the per-tag +override against the per-PLC default at config build time, so the runtime +hot path sees a single integer per tag. + +## Multi-Tag Range TTL Rule + +When a single FC03/FC04 read covers multiple configured BCD tags, the +effective TTL is the minimum across them: + +```text +range covers tags { A:TTL=500, B:TTL=2000, C:TTL=100 } → effective TTL = 100 +range covers tags { A:TTL=500, B:TTL=0 (uncached) } → effective TTL = 0 +range covers tags { A:TTL=500 } → effective TTL = 500 +range covers no configured tags → effective TTL = 0 +``` + +If any covered tag has `CacheTtlMs = 0`, the whole read is uncached. The +rationale is conservative-by-design: a multi-tag read whose narrowest TTL +is, for example, 100 ms cannot be served safely from an entry that was +stored under a tag with TTL 2 s, because that entry's freshness was only +guaranteed by the longer window. Rather than partition a range read across +heterogeneous TTLs or invent inheritance rules that an operator would have +to reason about per-deployment, the cache refuses to serve any multi-tag +read whose narrowest covered TTL is zero. Operators who want a tag cached +in isolation but uncached when read alongside an uncached neighbour get the +expected behaviour by leaving the neighbour at `CacheTtlMs = 0`. + +A read whose range covers no configured BCD tags also resolves to `0`. +There is nothing to be conservative about because the cache only serves +ranges that contain rewriter-tracked tags — a read of plain non-BCD +registers does not engage the cache regardless of any per-PLC default. + +## Lookup Order + +The multiplexer's FC03/FC04 path consults three tiers in fixed order: + +1. **Cache.** When `_ctx.Cache` is wired and `BcdTagMap.ResolveCacheTtlMs` + returns a positive TTL for the read range, `ResponseCache.TryGet` is + called against a `CacheKey(unitId, fc, startAddress, qty)`. A hit + splices the cached payload onto a fresh MBAP header carrying the + original upstream TxId, pushes the frame onto that pipe's response + channel, and **returns without engaging coalescing or the backend at + all**. +2. **Coalesce.** On a cache miss (or when the resolved TTL is zero), the + request is offered to `InFlightByKeyMap.TryAttachOrCreate`. A hit + attaches the new party to a peer's in-flight request. +3. **Backend.** On a coalescing miss, the request opens a proxy TxId, + registers a `CorrelationMap` entry, runs the BCD rewriter on any FC06 + or FC16 payload, and queues the frame onto the outbound channel. + +The cache check happens **before** the multiplexer's +`EnsureBackendConnectedAsync` call. A cache hit serves the upstream even +when the backend socket is currently disconnected or recovering. This is +not an accident — the cached payload's freshness is bounded by its TTL, +not by the liveness of the backend socket. See +[`../Operations/Troubleshooting.md`](../Operations/Troubleshooting.md) for +the operator view of cache-served reads during a backend outage. + +## Storage Format: Post-Rewriter Bytes + +`CacheEntry.PduBytes` holds the **post-rewriter response PDU body** — the +function code byte, the byte count, and the rewriter-decoded register +data, with no MBAP header. The backend reader task decodes the response +through `BcdPduPipeline` first and only then hands the rewritten payload +to `ResponseCache.Set`. + +```csharp +internal sealed record CacheEntry( + byte[] PduBytes, + DateTimeOffset CachedAtUtc, + DateTimeOffset ExpiresAtUtc, + int Length, + long LastUsedTick); +``` + +Storing post-rewriter bytes is both a CPU optimisation and a correctness +guarantee: + +- **CPU.** A cache hit returns ready-to-send bytes. The rewriter does not + re-run per hit; only the MBAP header is regenerated to carry the + upstream's original TxId. +- **Correctness.** An entry decoded against an earlier rewriter version + never gets retroactively re-transformed against a newer version. If the + rewriter's behaviour changes mid-process (it does not today, but the + guarantee is durable across future changes), in-flight cached entries + age out under their TTL and are replaced by fresh entries decoded + through the new rewriter. A bidirectional re-encode never happens to an + already-stored entry. + +## Write Invalidation by Address Range Overlap + +A successful (non-exception) FC06 or FC16 response invalidates every +cached FC03 or FC04 entry whose address range +`[StartAddress, StartAddress + Qty)` overlaps the write range +`[writeStart, writeStart + writeQty)`. The pure overlap math lives in +`CacheInvalidator.FindOverlapping`: + +```csharp +int writeEnd = writeStart + writeQty; // half-open upper bound + +foreach (var key in haystack) +{ + if (key.UnitId != unitId) continue; + if (key.Fc != 0x03 && key.Fc != 0x04) continue; + + int keyEnd = key.StartAddress + key.Qty; + // Overlap iff writeStart < keyEnd AND key.StartAddress < writeEnd. + if (writeStart < keyEnd && key.StartAddress < writeEnd) + hits.Add(key); +} +``` + +Worked examples on a single unit ID: + +```text +Write to register 105 (qty=1) + └─ invalidates cached FC03 [100..110) — register 105 is inside the cached range + └─ leaves cached FC03 [200..210) untouched + +Write to registers [10..15) (qty=5) + └─ leaves cached FC03 [15..20) untouched — half-open intervals, 15 is not in [10..15) + +Write to registers [98..108) (qty=10) + └─ invalidates cached FC03 [100..110) — ranges overlap on [100..108) +``` + +Three properties of the invalidator deserve calling out: + +- **Exception responses do not invalidate.** A Modbus exception (code 01, + 02, 03, 04, or any other) means the write did not take effect on the + PLC. The cached read is still consistent with the device, so the + invalidator is not engaged. +- **Different unit IDs never invalidate each other.** Multi-drop and + gateway personalities behind a shared socket address logically separate + Modbus tables. `CacheKey.UnitId` discriminates. +- **Only FC03 and FC04 entries are evicted.** The cache never stores write + responses, so the invalidator's function-code filter is defensive + rather than load-bearing. + +## Bounded Capacity (LRU) + +Each `ResponseCache` instance is capped at `Cache.MaxEntriesPerPlc` +(default 1000). When the dictionary is at the cap and a fresh insert +arrives, `EvictLeastRecentlyUsed` walks the entries and removes the one +with the smallest `CacheEntry.LastUsedTick`. The linear scan is +intentional — at 1000 entries the scan is cheaper than the network +round-trip the cache is saving, and a sorted secondary structure would +add complexity for no measurable win. + +`LastUsedTick` is a monotonic 64-bit counter incremented on every hit and +every fresh insert. Using the counter rather than `DateTimeOffset.UtcNow` +keeps the hot path free of clock calls and survives wall-clock skew. + +A background task drives proactive expiry. The constructor starts a +`PeriodicTimer` at `Cache.EvictionIntervalMs` (default 5000 ms; values +under 100 ms are clamped at 100 ms to prevent tight loops) and the +eviction loop sweeps every entry whose `ExpiresAtUtc` has passed. The +loop is the safety net that keeps abandoned entries — say, those for a +PLC whose upstream clients have all dropped — from holding memory until +process exit. Lazy expiry on `TryGet` still removes entries on demand +when traffic is steady; the background loop only matters under low- or +zero-traffic conditions. + +## Long-TTL Safety Gate + +`MbproxyOptionsValidator.ValidateCacheTtl` rejects any explicit +`CacheTtlMs > 60_000` unless `Cache.AllowLongTtl = true`. The same gate +applies to `PlcOptions.DefaultCacheTtlMs`. The rejection runs at config +bind / hot-reload time, so a misconfigured `appsettings.json` fails fast +before the cache sees the value. + +The gate exists to catch the "left at 1 hour by accident" mistake — a +deployment where a developer set `CacheTtlMs = 3_600_000` for a debugging +session and the value survived into production. Operators who legitimately +need long TTLs (slow-moving setpoints, configuration values that change +once per shift) flip `Cache.AllowLongTtl` to `true` as the explicit +acknowledgement that the long staleness window is intentional. + +## Cache and the Rewriter + +The BCD rewriter runs **once** on the cache-miss path: the backend reader +task decodes the response through `BcdPduPipeline` and only then hands the +decoded bytes to `ResponseCache.Set`. Cache hits return the stored +post-rewriter bytes directly. + +This division has two consequences worth restating: + +- **The rewriter cost is amortised across hits.** A high cache hit ratio + on a tag-dense PLC drops the per-request rewriter cost from "every + response" to "every cache-miss response," which on a hot register at + TTL=500 ms is one-in-many. +- **The cached payload is decoupled from the rewriter implementation.** + An entry stored under one rewriter does not get re-transformed if the + rewriter changes. Entries age out under TTL and are replaced by fresh + entries decoded under the current rewriter — there is no in-place + recomputation pass. + +## Hot-Reload Semantics + +Configuration changes propagate through `IOptionsMonitor`. +The cache reacts to four kinds of change: + +| Change | Cache behaviour | +|--------|----------------| +| Tag's `CacheTtlMs` changed (`0 → N`, `N → 0`, `N → M`) | Entire PLC cache is flushed via `ResponseCache.Clear()`; entries re-populate on demand under the new TTL. | +| New PLC added / removed | New PLC starts with an empty cache; removed PLC's `ResponseCache` is disposed with the multiplexer. | +| `Cache.AllowLongTtl` flipped | Validation runs on the next reload only; existing entries are unaffected. | +| `Cache.MaxEntriesPerPlc` changed | Existing entries are unaffected; the new cap applies to subsequent inserts. | +| `Cache.EvictionIntervalMs` changed | Existing eviction loop continues with its old period; subsequent loops use the new interval. | + +Per-tag flush granularity is intentionally not implemented. The clean move +is "any tag-list change to a PLC → drop every entry for that PLC and let +the natural traffic re-populate." Tracking which keys correspond to which +tag IDs adds bookkeeping for no operational win — a tag-list reload is +already a once-in-a-while event, and the rebuild cost on the affected +PLC's hot keys is one round-trip per key under traffic. + +See [`../Features/HotReload.md`](../Features/HotReload.md) for the +broader `IOptionsMonitor` propagation model. + +## Cache Survives Backend Disconnects + +A cached entry's data was valid when stored. A subsequent backend +disconnect does not retroactively invalidate it — the value the upstream +client sees on a hit is the value the PLC reported within the TTL +window, irrespective of whether the backend socket is up at the moment +of the hit. This is the cache's most operationally visible property +during PLC outages: upstream consumers that read hot tags within the +cache window continue to receive responses while the listener supervisor +is in `recovering` state. + +The companion rule on the write side keeps the invariant consistent: +**invalidations during a `recovering` listener state are skipped**. If +the backend is down, an FC06 or FC16 write did not reach the PLC, so the +cached read is still consistent with the device's actual state. Skipping +the invalidation matches reality — the write did not take effect, so the +read is not stale. + +## No Persistence + +The cache is purely in-memory. Process restart wipes every entry. There +is no file-backed snapshot, no Redis or other external store, and no +last-known-good replay. A restarted service rebuilds its cache from +fresh backend round-trips driven by upstream traffic, exactly as it +would after a TTL-induced flush. + +Intentional, for two reasons. First, the staleness contract is bounded +by `CacheTtlMs` measured from when the data was first read, and a +persisted entry would re-emerge with an unknown wall-clock age — every +invariant the cache offers would need a freshness field, freshness +arithmetic on load, and recovery against a clock that may have jumped. +Second, the operational model is that the proxy is a stateless +transformer; treating its cache as durable state would change the +deployment story for no measurable production benefit. + +## Counter Accounting + +`ProxyCounters` exposes five cache counters per PLC, surfaced on the +status page as both per-PLC and fleet-aggregate values: + +- **`cacheHitCount`** — FC03/FC04 requests served from the cache. Bumped + inside `OnUpstreamFrameAsync` when `ResponseCache.TryGet` returns true. +- **`cacheMissCount`** — FC03/FC04 requests whose resolved TTL was + positive but whose key was not in the cache (or whose entry had + expired). The identity `cacheHitCount + cacheMissCount = total + cache-eligible FC03/FC04 requests` holds — reads whose effective TTL + is `0` (uncached) increment neither counter. +- **`cacheHitRatio`** — derived on the status page snapshot as + `cacheHitCount / (cacheHitCount + cacheMissCount)` when the + denominator is non-zero. +- **`cacheInvalidations`** — count of cache entries invalidated by + successful FC06/FC16 write responses, summed across writes. +- **`cacheEntryCount`** — point-in-time snapshot of + `ResponseCache.Count` (Tier-2 memory-watch KPI). +- **`cacheBytes`** — point-in-time approximation of cached PDU bytes, + computed as the running sum of `CacheEntry.Length` across entries + (Tier-2 memory-watch KPI). + +The structured log events `mbproxy.cache.hit`, `mbproxy.cache.miss`, +`mbproxy.cache.store`, `mbproxy.cache.invalidated`, and +`mbproxy.cache.flushed` (defined in `CacheLogEvents`) mirror the counter +increments at Debug level for incident-time diagnosis. Counters are the +steady-state observability surface; the events are for tracing one +request through the cache when something looks wrong. See +[`../Operations/StatusPage.md`](../Operations/StatusPage.md) and +[`../Reference/LogEvents.md`](../Reference/LogEvents.md). + +## Design-Contract Note + +The cache changes the proxy's posture from "purely transparent except +for BCD rewriting" to "transparent by default, with an opt-in cache +layer." The transition is deliberate and operator-driven: setting +`CacheTtlMs > 0` on a tag is the explicit consent to the staleness +window, and a deployment that ships no positive TTLs is observationally +indistinguishable from one compiled without the cache code path. + +There is no global switch, no implicit warm-up, and no behavioural +divergence from the transparent baseline until the operator opts in +tag-by-tag. The cache is the only place in the proxy where an upstream +read can resolve to a value that did not just round-trip the wire, and +its engagement is gated entirely by the per-tag and per-PLC TTL +configuration described above. + +## Related Documentation + +- [`./ConnectionModel.md`](./ConnectionModel.md) — TxId multiplexing, + correlation map, and the backend socket the cache short-circuits on a + hit. +- [`./ReadCoalescing.md`](./ReadCoalescing.md) — sits below the cache in + the lookup order; cache hits short-circuit coalescing entirely. +- [`../Features/BcdRewriting.md`](../Features/BcdRewriting.md) — the + `BcdPduPipeline` whose post-decode bytes the cache stores. +- [`../Features/HotReload.md`](../Features/HotReload.md) — the + `IOptionsMonitor` propagation that drives the per-PLC flush on + tag-list change. +- [`../Operations/Configuration.md`](../Operations/Configuration.md) — + binding for `BcdTagOptions.CacheTtlMs`, + `PlcOptions.DefaultCacheTtlMs`, and the `Cache` section + (`AllowLongTtl`, `MaxEntriesPerPlc`, `EvictionIntervalMs`). +- [`../Operations/StatusPage.md`](../Operations/StatusPage.md) — exposes + `cacheHitCount`, `cacheMissCount`, `cacheHitRatio`, + `cacheInvalidations`, `cacheEntryCount`, and `cacheBytes`. +- [`../Operations/Troubleshooting.md`](../Operations/Troubleshooting.md) + — the operator view of cache-served reads while a backend is in + `recovering` state. +- [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — full + `mbproxy.cache.*` event catalogue with event IDs. +- [`../Testing/Simulator.md`](../Testing/Simulator.md) — the + `pymodbus` DL205 stand-in used by the end-to-end cache tests. +- [`../design.md`](../design.md) — canonical design decisions and + rationale. diff --git a/mbproxy/docs/Features/BcdRewriting.md b/mbproxy/docs/Features/BcdRewriting.md new file mode 100644 index 0000000..aab4f3a --- /dev/null +++ b/mbproxy/docs/Features/BcdRewriting.md @@ -0,0 +1,252 @@ +# BCD Rewriting + +The BCD rewriter is the inline codec that translates DirectLOGIC's native Binary-Coded Decimal register values to and from plain binary integers on every relevant Modbus TCP PDU. It is the one place in the proxy that knows which registers are BCD, so upstream consumers can treat the wire as plain `Int16` / `Int32`. + +## Why BCD Rewriting Exists + +The DL205 / DL260 family stores numeric V-memory register values in native BCD, not binary. The decimal integer `1234` in `V2000` lands on the Modbus wire as `0x1234` (nibbles `1`, `2`, `3`, `4`) — not as the binary `0x04D2`. See [`../../DL260/dl205.md`](../../DL260/dl205.md) for the device-side rationale and the V-memory ↔ Modbus translation rules. + +Upstream consumers (Wonderware, Historian, OPC UA gateways, generic Modbus clients written against the standard) expect plain binary integers. Asking every consumer to BCD-decode the wire is brittle: each consumer would carry the same tag list, the same word-order quirks, and the same risk of drift. The rewriter centralises that translation so the rest of the world sees plain `Int16` / `Int32` and the proxy is the single source of truth for "which addresses are BCD." + +The rewriter touches only the BCD slots declared in configuration. Every other byte of the PDU — non-BCD registers, coils, discrete inputs, diagnostic function codes, exception responses — passes through unchanged. MBAP transaction IDs, unit IDs, and the MBAP length field are preserved end-to-end; the rewriter only re-encodes payload bytes whose width does not change. + +## CDAB Word Order for 32-Bit Values + +A 32-bit BCD value spans a register pair at `Address` and `Address+1` in CDAB (low-word-first) order: + +- The register at `Address` holds the **low 4 BCD digits**. +- The register at `Address+1` holds the **high 4 BCD digits**. +- Decoded decimal = `Decode16(high) * 10_000 + Decode16(low)`. + +This follows directly from DirectLOGIC's CDAB word convention (see [`../../DL260/dl205.md`](../../DL260/dl205.md) → Word Order). + +Worked example — the register pair `[0x1234][0x5678]` reads on the wire as the low word `0x1234` first and the high word `0x5678` second: + +```text +Address: raw 0x1234 → low 4 digits = 1234 +Address+1: raw 0x5678 → high 4 digits = 5678 + +Decoded decimal = 5678 * 10_000 + 1234 = 56_781_234 +``` + +`BcdCodec.Encode32` and `BcdCodec.Decode32` in [`../../src/Mbproxy/Bcd/BcdCodec.cs`](../../src/Mbproxy/Bcd/BcdCodec.cs) implement this in both directions. `Encode32(12_345_678)` returns `(low: 0x5678, high: 0x1234)`. + +The 16-bit codec is a straight nibble pack / unpack: + +```csharp +// From BcdCodec.cs — Encode16 packs four decimal digits into four BCD nibbles. +int d3 = value / 1000; +int d2 = (value / 100) % 10; +int d1 = (value / 10) % 10; +int d0 = value % 10; +return (ushort)((d3 << 12) | (d2 << 8) | (d1 << 4) | d0); +``` + +`Decode16` is the reverse, with a `HasBadNibble` guard that throws `FormatException` if any nibble is `>= 0xA`. The Phase-04 rewrite pipeline catches the exception and surfaces it as a `mbproxy.rewrite.invalid_bcd` warning event instead of corrupting the payload. + +## BCD Tag Configuration Shape + +Every BCD register the rewriter handles is described by a `BcdTag` record from [`../../src/Mbproxy/Bcd/BcdTag.cs`](../../src/Mbproxy/Bcd/BcdTag.cs): + +```csharp +public sealed record BcdTag(ushort Address, byte Width, int CacheTtlMs = 0) +{ + public bool IsThirtyTwoBit => Width == 32; + public ushort HighRegister => /* Address + 1 for 32-bit tags */; +} +``` + +- `Address` is the **Modbus PDU register address** (zero-based, decimal). Configuration must translate from octal V-memory to PDU-decimal before reaching this struct — `V2000` octal = decimal 1024 = `0x0400`. The proxy does not perform that translation itself. +- `Width` is `16` (single register) or `32` (CDAB register pair at `Address` and `Address+1`). `BcdTag.Create` rejects any other width. +- `CacheTtlMs` is the Phase-11 response-cache opt-in (covered separately in [`../Architecture/ResponseCache.md`](../Architecture/ResponseCache.md)); it has no effect on rewriter behaviour. + +The wire-format options shape lives in [`../../src/Mbproxy/Options/BcdTagOptions.cs`](../../src/Mbproxy/Options/BcdTagOptions.cs) and [`../../src/Mbproxy/Options/BcdTagListOptions.cs`](../../src/Mbproxy/Options/BcdTagListOptions.cs). Configured tags resolve through `BcdTagMapBuilder.Build` (see [`../../src/Mbproxy/Bcd/BcdTagMapBuilder.cs`](../../src/Mbproxy/Bcd/BcdTagMapBuilder.cs)) into an immutable `BcdTagMap` ([`../../src/Mbproxy/Bcd/BcdTagMap.cs`](../../src/Mbproxy/Bcd/BcdTagMap.cs)) per PLC. + +Holding-register (FC03) and input-register (FC04) addresses share the **same** configured tag space. The DL205 / DL260 surfaces V-memory through both tables, so the rewriter applies the configured tag list against both FC03 and FC04 responses. + +## Function-Code Scope Table + +The rewriter touches payloads only for the function codes below. Every other FC — coils (FC01, FC05, FC15), discrete inputs (FC02), diagnostics, exception responses — passes through byte-for-byte. + +| FC | Direction | Action | +|----|-----------|--------| +| 03 | Request | Pass through (read; no payload rewrite needed) | +| 03 | Response | Re-encode covered BCD slots from raw nibbles → binary integer | +| 04 | Request | Pass through | +| 04 | Response | Same as FC03 response | +| 06 | Request | Re-encode binary integer → BCD nibbles before forwarding | +| 06 | Response | Decode BCD nibbles → binary integer on the echo (NModbus-style clients validate the echo and would throw otherwise) | +| 16 | Request | Per-register over the configured slots | +| 16 | Response | Pass through (the response carries only start+qty, not values) | + +The FC06 response decode is non-obvious: the PLC echoes back the value it actually wrote, which is now BCD-encoded because the proxy rewrote the request on the way in. Clients that validate the echo equals the value they sent (NModbus and similar libraries do this) would throw on the round-trip if the proxy did not decode the echo back. + +`BcdPduPipeline.Process` dispatches on direction first, then on FC: + +```csharp +public void Process(MbapDirection direction, ReadOnlySpan mbapHeader, + Span pdu, PduContext context) +{ + if (context is not PerPlcContext ctx) return; + if (pdu.Length < 1) return; + + byte fc = pdu[0]; + ctx.Counters.IncrementPdusForwarded(); + ctx.Counters.IncrementFcCount(fc); + + if (direction == MbapDirection.RequestToBackend) + ProcessRequest(fc, pdu, ctx); + else + ProcessResponse(fc, pdu, ctx); +} +``` + +`PerPlcContext` carries the `BcdTagMap`, the per-PLC `ProxyCounters`, the logger, and the matched `InFlightRequest` from the multiplexer's correlation map. If a caller passes a plain `PduContext` (e.g. a test harness using `NoopPduPipeline` alongside the BCD pipeline), the rewriter returns without touching the PDU. + +## Partial-Overlap Policy + +A request that touches only **one** register of a configured 32-bit BCD pair cannot be re-encoded correctly. There are two shapes: + +1. An FC03 / FC04 read whose range covers the low address but not the high address (`qty=1` at the low address) or vice versa. +2. An FC06 write to either the low or high address of a 32-bit pair, or an FC16 write whose range covers only one of the two registers. + +In every case the rewriter **passes the PDU through raw** and emits a `mbproxy.rewrite.partial_bcd` warning. The `PartialBcdWarnings` counter increments per occurrence. + +The proxy never synthesises a Modbus exception for a partial-overlap. Exception response codes are reserved for transport failure (the per-request watchdog manufactures `0x0B` Gateway Target Device Failed To Respond; the PLC itself produces `0x01`–`0x04`). Using an exception code to signal a configuration / client mismatch would conflate "the device or the path failed" with "the client straddled a 32-bit boundary," and operators chasing the exception would look at the wrong layer. + +The rationale for warn-plus-passthrough rather than silent rewrite: silently rewriting only the half the client touched would corrupt the value (a 16-bit BCD encode of a 32-bit binary integer is meaningless). A warning-plus-raw passthrough surfaces the misconfiguration loudly while leaving the client to discover the mismatch in its own data path. + +The FC16 request path makes the partial-overlap decision per-tag inside its loop over `TryGetForRange` hits: + +```csharp +if (tag.IsThirtyTwoBit) +{ + bool lowInRange = offsetWords >= 0 && offsetWords < qty; + bool highInRange = (offsetWords + 1) >= 0 && (offsetWords + 1) < qty; + + if (!lowInRange || !highInRange) + { + RewriterLogEvents.PartialBcd(ctx.Logger, ctx.PlcName, + tag.Address, startAddress, qty); + ctx.Counters.IncrementPartialBcd(); + continue; + } + // ...both registers in range — reconstruct, encode, write back... +} +``` + +For a 32-bit FC16 write where both registers are in range, the rewriter reconstructs the client's 32-bit binary value from the CDAB pair (`clientHigh * 10_000 + clientLow`), runs `BcdCodec.Encode32` to produce the BCD register pair, and writes both registers back to the PDU buffer in place. + +## Unsigned Only + +DL205 / DL260 BCD is non-negative in the default ladder pattern. `BcdCodec.Encode16` rejects values outside `[0, 9999]`; `BcdCodec.Encode32` rejects values outside `[0, 99_999_999]`. The rewriter does not implement signed BCD; signed conventions vary by site and any value out of range surfaces as `mbproxy.rewrite.invalid_bcd` rather than being silently coerced. + +## Exception Pass-Through + +Modbus exception responses pass through unchanged. The rewriter detects an exception response by the high bit of the function code (`fc & 0x80 != 0`), emits a `mbproxy.rewrite.exception_passthrough` event, increments the per-FC exception counter, and returns without touching the payload. + +Covered exception codes: + +- `0x01` Illegal Function +- `0x02` Illegal Data Address +- `0x03` Illegal Data Value +- `0x04` Server Device Failure +- `0x0B` Gateway Target Device Failed To Respond — manufactured by the per-request watchdog when a correlation entry ages past `Connection.BackendRequestTimeoutMs`. The rewriter does not distinguish proxy-manufactured from PLC-originated exception codes; both pass through identically. + +The rewriter increments `Counters.IncrementBackendException(exceptionCode)` per exception so the four common codes surface on the status page through `ExceptionCounts` (`Code01`, `Code02`, `Code03`, `Code04`). The Gateway-Target `0x0B` is also recorded but is more usefully traced through the watchdog log events rather than the per-code counter slot. + +## Where the Rewriter Runs in the Pipeline + +The rewriter is implemented as `BcdPduPipeline` in [`../../src/Mbproxy/Proxy/BcdPduPipeline.cs`](../../src/Mbproxy/Proxy/BcdPduPipeline.cs), registered as the singleton `IPduPipeline` in production. The class is stateless; per-call state arrives via the `PerPlcContext` passed into `Process`, which carries the `BcdTagMap`, the per-PLC counters, the logger, and (on the response path) the matched `InFlightRequest` from the multiplexer's correlation map. + +Per-PLC pipeline ordering: + +```text +Upstream request → + [cache lookup (Phase 11)] → + [coalesce check (Phase 10)] → + [BCD rewriter — request path] → + backend send + +Backend response → + [BCD rewriter — response path] → + [response-cache populate (Phase 11)] → + [fanout to all coalesced parties] +``` + +The rewriter runs **once per request** on the multiplexer's outbound path and **once per response** on the inbound path. Per-party MBAP TxId restoration happens after the rewriter on fanout, so the rewriter only ever sees the canonical (shared) PDU buffer. + +For Phase-11 cache hits, the response cache stores **POST-rewriter bytes** — the rewriter is bypassed on hits, both as a CPU optimisation and as a correctness guarantee (a future rewriter change does not retroactively re-transform an entry that was decoded against an earlier rewriter version). See [`../Architecture/ResponseCache.md`](../Architecture/ResponseCache.md). + +On the response path, the rewriter cannot infer the original `(StartAddress, Qty)` of an FC03 / FC04 read from the response alone — the response carries only `[fc][byteCount][reg0Hi][reg0Lo]...`. The multiplexer's `CorrelationMap` keys the matched `InFlightRequest` to the response and attaches it to `PerPlcContext.CurrentRequest` before invoking the rewriter, so concurrent responses from different upstream clients each decode against their own request range without cross-talk. If `CurrentRequest` is null (e.g. a unit-test fixture invoking the pipeline directly) the rewriter passes the response bytes through unchanged. + +## Hybrid Tag Resolution + +For each PLC, the effective BCD tag list is `Global ∪ Add − Remove`, resolved by `BcdTagMapBuilder.Build` in this order: + +1. Seed the working set from `BcdTagListOptions.Global`. +2. Apply `PlcBcdOverrides.Remove` — drop every address listed. `Remove` matches by address only; width is irrelevant. +3. Apply `PlcBcdOverrides.Add` — insert each entry into the working set. If an address already exists from `Global`, the `Add` entry **wins** (this is how a per-PLC width override is expressed: list the same address in `Add` with a different `Width`). + +The shapes are declared in [`../../src/Mbproxy/Options/BcdTagListOptions.cs`](../../src/Mbproxy/Options/BcdTagListOptions.cs): + +```csharp +public sealed class BcdTagListOptions +{ + public IReadOnlyList Global { get; init; } = []; +} + +public sealed class PlcBcdOverrides +{ + public IReadOnlyList Add { get; init; } = []; + public IReadOnlyList Remove { get; init; } = []; +} +``` + +Resolution produces a `ValidationResult` carrying the resolved `BcdTagMap`, a list of `BcdError` entries, and a list of `BcdWarning` entries. Callers treat any non-empty `Errors` list as a fatal configuration problem for that PLC. + +The user-facing syntax for `Global` + per-PLC `Add` / `Remove` is documented in [`../Operations/Configuration.md`](../Operations/Configuration.md). + +`BcdTagMap.TryGetForRange` is the hot-path range scan used by both the request and response paths. It returns every `BcdTag` whose register footprint intersects `[startAddress, startAddress + qty)`, each carrying its zero-based word `OffsetWords` relative to `startAddress`. A 32-bit tag whose low word starts **before** the range but whose high word lies inside the range returns with a **negative** `OffsetWords` — that is the partial-overlap signal the rewriter consumes when deciding whether to re-encode or warn. The no-hit path returns the empty-list singleton without allocating. + +## Validation at Startup and Hot-Reload + +`BcdTagMapBuilder.Build` runs the same validation pipeline at process start and on every hot-reload of `appsettings.json`. The validation results fall into three buckets, defined in [`../../src/Mbproxy/Bcd/BcdValidationError.cs`](../../src/Mbproxy/Bcd/BcdValidationError.cs): + +- `BcdValidationError.DuplicateAddress` — the same address appears more than once in the **resolved** list (after `Remove` and `Add` have been applied). Fatal error; the entry is excluded from the map. +- `BcdValidationError.OverlappingHighRegister` — a 32-bit entry's high register (`Address+1`) collides with the `Address` of a separate entry in the resolved list. Fatal error. +- `BcdValidationError.InvalidWidth` — an entry's `Width` is not `16` or `32`. Fatal error; the entry is excluded. +- `BcdWarning` — a `Remove` entry whose address does not appear in `Global`. Non-fatal, but typically indicates stale configuration (the global entry was removed without cleaning up the per-PLC override). + +A successful hot-reload that changes the resolved tag list reseats the per-PLC `BcdTagMap` and, for Phase 11, flushes the entire PLC response cache (see [`./HotReload.md`](./HotReload.md)). In-flight requests already past the rewriter are not retroactively re-rewritten; the next PDU sees the new map. A failed validation rejects the reload as a whole and the previous map stays in effect. + +## Counter Accounting + +The rewriter feeds two counters that surface on the status page: + +- `pdus.rewrittenSlots` — `RewrittenSlots` on `PlcPdusStatus`, incremented per re-encoded register. A 32-bit BCD pair counts as 2 slots; a 16-bit tag counts as 1. The FC06 echo decode is **not** counted to avoid double-counting the FC06 request that already incremented the slot on the way out. +- `pdus.partialBcdWarnings` — `PartialBcdWarnings` on `PlcPdusStatus`, incremented once per partial-overlap event (request or response path). + +An out-of-range value (`< 0` or `> 9999` for 16-bit; `< 0` or `> 99_999_999` for 32-bit) on a write, or a bad nibble (`>= 0xA`) on a read, increments an internal invalid-BCD counter and emits `mbproxy.rewrite.invalid_bcd` at warning. The PDU passes through raw in that case; the rewriter never substitutes a value the client did not send (writes) or the PLC did not return (reads). + +Both counters are exposed on the status page; see [`../Operations/StatusPage.md`](../Operations/StatusPage.md). The corresponding log events (`mbproxy.rewrite.partial_bcd`, `mbproxy.rewrite.invalid_bcd`, `mbproxy.rewrite.exception_passthrough`) are catalogued in [`../Reference/LogEvents.md`](../Reference/LogEvents.md). Partial-overlap troubleshooting is covered in [`../Operations/Troubleshooting.md`](../Operations/Troubleshooting.md). + +The `dl205.json` pymodbus simulator profile encodes BCD test fixtures used by the integration test suite; see [`../Testing/Simulator.md`](../Testing/Simulator.md). + +A few invariants the rewriter relies on and the test suite enforces: + +- The MBAP length field is **never** modified. Every re-encoded slot is the same byte width as the original (16-bit register in, 16-bit register out), so the PDU length is byte-stable. +- The rewriter is **stateless** at the class level. `BcdPduPipeline` holds no fields; everything per-call arrives via `PerPlcContext`. The same instance is safe to call concurrently from multiple upstream-read tasks and the single backend reader task on a given multiplexer. +- The rewriter operates on the canonical (shared) PDU buffer. Per-party MBAP TxId restoration on coalesced fanout happens **after** the rewriter, so any per-party byte copy only happens when fanout has more than one party. + +## Related Documentation + +- [`../Architecture/Overview.md`](../Architecture/Overview.md) — service-wide architecture and per-PLC pipeline shape +- [`../Architecture/ResponseCache.md`](../Architecture/ResponseCache.md) — Phase-11 response cache; the cache stores post-rewriter bytes and bypasses the rewriter on hits +- [`./HotReload.md`](./HotReload.md) — hot-reload semantics for BCD tag-list changes +- [`../Operations/Configuration.md`](../Operations/Configuration.md) — `BcdTags.Global` and per-PLC `Add` / `Remove` syntax +- [`../Operations/StatusPage.md`](../Operations/StatusPage.md) — `pdus.rewrittenSlots` and `pdus.partialBcdWarnings` exposure +- [`../Operations/Troubleshooting.md`](../Operations/Troubleshooting.md) — diagnosing partial-overlap warnings +- [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — `mbproxy.rewrite.*` event catalogue +- [`../Testing/Simulator.md`](../Testing/Simulator.md) — the `dl205.json` simulator profile that encodes BCD test fixtures +- [`../../DL260/dl205.md`](../../DL260/dl205.md) — DL205 / DL260 BCD encoding, CDAB word order, and V-memory ↔ Modbus translation diff --git a/mbproxy/docs/Features/HotReload.md b/mbproxy/docs/Features/HotReload.md new file mode 100644 index 0000000..7df3275 --- /dev/null +++ b/mbproxy/docs/Features/HotReload.md @@ -0,0 +1,189 @@ +# Hot Reload + +A save to `appsettings.json` propagates to a running `mbproxy` without restarting the service. This document explains the mechanism, the reconcile pipeline, and what each configuration change does to the running state. + +## How Reload Works + +`Microsoft.Extensions.Configuration` loads `appsettings.json` with `reloadOnChange: true`. Every consumer reads its options through `IOptionsMonitor` instead of capturing a one-shot `IOptions` snapshot at construction. When the framework's `FileSystemWatcher` sees the file change, it re-parses the JSON, re-binds the option tree, and notifies subscribers through `IOptionsMonitor.OnChange`. + +The chosen mechanism is deliberate. There is no custom file watcher, no IPC channel, no admin-port mutation endpoint, and no SIGHUP-style trigger. An operator edits the file in place (or a deployment tool atomically rewrites it) and the running service catches up. The reload contract is identical whether the service is running interactively or as a Windows Service under the SCM. + +The `OnChange` callback can fire multiple times for a single logical save because text editors on Windows commonly use a rename-and-replace pattern that produces two or three `FileSystemWatcher` events. The reconciler debounces these inside its own background loop with a 250 ms quiescent window so a single save produces a single apply. + +### Debounce window + +The debounce window is held in `ConfigReconciler.DebounceWindow = TimeSpan.FromMilliseconds(250)`. The loop reads from the change channel, then keeps re-arming a linked `CancellationTokenSource` with a 250 ms expiry and waits again. As long as new signals keep arriving inside the window, the loop drains them and keeps waiting. When the window elapses with no new signal the loop falls through and calls `ApplyAsync` against `IOptionsMonitor.CurrentValue`. The window is short enough that operators perceive saves as instant and long enough to absorb every editor save pattern observed in practice (rename-and-replace, write-truncate-write, Notepad, Visual Studio Code, PowerShell `Set-Content`). + +## The Reconcile Pipeline + +Three types in `src/Mbproxy/Configuration/` carry the reload contract from "framework noticed the file changed" to "the running service matches the new file": + +- `ReloadValidator` (`src/Mbproxy/Configuration/ReloadValidator.cs`) — runs cross-PLC and per-PLC checks before the reload is allowed to take effect. The validator is a static gate: `Validate(MbproxyOptions next, out IReadOnlyList errors)` returns `false` and a list of error strings if the snapshot is malformed, and the apply step bails out before touching any state. +- `ReloadPlan` (`src/Mbproxy/Configuration/ReloadPlan.cs`) — an immutable record produced by the pure function `ReloadPlan.Compute(MbproxyOptions current, MbproxyOptions next)`. It buckets PLCs into `ToAdd`, `ToRemove`, `ToRestart` (network identity changed), and `ToReseat` (only the resolved `BcdTagMap` changed). PLC identity is keyed on `Name`, not `ListenPort`, so a port change is still the same PLC and goes to `ToRestart` rather than `ToRemove` + `ToAdd`. +- `ConfigReconciler` (`src/Mbproxy/Configuration/ConfigReconciler.cs`) — subscribes to `IOptionsMonitor.OnChange`, debounces and serialises change events through a bounded `Channel` and a `SemaphoreSlim(1, 1)`, then runs the plan: removes go first (concurrent), restarts next (concurrent), reseats apply via `PlcListenerSupervisor.ReplaceContextAsync`, and adds finish last. + +The reconciler's `OnChange` handler does not block. It writes to a `Channel` with `BoundedChannelFullMode.DropOldest` so a busy reload queue never stalls the configuration framework. A dedicated background loop drains the channel, applies the 250 ms debounce, and then calls `ApplyAsync` on the latest snapshot exposed by `IOptionsMonitor.CurrentValue`. The last enqueued change wins. + +The apply itself runs under `_applySemaphore` (a `SemaphoreSlim(1, 1)`) so two saves arriving in rapid succession are serialised and never interleave. If a second save lands while the first apply is still running, it queues at the semaphore and runs against whatever `CurrentValue` exposes when its turn comes — which is the freshest options snapshot, not necessarily the one that caused the wake-up. + +### Apply order + +`ApplyUnderLockAsync` runs the steps in this order against the freshly validated snapshot: + +1. **Validate.** If `ReloadValidator.Validate` returns errors, log `mbproxy.config.reload.rejected`, increment the rejected counter, and return without mutating state. +2. **Compute.** Call `ReloadPlan.Compute(_currentOptions, next)` to bucket PLCs into `ToAdd`, `ToRemove`, `ToRestart`, and `ToReseat`. +3. **Remove.** Stop every supervisor in `ToRemove` concurrently with a 10-second stop timeout, then dispose. +4. **Restart.** Stop the old supervisor, build a fresh `PerPlcContext` (which includes a new `ResponseCache` when any resolved tag opts in), and start a new `PlcListenerSupervisor` on the new endpoint. Restarts run concurrently across affected PLCs. +5. **Reseat.** For each PLC in `ToReseat`, build a new context that preserves the existing `Counters` (so operators see real history across the reseat) and call `PlcListenerSupervisor.ReplaceContextAsync` with a 5-second timeout. +6. **Add.** Build and start a new supervisor for every PLC in `ToAdd` concurrently. +7. **Record.** Update `_currentOptions` to `next`, call `ServiceCounters.RecordReloadApplied`, and log `mbproxy.config.reload.applied` with the apply counts and the global tag delta. + +If a step throws, the exception is logged at Error and the loop continues with the remaining steps. The validator catches every precondition that can be checked from the configuration alone, so a runtime exception here is a true bug worth surfacing. The host stays up regardless. + +## Per-Change-Kind Reconcile Table + +| Change in `appsettings.json` | Propagation | +|------------------------------|-------------| +| `BcdTags.Global` add / remove / width | The rewriter dereferences `IOptionsMonitor` per PDU. The next PDU sees the new map. In-flight requests are not retroactively touched. | +| `Plcs[i].BcdTags.Add` or `Plcs[i].BcdTags.Remove` | Same as above — next-PDU resolution against the rebuilt map. | +| New `Plcs[i]` entry | `ConfigReconciler` builds a fresh `PerPlcContext` and `PlcListenerSupervisor`, which binds the new port under the same eager-then-auto-recover policy used at service startup. | +| `Plcs[i]` removed | The supervisor for that PLC is stopped (10 s stop timeout) and disposed, which closes every upstream client connection bound to that listener. | +| `Plcs[i].ListenPort` or `Host` changed | Equivalent to remove + add. The supervisor stops the old listener, the reconciler rebuilds the context, and a new supervisor starts on the new endpoint. | +| `Connection.BackendConnectTimeoutMs` and the other `Backend*TimeoutMs` values | The next backend connect or request reads the new value through the monitor. In-flight operations keep their already-applied timeout. | +| `BcdTags.*.CacheTtlMs` or `Plcs[i].DefaultCacheTtlMs` | A tag-map reseat constructs a fresh `ResponseCache` for that PLC, which drops every cached entry for that PLC. Entries re-populate on demand under the new TTL. Per-tag flush granularity is intentionally not implemented. | +| `Cache.AllowLongTtl` | Enforced at the next reload validation. A pending reload that depends on it must save together. | +| `Cache.MaxEntriesPerPlc` | Applies to subsequent inserts. Existing entries are not pruned. | +| `Cache.EvictionIntervalMs` | Read by the next eviction loop tick. | +| `Resilience.ReadCoalescing.Enabled` flipped to `false` | Already-running coalesced entries drain naturally. Subsequent reads bypass coalescing. | +| `Resilience.ReadCoalescing.MaxParties` | Applies to subsequent attaches. Existing in-flight entries keep their current cap. | +| Invalid reload (schema break, duplicate ports, duplicate addresses in a resolved tag list, `CacheTtlMs > 60_000` without `Cache.AllowLongTtl = true`) | Reload is rejected as a whole. The current in-memory config stays in effect. `mbproxy.config.reload.rejected` is logged at Error. | + +The "next-PDU" wording is load-bearing for the tag-list rows: the rewriter does not snapshot the tag map at connection accept time. It resolves the map for the active PLC at the start of every request frame, so a hot-reloaded tag list is in effect for the very next request, even on existing TCP connections. + +### Reseat vs. restart + +The `ReloadPlan` distinguishes two kinds of "PLC is still here but changed": + +- **Restart** is triggered when `Host`, `ListenPort`, or backend `Port` differ between the old and new `PlcOptions`. The TCP socket has to close and reopen on a new endpoint, so there is no way to preserve the listener — the supervisor stops and a brand-new one starts. +- **Reseat** is triggered when only the resolved `BcdTagMap` differs (which `ReloadPlan.Compute` checks structurally through `TagMapsEqual`: same set of `(Address, Width, CacheTtlMs)` triples). The listener socket and the upstream pipes stay open. Only the `PerPlcContext` swaps. + +`TagMapsEqual` includes `BcdTag.CacheTtlMs` in the comparison so a per-tag TTL change or a `Plcs[i].DefaultCacheTtlMs` change (which folds into per-tag TTLs through `BcdTagMapBuilder.Build`) also routes to `ToReseat` and so also drops the cache. A `Plcs[i]` whose options are byte-identical to the previous snapshot lands in neither bucket and the supervisor is left alone. + +### Tag map resolution + +`BcdTagMapBuilder.Build` is the single source of truth for what the resolved tag list looks like for one PLC. The hybrid resolution it implements is: + +1. Start with `BcdTags.Global` from the root options. +2. Remove every address present in `Plcs[i].BcdTags.Remove`. +3. Merge in `Plcs[i].BcdTags.Add` entries — if an address already exists in the working set, the `Add` entry wins. This is how a per-PLC width override is expressed (the global lists a 16-bit tag at the same address; the per-PLC `Add` overrides it to 32-bit). +4. Fold `Plcs[i].DefaultCacheTtlMs` into any tag whose explicit `CacheTtlMs` is null. + +The same builder runs both at startup and during reload validation, so a configuration that builds cleanly at startup is guaranteed to build cleanly at reload, and vice versa. There is no second validator that could disagree with the first. + +## Validation Rules + +`ReloadValidator.Validate` is the gate the hot-reload path consults directly. It runs the following checks in order: + +1. PLC names are non-empty and unique under ordinal comparison. +2. Every `Plcs[i].ListenPort` is in `[1, 65535]` and unique across the `Plcs` list. +3. `AdminPort` is in `[1, 65535]` and does not collide with any `ListenPort`. +4. For each PLC, `BcdTagMapBuilder.Build(next.BcdTags, plc.BcdTags, plc.DefaultCacheTtlMs)` reports no errors. This delegates the per-PLC well-formedness checks — duplicate addresses within a single resolved list, and 32-bit entries whose high register (`Address + 1`) overlaps a separate 16-bit entry — to the single source of truth used at startup. +5. Cache TTL bounds: every `BcdTag.CacheTtlMs` and every `Plcs[i].DefaultCacheTtlMs` must be `>= 0`, and any value above `60_000` ms requires `Cache.AllowLongTtl = true`. `Cache.MaxEntriesPerPlc` and `Cache.EvictionIntervalMs` must be `>= 0`. + +A failure at any step appends to the error list but the validator runs to completion so the operator sees every problem with a single save. If the list is non-empty, the reload is rejected atomically and no state mutates. + +Schema-level checks — invalid `Width` values on a `BcdTagOptions`, type mismatches, malformed JSON — are also enforced by `MbproxyOptionsValidator` (`IValidateOptions`) at bind time. The two paths overlap deliberately so both startup and reload reject the same malformed input with the same error wording. + +### Rejected-reload example + +A duplicate `ListenPort` in the saved file produces an error like the following on the rejected log line: + +```text +Config reload rejected — Errors=Plc 'plc-02': Duplicate ListenPort 5020 (already used by 'plc-01'). +``` + +When several rules trip on the same save, the validator joins them with `; ` so the operator sees every problem from one file save. The current in-memory configuration is unchanged, every supervisor keeps running on its existing context, and the next valid save will replay the whole apply against the now-current state. + +## What Stays vs. What Changes Mid-Flight + +The reload contract is built around a simple invariant: a Modbus request that has already started routing keeps the configuration it started with. The next request after the reload picks up the new values. + +The rewriter is the clearest example. `BcdPduPipeline` dereferences the tag map at the start of every PDU. A request that is already in the multiplexer's outbound queue is rewritten against the map that was current when it arrived. The very next request on the same TCP connection sees the new map. This avoids a torn behaviour where one PDU is half-rewritten under the old tag list and half under the new — every PDU is fully consistent with exactly one snapshot of the map. + +The same principle applies to timeouts. `Connection.BackendConnectTimeoutMs` and the per-operation timeout values are read through `IOptionsMonitor.CurrentValue` at the point the operation starts. A backend connect that has already entered its retry pipeline keeps its already-applied timeout for the remainder of that attempt. The next backend connect reads the new value. + +The reseat path is the only place where running state changes mid-connection. A reseat swaps the entire `PerPlcContext` — `TagMap`, `Counters`, `Cache` — via `PlcListenerSupervisor.ReplaceContextAsync`. The listener socket and the existing upstream pipes survive the swap. The brief transition window between the old context and the new is documented in code: any PDU mid-flight at the swap point may observe the boundary, but the rewriter only consults the map at PDU start, so the practical effect is the same next-PDU resolution rule. + +Counters are explicitly preserved across a reseat. The reconciler reads `supervisor.CurrentCounters` and passes the same `ProxyCounters` instance into the new context so request counts, rewrite counts, and error counts do not reset to zero every time an operator tweaks a tag. A restart, by contrast, constructs a brand-new `ProxyCounters` because the supervisor itself is brand new. + +### Effect on upstream sockets + +The fate of an open upstream client socket depends on which bucket its PLC lands in: + +- **Reseat.** The socket stays open. The client never notices the reload happened; only its next request frame resolves against the new tag map. +- **Restart.** The old listener stops, which closes every upstream socket bound to it. The client sees a TCP close and is expected to reconnect (Wonderware DAServer, generic Modbus masters, and the supported gateways all do this automatically). When it reconnects, it lands on the new listener at the new endpoint. +- **Remove.** Same as a restart from the client's perspective: the listener stops and every connection closes. If the operator also removed the IP from the upstream client's configuration, the client stops reconnecting; otherwise the reconnect attempts simply fail with `ECONNREFUSED` until the PLC reappears. +- **Add.** No effect on any existing socket. The new listener simply starts accepting on its `ListenPort`. + +## Cache and Hot-Reload + +Any tag-list change that affects a PLC drops the entire `ResponseCache` for that PLC. The reseat path constructs a fresh cache through `ConfigReconciler.BuildCacheIfNeeded`, which inspects the resolved map and returns a new `ResponseCache` when at least one tag opts in, or `null` otherwise. The supervisor disposes the old cache during `ReplaceContextAsync`. + +Per-tag granular flush is intentionally not implemented. The reasoning is correctness over micro-optimisation: + +- A width change between 16-bit and 32-bit can invalidate cached entries at neighbouring addresses, not just at the changed tag. +- A tag removal means a cached value is no longer rewritten on the way out, so the cached entry that was valid one millisecond ago is now serving the wrong shape. +- A TTL change on one tag does not influence neighbouring entries, but the cost of tracking per-entry TTL versions and replaying flushes outweighs the cost of repopulating on demand. + +A wholesale drop is the simple correct move. Entries repopulate on demand at the next read against the new TTL, and a 54-PLC fleet with second-scale TTLs warms back to steady state within a handful of poll intervals. + +`Cache.MaxEntriesPerPlc` and `Cache.EvictionIntervalMs` deliberately do **not** trigger a reseat. A change to either value is structurally invisible to `TagMapsEqual` (which only inspects the resolved tag triples), so no cache rebuild happens. `MaxEntriesPerPlc` is enforced on subsequent inserts only — existing entries above the new cap stay until natural LRU eviction reaches them. `EvictionIntervalMs` is sampled by each fresh tick of the eviction loop, so a change takes effect at the next tick of the old interval. + +## Reload Events + +Two events surface in the structured log every time the reconciler runs: + +```csharp +[LoggerMessage(EventId = 60, EventName = "mbproxy.config.reload.applied", + Level = LogLevel.Information, + Message = "Config reload applied — PlcsAdded={PlcsAdded} PlcsRemoved={PlcsRemoved} " + + "PlcsRestarted={PlcsRestarted} PlcsReseated={PlcsReseated} GlobalTagDelta={GlobalTagDelta}")] +private static partial void LogReloadApplied( + ILogger logger, int plcsAdded, int plcsRemoved, int plcsRestarted, int plcsReseated, int globalTagDelta); + +[LoggerMessage(EventId = 61, EventName = "mbproxy.config.reload.rejected", + Level = LogLevel.Error, + Message = "Config reload rejected — Errors={Errors}")] +private static partial void LogReloadRejected(ILogger logger, string errors); +``` + +`mbproxy.config.reload.applied` carries the counts from the executed `ReloadPlan` plus a `GlobalTagDelta` computed by `ConfigReconciler.ComputeGlobalTagDelta`, which counts how many global tag entries differ between the old and new options snapshots (added, removed, or width-changed). + +`mbproxy.config.reload.rejected` carries the joined error string from `ReloadValidator.Validate`. The reconciler also increments service-wide counters through `ServiceCounters.RecordReloadApplied` and `ServiceCounters.RecordReloadRejected`, which surface on the status page as `config.reloadCount`, `config.reloadRejectedCount`, and `config.lastReloadUtc`. Both event names are catalogued in [`../Reference/LogEvents.md`](../Reference/LogEvents.md). + +### Reading the events + +A healthy reload looks like this in the log stream: + +```text +INFO mbproxy.config.reload.applied — PlcsAdded=1 PlcsRemoved=0 PlcsRestarted=0 PlcsReseated=2 GlobalTagDelta=3 +``` + +The properties answer four questions at a glance: how many new listeners were brought up, how many old listeners were torn down, how many existing listeners moved to a new endpoint (and therefore disconnected their clients), and how many existing listeners had their tag maps swapped underneath open connections. `GlobalTagDelta` reports the number of `BcdTags.Global` entries that differ between snapshots; it counts each address once whether the difference is "added", "removed", or "width changed". + +A rejected reload looks like this: + +```text +ERROR mbproxy.config.reload.rejected — Errors=Plc 'plc-02': Duplicate ListenPort 5020 (already used by 'plc-01').; Plc 'plc-03': BCD tag map error (DuplicateAddress): Address 1072 appears twice in resolved tag list. +``` + +Every error from the validator concatenates with `; ` so a single rejected event captures every problem. The matching `config.reloadRejectedCount` counter on the status page increments by one per rejected save, not per error inside the save. + +## Related Documentation + +- [Architecture Overview](../Architecture/Overview.md) +- [Response Cache](../Architecture/ResponseCache.md) +- [BCD Rewriting](./BcdRewriting.md) +- [Configuration Reference](../Operations/Configuration.md) +- [Log Events](../Reference/LogEvents.md) +- [Status Page](../Operations/StatusPage.md) diff --git a/mbproxy/docs/Operations/Configuration.md b/mbproxy/docs/Operations/Configuration.md new file mode 100644 index 0000000..d5a1907 --- /dev/null +++ b/mbproxy/docs/Operations/Configuration.md @@ -0,0 +1,422 @@ +# Configuration Reference + +`mbproxy` binds its runtime configuration from `appsettings.json` under the `Mbproxy` section. This document is the full reference for every supported key, its type, default, range, and validation rules. + +## File Location + +The configuration loader resolves `appsettings.json` relative to the executable. + +- **Development run** (`dotnet run`): `src/Mbproxy/appsettings.json` next to the build output. +- **Single-file publish** (`dotnet publish -c Release -r win-x64`): `appsettings.json` next to `Mbproxy.exe` in the publish folder. +- **Installed as a Windows Service**: `%ProgramData%\mbproxy\appsettings.json`. The install script copies the template at `install/mbproxy.config.template.json` to this path the first time only — an existing file is preserved across reinstalls. + +The file is loaded with `reloadOnChange: true`. All consumers read through `IOptionsMonitor`, so a save propagates without restarting the service. See [`../Features/HotReload.md`](../Features/HotReload.md) for per-key propagation semantics. + +The .NET configuration provider accepts `//` and `/* */` comments (JSONC) in `appsettings.json` when loaded through `Host.CreateApplicationBuilder`. The install template ships with comments. + +Environment variables and command-line arguments are also accepted by the host. Either form can override any `Mbproxy:*` key; for example, `Mbproxy__AdminPort=9090` (double-underscore segment separator) overrides the JSON. Environment overrides are useful for ephemeral diagnostic switches but should not replace the file as the source of truth — `ReloadValidator` runs against the merged configuration on every reload. + +## Top-Level Schema + +Every supported key under `Mbproxy:*`, populated to a representative default: + +```jsonc +{ + "Mbproxy": { + + // Global BCD tag list — applies to every PLC unless overridden per-PLC. + "BcdTags": { + "Global": [ + { "Address": 1024, "Width": 16 }, // 16-bit BCD register + { "Address": 1056, "Width": 32 }, // 32-bit BCD pair (CDAB) + { "Address": 1088, "Width": 16, "CacheTtlMs": 1000 } // opt-in cache, 1 s TTL + ] + }, + + // One entry per PLC. Each maps an upstream proxy port to a backend Modbus TCP endpoint. + "Plcs": [ + { + "Name": "Line1-Mixer", + "ListenPort": 5020, + "Host": "10.0.1.1", + "Port": 502, + "DefaultCacheTtlMs": 0, + "BcdTags": { + "Add": [ { "Address": 1200, "Width": 32 } ], + "Remove": [ 1056 ] + } + } + ], + + // Read-only HTTP status page. Set to 0 to disable. + "AdminPort": 8080, + + // Backend connection / request / shutdown timeouts. + "Connection": { + "BackendConnectTimeoutMs": 3000, + "BackendRequestTimeoutMs": 3000, + "GracefulShutdownTimeoutMs": 10000 + }, + + // Polly resilience policies. + "Resilience": { + "BackendConnect": { + "MaxAttempts": 3, + "BackoffMs": [ 100, 500, 2000 ] + }, + "ListenerRecovery": { + "InitialBackoffMs": [ 1000, 2000, 5000, 15000, 30000 ], + "SteadyStateMs": 30000 + }, + "ReadCoalescing": { + "Enabled": true, + "MaxParties": 32 + } + }, + + // Response-cache safety knobs. The cache is off by default per tag. + "Cache": { + "AllowLongTtl": false, + "MaxEntriesPerPlc": 1000, + "EvictionIntervalMs": 5000 + } + } +} +``` + +`Serilog` configuration is documented in [`./Troubleshooting.md`](./Troubleshooting.md) and lives outside the `Mbproxy` section. + +## `Mbproxy.AdminPort` + +Port for the read-only HTTP status server. Binds to all interfaces on startup. + +| Field | Type | Default | Range | +|-------|------|---------|-------| +| `AdminPort` | int | `8080` | `[1, 65535]` | + +`ReloadValidator` rejects values outside `[1, 65535]` and rejects collisions with any `Plcs[i].ListenPort`. Source: `MbproxyOptions.AdminPort`. + +The server exposes `GET /` (auto-refreshing HTML) and `GET /status.json`. See [`./StatusPage.md`](./StatusPage.md) for the schema. + +Authentication is assumed at the network layer (trusted internal segment). The endpoint is read-only — there are no `POST` / `PUT` / `DELETE` routes — so the risk surface is limited to status disclosure. Place the admin port behind a firewall rule that allows only operator workstations. + +## `Mbproxy.Plcs[]` + +One entry per PLC. The array drives the listener supervisor; on reload, entries added here cause new listeners to bind and entries removed here cause listeners to stop. Source: `PlcOptions.cs`. + +| Field | Type | Default | Required | Notes | +|-------|------|---------|----------|-------| +| `Name` | string | `""` | yes | Non-empty, unique across the array. Shown on the status page and in structured logs as `plc`. | +| `ListenPort` | int | `0` | yes | Port the proxy listens on. `[1, 65535]`. Unique across the array. Cannot collide with `AdminPort`. | +| `Host` | string | `""` | yes | PLC IP address or hostname. | +| `Port` | int | `502` | no | Backend Modbus TCP port on the PLC. | +| `BcdTags` | object | `null` | no | Per-PLC overrides on top of `Mbproxy.BcdTags.Global`. See below. | +| `DefaultCacheTtlMs` | int | `0` | no | Fallback TTL in milliseconds for any tag on this PLC whose explicit `CacheTtlMs` is unset (`null`). `0` disables caching by default. | + +### `Plcs[i].BcdTags` + +Per-PLC override block. Resolution: the effective tag list for a PLC is `Global ∪ Add − Remove`, with `Add` winning on width when an address appears in both `Global` and `Add`. Source: `BcdTagListOptions.PlcBcdOverrides`. + +| Field | Type | Default | Notes | +|-------|------|---------|-------| +| `Add` | `BcdTagOptions[]` | `[]` | Tags to append for this PLC. Can override a `Global` entry's `Width` by repeating the address. Each entry follows the `BcdTagOptions` shape (see next section). | +| `Remove` | `ushort[]` | `[]` | Addresses to drop from this PLC's effective list. Matches by address. | + +The full tag-list resolution algorithm — `Add` width override semantics, overlap detection, and per-PLC tag-map flushing on reload — is documented in [`../Features/BcdRewriting.md`](../Features/BcdRewriting.md). + +A subtle case worth pinning down: when an address appears in both `Mbproxy.BcdTags.Global[]` and `Plcs[i].BcdTags.Add[]`, the per-PLC `Add` entry wins on `Width` and `CacheTtlMs`. This is how a per-PLC width override is expressed (for example, a 16-bit tag globally, promoted to 32-bit on the one PLC that uses the wider format). To strip a global tag from a PLC entirely, use `Remove`; do not add a same-address entry with `Width = 0`. + +## `Mbproxy.BcdTags.Global[]` + +The fleet-wide BCD tag list. Every PLC starts with this set, then applies its per-PLC `Add` / `Remove` overrides. Source: `BcdTagListOptions.Global`, entries of type `BcdTagOptions`. + +| Field | Type | Default | Range | Notes | +|-------|------|---------|-------|-------| +| `Address` | ushort | `0` | `[0, 65535]` | Modbus PDU address (decimal). Address `0` is valid on DL205/DL260 — do not skip it. Octal V-memory addresses must be converted: `V2000` octal = decimal `1024`. | +| `Width` | byte | `0` | `{ 16, 32 }` | Bit width. `16` is one register holding 4 BCD digits (`0–9999`). `32` is a CDAB-ordered register pair at `Address` (low word) and `Address+1` (high word). | +| `CacheTtlMs` | int? | `null` | `>= 0`, `<= 60000` unless `Cache.AllowLongTtl = true` | Optional per-tag opt-in to the response cache. `null` falls back to the PLC's `DefaultCacheTtlMs`. `0` explicitly disables caching for this tag even when the PLC default is non-zero. | + +`MbproxyOptionsValidator` rejects any entry whose `Width` is not `16` or `32`. See [`../Features/BcdRewriting.md`](../Features/BcdRewriting.md) for the wire encoding rules and the multi-tag-overlap validation that runs in `BcdTagMapBuilder`. + +Address conversion examples for operators coming from DirectLOGIC ladder: + +| V-memory (octal) | Modbus PDU (decimal) | +|------------------|----------------------| +| `V2000` | `1024` | +| `V2040` | `1056` | +| `V2100` | `1088` | +| `V2200` | `1152` | + +The proxy expects PDU-decimal addresses. Do not use octal V-memory addresses and do not use 1-based `4xxxx` Modbus references — both will resolve to the wrong register. + +## `Mbproxy.Connection` + +Backend connection and shutdown timeouts. Source: `ConnectionOptions.cs`. + +| Field | Type | Default | Notes | +|-------|------|---------|-------| +| `BackendConnectTimeoutMs` | int | `3000` | Max time in milliseconds to wait for one TCP connect to the backend PLC. Each Polly retry attempt is bounded by its own copy of this timeout — total worst-case connect time is `MaxAttempts * BackendConnectTimeoutMs` plus the configured backoffs. | +| `BackendRequestTimeoutMs` | int | `3000` | Max time in milliseconds to wait for the PLC to respond to a forwarded PDU. On timeout the upstream client is disconnected. FC06 / FC16 writes are not retried because they are non-idempotent on BCD tags; FC03 / FC04 reads are also not retried mid-request (a fresh upstream request takes the full pipeline again). | +| `GracefulShutdownTimeoutMs` | int | `10000` | Max time in milliseconds the shutdown coordinator waits for in-flight PDUs to drain after a stop signal (`sc.exe stop` or Windows Service stop). After this deadline, remaining work is cancelled. Keep at or below the Service Control Manager wait hint (30 s). | + +On hot reload, `BackendConnectTimeoutMs` and `BackendRequestTimeoutMs` apply to the next backend connect or request — in-flight operations keep their already-applied timeout. `GracefulShutdownTimeoutMs` is sampled only at shutdown. + +Operational sizing notes: + +- The default 3 s connect timeout is appropriate for a local Ethernet segment to a healthy ECOM100. On WAN paths or for devices behind switches with slow MAC-table aging, raise to 5–10 s. +- A 3 s request timeout is generous compared with typical DL205/DL260 scan times (a few ms to tens of ms for FC03 of 100 registers). The slack absorbs PLC scan-overlap jitter without faulting the upstream client. +- `GracefulShutdownTimeoutMs` should be less than the Service Control Manager's stop deadline. The default 10 s suits a fleet of 54 PLCs; on a much larger fleet, raise both the SCM wait hint and this value in lockstep. + +## `Mbproxy.Resilience` + +Polly retry pipelines for backend connect, listener bind, and the in-flight read coalescer. Source: `ResilienceOptions.cs`. + +### `Mbproxy.Resilience.BackendConnect` + +Bounded retries on the backend TCP connect path. Mid-request failures (during a forwarded PDU) are never retried. + +| Field | Type | Default | Notes | +|-------|------|---------|-------| +| `MaxAttempts` | int | `3` | Total connect tries, including the first. `1` disables retries. | +| `BackoffMs` | int[] | `[100, 500, 2000]` | Delay in milliseconds between attempts. Must have `MaxAttempts - 1` entries. | + +### `Mbproxy.Resilience.ListenerRecovery` + +Unbounded retries on the listener bind path. If a PLC's `ListenPort` cannot be bound (port in use, bad interface, transient OS error), the supervisor cycles through `InitialBackoffMs` once, then repeats `SteadyStateMs` forever. The same recovery code path also reacts to a listener that faults at runtime (for example, the underlying socket dies) and to listeners that come online from a hot-reload that adds a new PLC. + +| Field | Type | Default | Notes | +|-------|------|---------|-------| +| `InitialBackoffMs` | int[] | `[1000, 2000, 5000, 15000, 30000]` | Backoff schedule for the first N retries after a fault. | +| `SteadyStateMs` | int | `30000` | Backoff for every retry after the initial schedule is exhausted. Runs indefinitely. | + +### `Mbproxy.Resilience.ReadCoalescing` + +In-flight de-duplication of identical FC03 / FC04 reads. When multiple upstream clients issue the same `(unitId, fc, startAddress, qty)` tuple while a matching backend round-trip is already in flight, the late arrivals attach to the existing entry and the single response is fanned out to every party. See [`../Architecture/ReadCoalescing.md`](../Architecture/ReadCoalescing.md). + +| Field | Type | Default | Notes | +|-------|------|---------|-------| +| `Enabled` | bool | `true` | Master switch. Hot-reloadable; flipping to `false` lets already-coalesced entries drain naturally. | +| `MaxParties` | int | `32` | Per-entry cap on attached parties. Past this cap, the next identical request opens a fresh in-flight entry. | + +Writes (FC06 / FC16) are never coalesced. FC03 and FC04 never share an entry. Different `unitId` bytes never share an entry. + +Total FC03 + FC04 request accounting is preserved across the coalescing path: `coalescedHitCount + coalescedMissCount` equals the total reads observed by the multiplexer since startup. `coalescedHitCount` stays at `0` while `Enabled = false`, but every read still increments `coalescedMissCount`. See [`./StatusPage.md`](./StatusPage.md) for the full counter catalogue. + +## `Mbproxy.Cache` + +Service-wide safety knobs for the opt-in response cache. The cache is off by default per tag — this section only governs the limits when an operator opts a tag in via `CacheTtlMs` or `DefaultCacheTtlMs`. Source: `CacheOptions` in `MbproxyOptions.cs`. + +| Field | Type | Default | Notes | +|-------|------|---------|-------| +| `AllowLongTtl` | bool | `false` | Gate for any `CacheTtlMs > 60_000`. When `false`, `ReloadValidator` rejects any tag or PLC default that exceeds 60 s. Set to `true` to opt in explicitly. | +| `MaxEntriesPerPlc` | int | `1000` | LRU cap on the number of entries per PLC. When full, the next insert evicts the least-recently-used entry. Must be `>= 0`. `0` is accepted but means "evict every insert immediately" — effectively the cache is disabled even for tags with non-zero TTL. | +| `EvictionIntervalMs` | int | `5000` | Background eviction loop tick in milliseconds. Each tick scans the per-PLC caches and removes entries past their `ExpiresAtUtc`. Must be `>= 0`; values below 100 ms are clamped at 100 ms internally to avoid pathologically tight loops. | + +On hot reload, `AllowLongTtl` is enforced by the next reload validation. `MaxEntriesPerPlc` applies to subsequent inserts (existing entries are not pruned). `EvictionIntervalMs` is read by each fresh eviction loop iteration. + +Any tag-list change for a given PLC drops that PLC's entire cache on reload — per-tag flush granularity is intentionally not implemented. New entries re-populate on demand under the new TTL. Process restart wipes every cache; there is no persistence and no last-known-good snapshot. + +See [`../Architecture/ResponseCache.md`](../Architecture/ResponseCache.md) for the cache contract (lookup order, write-invalidation by address-range overlap, post-rewriter byte storage). + +## Per-Tag `CacheTtlMs` + +Per-tag opt-in to the cache. The same field appears on every `BcdTagOptions` entry — both `Mbproxy.BcdTags.Global[]` and `Mbproxy.Plcs[i].BcdTags.Add[]`. + +| Value | Meaning | +|-------|---------| +| `null` (omitted) | Unset. Falls back to `Plcs[i].DefaultCacheTtlMs`. | +| `0` | Caching explicitly disabled for this tag, even if the PLC default is non-zero. | +| `1..60000` | Cache enabled with this TTL in milliseconds. | +| `> 60000` | Rejected at reload unless `Cache.AllowLongTtl = true`. | + +TTL resolution order for any single tag: **explicit per-tag value → per-PLC `DefaultCacheTtlMs` → 0 (off)**. + +For multi-tag read ranges, the effective TTL is `min(TTLs)` across all configured tags inside the read range. If any tag in the range has `CacheTtlMs = 0`, the entire read is uncached. + +The cache itself is described in detail in [`../Architecture/ResponseCache.md`](../Architecture/ResponseCache.md). The properties most relevant to operators setting TTLs: + +- **Lookup order is cache → coalesce → backend.** A cache hit short-circuits the read coalescer entirely. +- **Writes invalidate by address-range overlap.** A successful FC06 / FC16 response invalidates every cached FC03 / FC04 entry whose read range overlaps the write range — not just exact-key matches. Exception responses do not invalidate (the write did not take effect on the PLC). +- **Cache stores post-rewriter bytes.** Hits never re-invoke the BCD rewriter. Tag-list reloads flush the affected PLC's whole cache so a rewriter-relevant change cannot serve stale post-rewriter bytes from before the change. +- **Different `unitId` bytes never invalidate each other.** Invalidation is scoped to `(unitId, FC ∈ {3, 4})`. + +## Validation Rules + +`ReloadValidator.Validate` runs on every config load (startup and hot reload) and rejects the entire snapshot if any rule fails. On rejection at startup, the service exits non-zero. On rejection at runtime, the current in-memory config stays in effect and `mbproxy.config.reload.rejected` is logged at `Error`. + +Rules (in order): + +1. **PLC names**: every `Plcs[i].Name` is non-empty and unique (ordinal comparison). +2. **ListenPort**: every `Plcs[i].ListenPort` is in `[1, 65535]` and unique across the array. +3. **AdminPort**: in `[1, 65535]` and does not collide with any `ListenPort`. +4. **BCD tag map** per PLC, delegated to `BcdTagMapBuilder.Build`: + - duplicate addresses within a single PLC's resolved tag list + - 32-bit entries whose high register (`Address + 1`) overlaps a separate 16-bit entry at that address +5. **Cache TTL bounds**: + - any `CacheTtlMs` or `DefaultCacheTtlMs` less than 0 is rejected + - any `CacheTtlMs` or `DefaultCacheTtlMs` greater than `60_000` is rejected unless `Cache.AllowLongTtl = true` +6. **Cache size knobs**: `Cache.MaxEntriesPerPlc >= 0`, `Cache.EvictionIntervalMs >= 0`. +7. **Width**: every `BcdTagOptions.Width` is `16` or `32` (enforced by `MbproxyOptionsValidator` at schema time). + +Sample rejection messages (logged at `Error` with the structured property `errors` carrying the full list): + +```text +Plc 'Line1-Mixer': Duplicate ListenPort 5020 (already used by 'Line1-Conveyor'). +AdminPort 5020 collides with ListenPort of PLC 'Line1-Mixer'. +Plc 'Line1-Mixer': BCD tag map error (DuplicateAddress): address 1024 appears twice. +BcdTags.Global Address 1024: CacheTtlMs=120000 exceeds 60_000 ms without Cache.AllowLongTtl=true. +Plcs[2] (Line2-Press): DefaultCacheTtlMs must be >= 0. +``` + +Warning case (not a rejection): + +- `Plcs[i].BcdTags.Remove[]` entries that do not match any global tag address are logged as warnings — probably stale config, but the reload proceeds. + +Two additional rejection categories handled earlier in the pipeline: + +- **Type-mismatched / malformed JSON.** The .NET configuration binder rejects values whose type does not match the bound property (for example, a string in `BackendConnectTimeoutMs`). At startup this aborts the host; on hot reload the binder retains the previous snapshot and the reload never reaches `ReloadValidator`. +- **Width invalid.** `MbproxyOptionsValidator` rejects any `BcdTagOptions.Width` that is not `16` or `32`. This runs as part of options validation before `ReloadValidator` and surfaces the same way as schema errors. + +See [`../Features/HotReload.md`](../Features/HotReload.md) for the full reload-acceptance flow, including the log event names emitted on acceptance (`mbproxy.config.reload.applied`) and rejection (`mbproxy.config.reload.rejected`). + +## Two Concrete Examples + +The minimal and production examples below are both complete `appsettings.json` snippets — paste either one and the service will start without further edits beyond the addresses and ports. + +### Minimal + +One PLC, no BCD tags, no cache. The proxy is pure pass-through. + +```jsonc +{ + "Mbproxy": { + "BcdTags": { "Global": [] }, + "Plcs": [ + { + "Name": "Line1-Mixer", + "ListenPort": 5020, + "Host": "10.0.1.1" + } + ], + "AdminPort": 8080 + } +} +``` + +Everything else picks up defaults: `Port = 502`, `Connection.BackendConnectTimeoutMs = 3000`, `Connection.BackendRequestTimeoutMs = 3000`, `Connection.GracefulShutdownTimeoutMs = 10000`, `Resilience.BackendConnect.MaxAttempts = 3`, `Resilience.ReadCoalescing.Enabled = true`, `Cache.AllowLongTtl = false`, `Cache.MaxEntriesPerPlc = 1000`, `Cache.EvictionIntervalMs = 5000`, and so on. + +Behaviour in this snapshot: every byte passes through unchanged in both directions, FC03 / FC04 reads are still subject to in-flight coalescing (the feature is on by default), and no responses are cached. + +### Production + +Three PLCs, a global BCD tag list, one PLC with overrides, cache enabled on hot reads. + +```jsonc +{ + "Mbproxy": { + "BcdTags": { + "Global": [ + { "Address": 1024, "Width": 16 }, // V2000 — 16-bit BCD counter + { "Address": 1056, "Width": 32 }, // V2040 — 32-bit BCD total + { "Address": 1088, "Width": 16, "CacheTtlMs": 1000 } // V2100 — setpoint, 1 s cache + ] + }, + "Plcs": [ + { + "Name": "Line1-Mixer", + "ListenPort": 5020, + "Host": "10.0.1.1", + "Port": 502, + "DefaultCacheTtlMs": 0, + "BcdTags": { + "Add": [ { "Address": 1200, "Width": 32 } ], + "Remove": [ 1056 ] + } + }, + { + "Name": "Line1-Conveyor", + "ListenPort": 5021, + "Host": "10.0.1.2" + }, + { + "Name": "Line2-Press", + "ListenPort": 5022, + "Host": "10.0.2.1", + "DefaultCacheTtlMs": 500 + } + ], + "AdminPort": 8080, + "Connection": { + "BackendConnectTimeoutMs": 3000, + "BackendRequestTimeoutMs": 3000, + "GracefulShutdownTimeoutMs": 10000 + }, + "Resilience": { + "BackendConnect": { "MaxAttempts": 3, "BackoffMs": [ 100, 500, 2000 ] }, + "ListenerRecovery": { "InitialBackoffMs": [ 1000, 2000, 5000, 15000, 30000 ], "SteadyStateMs": 30000 }, + "ReadCoalescing": { "Enabled": true, "MaxParties": 32 } + }, + "Cache": { + "AllowLongTtl": false, + "MaxEntriesPerPlc": 1000, + "EvictionIntervalMs": 5000 + } + } +} +``` + +In this snapshot, `Line1-Mixer` adds a 32-bit tag at `1200` and removes the global 32-bit tag at `1056`. `Line2-Press` opts every tag in (whose `CacheTtlMs` is `null`) into a 500 ms cache via its `DefaultCacheTtlMs`. The setpoint at `1088` already has an explicit per-tag TTL and that value wins. + +The effective tag map per PLC after resolution: + +| PLC | Effective tag list | +|-----|--------------------| +| `Line1-Mixer` | `1024` (16-bit), `1088` (16-bit, `CacheTtlMs = 1000`), `1200` (32-bit). `1056` is removed. | +| `Line1-Conveyor` | `1024` (16-bit), `1056` (32-bit), `1088` (16-bit, `CacheTtlMs = 1000`). | +| `Line2-Press` | `1024` (16-bit, effective `CacheTtlMs = 500` via PLC default), `1056` (32-bit, effective `CacheTtlMs = 500`), `1088` (16-bit, effective `CacheTtlMs = 1000` from explicit per-tag value). | + +Any FC03 / FC04 read whose register range overlaps `Line2-Press`'s tag `1088` resolves to the per-tag 1 s TTL. A read that spans tags with different TTLs takes `min(TTLs)` across the range; a read that includes a tag with `CacheTtlMs = 0` is uncached even if every other tag in the range is opted in. + +## Hot-Reload Propagation Summary + +A reduced view of [`../Features/HotReload.md`](../Features/HotReload.md), restricted to the keys documented here. Every accepted reload emits `mbproxy.config.reload.applied` at `Information` with a summary of which PLCs were added or removed and the size of the tag-list delta. + +| Change | Propagation | +|--------|-------------| +| `BcdTags.Global` add / remove / width | Rewriter dereferences `IOptionsMonitor` per PDU. Next PDU sees the new map; in-flight PDUs are not retroactively touched. | +| `Plcs[i].BcdTags.{Add,Remove}` | Same per-PDU resolution as above, scoped to the affected PLC. | +| New `Plcs[i]` entry | Listener supervisor binds the new port under `ListenerRecovery`. | +| `Plcs[i]` removed | Supervisor stops the listener and closes all upstream connections for that PLC. | +| `Plcs[i].ListenPort` or `Host` changed | Equivalent to remove + add. | +| `Connection.Backend*TimeoutMs` | Next backend connect or request uses the new value. | +| `AdminPort` | Requires a service restart — the Kestrel admin host is built once at startup. | +| `Resilience.ReadCoalescing.Enabled` | Hot-reloadable; in-flight coalesced entries drain naturally. | +| `BcdTags.*.CacheTtlMs`, `Plcs[i].DefaultCacheTtlMs` | Tag-map reseat for the affected PLC drops that PLC's entire cache. | +| `Cache.AllowLongTtl` / `MaxEntriesPerPlc` / `EvictionIntervalMs` | Enforced on next reload validation / next insert / next eviction tick respectively. | + +## Where Options Live in Code + +| Section | File | Binding class | +|---------|------|---------------| +| Root | `src/Mbproxy/Options/MbproxyOptions.cs` | `MbproxyOptions` | +| `Plcs[]` | `src/Mbproxy/Options/PlcOptions.cs` | `PlcOptions` | +| `BcdTags.Global[]` entry shape | `src/Mbproxy/Options/BcdTagOptions.cs` | `BcdTagOptions` | +| `BcdTags.Global` / `Plcs[i].BcdTags` | `src/Mbproxy/Options/BcdTagListOptions.cs` | `BcdTagListOptions`, `PlcBcdOverrides` | +| `Connection` | `src/Mbproxy/Options/ConnectionOptions.cs` | `ConnectionOptions` | +| `Resilience` | `src/Mbproxy/Options/ResilienceOptions.cs` | `ResilienceOptions`, `RetryProfile`, `RecoveryProfile`, `ReadCoalescingOptions` | +| `Cache` | `src/Mbproxy/Options/MbproxyOptions.cs` | `CacheOptions` (declared alongside `MbproxyOptions` in the same file) | +| Schema validation | `src/Mbproxy/Options/MbproxyOptions.cs` | `MbproxyOptionsValidator` | +| Reload validation | `src/Mbproxy/Configuration/ReloadValidator.cs` | `ReloadValidator` | +| Tag-map resolution | `src/Mbproxy/Bcd/BcdTagMapBuilder.cs` | `BcdTagMapBuilder` | +| Reload reconciliation | `src/Mbproxy/Configuration/ConfigReconciler.cs` | `ConfigReconciler`, `ReloadPlan` | + +All option classes are registered through `services.Configure(...)` against the `Mbproxy:*` section in `Program.cs`. `IOptionsMonitor` is the runtime read path; direct `IOptions` injection is not used because it does not propagate reloads. + +## Related Documentation + +- [`../Features/HotReload.md`](../Features/HotReload.md) — reload acceptance flow and per-key propagation semantics +- [`../Features/BcdRewriting.md`](../Features/BcdRewriting.md) — tag-list resolution, wire encoding, multi-tag overlap rules +- [`../Architecture/ResponseCache.md`](../Architecture/ResponseCache.md) — cache contract, lookup order, write invalidation +- [`./StatusPage.md`](./StatusPage.md) — schema served by `AdminPort` +- [`./Troubleshooting.md`](./Troubleshooting.md) — Serilog block and common config rejection diagnostics +- [`../../README.md`](../../README.md) — install and operational entry point diff --git a/mbproxy/docs/Operations/StatusPage.md b/mbproxy/docs/Operations/StatusPage.md new file mode 100644 index 0000000..6cd1c5f --- /dev/null +++ b/mbproxy/docs/Operations/StatusPage.md @@ -0,0 +1,334 @@ +# Status Page + +The status page is the operator-facing view of the running service: an auto-refreshing HTML dashboard at `GET /` and a JSON twin at `GET /status.json` that monitoring scrapers consume. This document describes the endpoint surface, every wire-level field, and how counters map back to architecture decisions. + +## Endpoint Surface + +The admin endpoint is owned by `AdminEndpointHost` (see `src/Mbproxy/Admin/AdminEndpointHost.cs`). It exposes exactly two routes: + +- `GET /` — a single self-contained HTML document with a `` tag. The page refreshes every five seconds by reload, not by JavaScript polling. There is no JS bundle, no external CSS, no remote fonts, and no favicon fetch. +- `GET /status.json` — the same in-memory snapshot serialized as JSON via the source-generated `StatusJsonContext` (camelCase property names). + +The endpoint is **read-only**. There are no admin actions exposed — no kick-client, no force-reload, no listener restart, no log download. Reload happens automatically via `IOptionsMonitor`; listener recovery is owned by the supervisor. Authentication lives at the network layer: the service binds to `IPAddress.Any` on the admin port and assumes the deployment runs in a trusted internal segment behind a firewall. + +Both routes call `StatusSnapshotBuilder.Build()` for every request. The builder reads atomic counters directly from the supervisor map and per-PLC `ProxyCounters`; it holds no locks and performs no I/O. + +## Port and Configuration + +The listen port is read from `Mbproxy.AdminPort` and defaults to `8080`. Configuration semantics for this key live in [`./Configuration.md`](./Configuration.md). + +If Kestrel cannot bind the configured port at startup (port already in use, missing permissions on a reserved range, etc.) the host logs `mbproxy.admin.bind.failed` at `Error` level with the underlying reason. The host then sets `_app = null` and returns — the rest of the service keeps running. The Modbus listener supervisors are completely independent of the admin endpoint, so a bind failure here is non-fatal for proxying. See [`../Reference/LogEvents.md`](../Reference/LogEvents.md) for the event-id catalogue. + +If `Mbproxy.AdminPort` changes via hot-reload, the currently-running Kestrel app is stopped (2 s deadline) and a new one is started on the new port. Other config changes do not touch the admin endpoint. + +## Service-Wide Fields + +Top-level fields come from `ServiceFields` and `ListenersAggregate` in `src/Mbproxy/Admin/StatusDto.cs`. + +| JSON path | Type | Source | Meaning | +|---|---|---|---| +| `service.uptimeSeconds` | `long` | `ServiceFields.UptimeSeconds` | Seconds since process start, computed as `now - ServiceCounters.StartedAtUtc` at snapshot time. | +| `service.version` | `string` | `ServiceFields.Version` via `AssemblyVersionAccessor` | `AssemblyInformationalVersion` of the running assembly. Useful for confirming a deployment took effect. | +| `service.configLastReloadUtc` | `DateTimeOffset?` | `ServiceCounters.LastReloadUtc` | Wall-clock time of the most recent **accepted** hot-reload. `null` if no reload has occurred since process start. See [`../Features/HotReload.md`](../Features/HotReload.md). | +| `service.configReloadCount` | `int` | `ServiceCounters.ReloadAppliedCount` | Number of `appsettings.json` reloads that validated and applied since process start. | +| `service.configReloadRejectedCount` | `int` | `ServiceCounters.ReloadRejectedCount` | Number of reload attempts rejected by validation. A non-zero value here paired with a stale `configLastReloadUtc` indicates the operator's last edit was malformed and the service is still running the previous config. | +| `listeners.bound` | `int` | `boundCount` accumulated while iterating `opts.Plcs` | Count of PLC entries whose supervisor currently reports `SupervisorState.Bound`. | +| `listeners.configured` | `int` | `opts.Plcs.Count` | Total number of PLC entries in the active configuration. | + +Operator triggers: + +- `listeners.bound < listeners.configured` for more than one refresh cycle indicates one or more listeners are stuck recovering. Drill into the per-PLC `listener.state` and `listener.lastBindError` fields below. +- `configReloadRejectedCount` rising means edits are reaching the watcher but failing validation — check the live log for `mbproxy.config.reload.rejected`. + +## Per-PLC Fields + +Each entry in `plcs[]` is a `PlcStatus` (see `src/Mbproxy/Admin/StatusDto.cs`). The builder iterates `opts.Plcs` in configured order, looks up the matching supervisor in `ProxyWorker.Supervisors`, and projects the supervisor's `CurrentCounters.Snapshot()` into wire fields. + +### Identity + +| JSON path | Type | Source | Meaning | +|---|---|---|---| +| `name` | `string` | `PlcOptions.Name` | Stable identifier from `appsettings.json`. Used as the dictionary key for supervisor lookup. | +| `host` | `string` | `PlcOptions.Host` | Backend PLC host (IP or DNS name) the proxy connects out to. | +| `listenPort` | `int` | `PlcOptions.ListenPort` | Local TCP port the proxy binds for upstream clients connecting *to* the proxy. | + +### Listener state + +| JSON path | Type | Source | Meaning | +|---|---|---|---| +| `listener.state` | `string` | `SupervisorSnapshot.State` mapped to `"bound"` / `"recovering"` / `"stopped"` | Current supervisor state. `bound` = TCP listener is accepting connections; `recovering` = Polly retry loop is trying to re-bind after a fault; `stopped` = no supervisor entry (typically a PLC that was just added and not yet started). | +| `listener.lastBindError` | `string?` | `SupervisorSnapshot.LastBindError` | Message from the last bind exception. Populated whenever `state == "recovering"`. Common values: `"Address already in use"`, `"Permission denied"`. | +| `listener.recoveryAttempts` | `int` | `SupervisorSnapshot.RecoveryAttempts` | Number of bind retries since the supervisor entered recovery. Resets on a successful bind. A monotonically rising value indicates the underlying problem is persistent. | + +### Client tracking + +| JSON path | Type | Source | Meaning | +|---|---|---|---| +| `clients.connected` | `int` | `clientSnapshots.Count` | Number of currently-connected upstream clients. Capped by the H2-ECOM100 four-client ceiling; values at 4 imply additional upstream connect attempts will be refused by the PLC. | +| `clients.remoteEndpoints[].remote` | `string` | `UpstreamPipe.RemoteEp` | Upstream TCP endpoint as `ip:port`. | +| `clients.remoteEndpoints[].connectedAtUtc` | `DateTimeOffset` | `UpstreamPipe.ConnectedAtUtc` | Wall-clock time the upstream socket was accepted. Useful for spotting zombie sockets that survived a network outage. | +| `clients.remoteEndpoints[].pdusForwarded` | `long` | `UpstreamPipe.PdusForwardedCount` | PDUs forwarded on this specific upstream pipe since it connected. Lets you see which client is responsible for what fraction of fleet traffic. | + +### PDU traffic + +| JSON path | Type | Source | Meaning | +|---|---|---|---| +| `pdus.forwarded` | `long` | `CounterSnapshot.PdusForwarded` | Total PDUs (requests + responses) that traversed the proxy for this PLC since start. Increments once per PDU handed to the rewriter. | +| `pdus.byFc.fc03` | `long` | `CounterSnapshot.Fc03` | Count of FC03 (read holding registers) requests seen. | +| `pdus.byFc.fc04` | `long` | `CounterSnapshot.Fc04` | Count of FC04 (read input registers) requests seen. | +| `pdus.byFc.fc06` | `long` | `CounterSnapshot.Fc06` | Count of FC06 (write single register) requests seen. | +| `pdus.byFc.fc16` | `long` | `CounterSnapshot.Fc16` | Count of FC16 (write multiple registers) requests seen. | +| `pdus.byFc.other` | `long` | `CounterSnapshot.FcOther` | All other function codes (FC01/02/05/15, diagnostic codes, etc.) seen. The proxy forwards these untouched. | +| `pdus.rewrittenSlots` | `long` | `CounterSnapshot.RewrittenSlots` | Number of register slots the BCD rewriter touched, counting reads and writes. Indicates how much of the traffic actually hits BCD-configured addresses. See [`../Features/BcdRewriting.md`](../Features/BcdRewriting.md). | +| `pdus.partialBcdWarnings` | `long` | `CounterSnapshot.PartialBcdWarnings` | Count of requests whose `[start, qty)` range partially overlapped a 32-bit BCD tag without fully covering its CDAB word pair. A rising value here is an operator signal: an upstream client is requesting partial-overlap reads, which the proxy cannot rewrite safely — review tag-list addresses or fix the client's request shape. | + +### Backend health + +| JSON path | Type | Source | Meaning | +|---|---|---|---| +| `backend.connectsSuccess` | `long` | `CounterSnapshot.ConnectsSuccess` | Successful backend TCP connects since start. Increments once per accepted upstream client (the proxy opens one backend socket per upstream client). | +| `backend.connectsFailed` | `long` | `CounterSnapshot.ConnectsFailed` | Failed backend TCP connects after the Polly retry budget is exhausted (3 attempts at 100/500/2000 ms). A rising counter means the backend host is unreachable or the PLC is at its connection cap. | +| `backend.exceptionsByCode.code01` | `long` | `CounterSnapshot.BackendException01` | Count of Modbus exception responses with code 01 (Illegal Function) received from the PLC. Typically indicates a client is sending function codes the PLC does not support. | +| `backend.exceptionsByCode.code02` | `long` | `CounterSnapshot.BackendException02` | Code 02 (Illegal Data Address) — the requested register range is out of the PLC's V-memory map. | +| `backend.exceptionsByCode.code03` | `long` | `CounterSnapshot.BackendException03` | Code 03 (Illegal Data Value) — quantity exceeds the PLC's per-FC cap (FC03/04 = 128 registers, FC16 = 100). | +| `backend.exceptionsByCode.code04` | `long` | `CounterSnapshot.BackendException04` | Code 04 (Server Device Failure) — internal PLC fault, often correlated with the PLC entering STOP mode. | +| `backend.lastRoundTripMs` | `double` | `CounterSnapshot.LastRoundTripMs` | Exponentially-weighted moving average of recent successful request → response round-trip times in milliseconds. Tracks PLC responsiveness; sustained values above the historical baseline indicate backend latency degradation. | + +### Multiplexer state + +These five fields describe the per-PLC backend multiplexer. See [`../Architecture/ConnectionModel.md`](../Architecture/ConnectionModel.md) for the design rationale and how transaction-id (TxId) reuse and queueing work. + +| JSON path | Type | Source | Meaning | +|---|---|---|---| +| `backend.inFlight` | `long` | `CounterSnapshot.InFlightCount` | Number of MBAP transactions currently in flight on the backend socket (request sent, response pending). | +| `backend.maxInFlight` | `long` | `CounterSnapshot.MaxInFlight` | High-water mark of `inFlight` since start. Used to size the queue and to verify the multiplexer is in fact pipelining requests. | +| `backend.txIdWraps` | `long` | `CounterSnapshot.TxIdWraps` | Times the 16-bit MBAP transaction-id allocator has wrapped through `0xFFFF`. A rising rate quantifies sustained request volume. | +| `backend.disconnectCascades` | `long` | `CounterSnapshot.BackendDisconnectCascades` | Times a backend disconnect cascaded into closing all upstream pipes that were waiting on in-flight TxIds. Each cascade aborts every queued request bound for that PLC. | +| `backend.queueDepth` | `long` | `CounterSnapshot.BackendQueueDepth` | Current count of requests queued behind the multiplexer's TxId allocator and write semaphore. A sustained non-zero queue means the multiplexer is the bottleneck (backend slower than upstream demand). | + +### Coalescing counters + +These fields describe duplicate-read coalescing on FC03/FC04. See [`../Architecture/ReadCoalescing.md`](../Architecture/ReadCoalescing.md) for the matching criteria and lifecycle. + +| JSON path | Type | Source | Meaning | +|---|---|---|---| +| `backend.coalescedHitCount` | `long` | `CounterSnapshot.CoalescedHitCount` | Reads that attached to an already-in-flight identical read instead of issuing a new backend request. | +| `backend.coalescedMissCount` | `long` | `CounterSnapshot.CoalescedMissCount` | Reads that did not find a matching in-flight request and issued their own. The dashboard-side ratio is `hit / (hit + miss)`; the wire format intentionally does **not** carry the derived ratio (consumers compute it). | +| `backend.coalescedResponseToDeadUpstream` | `long` | `CounterSnapshot.CoalescedResponseToDeadUpstream` | Coalesced responses that arrived after their attached upstream pipe had closed. Normal in bursty traffic; sustained growth indicates upstream clients are aborting too quickly. | + +### Cache counters + +These fields describe the short-TTL response cache for FC03/FC04. See [`../Architecture/ResponseCache.md`](../Architecture/ResponseCache.md). + +| JSON path | Type | Source | Meaning | +|---|---|---|---| +| `backend.cacheHitCount` | `long` | `CounterSnapshot.CacheHitCount` | Reads served from the cache without touching the backend at all. | +| `backend.cacheMissCount` | `long` | `CounterSnapshot.CacheMissCount` | Cache-eligible reads that fell through to the backend. The derived `cacheHitRatio` is `hit / (hit + miss)`; like coalescing, it is **not** carried on the wire. | +| `backend.cacheInvalidations` | `long` | `CounterSnapshot.CacheInvalidations` | Times a write (FC06/FC16) invalidated overlapping cache entries on this PLC. A high invalidation rate relative to writes means write coverage is broad and the cache is doing less work. | + +### Cache memory-watch + +These two fields are Tier-2 KPIs intended for memory-budget alerts. The cache is per-PLC; the dashboard aggregates these across the fleet. + +| JSON path | Type | Source | Meaning | +|---|---|---|---| +| `backend.cacheEntryCount` | `long` | `CounterSnapshot.CacheEntryCount` | Current number of cached response entries for this PLC. | +| `backend.cacheBytes` | `long` | `CounterSnapshot.CacheBytes` | Approximate byte cost of the cache entries (response payloads plus key overhead). Used to detect runaway growth from a chatty client. | + +### Bytes + +| JSON path | Type | Source | Meaning | +|---|---|---|---| +| `bytes.upstreamIn` | `long` | `CounterSnapshot.BytesUpstreamIn` | Total bytes read from upstream client sockets bound to this PLC since start. | +| `bytes.upstreamOut` | `long` | `CounterSnapshot.BytesUpstreamOut` | Total bytes written back to upstream client sockets bound to this PLC since start. | + +## Counter Atomicity + +All counters are `System.Threading.Interlocked` longs. Each read in `StatusSnapshotBuilder.Build()` is atomic per field; no locks are held across the snapshot build, and the build itself does no I/O. + +The practical consequence: a single `/status.json` request returns a coherent value for any **one** counter, but the assembled response is **not** a globally consistent snapshot — different per-PLC counters may straddle increments by microseconds. For example, `pdus.forwarded` for PLC A and `pdus.forwarded` for PLC B are not guaranteed to reflect the same instant. This is acceptable for dashboards and rate calculations; do not use these counters for fine-grained accounting. + +## Example JSON Response + +A representative two-PLC deployment, ~2 hours into a run: + +```json +{ + "service": { + "uptimeSeconds": 7234, + "version": "1.0.0", + "configLastReloadUtc": "2026-05-13T14:02:11+00:00", + "configReloadCount": 2, + "configReloadRejectedCount": 0 + }, + "listeners": { + "bound": 2, + "configured": 2 + }, + "plcs": [ + { + "name": "line1-press", + "host": "10.20.30.41", + "listenPort": 5021, + "listener": { + "state": "bound", + "lastBindError": null, + "recoveryAttempts": 0 + }, + "clients": { + "connected": 2, + "remoteEndpoints": [ + { + "remote": "10.20.40.10:51223", + "connectedAtUtc": "2026-05-13T12:01:55+00:00", + "pdusForwarded": 184213 + }, + { + "remote": "10.20.40.11:53901", + "connectedAtUtc": "2026-05-13T13:30:02+00:00", + "pdusForwarded": 41008 + } + ] + }, + "pdus": { + "forwarded": 225221, + "byFc": { + "fc03": 218904, + "fc04": 0, + "fc06": 12, + "fc16": 6203, + "other": 102 + }, + "rewrittenSlots": 1318622, + "partialBcdWarnings": 0 + }, + "backend": { + "connectsSuccess": 2, + "connectsFailed": 0, + "exceptionsByCode": { + "code01": 0, + "code02": 14, + "code03": 0, + "code04": 0 + }, + "lastRoundTripMs": 12.4, + "inFlight": 1, + "maxInFlight": 4, + "txIdWraps": 3, + "disconnectCascades": 0, + "queueDepth": 0, + "coalescedHitCount": 41892, + "coalescedMissCount": 177012, + "coalescedResponseToDeadUpstream": 7, + "cacheHitCount": 88321, + "cacheMissCount": 88691, + "cacheInvalidations": 6203, + "cacheEntryCount": 47, + "cacheBytes": 18512 + }, + "bytes": { + "upstreamIn": 4108290, + "upstreamOut": 12993021 + } + }, + { + "name": "line2-oven", + "host": "10.20.30.42", + "listenPort": 5022, + "listener": { + "state": "recovering", + "lastBindError": "Address already in use", + "recoveryAttempts": 12 + }, + "clients": { + "connected": 0, + "remoteEndpoints": [] + }, + "pdus": { + "forwarded": 0, + "byFc": { "fc03": 0, "fc04": 0, "fc06": 0, "fc16": 0, "other": 0 }, + "rewrittenSlots": 0, + "partialBcdWarnings": 0 + }, + "backend": { + "connectsSuccess": 0, + "connectsFailed": 0, + "exceptionsByCode": { "code01": 0, "code02": 0, "code03": 0, "code04": 0 }, + "lastRoundTripMs": 0.0, + "inFlight": 0, + "maxInFlight": 0, + "txIdWraps": 0, + "disconnectCascades": 0, + "queueDepth": 0, + "coalescedHitCount": 0, + "coalescedMissCount": 0, + "coalescedResponseToDeadUpstream": 0, + "cacheHitCount": 0, + "cacheMissCount": 0, + "cacheInvalidations": 0, + "cacheEntryCount": 0, + "cacheBytes": 0 + }, + "bytes": { "upstreamIn": 0, "upstreamOut": 0 } + } + ] +} +``` + +## HTML Page Layout + +The HTML renderer is `StatusHtmlRenderer.Render(StatusResponse)` in `src/Mbproxy/Admin/StatusHtmlRenderer.cs`. The page is one document, inline CSS in a `