diff --git a/mbproxy/README.md b/mbproxy/README.md
index ce8d0bb..d80129f 100644
--- a/mbproxy/README.md
+++ b/mbproxy/README.md
@@ -31,6 +31,36 @@ DL260/                DL205/DL260 reference material and pymodbus simulator prof
 | pymodbus simulator profile (register seeds for E2E tests) | [`DL260/dl205.json`](DL260/dl205.json) |
 | Agent-oriented coding guide (architecture bullets, device quirks, phase context) | [`CLAUDE.md`](CLAUDE.md) |
 
+## Detailed documentation
+
+The `docs/` tree is organized by topic. Start with [`docs/design.md`](docs/design.md) for the canonical end-to-end design; jump to the focused pages below when you need depth on one area.
+
+### Architecture
+
+- [`Architecture/Overview.md`](docs/Architecture/Overview.md) — Listener topology, request flow, per-PLC isolation.
+- [`Architecture/ConnectionModel.md`](docs/Architecture/ConnectionModel.md) — Single backend connection per PLC, TxId multiplexing, request-timeout watchdog, disconnect cascade.
+- [`Architecture/ReadCoalescing.md`](docs/Architecture/ReadCoalescing.md) — In-flight FC03/FC04 deduplication via `InFlightByKeyMap`.
+- [`Architecture/ResponseCache.md`](docs/Architecture/ResponseCache.md) — Opt-in per-tag response cache with bounded operator-configured staleness.
+
+### Features
+
+- [`Features/BcdRewriting.md`](docs/Features/BcdRewriting.md) — BCD codec, CDAB word order, FC03/04/06/16 scope, partial-overlap policy.
+- [`Features/HotReload.md`](docs/Features/HotReload.md) — `IOptionsMonitor`-driven config reload with per-change-kind reconcile rules.
+
+### Operations
+
+- [`Operations/Configuration.md`](docs/Operations/Configuration.md) — Full `appsettings.json` reference: every `Mbproxy:*` key, default, and validation rule.
+- [`Operations/StatusPage.md`](docs/Operations/StatusPage.md) — Admin endpoint surface (`/`, `/status.json`) with every JSON field documented.
+- [`Operations/Troubleshooting.md`](docs/Operations/Troubleshooting.md) — Diagnosis playbook keyed to log events and status counters.
+
+### Reference
+
+- [`Reference/LogEvents.md`](docs/Reference/LogEvents.md) — Stable `mbproxy.*` event catalog (28 events across 7 categories).
+
+### Testing
+
+- [`Testing/Simulator.md`](docs/Testing/Simulator.md) — pymodbus DL205 fixture, skip policy, and the load-bearing pymodbus 3.13 framer quirk.
+
 ## Build and run
 
 **Build (Debug, multi-file — fast for iteration):**
diff --git a/mbproxy/docs/Architecture/ConnectionModel.md b/mbproxy/docs/Architecture/ConnectionModel.md
new file mode 100644
index 0000000..75fd10f
--- /dev/null
+++ b/mbproxy/docs/Architecture/ConnectionModel.md
@@ -0,0 +1,247 @@
+# Connection Model
+
+The proxy holds one persistent backend TCP socket per PLC and multiplexes many upstream client connections onto it by rewriting the MBAP transaction ID on every request and restoring each client's original TxId on the matching response.
+
+## Why One Backend Connection Per PLC
+
+An earlier design opened a fresh backend socket for each accepted upstream client (1:1 pairs). That model collapsed against the **AutomationDirect H2-ECOM100**, which caps simultaneous TCP clients at **4 per PLC** (see [`../../DL260/dl205.md`](../../DL260/dl205.md) under "Behavioural Oddities"). The fifth upstream client to attach to a busy PLC was refused at connect, with no recourse other than waiting for an existing pair to drop.
+
+Multiplexing replaces 1:N upstream-to-backend with N:1 upstream-to-multiplexer-to-backend:
+
+- The proxy occupies exactly one of the ECOM's 4 TCP client slots per PLC, regardless of how many upstream clients are attached.
+- Upstream-side concurrency is no longer bounded by the controller's accept queue.
+- Serialisation shifts from the PLC accept queue to the proxy's outbound channel (`_outboundChannel` in `PlcMultiplexer`).
+
+The honest trade-off: the wire-rate ceiling does not change. The ECOM serialises requests internally at roughly 2–10 ms per scan, so the multiplexer cannot deliver more PDUs per second to one PLC than the 1:1 model could. What multiplexing buys is connection headroom, plus the data structures that read coalescing and the response cache hook into.
+
+### Why TxId rewriting rather than connection pooling
+
+The MBAP transaction ID is a 16-bit field at bytes 0–1 of every Modbus TCP frame, and the Modbus TCP specification explicitly permits clients to pipeline requests under different TxIds on a single connection. The PLC echoes each request's TxId on the matching response. The multiplexer exploits that contract: by allocating a proxy-side TxId per request and substituting it for the upstream client's TxId on the wire, many upstream clients can have concurrent requests outstanding on one backend socket without their MBAP frames ever colliding. A connection pool, by contrast, would still need either one backend socket per concurrent request (defeating the ECOM cap workaround) or a serialisation lock on each pooled socket (defeating concurrency).
+
+## Components
+
+The load-bearing types all live in [`../../src/Mbproxy/Proxy/Multiplexing/`](../../src/Mbproxy/Proxy/Multiplexing).
+
+### Type roster
+
+| Type | File | Role |
+|------|------|------|
+| `PlcMultiplexer` | `PlcMultiplexer.cs` | Owns the backend socket, the outbound channel, the backend writer and reader tasks, the per-request timeout watchdog, and the set of attached upstream pipes. One instance per PLC. |
+| `UpstreamPipe` | `UpstreamPipe.cs` | Per-upstream-client wrapper around an accepted `Socket`. Owns a read task that drives `PlcMultiplexer.OnUpstreamFrameAsync`, plus a write task that drains a bounded `_responseChannel` (capacity 16) onto the socket. |
+| `TxIdAllocator` | `TxIdAllocator.cs` | Proxy-side 16-bit TxId allocator. Backed by a `bool[65536]` plus a rolling `_next` cursor under a single lock. Exposes `TryAllocate`, `Release`, `InFlightCount`, and `WrapCount`. |
+| `CorrelationMap` | `CorrelationMap.cs` | `ConcurrentDictionary<ushort, InFlightRequest>` mapping proxy TxId to its in-flight record. Exposes `TryAdd`, `TryRemove`, `DrainAll`, and `SnapshotOlderThan`. |
+| `InFlightRequest` | `InFlightRequest.cs` | Record carrying `UnitId`, `Fc`, `StartAddress`, `Qty`, `IReadOnlyList<InterestedParty> InterestedParties`, `SentAtUtc`, and `ResolvedCacheTtlMs`. |
+| `InterestedParty` | `InFlightRequest.cs` | Record `(UpstreamPipe Pipe, ushort OriginalTxId)` identifying who receives the response and which TxId to restore. |
+
+### Threading invariants
+
+The multiplexer relies on a handful of single-owner rules that keep the wire-touching code lock-free:
+
+- **One backend writer.** Only `RunBackendWriterAsync` calls `backend.SendAsync`. The single-writer drain of `_outboundChannel.Reader.ReadAllAsync` means no socket-level send lock is needed.
+- **One backend reader.** Only `RunBackendReaderAsync` calls `backend.ReceiveAsync`. The reader is the sole producer of `CorrelationMap.TryRemove` for the response path.
+- **Per-pipe write loop.** Each `UpstreamPipe` has exactly one task that drains `_responseChannel` to its upstream socket. The multiplexer fan-out path only enqueues; it never writes to the socket directly.
+- **Per-pipe read loop.** A single read task per pipe parses MBAP frames and calls `OnUpstreamFrameAsync` sequentially. A single upstream client therefore cannot multi-PDU-pipeline itself; concurrency comes from having many pipes.
+
+`TxIdAllocator` holds an internal lock for `TryAllocate` / `Release`. Contention is low in practice — one PLC's wire rate is bounded by the ECOM scan time — and the lock is preferred over a lock-free approach so the saturation, cascade, and Polly-retry paths remain deterministic.
+
+### Why ConcurrentDictionary for the correlation map
+
+`CorrelationMap` is backed by `ConcurrentDictionary<ushort, InFlightRequest>` even though the request-side adds and the response-side removes are nominally single-threaded each. Three independent paths can remove an entry concurrently with each other: the backend reader on a normal response, the watchdog on a timeout, and the cascade walker on a backend disconnect. Two adders (the coalescing path's factory and the non-coalescing fast path) can also race against a removal if the backend response arrives mid-add. The `ConcurrentDictionary` makes those `TryAdd`/`TryRemove` calls atomic, which is what the "claim then dispatch" pattern in the watchdog and reader relies on for correctness.
+
+## Upstream To Multiplexer Path
+
+`PlcListener` accepts an upstream `Socket` and constructs an `UpstreamPipe` around it. `PlcMultiplexer.StartPipeAsync` attaches the pipe, spins up its write loop, and invokes `RunReadLoopAsync` with `OnUpstreamFrameAsync` as the per-frame callback. When the read loop returns (clean upstream EOF, socket fault, or cascade), a `ContinueWith` removes the pipe from `_pipes`; disposal itself is owned by the listener.
+
+### Frame parsing
+
+The pipe's read loop reads a 7-byte MBAP header into a stack-buffered array, parses the `Length` field, allocates a fresh `byte[]` sized to header + (`Length` − 1) bytes, fills the PDU body, and hands the complete frame to the callback. Frames whose length field claims a body larger than `MbapFrame.MaxPduBodySize` are treated as a protocol error and close the upstream pipe; a zero-body length is permitted (the header alone is forwarded). The buffer ownership transfers to the multiplexer with each call so the multiplexer can store it in the `CorrelationMap` entry without coordinating buffer lifetimes back to the pipe.
+
+Each call to `OnUpstreamFrameAsync`:
+
+1. Parses the MBAP header to extract the upstream client's `originalTxId` and the `unitId`.
+2. For FC03, FC04, FC06, and FC16 it also pulls `startAddress` and `qty` out of the PDU; these feed the cache, the read-coalescing key, and the response BCD rewriter.
+3. (Response cache, FC03/FC04 only) checks `_ctx.Cache` via a `CacheKey`. A hit short-circuits the entire path — including the backend connect attempt — and returns a synthesised frame.
+4. Calls `EnsureBackendConnectedAsync`, which lazily brings up the backend socket through a Polly retry pipeline driven by `Connection.BackendConnectTimeoutMs`.
+5. (Read coalescing, FC03/FC04 only, when enabled) consults `InFlightByKeyMap` to either attach to an existing peer in flight or open a new entry.
+6. On a coalescing miss or any non-coalescing FC: calls `TxIdAllocator.TryAllocate(out ushort proxyTxId)`. Saturation returns false and the client receives a Modbus exception code 4 (Slave Device Failure).
+7. Builds an `InFlightRequest`, registers it in `CorrelationMap.TryAdd(proxyTxId, ...)`, and observes the new peak via `ObserveInFlight`.
+8. Runs the BCD rewriter over the request payload through `_pipeline.Process(MbapDirection.RequestToBackend, ...)`.
+9. Overwrites bytes 0 and 1 of the MBAP header with the big-endian encoding of `proxyTxId`.
+10. Enqueues the frame onto `_outboundChannel` via `_outboundChannel.Writer.WriteAsync`. The channel is bounded at 256 with `BoundedChannelFullMode.Wait`, so a saturated outbound queue backpressures the upstream rather than dropping frames.
+
+```csharp
+// Sketch of the proxy-TxId rewrite (PlcMultiplexer.OnUpstreamFrameAsync):
+if (!_allocator.TryAllocate(out ushort proxyTxIdFc)) { /* exception 4 */ }
+_correlation.TryAdd(proxyTxIdFc, inFlightNc);
+_pipeline.Process(MbapDirection.RequestToBackend, header, body, requestCtxNc);
+frame[0] = (byte)(proxyTxIdFc >> 8);
+frame[1] = (byte)(proxyTxIdFc & 0xFF);
+await _outboundChannel.Writer.WriteAsync(frame, ct).ConfigureAwait(false);
+```
+
+After enqueuing, the upstream read loop is free to read the next frame. There is no per-pipe in-flight gate beyond what the upstream client itself imposes by reading from a single TCP stream.
+
+### Saturation handling
+
+`TxIdAllocator.TryAllocate` returns `false` only when all 65,536 slots are simultaneously in flight against one PLC. In that state `OnUpstreamFrameAsync` calls `BuildExceptionFrame(originalTxId, unitId, fcByte, exceptionCode: 4)` and enqueues the frame straight onto the requesting pipe's response channel — the upstream client sees a clean Modbus exception code 4 (Slave Device Failure) rather than a hung socket. The same path emits `MultiplexerLogEvents.Saturated` with the remote endpoint string for operator triage.
+
+### Lazy backend connect
+
+The backend socket starts offline. `EnsureBackendConnectedAsync` runs under a `SemaphoreSlim` named `_connectGate` so concurrent upstream frames during a cold start serialise their connect attempts. The first caller through the gate builds a fresh `Socket`, sets `NoDelay = true`, and runs `ConnectAsync` under either the supplied `_backendConnectPipeline` (Polly resilience pipeline) or a plain `CancellationToken` linked to `Connection.BackendConnectTimeoutMs`. On failure it logs `MultiplexerLogEvents.BackendFailed`, increments the per-PLC `connectsFailed` counter, and returns `false`; the upstream pipe is disposed by the caller. On success it spawns the backend writer and reader tasks under a fresh `CancellationTokenSource` linked to `_disposeCts`, increments `connectsSuccess`, and logs `MultiplexerLogEvents.BackendConnected`.
+
+A double-checked fast path before the gate avoids the semaphore acquire on the happy path: the moment `_backendSocket is { Connected: true }` and `_backendCts is { IsCancellationRequested: false }`, `EnsureBackendConnectedAsync` returns immediately without taking the lock. The lazy-connect contract therefore costs one volatile read per request after the first successful connect.
+
+## Multiplexer To Backend Path
+
+The backend side is two tasks plus one bounded channel. `EnsureBackendConnectedAsync` launches two tasks against the backend socket on first connect, both under a single `_backendCts`:
+
+- **`RunBackendWriterAsync`** — single consumer of `_outboundChannel.Reader.ReadAllAsync`. Writes every frame to the backend socket via `SendAsync` with a loop to handle short writes. Single-writer means no socket-level lock is needed for sends.
+- **`RunBackendReaderAsync`** — single producer reading frames off the backend socket. For each frame:
+  1. Parses the MBAP header to extract `proxyTxId` and `length`.
+  2. Reads the PDU body into a fresh `byte[]`.
+  3. Calls `CorrelationMap.TryRemove(proxyTxId, out var inFlight)`. A miss (no entry) drops the frame silently — usually a stale response after a cascade.
+  4. Frees the allocator slot via `_allocator.Release(proxyTxId)`.
+  5. Updates the per-PLC EWMA round-trip via `UpdateRoundTripEwma` using `inFlight.SentAtUtc`.
+  6. Runs the response-side BCD rewriter through `_pipeline.Process(MbapDirection.ResponseToClient, ...)`. The rewriter needs `inFlight.StartAddress` and `inFlight.Qty` because the FC03/FC04 response PDU does not echo the read range.
+  7. (Cache write-through, post-rewriter) on a non-exception response, stores FC03/FC04 entries in `_ctx.Cache` or invalidates overlapping entries on FC06/FC16.
+  8. Walks `inFlight.InterestedParties`. For each party with a live pipe, copies the frame, restores `party.OriginalTxId` into header bytes 0–1, and calls `party.Pipe.SendResponseAsync` to enqueue the frame onto that pipe's response channel.
+
+Single-reader on the backend socket plus per-pipe response channels means every cross-task hand-off goes through a `Channel<byte[]>` — no locks on the wire-touching code paths.
+
+### Frame fan-out
+
+When `inFlight.InterestedParties.Count == 1` — the common non-coalesced case — the reader optimises by passing the original frame buffer through to `SendResponseAsync` without copying. When the list has more than one party (a coalesced FC03/FC04 read), the reader clones the frame for each party before patching in its `OriginalTxId`, so each pipe's response channel owns an independent buffer.
+
+A party whose pipe reports `IsAlive == false` is skipped. For multi-party FC03/FC04 frames the skip path also increments the per-PLC `coalescedResponseToDeadUpstream` counter and logs `CoalescingLogEvents.DeadUpstream`, so operators can correlate cascade-mid-flight events with which reads were affected.
+
+## Per-Request Timeout Watchdog
+
+`RunRequestTimeoutWatchdogAsync` is launched from the multiplexer constructor and runs for the lifetime of the multiplexer. It ticks every `BackendRequestTimeoutMs / 4`, floored at 100 ms, and on each tick calls `CorrelationMap.SnapshotOlderThan(DateTimeOffset.UtcNow.AddMilliseconds(-BackendRequestTimeoutMs))`.
+
+For each stale entry the watchdog:
+
+1. Tries to claim the entry via `_correlation.TryRemove(proxyTxId, out var req)`. A failed claim means a response, cascade, or another watchdog tick already removed it — skip.
+2. Releases the proxy TxId via `_allocator.Release(proxyTxId)`.
+3. For FC03/FC04, also removes the matching `CoalescingKey` from `InFlightByKeyMap` so a brand-new identical request opens a fresh round-trip rather than attaching to a corpse.
+4. Walks `req.InterestedParties` and, for each live pipe, delivers a synthesised Modbus exception frame with function code `req.Fc | 0x80` and exception code `0x0B` (Gateway Target Device Failed To Respond), with the party's `OriginalTxId` patched back into the MBAP header.
+
+The watchdog exists because the multiplexed model has no per-pair fault-on-timeout backstop. In the 1:1 model, a lost response simply sat on a dead socket that the upstream eventually closed; in the multiplexed model, a single missing or mis-echoed response would leak its `CorrelationMap` entry forever and hang every upstream party attached to it. Specific failure modes the watchdog covers:
+
+- The PLC drops a response (busy controller, scan-time excursion).
+- A middlebox drops a packet on a long-idle backend socket.
+- A backend mis-echoes the MBAP TxId — including pymodbus 3.13.0's deferred-handler bug noted below.
+
+### Why claim then release
+
+The watchdog reads the stale set via `SnapshotOlderThan` (a non-removing scan) and only then competes for each entry via `TryRemove`. The two-step is deliberate: a response arriving between the snapshot and the claim wins the `TryRemove` race and the watchdog skips that entry. Without the claim race, the upstream party could receive both a real response and a 0x0B exception for the same request, which would corrupt clients that expect responses in TxId order.
+
+### Tick cadence
+
+The 100 ms floor on `tickMs` keeps the watchdog from busy-waking when an operator configures `BackendRequestTimeoutMs` below 400 ms. With the production default of 3000 ms the watchdog ticks every 750 ms, which keeps timeout dispatch latency well under one second past the threshold.
+
+### Exception frame shape
+
+`BuildExceptionFrame` produces a 9-byte synthetic response: 7-byte MBAP header plus a 2-byte exception PDU. The function code byte is OR'd with `0x80` to flag the response as an exception, and the second PDU byte carries the exception code (`0x04` for allocator saturation, `0x0B` for the watchdog). The `Length` field in the MBAP header is set to 3 (`UnitId` + exception FC + exception code) and the `ProtocolId` is zero per the Modbus TCP spec. Clients written against a real DL260 see exactly the same frame layout a controller would emit, so client libraries surface a normal `ModbusException` rather than a transport error.
+
+## Backend Disconnect Cascade
+
+When the backend socket dies — reader EOF, writer fault, PLC reboot, network partition, or middlebox idle drop — `TearDownBackendAsync(reason, cascadeUpstreams: true)` runs:
+
+1. Cancels `_backendCts`, which terminates both backend tasks.
+2. Shuts down and disposes the backend `Socket`.
+3. Calls `CorrelationMap.DrainAll`, releases every allocator slot, and collects every `InterestedParty`'s pipe ID.
+4. Calls `InFlightByKeyMap.DrainAll` so stale coalescing entries cannot outlive the backend they were aimed at.
+5. Disposes every attached `UpstreamPipe` and clears `_pipes`.
+6. Increments `BackendDisconnectCascades` on the per-PLC counters by the number of upstream pipes that were attached (`AddDisconnectCascades(upstreamCount)`).
+7. Logs a `MultiplexerLogEvents.BackendDisconnected` event with the upstream count, drained correlation count, and a reason string.
+
+The rationale: a backend disconnect invalidates every in-flight response, and there is no clean way to mid-flight-rebind upstream clients to a fresh backend socket without risking silent data loss. Cascading the disconnect upstream is loud (clients re-issue immediately) but unambiguous — every upstream sees its socket close, no zombie upstream sockets hold stale state. The next upstream frame after the cascade triggers a fresh Polly-driven backend connect.
+
+### Failure detection paths
+
+Three independent paths can initiate a cascade:
+
+1. **Reader EOF.** `RunBackendReaderAsync` sees a clean zero-byte read from `ReceiveAsync` and falls out of the loop. It calls `TearDownBackendAsync("backend reader EOF", cascadeUpstreams: true)` as a fire-and-forget task.
+2. **Reader fault or writer fault.** Either backend task catches a non-cancellation exception and calls `TearDownBackendAsync($"reader fault: {ex.Message}", ...)` or the equivalent writer-fault path.
+3. **Watchdog-driven indirect failure.** A backend that mis-echoes TxIds will not itself fault the socket; the watchdog eventually times out the leaked correlation entries and delivers 0x0B exceptions. The socket stays up unless the backend then also stops responding to subsequent requests.
+
+`TearDownBackendAsync` is idempotent against itself — the `lock (_backendLock)` block atomically swaps the live socket and task references to `null`, so a second invocation sees `oldSocket is null && oldCts is null` and returns without re-cascading.
+
+### Why every attached upstream cascades
+
+An earlier sketch cascaded only upstream pipes that had a request in flight at the moment of disconnect. The current implementation cascades every attached pipe, in flight or idle. The reason: an idle upstream pipe is one that the proxy has been quietly answering from cache or that has simply not sent a request recently. After a backend disconnect, the proxy has no way to prove the PLC's state still matches what those idle clients last saw — a PLC reboot, ladder edit, or operator write between the disconnect and reconnect can have moved the values out from under them. Closing every upstream socket is the unambiguous signal that "the link to the device was lost; rebuild your state from scratch." Clients reconnect on their own next request.
+
+### Connect-on-next-frame, not eager reconnect
+
+The cascade tears down the backend without scheduling a reconnect. The next upstream frame that arrives invokes `EnsureBackendConnectedAsync`, which constructs a fresh socket and runs the Polly connect pipeline. The rationale is that an eager reconnect spinner would hammer a downed PLC at the configured backoff schedule even when no clients are attached; gating reconnect on client demand avoids waste during long PLC outages without sacrificing recovery latency once clients return.
+
+## Wire-Rate Considerations
+
+The multiplexer is not a throughput multiplier. The ECOM serialises every request it receives on its single internal scan, so PDUs-per-second to one PLC is bounded by `1 / ecom_scan_ms` regardless of how many upstream clients the proxy fans in. What changes:
+
+- **Connection count.** Upstream-side connection count is now limited by the OS socket budget and `OutboundChannelCapacity` (256), not by the ECOM's 4-client cap.
+- **Coalescing opportunity.** Identical concurrent FC03/FC04 reads attach to the same `InFlightRequest` via `InFlightByKeyMap`, so the proxy issues one backend round-trip and fans the response out to all attached parties (see [`./ReadCoalescing.md`](./ReadCoalescing.md)).
+- **Cache short-circuit.** FC03/FC04 reads with a resolved per-tag TTL never reach the wire while the cached PDU is fresh (see [`./ResponseCache.md`](./ResponseCache.md)).
+
+The proxy can hand more concurrent upstream clients a result on a hot tag than the bare PLC can serve simultaneously. It cannot let those clients hammer the PLC harder than the PLC's scan time allows.
+
+### Counters exposed by the status page
+
+`PlcMultiplexer` implements `IMultiplexCountersProvider` and registers itself with the per-PLC counters object during construction. The status page reads these values per snapshot:
+
+| Counter | Source | Meaning |
+|---------|--------|---------|
+| `inFlight` | `TxIdAllocator.InFlightCount` | Proxy TxIds currently allocated against this PLC. |
+| `maxInFlight` | `Counters.ObserveInFlight` peak | High-water mark since service start. |
+| `txIdWraps` | `TxIdAllocator.WrapCount` | Times the rolling cursor has rolled 0xFFFF → 0x0000. Sustained non-zero means very high churn. |
+| `queueDepth` | `_outboundChannel.Reader.Count` | Frames sitting in the outbound channel waiting for the backend writer. Persistent depth means the PLC is the bottleneck. |
+| `disconnectCascades` | `Counters.AddDisconnectCascades` | Cumulative count of upstream pipes cascaded by backend disconnects. Rises in chunks equal to the attached pipe count at cascade time. |
+| `connectsSuccess` / `connectsFailed` | `Counters.IncrementConnectSuccess` / `IncrementConnectFailed` | Per-PLC backend connect outcomes. |
+
+### Interpreting non-zero txIdWraps
+
+Each `WrapCount` increment means the allocator has issued at least 65,536 TxIds against one PLC since service start. On a steady 10 ms-per-PDU pace that takes about 11 minutes; sustained wraps therefore indicate request rates in the hundreds-per-second range, well above what an ECOM-served PLC can answer. Wraps without a matching rise in `inFlight` simply reflect cumulative volume and are benign. Wraps that climb alongside a high `inFlight` value indicate the PLC is back-pressuring; check `queueDepth` and the EWMA round-trip on the same snapshot.
+
+### Interpreting non-zero queueDepth
+
+`_outboundChannel` is bounded at 256 with `BoundedChannelFullMode.Wait`. A persistent depth above zero means the backend writer is not draining as fast as upstream pipes are submitting — the PLC has become the bottleneck. A queue that climbs toward 256 means upstream pipes are starting to block on `WriteAsync`; that backpressure walks back up the per-pipe read loop and ultimately stalls the upstream client's send buffer, which is the correct behaviour for an overloaded PLC.
+
+A queue depth above zero with `inFlight` also climbing suggests the PLC is keeping up with requests but slowly; the EWMA round-trip on the same snapshot will confirm. A queue depth above zero with `inFlight` flat at the allocator's saturation ceiling indicates a stuck backend (no responses arriving, no slots freeing); the watchdog will eventually clear the stuck entries via 0x0B exceptions.
+
+### Memory footprint per PLC
+
+Each `PlcMultiplexer` holds a `bool[65536]` for the TxId allocator (~64 KB), the `ConcurrentDictionary` for the correlation map (sized to peak in-flight, typically tens of bytes per entry plus the `byte[]` frame buffers referenced by the entries), the bounded outbound channel (≤ 256 frames in flight; each frame at most 260 bytes), and the per-pipe response channels (≤ 16 frames per attached pipe). With ~54 PLCs the allocator alone accounts for roughly 3.4 MB; the rest is request-rate dependent and well within the service's measured ~30 MB working set under load.
+
+## Lifecycle And Disposal
+
+`PlcMultiplexer.DisposeAsync` is idempotent and runs in this order:
+
+1. Sets `_disposed = true` and unhooks the live `IMultiplexCountersProvider` registration so a concurrent status snapshot does not observe internal state mid-teardown.
+2. Cancels `_disposeCts`, which cooperatively stops the watchdog task.
+3. Awaits the watchdog with a 2-second timeout so its in-flight 0x0B dispatches settle before tests assert against counter values.
+4. Calls `TearDownBackendAsync("disposing", cascadeUpstreams: true)` to close the backend, drain `CorrelationMap`, drain `InFlightByKeyMap`, and dispose every attached pipe.
+5. Completes the outbound channel writer, then disposes any pipes that were not already cleared by the cascade walk.
+6. Disposes `_disposeCts`.
+
+`UpstreamPipe.DisposeAsync` is similarly idempotent: it completes its response channel writer, cancels its internal CTS, shuts the upstream socket down both ways, and emits a `MultiplexerLogEvents.ClientDisconnected` event with the remote endpoint string and a reason. Disposal can be triggered by the listener (clean upstream EOF), by the read or write loop encountering a socket error, or by the cascade walk.
+
+## pymodbus 3.13.0 Simulator Quirk
+
+The pymodbus simulator's `ServerRequestHandler` stores a single `last_pdu` field per connection and schedules deferred response handlers via `asyncio.call_soon`. When two MBAP frames arrive in the same recv buffer — exactly the workload the multiplexer can produce on its shared backend socket — the second frame's `last_pdu` overwrites the first before either deferred handler runs. Both responses then carry the second request's TxId.
+
+### Why this only matters in tests
+
+The real H2-ECOM100 does not have this bug; it echoes per-request TxIds correctly. Multiplexer correctness under genuine backend concurrency is proven by the unit tests in `PlcMultiplexerTests` against a stub backend that respects MBAP TxIds, not via the simulator. The E2E suite paces requests against the pymodbus simulator to keep it in known-good single-PDU mode.
+
+The per-request timeout watchdog described above is the production defence against any backend (real or simulated) that mis-echoes a TxId: the unanswered `InFlightRequest` ages past `BackendRequestTimeoutMs` and the upstream party receives a clean Modbus exception 0x0B rather than a hung socket.
+
+## Related Documentation
+
+- [`./Overview.md`](./Overview.md) — proxy architecture entry point
+- [`./ReadCoalescing.md`](./ReadCoalescing.md) — FC03/FC04 fan-out built on `InterestedParties`
+- [`./ResponseCache.md`](./ResponseCache.md) — per-PLC FC03/FC04 cache layered in front of this multiplexer
+- [`../Operations/Configuration.md`](../Operations/Configuration.md) — `Connection.BackendConnectTimeoutMs`, `Connection.BackendRequestTimeoutMs`, retry tuning
+- [`../Operations/StatusPage.md`](../Operations/StatusPage.md) — `inFlight`, `maxInFlight`, `txIdWraps`, `queueDepth`, `disconnectCascades` counters
+- [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — `mbproxy.multiplex.*` structured log events
+- [`../Testing/Simulator.md`](../Testing/Simulator.md) — pymodbus 3.13.0 deferred-handler quirk in detail
+- [`../../DL260/dl205.md`](../../DL260/dl205.md) — DL205/DL260 quirks including the 4-client ECOM cap
diff --git a/mbproxy/docs/Architecture/Overview.md b/mbproxy/docs/Architecture/Overview.md
new file mode 100644
index 0000000..f6f7819
--- /dev/null
+++ b/mbproxy/docs/Architecture/Overview.md
@@ -0,0 +1,150 @@
+# Architecture Overview
+
+`mbproxy` is a .NET 10 background service that sits inline between Modbus TCP clients and a fleet of AutomationDirect DL205/DL260 PLCs, rewriting BCD-encoded registers in both directions while multiplexing many upstream clients onto one persistent backend socket per PLC.
+
+This document is the entry point for readers new to the codebase. It sketches the runtime shape, the listener topology, the per-PLC isolation model, and the path a single Modbus frame takes from accept to response, and then hands off to the per-feature documents under `docs/Architecture/`, `docs/Features/`, and `docs/Operations/`.
+
+## Runtime Shape
+
+The process is a single .NET 10 Generic Host worker. `Microsoft.Extensions.Hosting.WindowsServices` registers the host as a Windows Service so the same binary runs interactively (for development) or under the SCM (in production). All configuration binds from `appsettings.json` through `IOptionsMonitor<MbproxyOptions>`, which makes the tag list and PLC roster hot-reloadable without process restart. `ProxyWorker` is the long-lived `BackgroundService` that owns startup, shutdown, and the listener supervisors for every PLC. A small Kestrel admin endpoint runs in the same process to serve the read-only status page.
+
+There is no in-process database, no message broker, and no persistent cache file: state is per-PLC, in-memory, and ephemeral. Restarting the service drops every in-flight request and every cached response. Upstream clients are expected to reconnect and reissue; the proxy never replays a request on their behalf.
+
+## Listener Topology
+
+The proxy opens **one `TcpListener` per PLC** on a distinct port. A client picks which PLC it is talking to by choosing which port to connect to. There is no protocol-level routing — port number is the PLC identity. This keeps the upstream surface trivial for Wonderware, Historian gateways, and generic Modbus clients that already know how to point at `host:port`, and it means no per-frame header inspection is needed to decide where a request is going.
+
+```text
+Client A ──┐
+Client B ──┼──→ proxy:5020 ──→ PLC #1  (10.0.1.1:502)
+           ├──→ proxy:5021 ──→ PLC #2  (10.0.1.2:502)
+           │       ...
+           └──→ proxy:5073 ──→ PLC #54 (10.0.1.54:502)
+```
+
+Each listener runs under a `PlcListenerSupervisor` that owns its bind lifecycle. If a bind fails at startup or the listener faults at runtime, the supervisor reattempts under a Polly retry pipeline; the same code path also brings up newly-added PLCs from hot-reload and tears down removed ones. The supervisor's state (`SupervisorState`) is observable on the status page so an operator can tell at a glance whether a port is bound, recovering, or shut down.
+
+Because port identity is PLC identity, adding a PLC is purely a configuration change — append an entry to `Mbproxy.Plcs` with a free `ListenPort`, save, and the supervisor reconciliation loop binds the new port without touching any other PLC. Removing a PLC follows the same path in reverse.
+
+## Per-PLC Isolation
+
+Every PLC gets its own `PerPlcContext` carrying that PLC's `PlcMultiplexer`, `CorrelationMap`, `TxIdAllocator`, `InFlightByKeyMap`, optional `ResponseCache`, `CacheInvalidator`, and `BcdPduPipeline`. There is no shared mutable state across PLCs at the request path.
+
+The consequence is fault containment:
+
+- A slow or dead backend on PLC #17 cannot block the request loop for PLC #18. Each multiplexer owns its own outbound channel and its own backend reader/writer task pair.
+- A flood of in-flight requests on one PLC consumes only that PLC's TxId allocator (the 16-bit space is per-PLC, not global).
+- A backend disconnect on one PLC cascades only to that PLC's attached upstream pipes; the rest of the fleet is unaffected.
+- Hot-reload of one PLC's tag list rewrites only that PLC's `BcdPduPipeline` view of the tag map. Other PLCs do not observe the swap.
+
+The listener topology and the per-PLC component graph are deliberately aligned: one port, one supervisor, one multiplexer, one backend socket, one cache instance.
+
+Cross-PLC state exists only in three places, and each is read-mostly: the bound `IOptionsMonitor<MbproxyOptions>` snapshot, the global Serilog logger, and the service-wide counter set surfaced on the status page. Counters are written via lock-free `Interlocked` operations on disjoint per-PLC fields, then summed when the status page is rendered.
+
+This isolation is what lets the service operate degraded without operator intervention. If three PLCs drop off the network, the supervisor for each enters `recovering`, their multiplexers tear down their backend sockets, attached upstream clients are disconnected, and the remaining 51 PLCs keep serving traffic with no measurable impact. When the dropped PLCs come back, their supervisors rebind their listeners and the next upstream request triggers a fresh backend connect through the Polly pipeline — no fleet-wide restart, no manual reconnect, no shared state to flush.
+
+## Request Flow
+
+The path of an FC03 read from an upstream client through the proxy and back. The cache check, the coalescing check, and the BCD rewrite all sit between the upstream parse and the backend send so the multiplexer can short-circuit the backend entirely when it does not need to be involved. Steps the upstream client never sees are indented.
+
+```text
+Upstream client
+   │  TCP connect → proxy:5020
+   ▼
+PlcListener (PlcListener.cs) accepts the socket
+   │
+   ▼
+UpstreamPipe wraps the socket: read loop + bounded response channel
+   │  parses MBAP frames off the wire, hands each frame to:
+   ▼
+PlcMultiplexer.OnUpstreamFrameAsync(pipe, frame, ct)
+   │
+   │  1. Parse MBAP header → originalTxId, unitId
+   │  2. Parse PDU → fc, startAddr, qty
+   │  3. (FC03/FC04 only) ResponseCache.TryGet(CacheKey)
+   │       ├─ HIT  → splice cached payload onto a fresh MBAP header
+   │       │         with originalTxId, push to upstream channel, DONE.
+   │       └─ MISS → fall through.
+   │  4. InFlightByKeyMap coalesce check
+   │       ├─ duplicate read in flight → attach as additional waiter,
+   │       │   share the eventual response, DONE for this frame.
+   │       └─ first-of-key → become the leader, fall through.
+   │  5. BcdPduPipeline rewrites request payload (FC06/FC16) binary → BCD
+   │  6. TxIdAllocator hands out a free proxyTxId
+   │  7. CorrelationMap[proxyTxId] = InFlightRequest(pipe, originalTxId, ...)
+   │  8. Overwrite MBAP TxId field with proxyTxId; enqueue to outbound channel
+   ▼
+Backend writer task drains the outbound channel
+   │  → single persistent socket → PLC :502
+   ▼
+PLC responds; backend reader task picks the frame off the socket
+   │
+   │  9. Look up proxyTxId in CorrelationMap; recover original requester(s)
+   │ 10. BcdPduPipeline rewrites response payload (FC03/FC04) BCD → binary
+   │ 11. ResponseCache stores the rewritten payload (if TTL > 0)
+   │ 12. Fan out to every waiter on the InFlightByKey entry, restoring each
+   │     waiter's originalTxId before pushing into its UpstreamPipe channel
+   ▼
+UpstreamPipe writer task drains its response channel → upstream socket
+   │
+   ▼
+Upstream client sees a response with the TxId it originally sent.
+```
+
+Writes (FC06, FC16) take a shorter path: no cache lookup, no coalescing, but the request payload is BCD-rewritten before forwarding, and the response triggers `CacheInvalidator` to evict any overlapping cached read ranges so the next read does not serve stale data.
+
+A few invariants are worth flagging because they shape the design:
+
+- **Original TxId is preserved end-to-end.** The multiplexer rewrites the wire TxId for routing, but every upstream client sees the exact 16-bit value it sent. `InFlightRequest` carries the original TxId alongside the upstream pipe reference.
+- **Single backend writer, single backend reader.** No socket-level synchronisation is needed because exactly one task writes to the backend socket and exactly one task reads from it. The outbound channel funnels every request through that single writer.
+- **The cache check happens before backend connect.** If every read in a request is cache-served and the backend is currently disconnected, the upstream client still gets a response. The cache survives backend transitions intentionally.
+- **No mid-request retries on writes.** FC06 and FC16 are non-idempotent on BCD tags (a partial-applied multi-register write could leave a 32-bit BCD value mid-transition), so a backend failure during a write surfaces as Modbus exception 0x0B and the client decides how to recover.
+
+## Component Map
+
+The major components a reader will hit when tracing a request, with their file locations under `src/Mbproxy/`. The list is ordered by where each component sits in the request path — accept loop at the top, rewrite at the bottom.
+
+- **`ProxyWorker`** — `Proxy/ProxyWorker.cs`. The `BackgroundService` host; reconciles the configured PLC list with the supervisor roster on startup and on `IOptionsMonitor` change events.
+- **`PlcListenerSupervisor`** — `Proxy/Supervision/PlcListenerSupervisor.cs`. Owns one PLC's listener lifecycle (bind, run, recover, shut down). Uses Polly for bounded recovery.
+- **`PlcListener`** — `Proxy/PlcListener.cs`. The actual `TcpListener` accept loop for one PLC; hands every accepted socket to that PLC's multiplexer as a new `UpstreamPipe`.
+- **`UpstreamPipe`** — `Proxy/Multiplexing/UpstreamPipe.cs`. One per upstream socket. Frame-parses inbound bytes and pushes parsed MBAP frames into the multiplexer; drains outbound responses from a bounded channel back to the client.
+- **`PlcMultiplexer`** — `Proxy/Multiplexing/PlcMultiplexer.cs`. The per-PLC fanin/fanout core. Owns the persistent backend socket, the outbound write loop, the backend read loop, the per-request watchdog, and the cascade-on-backend-disconnect contract. Entry point `OnUpstreamFrameAsync` is where every upstream frame enters the request path; it is the single function that ties cache, coalescing, BCD rewrite, TxId allocation, and correlation together.
+- **`CorrelationMap`** — `Proxy/Multiplexing/CorrelationMap.cs`. Maps `proxyTxId → InFlightRequest` so backend responses can be routed back to the originating upstream pipe(s). Also the surface the watchdog scans for stale entries.
+- **`TxIdAllocator`** — `Proxy/Multiplexing/TxIdAllocator.cs`. Allocates and recycles the per-PLC 16-bit proxy TxId space used by the multiplexer.
+- **`InFlightByKeyMap`** — `Proxy/Multiplexing/InFlightByKeyMap.cs`. The read-coalescing seam: keys on `(unitId, fc, startAddr, qty)` so duplicate concurrent reads share one backend round-trip and one response.
+- **`ResponseCache`** — `Proxy/Cache/ResponseCache.cs`. Opt-in per-tag-range TTL cache for FC03/FC04 responses. A cache hit short-circuits the backend entirely; cache lookup happens before the multiplexer even ensures the backend is connected.
+- **`CacheInvalidator`** — `Proxy/Cache/CacheInvalidator.cs`. Invalidates cached read ranges that overlap with successful FC06/FC16 writes, so writes never leave stale reads behind.
+- **`BcdPduPipeline`** — `Proxy/BcdPduPipeline.cs`. The actual BCD rewrite: walks request and response PDUs against the resolved tag map and re-encodes each configured register between BCD nibbles and binary integers. 32-bit BCD tags spanning the CDAB word pair are rewritten as a unit. Non-BCD registers pass through untouched, and any function code the pipeline does not own (diagnostics, exceptions, coil and discrete-input functions) is forwarded byte-for-byte.
+
+`PerPlcContext` (`Proxy/PerPlcContext.cs`) is the container that binds these together for one PLC and is the handle the supervisor and multiplexer carry around.
+
+Two supporting abstractions are worth knowing about even though they do not appear in the per-frame path:
+
+- **`IPduPipeline`** — the rewrite-pipeline interface (`Proxy/IPduPipeline.cs`). `BcdPduPipeline` is the production implementation; `NoopPduPipeline` is the test/passthrough implementation used when no BCD tags are configured for a PLC.
+- **`MbapFrame`** — the static helper (`Proxy/MbapFrame.cs`) that parses and serialises the 7-byte MBAP header. Every component that touches the wire goes through this helper rather than indexing raw byte arrays directly.
+
+Counters and structured log event names emitted from these components are catalogued in `ProxyCounters` (`Proxy/ProxyCounters.cs`) and the various `*LogEvents` static classes (`MultiplexerLogEvents`, `CoalescingLogEvents`, `CacheLogEvents`, `RewriterLogEvents`). A reader following a runtime symptom back to its source should grep for the event-name constants in those files first.
+
+## Where to Read Next
+
+For the wire-level details of how one backend socket fans out to many upstream clients — TxId rewriting, the correlation map, the per-request watchdog, the backend disconnect cascade — read [`./ConnectionModel.md`](./ConnectionModel.md). It is the most load-bearing internal document; almost every failure-mode question routes through it.
+
+For the read-coalescing seam (when duplicate concurrent reads collapse onto one backend request) read [`./ReadCoalescing.md`](./ReadCoalescing.md). For the opt-in TTL cache and how writes invalidate overlapping read ranges read [`./ResponseCache.md`](./ResponseCache.md). The BCD rewrite itself — what gets rewritten, what passes through, and how CDAB 32-bit values are handled — is in [`../Features/BcdRewriting.md`](../Features/BcdRewriting.md).
+
+Operators looking for configuration shape, hot-reload semantics, and the status page should start at [`../Operations/Configuration.md`](../Operations/Configuration.md) and [`../Operations/StatusPage.md`](../Operations/StatusPage.md). When something is misbehaving in production, [`../Operations/Troubleshooting.md`](../Operations/Troubleshooting.md) and [`../Reference/LogEvents.md`](../Reference/LogEvents.md) are the two places to look first.
+
+The simulator used by the end-to-end test suite — a `pymodbus`-based stand-in for a real DL205 — has its own document at [`../Testing/Simulator.md`](../Testing/Simulator.md). Test-only quirks of that simulator are called out there rather than in the production docs, because the real DL260 ECOM does not share them.
+
+## Related Documentation
+
+- [`./ConnectionModel.md`](./ConnectionModel.md) — TxId multiplexing, correlation map, per-request watchdog.
+- [`./ReadCoalescing.md`](./ReadCoalescing.md) — how `InFlightByKeyMap` collapses duplicate concurrent reads.
+- [`./ResponseCache.md`](./ResponseCache.md) — `ResponseCache` and `CacheInvalidator` semantics.
+- [`../Features/BcdRewriting.md`](../Features/BcdRewriting.md) — the `BcdPduPipeline` rewrite rules.
+- [`../Features/HotReload.md`](../Features/HotReload.md) — `IOptionsMonitor` propagation and supervisor reconciliation.
+- [`../Operations/Configuration.md`](../Operations/Configuration.md) — `appsettings.json` schema and tag list shape.
+- [`../Operations/StatusPage.md`](../Operations/StatusPage.md) — the Kestrel admin endpoint and counter catalog.
+- [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — stable structured log event names.
+- [`../design.md`](../design.md) — canonical design decisions and rationale.
+- [`../Testing/Simulator.md`](../Testing/Simulator.md) — `pymodbus` DL205 simulator used by the end-to-end suite.
+- [`../plan/README.md`](../plan/README.md) — phase plan with per-phase test inventory.
diff --git a/mbproxy/docs/Architecture/ReadCoalescing.md b/mbproxy/docs/Architecture/ReadCoalescing.md
new file mode 100644
index 0000000..70e0eca
--- /dev/null
+++ b/mbproxy/docs/Architecture/ReadCoalescing.md
@@ -0,0 +1,243 @@
+# Read Coalescing
+
+In-flight read coalescing collapses identical FC03/FC04 requests that arrive
+while a backend response is still in flight onto a single backend round-trip,
+then fans the single response out to every attached upstream client with each
+client's original MBAP transaction ID restored.
+
+## What Coalescing Does
+
+When two upstream clients each send `(unitId=1, FC=3, start=100, qty=10)`
+within the in-flight window of a previously-routed request, the second
+arrival attaches to the existing `InFlightRequest` instead of opening a new
+proxy transaction ID and a second backend round-trip. The PLC's reply is
+delivered to both upstream pipes; each pipe sees its own MBAP `TxId`
+restored on its copy of the response.
+
+The value each upstream sees is the same value an uncoalesced request would
+have returned within the PLC's own scan-time precision (microseconds to
+~10 ms typical window). Coalescing is not a cache layer — once the response
+fans out, the in-flight entry dies, and a subsequent identical read opens a
+fresh round-trip. Bounded-staleness caching is a separate feature; see
+[`./ResponseCache.md`](./ResponseCache.md).
+
+## The Coalescing Key
+
+The lookup tuple is defined in `CoalescingKey.cs`:
+
+```csharp
+internal readonly record struct CoalescingKey(
+    byte UnitId,
+    byte Fc,
+    ushort StartAddress,
+    ushort Qty);
+```
+
+Record-struct value equality drives the dictionary lookup in
+`InFlightByKeyMap`. Several axes never coalesce, by design:
+
+- **Function code.** FC03 (Read Holding Registers) and FC04 (Read Input
+  Registers) read different Modbus tables on the device. Their responses
+  are not interchangeable, so they do not share a key even at the same
+  address.
+- **Unit ID.** Distinct unit IDs behind a shared socket address different
+  Modbus personalities — coalescing never crosses a unit boundary.
+- **Start address and quantity.** Two reads with overlapping but
+  non-identical ranges never coalesce. Range-overlap logic exists for cache
+  invalidation, not for coalescing.
+
+## Eligibility
+
+Only FC03 and FC04 enter the coalescing path. The multiplexer's request
+handler parses the function code from the inbound PDU and gates on
+`fcByte is 0x03 or 0x04` before consulting `_inFlightByKey`.
+
+- FC06 (Write Single Register) and FC16 (Write Multiple Registers) are
+  non-idempotent on BCD tags — a second send would write the value twice.
+  Writes bypass coalescing entirely and always take the one-round-trip path.
+- Exception responses do not coalesce. Each upstream sees an exception
+  delivered against its own MBAP `TxId` through the normal correlation map
+  fan-out; there is no special exception-deduplication path.
+
+## The InterestedParties Seam
+
+The data shape that powers fan-out lives on `InFlightRequest`:
+
+```csharp
+internal sealed record InFlightRequest(
+    byte UnitId,
+    byte Fc,
+    ushort StartAddress,
+    ushort Qty,
+    IReadOnlyList<InterestedParty> InterestedParties,
+    DateTimeOffset SentAtUtc,
+    int ResolvedCacheTtlMs = 0);
+
+internal sealed record InterestedParty(UpstreamPipe Pipe, ushort OriginalTxId);
+```
+
+Each `InterestedParty` records the upstream pipe to deliver the response to
+and the original MBAP `TxId` that pipe sent. The backend reader iterates
+this list, patches each party's `OriginalTxId` into a per-party copy of the
+response frame, and hands the frame to `party.Pipe.SendResponseAsync`.
+
+### Multi-writer multi-reader safety
+
+The list typed as `IReadOnlyList<InterestedParty>` on the public surface is
+in fact a mutable `List<InterestedParty>` underneath. `InFlightByKeyMap`
+serialises every state mutation under a single `object` lock:
+
+- `TryAttachOrCreate` looks up the key, casts the existing
+  `InterestedParties` back to `List<InterestedParty>`, and appends the new
+  party — all under the lock.
+- The backend reader calls `TryRemove(coalKey, out _)` **before** it
+  iterates the parties list during fan-out. Once the key is gone from the
+  map, no future attach can find it, so no further appends can occur.
+
+The reader's removal-before-iteration ordering is the load-bearing
+invariant. By the time fan-out reads the list, the list is effectively
+frozen — there is no other writer that can reach it. The watchdog timeout
+path observes the same protocol: it removes the coalescing key before it
+walks `req.InterestedParties` to deliver exception 0x0B.
+
+The reverse race (reader removes first, then a late attach arrives) is
+impossible by construction — `TryRemove` and `TryAttachOrCreate` both take
+the same map lock, so any late attach is serialised either entirely before
+the removal (and is part of the fan-out) or entirely after (and opens a
+fresh entry under a new factory call).
+
+## MaxParties Cap
+
+`ResilienceOptions.cs` exposes the load-shedding cap:
+
+```csharp
+public sealed class ReadCoalescingOptions
+{
+    public bool Enabled { get; init; } = true;
+    public int MaxParties { get; init; } = 32;
+}
+```
+
+`Mbproxy.Resilience.ReadCoalescing.MaxParties` defaults to 32. Inside
+`TryAttachOrCreate`, an existing entry is only extended when
+`existingList.Count < maxParties`; once the cap is hit, the next identical
+arrival falls through to the factory branch and opens a fresh in-flight
+entry (which means a fresh backend round-trip).
+
+The cap bounds two costs:
+
+- **Fan-out cost per entry** at O(MaxParties). The backend reader's
+  per-party copy-and-patch loop runs at most `MaxParties` times for any
+  single response.
+- **Backend reader latency under pile-on.** A single pathologically popular
+  read (every HMI hitting the same tag at the same second) cannot stretch
+  one fan-out arbitrarily long.
+
+## Hot-Reloadable On/Off
+
+`Mbproxy.Resilience.ReadCoalescing.Enabled` defaults to `true`. The
+multiplexer holds a `Func<ReadCoalescingOptions>` accessor that production
+binds to `() => optionsMonitor.CurrentValue.Resilience.ReadCoalescing`, so
+a hot-reload of `appsettings.json` propagates immediately on the next
+inbound PDU.
+
+Flipping `Enabled` to `false` at runtime does not disturb already-coalesced
+entries: existing fan-outs drain through the backend reader naturally.
+Subsequent FC03/FC04 requests skip the coalescing branch entirely and take
+the one-proxy-TxId-per-upstream-request path verbatim.
+
+The same accessor reads `MaxParties` per PDU, so an operator can raise or
+lower the cap without restarting the service.
+
+## Lookup Order in the Multiplexer's Read Path
+
+`OnUpstreamFrameAsync` consults three tiers in fixed order for FC03/FC04:
+
+1. **Cache** — if `_ctx.Cache` is wired and `_ctx.TagMap.ResolveCacheTtlMs`
+   returns a positive TTL for the read range, the response cache is
+   checked first. A hit short-circuits everything, including the
+   `EnsureBackendConnectedAsync` call. See
+   [`./ResponseCache.md`](./ResponseCache.md).
+2. **Coalesce** — on a cache miss (or no cache configured), the request
+   consults `_inFlightByKey` via `TryAttachOrCreate`. A hit attaches the
+   new party to an in-flight peer and emits no backend traffic.
+3. **Backend** — on a coalescing miss, the factory branch allocates a
+   proxy `TxId` through `TxIdAllocator`, registers the entry in
+   `CorrelationMap`, runs the BCD rewriter on the request PDU, and queues
+   the frame onto the outbound channel.
+
+The order is load-bearing. Cache hits avoid both backend traffic **and**
+any coalescing-entry housekeeping. Coalescing hits avoid the backend but
+still incur a list-append and a fan-out. Backend round-trips are the most
+expensive of the three.
+
+## Counter Accounting
+
+`PerPlcContext.Counters` exposes three coalescing-specific counters, all
+surfaced on the status page:
+
+- **`coalescedHitCount`** — increments inside `OnUpstreamFrameAsync` when
+  `TryAttachOrCreate` returns `wasNew == false` (the request attached to
+  an existing in-flight entry).
+- **`coalescedMissCount`** — increments when `wasNew == true`. The
+  non-coalescing FC03/FC04 path also increments this counter when
+  coalescing is disabled, so the identity `coalescedHitCount +
+  coalescedMissCount == total FC03+FC04 requests since multiplexer
+  construction` holds regardless of `Enabled`.
+- **`coalescedResponseToDeadUpstream`** — increments inside the backend
+  reader's fan-out loop when a coalesced party's pipe has gone dead
+  (`party.Pipe.IsAlive == false`) before the response landed. Only
+  counted when the in-flight entry had more than one party — single-party
+  dead-upstream skips are the normal Phase-9 behaviour and are silent.
+
+When `ReadCoalescing.Enabled == false`, `coalescedHitCount` remains zero
+and every FC03/FC04 read increments `coalescedMissCount`. Aggregate fleet
+metrics (hit ratio, requests per second) read directly from these
+counters; see [`../Operations/StatusPage.md`](../Operations/StatusPage.md).
+
+The Debug-level log events `mbproxy.coalesce.hit`,
+`mbproxy.coalesce.miss`, and `mbproxy.coalesce.dead_upstream` mirror each
+counter increment; see [`../Reference/LogEvents.md`](../Reference/LogEvents.md).
+
+## Transparency Contract Preserved
+
+Each upstream client receives the same response shape it would have
+received from a one-to-one proxy:
+
+- **Original MBAP `TxId` restored.** The backend reader patches
+  `outFrame[0..2]` with `party.OriginalTxId` for each party in the
+  `InterestedParties` list. The proxy's internal TxId never reaches an
+  upstream socket.
+- **BCD rewriter runs once.** `_pipeline.Process(ResponseToClient, ...)`
+  fires exactly once against the shared backend response buffer. Cached
+  rewriter context (start address, quantity) comes from the
+  `InFlightRequest` that opened the round-trip.
+- **One-party fan-out reuses the buffer.** When
+  `inFlight.InterestedParties.Count == 1`, the backend reader assigns the
+  original `frame` reference to `outFrame` instead of cloning, saving the
+  allocation. Multi-party fan-outs clone the frame per party so each can
+  carry a distinct `TxId` without trampling its peers.
+
+Coalescing is invisible at the wire-protocol layer. An upstream client
+cannot tell whether its read was served by a fresh backend round-trip or
+by attaching to a peer's in-flight request — only the timing distribution
+changes.
+
+## Related Documentation
+
+- [`./ConnectionModel.md`](./ConnectionModel.md) — multiplexer overview;
+  the `InterestedParties` seam, `CorrelationMap`, and `TxIdAllocator` live
+  here.
+- [`./ResponseCache.md`](./ResponseCache.md) — bounded-staleness cache that
+  sits above coalescing in the lookup order; cache hits short-circuit
+  coalescing entirely.
+- [`../Operations/StatusPage.md`](../Operations/StatusPage.md) — exposes
+  `coalescedHitCount`, `coalescedMissCount`, and
+  `coalescedResponseToDeadUpstream` per PLC and as fleet aggregates.
+- [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — full
+  `mbproxy.coalesce.*` event catalogue with event IDs.
+- [`../Operations/Configuration.md`](../Operations/Configuration.md) —
+  binding for `Mbproxy.Resilience.ReadCoalescing.Enabled` and `MaxParties`,
+  hot-reload semantics.
+- [`../Features/BcdRewriting.md`](../Features/BcdRewriting.md) — the
+  rewriter that runs once on the shared response buffer before fan-out.
diff --git a/mbproxy/docs/Architecture/ResponseCache.md b/mbproxy/docs/Architecture/ResponseCache.md
new file mode 100644
index 0000000..420e0b5
--- /dev/null
+++ b/mbproxy/docs/Architecture/ResponseCache.md
@@ -0,0 +1,398 @@
+# Response Cache
+
+The response cache is an opt-in per-tag, bounded-staleness layer that serves
+FC03 and FC04 reads from in-process memory. It sits above read coalescing in
+the request path so a hit avoids both the coalescing entry and the backend
+round-trip entirely.
+
+## Cache Contract
+
+The cache is **off by default for every tag**. `CacheTtlMs = 0` on every BCD
+tag is the default state, and a deployment that ships without any TTL
+configuration behaves identically to one compiled without the cache at all
+— no in-memory entries are created, every FC03/FC04 read falls through to
+the coalescing-then-backend path, and counters that track cache activity
+stay at zero.
+
+Operators opt a tag in by setting a positive `CacheTtlMs`. That positive
+value is the explicit acknowledgement of the staleness window: the operator
+is stating, "I am willing for upstream clients to see a value up to N
+milliseconds old in exchange for taking the read off the backend." There is
+no implicit cache enablement. There is no global cache toggle that turns
+caching on for previously-uncached tags. Every cached tag is one whose
+configuration has a positive TTL on its line.
+
+This stance is the design-contract pivot the cache introduces: before it,
+the proxy is purely transparent except for BCD rewriting. With the cache,
+the proxy is transparent **by default**, with an opt-in cache layer the
+operator can engage tag-by-tag.
+
+## TTL Resolution Order
+
+Each FC03/FC04 read range resolves to one effective TTL through three
+tiers:
+
+1. **Explicit per-tag.** `BcdTagOptions.CacheTtlMs` on the tag entry. A
+   non-null value wins regardless of the per-PLC default. An explicit `0`
+   here disables caching for that tag even when the PLC default is
+   positive.
+2. **Per-PLC default.** `PlcOptions.DefaultCacheTtlMs` applies to any tag
+   whose explicit `CacheTtlMs` is `null` (unset). A `0` default means "no
+   caching by default at this PLC."
+3. **Zero.** With nothing set at either tier, the resolved TTL is `0` and
+   the read is uncached.
+
+`BcdTagMap.ResolveCacheTtlMs(startAddress, qty)` implements the per-read
+resolution. It enumerates the BCD tags whose register footprints intersect
+the requested range and returns the smallest positive TTL across the hits,
+or `0` if the range covers no configured tags.
+
+```csharp
+public int ResolveCacheTtlMs(ushort startAddress, ushort qty)
+{
+    if (!TryGetForRange(startAddress, qty, out var hits) || hits.Count == 0)
+        return 0;
+
+    int min = int.MaxValue;
+    foreach (var hit in hits)
+    {
+        int ttl = hit.Tag.CacheTtlMs;
+        if (ttl <= 0) return 0;
+        if (ttl < min) min = ttl;
+    }
+    return min == int.MaxValue ? 0 : min;
+}
+```
+
+The `hit.Tag.CacheTtlMs` value resolved on each `BcdTag` already reflects
+the explicit-then-default order — the options binder resolves the per-tag
+override against the per-PLC default at config build time, so the runtime
+hot path sees a single integer per tag.
+
+## Multi-Tag Range TTL Rule
+
+When a single FC03/FC04 read covers multiple configured BCD tags, the
+effective TTL is the minimum across them:
+
+```text
+range covers tags { A:TTL=500, B:TTL=2000, C:TTL=100 } → effective TTL = 100
+range covers tags { A:TTL=500, B:TTL=0 (uncached)    } → effective TTL = 0
+range covers tags { A:TTL=500 }                        → effective TTL = 500
+range covers no configured tags                        → effective TTL = 0
+```
+
+If any covered tag has `CacheTtlMs = 0`, the whole read is uncached. The
+rationale is conservative-by-design: a multi-tag read whose narrowest TTL
+is, for example, 100 ms cannot be served safely from an entry that was
+stored under a tag with TTL 2 s, because that entry's freshness was only
+guaranteed by the longer window. Rather than partition a range read across
+heterogeneous TTLs or invent inheritance rules that an operator would have
+to reason about per-deployment, the cache refuses to serve any multi-tag
+read whose narrowest covered TTL is zero. Operators who want a tag cached
+in isolation but uncached when read alongside an uncached neighbour get the
+expected behaviour by leaving the neighbour at `CacheTtlMs = 0`.
+
+A read whose range covers no configured BCD tags also resolves to `0`.
+There is nothing to be conservative about because the cache only serves
+ranges that contain rewriter-tracked tags — a read of plain non-BCD
+registers does not engage the cache regardless of any per-PLC default.
+
+## Lookup Order
+
+The multiplexer's FC03/FC04 path consults three tiers in fixed order:
+
+1. **Cache.** When `_ctx.Cache` is wired and `BcdTagMap.ResolveCacheTtlMs`
+   returns a positive TTL for the read range, `ResponseCache.TryGet` is
+   called against a `CacheKey(unitId, fc, startAddress, qty)`. A hit
+   splices the cached payload onto a fresh MBAP header carrying the
+   original upstream TxId, pushes the frame onto that pipe's response
+   channel, and **returns without engaging coalescing or the backend at
+   all**.
+2. **Coalesce.** On a cache miss (or when the resolved TTL is zero), the
+   request is offered to `InFlightByKeyMap.TryAttachOrCreate`. A hit
+   attaches the new party to a peer's in-flight request.
+3. **Backend.** On a coalescing miss, the request opens a proxy TxId,
+   registers a `CorrelationMap` entry, runs the BCD rewriter on any FC06
+   or FC16 payload, and queues the frame onto the outbound channel.
+
+The cache check happens **before** the multiplexer's
+`EnsureBackendConnectedAsync` call. A cache hit serves the upstream even
+when the backend socket is currently disconnected or recovering. This is
+not an accident — the cached payload's freshness is bounded by its TTL,
+not by the liveness of the backend socket. See
+[`../Operations/Troubleshooting.md`](../Operations/Troubleshooting.md) for
+the operator view of cache-served reads during a backend outage.
+
+## Storage Format: Post-Rewriter Bytes
+
+`CacheEntry.PduBytes` holds the **post-rewriter response PDU body** — the
+function code byte, the byte count, and the rewriter-decoded register
+data, with no MBAP header. The backend reader task decodes the response
+through `BcdPduPipeline` first and only then hands the rewritten payload
+to `ResponseCache.Set`.
+
+```csharp
+internal sealed record CacheEntry(
+    byte[] PduBytes,
+    DateTimeOffset CachedAtUtc,
+    DateTimeOffset ExpiresAtUtc,
+    int Length,
+    long LastUsedTick);
+```
+
+Storing post-rewriter bytes is both a CPU optimisation and a correctness
+guarantee:
+
+- **CPU.** A cache hit returns ready-to-send bytes. The rewriter does not
+  re-run per hit; only the MBAP header is regenerated to carry the
+  upstream's original TxId.
+- **Correctness.** An entry decoded against an earlier rewriter version
+  never gets retroactively re-transformed against a newer version. If the
+  rewriter's behaviour changes mid-process (it does not today, but the
+  guarantee is durable across future changes), in-flight cached entries
+  age out under their TTL and are replaced by fresh entries decoded
+  through the new rewriter. A bidirectional re-encode never happens to an
+  already-stored entry.
+
+## Write Invalidation by Address Range Overlap
+
+A successful (non-exception) FC06 or FC16 response invalidates every
+cached FC03 or FC04 entry whose address range
+`[StartAddress, StartAddress + Qty)` overlaps the write range
+`[writeStart, writeStart + writeQty)`. The pure overlap math lives in
+`CacheInvalidator.FindOverlapping`:
+
+```csharp
+int writeEnd = writeStart + writeQty;   // half-open upper bound
+
+foreach (var key in haystack)
+{
+    if (key.UnitId != unitId) continue;
+    if (key.Fc != 0x03 && key.Fc != 0x04) continue;
+
+    int keyEnd = key.StartAddress + key.Qty;
+    // Overlap iff writeStart < keyEnd AND key.StartAddress < writeEnd.
+    if (writeStart < keyEnd && key.StartAddress < writeEnd)
+        hits.Add(key);
+}
+```
+
+Worked examples on a single unit ID:
+
+```text
+Write to register 105 (qty=1)
+  └─ invalidates cached FC03 [100..110) — register 105 is inside the cached range
+  └─ leaves    cached FC03 [200..210) untouched
+
+Write to registers [10..15) (qty=5)
+  └─ leaves    cached FC03 [15..20) untouched — half-open intervals, 15 is not in [10..15)
+
+Write to registers [98..108) (qty=10)
+  └─ invalidates cached FC03 [100..110) — ranges overlap on [100..108)
+```
+
+Three properties of the invalidator deserve calling out:
+
+- **Exception responses do not invalidate.** A Modbus exception (code 01,
+  02, 03, 04, or any other) means the write did not take effect on the
+  PLC. The cached read is still consistent with the device, so the
+  invalidator is not engaged.
+- **Different unit IDs never invalidate each other.** Multi-drop and
+  gateway personalities behind a shared socket address logically separate
+  Modbus tables. `CacheKey.UnitId` discriminates.
+- **Only FC03 and FC04 entries are evicted.** The cache never stores write
+  responses, so the invalidator's function-code filter is defensive
+  rather than load-bearing.
+
+## Bounded Capacity (LRU)
+
+Each `ResponseCache` instance is capped at `Cache.MaxEntriesPerPlc`
+(default 1000). When the dictionary is at the cap and a fresh insert
+arrives, `EvictLeastRecentlyUsed` walks the entries and removes the one
+with the smallest `CacheEntry.LastUsedTick`. The linear scan is
+intentional — at 1000 entries the scan is cheaper than the network
+round-trip the cache is saving, and a sorted secondary structure would
+add complexity for no measurable win.
+
+`LastUsedTick` is a monotonic 64-bit counter incremented on every hit and
+every fresh insert. Using the counter rather than `DateTimeOffset.UtcNow`
+keeps the hot path free of clock calls and survives wall-clock skew.
+
+A background task drives proactive expiry. The constructor starts a
+`PeriodicTimer` at `Cache.EvictionIntervalMs` (default 5000 ms; values
+under 100 ms are clamped at 100 ms to prevent tight loops) and the
+eviction loop sweeps every entry whose `ExpiresAtUtc` has passed. The
+loop is the safety net that keeps abandoned entries — say, those for a
+PLC whose upstream clients have all dropped — from holding memory until
+process exit. Lazy expiry on `TryGet` still removes entries on demand
+when traffic is steady; the background loop only matters under low- or
+zero-traffic conditions.
+
+## Long-TTL Safety Gate
+
+`MbproxyOptionsValidator.ValidateCacheTtl` rejects any explicit
+`CacheTtlMs > 60_000` unless `Cache.AllowLongTtl = true`. The same gate
+applies to `PlcOptions.DefaultCacheTtlMs`. The rejection runs at config
+bind / hot-reload time, so a misconfigured `appsettings.json` fails fast
+before the cache sees the value.
+
+The gate exists to catch the "left at 1 hour by accident" mistake — a
+deployment where a developer set `CacheTtlMs = 3_600_000` for a debugging
+session and the value survived into production. Operators who legitimately
+need long TTLs (slow-moving setpoints, configuration values that change
+once per shift) flip `Cache.AllowLongTtl` to `true` as the explicit
+acknowledgement that the long staleness window is intentional.
+
+## Cache and the Rewriter
+
+The BCD rewriter runs **once** on the cache-miss path: the backend reader
+task decodes the response through `BcdPduPipeline` and only then hands the
+decoded bytes to `ResponseCache.Set`. Cache hits return the stored
+post-rewriter bytes directly.
+
+This division has two consequences worth restating:
+
+- **The rewriter cost is amortised across hits.** A high cache hit ratio
+  on a tag-dense PLC drops the per-request rewriter cost from "every
+  response" to "every cache-miss response," which on a hot register at
+  TTL=500 ms is one-in-many.
+- **The cached payload is decoupled from the rewriter implementation.**
+  An entry stored under one rewriter does not get re-transformed if the
+  rewriter changes. Entries age out under TTL and are replaced by fresh
+  entries decoded under the current rewriter — there is no in-place
+  recomputation pass.
+
+## Hot-Reload Semantics
+
+Configuration changes propagate through `IOptionsMonitor<MbproxyOptions>`.
+The cache reacts to four kinds of change:
+
+| Change | Cache behaviour |
+|--------|----------------|
+| Tag's `CacheTtlMs` changed (`0 → N`, `N → 0`, `N → M`) | Entire PLC cache is flushed via `ResponseCache.Clear()`; entries re-populate on demand under the new TTL. |
+| New PLC added / removed | New PLC starts with an empty cache; removed PLC's `ResponseCache` is disposed with the multiplexer. |
+| `Cache.AllowLongTtl` flipped | Validation runs on the next reload only; existing entries are unaffected. |
+| `Cache.MaxEntriesPerPlc` changed | Existing entries are unaffected; the new cap applies to subsequent inserts. |
+| `Cache.EvictionIntervalMs` changed | Existing eviction loop continues with its old period; subsequent loops use the new interval. |
+
+Per-tag flush granularity is intentionally not implemented. The clean move
+is "any tag-list change to a PLC → drop every entry for that PLC and let
+the natural traffic re-populate." Tracking which keys correspond to which
+tag IDs adds bookkeeping for no operational win — a tag-list reload is
+already a once-in-a-while event, and the rebuild cost on the affected
+PLC's hot keys is one round-trip per key under traffic.
+
+See [`../Features/HotReload.md`](../Features/HotReload.md) for the
+broader `IOptionsMonitor` propagation model.
+
+## Cache Survives Backend Disconnects
+
+A cached entry's data was valid when stored. A subsequent backend
+disconnect does not retroactively invalidate it — the value the upstream
+client sees on a hit is the value the PLC reported within the TTL
+window, irrespective of whether the backend socket is up at the moment
+of the hit. This is the cache's most operationally visible property
+during PLC outages: upstream consumers that read hot tags within the
+cache window continue to receive responses while the listener supervisor
+is in `recovering` state.
+
+The companion rule on the write side keeps the invariant consistent:
+**invalidations during a `recovering` listener state are skipped**. If
+the backend is down, an FC06 or FC16 write did not reach the PLC, so the
+cached read is still consistent with the device's actual state. Skipping
+the invalidation matches reality — the write did not take effect, so the
+read is not stale.
+
+## No Persistence
+
+The cache is purely in-memory. Process restart wipes every entry. There
+is no file-backed snapshot, no Redis or other external store, and no
+last-known-good replay. A restarted service rebuilds its cache from
+fresh backend round-trips driven by upstream traffic, exactly as it
+would after a TTL-induced flush.
+
+Intentional, for two reasons. First, the staleness contract is bounded
+by `CacheTtlMs` measured from when the data was first read, and a
+persisted entry would re-emerge with an unknown wall-clock age — every
+invariant the cache offers would need a freshness field, freshness
+arithmetic on load, and recovery against a clock that may have jumped.
+Second, the operational model is that the proxy is a stateless
+transformer; treating its cache as durable state would change the
+deployment story for no measurable production benefit.
+
+## Counter Accounting
+
+`ProxyCounters` exposes five cache counters per PLC, surfaced on the
+status page as both per-PLC and fleet-aggregate values:
+
+- **`cacheHitCount`** — FC03/FC04 requests served from the cache. Bumped
+  inside `OnUpstreamFrameAsync` when `ResponseCache.TryGet` returns true.
+- **`cacheMissCount`** — FC03/FC04 requests whose resolved TTL was
+  positive but whose key was not in the cache (or whose entry had
+  expired). The identity `cacheHitCount + cacheMissCount = total
+  cache-eligible FC03/FC04 requests` holds — reads whose effective TTL
+  is `0` (uncached) increment neither counter.
+- **`cacheHitRatio`** — derived on the status page snapshot as
+  `cacheHitCount / (cacheHitCount + cacheMissCount)` when the
+  denominator is non-zero.
+- **`cacheInvalidations`** — count of cache entries invalidated by
+  successful FC06/FC16 write responses, summed across writes.
+- **`cacheEntryCount`** — point-in-time snapshot of
+  `ResponseCache.Count` (Tier-2 memory-watch KPI).
+- **`cacheBytes`** — point-in-time approximation of cached PDU bytes,
+  computed as the running sum of `CacheEntry.Length` across entries
+  (Tier-2 memory-watch KPI).
+
+The structured log events `mbproxy.cache.hit`, `mbproxy.cache.miss`,
+`mbproxy.cache.store`, `mbproxy.cache.invalidated`, and
+`mbproxy.cache.flushed` (defined in `CacheLogEvents`) mirror the counter
+increments at Debug level for incident-time diagnosis. Counters are the
+steady-state observability surface; the events are for tracing one
+request through the cache when something looks wrong. See
+[`../Operations/StatusPage.md`](../Operations/StatusPage.md) and
+[`../Reference/LogEvents.md`](../Reference/LogEvents.md).
+
+## Design-Contract Note
+
+The cache changes the proxy's posture from "purely transparent except
+for BCD rewriting" to "transparent by default, with an opt-in cache
+layer." The transition is deliberate and operator-driven: setting
+`CacheTtlMs > 0` on a tag is the explicit consent to the staleness
+window, and a deployment that ships no positive TTLs is observationally
+indistinguishable from one compiled without the cache code path.
+
+There is no global switch, no implicit warm-up, and no behavioural
+divergence from the transparent baseline until the operator opts in
+tag-by-tag. The cache is the only place in the proxy where an upstream
+read can resolve to a value that did not just round-trip the wire, and
+its engagement is gated entirely by the per-tag and per-PLC TTL
+configuration described above.
+
+## Related Documentation
+
+- [`./ConnectionModel.md`](./ConnectionModel.md) — TxId multiplexing,
+  correlation map, and the backend socket the cache short-circuits on a
+  hit.
+- [`./ReadCoalescing.md`](./ReadCoalescing.md) — sits below the cache in
+  the lookup order; cache hits short-circuit coalescing entirely.
+- [`../Features/BcdRewriting.md`](../Features/BcdRewriting.md) — the
+  `BcdPduPipeline` whose post-decode bytes the cache stores.
+- [`../Features/HotReload.md`](../Features/HotReload.md) — the
+  `IOptionsMonitor` propagation that drives the per-PLC flush on
+  tag-list change.
+- [`../Operations/Configuration.md`](../Operations/Configuration.md) —
+  binding for `BcdTagOptions.CacheTtlMs`,
+  `PlcOptions.DefaultCacheTtlMs`, and the `Cache` section
+  (`AllowLongTtl`, `MaxEntriesPerPlc`, `EvictionIntervalMs`).
+- [`../Operations/StatusPage.md`](../Operations/StatusPage.md) — exposes
+  `cacheHitCount`, `cacheMissCount`, `cacheHitRatio`,
+  `cacheInvalidations`, `cacheEntryCount`, and `cacheBytes`.
+- [`../Operations/Troubleshooting.md`](../Operations/Troubleshooting.md)
+  — the operator view of cache-served reads while a backend is in
+  `recovering` state.
+- [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — full
+  `mbproxy.cache.*` event catalogue with event IDs.
+- [`../Testing/Simulator.md`](../Testing/Simulator.md) — the
+  `pymodbus` DL205 stand-in used by the end-to-end cache tests.
+- [`../design.md`](../design.md) — canonical design decisions and
+  rationale.
diff --git a/mbproxy/docs/Features/BcdRewriting.md b/mbproxy/docs/Features/BcdRewriting.md
new file mode 100644
index 0000000..aab4f3a
--- /dev/null
+++ b/mbproxy/docs/Features/BcdRewriting.md
@@ -0,0 +1,252 @@
+# BCD Rewriting
+
+The BCD rewriter is the inline codec that translates DirectLOGIC's native Binary-Coded Decimal register values to and from plain binary integers on every relevant Modbus TCP PDU. It is the one place in the proxy that knows which registers are BCD, so upstream consumers can treat the wire as plain `Int16` / `Int32`.
+
+## Why BCD Rewriting Exists
+
+The DL205 / DL260 family stores numeric V-memory register values in native BCD, not binary. The decimal integer `1234` in `V2000` lands on the Modbus wire as `0x1234` (nibbles `1`, `2`, `3`, `4`) — not as the binary `0x04D2`. See [`../../DL260/dl205.md`](../../DL260/dl205.md) for the device-side rationale and the V-memory ↔ Modbus translation rules.
+
+Upstream consumers (Wonderware, Historian, OPC UA gateways, generic Modbus clients written against the standard) expect plain binary integers. Asking every consumer to BCD-decode the wire is brittle: each consumer would carry the same tag list, the same word-order quirks, and the same risk of drift. The rewriter centralises that translation so the rest of the world sees plain `Int16` / `Int32` and the proxy is the single source of truth for "which addresses are BCD."
+
+The rewriter touches only the BCD slots declared in configuration. Every other byte of the PDU — non-BCD registers, coils, discrete inputs, diagnostic function codes, exception responses — passes through unchanged. MBAP transaction IDs, unit IDs, and the MBAP length field are preserved end-to-end; the rewriter only re-encodes payload bytes whose width does not change.
+
+## CDAB Word Order for 32-Bit Values
+
+A 32-bit BCD value spans a register pair at `Address` and `Address+1` in CDAB (low-word-first) order:
+
+- The register at `Address` holds the **low 4 BCD digits**.
+- The register at `Address+1` holds the **high 4 BCD digits**.
+- Decoded decimal = `Decode16(high) * 10_000 + Decode16(low)`.
+
+This follows directly from DirectLOGIC's CDAB word convention (see [`../../DL260/dl205.md`](../../DL260/dl205.md) → Word Order).
+
+Worked example — the register pair `[0x1234][0x5678]` reads on the wire as the low word `0x1234` first and the high word `0x5678` second:
+
+```text
+Address:    raw 0x1234 → low  4 digits = 1234
+Address+1:  raw 0x5678 → high 4 digits = 5678
+
+Decoded decimal = 5678 * 10_000 + 1234 = 56_781_234
+```
+
+`BcdCodec.Encode32` and `BcdCodec.Decode32` in [`../../src/Mbproxy/Bcd/BcdCodec.cs`](../../src/Mbproxy/Bcd/BcdCodec.cs) implement this in both directions. `Encode32(12_345_678)` returns `(low: 0x5678, high: 0x1234)`.
+
+The 16-bit codec is a straight nibble pack / unpack:
+
+```csharp
+// From BcdCodec.cs — Encode16 packs four decimal digits into four BCD nibbles.
+int d3 = value / 1000;
+int d2 = (value / 100) % 10;
+int d1 = (value / 10)  % 10;
+int d0 = value         % 10;
+return (ushort)((d3 << 12) | (d2 << 8) | (d1 << 4) | d0);
+```
+
+`Decode16` is the reverse, with a `HasBadNibble` guard that throws `FormatException` if any nibble is `>= 0xA`. The Phase-04 rewrite pipeline catches the exception and surfaces it as a `mbproxy.rewrite.invalid_bcd` warning event instead of corrupting the payload.
+
+## BCD Tag Configuration Shape
+
+Every BCD register the rewriter handles is described by a `BcdTag` record from [`../../src/Mbproxy/Bcd/BcdTag.cs`](../../src/Mbproxy/Bcd/BcdTag.cs):
+
+```csharp
+public sealed record BcdTag(ushort Address, byte Width, int CacheTtlMs = 0)
+{
+    public bool IsThirtyTwoBit => Width == 32;
+    public ushort HighRegister => /* Address + 1 for 32-bit tags */;
+}
+```
+
+- `Address` is the **Modbus PDU register address** (zero-based, decimal). Configuration must translate from octal V-memory to PDU-decimal before reaching this struct — `V2000` octal = decimal 1024 = `0x0400`. The proxy does not perform that translation itself.
+- `Width` is `16` (single register) or `32` (CDAB register pair at `Address` and `Address+1`). `BcdTag.Create` rejects any other width.
+- `CacheTtlMs` is the Phase-11 response-cache opt-in (covered separately in [`../Architecture/ResponseCache.md`](../Architecture/ResponseCache.md)); it has no effect on rewriter behaviour.
+
+The wire-format options shape lives in [`../../src/Mbproxy/Options/BcdTagOptions.cs`](../../src/Mbproxy/Options/BcdTagOptions.cs) and [`../../src/Mbproxy/Options/BcdTagListOptions.cs`](../../src/Mbproxy/Options/BcdTagListOptions.cs). Configured tags resolve through `BcdTagMapBuilder.Build` (see [`../../src/Mbproxy/Bcd/BcdTagMapBuilder.cs`](../../src/Mbproxy/Bcd/BcdTagMapBuilder.cs)) into an immutable `BcdTagMap` ([`../../src/Mbproxy/Bcd/BcdTagMap.cs`](../../src/Mbproxy/Bcd/BcdTagMap.cs)) per PLC.
+
+Holding-register (FC03) and input-register (FC04) addresses share the **same** configured tag space. The DL205 / DL260 surfaces V-memory through both tables, so the rewriter applies the configured tag list against both FC03 and FC04 responses.
+
+## Function-Code Scope Table
+
+The rewriter touches payloads only for the function codes below. Every other FC — coils (FC01, FC05, FC15), discrete inputs (FC02), diagnostics, exception responses — passes through byte-for-byte.
+
+| FC | Direction | Action |
+|----|-----------|--------|
+| 03 | Request | Pass through (read; no payload rewrite needed) |
+| 03 | Response | Re-encode covered BCD slots from raw nibbles → binary integer |
+| 04 | Request | Pass through |
+| 04 | Response | Same as FC03 response |
+| 06 | Request | Re-encode binary integer → BCD nibbles before forwarding |
+| 06 | Response | Decode BCD nibbles → binary integer on the echo (NModbus-style clients validate the echo and would throw otherwise) |
+| 16 | Request | Per-register over the configured slots |
+| 16 | Response | Pass through (the response carries only start+qty, not values) |
+
+The FC06 response decode is non-obvious: the PLC echoes back the value it actually wrote, which is now BCD-encoded because the proxy rewrote the request on the way in. Clients that validate the echo equals the value they sent (NModbus and similar libraries do this) would throw on the round-trip if the proxy did not decode the echo back.
+
+`BcdPduPipeline.Process` dispatches on direction first, then on FC:
+
+```csharp
+public void Process(MbapDirection direction, ReadOnlySpan<byte> mbapHeader,
+                    Span<byte> pdu, PduContext context)
+{
+    if (context is not PerPlcContext ctx) return;
+    if (pdu.Length < 1) return;
+
+    byte fc = pdu[0];
+    ctx.Counters.IncrementPdusForwarded();
+    ctx.Counters.IncrementFcCount(fc);
+
+    if (direction == MbapDirection.RequestToBackend)
+        ProcessRequest(fc, pdu, ctx);
+    else
+        ProcessResponse(fc, pdu, ctx);
+}
+```
+
+`PerPlcContext` carries the `BcdTagMap`, the per-PLC `ProxyCounters`, the logger, and the matched `InFlightRequest` from the multiplexer's correlation map. If a caller passes a plain `PduContext` (e.g. a test harness using `NoopPduPipeline` alongside the BCD pipeline), the rewriter returns without touching the PDU.
+
+## Partial-Overlap Policy
+
+A request that touches only **one** register of a configured 32-bit BCD pair cannot be re-encoded correctly. There are two shapes:
+
+1. An FC03 / FC04 read whose range covers the low address but not the high address (`qty=1` at the low address) or vice versa.
+2. An FC06 write to either the low or high address of a 32-bit pair, or an FC16 write whose range covers only one of the two registers.
+
+In every case the rewriter **passes the PDU through raw** and emits a `mbproxy.rewrite.partial_bcd` warning. The `PartialBcdWarnings` counter increments per occurrence.
+
+The proxy never synthesises a Modbus exception for a partial-overlap. Exception response codes are reserved for transport failure (the per-request watchdog manufactures `0x0B` Gateway Target Device Failed To Respond; the PLC itself produces `0x01`–`0x04`). Using an exception code to signal a configuration / client mismatch would conflate "the device or the path failed" with "the client straddled a 32-bit boundary," and operators chasing the exception would look at the wrong layer.
+
+The rationale for warn-plus-passthrough rather than silent rewrite: silently rewriting only the half the client touched would corrupt the value (a 16-bit BCD encode of a 32-bit binary integer is meaningless). A warning-plus-raw passthrough surfaces the misconfiguration loudly while leaving the client to discover the mismatch in its own data path.
+
+The FC16 request path makes the partial-overlap decision per-tag inside its loop over `TryGetForRange` hits:
+
+```csharp
+if (tag.IsThirtyTwoBit)
+{
+    bool lowInRange  = offsetWords >= 0 && offsetWords < qty;
+    bool highInRange = (offsetWords + 1) >= 0 && (offsetWords + 1) < qty;
+
+    if (!lowInRange || !highInRange)
+    {
+        RewriterLogEvents.PartialBcd(ctx.Logger, ctx.PlcName,
+            tag.Address, startAddress, qty);
+        ctx.Counters.IncrementPartialBcd();
+        continue;
+    }
+    // ...both registers in range — reconstruct, encode, write back...
+}
+```
+
+For a 32-bit FC16 write where both registers are in range, the rewriter reconstructs the client's 32-bit binary value from the CDAB pair (`clientHigh * 10_000 + clientLow`), runs `BcdCodec.Encode32` to produce the BCD register pair, and writes both registers back to the PDU buffer in place.
+
+## Unsigned Only
+
+DL205 / DL260 BCD is non-negative in the default ladder pattern. `BcdCodec.Encode16` rejects values outside `[0, 9999]`; `BcdCodec.Encode32` rejects values outside `[0, 99_999_999]`. The rewriter does not implement signed BCD; signed conventions vary by site and any value out of range surfaces as `mbproxy.rewrite.invalid_bcd` rather than being silently coerced.
+
+## Exception Pass-Through
+
+Modbus exception responses pass through unchanged. The rewriter detects an exception response by the high bit of the function code (`fc & 0x80 != 0`), emits a `mbproxy.rewrite.exception_passthrough` event, increments the per-FC exception counter, and returns without touching the payload.
+
+Covered exception codes:
+
+- `0x01` Illegal Function
+- `0x02` Illegal Data Address
+- `0x03` Illegal Data Value
+- `0x04` Server Device Failure
+- `0x0B` Gateway Target Device Failed To Respond — manufactured by the per-request watchdog when a correlation entry ages past `Connection.BackendRequestTimeoutMs`. The rewriter does not distinguish proxy-manufactured from PLC-originated exception codes; both pass through identically.
+
+The rewriter increments `Counters.IncrementBackendException(exceptionCode)` per exception so the four common codes surface on the status page through `ExceptionCounts` (`Code01`, `Code02`, `Code03`, `Code04`). The Gateway-Target `0x0B` is also recorded but is more usefully traced through the watchdog log events rather than the per-code counter slot.
+
+## Where the Rewriter Runs in the Pipeline
+
+The rewriter is implemented as `BcdPduPipeline` in [`../../src/Mbproxy/Proxy/BcdPduPipeline.cs`](../../src/Mbproxy/Proxy/BcdPduPipeline.cs), registered as the singleton `IPduPipeline` in production. The class is stateless; per-call state arrives via the `PerPlcContext` passed into `Process`, which carries the `BcdTagMap`, the per-PLC counters, the logger, and (on the response path) the matched `InFlightRequest` from the multiplexer's correlation map.
+
+Per-PLC pipeline ordering:
+
+```text
+Upstream request →
+    [cache lookup (Phase 11)] →
+    [coalesce check (Phase 10)] →
+    [BCD rewriter — request path] →
+    backend send
+
+Backend response →
+    [BCD rewriter — response path] →
+    [response-cache populate (Phase 11)] →
+    [fanout to all coalesced parties]
+```
+
+The rewriter runs **once per request** on the multiplexer's outbound path and **once per response** on the inbound path. Per-party MBAP TxId restoration happens after the rewriter on fanout, so the rewriter only ever sees the canonical (shared) PDU buffer.
+
+For Phase-11 cache hits, the response cache stores **POST-rewriter bytes** — the rewriter is bypassed on hits, both as a CPU optimisation and as a correctness guarantee (a future rewriter change does not retroactively re-transform an entry that was decoded against an earlier rewriter version). See [`../Architecture/ResponseCache.md`](../Architecture/ResponseCache.md).
+
+On the response path, the rewriter cannot infer the original `(StartAddress, Qty)` of an FC03 / FC04 read from the response alone — the response carries only `[fc][byteCount][reg0Hi][reg0Lo]...`. The multiplexer's `CorrelationMap` keys the matched `InFlightRequest` to the response and attaches it to `PerPlcContext.CurrentRequest` before invoking the rewriter, so concurrent responses from different upstream clients each decode against their own request range without cross-talk. If `CurrentRequest` is null (e.g. a unit-test fixture invoking the pipeline directly) the rewriter passes the response bytes through unchanged.
+
+## Hybrid Tag Resolution
+
+For each PLC, the effective BCD tag list is `Global ∪ Add − Remove`, resolved by `BcdTagMapBuilder.Build` in this order:
+
+1. Seed the working set from `BcdTagListOptions.Global`.
+2. Apply `PlcBcdOverrides.Remove` — drop every address listed. `Remove` matches by address only; width is irrelevant.
+3. Apply `PlcBcdOverrides.Add` — insert each entry into the working set. If an address already exists from `Global`, the `Add` entry **wins** (this is how a per-PLC width override is expressed: list the same address in `Add` with a different `Width`).
+
+The shapes are declared in [`../../src/Mbproxy/Options/BcdTagListOptions.cs`](../../src/Mbproxy/Options/BcdTagListOptions.cs):
+
+```csharp
+public sealed class BcdTagListOptions
+{
+    public IReadOnlyList<BcdTagOptions> Global { get; init; } = [];
+}
+
+public sealed class PlcBcdOverrides
+{
+    public IReadOnlyList<BcdTagOptions> Add { get; init; } = [];
+    public IReadOnlyList<ushort> Remove { get; init; } = [];
+}
+```
+
+Resolution produces a `ValidationResult` carrying the resolved `BcdTagMap`, a list of `BcdError` entries, and a list of `BcdWarning` entries. Callers treat any non-empty `Errors` list as a fatal configuration problem for that PLC.
+
+The user-facing syntax for `Global` + per-PLC `Add` / `Remove` is documented in [`../Operations/Configuration.md`](../Operations/Configuration.md).
+
+`BcdTagMap.TryGetForRange` is the hot-path range scan used by both the request and response paths. It returns every `BcdTag` whose register footprint intersects `[startAddress, startAddress + qty)`, each carrying its zero-based word `OffsetWords` relative to `startAddress`. A 32-bit tag whose low word starts **before** the range but whose high word lies inside the range returns with a **negative** `OffsetWords` — that is the partial-overlap signal the rewriter consumes when deciding whether to re-encode or warn. The no-hit path returns the empty-list singleton without allocating.
+
+## Validation at Startup and Hot-Reload
+
+`BcdTagMapBuilder.Build` runs the same validation pipeline at process start and on every hot-reload of `appsettings.json`. The validation results fall into three buckets, defined in [`../../src/Mbproxy/Bcd/BcdValidationError.cs`](../../src/Mbproxy/Bcd/BcdValidationError.cs):
+
+- `BcdValidationError.DuplicateAddress` — the same address appears more than once in the **resolved** list (after `Remove` and `Add` have been applied). Fatal error; the entry is excluded from the map.
+- `BcdValidationError.OverlappingHighRegister` — a 32-bit entry's high register (`Address+1`) collides with the `Address` of a separate entry in the resolved list. Fatal error.
+- `BcdValidationError.InvalidWidth` — an entry's `Width` is not `16` or `32`. Fatal error; the entry is excluded.
+- `BcdWarning` — a `Remove` entry whose address does not appear in `Global`. Non-fatal, but typically indicates stale configuration (the global entry was removed without cleaning up the per-PLC override).
+
+A successful hot-reload that changes the resolved tag list reseats the per-PLC `BcdTagMap` and, for Phase 11, flushes the entire PLC response cache (see [`./HotReload.md`](./HotReload.md)). In-flight requests already past the rewriter are not retroactively re-rewritten; the next PDU sees the new map. A failed validation rejects the reload as a whole and the previous map stays in effect.
+
+## Counter Accounting
+
+The rewriter feeds two counters that surface on the status page:
+
+- `pdus.rewrittenSlots` — `RewrittenSlots` on `PlcPdusStatus`, incremented per re-encoded register. A 32-bit BCD pair counts as 2 slots; a 16-bit tag counts as 1. The FC06 echo decode is **not** counted to avoid double-counting the FC06 request that already incremented the slot on the way out.
+- `pdus.partialBcdWarnings` — `PartialBcdWarnings` on `PlcPdusStatus`, incremented once per partial-overlap event (request or response path).
+
+An out-of-range value (`< 0` or `> 9999` for 16-bit; `< 0` or `> 99_999_999` for 32-bit) on a write, or a bad nibble (`>= 0xA`) on a read, increments an internal invalid-BCD counter and emits `mbproxy.rewrite.invalid_bcd` at warning. The PDU passes through raw in that case; the rewriter never substitutes a value the client did not send (writes) or the PLC did not return (reads).
+
+Both counters are exposed on the status page; see [`../Operations/StatusPage.md`](../Operations/StatusPage.md). The corresponding log events (`mbproxy.rewrite.partial_bcd`, `mbproxy.rewrite.invalid_bcd`, `mbproxy.rewrite.exception_passthrough`) are catalogued in [`../Reference/LogEvents.md`](../Reference/LogEvents.md). Partial-overlap troubleshooting is covered in [`../Operations/Troubleshooting.md`](../Operations/Troubleshooting.md).
+
+The `dl205.json` pymodbus simulator profile encodes BCD test fixtures used by the integration test suite; see [`../Testing/Simulator.md`](../Testing/Simulator.md).
+
+A few invariants the rewriter relies on and the test suite enforces:
+
+- The MBAP length field is **never** modified. Every re-encoded slot is the same byte width as the original (16-bit register in, 16-bit register out), so the PDU length is byte-stable.
+- The rewriter is **stateless** at the class level. `BcdPduPipeline` holds no fields; everything per-call arrives via `PerPlcContext`. The same instance is safe to call concurrently from multiple upstream-read tasks and the single backend reader task on a given multiplexer.
+- The rewriter operates on the canonical (shared) PDU buffer. Per-party MBAP TxId restoration on coalesced fanout happens **after** the rewriter, so any per-party byte copy only happens when fanout has more than one party.
+
+## Related Documentation
+
+- [`../Architecture/Overview.md`](../Architecture/Overview.md) — service-wide architecture and per-PLC pipeline shape
+- [`../Architecture/ResponseCache.md`](../Architecture/ResponseCache.md) — Phase-11 response cache; the cache stores post-rewriter bytes and bypasses the rewriter on hits
+- [`./HotReload.md`](./HotReload.md) — hot-reload semantics for BCD tag-list changes
+- [`../Operations/Configuration.md`](../Operations/Configuration.md) — `BcdTags.Global` and per-PLC `Add` / `Remove` syntax
+- [`../Operations/StatusPage.md`](../Operations/StatusPage.md) — `pdus.rewrittenSlots` and `pdus.partialBcdWarnings` exposure
+- [`../Operations/Troubleshooting.md`](../Operations/Troubleshooting.md) — diagnosing partial-overlap warnings
+- [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — `mbproxy.rewrite.*` event catalogue
+- [`../Testing/Simulator.md`](../Testing/Simulator.md) — the `dl205.json` simulator profile that encodes BCD test fixtures
+- [`../../DL260/dl205.md`](../../DL260/dl205.md) — DL205 / DL260 BCD encoding, CDAB word order, and V-memory ↔ Modbus translation
diff --git a/mbproxy/docs/Features/HotReload.md b/mbproxy/docs/Features/HotReload.md
new file mode 100644
index 0000000..7df3275
--- /dev/null
+++ b/mbproxy/docs/Features/HotReload.md
@@ -0,0 +1,189 @@
+# Hot Reload
+
+A save to `appsettings.json` propagates to a running `mbproxy` without restarting the service. This document explains the mechanism, the reconcile pipeline, and what each configuration change does to the running state.
+
+## How Reload Works
+
+`Microsoft.Extensions.Configuration` loads `appsettings.json` with `reloadOnChange: true`. Every consumer reads its options through `IOptionsMonitor<MbproxyOptions>` instead of capturing a one-shot `IOptions<T>` snapshot at construction. When the framework's `FileSystemWatcher` sees the file change, it re-parses the JSON, re-binds the option tree, and notifies subscribers through `IOptionsMonitor.OnChange`.
+
+The chosen mechanism is deliberate. There is no custom file watcher, no IPC channel, no admin-port mutation endpoint, and no SIGHUP-style trigger. An operator edits the file in place (or a deployment tool atomically rewrites it) and the running service catches up. The reload contract is identical whether the service is running interactively or as a Windows Service under the SCM.
+
+The `OnChange` callback can fire multiple times for a single logical save because text editors on Windows commonly use a rename-and-replace pattern that produces two or three `FileSystemWatcher` events. The reconciler debounces these inside its own background loop with a 250 ms quiescent window so a single save produces a single apply.
+
+### Debounce window
+
+The debounce window is held in `ConfigReconciler.DebounceWindow = TimeSpan.FromMilliseconds(250)`. The loop reads from the change channel, then keeps re-arming a linked `CancellationTokenSource` with a 250 ms expiry and waits again. As long as new signals keep arriving inside the window, the loop drains them and keeps waiting. When the window elapses with no new signal the loop falls through and calls `ApplyAsync` against `IOptionsMonitor.CurrentValue`. The window is short enough that operators perceive saves as instant and long enough to absorb every editor save pattern observed in practice (rename-and-replace, write-truncate-write, Notepad, Visual Studio Code, PowerShell `Set-Content`).
+
+## The Reconcile Pipeline
+
+Three types in `src/Mbproxy/Configuration/` carry the reload contract from "framework noticed the file changed" to "the running service matches the new file":
+
+- `ReloadValidator` (`src/Mbproxy/Configuration/ReloadValidator.cs`) — runs cross-PLC and per-PLC checks before the reload is allowed to take effect. The validator is a static gate: `Validate(MbproxyOptions next, out IReadOnlyList<string> errors)` returns `false` and a list of error strings if the snapshot is malformed, and the apply step bails out before touching any state.
+- `ReloadPlan` (`src/Mbproxy/Configuration/ReloadPlan.cs`) — an immutable record produced by the pure function `ReloadPlan.Compute(MbproxyOptions current, MbproxyOptions next)`. It buckets PLCs into `ToAdd`, `ToRemove`, `ToRestart` (network identity changed), and `ToReseat` (only the resolved `BcdTagMap` changed). PLC identity is keyed on `Name`, not `ListenPort`, so a port change is still the same PLC and goes to `ToRestart` rather than `ToRemove` + `ToAdd`.
+- `ConfigReconciler` (`src/Mbproxy/Configuration/ConfigReconciler.cs`) — subscribes to `IOptionsMonitor.OnChange`, debounces and serialises change events through a bounded `Channel<bool>` and a `SemaphoreSlim(1, 1)`, then runs the plan: removes go first (concurrent), restarts next (concurrent), reseats apply via `PlcListenerSupervisor.ReplaceContextAsync`, and adds finish last.
+
+The reconciler's `OnChange` handler does not block. It writes to a `Channel<bool>` with `BoundedChannelFullMode.DropOldest` so a busy reload queue never stalls the configuration framework. A dedicated background loop drains the channel, applies the 250 ms debounce, and then calls `ApplyAsync` on the latest snapshot exposed by `IOptionsMonitor.CurrentValue`. The last enqueued change wins.
+
+The apply itself runs under `_applySemaphore` (a `SemaphoreSlim(1, 1)`) so two saves arriving in rapid succession are serialised and never interleave. If a second save lands while the first apply is still running, it queues at the semaphore and runs against whatever `CurrentValue` exposes when its turn comes — which is the freshest options snapshot, not necessarily the one that caused the wake-up.
+
+### Apply order
+
+`ApplyUnderLockAsync` runs the steps in this order against the freshly validated snapshot:
+
+1. **Validate.** If `ReloadValidator.Validate` returns errors, log `mbproxy.config.reload.rejected`, increment the rejected counter, and return without mutating state.
+2. **Compute.** Call `ReloadPlan.Compute(_currentOptions, next)` to bucket PLCs into `ToAdd`, `ToRemove`, `ToRestart`, and `ToReseat`.
+3. **Remove.** Stop every supervisor in `ToRemove` concurrently with a 10-second stop timeout, then dispose.
+4. **Restart.** Stop the old supervisor, build a fresh `PerPlcContext` (which includes a new `ResponseCache` when any resolved tag opts in), and start a new `PlcListenerSupervisor` on the new endpoint. Restarts run concurrently across affected PLCs.
+5. **Reseat.** For each PLC in `ToReseat`, build a new context that preserves the existing `Counters` (so operators see real history across the reseat) and call `PlcListenerSupervisor.ReplaceContextAsync` with a 5-second timeout.
+6. **Add.** Build and start a new supervisor for every PLC in `ToAdd` concurrently.
+7. **Record.** Update `_currentOptions` to `next`, call `ServiceCounters.RecordReloadApplied`, and log `mbproxy.config.reload.applied` with the apply counts and the global tag delta.
+
+If a step throws, the exception is logged at Error and the loop continues with the remaining steps. The validator catches every precondition that can be checked from the configuration alone, so a runtime exception here is a true bug worth surfacing. The host stays up regardless.
+
+## Per-Change-Kind Reconcile Table
+
+| Change in `appsettings.json` | Propagation |
+|------------------------------|-------------|
+| `BcdTags.Global` add / remove / width | The rewriter dereferences `IOptionsMonitor` per PDU. The next PDU sees the new map. In-flight requests are not retroactively touched. |
+| `Plcs[i].BcdTags.Add` or `Plcs[i].BcdTags.Remove` | Same as above — next-PDU resolution against the rebuilt map. |
+| New `Plcs[i]` entry | `ConfigReconciler` builds a fresh `PerPlcContext` and `PlcListenerSupervisor`, which binds the new port under the same eager-then-auto-recover policy used at service startup. |
+| `Plcs[i]` removed | The supervisor for that PLC is stopped (10 s stop timeout) and disposed, which closes every upstream client connection bound to that listener. |
+| `Plcs[i].ListenPort` or `Host` changed | Equivalent to remove + add. The supervisor stops the old listener, the reconciler rebuilds the context, and a new supervisor starts on the new endpoint. |
+| `Connection.BackendConnectTimeoutMs` and the other `Backend*TimeoutMs` values | The next backend connect or request reads the new value through the monitor. In-flight operations keep their already-applied timeout. |
+| `BcdTags.*.CacheTtlMs` or `Plcs[i].DefaultCacheTtlMs` | A tag-map reseat constructs a fresh `ResponseCache` for that PLC, which drops every cached entry for that PLC. Entries re-populate on demand under the new TTL. Per-tag flush granularity is intentionally not implemented. |
+| `Cache.AllowLongTtl` | Enforced at the next reload validation. A pending reload that depends on it must save together. |
+| `Cache.MaxEntriesPerPlc` | Applies to subsequent inserts. Existing entries are not pruned. |
+| `Cache.EvictionIntervalMs` | Read by the next eviction loop tick. |
+| `Resilience.ReadCoalescing.Enabled` flipped to `false` | Already-running coalesced entries drain naturally. Subsequent reads bypass coalescing. |
+| `Resilience.ReadCoalescing.MaxParties` | Applies to subsequent attaches. Existing in-flight entries keep their current cap. |
+| Invalid reload (schema break, duplicate ports, duplicate addresses in a resolved tag list, `CacheTtlMs > 60_000` without `Cache.AllowLongTtl = true`) | Reload is rejected as a whole. The current in-memory config stays in effect. `mbproxy.config.reload.rejected` is logged at Error. |
+
+The "next-PDU" wording is load-bearing for the tag-list rows: the rewriter does not snapshot the tag map at connection accept time. It resolves the map for the active PLC at the start of every request frame, so a hot-reloaded tag list is in effect for the very next request, even on existing TCP connections.
+
+### Reseat vs. restart
+
+The `ReloadPlan` distinguishes two kinds of "PLC is still here but changed":
+
+- **Restart** is triggered when `Host`, `ListenPort`, or backend `Port` differ between the old and new `PlcOptions`. The TCP socket has to close and reopen on a new endpoint, so there is no way to preserve the listener — the supervisor stops and a brand-new one starts.
+- **Reseat** is triggered when only the resolved `BcdTagMap` differs (which `ReloadPlan.Compute` checks structurally through `TagMapsEqual`: same set of `(Address, Width, CacheTtlMs)` triples). The listener socket and the upstream pipes stay open. Only the `PerPlcContext` swaps.
+
+`TagMapsEqual` includes `BcdTag.CacheTtlMs` in the comparison so a per-tag TTL change or a `Plcs[i].DefaultCacheTtlMs` change (which folds into per-tag TTLs through `BcdTagMapBuilder.Build`) also routes to `ToReseat` and so also drops the cache. A `Plcs[i]` whose options are byte-identical to the previous snapshot lands in neither bucket and the supervisor is left alone.
+
+### Tag map resolution
+
+`BcdTagMapBuilder.Build` is the single source of truth for what the resolved tag list looks like for one PLC. The hybrid resolution it implements is:
+
+1. Start with `BcdTags.Global` from the root options.
+2. Remove every address present in `Plcs[i].BcdTags.Remove`.
+3. Merge in `Plcs[i].BcdTags.Add` entries — if an address already exists in the working set, the `Add` entry wins. This is how a per-PLC width override is expressed (the global lists a 16-bit tag at the same address; the per-PLC `Add` overrides it to 32-bit).
+4. Fold `Plcs[i].DefaultCacheTtlMs` into any tag whose explicit `CacheTtlMs` is null.
+
+The same builder runs both at startup and during reload validation, so a configuration that builds cleanly at startup is guaranteed to build cleanly at reload, and vice versa. There is no second validator that could disagree with the first.
+
+## Validation Rules
+
+`ReloadValidator.Validate` is the gate the hot-reload path consults directly. It runs the following checks in order:
+
+1. PLC names are non-empty and unique under ordinal comparison.
+2. Every `Plcs[i].ListenPort` is in `[1, 65535]` and unique across the `Plcs` list.
+3. `AdminPort` is in `[1, 65535]` and does not collide with any `ListenPort`.
+4. For each PLC, `BcdTagMapBuilder.Build(next.BcdTags, plc.BcdTags, plc.DefaultCacheTtlMs)` reports no errors. This delegates the per-PLC well-formedness checks — duplicate addresses within a single resolved list, and 32-bit entries whose high register (`Address + 1`) overlaps a separate 16-bit entry — to the single source of truth used at startup.
+5. Cache TTL bounds: every `BcdTag.CacheTtlMs` and every `Plcs[i].DefaultCacheTtlMs` must be `>= 0`, and any value above `60_000` ms requires `Cache.AllowLongTtl = true`. `Cache.MaxEntriesPerPlc` and `Cache.EvictionIntervalMs` must be `>= 0`.
+
+A failure at any step appends to the error list but the validator runs to completion so the operator sees every problem with a single save. If the list is non-empty, the reload is rejected atomically and no state mutates.
+
+Schema-level checks — invalid `Width` values on a `BcdTagOptions`, type mismatches, malformed JSON — are also enforced by `MbproxyOptionsValidator` (`IValidateOptions<MbproxyOptions>`) at bind time. The two paths overlap deliberately so both startup and reload reject the same malformed input with the same error wording.
+
+### Rejected-reload example
+
+A duplicate `ListenPort` in the saved file produces an error like the following on the rejected log line:
+
+```text
+Config reload rejected — Errors=Plc 'plc-02': Duplicate ListenPort 5020 (already used by 'plc-01').
+```
+
+When several rules trip on the same save, the validator joins them with `; ` so the operator sees every problem from one file save. The current in-memory configuration is unchanged, every supervisor keeps running on its existing context, and the next valid save will replay the whole apply against the now-current state.
+
+## What Stays vs. What Changes Mid-Flight
+
+The reload contract is built around a simple invariant: a Modbus request that has already started routing keeps the configuration it started with. The next request after the reload picks up the new values.
+
+The rewriter is the clearest example. `BcdPduPipeline` dereferences the tag map at the start of every PDU. A request that is already in the multiplexer's outbound queue is rewritten against the map that was current when it arrived. The very next request on the same TCP connection sees the new map. This avoids a torn behaviour where one PDU is half-rewritten under the old tag list and half under the new — every PDU is fully consistent with exactly one snapshot of the map.
+
+The same principle applies to timeouts. `Connection.BackendConnectTimeoutMs` and the per-operation timeout values are read through `IOptionsMonitor.CurrentValue` at the point the operation starts. A backend connect that has already entered its retry pipeline keeps its already-applied timeout for the remainder of that attempt. The next backend connect reads the new value.
+
+The reseat path is the only place where running state changes mid-connection. A reseat swaps the entire `PerPlcContext` — `TagMap`, `Counters`, `Cache` — via `PlcListenerSupervisor.ReplaceContextAsync`. The listener socket and the existing upstream pipes survive the swap. The brief transition window between the old context and the new is documented in code: any PDU mid-flight at the swap point may observe the boundary, but the rewriter only consults the map at PDU start, so the practical effect is the same next-PDU resolution rule.
+
+Counters are explicitly preserved across a reseat. The reconciler reads `supervisor.CurrentCounters` and passes the same `ProxyCounters` instance into the new context so request counts, rewrite counts, and error counts do not reset to zero every time an operator tweaks a tag. A restart, by contrast, constructs a brand-new `ProxyCounters` because the supervisor itself is brand new.
+
+### Effect on upstream sockets
+
+The fate of an open upstream client socket depends on which bucket its PLC lands in:
+
+- **Reseat.** The socket stays open. The client never notices the reload happened; only its next request frame resolves against the new tag map.
+- **Restart.** The old listener stops, which closes every upstream socket bound to it. The client sees a TCP close and is expected to reconnect (Wonderware DAServer, generic Modbus masters, and the supported gateways all do this automatically). When it reconnects, it lands on the new listener at the new endpoint.
+- **Remove.** Same as a restart from the client's perspective: the listener stops and every connection closes. If the operator also removed the IP from the upstream client's configuration, the client stops reconnecting; otherwise the reconnect attempts simply fail with `ECONNREFUSED` until the PLC reappears.
+- **Add.** No effect on any existing socket. The new listener simply starts accepting on its `ListenPort`.
+
+## Cache and Hot-Reload
+
+Any tag-list change that affects a PLC drops the entire `ResponseCache` for that PLC. The reseat path constructs a fresh cache through `ConfigReconciler.BuildCacheIfNeeded`, which inspects the resolved map and returns a new `ResponseCache` when at least one tag opts in, or `null` otherwise. The supervisor disposes the old cache during `ReplaceContextAsync`.
+
+Per-tag granular flush is intentionally not implemented. The reasoning is correctness over micro-optimisation:
+
+- A width change between 16-bit and 32-bit can invalidate cached entries at neighbouring addresses, not just at the changed tag.
+- A tag removal means a cached value is no longer rewritten on the way out, so the cached entry that was valid one millisecond ago is now serving the wrong shape.
+- A TTL change on one tag does not influence neighbouring entries, but the cost of tracking per-entry TTL versions and replaying flushes outweighs the cost of repopulating on demand.
+
+A wholesale drop is the simple correct move. Entries repopulate on demand at the next read against the new TTL, and a 54-PLC fleet with second-scale TTLs warms back to steady state within a handful of poll intervals.
+
+`Cache.MaxEntriesPerPlc` and `Cache.EvictionIntervalMs` deliberately do **not** trigger a reseat. A change to either value is structurally invisible to `TagMapsEqual` (which only inspects the resolved tag triples), so no cache rebuild happens. `MaxEntriesPerPlc` is enforced on subsequent inserts only — existing entries above the new cap stay until natural LRU eviction reaches them. `EvictionIntervalMs` is sampled by each fresh tick of the eviction loop, so a change takes effect at the next tick of the old interval.
+
+## Reload Events
+
+Two events surface in the structured log every time the reconciler runs:
+
+```csharp
+[LoggerMessage(EventId = 60, EventName = "mbproxy.config.reload.applied",
+    Level = LogLevel.Information,
+    Message = "Config reload applied — PlcsAdded={PlcsAdded} PlcsRemoved={PlcsRemoved} " +
+              "PlcsRestarted={PlcsRestarted} PlcsReseated={PlcsReseated} GlobalTagDelta={GlobalTagDelta}")]
+private static partial void LogReloadApplied(
+    ILogger logger, int plcsAdded, int plcsRemoved, int plcsRestarted, int plcsReseated, int globalTagDelta);
+
+[LoggerMessage(EventId = 61, EventName = "mbproxy.config.reload.rejected",
+    Level = LogLevel.Error,
+    Message = "Config reload rejected — Errors={Errors}")]
+private static partial void LogReloadRejected(ILogger logger, string errors);
+```
+
+`mbproxy.config.reload.applied` carries the counts from the executed `ReloadPlan` plus a `GlobalTagDelta` computed by `ConfigReconciler.ComputeGlobalTagDelta`, which counts how many global tag entries differ between the old and new options snapshots (added, removed, or width-changed).
+
+`mbproxy.config.reload.rejected` carries the joined error string from `ReloadValidator.Validate`. The reconciler also increments service-wide counters through `ServiceCounters.RecordReloadApplied` and `ServiceCounters.RecordReloadRejected`, which surface on the status page as `config.reloadCount`, `config.reloadRejectedCount`, and `config.lastReloadUtc`. Both event names are catalogued in [`../Reference/LogEvents.md`](../Reference/LogEvents.md).
+
+### Reading the events
+
+A healthy reload looks like this in the log stream:
+
+```text
+INFO mbproxy.config.reload.applied — PlcsAdded=1 PlcsRemoved=0 PlcsRestarted=0 PlcsReseated=2 GlobalTagDelta=3
+```
+
+The properties answer four questions at a glance: how many new listeners were brought up, how many old listeners were torn down, how many existing listeners moved to a new endpoint (and therefore disconnected their clients), and how many existing listeners had their tag maps swapped underneath open connections. `GlobalTagDelta` reports the number of `BcdTags.Global` entries that differ between snapshots; it counts each address once whether the difference is "added", "removed", or "width changed".
+
+A rejected reload looks like this:
+
+```text
+ERROR mbproxy.config.reload.rejected — Errors=Plc 'plc-02': Duplicate ListenPort 5020 (already used by 'plc-01').; Plc 'plc-03': BCD tag map error (DuplicateAddress): Address 1072 appears twice in resolved tag list.
+```
+
+Every error from the validator concatenates with `; ` so a single rejected event captures every problem. The matching `config.reloadRejectedCount` counter on the status page increments by one per rejected save, not per error inside the save.
+
+## Related Documentation
+
+- [Architecture Overview](../Architecture/Overview.md)
+- [Response Cache](../Architecture/ResponseCache.md)
+- [BCD Rewriting](./BcdRewriting.md)
+- [Configuration Reference](../Operations/Configuration.md)
+- [Log Events](../Reference/LogEvents.md)
+- [Status Page](../Operations/StatusPage.md)
diff --git a/mbproxy/docs/Operations/Configuration.md b/mbproxy/docs/Operations/Configuration.md
new file mode 100644
index 0000000..d5a1907
--- /dev/null
+++ b/mbproxy/docs/Operations/Configuration.md
@@ -0,0 +1,422 @@
+# Configuration Reference
+
+`mbproxy` binds its runtime configuration from `appsettings.json` under the `Mbproxy` section. This document is the full reference for every supported key, its type, default, range, and validation rules.
+
+## File Location
+
+The configuration loader resolves `appsettings.json` relative to the executable.
+
+- **Development run** (`dotnet run`): `src/Mbproxy/appsettings.json` next to the build output.
+- **Single-file publish** (`dotnet publish -c Release -r win-x64`): `appsettings.json` next to `Mbproxy.exe` in the publish folder.
+- **Installed as a Windows Service**: `%ProgramData%\mbproxy\appsettings.json`. The install script copies the template at `install/mbproxy.config.template.json` to this path the first time only — an existing file is preserved across reinstalls.
+
+The file is loaded with `reloadOnChange: true`. All consumers read through `IOptionsMonitor<MbproxyOptions>`, so a save propagates without restarting the service. See [`../Features/HotReload.md`](../Features/HotReload.md) for per-key propagation semantics.
+
+The .NET configuration provider accepts `//` and `/* */` comments (JSONC) in `appsettings.json` when loaded through `Host.CreateApplicationBuilder`. The install template ships with comments.
+
+Environment variables and command-line arguments are also accepted by the host. Either form can override any `Mbproxy:*` key; for example, `Mbproxy__AdminPort=9090` (double-underscore segment separator) overrides the JSON. Environment overrides are useful for ephemeral diagnostic switches but should not replace the file as the source of truth — `ReloadValidator` runs against the merged configuration on every reload.
+
+## Top-Level Schema
+
+Every supported key under `Mbproxy:*`, populated to a representative default:
+
+```jsonc
+{
+  "Mbproxy": {
+
+    // Global BCD tag list — applies to every PLC unless overridden per-PLC.
+    "BcdTags": {
+      "Global": [
+        { "Address": 1024, "Width": 16 },                       // 16-bit BCD register
+        { "Address": 1056, "Width": 32 },                       // 32-bit BCD pair (CDAB)
+        { "Address": 1088, "Width": 16, "CacheTtlMs": 1000 }    // opt-in cache, 1 s TTL
+      ]
+    },
+
+    // One entry per PLC. Each maps an upstream proxy port to a backend Modbus TCP endpoint.
+    "Plcs": [
+      {
+        "Name": "Line1-Mixer",
+        "ListenPort": 5020,
+        "Host": "10.0.1.1",
+        "Port": 502,
+        "DefaultCacheTtlMs": 0,
+        "BcdTags": {
+          "Add":    [ { "Address": 1200, "Width": 32 } ],
+          "Remove": [ 1056 ]
+        }
+      }
+    ],
+
+    // Read-only HTTP status page. Set to 0 to disable.
+    "AdminPort": 8080,
+
+    // Backend connection / request / shutdown timeouts.
+    "Connection": {
+      "BackendConnectTimeoutMs":   3000,
+      "BackendRequestTimeoutMs":   3000,
+      "GracefulShutdownTimeoutMs": 10000
+    },
+
+    // Polly resilience policies.
+    "Resilience": {
+      "BackendConnect": {
+        "MaxAttempts": 3,
+        "BackoffMs":   [ 100, 500, 2000 ]
+      },
+      "ListenerRecovery": {
+        "InitialBackoffMs": [ 1000, 2000, 5000, 15000, 30000 ],
+        "SteadyStateMs":    30000
+      },
+      "ReadCoalescing": {
+        "Enabled":    true,
+        "MaxParties": 32
+      }
+    },
+
+    // Response-cache safety knobs. The cache is off by default per tag.
+    "Cache": {
+      "AllowLongTtl":       false,
+      "MaxEntriesPerPlc":   1000,
+      "EvictionIntervalMs": 5000
+    }
+  }
+}
+```
+
+`Serilog` configuration is documented in [`./Troubleshooting.md`](./Troubleshooting.md) and lives outside the `Mbproxy` section.
+
+## `Mbproxy.AdminPort`
+
+Port for the read-only HTTP status server. Binds to all interfaces on startup.
+
+| Field | Type | Default | Range |
+|-------|------|---------|-------|
+| `AdminPort` | int | `8080` | `[1, 65535]` |
+
+`ReloadValidator` rejects values outside `[1, 65535]` and rejects collisions with any `Plcs[i].ListenPort`. Source: `MbproxyOptions.AdminPort`.
+
+The server exposes `GET /` (auto-refreshing HTML) and `GET /status.json`. See [`./StatusPage.md`](./StatusPage.md) for the schema.
+
+Authentication is assumed at the network layer (trusted internal segment). The endpoint is read-only — there are no `POST` / `PUT` / `DELETE` routes — so the risk surface is limited to status disclosure. Place the admin port behind a firewall rule that allows only operator workstations.
+
+## `Mbproxy.Plcs[]`
+
+One entry per PLC. The array drives the listener supervisor; on reload, entries added here cause new listeners to bind and entries removed here cause listeners to stop. Source: `PlcOptions.cs`.
+
+| Field | Type | Default | Required | Notes |
+|-------|------|---------|----------|-------|
+| `Name` | string | `""` | yes | Non-empty, unique across the array. Shown on the status page and in structured logs as `plc`. |
+| `ListenPort` | int | `0` | yes | Port the proxy listens on. `[1, 65535]`. Unique across the array. Cannot collide with `AdminPort`. |
+| `Host` | string | `""` | yes | PLC IP address or hostname. |
+| `Port` | int | `502` | no | Backend Modbus TCP port on the PLC. |
+| `BcdTags` | object | `null` | no | Per-PLC overrides on top of `Mbproxy.BcdTags.Global`. See below. |
+| `DefaultCacheTtlMs` | int | `0` | no | Fallback TTL in milliseconds for any tag on this PLC whose explicit `CacheTtlMs` is unset (`null`). `0` disables caching by default. |
+
+### `Plcs[i].BcdTags`
+
+Per-PLC override block. Resolution: the effective tag list for a PLC is `Global ∪ Add − Remove`, with `Add` winning on width when an address appears in both `Global` and `Add`. Source: `BcdTagListOptions.PlcBcdOverrides`.
+
+| Field | Type | Default | Notes |
+|-------|------|---------|-------|
+| `Add` | `BcdTagOptions[]` | `[]` | Tags to append for this PLC. Can override a `Global` entry's `Width` by repeating the address. Each entry follows the `BcdTagOptions` shape (see next section). |
+| `Remove` | `ushort[]` | `[]` | Addresses to drop from this PLC's effective list. Matches by address. |
+
+The full tag-list resolution algorithm — `Add` width override semantics, overlap detection, and per-PLC tag-map flushing on reload — is documented in [`../Features/BcdRewriting.md`](../Features/BcdRewriting.md).
+
+A subtle case worth pinning down: when an address appears in both `Mbproxy.BcdTags.Global[]` and `Plcs[i].BcdTags.Add[]`, the per-PLC `Add` entry wins on `Width` and `CacheTtlMs`. This is how a per-PLC width override is expressed (for example, a 16-bit tag globally, promoted to 32-bit on the one PLC that uses the wider format). To strip a global tag from a PLC entirely, use `Remove`; do not add a same-address entry with `Width = 0`.
+
+## `Mbproxy.BcdTags.Global[]`
+
+The fleet-wide BCD tag list. Every PLC starts with this set, then applies its per-PLC `Add` / `Remove` overrides. Source: `BcdTagListOptions.Global`, entries of type `BcdTagOptions`.
+
+| Field | Type | Default | Range | Notes |
+|-------|------|---------|-------|-------|
+| `Address` | ushort | `0` | `[0, 65535]` | Modbus PDU address (decimal). Address `0` is valid on DL205/DL260 — do not skip it. Octal V-memory addresses must be converted: `V2000` octal = decimal `1024`. |
+| `Width` | byte | `0` | `{ 16, 32 }` | Bit width. `16` is one register holding 4 BCD digits (`0–9999`). `32` is a CDAB-ordered register pair at `Address` (low word) and `Address+1` (high word). |
+| `CacheTtlMs` | int? | `null` | `>= 0`, `<= 60000` unless `Cache.AllowLongTtl = true` | Optional per-tag opt-in to the response cache. `null` falls back to the PLC's `DefaultCacheTtlMs`. `0` explicitly disables caching for this tag even when the PLC default is non-zero. |
+
+`MbproxyOptionsValidator` rejects any entry whose `Width` is not `16` or `32`. See [`../Features/BcdRewriting.md`](../Features/BcdRewriting.md) for the wire encoding rules and the multi-tag-overlap validation that runs in `BcdTagMapBuilder`.
+
+Address conversion examples for operators coming from DirectLOGIC ladder:
+
+| V-memory (octal) | Modbus PDU (decimal) |
+|------------------|----------------------|
+| `V2000` | `1024` |
+| `V2040` | `1056` |
+| `V2100` | `1088` |
+| `V2200` | `1152` |
+
+The proxy expects PDU-decimal addresses. Do not use octal V-memory addresses and do not use 1-based `4xxxx` Modbus references — both will resolve to the wrong register.
+
+## `Mbproxy.Connection`
+
+Backend connection and shutdown timeouts. Source: `ConnectionOptions.cs`.
+
+| Field | Type | Default | Notes |
+|-------|------|---------|-------|
+| `BackendConnectTimeoutMs` | int | `3000` | Max time in milliseconds to wait for one TCP connect to the backend PLC. Each Polly retry attempt is bounded by its own copy of this timeout — total worst-case connect time is `MaxAttempts * BackendConnectTimeoutMs` plus the configured backoffs. |
+| `BackendRequestTimeoutMs` | int | `3000` | Max time in milliseconds to wait for the PLC to respond to a forwarded PDU. On timeout the upstream client is disconnected. FC06 / FC16 writes are not retried because they are non-idempotent on BCD tags; FC03 / FC04 reads are also not retried mid-request (a fresh upstream request takes the full pipeline again). |
+| `GracefulShutdownTimeoutMs` | int | `10000` | Max time in milliseconds the shutdown coordinator waits for in-flight PDUs to drain after a stop signal (`sc.exe stop` or Windows Service stop). After this deadline, remaining work is cancelled. Keep at or below the Service Control Manager wait hint (30 s). |
+
+On hot reload, `BackendConnectTimeoutMs` and `BackendRequestTimeoutMs` apply to the next backend connect or request — in-flight operations keep their already-applied timeout. `GracefulShutdownTimeoutMs` is sampled only at shutdown.
+
+Operational sizing notes:
+
+- The default 3 s connect timeout is appropriate for a local Ethernet segment to a healthy ECOM100. On WAN paths or for devices behind switches with slow MAC-table aging, raise to 5–10 s.
+- A 3 s request timeout is generous compared with typical DL205/DL260 scan times (a few ms to tens of ms for FC03 of 100 registers). The slack absorbs PLC scan-overlap jitter without faulting the upstream client.
+- `GracefulShutdownTimeoutMs` should be less than the Service Control Manager's stop deadline. The default 10 s suits a fleet of 54 PLCs; on a much larger fleet, raise both the SCM wait hint and this value in lockstep.
+
+## `Mbproxy.Resilience`
+
+Polly retry pipelines for backend connect, listener bind, and the in-flight read coalescer. Source: `ResilienceOptions.cs`.
+
+### `Mbproxy.Resilience.BackendConnect`
+
+Bounded retries on the backend TCP connect path. Mid-request failures (during a forwarded PDU) are never retried.
+
+| Field | Type | Default | Notes |
+|-------|------|---------|-------|
+| `MaxAttempts` | int | `3` | Total connect tries, including the first. `1` disables retries. |
+| `BackoffMs` | int[] | `[100, 500, 2000]` | Delay in milliseconds between attempts. Must have `MaxAttempts - 1` entries. |
+
+### `Mbproxy.Resilience.ListenerRecovery`
+
+Unbounded retries on the listener bind path. If a PLC's `ListenPort` cannot be bound (port in use, bad interface, transient OS error), the supervisor cycles through `InitialBackoffMs` once, then repeats `SteadyStateMs` forever. The same recovery code path also reacts to a listener that faults at runtime (for example, the underlying socket dies) and to listeners that come online from a hot-reload that adds a new PLC.
+
+| Field | Type | Default | Notes |
+|-------|------|---------|-------|
+| `InitialBackoffMs` | int[] | `[1000, 2000, 5000, 15000, 30000]` | Backoff schedule for the first N retries after a fault. |
+| `SteadyStateMs` | int | `30000` | Backoff for every retry after the initial schedule is exhausted. Runs indefinitely. |
+
+### `Mbproxy.Resilience.ReadCoalescing`
+
+In-flight de-duplication of identical FC03 / FC04 reads. When multiple upstream clients issue the same `(unitId, fc, startAddress, qty)` tuple while a matching backend round-trip is already in flight, the late arrivals attach to the existing entry and the single response is fanned out to every party. See [`../Architecture/ReadCoalescing.md`](../Architecture/ReadCoalescing.md).
+
+| Field | Type | Default | Notes |
+|-------|------|---------|-------|
+| `Enabled` | bool | `true` | Master switch. Hot-reloadable; flipping to `false` lets already-coalesced entries drain naturally. |
+| `MaxParties` | int | `32` | Per-entry cap on attached parties. Past this cap, the next identical request opens a fresh in-flight entry. |
+
+Writes (FC06 / FC16) are never coalesced. FC03 and FC04 never share an entry. Different `unitId` bytes never share an entry.
+
+Total FC03 + FC04 request accounting is preserved across the coalescing path: `coalescedHitCount + coalescedMissCount` equals the total reads observed by the multiplexer since startup. `coalescedHitCount` stays at `0` while `Enabled = false`, but every read still increments `coalescedMissCount`. See [`./StatusPage.md`](./StatusPage.md) for the full counter catalogue.
+
+## `Mbproxy.Cache`
+
+Service-wide safety knobs for the opt-in response cache. The cache is off by default per tag — this section only governs the limits when an operator opts a tag in via `CacheTtlMs` or `DefaultCacheTtlMs`. Source: `CacheOptions` in `MbproxyOptions.cs`.
+
+| Field | Type | Default | Notes |
+|-------|------|---------|-------|
+| `AllowLongTtl` | bool | `false` | Gate for any `CacheTtlMs > 60_000`. When `false`, `ReloadValidator` rejects any tag or PLC default that exceeds 60 s. Set to `true` to opt in explicitly. |
+| `MaxEntriesPerPlc` | int | `1000` | LRU cap on the number of entries per PLC. When full, the next insert evicts the least-recently-used entry. Must be `>= 0`. `0` is accepted but means "evict every insert immediately" — effectively the cache is disabled even for tags with non-zero TTL. |
+| `EvictionIntervalMs` | int | `5000` | Background eviction loop tick in milliseconds. Each tick scans the per-PLC caches and removes entries past their `ExpiresAtUtc`. Must be `>= 0`; values below 100 ms are clamped at 100 ms internally to avoid pathologically tight loops. |
+
+On hot reload, `AllowLongTtl` is enforced by the next reload validation. `MaxEntriesPerPlc` applies to subsequent inserts (existing entries are not pruned). `EvictionIntervalMs` is read by each fresh eviction loop iteration.
+
+Any tag-list change for a given PLC drops that PLC's entire cache on reload — per-tag flush granularity is intentionally not implemented. New entries re-populate on demand under the new TTL. Process restart wipes every cache; there is no persistence and no last-known-good snapshot.
+
+See [`../Architecture/ResponseCache.md`](../Architecture/ResponseCache.md) for the cache contract (lookup order, write-invalidation by address-range overlap, post-rewriter byte storage).
+
+## Per-Tag `CacheTtlMs`
+
+Per-tag opt-in to the cache. The same field appears on every `BcdTagOptions` entry — both `Mbproxy.BcdTags.Global[]` and `Mbproxy.Plcs[i].BcdTags.Add[]`.
+
+| Value | Meaning |
+|-------|---------|
+| `null` (omitted) | Unset. Falls back to `Plcs[i].DefaultCacheTtlMs`. |
+| `0` | Caching explicitly disabled for this tag, even if the PLC default is non-zero. |
+| `1..60000` | Cache enabled with this TTL in milliseconds. |
+| `> 60000` | Rejected at reload unless `Cache.AllowLongTtl = true`. |
+
+TTL resolution order for any single tag: **explicit per-tag value → per-PLC `DefaultCacheTtlMs` → 0 (off)**.
+
+For multi-tag read ranges, the effective TTL is `min(TTLs)` across all configured tags inside the read range. If any tag in the range has `CacheTtlMs = 0`, the entire read is uncached.
+
+The cache itself is described in detail in [`../Architecture/ResponseCache.md`](../Architecture/ResponseCache.md). The properties most relevant to operators setting TTLs:
+
+- **Lookup order is cache → coalesce → backend.** A cache hit short-circuits the read coalescer entirely.
+- **Writes invalidate by address-range overlap.** A successful FC06 / FC16 response invalidates every cached FC03 / FC04 entry whose read range overlaps the write range — not just exact-key matches. Exception responses do not invalidate (the write did not take effect on the PLC).
+- **Cache stores post-rewriter bytes.** Hits never re-invoke the BCD rewriter. Tag-list reloads flush the affected PLC's whole cache so a rewriter-relevant change cannot serve stale post-rewriter bytes from before the change.
+- **Different `unitId` bytes never invalidate each other.** Invalidation is scoped to `(unitId, FC ∈ {3, 4})`.
+
+## Validation Rules
+
+`ReloadValidator.Validate` runs on every config load (startup and hot reload) and rejects the entire snapshot if any rule fails. On rejection at startup, the service exits non-zero. On rejection at runtime, the current in-memory config stays in effect and `mbproxy.config.reload.rejected` is logged at `Error`.
+
+Rules (in order):
+
+1. **PLC names**: every `Plcs[i].Name` is non-empty and unique (ordinal comparison).
+2. **ListenPort**: every `Plcs[i].ListenPort` is in `[1, 65535]` and unique across the array.
+3. **AdminPort**: in `[1, 65535]` and does not collide with any `ListenPort`.
+4. **BCD tag map** per PLC, delegated to `BcdTagMapBuilder.Build`:
+   - duplicate addresses within a single PLC's resolved tag list
+   - 32-bit entries whose high register (`Address + 1`) overlaps a separate 16-bit entry at that address
+5. **Cache TTL bounds**:
+   - any `CacheTtlMs` or `DefaultCacheTtlMs` less than 0 is rejected
+   - any `CacheTtlMs` or `DefaultCacheTtlMs` greater than `60_000` is rejected unless `Cache.AllowLongTtl = true`
+6. **Cache size knobs**: `Cache.MaxEntriesPerPlc >= 0`, `Cache.EvictionIntervalMs >= 0`.
+7. **Width**: every `BcdTagOptions.Width` is `16` or `32` (enforced by `MbproxyOptionsValidator` at schema time).
+
+Sample rejection messages (logged at `Error` with the structured property `errors` carrying the full list):
+
+```text
+Plc 'Line1-Mixer': Duplicate ListenPort 5020 (already used by 'Line1-Conveyor').
+AdminPort 5020 collides with ListenPort of PLC 'Line1-Mixer'.
+Plc 'Line1-Mixer': BCD tag map error (DuplicateAddress): address 1024 appears twice.
+BcdTags.Global Address 1024: CacheTtlMs=120000 exceeds 60_000 ms without Cache.AllowLongTtl=true.
+Plcs[2] (Line2-Press): DefaultCacheTtlMs must be >= 0.
+```
+
+Warning case (not a rejection):
+
+- `Plcs[i].BcdTags.Remove[]` entries that do not match any global tag address are logged as warnings — probably stale config, but the reload proceeds.
+
+Two additional rejection categories handled earlier in the pipeline:
+
+- **Type-mismatched / malformed JSON.** The .NET configuration binder rejects values whose type does not match the bound property (for example, a string in `BackendConnectTimeoutMs`). At startup this aborts the host; on hot reload the binder retains the previous snapshot and the reload never reaches `ReloadValidator`.
+- **Width invalid.** `MbproxyOptionsValidator` rejects any `BcdTagOptions.Width` that is not `16` or `32`. This runs as part of options validation before `ReloadValidator` and surfaces the same way as schema errors.
+
+See [`../Features/HotReload.md`](../Features/HotReload.md) for the full reload-acceptance flow, including the log event names emitted on acceptance (`mbproxy.config.reload.applied`) and rejection (`mbproxy.config.reload.rejected`).
+
+## Two Concrete Examples
+
+The minimal and production examples below are both complete `appsettings.json` snippets — paste either one and the service will start without further edits beyond the addresses and ports.
+
+### Minimal
+
+One PLC, no BCD tags, no cache. The proxy is pure pass-through.
+
+```jsonc
+{
+  "Mbproxy": {
+    "BcdTags": { "Global": [] },
+    "Plcs": [
+      {
+        "Name":       "Line1-Mixer",
+        "ListenPort": 5020,
+        "Host":       "10.0.1.1"
+      }
+    ],
+    "AdminPort": 8080
+  }
+}
+```
+
+Everything else picks up defaults: `Port = 502`, `Connection.BackendConnectTimeoutMs = 3000`, `Connection.BackendRequestTimeoutMs = 3000`, `Connection.GracefulShutdownTimeoutMs = 10000`, `Resilience.BackendConnect.MaxAttempts = 3`, `Resilience.ReadCoalescing.Enabled = true`, `Cache.AllowLongTtl = false`, `Cache.MaxEntriesPerPlc = 1000`, `Cache.EvictionIntervalMs = 5000`, and so on.
+
+Behaviour in this snapshot: every byte passes through unchanged in both directions, FC03 / FC04 reads are still subject to in-flight coalescing (the feature is on by default), and no responses are cached.
+
+### Production
+
+Three PLCs, a global BCD tag list, one PLC with overrides, cache enabled on hot reads.
+
+```jsonc
+{
+  "Mbproxy": {
+    "BcdTags": {
+      "Global": [
+        { "Address": 1024, "Width": 16 },                        // V2000 — 16-bit BCD counter
+        { "Address": 1056, "Width": 32 },                        // V2040 — 32-bit BCD total
+        { "Address": 1088, "Width": 16, "CacheTtlMs": 1000 }     // V2100 — setpoint, 1 s cache
+      ]
+    },
+    "Plcs": [
+      {
+        "Name":              "Line1-Mixer",
+        "ListenPort":        5020,
+        "Host":              "10.0.1.1",
+        "Port":              502,
+        "DefaultCacheTtlMs": 0,
+        "BcdTags": {
+          "Add":    [ { "Address": 1200, "Width": 32 } ],
+          "Remove": [ 1056 ]
+        }
+      },
+      {
+        "Name":       "Line1-Conveyor",
+        "ListenPort": 5021,
+        "Host":       "10.0.1.2"
+      },
+      {
+        "Name":              "Line2-Press",
+        "ListenPort":        5022,
+        "Host":              "10.0.2.1",
+        "DefaultCacheTtlMs": 500
+      }
+    ],
+    "AdminPort": 8080,
+    "Connection": {
+      "BackendConnectTimeoutMs":   3000,
+      "BackendRequestTimeoutMs":   3000,
+      "GracefulShutdownTimeoutMs": 10000
+    },
+    "Resilience": {
+      "BackendConnect":   { "MaxAttempts": 3, "BackoffMs": [ 100, 500, 2000 ] },
+      "ListenerRecovery": { "InitialBackoffMs": [ 1000, 2000, 5000, 15000, 30000 ], "SteadyStateMs": 30000 },
+      "ReadCoalescing":   { "Enabled": true, "MaxParties": 32 }
+    },
+    "Cache": {
+      "AllowLongTtl":       false,
+      "MaxEntriesPerPlc":   1000,
+      "EvictionIntervalMs": 5000
+    }
+  }
+}
+```
+
+In this snapshot, `Line1-Mixer` adds a 32-bit tag at `1200` and removes the global 32-bit tag at `1056`. `Line2-Press` opts every tag in (whose `CacheTtlMs` is `null`) into a 500 ms cache via its `DefaultCacheTtlMs`. The setpoint at `1088` already has an explicit per-tag TTL and that value wins.
+
+The effective tag map per PLC after resolution:
+
+| PLC | Effective tag list |
+|-----|--------------------|
+| `Line1-Mixer` | `1024` (16-bit), `1088` (16-bit, `CacheTtlMs = 1000`), `1200` (32-bit). `1056` is removed. |
+| `Line1-Conveyor` | `1024` (16-bit), `1056` (32-bit), `1088` (16-bit, `CacheTtlMs = 1000`). |
+| `Line2-Press` | `1024` (16-bit, effective `CacheTtlMs = 500` via PLC default), `1056` (32-bit, effective `CacheTtlMs = 500`), `1088` (16-bit, effective `CacheTtlMs = 1000` from explicit per-tag value). |
+
+Any FC03 / FC04 read whose register range overlaps `Line2-Press`'s tag `1088` resolves to the per-tag 1 s TTL. A read that spans tags with different TTLs takes `min(TTLs)` across the range; a read that includes a tag with `CacheTtlMs = 0` is uncached even if every other tag in the range is opted in.
+
+## Hot-Reload Propagation Summary
+
+A reduced view of [`../Features/HotReload.md`](../Features/HotReload.md), restricted to the keys documented here. Every accepted reload emits `mbproxy.config.reload.applied` at `Information` with a summary of which PLCs were added or removed and the size of the tag-list delta.
+
+| Change | Propagation |
+|--------|-------------|
+| `BcdTags.Global` add / remove / width | Rewriter dereferences `IOptionsMonitor` per PDU. Next PDU sees the new map; in-flight PDUs are not retroactively touched. |
+| `Plcs[i].BcdTags.{Add,Remove}` | Same per-PDU resolution as above, scoped to the affected PLC. |
+| New `Plcs[i]` entry | Listener supervisor binds the new port under `ListenerRecovery`. |
+| `Plcs[i]` removed | Supervisor stops the listener and closes all upstream connections for that PLC. |
+| `Plcs[i].ListenPort` or `Host` changed | Equivalent to remove + add. |
+| `Connection.Backend*TimeoutMs` | Next backend connect or request uses the new value. |
+| `AdminPort` | Requires a service restart — the Kestrel admin host is built once at startup. |
+| `Resilience.ReadCoalescing.Enabled` | Hot-reloadable; in-flight coalesced entries drain naturally. |
+| `BcdTags.*.CacheTtlMs`, `Plcs[i].DefaultCacheTtlMs` | Tag-map reseat for the affected PLC drops that PLC's entire cache. |
+| `Cache.AllowLongTtl` / `MaxEntriesPerPlc` / `EvictionIntervalMs` | Enforced on next reload validation / next insert / next eviction tick respectively. |
+
+## Where Options Live in Code
+
+| Section | File | Binding class |
+|---------|------|---------------|
+| Root | `src/Mbproxy/Options/MbproxyOptions.cs` | `MbproxyOptions` |
+| `Plcs[]` | `src/Mbproxy/Options/PlcOptions.cs` | `PlcOptions` |
+| `BcdTags.Global[]` entry shape | `src/Mbproxy/Options/BcdTagOptions.cs` | `BcdTagOptions` |
+| `BcdTags.Global` / `Plcs[i].BcdTags` | `src/Mbproxy/Options/BcdTagListOptions.cs` | `BcdTagListOptions`, `PlcBcdOverrides` |
+| `Connection` | `src/Mbproxy/Options/ConnectionOptions.cs` | `ConnectionOptions` |
+| `Resilience` | `src/Mbproxy/Options/ResilienceOptions.cs` | `ResilienceOptions`, `RetryProfile`, `RecoveryProfile`, `ReadCoalescingOptions` |
+| `Cache` | `src/Mbproxy/Options/MbproxyOptions.cs` | `CacheOptions` (declared alongside `MbproxyOptions` in the same file) |
+| Schema validation | `src/Mbproxy/Options/MbproxyOptions.cs` | `MbproxyOptionsValidator` |
+| Reload validation | `src/Mbproxy/Configuration/ReloadValidator.cs` | `ReloadValidator` |
+| Tag-map resolution | `src/Mbproxy/Bcd/BcdTagMapBuilder.cs` | `BcdTagMapBuilder` |
+| Reload reconciliation | `src/Mbproxy/Configuration/ConfigReconciler.cs` | `ConfigReconciler`, `ReloadPlan` |
+
+All option classes are registered through `services.Configure<T>(...)` against the `Mbproxy:*` section in `Program.cs`. `IOptionsMonitor<T>` is the runtime read path; direct `IOptions<T>` injection is not used because it does not propagate reloads.
+
+## Related Documentation
+
+- [`../Features/HotReload.md`](../Features/HotReload.md) — reload acceptance flow and per-key propagation semantics
+- [`../Features/BcdRewriting.md`](../Features/BcdRewriting.md) — tag-list resolution, wire encoding, multi-tag overlap rules
+- [`../Architecture/ResponseCache.md`](../Architecture/ResponseCache.md) — cache contract, lookup order, write invalidation
+- [`./StatusPage.md`](./StatusPage.md) — schema served by `AdminPort`
+- [`./Troubleshooting.md`](./Troubleshooting.md) — Serilog block and common config rejection diagnostics
+- [`../../README.md`](../../README.md) — install and operational entry point
diff --git a/mbproxy/docs/Operations/StatusPage.md b/mbproxy/docs/Operations/StatusPage.md
new file mode 100644
index 0000000..6cd1c5f
--- /dev/null
+++ b/mbproxy/docs/Operations/StatusPage.md
@@ -0,0 +1,334 @@
+# Status Page
+
+The status page is the operator-facing view of the running service: an auto-refreshing HTML dashboard at `GET /` and a JSON twin at `GET /status.json` that monitoring scrapers consume. This document describes the endpoint surface, every wire-level field, and how counters map back to architecture decisions.
+
+## Endpoint Surface
+
+The admin endpoint is owned by `AdminEndpointHost` (see `src/Mbproxy/Admin/AdminEndpointHost.cs`). It exposes exactly two routes:
+
+- `GET /` — a single self-contained HTML document with a `<meta http-equiv="refresh" content="5">` tag. The page refreshes every five seconds by reload, not by JavaScript polling. There is no JS bundle, no external CSS, no remote fonts, and no favicon fetch.
+- `GET /status.json` — the same in-memory snapshot serialized as JSON via the source-generated `StatusJsonContext` (camelCase property names).
+
+The endpoint is **read-only**. There are no admin actions exposed — no kick-client, no force-reload, no listener restart, no log download. Reload happens automatically via `IOptionsMonitor`; listener recovery is owned by the supervisor. Authentication lives at the network layer: the service binds to `IPAddress.Any` on the admin port and assumes the deployment runs in a trusted internal segment behind a firewall.
+
+Both routes call `StatusSnapshotBuilder.Build()` for every request. The builder reads atomic counters directly from the supervisor map and per-PLC `ProxyCounters`; it holds no locks and performs no I/O.
+
+## Port and Configuration
+
+The listen port is read from `Mbproxy.AdminPort` and defaults to `8080`. Configuration semantics for this key live in [`./Configuration.md`](./Configuration.md).
+
+If Kestrel cannot bind the configured port at startup (port already in use, missing permissions on a reserved range, etc.) the host logs `mbproxy.admin.bind.failed` at `Error` level with the underlying reason. The host then sets `_app = null` and returns — the rest of the service keeps running. The Modbus listener supervisors are completely independent of the admin endpoint, so a bind failure here is non-fatal for proxying. See [`../Reference/LogEvents.md`](../Reference/LogEvents.md) for the event-id catalogue.
+
+If `Mbproxy.AdminPort` changes via hot-reload, the currently-running Kestrel app is stopped (2 s deadline) and a new one is started on the new port. Other config changes do not touch the admin endpoint.
+
+## Service-Wide Fields
+
+Top-level fields come from `ServiceFields` and `ListenersAggregate` in `src/Mbproxy/Admin/StatusDto.cs`.
+
+| JSON path | Type | Source | Meaning |
+|---|---|---|---|
+| `service.uptimeSeconds` | `long` | `ServiceFields.UptimeSeconds` | Seconds since process start, computed as `now - ServiceCounters.StartedAtUtc` at snapshot time. |
+| `service.version` | `string` | `ServiceFields.Version` via `AssemblyVersionAccessor` | `AssemblyInformationalVersion` of the running assembly. Useful for confirming a deployment took effect. |
+| `service.configLastReloadUtc` | `DateTimeOffset?` | `ServiceCounters.LastReloadUtc` | Wall-clock time of the most recent **accepted** hot-reload. `null` if no reload has occurred since process start. See [`../Features/HotReload.md`](../Features/HotReload.md). |
+| `service.configReloadCount` | `int` | `ServiceCounters.ReloadAppliedCount` | Number of `appsettings.json` reloads that validated and applied since process start. |
+| `service.configReloadRejectedCount` | `int` | `ServiceCounters.ReloadRejectedCount` | Number of reload attempts rejected by validation. A non-zero value here paired with a stale `configLastReloadUtc` indicates the operator's last edit was malformed and the service is still running the previous config. |
+| `listeners.bound` | `int` | `boundCount` accumulated while iterating `opts.Plcs` | Count of PLC entries whose supervisor currently reports `SupervisorState.Bound`. |
+| `listeners.configured` | `int` | `opts.Plcs.Count` | Total number of PLC entries in the active configuration. |
+
+Operator triggers:
+
+- `listeners.bound < listeners.configured` for more than one refresh cycle indicates one or more listeners are stuck recovering. Drill into the per-PLC `listener.state` and `listener.lastBindError` fields below.
+- `configReloadRejectedCount` rising means edits are reaching the watcher but failing validation — check the live log for `mbproxy.config.reload.rejected`.
+
+## Per-PLC Fields
+
+Each entry in `plcs[]` is a `PlcStatus` (see `src/Mbproxy/Admin/StatusDto.cs`). The builder iterates `opts.Plcs` in configured order, looks up the matching supervisor in `ProxyWorker.Supervisors`, and projects the supervisor's `CurrentCounters.Snapshot()` into wire fields.
+
+### Identity
+
+| JSON path | Type | Source | Meaning |
+|---|---|---|---|
+| `name` | `string` | `PlcOptions.Name` | Stable identifier from `appsettings.json`. Used as the dictionary key for supervisor lookup. |
+| `host` | `string` | `PlcOptions.Host` | Backend PLC host (IP or DNS name) the proxy connects out to. |
+| `listenPort` | `int` | `PlcOptions.ListenPort` | Local TCP port the proxy binds for upstream clients connecting *to* the proxy. |
+
+### Listener state
+
+| JSON path | Type | Source | Meaning |
+|---|---|---|---|
+| `listener.state` | `string` | `SupervisorSnapshot.State` mapped to `"bound"` / `"recovering"` / `"stopped"` | Current supervisor state. `bound` = TCP listener is accepting connections; `recovering` = Polly retry loop is trying to re-bind after a fault; `stopped` = no supervisor entry (typically a PLC that was just added and not yet started). |
+| `listener.lastBindError` | `string?` | `SupervisorSnapshot.LastBindError` | Message from the last bind exception. Populated whenever `state == "recovering"`. Common values: `"Address already in use"`, `"Permission denied"`. |
+| `listener.recoveryAttempts` | `int` | `SupervisorSnapshot.RecoveryAttempts` | Number of bind retries since the supervisor entered recovery. Resets on a successful bind. A monotonically rising value indicates the underlying problem is persistent. |
+
+### Client tracking
+
+| JSON path | Type | Source | Meaning |
+|---|---|---|---|
+| `clients.connected` | `int` | `clientSnapshots.Count` | Number of currently-connected upstream clients. Capped by the H2-ECOM100 four-client ceiling; values at 4 imply additional upstream connect attempts will be refused by the PLC. |
+| `clients.remoteEndpoints[].remote` | `string` | `UpstreamPipe.RemoteEp` | Upstream TCP endpoint as `ip:port`. |
+| `clients.remoteEndpoints[].connectedAtUtc` | `DateTimeOffset` | `UpstreamPipe.ConnectedAtUtc` | Wall-clock time the upstream socket was accepted. Useful for spotting zombie sockets that survived a network outage. |
+| `clients.remoteEndpoints[].pdusForwarded` | `long` | `UpstreamPipe.PdusForwardedCount` | PDUs forwarded on this specific upstream pipe since it connected. Lets you see which client is responsible for what fraction of fleet traffic. |
+
+### PDU traffic
+
+| JSON path | Type | Source | Meaning |
+|---|---|---|---|
+| `pdus.forwarded` | `long` | `CounterSnapshot.PdusForwarded` | Total PDUs (requests + responses) that traversed the proxy for this PLC since start. Increments once per PDU handed to the rewriter. |
+| `pdus.byFc.fc03` | `long` | `CounterSnapshot.Fc03` | Count of FC03 (read holding registers) requests seen. |
+| `pdus.byFc.fc04` | `long` | `CounterSnapshot.Fc04` | Count of FC04 (read input registers) requests seen. |
+| `pdus.byFc.fc06` | `long` | `CounterSnapshot.Fc06` | Count of FC06 (write single register) requests seen. |
+| `pdus.byFc.fc16` | `long` | `CounterSnapshot.Fc16` | Count of FC16 (write multiple registers) requests seen. |
+| `pdus.byFc.other` | `long` | `CounterSnapshot.FcOther` | All other function codes (FC01/02/05/15, diagnostic codes, etc.) seen. The proxy forwards these untouched. |
+| `pdus.rewrittenSlots` | `long` | `CounterSnapshot.RewrittenSlots` | Number of register slots the BCD rewriter touched, counting reads and writes. Indicates how much of the traffic actually hits BCD-configured addresses. See [`../Features/BcdRewriting.md`](../Features/BcdRewriting.md). |
+| `pdus.partialBcdWarnings` | `long` | `CounterSnapshot.PartialBcdWarnings` | Count of requests whose `[start, qty)` range partially overlapped a 32-bit BCD tag without fully covering its CDAB word pair. A rising value here is an operator signal: an upstream client is requesting partial-overlap reads, which the proxy cannot rewrite safely — review tag-list addresses or fix the client's request shape. |
+
+### Backend health
+
+| JSON path | Type | Source | Meaning |
+|---|---|---|---|
+| `backend.connectsSuccess` | `long` | `CounterSnapshot.ConnectsSuccess` | Successful backend TCP connects since start. Increments once per accepted upstream client (the proxy opens one backend socket per upstream client). |
+| `backend.connectsFailed` | `long` | `CounterSnapshot.ConnectsFailed` | Failed backend TCP connects after the Polly retry budget is exhausted (3 attempts at 100/500/2000 ms). A rising counter means the backend host is unreachable or the PLC is at its connection cap. |
+| `backend.exceptionsByCode.code01` | `long` | `CounterSnapshot.BackendException01` | Count of Modbus exception responses with code 01 (Illegal Function) received from the PLC. Typically indicates a client is sending function codes the PLC does not support. |
+| `backend.exceptionsByCode.code02` | `long` | `CounterSnapshot.BackendException02` | Code 02 (Illegal Data Address) — the requested register range is out of the PLC's V-memory map. |
+| `backend.exceptionsByCode.code03` | `long` | `CounterSnapshot.BackendException03` | Code 03 (Illegal Data Value) — quantity exceeds the PLC's per-FC cap (FC03/04 = 128 registers, FC16 = 100). |
+| `backend.exceptionsByCode.code04` | `long` | `CounterSnapshot.BackendException04` | Code 04 (Server Device Failure) — internal PLC fault, often correlated with the PLC entering STOP mode. |
+| `backend.lastRoundTripMs` | `double` | `CounterSnapshot.LastRoundTripMs` | Exponentially-weighted moving average of recent successful request → response round-trip times in milliseconds. Tracks PLC responsiveness; sustained values above the historical baseline indicate backend latency degradation. |
+
+### Multiplexer state
+
+These five fields describe the per-PLC backend multiplexer. See [`../Architecture/ConnectionModel.md`](../Architecture/ConnectionModel.md) for the design rationale and how transaction-id (TxId) reuse and queueing work.
+
+| JSON path | Type | Source | Meaning |
+|---|---|---|---|
+| `backend.inFlight` | `long` | `CounterSnapshot.InFlightCount` | Number of MBAP transactions currently in flight on the backend socket (request sent, response pending). |
+| `backend.maxInFlight` | `long` | `CounterSnapshot.MaxInFlight` | High-water mark of `inFlight` since start. Used to size the queue and to verify the multiplexer is in fact pipelining requests. |
+| `backend.txIdWraps` | `long` | `CounterSnapshot.TxIdWraps` | Times the 16-bit MBAP transaction-id allocator has wrapped through `0xFFFF`. A rising rate quantifies sustained request volume. |
+| `backend.disconnectCascades` | `long` | `CounterSnapshot.BackendDisconnectCascades` | Times a backend disconnect cascaded into closing all upstream pipes that were waiting on in-flight TxIds. Each cascade aborts every queued request bound for that PLC. |
+| `backend.queueDepth` | `long` | `CounterSnapshot.BackendQueueDepth` | Current count of requests queued behind the multiplexer's TxId allocator and write semaphore. A sustained non-zero queue means the multiplexer is the bottleneck (backend slower than upstream demand). |
+
+### Coalescing counters
+
+These fields describe duplicate-read coalescing on FC03/FC04. See [`../Architecture/ReadCoalescing.md`](../Architecture/ReadCoalescing.md) for the matching criteria and lifecycle.
+
+| JSON path | Type | Source | Meaning |
+|---|---|---|---|
+| `backend.coalescedHitCount` | `long` | `CounterSnapshot.CoalescedHitCount` | Reads that attached to an already-in-flight identical read instead of issuing a new backend request. |
+| `backend.coalescedMissCount` | `long` | `CounterSnapshot.CoalescedMissCount` | Reads that did not find a matching in-flight request and issued their own. The dashboard-side ratio is `hit / (hit + miss)`; the wire format intentionally does **not** carry the derived ratio (consumers compute it). |
+| `backend.coalescedResponseToDeadUpstream` | `long` | `CounterSnapshot.CoalescedResponseToDeadUpstream` | Coalesced responses that arrived after their attached upstream pipe had closed. Normal in bursty traffic; sustained growth indicates upstream clients are aborting too quickly. |
+
+### Cache counters
+
+These fields describe the short-TTL response cache for FC03/FC04. See [`../Architecture/ResponseCache.md`](../Architecture/ResponseCache.md).
+
+| JSON path | Type | Source | Meaning |
+|---|---|---|---|
+| `backend.cacheHitCount` | `long` | `CounterSnapshot.CacheHitCount` | Reads served from the cache without touching the backend at all. |
+| `backend.cacheMissCount` | `long` | `CounterSnapshot.CacheMissCount` | Cache-eligible reads that fell through to the backend. The derived `cacheHitRatio` is `hit / (hit + miss)`; like coalescing, it is **not** carried on the wire. |
+| `backend.cacheInvalidations` | `long` | `CounterSnapshot.CacheInvalidations` | Times a write (FC06/FC16) invalidated overlapping cache entries on this PLC. A high invalidation rate relative to writes means write coverage is broad and the cache is doing less work. |
+
+### Cache memory-watch
+
+These two fields are Tier-2 KPIs intended for memory-budget alerts. The cache is per-PLC; the dashboard aggregates these across the fleet.
+
+| JSON path | Type | Source | Meaning |
+|---|---|---|---|
+| `backend.cacheEntryCount` | `long` | `CounterSnapshot.CacheEntryCount` | Current number of cached response entries for this PLC. |
+| `backend.cacheBytes` | `long` | `CounterSnapshot.CacheBytes` | Approximate byte cost of the cache entries (response payloads plus key overhead). Used to detect runaway growth from a chatty client. |
+
+### Bytes
+
+| JSON path | Type | Source | Meaning |
+|---|---|---|---|
+| `bytes.upstreamIn` | `long` | `CounterSnapshot.BytesUpstreamIn` | Total bytes read from upstream client sockets bound to this PLC since start. |
+| `bytes.upstreamOut` | `long` | `CounterSnapshot.BytesUpstreamOut` | Total bytes written back to upstream client sockets bound to this PLC since start. |
+
+## Counter Atomicity
+
+All counters are `System.Threading.Interlocked` longs. Each read in `StatusSnapshotBuilder.Build()` is atomic per field; no locks are held across the snapshot build, and the build itself does no I/O.
+
+The practical consequence: a single `/status.json` request returns a coherent value for any **one** counter, but the assembled response is **not** a globally consistent snapshot — different per-PLC counters may straddle increments by microseconds. For example, `pdus.forwarded` for PLC A and `pdus.forwarded` for PLC B are not guaranteed to reflect the same instant. This is acceptable for dashboards and rate calculations; do not use these counters for fine-grained accounting.
+
+## Example JSON Response
+
+A representative two-PLC deployment, ~2 hours into a run:
+
+```json
+{
+  "service": {
+    "uptimeSeconds": 7234,
+    "version": "1.0.0",
+    "configLastReloadUtc": "2026-05-13T14:02:11+00:00",
+    "configReloadCount": 2,
+    "configReloadRejectedCount": 0
+  },
+  "listeners": {
+    "bound": 2,
+    "configured": 2
+  },
+  "plcs": [
+    {
+      "name": "line1-press",
+      "host": "10.20.30.41",
+      "listenPort": 5021,
+      "listener": {
+        "state": "bound",
+        "lastBindError": null,
+        "recoveryAttempts": 0
+      },
+      "clients": {
+        "connected": 2,
+        "remoteEndpoints": [
+          {
+            "remote": "10.20.40.10:51223",
+            "connectedAtUtc": "2026-05-13T12:01:55+00:00",
+            "pdusForwarded": 184213
+          },
+          {
+            "remote": "10.20.40.11:53901",
+            "connectedAtUtc": "2026-05-13T13:30:02+00:00",
+            "pdusForwarded": 41008
+          }
+        ]
+      },
+      "pdus": {
+        "forwarded": 225221,
+        "byFc": {
+          "fc03": 218904,
+          "fc04": 0,
+          "fc06": 12,
+          "fc16": 6203,
+          "other": 102
+        },
+        "rewrittenSlots": 1318622,
+        "partialBcdWarnings": 0
+      },
+      "backend": {
+        "connectsSuccess": 2,
+        "connectsFailed": 0,
+        "exceptionsByCode": {
+          "code01": 0,
+          "code02": 14,
+          "code03": 0,
+          "code04": 0
+        },
+        "lastRoundTripMs": 12.4,
+        "inFlight": 1,
+        "maxInFlight": 4,
+        "txIdWraps": 3,
+        "disconnectCascades": 0,
+        "queueDepth": 0,
+        "coalescedHitCount": 41892,
+        "coalescedMissCount": 177012,
+        "coalescedResponseToDeadUpstream": 7,
+        "cacheHitCount": 88321,
+        "cacheMissCount": 88691,
+        "cacheInvalidations": 6203,
+        "cacheEntryCount": 47,
+        "cacheBytes": 18512
+      },
+      "bytes": {
+        "upstreamIn": 4108290,
+        "upstreamOut": 12993021
+      }
+    },
+    {
+      "name": "line2-oven",
+      "host": "10.20.30.42",
+      "listenPort": 5022,
+      "listener": {
+        "state": "recovering",
+        "lastBindError": "Address already in use",
+        "recoveryAttempts": 12
+      },
+      "clients": {
+        "connected": 0,
+        "remoteEndpoints": []
+      },
+      "pdus": {
+        "forwarded": 0,
+        "byFc": { "fc03": 0, "fc04": 0, "fc06": 0, "fc16": 0, "other": 0 },
+        "rewrittenSlots": 0,
+        "partialBcdWarnings": 0
+      },
+      "backend": {
+        "connectsSuccess": 0,
+        "connectsFailed": 0,
+        "exceptionsByCode": { "code01": 0, "code02": 0, "code03": 0, "code04": 0 },
+        "lastRoundTripMs": 0.0,
+        "inFlight": 0,
+        "maxInFlight": 0,
+        "txIdWraps": 0,
+        "disconnectCascades": 0,
+        "queueDepth": 0,
+        "coalescedHitCount": 0,
+        "coalescedMissCount": 0,
+        "coalescedResponseToDeadUpstream": 0,
+        "cacheHitCount": 0,
+        "cacheMissCount": 0,
+        "cacheInvalidations": 0,
+        "cacheEntryCount": 0,
+        "cacheBytes": 0
+      },
+      "bytes": { "upstreamIn": 0, "upstreamOut": 0 }
+    }
+  ]
+}
+```
+
+## HTML Page Layout
+
+The HTML renderer is `StatusHtmlRenderer.Render(StatusResponse)` in `src/Mbproxy/Admin/StatusHtmlRenderer.cs`. The page is one document, inline CSS in a `<style>` block, no external resources of any kind — operators can serve it behind a corporate firewall without whitelisting a CDN.
+
+Structure:
+
+1. **Header summary** — version, formatted uptime (`Nh MMm SSs`), `bound/configured` listener tally, last reload timestamp, reload count with a `(N rejected)` suffix when applicable.
+2. **PLC table** — one row per configured PLC. Columns: Name, Host, Port, State (colour-coded — `bound` = green, `recovering` = orange, `stopped` = grey), Clients (count plus a comma-separated list of `remote (N PDUs)`), PDUs forwarded, FC03/FC04/FC06/FC16/FC? counts, BCD slots, Partial BCD, exception codes 01/02/03/04, RTT (ms), bytes in/out, multiplexer columns (in-flight, max in-flight, TxId wraps, cascades, queue), coalescing ratio cell, cache ratio cell.
+3. **State cell error detail** — when `state == "recovering"`, the cell also shows `lastBindError` and `(attempt N)` in a small red span.
+
+The coalescing and cache cells each render as `<pct>% (<hits>)`. When neither has been exercised (`hit + miss == 0`), the cell renders an em-dash to keep the column narrow. Page weight is bounded by the design budget (≤ 50 KB for a 54-PLC fleet).
+
+The page does not depend on JavaScript. Refresh is driven entirely by the `<meta http-equiv="refresh" content="5">` tag, so any browser — including text-mode browsers — sees the same view.
+
+## How to Scrape It
+
+The JSON twin is plain HTTP. Any monitoring system that can curl an endpoint can scrape it.
+
+PowerShell, pulling the cache hit ratio for the first PLC into a variable:
+
+```powershell
+$snap = Invoke-WebRequest -Uri "http://mbproxy-host:8080/status.json" -UseBasicParsing |
+        Select-Object -ExpandProperty Content |
+        ConvertFrom-Json
+
+$plc = $snap.plcs[0]
+$hits  = $plc.backend.cacheHitCount
+$total = $hits + $plc.backend.cacheMissCount
+$ratio = if ($total -gt 0) { [math]::Round(100.0 * $hits / $total, 1) } else { 0.0 }
+
+"PLC $($plc.name): cache hit ratio = $ratio% over $total reads"
+```
+
+Bash with `curl` and `jq`, fanning out across the fleet:
+
+```bash
+curl -s http://mbproxy-host:8080/status.json |
+  jq -r '.plcs[] | "\(.name)\t\(.listener.state)\t\(.backend.lastRoundTripMs)"'
+```
+
+Prometheus-style scrapers should poll `/status.json` directly and translate fields into their own metric names; the service does not expose Prometheus exposition format.
+
+## Where the KPIs Live
+
+This document covers the **endpoint surface**: what is on the wire and how each field is computed. The **dashboard composition** — which counters roll up into which Grafana panels, alerting thresholds, fleet-aggregate definitions — lives in [`../kpi.md`](../kpi.md). Keep the two documents disjoint: when a new counter is added, list it here; when a new panel or rate calculation is added, add it to `kpi.md`.
+
+## Related Documentation
+
+- [`../Architecture/ConnectionModel.md`](../Architecture/ConnectionModel.md) — multiplexer counter meanings (`inFlight`, `maxInFlight`, `txIdWraps`, `queueDepth`, `disconnectCascades`).
+- [`../Architecture/ReadCoalescing.md`](../Architecture/ReadCoalescing.md) — coalescing counter meanings and matching criteria.
+- [`../Architecture/ResponseCache.md`](../Architecture/ResponseCache.md) — cache counter meanings, TTL, invalidation rules.
+- [`../Features/BcdRewriting.md`](../Features/BcdRewriting.md) — what increments `rewrittenSlots` and `partialBcdWarnings`.
+- [`../Features/HotReload.md`](../Features/HotReload.md) — what increments `configReloadCount` vs. `configReloadRejectedCount`.
+- [`./Configuration.md`](./Configuration.md) — `Mbproxy.AdminPort` and other option keys.
+- [`./Troubleshooting.md`](./Troubleshooting.md) — using these counters to diagnose specific failure modes.
+- [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — event-id catalogue including `mbproxy.admin.bind.failed`.
+- [`../kpi.md`](../kpi.md) — dashboard catalog that consumes these counters.
diff --git a/mbproxy/docs/Operations/Troubleshooting.md b/mbproxy/docs/Operations/Troubleshooting.md
new file mode 100644
index 0000000..8c07d45
--- /dev/null
+++ b/mbproxy/docs/Operations/Troubleshooting.md
@@ -0,0 +1,364 @@
+# Troubleshooting
+
+Operator diagnosis playbook for mbproxy. Each entry maps an observable symptom to the log event name and status-page counter that confirms it, then lists likely causes and remediation steps.
+
+The rolling log lives at `C:\ProgramData\mbproxy\logs\mbproxy-<date>.log`. The live counters are at `http://<host>:<AdminPort>/status.json` (default port `8080`). Events at Error level and above are also mirrored to the Windows Application Event Log under source `mbproxy`.
+
+## Service Startup Failures
+
+### Listener port already in use
+
+**Symptom.** Service starts but one or more PLCs show `listener.state = "recovering"` on the status page. `plcs[].listener.lastBindError` contains the OS error text (typically `Only one usage of each socket address ... is normally permitted`).
+
+**Where to look.**
+
+- Log event: `mbproxy.startup.bind.failed` (Error) with `Plc`, `Port`, `Reason` properties.
+- Followed periodically by retries; success eventually logs `mbproxy.listener.recovered`.
+- Status fields: `plcs[].listener.state`, `plcs[].listener.lastBindError`, `plcs[].listener.recoveryAttempts`.
+
+**Root causes.**
+
+- Another mbproxy instance is already running against the same `appsettings.json`.
+- A stale `mbproxy` process is holding the port after a non-graceful stop.
+- A different service (a previous Modbus gateway, a leftover test harness) is bound to the configured `ListenPort`.
+
+**Remediation.**
+
+1. Identify the process holding the port:
+
+   ```powershell
+   netstat -ano | findstr :<port>
+   Get-Process -Id <pid>
+   ```
+
+2. Stop the conflicting process, or change `Plcs[].ListenPort` in `appsettings.json` and save. The supervisor retries on a Polly schedule (1s / 2s / 5s / 15s / 30s, then 30s indefinitely) — watch for `mbproxy.listener.recovered` to confirm.
+3. If the listener never recovers, check the Event Log for the underlying reason text and verify the configured IP address is bound on the host (the proxy binds to the host's IPs, not the PLC's).
+
+### Admin endpoint port collision
+
+**Symptom.** Status page is unreachable; Modbus traffic continues to flow normally.
+
+**Where to look.**
+
+- Log event: `mbproxy.admin.bind.failed` (Error) with `Port`, `Reason` properties.
+- The matching success event `mbproxy.admin.started` is absent from the same boot.
+
+**Root causes.**
+
+- Another HTTP service (IIS, a sidecar dashboard, a previous mbproxy instance) is bound to `AdminPort`.
+- A firewall rule is rejecting the bind on the configured port.
+
+**Remediation.**
+
+1. The Modbus proxy continues to forward traffic — only telemetry is affected. There is no urgency from a traffic-flow perspective.
+2. Identify the process holding the admin port with the same `netstat -ano | findstr :<port>` pattern.
+3. Change `Mbproxy.AdminPort` in `appsettings.json` and save. Admin rebinding is hot-reloadable; no service restart is required.
+
+### Malformed appsettings.json at startup
+
+**Symptom.** Service fails to enter the `RUNNING` state. `sc.exe start mbproxy` reports a startup failure.
+
+**Where to look.**
+
+- Rolling log at `C:\ProgramData\mbproxy\logs\` for the most recent date — startup errors include the JSON parse exception with line/column.
+- Windows Event Log under source `mbproxy` for the Error-level entry mirrored from the rolling log.
+
+**Root causes.**
+
+- Trailing comma, unbalanced braces, or stray comment in `appsettings.json`.
+- A required section (`Plcs`, `BcdTags`) is missing or has the wrong shape.
+- A field that must be an integer is quoted as a string.
+
+**Remediation.**
+
+1. Open `C:\ProgramData\mbproxy\appsettings.json` and validate it as JSON (use any editor with JSON linting).
+2. Fix the structural error reported in the log and save.
+3. Start the service with `sc.exe start mbproxy`.
+
+## Connectivity Failures Between Proxy and PLC
+
+### Backend connect refused
+
+**Symptom.** Upstream clients can connect to the proxy but their reads/writes either return Modbus exception 0x0B or the proxy closes the client socket. `plcs[].backend.connectsFailed` on the status page rises while `connectsSuccess` stays flat.
+
+**Where to look.**
+
+- Log event: `mbproxy.backend.failed` (Warning) with `Plc`, `Reason`.
+- Status fields: `plcs[].backend.connectsFailed`, `plcs[].backend.connectsSuccess`.
+
+**Root causes.**
+
+- PLC powered off, rebooting, or its ECOM/EBC coprocessor is faulted.
+- Wrong `Host` or `Port` configured for the PLC in `appsettings.json`.
+- A network ACL or firewall change is blocking the proxy host from reaching the PLC on TCP 502.
+- The H2-ECOM100 already has its cap of 4 simultaneous TCP clients in use — the 5th connection is refused at the device.
+
+**Remediation.**
+
+1. Confirm the PLC is reachable from the proxy host:
+
+   ```powershell
+   Test-NetConnection -ComputerName <plc-ip> -Port 502
+   ```
+
+2. Verify the host/port in `appsettings.json` matches the PLC's actual settings (see `DL260/mbtcp_settings.JPG` for the as-deployed values).
+3. If `Test-NetConnection` succeeds but the proxy still fails, inspect the upstream client count for that PLC on the status page — if it is at 4 and a new connect attempt fires, the ECOM cap is the cause.
+4. If the PLC has rebooted, the supervisor retries automatically on the Polly backend-connect pipeline (3 attempts at 100ms / 500ms / 2000ms per upstream request).
+
+### Backend disconnect cascade
+
+**Symptom.** All upstream clients for a single PLC disconnect at the same instant. `plcs[].backend.disconnectCascades` on the status page increments by the number of upstream clients that were attached at the time. Upstream clients reconnect on their next request.
+
+**Where to look.**
+
+- Log event: `mbproxy.multiplex.backend.disconnected` (Warning) with `Plc`, `UpstreamCount`, `InFlightCount`, `Reason`.
+- Status field: `plcs[].backend.disconnectCascades`.
+
+**Root causes.**
+
+- PLC rebooted, reset its ECOM, or dropped the TCP session.
+- A middlebox (switch, firewall) timed out the idle connection. The DL205/DL260 family does not emit TCP keepalive, so idle paths die silently after typically 2–5 minutes.
+- A network event (link flap, switch reset) closed the path.
+
+**Remediation.**
+
+1. Verify the upstream count on the status page returns to normal as clients reconnect — `plcs[].clients.connected` should climb again within seconds.
+2. If cascades fire repeatedly against the same PLC, investigate the PLC and intermediate network for stability. The proxy itself has no state to repair.
+3. If cascades correlate with idle periods, the idle middlebox-drop pattern is the likeliest cause; reduce the upstream client's poll interval below the middlebox idle timeout to keep traffic flowing.
+
+### Request timeout watchdog firing
+
+**Symptom.** Upstream clients receive Modbus exception `0x0B` (Gateway Target Device Failed To Respond) with the original transaction ID preserved. The backend socket stays up — only individual requests time out.
+
+**Where to look.**
+
+- Log event: `mbproxy.multiplex.request.timeout` (Warning) with `Plc`, `ProxyTxId`, `OriginalTxId`, `Fc`, `ElapsedMs`.
+- Status field: `plcs[].backend.lastRoundTripMs` (EWMA over recent successful round-trips — climbs as the PLC slows down).
+
+**Root causes.**
+
+- PLC scan time has grown beyond `Connection.BackendRequestTimeoutMs` (default 3000) under load.
+- A PLC firmware quirk is dropping responses or echoing the wrong MBAP transaction ID.
+- In test environments only, pymodbus 3.13.0's concurrent-multiplexed-request bug delivers the response under a different `OriginalTxId` than was sent — see [`../Testing/Simulator.md`](../Testing/Simulator.md).
+
+**Remediation.**
+
+1. Confirm the PLC is healthy — the EWMA in `plcs[].backend.lastRoundTripMs` should sit well below the configured timeout. If it is creeping up, the PLC itself is overloaded.
+2. If the PLC's scan time legitimately exceeds the default, raise `Connection.BackendRequestTimeoutMs`. The change is hot-reloadable; the next request uses the new value.
+3. The proxy does not retry timed-out FC06 / FC16 — they are non-idempotent on BCD tags and a partial-applied multi-register write could leave a 32-bit pair mid-transition. Upstream clients are responsible for their own retry policy.
+
+## Configuration Hot-Reload Failures
+
+### Reload rejected
+
+**Symptom.** A save to `%ProgramData%\mbproxy\appsettings.json` is not picked up. The running config stays in effect. `service.configReloadRejectedCount` on the status page increments by one; `service.configLastReloadUtc` does not advance.
+
+**Where to look.**
+
+- Log event: `mbproxy.config.reload.rejected` (Error) with the joined `Errors` string.
+- Status fields: `service.configReloadCount`, `service.configReloadRejectedCount`, `service.configLastReloadUtc`.
+
+**Root causes.**
+
+- Malformed JSON (parse error).
+- Duplicate `Plcs[].ListenPort` across two PLCs.
+- Duplicate BCD `Address` within one tag list.
+- A 32-bit BCD pair whose high register overlaps with a separate 16-bit entry at the next address.
+- A `CacheTtlMs` (or per-PLC `DefaultCacheTtlMs`) exceeding 60 000 ms without `Cache.AllowLongTtl = true` to opt in.
+- Schema error (a required field is missing or has the wrong type).
+
+**Remediation.**
+
+1. Read the `Errors` property on the rejection log event — it lists every validation failure for the rejected file as a single semicolon-separated string.
+2. Fix the file and save again. The next valid write is accepted and `mbproxy.config.reload.applied` is logged.
+3. Reload is all-or-nothing — there is no partial application. The previous valid config keeps running until the next valid write.
+4. For the full validation rule set, see [`../Features/HotReload.md`](../Features/HotReload.md).
+
+## BCD Rewriter Anomalies
+
+### Partial-overlap warnings
+
+**Symptom.** The status page's `plcs[].pdus.partialBcdWarnings` counter rises. Upstream clients see raw nibble values instead of decoded integers for a 32-bit BCD pair.
+
+**Where to look.**
+
+- Log event: `mbproxy.rewrite.partial_bcd` (Warning) with `Plc`, `Address`, `ClientStart`, `ClientQty`.
+- Status field: `plcs[].pdus.partialBcdWarnings`.
+
+**Root causes.**
+
+- An upstream client requested quantity = 1 at the low address of a configured 32-bit BCD pair (the proxy will not rewrite half a pair).
+- An upstream client read or wrote the high address of a 32-bit pair on its own.
+- A client tag definition specifies the wrong word width for the configured BCD address.
+
+**Remediation.**
+
+1. Fix the client's tag definition: 32-bit BCD addresses must be accessed as quantity = 2 starting at the low address.
+2. If the client genuinely intends to read half the pair (rare; not a normal workflow), remove the 32-bit entry from the BCD tag list and replace it with a single 16-bit entry at the address the client uses.
+3. Background reading: [`../Features/BcdRewriting.md`](../Features/BcdRewriting.md) covers the rewriter pipeline and the partial-overlap policy.
+
+### Invalid BCD values
+
+**Symptom.** The status page's `plcs[].pdus` block is normal but `mbproxy.rewrite.invalid_bcd` warnings appear in the log. The affected register passes through raw.
+
+**Where to look.**
+
+- Log event: `mbproxy.rewrite.invalid_bcd` (Warning) with `Plc`, `Address`, `RawValue` (hex), `Direction` (`Read` or `Write`).
+- Status field: invalid-BCD warnings are folded into `plcs[].pdus.partialBcdWarnings` only when the warning is partial; pure invalid-BCD events do not have a dedicated counter — log search is the primary diagnostic.
+
+**Root causes.**
+
+- The tag is misconfigured as BCD when the PLC is actually storing binary at that address. `0x1A2B` is not valid BCD because the nibble `A` is outside 0–9.
+- The PLC ladder program wrote a non-BCD value to a register configured as a BCD tag (operator bug on the PLC side).
+- An upstream client is writing a value the proxy cannot encode into BCD (out-of-range decimal — for example, decimal 9 999 999 into a 16-bit BCD slot whose maximum is 9999).
+
+**Remediation.**
+
+1. Look at the `RawValue` in the log event. If it consistently contains nibbles `A`–`F`, the tag is almost certainly not BCD at the PLC — remove it from the BCD tag list.
+2. If the value is occasional and only appears under certain operator actions, inspect the PLC ladder program for code paths that write to that address.
+3. If the warning fires on writes (`Direction=Write`), the upstream client is sending a binary integer the proxy cannot represent in BCD. Validate the client-side value range.
+
+## Performance Anomalies
+
+### Backend EWMA latency spiking
+
+**Symptom.** `plcs[].backend.lastRoundTripMs` on the status page climbs from its normal range (typically a few milliseconds on a healthy ECOM) toward `Connection.BackendRequestTimeoutMs`. Eventually some requests start timing out and the request-timeout symptom kicks in.
+
+**Where to look.**
+
+- Status field: `plcs[].backend.lastRoundTripMs` (EWMA with α = 0.2 over recent successful round-trips).
+- If timeouts begin, `mbproxy.multiplex.request.timeout` events appear.
+
+**Root causes.**
+
+- PLC is under unusually heavy ladder load and its Modbus scan slot is starving.
+- Network congestion between the proxy host and the PLC.
+- The PLC is sharing its ECOM module with other Modbus clients (the proxy is not the only consumer).
+
+**Remediation.**
+
+1. Check the PLC's own diagnostics for scan-time growth or watchdog warnings.
+2. Verify the proxy is not the only consumer — if the ECOM is also serving another upstream tool, the two are contending for the same serialised processing slot.
+3. If the EWMA stabilises at a higher-but-still-safe value, consider raising `Connection.BackendRequestTimeoutMs` so individual slow responses do not start timing out.
+
+### Multiplexer queue depth growing
+
+**Symptom.** `plcs[].backend.queueDepth` on the status page is non-zero and trending up rather than oscillating near zero. The backend is being asked for more frames per unit time than it can serialise.
+
+**Where to look.**
+
+- Status field: `plcs[].backend.queueDepth` (current depth of the outbound channel feeding the backend writer task).
+- Cross-reference: `plcs[].clients.connected` (upstream demand) and `plcs[].backend.lastRoundTripMs` (backend service rate).
+
+**Root causes.**
+
+- More concurrent upstream clients are issuing reads than the ECOM's serialised loop can satisfy. The DL205/DL260 family processes Modbus requests one at a time.
+- A burst of large FC03/FC04 quantities is consuming wire time per request.
+
+**Remediation.**
+
+1. Enable the Phase-10 read coalescer if it is off: set `Resilience.ReadCoalescing.Enabled = true` in `appsettings.json`. Overlapping FC03/FC04 reads on the same address range fan out from a single backend round-trip — see [`../Architecture/ReadCoalescing.md`](../Architecture/ReadCoalescing.md).
+2. Opt heavy-read tags into the response cache by setting `CacheTtlMs > 0` per tag (or `DefaultCacheTtlMs` per PLC). Cache hits never touch the backend — see [`../Architecture/ResponseCache.md`](../Architecture/ResponseCache.md).
+3. Reduce upstream client poll rates against the affected PLC if neither feature is appropriate.
+4. Background reading on the per-PLC connection model and the queue's role: [`../Architecture/ConnectionModel.md`](../Architecture/ConnectionModel.md).
+
+### Coalescing dead-upstream count rising
+
+**Symptom.** `plcs[].backend.coalescedResponseToDeadUpstream` rises steadily. Coalesced reads complete on the backend, but the upstream client that asked for the read has already disconnected before the response is fanned out.
+
+**Where to look.**
+
+- Log event: `mbproxy.coalesce.dead_upstream` (Debug) with `Plc`, `UnitId`, `Fc`, `Start`, `Qty`.
+- Status field: `plcs[].backend.coalescedResponseToDeadUpstream`.
+
+**Root causes.**
+
+- Upstream Modbus clients are configured with a read timeout shorter than the backend's actual response time. They disconnect and reconnect before the response arrives.
+- Upstream clients are deliberately short-lived (single-shot pollers that connect, send one request, close).
+- Network instability is killing upstream sockets mid-request.
+
+**Remediation.**
+
+1. This metric is informational and often benign. A small steady rate against short-lived pollers is expected.
+2. If the rate is high enough to matter, verify upstream client (NModbus, generic Modbus clients) read timeouts exceed `Connection.BackendRequestTimeoutMs` plus a margin for jitter.
+3. Cross-check `plcs[].backend.lastRoundTripMs` — if the backend is genuinely slow, the dead-upstream metric is a follow-on symptom of the latency-spike entry above; address that first.
+
+## Response Cache Anomalies
+
+### Cache hit ratio low when expected high
+
+**Symptom.** `plcs[].backend.cacheHitCount` is not rising even though the tag was opted into the cache. Reads are still hitting the backend.
+
+**Where to look.**
+
+- Status fields: `plcs[].backend.cacheHitCount`, `plcs[].backend.cacheMissCount`, `plcs[].backend.cacheEntryCount`.
+- Log event: `mbproxy.cache.miss` (Debug) — turn the log level up to confirm misses are firing for the expected addresses.
+
+**Root causes.**
+
+- The tag's `CacheTtlMs` is unset (null) and the per-PLC `DefaultCacheTtlMs` is `0` (the default). Cache is opt-in; absent a positive TTL, every read misses.
+- The last config reload was rejected, so the cache TTL change never took effect. Check `service.configReloadRejectedCount`.
+- Writes to the same address range are arriving fast enough to invalidate every cached entry before it is reused. Inspect `cacheInvalidations` next.
+
+**Remediation.**
+
+1. Confirm the configured cache fields appear in `/status.json` for the PLC. If `cacheEntryCount` is `0` after a sustained read load, the cache is not wired for that tag.
+2. Verify the most recent `service.configLastReloadUtc` matches the time you saved the file. If not, the reload was rejected — see the "Reload rejected" entry above.
+3. For the cache wiring rules and per-tag-versus-per-PLC precedence, see [`../Architecture/ResponseCache.md`](../Architecture/ResponseCache.md).
+
+### Cache invalidations storming
+
+**Symptom.** `plcs[].backend.cacheInvalidations` rises at a rate close to the read rate. Cache hits are happening but each one is followed quickly by an invalidation.
+
+**Where to look.**
+
+- Status fields: `plcs[].backend.cacheInvalidations`, `plcs[].backend.cacheHitCount`, `plcs[].pdus.byFc.fc06`, `plcs[].pdus.byFc.fc16`.
+- Log event: `mbproxy.cache.invalidated` (Debug) with `Plc`, `UnitId`, `WriteStart`, `WriteQty`, `Count`.
+
+**Root causes.**
+
+- Frequent FC06 / FC16 writes target the same address range as the cached reads. The cache invalidates correctly on every overlapping write — but if writes outpace reads, the cache provides little net benefit.
+- The cached read range is larger than necessary and overlaps with unrelated write traffic.
+- The TTL is high relative to the natural data churn rate at the PLC.
+
+**Remediation.**
+
+1. Compare `cacheInvalidations` to `cacheHitCount`. When invalidation rate approaches read rate, the cache is doing its job but is not buying anything.
+2. Lower the TTL on the affected tag (or remove it from the cache entirely).
+3. Verify the cached read range matches only the addresses the upstream actually needs — narrower ranges reduce overlap with write traffic.
+
+## Service Stop / Restart
+
+### Service will not stop cleanly within the graceful drain
+
+**Symptom.** `sc.exe stop mbproxy` takes the full `Connection.GracefulShutdownTimeoutMs` (default 10 000) or longer to return. The shutdown log line indicates non-zero in-flight requests at cancel time.
+
+**Where to look.**
+
+- Log event: `mbproxy.shutdown.complete` (Information) with `InFlightAtCancel`, `ElapsedMs`.
+- Windows Event Log for any Error-level events emitted during the shutdown window.
+
+**Root causes.**
+
+- `Connection.GracefulShutdownTimeoutMs` is shorter than the slowest in-flight request can complete in.
+- An in-flight request is stuck because the backend is unresponsive — the request will never return; only the deadline ends it.
+- The fleet is in a sustained busy state at the moment of stop, with many in-flight requests, and they cannot all complete within the configured budget.
+
+**Remediation.**
+
+1. Inspect `InFlightAtCancel` on the shutdown log line. Zero means the drain succeeded; non-zero means that many requests were cancelled by the deadline.
+2. Raise `Connection.GracefulShutdownTimeoutMs` if a slow-but-healthy backend is the cause. The change applies on the next `ApplicationStopping` event — restart the service to pick it up.
+3. If non-zero `InFlightAtCancel` correlates with `mbproxy.multiplex.request.timeout` events in the same window, the backend was unresponsive at stop time and no timeout extension would have helped — the proxy correctly proceeded with shutdown.
+4. The drain phase cancels remaining work cleanly; the service always reaches `STOPPED`. Persistent failure to reach `STOPPED` indicates a Windows service-control issue, not an mbproxy issue.
+
+## Related Documentation
+
+- [Status page](./StatusPage.md)
+- [Configuration](./Configuration.md)
+- [Log event catalogue](../Reference/LogEvents.md)
+- [Connection model](../Architecture/ConnectionModel.md)
+- [Response cache](../Architecture/ResponseCache.md)
+- [Hot reload validation rules](../Features/HotReload.md)
+- [BCD rewriting](../Features/BcdRewriting.md)
+- [Read coalescing](../Architecture/ReadCoalescing.md)
+- [pymodbus simulator quirks](../Testing/Simulator.md)
diff --git a/mbproxy/docs/Reference/LogEvents.md b/mbproxy/docs/Reference/LogEvents.md
new file mode 100644
index 0000000..05b468b
--- /dev/null
+++ b/mbproxy/docs/Reference/LogEvents.md
@@ -0,0 +1,499 @@
+# Log Events
+
+The stable catalog of every `mbproxy.*` event name the service emits, with its level, structured properties, and operational meaning. Operators grep the rolling log against these names, dashboards filter on them, and alerting rules trigger on them — once shipped, the names do not churn.
+
+## Conventions and Wiring
+
+The service uses [Serilog](https://serilog.net/) wired through the `Microsoft.Extensions.Logging` bridge. Three sinks are configured (see `src/Mbproxy/HostingExtensions.cs`):
+
+- **Console** — written to stdout for interactive `--console` runs and for the SCM stdout capture.
+- **Rolling file** — under `%ProgramData%\mbproxy\logs\` (`mbproxy-<date>.log`).
+- **Windows Event Log** — only when running as a Windows Service, and only for events at `Error` and above (see `src/Mbproxy/Diagnostics/EventLogBridge.cs`).
+
+Every event uses source-generated `[LoggerMessage]` definitions, so the property names below match the message template token-for-token. The default minimum level is `Information`; lower the floor for `Mbproxy.*` categories via the standard `Logging:LogLevel` configuration to surface `Debug` events such as the coalesce and cache traces.
+
+```json
+{
+  "Logging": {
+    "LogLevel": {
+      "Mbproxy": "Debug"
+    }
+  }
+}
+```
+
+Every event carries a `Plc` property (the configured PLC name from `appsettings.json`) wherever a PLC scope applies, so log lines for one device can be filtered out of the fleet stream:
+
+```bash
+grep '"Plc":"Line1-Mixer"' mbproxy-20260514.log
+```
+
+Each H3 below is a stable event identifier; the dotted lowercase casing is part of the operator contract and is preserved verbatim. The `EventId` numeric column is documented alongside the event name so Windows Event Viewer filters and Serilog `EventId`-based subscriptions remain stable too.
+
+## Service Lifecycle
+
+### mbproxy.startup.ready
+
+**Level:** Information &middot; **EventId:** 1 &middot; **Source:** `src/Mbproxy/Proxy/ProxyWorker.cs`
+
+| Property | Type | Meaning |
+|----------|------|---------|
+| `ListenersBound` | `int` | Number of per-PLC `TcpListener` instances successfully bound at startup. |
+| `PlcsConfigured` | `int` | Total PLC entries in the current configuration snapshot. |
+
+Fires once after `ProxyWorker.StartAsync` has spun up every per-PLC supervisor and the admin endpoint. `ListenersBound < PlcsConfigured` means at least one PLC failed its initial bind — the supervisor will keep retrying, but the gap is the operator's first signal.
+
+**Operator action:** if the two counts disagree, search for `mbproxy.startup.bind.failed` entries to identify the missing PLCs.
+
+### mbproxy.startup.bind
+
+**Level:** Information &middot; **EventId:** 20 (`PlcListener`) / 40 (`PlcListenerSupervisor`) &middot; **Source:** `src/Mbproxy/Proxy/PlcListener.cs`, `src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs`
+
+| Property | Type | Meaning |
+|----------|------|---------|
+| `Plc` | `string` | Configured PLC name. |
+| `Port` | `int` | Local TCP port the listener is bound to. |
+
+Fires when a per-PLC `TcpListener` successfully binds its configured port. Emitted by both the listener itself and the supervisor wrapper — the two sites share the event name so dashboards filtering on `mbproxy.startup.bind` see both startup binds and post-recovery rebinds.
+
+**Operator action:** none in steady state. Use this event to confirm a hot-reload added PLC actually came up.
+
+### mbproxy.startup.bind.failed
+
+**Level:** Error &middot; **EventId:** 21 (`ProxyWorker`) / 41 (`PlcListenerSupervisor`) &middot; **Source:** `src/Mbproxy/Proxy/ProxyWorker.cs`, `src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs`
+
+| Property | Type | Meaning |
+|----------|------|---------|
+| `Plc` | `string` | Configured PLC name. |
+| `Port` | `int` | Port that failed to bind. |
+| `Reason` | `string` | Bind exception message (usually `Address already in use` or a permissions error). |
+
+Fires when a listener fails to bind at process startup or after a configuration reload. The supervisor's recovery pipeline will keep retrying with the policy from `Resilience:ListenerRecovery`, so a single occurrence is not necessarily fatal.
+
+**Operator action:** check for port collisions (`netstat -ano | findstr :<port>`). If the conflict is another `mbproxy` instance from a botched uninstall, stop the stray process; if it is a third-party service, change the PLC's `Port` in `appsettings.json` and let hot-reload pick up the change.
+
+### mbproxy.listener.recovered
+
+**Level:** Information &middot; **EventId:** 42 &middot; **Source:** `src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs`
+
+| Property | Type | Meaning |
+|----------|------|---------|
+| `Plc` | `string` | Configured PLC name. |
+| `Port` | `int` | Port now bound. |
+| `AttemptCount` | `int` | Number of retry attempts the recovery pipeline executed before success. |
+
+Fires after the supervisor's Polly recovery pipeline successfully rebinds a listener that previously emitted `mbproxy.listener.faulted` or `mbproxy.listener.ended`. `AttemptCount` is cumulative for the current outage — useful for spotting a port that is repeatedly flapping.
+
+**Operator action:** none directly. Correlate with the immediately preceding fault to characterise the outage cause.
+
+### mbproxy.listener.faulted
+
+**Level:** Error (`PlcListener`) / Warning (`PlcListenerSupervisor`) &middot; **EventId:** 22 / 43 &middot; **Source:** `src/Mbproxy/Proxy/PlcListener.cs`, `src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs`
+
+| Property | Type | Meaning |
+|----------|------|---------|
+| `Plc` | `string` | Configured PLC name. |
+| `Port` | `int` | Port whose listener faulted. |
+| `Reason` | `string` | Top-level exception message. |
+
+Fires when a listener's accept loop throws. The two sources emit at different levels deliberately: the unsupervised `PlcListener` instance logs at `Error` (a terminal condition for that listener), while the supervised emission is `Warning` because Polly will retry. The supervised path attaches the exception object as the `LoggerMessage` exception parameter, so the stack trace is captured.
+
+**Operator action:** if the same `Plc` produces repeated faults inside a few minutes, inspect the network path. A burst of faults paired with `mbproxy.multiplex.backend.disconnected` indicates the PLC itself is unhealthy rather than a proxy issue.
+
+### mbproxy.listener.ended
+
+**Level:** Warning &middot; **EventId:** 44 &middot; **Source:** `src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs`
+
+| Property | Type | Meaning |
+|----------|------|---------|
+| `Plc` | `string` | Configured PLC name. |
+| `Port` | `int` | Port whose accept loop terminated. |
+
+Fires when an accept loop returns without throwing — usually because the underlying `TcpListener` was stopped without the supervisor requesting it. Treated as a fault: the recovery pipeline rebinds and a subsequent `mbproxy.listener.recovered` should follow.
+
+**Operator action:** none unless paired with no recovery within the configured retry window.
+
+### mbproxy.admin.started
+
+**Level:** Information &middot; **EventId:** 70 &middot; **Source:** `src/Mbproxy/Admin/AdminEndpointHost.cs`
+
+| Property | Type | Meaning |
+|----------|------|---------|
+| `Port` | `int` | TCP port the read-only admin endpoint is listening on. |
+
+Fires when the Kestrel-hosted admin endpoint has bound and is ready to serve `GET /` and `GET /status.json`.
+
+**Operator action:** none. Use this to confirm the configured `AdminPort` is actually serving — `curl http://localhost:<port>/status.json` should return immediately afterwards.
+
+### mbproxy.admin.bind.failed
+
+**Level:** Error &middot; **EventId:** 71 &middot; **Source:** `src/Mbproxy/Admin/AdminEndpointHost.cs`
+
+| Property | Type | Meaning |
+|----------|------|---------|
+| `Port` | `int` | Port that failed to bind. |
+| `Reason` | `string` | Bind exception message. |
+
+Fires when the admin endpoint cannot bind its configured `AdminPort`. The service continues to proxy Modbus traffic — only the status page and `status.json` are unavailable.
+
+**Operator action:** change `Mbproxy:AdminPort` in `appsettings.json` to a free port. Hot-reload picks up the change; the admin endpoint rebinds without a service restart.
+
+### mbproxy.shutdown.complete
+
+**Level:** Information &middot; **EventId:** 80 &middot; **Source:** `src/Mbproxy/Diagnostics/ShutdownCoordinator.cs`
+
+| Property | Type | Meaning |
+|----------|------|---------|
+| `InFlightAtCancel` | `int` | Aggregate in-flight request count across all multiplexers at the moment SIGTERM was received. |
+| `ElapsedMs` | `long` | Wall-clock time spent draining, capped by `Connection:GracefulShutdownTimeoutMs`. |
+
+Fires once after the graceful drain completes (whether all in-flight requests finished or the timeout fired first). `ElapsedMs` close to the configured drain timeout indicates the drain budget was exhausted — some upstream clients saw forced disconnects.
+
+**Operator action:** if `InFlightAtCancel` is consistently large, consider raising `Connection:GracefulShutdownTimeoutMs` so restarts don't strand in-flight reads/writes.
+
+## Client Sessions
+
+### mbproxy.client.connected
+
+**Level:** Information &middot; **EventId:** 110 &middot; **Source:** `src/Mbproxy/Proxy/Multiplexing/MultiplexerLogEvents.cs`
+
+| Property | Type | Meaning |
+|----------|------|---------|
+| `Plc` | `string` | Configured PLC name. |
+| `RemoteEp` | `string` | `IPAddress:port` of the upstream client. |
+
+Fires once per upstream client `accept` on a PLC's listener. The event name is preserved from the legacy 1:1 model so existing operator queries keep working after the multiplexer rewrite.
+
+**Operator action:** none in steady state. A burst of connects from the same `RemoteEp` indicates a reconnect storm — pair with `mbproxy.client.disconnected` to confirm.
+
+### mbproxy.client.disconnected
+
+**Level:** Information &middot; **EventId:** 111 &middot; **Source:** `src/Mbproxy/Proxy/Multiplexing/MultiplexerLogEvents.cs`
+
+| Property | Type | Meaning |
+|----------|------|---------|
+| `Plc` | `string` | Configured PLC name. |
+| `RemoteEp` | `string` | `IPAddress:port` of the upstream client. |
+| `Reason` | `string` | Disconnect reason: `clean`, an exception type, or `cascade` when the multiplexer closed the upstream pipe due to a backend failure. |
+
+Fires when an upstream pipe is closed for any reason.
+
+**Operator action:** none unless paired with `mbproxy.multiplex.backend.disconnected` — that combination indicates the disconnect cascade is fleet-wide for one PLC.
+
+## Backend Multiplexer
+
+### mbproxy.multiplex.backend.connected
+
+**Level:** Information &middot; **EventId:** 112 &middot; **Source:** `src/Mbproxy/Proxy/Multiplexing/MultiplexerLogEvents.cs`
+
+| Property | Type | Meaning |
+|----------|------|---------|
+| `Plc` | `string` | Configured PLC name. |
+| `Host` | `string` | Backend host or IP. |
+| `Port` | `int` | Backend TCP port (typically 502). |
+
+Fires when the multiplexer's single backend socket to a PLC is established. Because the multiplexer holds exactly one backend socket per PLC and reuses it across every upstream client, this event is rare in steady state — see it at startup and after each `mbproxy.multiplex.backend.disconnected`.
+
+**Operator action:** none.
+
+### mbproxy.multiplex.backend.disconnected
+
+**Level:** Warning &middot; **EventId:** 113 &middot; **Source:** `src/Mbproxy/Proxy/Multiplexing/MultiplexerLogEvents.cs`
+
+| Property | Type | Meaning |
+|----------|------|---------|
+| `Plc` | `string` | Configured PLC name. |
+| `UpstreamCount` | `int` | Number of upstream pipes the multiplexer cascade-closed. |
+| `InFlightCount` | `int` | Number of in-flight requests dropped (each upstream sees a `Gateway Target Device Failed To Respond` exception or a closed socket). |
+| `Reason` | `string` | Underlying disconnect reason (exception type or `"backend EOF"`). |
+
+Fires when the backend socket closes for any reason. Closing the backend is fatal to all attached upstream pipes; the multiplexer cascade-closes them so clients reconnect cleanly through the listener.
+
+**Operator action:** investigate the PLC and the network path. A single transient occurrence is normal — repeated occurrences for the same `Plc` indicate a sick controller, a flaky ECOM100, or an unstable middlebox. Pair with `mbproxy.backend.failed` events for the same PLC to confirm the proxy can't get back in.
+
+### mbproxy.multiplex.saturated
+
+**Level:** Error &middot; **EventId:** 114 &middot; **Source:** `src/Mbproxy/Proxy/Multiplexing/MultiplexerLogEvents.cs`
+
+| Property | Type | Meaning |
+|----------|------|---------|
+| `Plc` | `string` | Configured PLC name. |
+| `RemoteEp` | `string` | Upstream client whose request was refused. |
+
+Fires when the TxId allocator refuses to allocate — every slot in the 16-bit MBAP transaction-ID space is currently in flight. The multiplexer responds to the upstream with Modbus exception code 04 (`Slave Device Failure`) and frees nothing. The DL205/DL260 family serialises Modbus TCP at roughly 2–10 ms per request, so reaching 65 536 concurrent in-flight requests is a stress-only path; in production this event is alert-worthy because it indicates either a runaway client or a backend that has stopped responding.
+
+**Operator action:** alert. Check the corresponding PLC's in-flight gauge on the status page; if it is also pegged, the backend is wedged and the listener should be restarted via the hot-reload path (toggle the PLC's `Enabled` flag).
+
+### mbproxy.multiplex.request.timeout
+
+**Level:** Warning &middot; **EventId:** 116 &middot; **Source:** `src/Mbproxy/Proxy/Multiplexing/MultiplexerLogEvents.cs`
+
+| Property | Type | Meaning |
+|----------|------|---------|
+| `Plc` | `string` | Configured PLC name. |
+| `ProxyTxId` | `ushort` | Internal TxId the multiplexer assigned to the backend request. |
+| `OriginalTxId` | `ushort` | TxId the upstream client sent. |
+| `Fc` | `byte` | Modbus function code of the timed-out request. |
+| `ElapsedMs` | `long` | Wall-clock time the request spent in flight before the watchdog fired. |
+
+Fires when the per-request watchdog times out an in-flight request whose response never arrived within `BackendRequestTimeoutMs`. The upstream client receives Modbus exception code `0x0B` (`Gateway Target Device Failed To Respond`) and the proxy TxId is freed. Common causes: PLC dropped the response, packet loss, or a backend that echoes the wrong MBAP TxId (e.g. pymodbus 3.13.0's concurrent-multiplexed-request bug).
+
+**Operator action:** isolated timeouts are noise. A sustained rate indicates the PLC is overloaded or the backend is misbehaving — correlate with `mbproxy.multiplex.backend.disconnected`.
+
+### mbproxy.backend.failed
+
+**Level:** Warning &middot; **EventId:** 115 &middot; **Source:** `src/Mbproxy/Proxy/Multiplexing/MultiplexerLogEvents.cs`
+
+| Property | Type | Meaning |
+|----------|------|---------|
+| `Plc` | `string` | Configured PLC name. |
+| `Reason` | `string` | Final connect-pipeline exception (after retries). |
+
+Fires when the backend-connect Polly pipeline (3 attempts at 100 ms / 500 ms / 2000 ms by default) exhausts its retries. The multiplexer cascade-closes any waiting upstream pipes; new clients will trigger a fresh reconnect attempt.
+
+**Operator action:** check the PLC's IP/port reachability. If multiple PLCs share an upstream switch or an EBC100 daughterboard, look for a common network event.
+
+## Read Coalescing
+
+### mbproxy.coalesce.hit
+
+**Level:** Debug &middot; **EventId:** 120 &middot; **Source:** `src/Mbproxy/Proxy/Multiplexing/CoalescingLogEvents.cs`
+
+| Property | Type | Meaning |
+|----------|------|---------|
+| `Plc` | `string` | Configured PLC name. |
+| `UnitId` | `byte` | Modbus unit identifier from the request MBAP. |
+| `Fc` | `byte` | Function code (`0x03` or `0x04`). |
+| `Start` | `ushort` | First register in the request range. |
+| `Qty` | `ushort` | Quantity of registers. |
+| `PartyCount` | `int` | Number of upstream parties now attached to the in-flight peer (after this one joined). |
+
+Fires when an FC03/FC04 request attaches to an in-flight request with the same `(UnitId, Fc, Start, Qty)` key, so only one wire request hits the PLC.
+
+**Operator action:** none. Coalescing is steady-state behaviour; the counters on the status page surface the same data without log volume.
+
+### mbproxy.coalesce.miss
+
+**Level:** Debug &middot; **EventId:** 121 &middot; **Source:** `src/Mbproxy/Proxy/Multiplexing/CoalescingLogEvents.cs`
+
+| Property | Type | Meaning |
+|----------|------|---------|
+| `Plc` | `string` | Configured PLC name. |
+| `UnitId` | `byte` | Modbus unit identifier. |
+| `Fc` | `byte` | Function code (`0x03` or `0x04`). |
+| `Start` | `ushort` | First register in the request range. |
+| `Qty` | `ushort` | Quantity of registers. |
+
+Fires when an FC03/FC04 request opens a fresh in-flight entry — either no matching peer existed, or the matching peer had reached its `MaxParties` cap and a new entry was opened.
+
+**Operator action:** none.
+
+### mbproxy.coalesce.dead_upstream
+
+**Level:** Debug &middot; **EventId:** 122 &middot; **Source:** `src/Mbproxy/Proxy/Multiplexing/CoalescingLogEvents.cs`
+
+| Property | Type | Meaning |
+|----------|------|---------|
+| `Plc` | `string` | Configured PLC name. |
+| `UnitId` | `byte` | Modbus unit identifier. |
+| `Fc` | `byte` | Function code (`0x03` or `0x04`). |
+| `Start` | `ushort` | First register in the request range. |
+| `Qty` | `ushort` | Quantity of registers. |
+
+Fires when fan-out skips a coalesced party because its upstream pipe was closed (or had pending writes the pipe rejected). A high rate paired with a high `mbproxy.client.disconnected` rate indicates clients are giving up on slow PLCs.
+
+**Operator action:** none directly. Investigate if paired with rising `mbproxy.multiplex.request.timeout` on the same PLC.
+
+## Response Cache
+
+### mbproxy.cache.hit
+
+**Level:** Debug &middot; **EventId:** 140 &middot; **Source:** `src/Mbproxy/Proxy/Cache/CacheLogEvents.cs`
+
+| Property | Type | Meaning |
+|----------|------|---------|
+| `Plc` | `string` | Configured PLC name. |
+| `UnitId` | `byte` | Modbus unit identifier. |
+| `Fc` | `byte` | Function code (`0x03` or `0x04`). |
+| `Start` | `ushort` | First register in the request range. |
+| `Qty` | `ushort` | Quantity of registers. |
+
+Fires when an FC03/FC04 request is served entirely from the in-process response cache — no backend round-trip occurs and the result is rewritten through the BCD pipeline as usual.
+
+**Operator action:** none. The status-page cache hit-rate gauge tracks the same signal at far lower cost than this event.
+
+### mbproxy.cache.miss
+
+**Level:** Debug &middot; **EventId:** 141 &middot; **Source:** `src/Mbproxy/Proxy/Cache/CacheLogEvents.cs`
+
+| Property | Type | Meaning |
+|----------|------|---------|
+| `Plc` | `string` | Configured PLC name. |
+| `UnitId` | `byte` | Modbus unit identifier. |
+| `Fc` | `byte` | Function code. |
+| `Start` | `ushort` | First register in the request range. |
+| `Qty` | `ushort` | Quantity of registers. |
+
+Fires when an FC03/FC04 request matches a cacheable tag range but no live cache entry exists; the request falls through to the backend.
+
+**Operator action:** none.
+
+### mbproxy.cache.store
+
+**Level:** Debug &middot; **EventId:** 142 &middot; **Source:** `src/Mbproxy/Proxy/Cache/CacheLogEvents.cs`
+
+| Property | Type | Meaning |
+|----------|------|---------|
+| `Plc` | `string` | Configured PLC name. |
+| `UnitId` | `byte` | Modbus unit identifier. |
+| `Fc` | `byte` | Function code. |
+| `Start` | `ushort` | First register in the request range. |
+| `Qty` | `ushort` | Quantity of registers. |
+| `TtlMs` | `int` | Cache TTL applied to the entry (in milliseconds). |
+
+Fires when a successful FC03/FC04 response is admitted into the cache. `TtlMs` echoes the per-tag `CacheTtlMs` from `appsettings.json`.
+
+**Operator action:** none.
+
+### mbproxy.cache.invalidated
+
+**Level:** Debug &middot; **EventId:** 143 &middot; **Source:** `src/Mbproxy/Proxy/Cache/CacheLogEvents.cs`
+
+| Property | Type | Meaning |
+|----------|------|---------|
+| `Plc` | `string` | Configured PLC name. |
+| `UnitId` | `byte` | Modbus unit identifier. |
+| `WriteStart` | `ushort` | First register the write touched. |
+| `WriteQty` | `ushort` | Number of registers the write touched. |
+| `Count` | `int` | Number of cache entries evicted by the write. |
+
+Fires when an FC06/FC16 write invalidates one or more overlapping cache entries. The invalidator deliberately operates at the register level so a write to `V2000` evicts every cached read range that includes register `V2000`.
+
+**Operator action:** none. A `Count` of zero means the write was outside any cached range and is informational only.
+
+### mbproxy.cache.flushed
+
+**Level:** Information &middot; **EventId:** 144 &middot; **Source:** `src/Mbproxy/Proxy/Cache/CacheLogEvents.cs`
+
+| Property | Type | Meaning |
+|----------|------|---------|
+| `Plc` | `string` | Configured PLC name. |
+| `Reason` | `string` | Why the cache was flushed: `"backend reconnect"`, `"hot-reload"`, or `"shutdown"`. |
+| `Count` | `int` | Number of entries dropped. |
+
+Fires whenever the entire per-PLC cache is wiped at once — primarily after a backend reconnect (the proxy can't reason about staleness across the gap) or after a hot-reload that changed the BCD tag map for the PLC.
+
+**Operator action:** none unless flushes happen on a tight loop, which would indicate the backend connection itself is unstable.
+
+## BCD Rewriter
+
+### mbproxy.rewrite.partial_bcd
+
+**Level:** Warning &middot; **EventId:** 30 &middot; **Source:** `src/Mbproxy/Proxy/RewriterLogEvents.cs`
+
+| Property | Type | Meaning |
+|----------|------|---------|
+| `Plc` | `string` | Configured PLC name. |
+| `Address` | `ushort` | PDU-decimal address of the BCD tag involved. |
+| `ClientStart` | `ushort` | Start register the client requested. |
+| `ClientQty` | `ushort` | Quantity of registers the client requested. |
+
+Fires when a 32-bit BCD pair is only partially covered by a read or write range (the request straddles the CDAB boundary instead of covering both words). The raw bytes are passed through unchanged, so the client or PLC sees the original nibbles.
+
+**Operator action:** the upstream client is misconfigured — its register map disagrees with the proxy's `BcdTags` list. Reconcile the client's tag definitions against `appsettings.json`.
+
+### mbproxy.rewrite.invalid_bcd
+
+**Level:** Warning &middot; **EventId:** 31 &middot; **Source:** `src/Mbproxy/Proxy/RewriterLogEvents.cs`
+
+| Property | Type | Meaning |
+|----------|------|---------|
+| `Plc` | `string` | Configured PLC name. |
+| `Address` | `ushort` | PDU-decimal address of the BCD tag involved. |
+| `RawValue` | `ushort` | The raw register value that failed BCD validation (logged as `0x{RawValue:X4}`). |
+| `Direction` | `string` | `"Read"` (response from PLC) or `"Write"` (request from client). |
+
+Fires when a register at a configured BCD address contains a nibble `>= 0xA` — i.e. not a valid BCD digit. The raw bytes are passed through unchanged.
+
+**Operator action:** if `Direction=Read`, the PLC has been written outside BCD discipline (probably by a ladder program that bypassed the proxy). If `Direction=Write`, the client is sending a value `> 9999` that doesn't fit in 4 BCD digits — bound-check upstream.
+
+### mbproxy.exception.passthrough
+
+**Level:** Information &middot; **EventId:** 32 &middot; **Source:** `src/Mbproxy/Proxy/RewriterLogEvents.cs`
+
+| Property | Type | Meaning |
+|----------|------|---------|
+| `Plc` | `string` | Configured PLC name. |
+| `Fc` | `byte` | Original function code (high bit set on the wire; logged as `0x{Fc:X2}`). |
+| `ExceptionCode` | `byte` | Modbus exception code (1=illegal function, 2=illegal data address, 3=illegal data value, 4=slave device failure, etc.). |
+
+Fires when the PLC returns a Modbus exception response (high bit set on the FC byte). The frame is forwarded verbatim to the client.
+
+**Operator action:** none in isolation. A sustained `ExceptionCode=2` rate against a configured BCD address suggests the PLC's V-memory map no longer matches the proxy's tag list.
+
+## Configuration Hot-Reload
+
+### mbproxy.config.reload.applied
+
+**Level:** Information &middot; **EventId:** 60 &middot; **Source:** `src/Mbproxy/Configuration/ConfigReconciler.cs`
+
+| Property | Type | Meaning |
+|----------|------|---------|
+| `PlcsAdded` | `int` | Number of new PLC entries the supervisor brought online. |
+| `PlcsRemoved` | `int` | Number of PLC entries torn down. |
+| `PlcsRestarted` | `int` | Number of PLCs whose listener was rebound (port or host change). |
+| `PlcsReseated` | `int` | Number of PLCs whose BCD tag map was swapped without restarting the listener. |
+| `GlobalTagDelta` | `int` | Net change in the global tag count across the new snapshot. |
+
+Fires after a debounced `appsettings.json` change passes validation and the reconciler has applied it. The five counters together describe the shape of the change so dashboards can plot churn over time.
+
+**Operator action:** none. Use this event as the audit trail for who-changed-what after a configuration push.
+
+### mbproxy.config.reload.rejected
+
+**Level:** Error &middot; **EventId:** 61 &middot; **Source:** `src/Mbproxy/Configuration/ConfigReconciler.cs`
+
+| Property | Type | Meaning |
+|----------|------|---------|
+| `Errors` | `string` | Concatenated validation failures (one line per failure). |
+
+Fires when a configuration change fails validation — duplicate PLC names, port collisions inside the new file, malformed BCD tag entries, or a schema-level error. The previous valid configuration remains in effect; no listeners are touched.
+
+**Operator action:** fix the offending entry in `appsettings.json` and save again. The reconciler debounces file events on a 250 ms window, so rapid sequential saves coalesce into one validation pass.
+
+## Conventions
+
+### Naming
+
+All event names follow `mbproxy.<area>.<noun>[.<state>]`:
+
+- `<area>` matches a subsystem (`startup`, `client`, `multiplex`, `coalesce`, `cache`, `rewrite`, `config`, `admin`, `listener`, `shutdown`, `backend`, `exception`).
+- `<noun>` describes the thing the event is about.
+- `<state>` is optional and only used when the event represents a terminal or special outcome (`bind.failed`, `reload.rejected`, `backend.connected`).
+
+Property names are PascalCase, match the `[LoggerMessage]` template tokens exactly, and use `Plc` as the canonical scope key for per-device filtering.
+
+### Stability Promise
+
+Event names and `EventId` values are part of the operator contract. They are not changed in patch or minor releases. A rename or removal requires a major version bump and a migration note in the release. New events are additive and can ship in any release.
+
+### Where Events Are Defined
+
+Each subsystem owns a single `*LogEvents.cs` static partial class with `[LoggerMessage]` declarations:
+
+- `src/Mbproxy/Proxy/Multiplexing/MultiplexerLogEvents.cs` — client sessions and backend multiplexer.
+- `src/Mbproxy/Proxy/Multiplexing/CoalescingLogEvents.cs` — read coalescing.
+- `src/Mbproxy/Proxy/Cache/CacheLogEvents.cs` — response cache.
+- `src/Mbproxy/Proxy/RewriterLogEvents.cs` — BCD rewriting and exception passthrough.
+
+Lifecycle events (`startup.*`, `listener.*`, `admin.*`, `shutdown.*`, `config.reload.*`) live as private `[LoggerMessage]` declarations next to the class that emits them — see `ProxyWorker.cs`, `PlcListener.cs`, `PlcListenerSupervisor.cs`, `AdminEndpointHost.cs`, `ShutdownCoordinator.cs`, and `ConfigReconciler.cs`. New subsystems should follow the `*LogEvents.cs` pattern when they accumulate more than two events.
+
+## Related Documentation
+
+- [Troubleshooting](../Operations/Troubleshooting.md) — operator playbook keyed off the event names in this catalog.
+- [Connection Model](../Architecture/ConnectionModel.md) — context for the `mbproxy.multiplex.*` and `mbproxy.client.*` events.
+- [Response Cache](../Architecture/ResponseCache.md) — context for the `mbproxy.cache.*` events.
+- [Status Page](../Operations/StatusPage.md) — counter equivalents for the high-volume Debug-level events.
+- [Read Coalescing](../Architecture/ReadCoalescing.md) — context for the `mbproxy.coalesce.*` events.
+- [BCD Rewriting](../Features/BcdRewriting.md) — context for the `mbproxy.rewrite.*` and `mbproxy.exception.passthrough` events.
+- [Hot Reload](../Features/HotReload.md) — context for the `mbproxy.config.reload.*` events.
diff --git a/mbproxy/docs/Testing/Simulator.md b/mbproxy/docs/Testing/Simulator.md
new file mode 100644
index 0000000..bb84c90
--- /dev/null
+++ b/mbproxy/docs/Testing/Simulator.md
@@ -0,0 +1,235 @@
+# Simulator Harness
+
+The pymodbus DL205 simulator stands in for real DL205/DL260 hardware in the E2E test suite. This document describes the launcher, the xUnit fixture, the skip policy, the per-test timeout discipline, and the pymodbus 3.13.0 framer quirk that the test strategy works around.
+
+## Why a Simulator
+
+`mbproxy` targets a fleet of AutomationDirect DL205/DL260 controllers that test machines do not have. The pymodbus profile at [`../../DL260/dl205.json`](../../DL260/dl205.json) already models the device-side quirks (BCD nibbles at known holding-register addresses, CDAB-ordered 32-bit values, C-relay/Y-output coil mappings) as concrete register seeds. The harness wraps that profile in an xUnit `IAsyncLifetime` fixture so every E2E test class opens against a fresh known-good DL-series target without manual setup.
+
+The device-side rationale for each seed (why HR 1072 is `0x1234`, why FC03 caps at 128, etc.) lives in [`../../DL260/dl205.md`](../../DL260/dl205.md). The harness exists to make that profile addressable from xUnit tests; it does not duplicate the device documentation.
+
+## Harness Layout
+
+Three files form the harness contract:
+
+| Path | Role |
+|------|------|
+| `tests/sim/run-dl205-sim.ps1` | PowerShell launcher. Provisions a Python venv under `tests/sim/.venv` on first run (`python -m venv` + `pip install pymodbus`) and execs `pymodbus.simulator` against `dl205.json` on the requested port. Idempotent — re-runs reuse the venv. |
+| `tests/Mbproxy.Tests/Sim/DL205SimulatorFixture.cs` | `IAsyncLifetime` fixture that picks a free port, spawns the launcher, polls for TCP readiness, and tears the process tree down on dispose. |
+| `tests/Mbproxy.Tests/Sim/DL205SimulatorCollection.cs` | `[CollectionDefinition(nameof(DL205SimulatorCollection))]` that exposes the fixture as an xUnit `ICollectionFixture<DL205SimulatorFixture>`. |
+
+The launcher is a PowerShell 7+ script; the fixture invokes it via `pwsh -NoProfile -File <script> -Port <picked>`. The script's exit codes (1 = venv failure, 2 = pymodbus launch failure, 3 = profile missing) propagate up through the fixture's stdout/stderr capture for diagnosis.
+
+## Fixture Lifecycle
+
+`DL205SimulatorFixture` is sealed and lives in `Mbproxy.Tests.Sim`. The lifecycle is bounded entirely by `InitializeAsync` and `DisposeAsync`.
+
+### InitializeAsync
+
+`InitializeAsync` runs five steps:
+
+1. **Pick a free local port.** Bind a `TcpListener` on `IPAddress.Loopback:0`, capture the OS-assigned port into `Port`, and dispose the listener. The TOCTOU window between dispose and pymodbus binding is documented in the source and considered acceptable for tests; a port-steal would manifest as a connect failure inside the readiness poll, which then surfaces via `SkipReason`.
+2. **Resolve the launcher script.** Walks upward from the test assembly directory (`tests/Mbproxy.Tests/bin/<config>/net10.0/`) looking for `tests/sim/run-dl205-sim.ps1`. If not found, `SkipReason` is set with a "could not locate" message that points at the expected layout.
+3. **Verify `pwsh` is on `PATH`.** Spawns `pwsh -NoProfile -Command exit 0` with a 3 s budget. If `pwsh` is missing or returns non-zero, `SkipReason` is set.
+4. **Spawn the simulator subprocess.** `Process.Start` invokes the launcher with the picked port; stdout and stderr are drained asynchronously into a 50-line ring buffer via `BeginOutputReadLine` / `BeginErrorReadLine` to avoid blocking the child on a full pipe.
+5. **Poll for TCP readiness.** Repeatedly opens `TcpClient.ConnectAsync(Host, Port)` at 100 ms intervals until either the connect succeeds, the process exits early, or `ReadinessTimeout` elapses.
+
+`ReadinessTimeout` is 120 s. The spec calls for "up to 10 s" of warm-run server startup; the fixture allows the longer budget because a cold run that has to `pip install pymodbus` can take 30 to 90 s depending on network speed. Warm runs (the common case) succeed in under 2 s. Cold-run provisioning is additive and cannot be separated without a separate pre-provision step.
+
+If readiness never arrives, the fixture distinguishes two failure modes in `SkipReason`:
+
+- **The process exited prematurely.** Likely cause: Python not found, pymodbus not installed, profile path wrong. The exit code and the log tail are included in the skip reason.
+- **The process is still running but never accepted a connection.** Likely cause: port stolen, pymodbus stuck in its own startup, firewall blocking loopback. The log tail is included verbatim.
+
+Either way, the process tree is killed before the skip reason is returned, so no orphan pymodbus survives a failed fixture initialisation.
+
+### DisposeAsync
+
+`DisposeAsync` calls `_process.Kill(entireProcessTree: true)` and then `WaitForExitAsync` with a 5 s cap. Windows lacks a portable graceful `SIGINT` from .NET without P/Invoke into the console-attach APIs; pymodbus's `atexit` handlers may be cut short. The trade-off is documented inline in the fixture — acceptable for test cleanup because the simulator is stateless across runs.
+
+### Public Surface
+
+The fixture exposes four members:
+
+```csharp
+public sealed class DL205SimulatorFixture : IAsyncLifetime
+{
+    public string Host { get; } = "127.0.0.1";
+    public int Port { get; private set; }
+    public string? SkipReason { get; private set; }
+    public string LogTail => BuildLogTail();
+}
+```
+
+`Host` is always loopback. `Port` is the OS-picked free port. `SkipReason` is non-null when the simulator could not start. `LogTail` returns the last 50 lines of captured stdout/stderr for diagnosis when a test fails.
+
+## Skip Policy
+
+If Python is missing, `pwsh` is not on `PATH`, or pip provisioning fails, `InitializeAsync` populates `SkipReason` with a human-readable explanation and the fixture proceeds without a live simulator. Every E2E test starts with the same guard:
+
+```csharp
+if (_sim.SkipReason is not null)
+    Assert.Skip(_sim.SkipReason);
+```
+
+The unit-test suite (any test without `[Trait("Category", "E2E")]`) runs without any Python at all. CI machines must have Python 3.10+ and PowerShell 7+; local developers running only unit tests need nothing extra. The phase-01 gate (see [`../plan/README.md`](../plan/README.md)) explicitly verifies that on a machine with Python and pymodbus installed, none of the smoke tests skip — a skip on a properly equipped CI machine is treated as an environment failure, not a test pass.
+
+The skip reasons the fixture produces map cleanly onto the recovery action:
+
+| Skip reason prefix | Cause | Recovery |
+|--------------------|-------|----------|
+| `Could not locate tests/sim/run-dl205-sim.ps1` | Test assembly is too deep, or the script was deleted | Restore the script; verify the upward search starts inside the repo |
+| `pwsh (PowerShell 7+) is not available on PATH` | Windows PowerShell 5.1 is on PATH but not pwsh | Install PowerShell 7+ and ensure `pwsh` resolves |
+| `Failed to spawn pwsh: <message>` | `Process.Start` itself failed | Inspect the inner message; usually a `PATH` or permissions issue |
+| `Simulator process exited prematurely (exit code N)` | Launcher script returned non-zero (Python missing, pymodbus install failed, profile missing) | Read the log tail; exit codes 1/2/3 map to venv / launch / profile failures |
+| `Simulator did not accept a TCP connection on port N within 120 s` | Process is alive but never bound the port | Read the log tail; usually a port-steal or a pymodbus internal hang |
+
+## Adding an E2E Test
+
+An E2E test class declares the collection, the category trait, takes the fixture in its constructor, and guards every test with the skip check. `SimulatorSmokeTests` in `tests/Mbproxy.Tests/Sim/` is the canonical minimal example:
+
+```csharp
+[Collection(nameof(DL205SimulatorCollection))]
+[Trait("Category", "E2E")]
+public sealed class SimulatorSmokeTests
+{
+    private readonly DL205SimulatorFixture _sim;
+
+    public SimulatorSmokeTests(DL205SimulatorFixture sim) => _sim = sim;
+
+    [Fact(Timeout = 5_000)]
+    public async Task Simulator_FC03_ReturnsBCD_RawValueAtHR1072_0x1234()
+    {
+        if (_sim.SkipReason is not null)
+            Assert.Skip(_sim.SkipReason);
+
+        using var client = new TcpClient();
+        await client.ConnectAsync(_sim.Host, _sim.Port,
+            TestContext.Current.CancellationToken);
+
+        var master = new ModbusFactory().CreateMaster(client);
+        ushort[] regs = master.ReadHoldingRegisters(
+            slaveAddress: 1, startAddress: 1072, numberOfPoints: 1);
+
+        Assert.Equal(0x1234, regs[0]); // raw BCD nibbles, NOT binary 1234
+    }
+}
+```
+
+For a proxy-shaped test, configure an in-process host pointing at `_sim.Host:_sim.Port` as the backend PLC and drive `NModbus` against the proxy's listen port. `MultiplexerE2ETests` in `tests/Mbproxy.Tests/Proxy/Multiplexing/` is the working example with the in-process `Host.CreateApplicationBuilder()` setup.
+
+## Per-Test Timeout Policy
+
+`[Fact(Timeout = 5_000)]` is the default for every E2E test. Expand per-test only when the test genuinely needs longer — concurrent bursts above 100 ops, reload-propagation debounce, graceful-shutdown drain, Polly-paced backend reconnects. Add a one-line comment on the test explaining the reason whenever the timeout exceeds the 5 s default. `MultiplexerE2ETests.E2E_BackendDisconnect_DuringInflight_CascadesUpstream_AndRecovers` uses `[Fact(Timeout = 8_000)]` and documents the Polly backoff budget inline.
+
+The reason a hard timeout matters: synchronous `NModbus` calls do not honor `TestContext.Current.CancellationToken`. Without `[Fact(Timeout=…)]`, a deadlock anywhere in the proxy hangs the test runner indefinitely. The hang-diagnosis pattern for when this nonetheless happens lives in [`../Operations/Troubleshooting.md`](../Operations/Troubleshooting.md).
+
+The test-runner backstop is a process-level safety net:
+
+```powershell
+dotnet test --filter Category=E2E --blame-hang-timeout 2m
+```
+
+The `--blame-hang-timeout` is mandatory for E2E runs. It catches the rare case where an individual test's `Timeout` somehow does not fire — for example, an unmanaged thread blocking finalization.
+
+## The Pymodbus 3.13.0 Framer Quirk
+
+Pymodbus 3.13.0's `ServerRequestHandler` stores a single `last_pdu` field per TCP connection and schedules the deferred handler via `asyncio.call_soon`. If two MBAP frames arrive in the same recv buffer — which the multiplexer's shared backend connection produces under truly concurrent upstream reads — the later frame overwrites `last_pdu` before the first scheduled handler runs, and both responses then carry the later request's TxId. The real DL260 ECOM does not exhibit this bug; it echoes per-request TxIds correctly.
+
+This forces a three-part test strategy:
+
+- **Multiplexer correctness under concurrent backend traffic is proven against a stub backend.** `PlcMultiplexerTests` drives the multiplexer with a stub that properly echoes per-request TxIds. That test class is the load-bearing coverage for TxId rewriting; the simulator does not contribute here.
+- **Simulator-backed E2E tests pace requests** to keep pymodbus in known-good single-PDU mode. `MultiplexerE2ETests` enforces serialisation explicitly: each upstream client's request is issued only after the previous client's response has returned. The `<summary>` block at the top of `MultiplexerE2ETests.cs` documents the trade-off and points readers to `PlcMultiplexerTests` for the concurrency proof.
+- **The per-request watchdog defends production.** Configurable via `Connection.BackendRequestTimeoutMs`, the multiplexer cancels any in-flight request whose response does not arrive within the budget and surfaces Modbus exception `0x0B` upstream. The `mbproxy.multiplex.request.timeout` log event (see [`../Reference/LogEvents.md`](../Reference/LogEvents.md)) is the operational signal. The same code path defends against any backend — pymodbus, a misbehaving ECOM, or a network middlebox — that mis-echoes or drops a TxId.
+
+The connection-model rationale for why the multiplexer produces multi-frame recv buffers in the first place is in [`../Architecture/ConnectionModel.md`](../Architecture/ConnectionModel.md).
+
+## Simulator Profile
+
+`DL260/dl205.json` is the pymodbus server config. It seeds the registers the E2E tests assert against:
+
+| Address | Width | Seeded value | Used to prove |
+|---------|-------|--------------|---------------|
+| HR 0 | uint16 | `0xCAFE` | Profile is loaded; register 0 is valid on DL205/DL260 |
+| HR 200..209 | uint16 | scratch range, writable | FC06/FC16 round-trips for BCD-rewriter E2E tests |
+| HR 1072 | uint16 | `0x1234` (raw BCD nibbles) | Single-register FC03 BCD decode through the proxy |
+| HR 1080/1081 | uint16 pair | CDAB-ordered 32-bit BCD | 32-bit BCD decode across the word pair |
+
+The full register map and the device-side rationale for each entry live in [`../../DL260/dl205.md`](../../DL260/dl205.md).
+
+Two profile-level settings are load-bearing for the harness:
+
+- **`"shared blocks": true`** — matches the DL series memory model where holding registers and input registers share the same backing store. The proxy's tests assume this; switching it off would change which addresses appear via FC03 versus FC04.
+- **`"type exception": false`** — controls whether pymodbus raises an exception when an address is read via a function code that does not match its declared type. The default `false` is the lax behaviour the tests rely on. Flipping it to `true` is an alternate-profile scenario, not a default-profile change.
+
+The `write` block in the JSON controls which ranges accept FC06/FC16. Writes outside the listed ranges return Modbus exception 02 (illegal data address), which is itself a useful condition the proxy must forward correctly. `MultiplexerE2ETests.E2E_RewriterStillWorks_UnderMultiplexedThreeClients` uses addresses 200, 202, and 204 from the writable scratch range for exactly this reason — read-only addresses would return exception 02 on the write step and break the round-trip assertion.
+
+## Alternate Profiles
+
+The `MODBUS_SIM_PROFILE` environment variable selects an alternate profile alongside `dl205.json`. This is the seam for scenario-specific simulators — for example, a profile with `"type exception": true` to verify the proxy does not depend on the default lax pymodbus behaviour, or a profile that seeds a specific partial-overlap test case at a known address. The existing pattern is `DL260/DL205BcdQuirkTests.cs`, which already drives the simulator with profile-driven assertions. When a new scenario needs its own profile, drop the JSON alongside `dl205.json` and select it via the env var rather than swapping the default — the default profile is the contract for the smoke tests and `MultiplexerE2ETests` and should not be silently mutated.
+
+## Running the Simulator Standalone
+
+The launcher is usable outside xUnit for ad-hoc debugging:
+
+```powershell
+pwsh tests/sim/run-dl205-sim.ps1 -Port 5020
+```
+
+The script provisions the venv on first run and execs `pymodbus.simulator`. Output streams to the terminal; Ctrl-C exits cleanly because the pymodbus process is attached to the script's console group. Useful for poking at the profile with an external Modbus client (e.g. ModScan, mbpoll, NModbus from `dotnet fsi`) without running the test harness.
+
+A typical debugging loop:
+
+1. Launch the simulator standalone on a fixed port.
+2. Point a manually built proxy host at it via `appsettings.json` with `Host=127.0.0.1, Port=5020`.
+3. Drive the proxy from a Modbus client and inspect log events at `Verbose` to see the rewriter in action.
+
+The standalone launcher uses the same script the fixture invokes, so behaviour is identical between the test harness and ad-hoc runs.
+
+## End-to-End Test Shape
+
+A full E2E test wires the simulator, an in-process proxy host, and an `NModbus` client into the same loopback stack:
+
+```csharp
+// 1. Simulator runs at _sim.Host:_sim.Port (fixture-managed).
+// 2. Build an in-process proxy host pointing at the simulator as its PLC backend.
+int proxyPort = PickFreePort();
+var config = new Dictionary<string, string?>
+{
+    ["Mbproxy:AdminPort"] = "0",
+    [$"Mbproxy:Plcs:0:Name"]       = "TestPLC",
+    [$"Mbproxy:Plcs:0:ListenPort"] = proxyPort.ToString(),
+    [$"Mbproxy:Plcs:0:Host"]       = _sim.Host,
+    [$"Mbproxy:Plcs:0:Port"]       = _sim.Port.ToString(),
+    ["Mbproxy:BcdTags:Global:0:Address"] = "1072",
+    ["Mbproxy:BcdTags:Global:0:Width"]   = "16",
+};
+
+var host = BuildBcdHost(config);
+await host.StartAsync(startCts.Token);
+
+// 3. Drive NModbus against the proxy port.
+using var client = new TcpClient();
+await client.ConnectAsync("127.0.0.1", proxyPort,
+    TestContext.Current.CancellationToken);
+var master = new ModbusFactory().CreateMaster(client);
+
+// 4a. Read: simulator returns raw 0x1234, proxy rewrites to binary 1234.
+ushort[] regs = master.ReadHoldingRegisters(1, 1072, 1);
+regs[0].ShouldBe((ushort)1234);
+
+// 4b. Write (against a writable scratch address, e.g. HR 200):
+//     client writes binary 1234, proxy re-encodes to 0x1234 BCD nibbles,
+//     simulator stores the nibbles.
+master.WriteSingleRegister(1, 200, 1234);
+```
+
+The read direction proves the proxy rewrote the response; the write direction proves the proxy rewrote the request. Both assertions running against the same simulator instance is the smallest viable end-to-end signal that the BCD rewriter is correctly wired.
+
+## Related Documentation
+
+- [Connection Model](../Architecture/ConnectionModel.md) — why the multiplexer's shared backend connection produces the multi-frame condition that triggers pymodbus's framer quirk
+- [Troubleshooting](../Operations/Troubleshooting.md) — hang-diagnosis pattern for tests that exceed their `[Fact(Timeout)]`
+- [Log Events](../Reference/LogEvents.md) — `mbproxy.multiplex.request.timeout` is the production watchdog against TxId mis-echo
+- [DL205/DL260 device quirks](../../DL260/dl205.md) — device-side rationale for every register the simulator profile seeds
+- [Phase plan README](../plan/README.md) — Test discipline section that codifies the 5 000 ms default and the `--blame-hang-timeout` rule