mbproxy: initial commit through Phase 9 (TxId multiplexing)

Adds the mbproxy service end-to-end. Phases 00-08 implement the
production-ready single-listener / 1:1-backend transparent Modbus TCP
proxy with bidirectional BCD rewriting for the ~54-PLC DL205/DL260
fleet. Phase 9 replaces the connection layer with a single backend
socket per PLC plus MBAP TxId rewriting, lifting the H2-ECOM100's
4-concurrent-client cap as an operational ceiling.

Phase 9 additions of note:
- PlcMultiplexer + UpstreamPipe + TxIdAllocator + CorrelationMap
- InFlightRequest with IReadOnlyList<InterestedParty> (load-bearing
  for Phase 10 read coalescing — do not collapse to a single field)
- Per-request watchdog: surfaces Modbus exception 0x0B to upstream
  on BackendRequestTimeoutMs, defending against lost responses,
  dead-PLC paths, and pymodbus 3.13.0's concurrent-multiplexed-
  request bug (its ServerRequestHandler.last_pdu state race)
- Status DTO + HTML gain inFlight / maxInFlight / txIdWraps /
  disconnectCascades / queueDepth (Tier 1.6 in docs/kpi.md)

Tests: 263 unit + 38 E2E. Multiplexer correctness under truly
concurrent backend traffic is proved against a stub backend in
PlcMultiplexerTests; MultiplexerE2ETests paces requests so pymodbus
3.13's single-PDU framer stays in known-good mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-14 01:49:35 -04:00
parent 2e937228a0
commit 56eee3c563
105 changed files with 18430 additions and 0 deletions
+252
View File
@@ -0,0 +1,252 @@
# mbproxy — design plan
Architectural design for the `mbproxy` Modbus TCP proxy service: how it fronts ~54 AutomationDirect DirectLOGIC DL205/DL260 controllers, rewrites BCD tags bidirectionally inline, and recovers from listener and backend failures. Settled in a design Q&A on 2026-05-13.
**Status:** plan; no code yet. Each decision below is load-bearing — change deliberately, not by drift.
Context (what the service does and why it exists) lives in [`../CLAUDE.md`](../CLAUDE.md) under "What this is" and "Purpose: bidirectional BCD rewrite". This file is the *how*. Device quirks the design depends on live in [`../DL260/dl205.md`](../DL260/dl205.md).
Runtime shape: **.NET 10 Generic Host** worker service registered as a **Windows Service** via `Microsoft.Extensions.Hosting.WindowsServices`.
## Listener topology — per-PLC port (one port → one PLC)
The host opens **one `TcpListener` per PLC** on a distinct port. Upstream clients reach a specific PLC by connecting to its assigned proxy port; no protocol-level routing is needed.
```
Client A ──┐
Client B ──┼──→ proxy:5020 ──→ PLC #1 (10.0.1.1:502)
├──→ proxy:5021 ──→ PLC #2 (10.0.1.2:502)
│ ...
└──→ proxy:5073 ──→ PLC #54 (10.0.1.54:502)
```
## Connection model — single backend socket per PLC, multiplexed via MBAP TxId rewriting
Each PLC has **one persistent backend TCP socket**, owned by a `PlcMultiplexer`. Many upstream client connections share that single backend socket; the multiplexer distinguishes their in-flight requests by **rewriting the MBAP transaction ID** on each request and restoring each client's original TxId on the matching response. Implemented in [Phase 09](plan/09-txid-multiplexing.md); replaced the prior 1:1 per-upstream-client backend-socket model.
```
Client A ─┐
Client B ─┼─→ proxy:5020 ─[ PlcMultiplexer ]─→ PLC #1 (10.0.1.1:502)
Client C ─┘ │ (one persistent socket)
CorrelationMap[proxyTxId]
TxIdAllocator (16-bit space)
```
- **Upstream → multiplexer**: each accepted upstream socket is wrapped in an `UpstreamPipe` (read loop + bounded response channel). The pipe's read loop hands every parsed MBAP frame to the multiplexer's `OnUpstreamFrameAsync`, which allocates a free 16-bit `proxyTxId`, stores an `InFlightRequest` in a `CorrelationMap` keyed by that proxyTxId, BCD-rewrites the request payload, overwrites the MBAP header's TxId field with `proxyTxId`, and enqueues the frame into the per-PLC outbound channel.
- **Multiplexer → backend**: a single backend writer task drains the outbound channel and sends each frame to the PLC over the shared socket. A single backend reader task reads MBAP frames back, looks each up by `proxyTxId` in the correlation map, BCD-rewrites the response, restores each interested party's original TxId, and routes the frame to that party's `UpstreamPipe._responseChannel`. The single-writer / single-reader invariant on the backend socket eliminates the need for socket-level synchronisation.
- **Per-request timeout watchdog**: a periodic task scans the correlation map at a quarter of `Connection.BackendRequestTimeoutMs` and times out any in-flight request whose response has not arrived. Timed-out requests get a Modbus exception 0x0B (Gateway Target Device Failed To Respond) delivered to their upstream party and free their allocator slot. Without this watchdog, a single lost or mis-routed response would leak a correlation entry forever and hang the upstream pipe indefinitely.
**Operational consequence (replaces the prior 4-client warning).** The H2-ECOM100's 4-concurrent-TCP-client cap (see [`../DL260/dl205.md`](../DL260/dl205.md) → Behavioral Oddities) no longer limits upstream-side connection count — the proxy holds exactly one slot per PLC regardless of how many upstream clients are attached. The wire-rate ceiling is unchanged (the ECOM internally serializes requests at ~210 ms per scan); the multiplexer shifts where serialization happens (proxy outbound queue vs PLC accept queue) rather than adding throughput.
> ⚠ **Backend disconnect cascades upstream.** When the backend socket dies (PLC reboot, network partition, middlebox idle drop), the multiplexer closes every attached upstream pipe in the same cycle and increments `BackendDisconnectCascades` by the upstream count. Clients reconnect on their own next request and the multiplexer Polly-reconnects to the backend on the first upstream frame.
> ⚠ **pymodbus 3.13.0 simulator quirk (test-only).** The pymodbus simulator's `ServerRequestHandler` stores a single `last_pdu` per connection and schedules deferred handlers via `asyncio.call_soon`. Two MBAP frames arriving in the same recv buffer (as the multiplexer can produce on its shared backend connection) overwrite `last_pdu` before the first handler runs, and both responses then carry the later request's TxId. The real DL260 ECOM does not suffer this — it echoes per-request TxIds correctly. Multiplexer correctness under truly concurrent backend traffic is therefore proved against a stub backend in `PlcMultiplexerTests`; the E2E suite paces requests to keep pymodbus in known-good single-PDU mode. The per-request watchdog is the production defence against any backend (real or simulated) that mis-echoes a TxId.
## Configuration — single `appsettings.json`
All configuration lives in one file, loaded via `Microsoft.Extensions.Configuration` and bound to typed POCOs. No sidecar YAML/CSV.
```jsonc
{
"Mbproxy": {
"BcdTags": {
"Global": [
{ "Address": 1072, "Width": 16 },
{ "Address": 1080, "Width": 32 }
]
},
"Plcs": [
{
"Name": "Line1-Mixer",
"ListenPort": 5020,
"Host": "10.0.1.1",
"BcdTags": {
"Add": [ { "Address": 1200, "Width": 32 } ],
"Remove": [ 1080 ]
}
},
{ "Name": "Line1-Conveyor", "ListenPort": 5021, "Host": "10.0.1.2" }
// ... 54 PLC rows
],
"AdminPort": 8080,
"Connection": {
"BackendConnectTimeoutMs": 3000,
"BackendRequestTimeoutMs": 3000
},
"Resilience": {
"BackendConnect": { "MaxAttempts": 3, "BackoffMs": [100, 500, 2000] },
"ListenerRecovery": { "InitialBackoffMs": [1000, 2000, 5000, 15000, 30000], "SteadyStateMs": 30000 }
}
}
}
```
**Hybrid tag resolution.** For each PLC, the effective BCD tag list is `Global Add Remove`. `Remove` matches by address; if the same address appears in both `Add` and `Global` the `Add` entry wins (this is how a width override is expressed). Validation at startup must:
- reject duplicate addresses within a single PLC's resolved list
- reject 32-bit entries that would have their high register overlap a separate 16-bit entry
- warn on `Remove` entries that don't match any global tag (probably stale config)
## Configuration hot-reload
`Microsoft.Extensions.Configuration` loads `appsettings.json` with `reloadOnChange: true`, and all consumers read via `IOptionsMonitor<MbproxyOptions>` so a save to the config file propagates without restarting the service. Each change kind has explicit reconcile semantics:
| Change in appsettings | Propagation |
|-----------------------|-------------|
| `BcdTags.Global` add/remove/width | Rewriter dereferences the monitor per-PDU. Next PDU sees the new map; in-flight reads/writes are not retroactively touched. |
| `Plcs[i].BcdTags.{Add,Remove}` | Same — next-PDU resolution. |
| New `Plcs[i]` entry | Listener supervisor binds the new port subject to the same eager-then-auto-recover policy. |
| `Plcs[i]` removed | Supervisor stops the listener and closes all upstream client connections for that PLC. |
| `Plcs[i].ListenPort` or `Host` changed | Equivalent to remove + add. |
| `Connection.Backend*TimeoutMs` | Next backend connect/request uses the new value. In-flight operations keep their already-applied timeout. |
| Invalid reload (schema break, duplicate ports, duplicate addresses in a resolved tag list) | Reload is rejected as a whole; current in-memory config stays in effect; `mbproxy.config.reload.rejected` is logged at Error. |
Every accepted reload emits `mbproxy.config.reload.applied` at Information with a summary of which PLCs were added/removed and the size of the tag-list delta.
## BCD tag shape
```csharp
public sealed record BcdTag(ushort Address, byte Width); // Width ∈ { 16, 32 }
```
- **16-bit BCD** — one register holds 4 BCD digits (09999). Wire value `0x1234` decodes to decimal 1234.
- **32-bit BCD** — a CDAB-ordered register pair at `Address` and `Address+1`. The register at `Address` holds the **low 4 digits**; the register at `Address+1` holds the **high 4 digits**. Decoded decimal = `high * 10000 + low`. This follows directly from DirectLOGIC's CDAB word order (see [`../DL260/dl205.md`](../DL260/dl205.md) → Word Order).
- **Unsigned only.** DL205/DL260 BCD is non-negative in the default ladder pattern; the proxy does not implement signed BCD.
- **Holding-register and input-register addresses share the same space.** The rewriter applies the configured tag list against both FC03 and FC04 reads.
## Rewriter — function code scope
The rewriter inspects and rewrites payloads only for these function codes; every other FC (coils, discrete inputs, diagnostics, exception responses) passes through byte-for-byte:
| FC | Direction | Action |
|----|----------------|-----------------------------------------------------------------------|
| 03 | response | Re-encode covered BCD slots from raw nibbles → binary integer |
| 04 | response | Same as FC03 (input-register table also surfaces V-memory) |
| 06 | request | Re-encode binary integer → BCD nibbles before forwarding |
| 06 | response | Decode BCD nibbles → binary integer on the echo (clients validate that the echoed value equals the value they sent; without this, NModbus-style clients throw on the round-trip) |
| 16 | request | Per-register over the configured slots, then forward |
**Partial-overlap policy.** A request that touches only ONE register of a configured 32-bit BCD pair (qty=1 at the low addr, or any read/write of the high addr alone) **passes through raw** with a `mbproxy.rewrite.partial_bcd` warning. The proxy never synthesises a Modbus exception for a partial-overlap — that response code is reserved for transport failure.
## Failure modes — transparent pass-through with Polly-bounded backend connect
- **PLC returns a Modbus exception (codes 0104)** → forward verbatim with the original MBAP transaction ID. The client sees the real DL205/DL260 exception.
- **Backend connect refused or initial connect timeout** → retry under a Polly resilience pipeline: 3 attempts at 100ms / 500ms / 2000ms backoff (tuned via `Resilience.BackendConnect`). If all attempts fail, the multiplexer closes the upstream client connection that triggered the connect.
- **Backend mid-stream broken socket** → the multiplexer's reader/writer task throws; the backend tear-down path cancels both tasks, drains the correlation map, and **cascades the disconnect by closing every attached upstream pipe**. The next upstream request to any pipe triggers a fresh backend connect through the Polly pipeline. `BackendDisconnectCascades` counter records the upstream-pipe count at each cascade event.
- **Backend request timeout** → the per-request watchdog times out any correlation entry older than `Connection.BackendRequestTimeoutMs`, delivers Modbus exception 0x0B (Gateway Target Device Failed To Respond) with the original TxId to the upstream party, and frees the proxy TxId. **No mid-request retries** — FC06 / FC16 are non-idempotent on BCD tags (a partial-applied multi-register write could leave a 32-bit BCD tag mid-transition), so every in-flight request is one-shot. The client interprets the 0x0B as a transport failure and reconnects through its normal path.
- **Partial-BCD overlap** → forward raw + warn (see Rewriter section).
- **One slow PLC does not stall the rest of the fleet.** Each PLC has its own `PlcMultiplexer`, with its own backend socket, correlation map, and outbound channel; per-PLC failures are local. A slow or dead backend on one PLC only impacts that PLC's clients.
## Startup posture — eager, continue on per-port failure
At startup the host attempts to bind **all 54 listen sockets up front**. Each failure (port already in use, invalid IP, malformed PLC entry) is logged at Error and handed off to the listener supervisor (next section). The service proceeds with whichever PLCs bound on the first attempt; the rest converge in the background. Monitoring should alert on `mbproxy.startup.bind.failed` so missing PLCs aren't silently dropped, and watch for `mbproxy.listener.recovered` to confirm late binds eventually succeeded.
## Listener auto-recovery (Polly-backed supervisor)
Each PLC's listener runs under a **supervisor task** that owns its bind lifecycle. If a bind fails at startup, or if a listener faults at runtime (port stolen by another process, transient OS network reset), the supervisor reattempts via a Polly retry pipeline: 5 attempts at 1s / 2s / 5s / 15s / 30s backoff, then steady-state retries every 30s indefinitely (tuned via `Resilience.ListenerRecovery`). Each attempt logs at Debug; the bind that finally succeeds emits one `mbproxy.listener.recovered` Information event.
While a supervisor is between attempts, the corresponding PLC is reported as `listener.state = recovering` on the status page. Hot-reload uses the same supervisor to bring newly-added PLCs online and to tear down removed ones — there is exactly one code path for "bring up a listener" and one for "shut a listener down."
## Logging — Serilog, structured, console + rolling file
Serilog wired through the Microsoft.Extensions.Logging bridge:
- **Console sink** for interactive `--console` runs.
- **Rolling-file sink** under `%ProgramData%\mbproxy\logs\`.
- **Default level** Information. Per-PLC and per-client scopes via `LogContext.PushProperty("Plc", name)` / `("Client", remoteEp)` so log lines are greppable across the fleet.
Stable event names (keep these stable so log queries don't churn):
| Event | Level | Properties |
|--------------------------------------|---------|---------------------------------------------|
| `mbproxy.startup.bind` | Info | `Plc`, `Port` |
| `mbproxy.startup.bind.failed` | Error | `Plc`, `Port`, `Reason` |
| `mbproxy.listener.recovered` | Info | `Plc`, `Port`, `AttemptCount` |
| `mbproxy.client.connected` | Info | `Plc`, `RemoteEp` |
| `mbproxy.client.disconnected` | Info | `Plc`, `RemoteEp`, `Reason` |
| `mbproxy.backend.failed` | Warning | `Plc`, `Reason` |
| `mbproxy.rewrite.partial_bcd` | Warning | `Plc`, `Address`, `ClientStart`, `ClientQty` |
| `mbproxy.rewrite.invalid_bcd` | Warning | `Plc`, `Address`, `RawValue`, `Direction` |
| `mbproxy.exception.passthrough` | Info | `Plc`, `Fc`, `ExceptionCode` |
| `mbproxy.config.reload.applied` | Info | `PlcsAdded`, `PlcsRemoved`, `TagDelta` |
| `mbproxy.config.reload.rejected` | Error | `Reason` |
| `mbproxy.admin.bind.failed` | Error | `Port`, `Reason` |
| `mbproxy.multiplex.backend.connected` | Info | `Plc`, `Host`, `Port` |
| `mbproxy.multiplex.backend.disconnected` | Warning | `Plc`, `UpstreamCount`, `InFlightCount`, `Reason` |
| `mbproxy.multiplex.saturated` | Error | `Plc`, `RemoteEp` (16-bit TxId space full) |
| `mbproxy.multiplex.request.timeout` | Warning | `Plc`, `ProxyTxId`, `OriginalTxId`, `Fc`, `ElapsedMs` |
## Status page — read-only HTTP endpoint
A separate **Kestrel-hosted minimal API** runs on `Mbproxy.AdminPort` (default `8080`, distinct from the Modbus listen ports). The endpoint set is intentionally narrow — read-only telemetry; **no admin actions** (kick client, force reload, restart listener) are exposed:
- `GET /` — single self-contained HTML page rendering a table of all configured PLCs with their state and live counters. Auto-refreshes every 5s via a meta-refresh tag (no JS bundle, no external assets).
- `GET /status.json` — the same data as JSON for monitoring scrapers.
Authentication is assumed to live at the network layer (trusted internal segment behind a firewall). Surface that assumption in deployment docs when they exist.
**Service-wide fields:**
| Field | Meaning |
|-------|---------|
| `service.uptime` | Seconds since service start |
| `service.version` | Assembly informational version |
| `service.config.lastReloadUtc` | Timestamp of last accepted hot-reload (or `null`) |
| `service.config.reloadCount` | Number of reloads accepted since start |
| `service.config.reloadRejectedCount` | Number of reloads rejected since start |
| `listeners.bound` / `listeners.configured` | Bound listener count vs configured PLC count |
**Per-PLC fields** (one row per `Plcs[i]`):
| Field | Meaning |
|-------|---------|
| `name`, `host`, `listenPort` | Identity from config |
| `listener.state` | `bound` / `recovering` / `stopped` |
| `listener.lastBindError` | Most recent bind failure message (when `recovering`) |
| `listener.recoveryAttempts` | Polly retry count since last successful bind |
| `clients.connected` | Currently connected upstream client count |
| `clients.remoteEndpoints` | Array of `{ remote, connectedAtUtc, pdusForwarded }` |
| `pdus.forwarded` | Total PDUs (request+response) forwarded since start |
| `pdus.byFc` | `{ fc03, fc04, fc06, fc16, other }` request counts |
| `pdus.rewrittenSlots` | Count of register slots BCD-rewritten |
| `pdus.partialBcdWarnings` | Count of partial-overlap pass-throughs |
| `backend.connects.success` / `backend.connects.failed` | Polly-final-result counters |
| `backend.exceptions.byCode` | `{ "01": n, "02": n, "03": n, "04": n }` |
| `backend.lastRoundTripMs` | EWMA of recent successful round-trip times |
| `bytes.upstreamIn` / `bytes.upstreamOut` | Bytes forwarded each direction |
Counters are `System.Threading.Interlocked` longs read atomically per request; no locking on the read path.
## Test simulator — pymodbus DL260/DL205 server
The pymodbus profile at [`../DL260/dl205.json`](../DL260/dl205.json) already models the DL205/DL260 quirks (BCD nibbles at known addresses, CDAB-ordered 32-bit values, C-relay/Y-output coil mappings, etc.) as concrete register seeds. The test infrastructure wraps it as a managed lifecycle so every integration / e2e test gets a fresh known-good DL-series target without needing real hardware.
Harness shape (lives under `tests/sim/`):
- **Launcher script** — `tests/sim/run-dl205-sim.ps1` provisions a Python venv under `tests/sim/.venv` on first run (`python -m venv` + `pip install pymodbus`), then launches `pymodbus.server` with the `dl205.json` profile on a configurable port. Idempotent: re-runs reuse the venv.
- **xUnit fixture** — `Mbproxy.Tests.Sim.DL205SimulatorFixture : IAsyncLifetime` that:
- `InitializeAsync`: spawns the simulator subprocess, polls `TcpClient.ConnectAsync` against the port until success or a 10 s deadline, captures stdout/stderr to test output.
- `DisposeAsync`: signals graceful shutdown (Ctrl-C on the process group on Windows), then `Process.Kill(entireProcessTree: true)` as a safety net.
- Exposes `Host`, `Port`, `LogTail` (last N lines of sim stderr for diagnosis).
- **Test collection** — `[CollectionDefinition(nameof(DL205SimulatorCollection))]` so the fixture is shared across all integration/e2e classes that opt in (cheap startup, expensive process churn).
- **Skip policy** — if Python or pymodbus isn't available and the auto-provision fails (no network, locked-down CI image, etc.), `InitializeAsync` records the reason and tests skip via `Assert.Skip(sim.SkipReason)`. CI must have Python 3.10+ available; local devs running only the rewriter unit tests need nothing extra.
- **Alternate profiles** — additional scenarios (e.g., a profile that seeds a specific partial-overlap test case, or a profile with strict `type exception: true` to verify the proxy doesn't depend on lax pymodbus behaviour) live alongside `dl205.json` and are selected via `MODBUS_SIM_PROFILE` env var, matching the pattern already established by [`../DL260/DL205BcdQuirkTests.cs`](../DL260/DL205BcdQuirkTests.cs).
The simulator IS the proxy's end-to-end test bed. A standard e2e test does:
1. Start the simulator at `127.0.0.1:<simPort>`.
2. Configure the proxy with one PLC entry `Host=127.0.0.1, Port=<simPort>, ListenPort=<proxyPort>`.
3. Start the proxy (in-process via `WebApplicationFactory`-style host construction).
4. Drive a plain Modbus TCP client (`NModbus` or `FluentModbus`) against `127.0.0.1:<proxyPort>`.
5. Assert two directions:
- **Read**: client sees the BCD-decoded integer (proxy rewrote the response).
- **Write**: simulator's register state shows the BCD-encoded nibbles (proxy rewrote the request).
## Testing
- **Unit tests** — drive the BCD rewriter with synthetic Modbus PDU byte arrays. No network, no simulator. Cover every FC03/04/06/16 × {single 16-bit, full 32-bit pair, partial-overlap low, partial-overlap high, mixed-with-non-BCD} cell.
- **Integration tests** — drive the proxy end-to-end against the pymodbus simulator described in the previous section, using a plain Modbus TCP client (`NModbus` or `FluentModbus`) against `proxy:<listenPort>` and asserting the decoded value rather than the raw register bytes.
- **Auto-recovery tests** — bind a `TcpListener` on a target port BEFORE starting the proxy, assert that the supervisor enters `recovering` state, release the port, and assert the next supervisor attempt succeeds and `mbproxy.listener.recovered` fires. Also cover the runtime-fault path by forcing the accept loop to throw and asserting the supervisor reattempts.
- **Hot-reload tests** — write a temp `appsettings.json`, start the host, mutate the file (add a PLC, remove a PLC, change a global tag width), and assert: (a) supervisor adds/removes the affected listener, (b) the rewriter on the next PDU reflects the new tag map, (c) a malformed reload is rejected without breaking the running config. Cover both `mbproxy.config.reload.applied` and `mbproxy.config.reload.rejected` paths.
- **Status page tests** — start the host, induce known events (connect 2 clients, force a backend exception, trigger a partial-BCD warning), and assert `GET /status.json` returns the expected counters. The HTML page is verified separately as a smoke test that the route returns 200 with `text/html`.
+397
View File
@@ -0,0 +1,397 @@
# mbproxy — Dashboard KPI catalogue
Recommended additions to the `/status.json` and `/` admin endpoint to make a production fleet dashboard genuinely useful, grouped by tier. Today's `/status.json` exposes raw cumulative counters; this doc describes what's typically *also* expected when those counters land in Grafana / Wonderware / a custom HMI.
**Scope.** This is a proposal, not a contract. The endpoint shape settled in [`design.md`](design.md) → "Status page" is what ships today; the items below are dashboard-side derivatives or new counters that operators of comparable Modbus / SCADA proxy fleets typically expect.
**Reading guide.** Each KPI has:
- **Name** — short identifier matching the proxy's existing camelCase convention.
- **Definition** — what the number means.
- **Source** — where the value comes from (existing counter, new counter, derived).
- **Widget** — typical dashboard visualisation.
- **Alert** — common threshold or anomaly rule (where applicable).
- **Effort** — implementation cost in hours (rough order-of-magnitude).
## What's exposed today (recap)
For context — every recommended addition below is *in addition to* this list. Today's `/status.json` carries:
| Group | Fields |
|-------|--------|
| Service | `uptimeSeconds`, `version`, `configLastReloadUtc`, `configReloadCount`, `configReloadRejectedCount` |
| Listeners | `bound`, `configured` |
| Per-PLC listener | `state`, `lastBindError`, `recoveryAttempts` |
| Per-PLC clients | `connected`, `remoteEndpoints[]` (remote, connectedAtUtc, pdusForwarded) |
| Per-PLC PDUs | `forwarded`, `byFc.{fc03,fc04,fc06,fc16,other}`, `rewrittenSlots`, `partialBcdWarnings` |
| Per-PLC backend | `connectsSuccess`, `connectsFailed`, `exceptionsByCode.{code01..code04}`, `lastRoundTripMs` |
| Per-PLC bytes | `upstreamIn`, `upstreamOut` |
Counters are **cumulative since process start**. A restart resets them.
---
## Tier 1 — strongly recommended for production
These are the additions that, in practice, are the difference between "I can see the proxy is up" and "I can run a 54-PLC fleet from this dashboard."
### 1.1 Rate metrics (per-PLC and fleet-wide)
| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `pdus.ratePerSec.last1m` | PDU rate over the last 60 s | New per-PLC ring buffer (60 × 1 s samples) | Sparkline per PLC | None — informational | 4 h |
| `pdus.ratePerSec.last5m` | Same over 5 min | Same buffer at 300 s | Sparkline | None | shared |
| `errors.ratePerMin` | Sum of `exceptionsByCode.*` + `partialBcdWarnings` + `invalidBcdWarnings` per minute | Derived | Stat tile per PLC | > 10/min → page | 2 h |
| `bytes.ratePerSec.up` / `.down` | Bandwidth each direction | Derived from `bytesUpstreamIn/Out` deltas | Stacked area | None — informational | 2 h |
| `fleet.totalPdusPerSec` | Sum of all PLCs' rates | Aggregate | Single number, big | None | 1 h |
**Why this matters.** Cumulative counters answer "did anything ever happen" but not "is anything happening right now." A grafana panel computing `rate(pdus_forwarded[1m])` on a 54-row fleet is the single most informative widget on the dashboard.
**Implementation note.** Rate-from-counter computation can live entirely on the dashboard side (Prometheus/Grafana handles it natively). If we want them in `/status.json` directly, add a per-PLC `Mbproxy.Proxy.RateTracker` with a fixed-size circular buffer of 60 one-second samples and expose `RatePerSec1m`, `RatePerSec5m`.
### 1.2 Latency percentiles (replacing the bare EWMA)
| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `backend.roundTripMs.p50` | Median backend round-trip over last 1 min | New per-PLC reservoir sample (size 256) | Line chart, per-PLC | None | 6 h |
| `backend.roundTripMs.p95` | 95th percentile | Same reservoir | Line chart | > 500 ms sustained 5 min → warn | shared |
| `backend.roundTripMs.p99` | 99th percentile | Same reservoir | Line chart | > 2 s sustained 5 min → page | shared |
| `backend.roundTripMs.max1m` | Slowest single PDU in last 1 min | Same reservoir | Stat tile | > 5 s → page | shared |
**Why this matters.** The existing `lastRoundTripMs` is an EWMA — useful, but it smooths away tail events. A single PLC misbehaving with bursty 5-second responses won't show up in EWMA but is obvious in p99. Modbus clients have hard timeouts (typically 3 s); knowing p99 lets you set them confidently.
**Implementation note.** Use `Mbproxy.Proxy.LatencyReservoir` — a 256-sample reservoir with Vitter's Algorithm R for unbiased sampling under arbitrary throughput. Don't store every sample (a busy PLC at 100 PDU/s × 60 s = 6,000 samples/min × 54 PLCs = 324K samples/min, too much).
### 1.3 Per-PLC availability ratio
| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `listener.boundRatio.last1h` | Fraction of time in `bound` state over last hour | New per-supervisor state-time tracker | Gauge per PLC | < 0.99 → warn, < 0.95 → page | 4 h |
| `listener.boundRatio.sinceStart` | Fraction over process lifetime | Same tracker | Gauge | < 0.999 → warn | shared |
| `listener.timeInRecoveringMs.last1h` | Total time spent recovering in last hour | Same tracker | Stat tile | > 60s → warn | shared |
**Why this matters.** `recoveryAttempts` tells you how many times something has flapped, but not how *much* downtime that represented. A PLC that recovers in 1 s once an hour is healthy; one that recovers in 90 s every 10 min is degraded. The ratio captures this directly.
**Implementation note.** Each `PlcListenerSupervisor` already has a state machine. Add a `StateDurationTracker` that timestamps every state transition and accumulates total time in each state. Surface the ratio over a sliding window.
### 1.4 Liveness / staleness signals
| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `pdus.lastForwardedUtc` | Wall time of the most recent forwarded PDU | New `_lastForwardedTimestamp` per PLC | Stat tile | `now - value > 5 min AND clients.connected > 0` → page | 1 h |
| `clients.lastActivityUtc` | Per-client last-PDU timestamp | Already implicit; expose explicitly | Per-row in remoteEndpoints | None | 1 h |
| `staleClients.count` | Connected clients with no PDUs in last 5 min | Derived | Stat tile | > 0 → informational | 1 h |
**Why this matters.** Operators want to know "is this PLC actually doing anything?" not just "is the listener bound?" A PLC with `clients.connected = 2` but no PDU in 10 minutes is suspicious — either the clients are dead, the network is broken, or the HMI is misconfigured.
### 1.5 Service-wide fleet aggregates
These are single-number widgets that surface fleet health at a glance, typically rendered as large stat tiles in the header of the dashboard.
| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `fleet.plcsHealthy` | Count of PLCs in `bound` state with no errors in last 5 min | Aggregate | Big number, green | < `listeners.configured - 2` → warn | 2 h |
| `fleet.plcsRecovering` | Count in `recovering` state | Aggregate | Big number, orange | > 0 → informational | shared |
| `fleet.plcsStopped` | Count in `stopped` state | Aggregate | Big number, grey | > 0 → page | shared |
| `fleet.plcsWithActiveErrors` | Count with `errors.ratePerMin > 0` | Aggregate | Big number, red | > 0 → page | shared |
| `fleet.totalClientsConnected` | Sum of `clients.connected` | Aggregate | Stat tile | None | 1 h |
| `fleet.totalRewrittenSlotsPerSec` | Sum of rewrite rates | Aggregate + derived | Sparkline | None | shared |
**Why this matters.** A 54-row table is hard to scan. A "47 healthy / 5 recovering / 2 errors" header lets the operator know whether to even look at the table.
### 1.6 Multiplexer state — **shipped in [Phase 9](plan/09-txid-multiplexing.md)**
The proxy holds one backend socket per PLC and multiplexes upstream clients via MBAP TxId rewriting. The 4-client ECOM cap is no longer a meaningful operational concern; the new saturation surface is the 16-bit TxId space and the per-PLC outbound queue depth.
| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `backend.inFlightCount` | Current in-flight Modbus requests on this PLC's backend connection | Phase-9 counter | Sparkline per PLC | Sustained > 100 → investigate (high churn or slow backend) | (in Phase 9 scope) |
| `backend.maxInFlight` | Peak in-flight count observed since process start | Phase-9 counter | Stat tile per PLC | Approaches 65,000 → page (TxId saturation imminent — realistic only under pathological load) | (in Phase 9 scope) |
| `backend.txIdWraps` | Times the TxId allocator has wrapped 0xFFFF → 0x0000 | Phase-9 counter | Stat tile per PLC | Sudden increase rate → very high in-flight churn; investigate fairness | (in Phase 9 scope) |
| `backend.queueDepth` | Current outbound channel depth (frames queued for the backend writer) | Phase-9 counter | Sparkline per PLC | Sustained > 50 → backend is slower than upstream demand; latency rising | (in Phase 9 scope) |
| `backend.disconnectCascades` | Total upstream clients closed due to backend disconnects | Phase-9 counter | Stat tile per PLC | Spike → network instability; correlate with `mbproxy.backend.failed` events | (in Phase 9 scope) |
**Why this matters.** Multiplexing concentrates connection risk: a single backend disconnect now cascades to every attached upstream client. The cascade counter quantifies that blast radius. Queue depth is the new latency leading indicator (today's `lastRoundTripMs` measures wire latency only; queue depth reveals proxy-side backlog).
### 1.7 Read coalescing — **[requires Phase 10](plan/10-read-coalescing.md)**
After Phase 10 ships, same-key FC03/04 reads within the in-flight window attach to one another instead of generating duplicate backend requests. The coalescing ratio is the headline metric.
| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `backend.coalescedHitCount` | FC03/04 requests attached to an already-in-flight peer | Phase-10 counter | Sparkline | None — trend-watch | (in Phase 10 scope) |
| `backend.coalescedMissCount` | FC03/04 requests that created a fresh backend round-trip | Phase-10 counter | Sparkline | None — trend-watch | (in Phase 10 scope) |
| `backend.coalescingRatio` | `Hit / (Hit + Miss)` over the trailing window | Derived (dashboard) | Stat tile per PLC | None; a low ratio just means clients aren't synchronised on the same registers — informational | (in Phase 10 scope) |
| `backend.coalescedResponseToDeadUpstream` | Fan-out responses dropped because the attached upstream disconnected mid-flight | Phase-10 counter | Stat tile per PLC | Spike → client churn during traffic burst; usually not actionable | (in Phase 10 scope) |
**Why this matters.** Coalescing-ratio is the "how much PLC traffic did we save" metric. A 60% ratio means 60% of FC03/04 reads landed on an existing in-flight request — that's roughly 60% reduction in backend PDU rate vs the pre-Phase-10 model. The dead-upstream counter is a churn indicator that's invisible in any other metric.
### 1.8 Response cache — **[requires Phase 11](plan/11-response-cache.md)**
After Phase 11 ships, FC03/04 responses for opt-in tags are cached with a per-tag TTL. Cache hits serve from in-process memory without backend traffic; FC06/FC16 write responses invalidate overlapping entries.
| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `backend.cacheHitCount` | FC03/04 requests served from the cache | Phase-11 counter | Sparkline per PLC | None — informational | (in Phase 11 scope) |
| `backend.cacheMissCount` | FC03/04 requests that fell through to the backend (or coalescing) | Phase-11 counter | Sparkline per PLC | None — informational | (in Phase 11 scope) |
| `backend.cacheHitRatio` | `Hit / (Hit + Miss)` for cache-eligible reads | Derived (dashboard) | Stat tile per PLC | None; informs whether TTL tuning is worthwhile | (in Phase 11 scope) |
| `backend.cacheInvalidations` | Cache entries invalidated by FC06/FC16 write responses | Phase-11 counter | Stat tile per PLC | High rate → many writes to cached addresses; consider reducing TTL on those tags | (in Phase 11 scope) |
**Why this matters.** Cache-hit-ratio is the operator's ROI metric — TTLs that yield low hit-ratios are wasted staleness. The invalidation counter reveals writes-to-cached-reads churn: a high rate suggests the cache is invalidating itself constantly, meaning the TTL configuration isn't matching real access patterns. Both are operational tuning signals, not alerts.
---
## Tier 2 — nice-to-have
Reach for these once Tier 1 is solid. They add depth for specific operational scenarios.
### 2.1 Connection-cap saturation warning
> **Status: superseded by [Phase 9](plan/09-txid-multiplexing.md).** This KPI tracked the H2-ECOM100's 4-concurrent-TCP-client cap, which was the headline operational ceiling under the pre-Phase-9 1:1 connection model. After Phase 9 ships, the proxy holds exactly one backend socket per PLC regardless of how many upstream clients connect — the 4-client cap on the ECOM is no longer reachable from the upstream side. The closest post-Phase-9 equivalent is `backend.inFlightCount` (Tier 1.6) against the 65,535 TxId-allocator ceiling, but that's realistically unreachable under any normal load. **Keep this section as historical context only; do not implement it on a Phase-9 (or later) deployment.**
| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `clients.atCapWarning` | Boolean: `clients.connected >= 3` (1 short of ECOM100's 4-client cap) | Derived | Cell highlight | True → warn | 1 h |
| `clients.atCapBlocked` | Boolean: `clients.connected >= 4` (cap reached) | Derived | Cell highlight | True → page | shared |
**Why this mattered (pre-Phase-9).** The H2-ECOM100's 4-simultaneous-TCP-client cap was a documented operational ceiling (see [design.md](design.md) → "Connection model" and [DL260/dl205.md](../DL260/dl205.md) → "Behavioral Oddities"). When 4 clients were connected, the 5th would see backend connect failures. Surfacing this proactively let ops kick a stale client before incoming clients failed. Phase 9 eliminates the underlying problem; this KPI exists in the catalogue only as a historical reference for pre-Phase-9 deployments.
### 2.2 Error breakdown / heatmap
| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `partialBcd.byClient` | Count of partial-BCD warnings grouped by client remote endpoint | New per-client counter | Top-N list | Top-1 > 100/hr → ops should check the client's tag definition | 3 h |
| `invalidBcd.byAddress` | Count of invalid-BCD events grouped by Modbus address | New per-address counter (small map) | Heatmap | Single address with persistent rate → broken PLC logic | 4 h |
| `exceptions.byCodeRate` | Per-exception-code rate over 5 min | Derived from `exceptionsByCode.*` | Stacked bar | Code 04 (Slave Failure) spike → PLC in PROGRAM mode? | 2 h |
**Why this matters.** Once you've seen `partialBcdWarnings = 1247`, the next question is *which client* and *which tag*. Without dimensional breakdown, you have to ssh into the log file to find out.
### 2.3 Hot-reload cadence
| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `config.reloadsPerHour` | Reload events per hour | Derived from `configReloadCount` | Sparkline | > 10/hr → unusual; misconfig loop? | 1 h |
| `config.lastReloadDelta` | Summary of what changed on last reload | Already in `mbproxy.config.reload.applied` event; surface here | Text snippet | None — informational | 2 h |
**Why this matters.** Config thrashing is a smell — usually means an automation tool is fighting with a manual edit or a CI deploy is misconfigured.
### 2.4 Memory / process health
| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `process.workingSetMb` | `Process.GetCurrentProcess().WorkingSet64 / 1MB` | New | Stat tile | > 1024 MB → warn (54 PLCs shouldn't need that much) | 0.5 h |
| `process.gcCollections.gen0/1/2` | GC counts per generation | `GC.CollectionCount(n)` | Sparkline | Gen-2 frequency → memory pressure | 0.5 h |
| `process.threadCount` | `Process.Threads.Count` | New | Stat tile | > 200 → leak? | 0.5 h |
**Why this matters.** A long-running service in a 24/7 plant needs to prove it's not leaking. These three numbers catch 90 % of common leak patterns. Each is one `Process` API call, no perf overhead.
---
## Real-time updates via SignalR
Today's status surface is poll-based: the HTML page uses a 5-second `meta-refresh`, and Prometheus / custom HMI scrapers hit `/status.json` on their own cadence. For a glance dashboard or a TSDB scrape that's fine. For a **live fleet dashboard with many panels open**, polling 54 PLCs at 1 Hz means ~54 HTTP round-trips per second from the dashboard backend, and a state transition (e.g., a listener flipping `bound → recovering`) is invisible until the next poll window. SignalR addresses both: one persistent connection per dashboard client, server pushes counter deltas and discrete events at the cadence that makes sense for each kind of update.
**The recommendation is additive, not replacement.** Keep `/status.json` for scrapers and the meta-refresh HTML for the operator-with-a-browser case. Add a SignalR hub for full-screen live dashboards. Existing consumers do not change.
### Why this is cheap to add
The `Microsoft.AspNetCore.App` framework reference that Phase 07 added to the csproj **already includes `Microsoft.AspNetCore.SignalR`** — no new NuGet, no version pinning, no AOT concerns. The hub mounts on the existing Kestrel server that runs on `Mbproxy.AdminPort`. No additional port, no additional listener supervision, no additional shutdown path.
### Architecture
```
┌─→ Dashboard A (subscribed to "all")
ProxyWorker / Supervisors ──┐ │
ConfigReconciler ───────────┤ │
ProxyCounters ──────────────┼──→ StatusBroadcaster ──→ StatusHub ──┼─→ Dashboard B (subscribed to "plc:Line1-Mixer")
ServiceCounters ────────────┘ (background loop + │
immediate-push paths) └─→ Dashboard C (subscribed to "service")
```
- **`StatusHub : Hub`** — the SignalR endpoint mounted at `/hub/status` on `AdminPort`. Clients call its methods to subscribe; the server invokes client-side callbacks to deliver updates.
- **`StatusBroadcaster : IHostedService`** — the background pusher. Holds a `Timer` (or `PeriodicTimer`) that ticks at `PushIntervalMs` (default 1000 ms), builds a `StatusResponse` via the existing `StatusSnapshotBuilder`, diffs it against the previous snapshot, and pushes only the changed pieces. Also exposes `PushEventAsync(name, props)` for the immediate-push paths.
- **Immediate-push wiring** — the existing log events (`mbproxy.listener.recovered`, `mbproxy.config.reload.applied`, `mbproxy.backend.failed`, `mbproxy.rewrite.partial_bcd`, etc.) gain a fan-out call to `broadcaster.PushEventAsync(...)` so subscribers see them inside ~10 ms of occurrence rather than at the next poll tick.
### Hub contract
**Hub URL:** `https://<host>:<AdminPort>/hub/status`
**Hub groups** — clients subscribe to scopes; the server broadcasts to matching groups:
| Group | Receives |
|-------|----------|
| `all` | Every update for every PLC + every service-level event |
| `service` | Service-level events only (`mbproxy.config.*`, `mbproxy.admin.*`, `mbproxy.startup.*`, `mbproxy.shutdown.*`) |
| `plc:<Name>` | One PLC's snapshots + that PLC's events |
**Server-side methods** (client → server):
| Method | Purpose |
|--------|---------|
| `Task SubscribeFleet()` | Join group `all` |
| `Task SubscribeService()` | Join group `service` |
| `Task SubscribePlc(string name)` | Join group `plc:<name>` after validating that `name` exists in current options |
| `Task Unsubscribe()` | Leave every group; the connection stays open but receives nothing |
**Client-side callbacks** (server → client, named `On*` per SignalR convention):
| Callback | Payload | When |
|----------|---------|------|
| `OnSnapshot(StatusResponse snapshot)` | Full snapshot of the relevant scope (`all`, `service`, or a single PLC) | Sent once on subscribe so the dashboard has a baseline; thereafter only on initial reconnect |
| `OnPatch(StatusPatch patch)` | Delta of fields that changed since the last push | Periodic — every `PushIntervalMs` if anything changed; skipped if nothing changed |
| `OnEvent(StatusEvent ev)` | Single discrete event: `{ name, levelString, plc?, propertiesJson, timestampUtc }` | Immediately — fan-out from the existing `[LoggerMessage]` event call sites |
`StatusPatch` carries only the fields that changed since the previous push: it's a `Dictionary<string, JsonElement>` keyed by JSON path (e.g., `"plcs[2].pdus.forwarded"`, `"plcs[2].listener.state"`). Dashboard clients apply these to their local model. Keeps wire traffic tiny when the fleet is idle.
### What gets pushed, and when
| Update kind | Cadence | Volume per PLC | Channel |
|-------------|---------|----------------|---------|
| Counter increments (PDUs, bytes, rewrites) | Every `PushIntervalMs` if changed; coalesced | 1 patch / push tick / subscribed group | `OnPatch` |
| State transitions (`bound ↔ recovering ↔ stopped`) | Immediate | 1 event + 1 patch | `OnEvent` + `OnPatch` |
| Discrete log events at level ≥ Info from the stable vocabulary | Immediate | 1 event per occurrence | `OnEvent` |
| Hot-reload applied / rejected | Immediate | 1 event with `propertiesJson` summary | `OnEvent` |
| Periodic full snapshot | Every 60 s | 1 full snapshot | `OnSnapshot` |
The periodic full snapshot every 60 s is a self-healing measure: if a patch is missed (rare with SignalR but possible on transport hiccups), the next minute resets the dashboard's local model to ground truth.
### Configuration
Extend `appsettings.json` with:
```jsonc
"Mbproxy": {
// ... existing keys ...
"Admin": {
"SignalR": {
"Enabled": true,
"PushIntervalMs": 1000, // patch cadence
"FullSnapshotIntervalMs": 60000, // periodic re-baseline
"MaxConcurrentClients": 32, // refuse new connections beyond this
"MaxGroupsPerClient": 8 // anti-runaway-subscription guard
}
}
}
```
Defaults make the feature opt-in-able-by-omission: if `SignalR.Enabled = false`, the hub is not mapped, the broadcaster is not started, and there is zero runtime cost. Hot-reload of these keys is desirable but lower priority than core functionality — first ship with restart-required.
### Implementation outline
1. **Hub class**`src/Mbproxy/Admin/StatusHub.cs`. Inherits `Hub`. Implements the four `Subscribe*` / `Unsubscribe` methods. `OnConnectedAsync` rejects if `Context.Items.Count > MaxConcurrentClients` (track in a static `ConcurrentDictionary<string, byte>` indexed by `ConnectionId`).
2. **Broadcaster**`src/Mbproxy/Admin/StatusBroadcaster.cs : IHostedService`. Constructor takes `IHubContext<StatusHub>`, `StatusSnapshotBuilder`, `IOptionsMonitor<MbproxyOptions>`. The push loop is a `while (!ct.IsCancellationRequested) { await timer.WaitForNextTickAsync(ct); ... }` body — wins over `Timer` for cancellation correctness.
3. **DTOs**`StatusPatch` and `StatusEvent` records added to `StatusDto.cs`, registered with the source-gen `StatusJsonContext`.
4. **Event fan-out** — the existing `[LoggerMessage]` partial methods stay; add a thin `RealtimeLogEvents` wrapper class that logs AND calls `broadcaster.PushEventAsync(...)`. Call sites in supervisors / pipelines / reconciler swap to the wrapper. Keeps log-only call sites and broadcast-too call sites both readable.
5. **Hub mapping**`AdminEndpointHost` adds `app.MapHub<StatusHub>("/hub/status")` if `SignalR.Enabled`. The Kestrel pipeline stays minimal: the hub is the only WebSocket-capable endpoint.
6. **Shutdown**`StatusBroadcaster.StopAsync` cancels its pump and the hub's `Dispose` chain handles connection teardown. The existing `ShutdownCoordinator` deadline applies.
### Test approach
Use the **`Microsoft.AspNetCore.SignalR.Client`** package (NuGet) in the test csproj only. Pattern:
```csharp
[Fact]
[Trait("Category", "E2E")]
public async Task SignalR_StatePatchFiresWithin_500ms_OfBackendException()
{
// Arrange: start host on a random AdminPort, build a SignalR client.
var connection = new HubConnectionBuilder()
.WithUrl($"http://localhost:{adminPort}/hub/status")
.Build();
var patches = new ConcurrentQueue<StatusPatch>();
connection.On<StatusPatch>("OnPatch", patches.Enqueue);
await connection.StartAsync(TestContext.Current.CancellationToken);
await connection.InvokeAsync("SubscribePlc", "TestPLC", TestContext.Current.CancellationToken);
// Act: induce a backend exception (e.g., point a configured PLC at 127.0.0.1:1).
// ... drive request through proxy ...
// Assert: a patch with backend.connectsFailed != 0 arrives within 500 ms.
var deadline = DateTime.UtcNow.AddMilliseconds(500);
while (DateTime.UtcNow < deadline && !patches.Any(p => p.Fields.ContainsKey("plcs[0].backend.connectsFailed")))
await Task.Delay(20, TestContext.Current.CancellationToken);
patches.ShouldContain(p => p.Fields.ContainsKey("plcs[0].backend.connectsFailed"));
}
```
Skip-safe like the existing E2E suite: if the simulator isn't available, the test skips cleanly.
Coverage targets for the new tests:
1. `SignalR_Subscribe_DeliversInitialSnapshot`
2. `SignalR_Patch_FiresWithinPushInterval_AfterCounterChange`
3. `SignalR_Event_FiresWithin_100ms_OfListenerRecovered`
4. `SignalR_SubscribePlc_OnlyDeliversThatPlcEvents` — verifies group filtering
5. `SignalR_MaxConcurrentClients_RefusesExcess` — capacity guard
6. `SignalR_FullSnapshotReBaseline_FiresEvery_FullSnapshotIntervalMs`
### Operational considerations
- **Authentication / authorisation.** Same network-trust assumption as the rest of the admin endpoint — none in-process. If a hostile network is in scope, terminate at a reverse proxy that enforces auth (IIS, nginx) and treat SignalR like any other HTTP path through that proxy.
- **Transport.** SignalR negotiates: WebSocket first, then Server-Sent Events, then long polling. The 0/1/2-RTT cost difference matters only for the first connection; subsequent updates are push regardless of transport.
- **Backpressure.** `Hub.Clients.Group("all").SendAsync` does not buffer per-client. If a dashboard is slow, SignalR slows its writes; the broadcaster's push tick still runs at 1 Hz to all healthy clients. A slow client does not block the proxy.
- **Reconnection.** The .NET / browser SignalR clients reconnect automatically with exponential backoff. The periodic full snapshot every 60 s ensures the dashboard re-baselines after a reconnect even without explicit re-subscription logic on the client side.
- **Cardinality at scale.** 32 concurrent clients × 54 PLC subscriptions × 1 Hz patches × ~500 bytes / patch ≈ 850 KB/s outbound at saturation. Well within Kestrel's capacity on commodity hardware. The `MaxConcurrentClients` guard exists to prevent a misconfigured deploy from accidentally pointing 1000 dashboards at the same proxy.
- **CORS.** If dashboards run on a different origin (likely), enable CORS on the admin app for `/hub/status` only. Add `AdminCors.AllowedOrigins` to `appsettings.json` as an array of allowed origin strings; an empty array means same-origin only.
- **Logging.** SignalR's internal logs are noisy at Information. In `appsettings.json`, set the `Microsoft.AspNetCore.SignalR` category to `Warning` and `Microsoft.AspNetCore.Http.Connections` to `Warning` so the proxy's own event stream isn't drowned out.
### Effort estimate
| Work | Hours |
|------|-------|
| Hub + DTOs + broadcaster | 6 h |
| Event fan-out wiring (existing log events) | 3 h |
| AdminEndpointHost integration + appsettings binding | 2 h |
| E2E test suite (6 tests using SignalR .NET client) | 4 h |
| Documentation (this section graduates from proposal to fact; design.md update) | 1 h |
| **Total** | **~16 h** |
This is comparable to Phase 07's status-page implementation (~14 hours) and slots well as a follow-on phase if SignalR turns out to be wanted in production.
---
## Implementation notes
### Where rates and percentiles should live
Two reasonable answers:
1. **Compute in the proxy, expose pre-computed values in `/status.json`.** Pro: dashboard tools don't need anything beyond raw HTTP scraping. Con: we own the windowing logic; choosing the wrong window sizes is annoying to change.
2. **Expose raw cumulative counters; let the dashboard tool (Prometheus, Grafana) compute rates.** Pro: zero in-process state; dashboard tooling does this natively and well. Con: requires a real TSDB sidecar.
**Recommendation:** ship Tier 1 rate metrics computed in-process for the operator who just opens `http://<host>:8080/` in a browser, AND keep the raw counters so a real TSDB can scrape them too. The in-process windowed values are best-effort; the raw counters are authoritative.
### Counter additions vs computed values
A few proposed KPIs require **new counters in `ProxyCounters` or `ServiceCounters`**, not just derivations:
- `pdus.lastForwardedUtc` — new `volatile long _lastForwardedTicks` on `ProxyCounters`.
- `listener.boundRatio.*` — new `StateDurationTracker` on `PlcListenerSupervisor`.
- `partialBcd.byClient` / `invalidBcd.byAddress` — new `ConcurrentDictionary<string,long>` / `ConcurrentDictionary<ushort,long>` on `PerPlcContext`. Keep cardinality bounded (cap to top-N or use a count-min sketch for very high-cardinality cases).
- `process.*` — read fresh on every snapshot from `Process.GetCurrentProcess()` — no stored state.
### Snapshot serialization cost
`StatusResponse` is built per-request to `/status.json`. The current shape allocates one record per PLC plus nested children. Adding the Tier 1 fields adds ~6 longs per PLC = trivial allocation cost. Adding Tier 2 dimensional maps (e.g., `invalidBcd.byAddress`) adds a small dictionary serialization per PLC — fine for 54 PLCs × a few unique error addresses, but cap the dictionary size in code (top-50 by count, drop the rest) to keep `/status.json` under a few hundred KB even when something goes badly wrong.
### Dashboard widget mapping (Grafana-style cheat sheet)
| Widget | Use for |
|--------|---------|
| **Stat (big number)** | Service-wide aggregates, counts, latest timestamps |
| **Gauge** | Ratios (availability, success rate, queue depth) |
| **Sparkline** | Rates, percentiles, time-series trends |
| **Stacked area** | Bandwidth, PDU-by-FC breakdown over time |
| **Heatmap** | Per-address / per-client dimensional breakdowns |
| **Cell-coloured table** | Per-PLC status (54 rows, one per PLC, columns of KPIs) |
### Backwards-compat policy
The fields currently in `/status.json` are **frozen** — adding fields is fine, removing or renaming is a breaking change. Treat the field-name table in [`design.md`](design.md) → "Status page" as the contract; new fields ship via PRs that update the contract first.
## Cross-references
- Field tables for what ships today: [`design.md`](design.md) → "Status page".
- Stable log event names (some KPIs are derivable by tailing these): [`design.md`](design.md) → "Logging" event-name table.
- Per-counter wiring lives in `src/Mbproxy/Proxy/ProxyCounters.cs` and `src/Mbproxy/ServiceCounters.cs`.
- The status HTML page is rendered by `src/Mbproxy/Admin/StatusHtmlRenderer.cs`; the JSON DTOs and source-gen context live in `src/Mbproxy/Admin/StatusDto.cs`.
+271
View File
@@ -0,0 +1,271 @@
# mbproxy operations runbook
Day-two operations reference for the mbproxy Windows Service: install, upgrade, configuration, logs, and troubleshooting.
## Install
### Prerequisites
- Windows 10 / Server 2019 or later (64-bit).
- PowerShell 5.1+ run as Administrator (the install script uses `#Requires -RunAsAdministrator`).
- The compiled publish output from `dotnet publish` (see [README.md](../README.md) for the exact command).
- Modbus TCP reachable from the proxy host to the PLCs on port 502.
- Port 8080 (or whatever `AdminPort` is set to) available for the status page.
### Steps
1. Publish the binaries on the build machine:
```powershell
dotnet publish src/Mbproxy/Mbproxy.csproj -c Release -r win-x64 --self-contained true -o C:\build\mbproxy-publish
```
2. Copy the publish output to the target server (or run the install script locally if you built on the server).
3. Open an elevated PowerShell prompt and run the install script:
```powershell
.\install\install.ps1 -PublishOutput C:\build\mbproxy-publish -Start
```
The script:
- Copies binaries to `C:\Program Files\Mbproxy\` (configurable via `-InstallPath`).
- Registers the service with `sc.exe create`.
- Sets failure-recovery: restart after 60 s on first/second failure, no action on third.
- Creates `%ProgramData%\mbproxy\logs\` and sets ACLs if needed.
- Copies `mbproxy.config.template.json``%ProgramData%\mbproxy\appsettings.json` **only if no config exists**.
- Registers the Windows Event Log source `mbproxy`.
- With `-Start`, starts the service and waits up to 30 s for `RUNNING` state.
4. Edit `%ProgramData%\mbproxy\appsettings.json` to configure your PLC list and BCD tags. See the template for inline comments on every field.
5. If you edited the config before starting, start the service:
```powershell
sc.exe start mbproxy
```
6. Verify (smoke checklist — see [Smoke checklist](#first-install-smoke-checklist) below).
### Re-running install on an existing installation
The install script is idempotent. Re-running it:
- Stops the service if running.
- Overwrites the binaries.
- Updates the service config via `sc.exe config` (not `sc.exe create`).
- Preserves `%ProgramData%\mbproxy\appsettings.json` (never overwritten on update).
- Skips Event Log source creation if already registered.
## Upgrade procedure
1. Publish new binaries on the build machine (same command as install step 1).
2. Stop the service:
```powershell
sc.exe stop mbproxy
```
Wait for the service to reach `STOPPED` state — graceful shutdown drains in-flight PDUs (up to `Connection.GracefulShutdownTimeoutMs`, default 10 s).
3. Copy new binaries to `C:\Program Files\Mbproxy\` (or run `install.ps1 -PublishOutput ...` to automate steps 24):
```powershell
Copy-Item -Path C:\build\mbproxy-publish\* -Destination 'C:\Program Files\Mbproxy\' -Force
```
4. Start the service:
```powershell
sc.exe start mbproxy
```
5. Check the status page to confirm the new version:
```powershell
Invoke-RestMethod http://localhost:8080/status.json | Select-Object -ExpandProperty service
```
The `version` field should show the new build.
## Uninstall
```powershell
.\install\uninstall.ps1
```
Options:
- `-KeepConfig` — preserves `%ProgramData%\mbproxy\appsettings.json` for re-install.
- Log files are **always archived** to `%ProgramData%\mbproxy.archived-<timestamp>\logs\` regardless of `-KeepConfig`. They are never deleted.
## Configuration
The service reads `%ProgramData%\mbproxy\appsettings.json` at startup and watches it for changes while running. Most settings are hot-reloadable; a few require a restart.
### Hot-reload vs. restart
| Setting | Behaviour on file save |
|---|---|
| `BcdTags.Global` add/remove/width | Next PDU uses the new map; in-flight PDUs complete with the old map. |
| `Plcs[].BcdTags.{Add,Remove}` | Same per-PDU propagation. |
| `Plcs[].Name` or `.Host` or `.ListenPort` changed | Treated as remove + add: old listener stops, new one starts. |
| New `Plcs[]` entry | New listener binds immediately (subject to port availability). |
| `Plcs[]` entry removed | Supervisor stops the listener; all connected clients for that PLC are disconnected. |
| `Connection.Backend*TimeoutMs` | Next connect/request uses the new value. |
| `Connection.GracefulShutdownTimeoutMs` | Picked up on the next `ApplicationStopping` event. |
| `AdminPort` | Admin endpoint re-binds on the new port; old port released. |
| Invalid reload (schema error, duplicate ports/addresses) | Rejected as a whole. Current in-memory config stays; `mbproxy.config.reload.rejected` logged at Error. |
For more detail on the hot-reload propagation model, see [`design.md`](design.md) → "Configuration hot-reload".
### Editing appsettings.json
The service picks up changes automatically. There is no need to restart unless you are changing the `Connection.GracefulShutdownTimeoutMs` (applies only on next stop) or updating the binary.
If a reload is rejected (`mbproxy.config.reload.rejected` in the log), the service continues running with the previous config. Fix the JSON error and save again — the next valid file write will be accepted.
## Logs
### Location
Rolling log files live at: `C:\ProgramData\mbproxy\logs\mbproxy-<date>.log`
One file per day, retained for 30 days by default (controlled by `retainedFileCountLimit` in the Serilog config section).
### Windows Event Log
When running as a Windows Service, the `EventLogBridge` sink writes events at Error level and above to the Windows Application Event Log under source `mbproxy`. View with:
```powershell
Get-EventLog -LogName Application -Source mbproxy -Newest 20
```
Or open Event Viewer → Windows Logs → Application, filter by source `mbproxy`.
### Log survival after uninstall
`uninstall.ps1` **never deletes log files**. It moves `logs\` to a timestamped archive at `%ProgramData%\mbproxy.archived-<timestamp>\logs\` so post-crash diagnostics remain accessible.
## Status page
**URL:** `http://<proxy-host>:<AdminPort>/`
Default port: 8080. Change with `Mbproxy.AdminPort` in `appsettings.json`.
Routes:
- `GET /` — HTML table, auto-refreshes every 5 s. No external assets.
- `GET /status.json` — same data as JSON for monitoring scrapers.
Key fields on `/status.json`:
| Field | Meaning |
|---|---|
| `service.version` | Assembly informational version (set at publish time). |
| `service.uptimeSeconds` | Seconds since service start. |
| `service.config.lastReloadUtc` | Last accepted hot-reload timestamp. |
| `listeners.bound` / `listeners.configured` | Bound count vs. configured PLC count. |
| `plcs[].listener.state` | `bound` / `recovering` / `stopped`. |
| `plcs[].backend.connectsSuccess` | Successful backend TCP connects since start. |
| `plcs[].backend.connectsFailed` | Failed backend connects (all retries exhausted). |
| `plcs[].pdus.forwarded` | Total PDUs forwarded through this PLC's proxy. |
## Common failure modes
### `mbproxy.startup.bind.failed` — port in use
**Symptom:** The service starts but one or more PLCs show `listener.state = recovering`.
**Cause:** Another process is bound to the configured `ListenPort`.
**Remediation:**
```powershell
netstat -ano | findstr :<port> # find PID holding the port
Get-Process -Id <pid> # identify the process
```
Release the port or change `Plcs[].ListenPort` in `appsettings.json`. The supervisor will retry automatically — watch for `mbproxy.listener.recovered` in the log.
### `mbproxy.listener.recovered` — no action needed
A previously-failing listener successfully bound. The service is self-healing. This is informational.
### `mbproxy.backend.failed` — PLC unreachable
**Symptom:** Upstream clients cannot connect through the proxy, or connections are immediately dropped.
**Cause:** The PLC backend (`Plcs[].Host:Port`) is unreachable — network issue, PLC power cycle, or H2-ECOM100 firmware issue.
**Remediation:** Check network path to the PLC. Verify the PLC Modbus port is responding:
```powershell
Test-NetConnection -ComputerName <plc-ip> -Port 502
```
Note: the H2-ECOM100 module caps connections at 4 simultaneous TCP clients. If the proxy already has 4 upstream clients connected to one PLC port, a fifth will trigger `mbproxy.backend.failed`.
### `mbproxy.config.reload.rejected` — bad config
**Symptom:** The log shows a rejection event after a file save; the current config is unchanged.
**Cause:** The saved `appsettings.json` has a schema error, duplicate port, or conflicting BCD address.
**Remediation:** Check the log for the joined error list immediately following the rejection event. Fix the JSON and save again.
### `mbproxy.admin.bind.failed` — admin port in use
**Symptom:** The status page is unreachable.
**Cause:** Another process is using `AdminPort`.
**Remediation:** The proxy continues to forward Modbus traffic — only the status page is affected. Change `AdminPort` in `appsettings.json` (hot-reload applies).
### `mbproxy.rewrite.partial_bcd` — client reading half a 32-bit BCD pair
**Symptom:** Warning in the log; the value passes through raw (no rewrite).
**Cause:** The upstream client is reading only one register of a configured 32-bit BCD pair (e.g., quantity = 1 at the low address, or any read at the high address alone). This is almost always a client-side tag-definition bug.
**Remediation:** Verify the client's tag definition specifies quantity = 2 for 32-bit BCD addresses.
### `mbproxy.rewrite.invalid_bcd` — non-BCD value from PLC
**Symptom:** Warning in the log; the value passes through raw.
**Cause:** The PLC returned a register value that contains non-BCD nibbles (e.g., `0xA123` — the nibble `A` is invalid BCD). This usually indicates the ladder program wrote a non-BCD value to a register configured as a BCD tag.
**Remediation:** Investigate the PLC ladder program. The proxy cannot decode non-BCD data — passing it through is safer than guessing.
## First-install smoke checklist
Run these commands after `install.ps1 -Start` to verify the deployment:
```powershell
# 1. Service is running
Get-Service mbproxy | Select-Object Status, DisplayName
# 2. Status page is reachable
Invoke-WebRequest http://localhost:8080/ -UseBasicParsing | Select-Object StatusCode
# 3. JSON endpoint returns expected fields
$status = Invoke-RestMethod http://localhost:8080/status.json
$status.service | Select-Object version, uptimeSeconds
$status.listeners
# 4. Log file exists and is recent
Get-Item "C:\ProgramData\mbproxy\logs\mbproxy-*.log" | Sort-Object LastWriteTime -Descending | Select-Object -First 1
# 5. No Error events in the Event Log
Get-EventLog -LogName Application -Source mbproxy -EntryType Error -Newest 5
# 6. Stop the service cleanly (graceful shutdown within 10 s)
$sw = [System.Diagnostics.Stopwatch]::StartNew()
sc.exe stop mbproxy
$deadline = [DateTime]::UtcNow.AddSeconds(15)
do { Start-Sleep 1 } until ((Get-Service mbproxy).Status -eq 'Stopped' -or [DateTime]::UtcNow -gt $deadline)
$sw.Stop()
Write-Host "Stop elapsed: $($sw.ElapsedMilliseconds) ms"
(Get-Service mbproxy).Status # Should be Stopped
```
**Note:** This checklist documents the expected steps. It was not executed on a dedicated clean VM (the proxy was developed and unit/E2E tested in-process). Run this checklist on first deployment to a production host.
+179
View File
@@ -0,0 +1,179 @@
# Phase 00 — Bootstrap
Scaffold the .NET 10 Worker Service project and the test project. Wire up Generic Host, Serilog, Windows-Service registration, and `MbproxyOptions` POCOs bound via `IOptionsMonitor`. No proxy logic yet — the service starts, logs "ready", and stops cleanly.
**Depends on:** nothing. Must run alone.
**Parallel-safe with:** nothing. Phase 00 owns the initial `.csproj` and solution; subsequent phases append.
## Goal
Produce a minimal but production-shaped host that all subsequent phases plug into. The host must:
- Target `.NET 10` (`net10.0`), be registered as a Windows Service via `Microsoft.Extensions.Hosting.WindowsServices`, and also run as a console under `dotnet run` for local dev.
- Load `appsettings.json` with `reloadOnChange: true`, bind the `"Mbproxy"` section to typed POCOs, and expose them via `IOptionsMonitor<MbproxyOptions>`.
- Use Serilog with console + rolling-file sinks under `%ProgramData%\mbproxy\logs\` (configurable, but default that location).
- Set `<TreatWarningsAsErrors>true</TreatWarningsAsErrors>` and `<Nullable>enable</Nullable>` in the csproj. These stay set forever.
## Outputs (files created in this phase)
```
Mbproxy.slnx
src/Mbproxy/Mbproxy.csproj
src/Mbproxy/Program.cs
src/Mbproxy/HostingExtensions.cs # AddMbproxyOptions, AddMbproxySerilog
src/Mbproxy/Options/MbproxyOptions.cs
src/Mbproxy/Options/BcdTagOptions.cs
src/Mbproxy/Options/PlcOptions.cs
src/Mbproxy/Options/ConnectionOptions.cs
src/Mbproxy/Options/ResilienceOptions.cs
src/Mbproxy/Options/BcdTagListOptions.cs # the Global + per-PLC Add/Remove DTOs
src/Mbproxy/Workers/HeartbeatWorker.cs # one-line "service alive" worker; deleted by phase 03
src/Mbproxy/appsettings.json # minimal default with empty Plcs array
tests/Mbproxy.Tests/Mbproxy.Tests.csproj
tests/Mbproxy.Tests/HostSmokeTests.cs
tests/Mbproxy.Tests/Options/MbproxyOptionsBindingTests.cs
.gitignore # add bin/, obj/, .vs/, *.user, tests/sim/.venv/, %ProgramData%\mbproxy\
```
No other files. Phase 00 does NOT create:
- BCD codec types (phase 02)
- Proxy types (phase 03)
- Listener supervisor (phase 05)
- Status page (phase 07)
## Tasks
1. **Create `Mbproxy.slnx`** referencing the two csprojs.
2. **`src/Mbproxy/Mbproxy.csproj`** — `<Project Sdk="Microsoft.NET.Sdk.Worker">`, `TargetFramework=net10.0`, `OutputType=Exe`, `Nullable=enable`, `TreatWarningsAsErrors=true`, `ImplicitUsings=enable`. PackageReferences:
- `Microsoft.Extensions.Hosting` (latest stable for .NET 10)
- `Microsoft.Extensions.Hosting.WindowsServices`
- `Serilog.Extensions.Hosting`
- `Serilog.Settings.Configuration`
- `Serilog.Sinks.Console`
- `Serilog.Sinks.File`
- `Polly` (referenced now so phase 04/05 don't have to touch this csproj for the package; usage is deferred)
3. **`Options/MbproxyOptions.cs`** and siblings — typed POCOs that mirror the appsettings schema in [`../design.md`](../design.md) → Configuration. Keep them plain DTOs (`public sealed class` with init-only properties). Use `IValidateOptions<MbproxyOptions>` for cross-field checks at the **schema** level only (no business rules like "duplicate addresses" — those move to phase 06 along with hot-reload).
4. **`HostingExtensions.cs`** — extension methods on `IHostApplicationBuilder` named `AddMbproxyOptions(IConfiguration)` and `AddMbproxySerilog(IConfiguration)`. Keep `Program.cs` thin: read config, call the two extensions, register `HeartbeatWorker`, run.
5. **`Program.cs`** — Generic Host with `.UseWindowsService()`. `await Host.CreateApplicationBuilder(args)...Build().RunAsync()`. Honour `--console` as a no-op flag for documentation symmetry with the design (the worker SDK + UseWindowsService combo already runs in console mode under `dotnet run`).
6. **`Workers/HeartbeatWorker.cs`** — `BackgroundService` that logs `mbproxy.startup.ready` once after `Task.Delay(100)` (so Serilog has flushed) and then idles. This worker is deleted in phase 03 when the real listener supervisor takes over; it exists so phase 00's smoke test has something to assert.
7. **`appsettings.json`** — minimal, valid against the POCOs, with `Plcs: []`. Include the full key shape (`BcdTags.Global`, `AdminPort`, `Connection`, `Resilience`) so future phases just fill in values.
8. **`tests/Mbproxy.Tests/Mbproxy.Tests.csproj`** — Microsoft.NET.Sdk, `TargetFramework=net10.0`, same `Nullable`/`TreatWarningsAsErrors`. ProjectReference to `src/Mbproxy/Mbproxy.csproj`. PackageReferences:
- `Microsoft.NET.Test.Sdk`
- `xunit` (v3 if a stable release exists; v2 otherwise — record the decision in the csproj comment)
- `xunit.runner.visualstudio`
- `Shouldly`
9. **`HostSmokeTests.cs`** — build the host with `Host.CreateApplicationBuilder` against a synthetic config, start it on a `CancellationTokenSource` with a short deadline, assert it logged `mbproxy.startup.ready` and shut down without unhandled exceptions.
10. **`MbproxyOptionsBindingTests.cs`** — bind a hand-written `Dictionary<string,string>` config source into `MbproxyOptions`, assert all fields populate correctly (including a `Plcs` entry with `BcdTags.Add` and `BcdTags.Remove`).
## Public surface declared in this phase
```csharp
namespace Mbproxy.Options;
public sealed class MbproxyOptions {
public BcdTagListOptions BcdTags { get; init; } = new();
public IReadOnlyList<PlcOptions> Plcs { get; init; } = [];
public int AdminPort { get; init; } = 8080;
public ConnectionOptions Connection { get; init; } = new();
public ResilienceOptions Resilience { get; init; } = new();
}
public sealed class BcdTagListOptions {
public IReadOnlyList<BcdTagOptions> Global { get; init; } = [];
}
public sealed class BcdTagOptions {
public ushort Address { get; init; }
public byte Width { get; init; } // 16 or 32
}
public sealed class PlcOptions {
public string Name { get; init; } = "";
public int ListenPort { get; init; }
public string Host { get; init; } = "";
public PlcBcdOverrides? BcdTags { get; init; }
}
public sealed class PlcBcdOverrides {
public IReadOnlyList<BcdTagOptions> Add { get; init; } = [];
public IReadOnlyList<ushort> Remove { get; init; } = [];
}
public sealed class ConnectionOptions {
public int BackendConnectTimeoutMs { get; init; } = 3000;
public int BackendRequestTimeoutMs { get; init; } = 3000;
}
public sealed class ResilienceOptions {
public RetryProfile BackendConnect { get; init; } = new() { MaxAttempts = 3, BackoffMs = [100, 500, 2000] };
public RecoveryProfile ListenerRecovery { get; init; } = new() {
InitialBackoffMs = [1000, 2000, 5000, 15000, 30000],
SteadyStateMs = 30000,
};
}
public sealed class RetryProfile {
public int MaxAttempts { get; init; }
public IReadOnlyList<int> BackoffMs { get; init; } = [];
}
public sealed class RecoveryProfile {
public IReadOnlyList<int> InitialBackoffMs { get; init; } = [];
public int SteadyStateMs { get; init; }
}
```
```csharp
namespace Mbproxy;
internal static class HostingExtensions {
public static IHostApplicationBuilder AddMbproxyOptions(this IHostApplicationBuilder b);
public static IHostApplicationBuilder AddMbproxySerilog(this IHostApplicationBuilder b);
}
```
```csharp
namespace Mbproxy.Workers;
internal sealed class HeartbeatWorker : BackgroundService { /* logs mbproxy.startup.ready */ }
```
No other public types in this phase.
## Tests required
### Unit (`Category = Unit`, default)
1. `MbproxyOptionsBinding_BindsGlobalBcdTags_From_appsettings`
2. `MbproxyOptionsBinding_BindsPerPlcAddAndRemove`
3. `MbproxyOptionsBinding_DefaultsAreApplied_WhenSectionMissing` (AdminPort=8080, Resilience defaults)
4. `MbproxyOptionsBinding_RejectsInvalidWidth` — IValidateOptions returns Fail for `Width != 16 && Width != 32`. Schema-level only; address-overlap validation is phase 06.
5. `HostSmoke_StartsAndStops_Cleanly_AndLogs_StartupReady` — uses a Serilog sink that captures events to memory; asserts the `mbproxy.startup.ready` event fired at Information.
6. `HostSmoke_ShutdownIsOrdered` — host responds to `StopAsync` within 2 s.
### E2E (`Category = E2E`)
None in this phase. The simulator harness is phase 01.
## Phase gate
- [ ] `dotnet build Mbproxy.slnx -c Debug` — zero warnings.
- [ ] `dotnet test --filter Category!=E2E` — all green, ≥6 tests.
- [ ] `dotnet run --project src/Mbproxy` — service starts, logs `mbproxy.startup.ready` to console within 5 s, exits cleanly on Ctrl-C.
- [ ] `appsettings.json` is a valid JSON document and parses into a populated `MbproxyOptions` instance via the test harness.
- [ ] [`../design.md`](../design.md) is unchanged (this phase introduces no new design decisions).
- [ ] Resource index entry for `docs/plan/00-bootstrap.md` is not needed (the plan README routes there).
## Out of scope
- BCD encode/decode logic (phase 02).
- TcpListener / Modbus framing / byte forwarding (phase 03).
- Polly retry pipelines (referenced as a NuGet, used starting in phase 04/05).
- Address-overlap / duplicate-port validation (phase 06).
- AdminPort HTTP endpoint (phase 07).
- Service install / uninstall scripts (phase 08).
## Notes for the subagent
- Do not create `README.md` for the tool root yet — that's a phase 08 deliverable when there's something installable to document.
- If the `xunit` v3 vs v2 question is unclear at implementation time, prefer v3 if available on NuGet — record the choice in a single-line comment at the top of the test csproj. Future phases must not silently switch.
- Use `LoggerMessage`-source-generated logging (`[LoggerMessage]`) for the heartbeat event so phases that add more log events can follow the same pattern. Set `EventId.Name = "mbproxy.startup.ready"`.
+108
View File
@@ -0,0 +1,108 @@
# Phase 01 — Simulator harness
Wrap the existing pymodbus profile at [`../../DL260/dl205.json`](../../DL260/dl205.json) as a managed lifecycle for xUnit tests. After this phase, any test class that declares `[Collection(nameof(DL205SimulatorCollection))]` gets a running pymodbus server on a known port, with skip-safe behaviour when Python is unavailable.
**Depends on:** Phase 00 (test project exists).
**Parallel-safe with:** Phase 02, Phase 03. (Touches only `tests/sim/` and `tests/Mbproxy.Tests/Sim/`. Disjoint from codec and proxy work.)
## Goal
Eliminate "did the simulator start?" as a source of flaky tests. Encode the launch / readiness-probe / shutdown / cleanup contract once, in a fixture, so phases 03 / 04 / 05 / 06 / 07 don't each reinvent it. Tests must be able to declare a dependency on the simulator and get a hot port back, OR get a clean skip if the environment can't provide one.
## Outputs
```
tests/sim/run-dl205-sim.ps1 # idempotent launcher; venv-provisioning
tests/sim/README.md # how to run the simulator standalone
tests/Mbproxy.Tests/Sim/DL205SimulatorFixture.cs
tests/Mbproxy.Tests/Sim/DL205SimulatorCollection.cs
tests/Mbproxy.Tests/Sim/SimulatorSmokeTests.cs # connects, sends FC03, verifies a seeded BCD register
```
Modifications:
- `.gitignore` already has `tests/sim/.venv/` from phase 00 — verify it's present.
- `tests/Mbproxy.Tests/Mbproxy.Tests.csproj` — add `NModbus` PackageReference (chosen for its small footprint and net10.0 compatibility; record the choice as a top-of-csproj comment). This is the Modbus TCP client used by tests against the simulator from this phase forward.
No other files.
## Tasks
1. **`tests/sim/run-dl205-sim.ps1`** — pure PowerShell. Parameters: `-Profile <path>` (default `../DL260/dl205.json` relative to script), `-Port <int>` (default 5020). Behaviour:
- If `tests/sim/.venv` doesn't exist: `python -m venv tests/sim/.venv`, then `tests/sim/.venv/Scripts/pip.exe install "pymodbus[server]"` pinned to a known version (record version in the script + README).
- Activate the venv (`& tests/sim/.venv/Scripts/activate.ps1`).
- Exec `pymodbus.server run --modbus-config-path <Profile> --modbus-server tcp --port <Port>`. Output streams to stdout/stderr; on script termination, the child server dies with it.
- Exit codes: 0 on clean exit, 1 on venv provisioning failure, 2 on pymodbus launch failure, 3 if the profile file is missing.
2. **`DL205SimulatorFixture : IAsyncLifetime`** —
- `InitializeAsync`: pick a free local port (bind/release a `TcpListener` on `IPEndPoint.Any:0`, capture the port, dispose). Spawn `pwsh -NoProfile -File <run-dl205-sim.ps1> -Port <picked>` via `System.Diagnostics.Process` with `RedirectStandardOutput/Error`. Poll `new TcpClient().ConnectAsync("127.0.0.1", port)` at 100 ms intervals for up to 10 s. If the simulator never accepts a connection, capture stderr tail, set `SkipReason`, and dispose the process.
- `DisposeAsync`: send Ctrl-C to the process group (`Process.Kill(entireProcessTree: true)` on Windows is the pragmatic choice — pymodbus handles SIGTERM gracefully but Windows lacks proper signals; document the tradeoff in a comment). Wait up to 5 s for exit.
- Public surface: `string Host { get; }` (always `127.0.0.1`), `int Port { get; }`, `string? SkipReason { get; }`, `string LogTail { get; }` (last ~50 lines of stderr, for diagnosis).
3. **`DL205SimulatorCollection`** —
```csharp
[CollectionDefinition(nameof(DL205SimulatorCollection))]
public sealed class DL205SimulatorCollection : ICollectionFixture<DL205SimulatorFixture> { }
```
Tests that need the fixture declare `[Collection(nameof(DL205SimulatorCollection))]`.
4. **`SimulatorSmokeTests`** — `[Collection(nameof(DL205SimulatorCollection))] [Trait("Category", "E2E")]`. Three tests:
- `Simulator_AcceptsTcpConnection`
- `Simulator_FC03_ReturnsSeededValue_AtHR0_0xCAFE` — reads register 0, expects `0xCAFE` (the seeded marker from `dl205.json`). Uses NModbus directly. This proves the dl205.json profile is in fact loaded.
- `Simulator_FC03_ReturnsBCD_RawValueAtHR1072_0x1234` — reads register 1072, expects raw `0x1234` (= 4660). This is the BCD register the proxy will rewrite later; phase 04's e2e test will read the SAME register through the proxy and assert 1234 instead.
5. **`tests/sim/README.md`** — a few lines: "Run `pwsh ./run-dl205-sim.ps1 -Port 5020` to launch the simulator standalone. Used by xUnit tests via `DL205SimulatorFixture`. Requires Python 3.10+; the script provisions a venv on first run."
## Public surface declared in this phase
```csharp
namespace Mbproxy.Tests.Sim;
public sealed class DL205SimulatorFixture : IAsyncLifetime {
public string Host { get; }
public int Port { get; }
public string? SkipReason { get; }
public string LogTail { get; }
public Task InitializeAsync();
public Task DisposeAsync();
}
[CollectionDefinition(nameof(DL205SimulatorCollection))]
public sealed class DL205SimulatorCollection : ICollectionFixture<DL205SimulatorFixture> { }
```
No production code is added in this phase.
## Tests required
### Unit (Category = Unit)
None in this phase. The fixture itself is a test-infrastructure component; its correctness is verified by the e2e smoke tests below.
### E2E (Category = E2E)
1. `Simulator_AcceptsTcpConnection` — open a TCP socket to `fixture.Host:fixture.Port` within the fixture lifetime.
2. `Simulator_FC03_ReturnsSeededValue_AtHR0_0xCAFE` — NModbus FC03, asserts `0xCAFE`.
3. `Simulator_FC03_ReturnsBCD_RawValueAtHR1072_0x1234` — NModbus FC03, asserts raw `0x1234` (4660).
When `SkipReason` is set, all three skip with `Assert.Skip(fixture.SkipReason)`. The phase gate explicitly verifies that on a machine WITH Python+pymodbus, none of them skip — skips are an environment failure, not a test pass.
## Phase gate
- [ ] `pwsh tests/sim/run-dl205-sim.ps1 -Port 5020` standalone — script provisions a venv on first run, server logs "Modbus TCP server listening" within 10 s, Ctrl-C exits cleanly.
- [ ] On second run: venv exists, script skips provisioning, server starts in < 2 s.
- [ ] On a machine WITHOUT Python: `SkipReason` is non-null and tests skip rather than fail.
- [ ] On a machine WITH Python: `SkipReason` is null, all three e2e smoke tests pass.
- [ ] `dotnet test --filter Category=E2E` is green on the dev machine.
- [ ] `dotnet test --filter Category!=E2E` still green (no regression to phase 00's tests).
- [ ] Build zero-warnings.
- [ ] `tests/sim/README.md` documents the manual launch path.
## Out of scope
- Multiple simultaneous simulators (one fixture instance is enough for all e2e tests via `ICollectionFixture`).
- Alternate profiles selected via `MODBUS_SIM_PROFILE` env var — defer until phase 04 actually needs a partial-overlap scenario; add the env-var support then.
- A C# pymodbus replacement / in-process Modbus mock. The pymodbus profile is the source of truth for DL-series quirks and we're not duplicating it.
- pip-mirror or offline-install support. CI is expected to have network or a pre-warmed venv; if a customer site needs offline install, that's a deployment concern (phase 08).
## Notes for the subagent
- Capture the chosen `pymodbus` version pin in both `run-dl205-sim.ps1` and `tests/sim/README.md` so the version isn't lost across re-provisioning.
- The free-port-picker pattern (bind on `:0`, capture port, dispose, then hand the port to the child process) has an inherent TOCTOU race — another process could grab the port between dispose and pymodbus binding. In practice this is rare; acceptable for tests. Note the trade-off in a comment.
- Pymodbus log output is verbose. Pipe it through a line buffer; only the last ~50 lines need to be available via `LogTail` for diagnosis.
- Do not commit the `.venv/` directory.
+157
View File
@@ -0,0 +1,157 @@
# Phase 02 — BCD codec
Pure logic for encoding integers as DirectLOGIC BCD nibbles and decoding nibbles back. No I/O, no network, no Modbus framing. The codec exposed by this phase is what phase 04 plugs into the proxy.
**Depends on:** Phase 00 (csproj + options POCOs).
**Parallel-safe with:** Phase 01, Phase 03. (All work lives under `src/Mbproxy/Bcd/` and `tests/Mbproxy.Tests/Bcd/` — disjoint from sim harness and proxy plumbing.)
## Goal
A tiny, allocation-free codec library that:
- Encodes a non-negative `int` (capped at the width's range) to either one 16-bit raw register value or a `(low, high)` register pair for 32-bit BCD per the design's CDAB digit-layout rule.
- Decodes one or two raw register values back to an `int`.
- Resolves `Global + per-PLC Add - per-PLC Remove` into an **immutable per-PLC `BcdTagMap`** that the rewriter looks up by Modbus address in O(1).
The codec is the single source of BCD-encoding correctness in the system. Phase 04 must not reimplement any nibble math.
## Outputs
```
src/Mbproxy/Bcd/BcdCodec.cs # static class: Encode16, Decode16, Encode32, Decode32
src/Mbproxy/Bcd/BcdTag.cs # the public record (mirrors design.md exactly)
src/Mbproxy/Bcd/BcdTagMap.cs # immutable, address-keyed lookup; describes per-PLC resolved tags
src/Mbproxy/Bcd/BcdTagMapBuilder.cs # resolves global + Add - Remove into a map; runs validation
src/Mbproxy/Bcd/BcdValidationError.cs # enum + ValidationResult record
tests/Mbproxy.Tests/Bcd/BcdCodecTests.cs
tests/Mbproxy.Tests/Bcd/BcdTagMapBuilderTests.cs
```
No other files. The proxy plumbing layer doesn't exist yet and isn't touched.
## Tasks
1. **`BcdTag.cs`** — `public sealed record BcdTag(ushort Address, byte Width)` with a static factory `Create(ushort, byte)` that throws on `Width != 16 && Width != 32`. This record is the type phases 04 / 06 / 07 will use.
2. **`BcdCodec.cs`** — `internal static class` with four pure methods. Internal because the proxy is the only consumer; nothing else in the assembly should call these.
- `static ushort Encode16(int value)` — value in `[0, 9999]`; produces the 16-bit BCD register, e.g. `1234 → 0x1234`. Throws `ArgumentOutOfRangeException` if value is out of range.
- `static int Decode16(ushort raw)` — inverse. If any nibble is `>= 0xA`, return a `int.MinValue` sentinel? No — throw `FormatException` with the raw value in the message. The rewriter catches this and surfaces a `mbproxy.rewrite.invalid_bcd` event (event name added in phase 04).
- `static (ushort low, ushort high) Encode32(int value)` — value in `[0, 99_999_999]`; produces the CDAB pair, where `low` = low 4 BCD digits (least-significant) and `high` = high 4 BCD digits (most-significant). Decoded decimal = `high * 10000 + low_as_bcd_decoded`. Throws if out of range.
- `static int Decode32(ushort low, ushort high)` — inverse. Throws `FormatException` if either word has a bad nibble.
3. **`BcdTagMap.cs`** — `public sealed class BcdTagMap` wrapping a frozen address-keyed dictionary. Methods:
- `static BcdTagMap Empty { get; }`
- `bool TryGet(ushort address, out BcdTag tag)` — O(1) lookup.
- `bool TryGetForRange(ushort startAddress, ushort qty, out IEnumerable<(int offset, BcdTag tag)> hits)` — returns every BCD tag whose register footprint intersects `[startAddress, startAddress+qty)`. Offsets are relative to `startAddress`. Used by the rewriter to know which slots in a multi-register PDU to touch.
- `int Count { get; }`, `IEnumerable<BcdTag> All { get; }` — for telemetry / status page.
4. **`BcdTagMapBuilder.cs`** — given `BcdTagListOptions Global` and `PlcBcdOverrides? perPlc`, produce a `(BcdTagMap, ValidationResult)`. Validation rules from design.md:
- Reject duplicate addresses within the resolved list (Add+Global after Remove).
- Reject 32-bit entries whose high register (`Address+1`) collides with any other entry's address (16-bit or 32-bit).
- Warn on `Remove` entries that don't match any address in Global (this is not a failure; the warning rides on `ValidationResult.Warnings`).
- Reject `Width` values other than 16/32 (defensive; phase 00's `IValidateOptions` should already have caught this, but the builder is the last line of defence).
5. **`BcdValidationError.cs`** — `public enum BcdValidationError { DuplicateAddress, OverlappingHighRegister, InvalidWidth }`. `public sealed record ValidationResult(BcdTagMap Map, IReadOnlyList<BcdError> Errors, IReadOnlyList<BcdWarning> Warnings)`. Errors fail the build; warnings ride along.
## Public surface declared in this phase
```csharp
namespace Mbproxy.Bcd;
public sealed record BcdTag(ushort Address, byte Width) {
public static BcdTag Create(ushort address, byte width);
public bool IsThirtyTwoBit => Width == 32;
public ushort HighRegister => (ushort)(Address + 1); // throws if Width != 32
}
public sealed class BcdTagMap {
public static BcdTagMap Empty { get; }
public int Count { get; }
public IEnumerable<BcdTag> All { get; }
public bool TryGet(ushort address, out BcdTag tag);
public bool TryGetForRange(ushort startAddress, ushort qty, out IReadOnlyList<RangeHit> hits);
}
public readonly record struct RangeHit(int OffsetWords, BcdTag Tag);
public static class BcdTagMapBuilder {
public static ValidationResult Build(BcdTagListOptions global, PlcBcdOverrides? perPlc);
}
public sealed record ValidationResult(
BcdTagMap Map,
IReadOnlyList<BcdError> Errors,
IReadOnlyList<BcdWarning> Warnings);
public sealed record BcdError(BcdValidationError Kind, string Message, ushort? Address);
public sealed record BcdWarning(string Message, ushort? Address);
public enum BcdValidationError { DuplicateAddress, OverlappingHighRegister, InvalidWidth }
```
```csharp
namespace Mbproxy.Bcd;
internal static class BcdCodec {
public static ushort Encode16(int value);
public static int Decode16(ushort raw);
public static (ushort low, ushort high) Encode32(int value);
public static int Decode32(ushort low, ushort high);
}
```
## Tests required
### Unit (`Category = Unit`)
`BcdCodecTests` (≥ 16 tests):
1. `Encode16_1234_Returns_0x1234`
2. `Encode16_0_Returns_0x0000`
3. `Encode16_9999_Returns_0x9999`
4. `Encode16_10000_Throws_OutOfRange`
5. `Encode16_Negative_Throws_OutOfRange`
6. `Decode16_0x1234_Returns_1234`
7. `Decode16_0x0000_Returns_0`
8. `Decode16_0x9999_Returns_9999`
9. `Decode16_0x123A_Throws_Format` — bad nibble `A`.
10. `Encode32_12345678_Returns_LowHigh_5678_1234` — verify `low = 0x5678`, `high = 0x1234`.
11. `Encode32_0_Returns_LowHigh_0_0`
12. `Encode32_99999999_Returns_LowHigh_9999_9999`
13. `Encode32_100000000_Throws_OutOfRange`
14. `Decode32_LowHigh_5678_1234_Returns_12345678`
15. `Decode32_BadNibble_InLow_Throws`
16. `Decode32_BadNibble_InHigh_Throws`
17. `RoundTrip16_AllValuesUnder10000``[Theory]` with `[InlineData]` for boundary values; for the dense check use `[Theory] [MemberData]` enumerating every 100th value. The codec must be `Decode16(Encode16(v)) == v`.
`BcdTagMapBuilderTests` (≥ 10 tests):
1. `Build_EmptyGlobal_EmptyOverride_ReturnsEmptyMap`
2. `Build_GlobalOnly_PopulatesMap`
3. `Build_PerPlcAdd_AppendsToGlobal`
4. `Build_PerPlcRemove_DropsFromGlobal`
5. `Build_AddOverrideSameAddressAsGlobal_AddWidthWins`
6. `Build_DuplicateAddressInGlobal_ReturnsDuplicateAddressError`
7. `Build_32BitHighRegOverlaps16BitGlobal_ReturnsOverlappingHighRegisterError`
8. `Build_Remove_OfNonExistentAddress_ReturnsWarning_NotError`
9. `Build_InvalidWidth_ReturnsInvalidWidthError`
10. `Map_TryGetForRange_ReturnsAllHits_InOrder` — covers full overlap, partial overlap (low only, high only), and no overlap.
### E2E (Category = E2E)
None. The codec is pure logic.
## Phase gate
- [ ] Zero-warnings build.
- [ ] `dotnet test --filter Category=Unit` — all green, ≥ 26 new tests.
- [ ] `BcdCodec` is `internal`; nothing outside `Mbproxy.Bcd` calls it directly.
- [ ] `BcdTagMap` has zero allocations on `TryGet` and on the hot `TryGetForRange` path (verify via a microbench note in the test file's docstring; no benchmark project added).
- [ ] [`../design.md`](../design.md) → "BCD tag shape" matches the public record exactly; if the spec drifted during implementation, update design.md in this PR.
## Out of scope
- Signed BCD. Design explicitly excludes it.
- Half-byte / "BCD with sign nibble" variants used by some DL-family math instructions. Not in the design's tag shape.
- The actual PDU-byte-level rewriting (FC parsing, MBAP framing). That's phase 04.
- Telemetry counters. The codec exposes nothing to counters; phase 04 instruments the rewrite pipeline that USES the codec.
## Notes for the subagent
- The DirectLOGIC CDAB digit layout is the most-likely-to-confuse part of this phase. Re-read [`../design.md`](../design.md) → "BCD tag shape" and [`../../DL260/dl205.md`](../../DL260/dl205.md) → "Word Order" before implementing `Encode32`/`Decode32`. The seeded marker in `dl205.json` for the float32 case (`HR[1056]=0x0000, HR[1057]=0x3FC0` for IEEE 1.5) confirms low-word-first; the BCD-32 case is the same word order with BCD nibble semantics inside each word.
- `BcdTagMapBuilder` is single-shot — given inputs, produce a map. There is NO `IObservable<BcdTagMap>` here. Phase 06 owns reload-driven rebuilds and just calls `Build` again.
- `TryGetForRange` is on the hot path for FC03/04 responses. Implementation should pre-bucket BCD tags by 256-register window if it makes the lookup faster, but only if a microbench shows a real win. Don't preoptimise.
+129
View File
@@ -0,0 +1,129 @@
# Phase 03 — Proxy plumbing
The minimum-viable proxy: one `TcpListener` per configured PLC, 1:1 upstream-client ↔ backend-socket, byte-for-byte forwarding both directions, transparent MBAP TxId / unit ID. No BCD rewriting yet — that's phase 04. No supervisor / auto-recovery — that's phase 05.
**Depends on:** Phase 00 (host, options).
**Parallel-safe with:** Phase 02 (BCD codec lives under `src/Mbproxy/Bcd/`; this phase lives under `src/Mbproxy/Proxy/`).
## Goal
Stand up the listener-and-forwarder pair so an e2e test can:
1. Configure the proxy with `Plcs: [{ Host: "127.0.0.1", Port: <simPort>, ListenPort: <proxyPort> }]`.
2. Start the host.
3. Drive NModbus against `127.0.0.1:<proxyPort>` and see the SAME bytes the simulator would return on a direct connection.
The proxy is transparent in this phase. The BCD rewrite hook point is reserved but not wired.
## Outputs
```
src/Mbproxy/Proxy/PlcListener.cs # owns one TcpListener; accepts loop
src/Mbproxy/Proxy/PlcConnectionPair.cs # one upstream socket + one backend socket; forwarder
src/Mbproxy/Proxy/IPduPipeline.cs # the rewrite hook contract (no-op impl in this phase)
src/Mbproxy/Proxy/NoopPduPipeline.cs # the no-op impl
src/Mbproxy/Proxy/ProxyWorker.cs # BackgroundService that owns all PlcListeners
src/Mbproxy/Proxy/MbapFrame.cs # MBAP header parse helpers (length, txid, unit)
tests/Mbproxy.Tests/Proxy/ProxyForwardingTests.cs # e2e against the simulator
tests/Mbproxy.Tests/Proxy/MbapFrameTests.cs # unit tests for the MBAP parser
```
Modifications:
- `src/Mbproxy/Program.cs` — register `ProxyWorker` as a hosted service. The `HeartbeatWorker` from phase 00 is DELETED in this phase (its job is replaced by ProxyWorker logging `mbproxy.startup.ready` after all listeners are bound).
- `src/Mbproxy/Workers/HeartbeatWorker.cs` — DELETED.
## Tasks
1. **`MbapFrame.cs`** — pure helpers, no allocations. Static methods:
- `static bool TryParseHeader(ReadOnlySpan<byte> buffer, out ushort txId, out ushort protocolId, out ushort length, out byte unitId)` — returns false if buffer.Length < 7.
- `static int TotalFrameLength(ushort lengthField)``lengthField + 6` (7 header bytes minus the 1-byte unit ID which is counted in the length field).
2. **`IPduPipeline.cs`** — the rewrite hook. Single method:
```csharp
void Process(MbapDirection direction, ReadOnlySpan<byte> mbapHeader, Span<byte> pdu, PduContext context);
```
`MbapDirection` is `RequestToBackend` or `ResponseToClient`. `PduContext` carries the per-pair state (counters, PLC name, configured tag map). In phase 03, the only implementation is `NoopPduPipeline` which does nothing.
3. **`NoopPduPipeline.cs`** — empty `Process` method. Registered as the default `IPduPipeline` in DI for this phase. Phase 04 replaces it with the real rewriter.
4. **`PlcConnectionPair.cs`** — owns the upstream `Socket` (or `TcpClient`) handed to it by `PlcListener.Accept`, opens a fresh backend socket to the configured PLC, and runs two `Task`s:
- **Upstream → backend**: read one full MBAP frame at a time (header → length → rest), call `pipeline.Process(RequestToBackend, header, pdu, ctx)`, write the frame to the backend.
- **Backend → upstream**: same shape, with `ResponseToClient`.
Either task ending (socket closed, exception, cancellation) tears down both sides cleanly. No retry loop; that's phase 05.
Backend connect is wrapped in a `try`/`catch` with the configured `BackendConnectTimeoutMs`. Connect failures close the upstream socket immediately and log `mbproxy.backend.failed`. Polly bounded retries on backend connect are **deferred to phase 05** to keep this phase scope tight — note the deferral in code with `// Phase 05: wrap in Polly pipeline`.
5. **`PlcListener.cs`** — owns one `TcpListener` for one PLC. `StartAsync` binds; on bind failure, throws (caller logs `mbproxy.startup.bind.failed` and decides what to do — phase 05 will introduce the supervisor that turns this into a recoverable state). On each accept, hands the socket to a fresh `PlcConnectionPair` and runs it on the thread-pool.
6. **`ProxyWorker.cs`** — `BackgroundService`. On start: enumerates `MbproxyOptions.Plcs`, instantiates one `PlcListener` per entry, starts them all. Each bind that succeeds logs `mbproxy.startup.bind`; each that fails logs `mbproxy.startup.bind.failed` and continues to the next PLC (matching the design's "eager, continue on per-port failure" posture). After all bind attempts, logs `mbproxy.startup.ready` with `{ ListenersBound, PlcsConfigured }`. On stop: cancels and disposes all listeners and their open pairs.
7. **`Program.cs`** — remove the HeartbeatWorker registration; register `ProxyWorker`. Also register `IPduPipeline` as a singleton `NoopPduPipeline` in DI.
## Public surface declared in this phase
All `internal sealed class` — the proxy types are not consumed outside this assembly. The only public-shaped surfaces are the `IPduPipeline` interface and the `MbapDirection` enum (so phase 04 can implement its own pipeline cleanly).
```csharp
namespace Mbproxy.Proxy;
public interface IPduPipeline {
void Process(MbapDirection direction, ReadOnlySpan<byte> mbapHeader, Span<byte> pdu, PduContext context);
}
public enum MbapDirection { RequestToBackend, ResponseToClient }
public sealed class PduContext {
public string PlcName { get; init; } = "";
// Phase 04 adds: BcdTagMap, counters, logger
}
internal sealed class NoopPduPipeline : IPduPipeline { /* no-op */ }
internal sealed class MbapFrame { /* static helpers */ }
internal sealed class PlcListener : IAsyncDisposable { /* ... */ }
internal sealed class PlcConnectionPair : IAsyncDisposable { /* ... */ }
internal sealed class ProxyWorker : BackgroundService { /* ... */ }
```
## Tests required
### Unit (`Category = Unit`)
`MbapFrameTests` (≥ 8 tests):
1. `TryParseHeader_TooShort_ReturnsFalse`
2. `TryParseHeader_ValidFrame_ParsesAllFields`
3. `TryParseHeader_ProtocolId_NotZero_StillParses` — we don't reject non-zero protocol IDs; that's the PLC's job.
4. `TotalFrameLength_LengthField7_Returns13`
5. `TotalFrameLength_LengthFieldMax_Returns_LengthFieldPlus6`
6. Round-trip: parse a known good FC03 frame and assert each field.
7. Round-trip: parse a known good FC16 write-multiple frame.
8. Negative: a frame with `length < 2` returns the parsed value but is callers' responsibility to reject. Document in a test.
### E2E (`Category = E2E`)
`ProxyForwardingTests` (≥ 5 tests, `[Collection(nameof(DL205SimulatorCollection))]`):
1. `Forward_FC03_HR0_Returns_SimulatorRawValue_0xCAFE` — proxy is transparent; client sees the raw simulator value.
2. `Forward_FC03_HR1072_Returns_RawBCD_0x1234` — the BCD register is NOT rewritten in phase 03 (NoopPduPipeline). This test will be REPLACED in phase 04 with one that asserts `1234` instead. Document the planned replacement in a comment so phase 04's agent knows what to update.
3. `Forward_FC06_WriteHR200_ThenReadBack_RoundTrips` — proves the write path forwards correctly.
4. `Forward_FC16_WriteMultipleHR201_203_ThenReadBack_RoundTrips`.
5. `MbapTxId_IsPreservedEndToEnd` — issue 20 back-to-back FC03 reads with monotonically increasing TxIds; assert every response carries the matching TxId.
6. `BackendConnectFailure_ClosesUpstreamCleanly` — point the proxy at an unreachable backend (`127.0.0.1:1`), assert the client's socket is closed within `BackendConnectTimeoutMs + 200ms`.
## Phase gate
- [ ] Zero-warnings build.
- [ ] All phase 00, 02 tests still green.
- [ ] All new unit tests green (≥ 8 in MbapFrameTests).
- [ ] All new e2e tests green when the simulator is available; skip cleanly when it isn't.
- [ ] `dotnet run --project src/Mbproxy` with an appsettings.json pointing at the simulator: NModbus can read/write through the proxy and gets the simulator's raw values.
- [ ] On startup with one bad and one good PLC config, the good one binds and the bad one logs `mbproxy.startup.bind.failed`, and the service does NOT abort. (Hand the supervisor work to phase 05; this phase only proves the "continue on per-port failure" posture.)
- [ ] `mbproxy.startup.ready` is now logged by `ProxyWorker`, not by a heartbeat worker. The heartbeat worker file is deleted.
## Out of scope
- BCD rewriting (phase 04 replaces `NoopPduPipeline`).
- Polly retries on backend connect (phase 05 supervisor wraps this).
- Auto-recovery for failed listener binds (phase 05).
- Counter tracking / per-PLC telemetry (phase 04 starts adding counters via `PduContext`).
- Half-MBAP-frame handling (split TCP packets): rely on `NetworkStream.ReadAsync` returning short reads; loop to fill the header (7 bytes) and then loop to fill the body (`length - 1` more bytes). Test 5 above verifies this stays correct over 20 back-to-back requests.
## Notes for the subagent
- `Socket` vs `TcpClient`: prefer `Socket` directly so framing reads can use `ReadOnlyMemory<byte>` without `NetworkStream` allocation overhead. The performance difference is small but the byte-precise API matches what the rewriter in phase 04 will need.
- Frame reads use a per-pair pooled buffer of 260 bytes (MBAP header 7 + max PDU 253). Don't allocate per-frame.
- The "Phase 04 will replace test 2" pattern is intentional. Leave breadcrumbs so the next phase's agent knows exactly which test to update; do NOT silently make the test pass against a future rewriter.
- Both forwarder tasks run with the same `CancellationTokenSource`. Cancellation propagates from listener stop → pair stop → both task ends → socket dispose.
@@ -0,0 +1,146 @@
# Phase 04 — Rewriter integration
Replace `NoopPduPipeline` with the real BCD rewriter. After this phase, FC03/FC04 responses have their configured BCD slots decoded to binary integers on the way to the client, and FC06/FC16 requests have their configured BCD slots encoded to nibbles on the way to the PLC. Counters and warnings come online here.
**Depends on:** Phase 02 (codec + tag map), Phase 03 (plumbing + `IPduPipeline`).
**Parallel-safe with:** nothing (it integrates two prior phases' outputs).
## Goal
Wire `BcdTagMap` + `BcdCodec` into the proxy at the single hook point `IPduPipeline.Process(...)`. The rewriter is responsible for:
- FC03 / FC04 responses: re-encode every covered slot from raw nibbles into a binary integer.
- FC06 / FC16 requests: re-encode every covered slot from binary integer into raw BCD nibbles.
- Partial-overlap of 32-bit pairs: pass through raw, emit `mbproxy.rewrite.partial_bcd` warning, increment partial-overlap counter.
- Bad BCD nibbles in a PLC response: pass through raw, emit `mbproxy.rewrite.invalid_bcd` (new event in this phase) at Warning, increment invalid-bcd counter. NEVER throw out of the pipeline.
- Increment per-pair counters for `pdus.forwarded`, `pdus.byFc`, `pdus.rewrittenSlots`, `pdus.partialBcdWarnings`, `pdus.invalidBcdWarnings`.
The transparency contract holds: MBAP header bytes are untouched, length field is unchanged (re-encoded slots are the same byte width), TxId / unit ID flow through.
## Outputs
```
src/Mbproxy/Proxy/BcdPduPipeline.cs # replaces NoopPduPipeline
src/Mbproxy/Proxy/PerPlcContext.cs # the per-PLC context (BcdTagMap + counters + logger)
src/Mbproxy/Proxy/ProxyCounters.cs # System.Threading.Interlocked counters
src/Mbproxy/Proxy/RewriterLogEvents.cs # [LoggerMessage] static partial methods
tests/Mbproxy.Tests/Proxy/BcdPduPipelineTests.cs # unit tests against synthetic PDU bytes
tests/Mbproxy.Tests/Proxy/RewriterE2ETests.cs # e2e against the simulator
```
Modifications:
- `src/Mbproxy/Proxy/PlcConnectionPair.cs` — replace `PduContext` (placeholder from phase 03) with `PerPlcContext`. Counters increment inline. The pipeline call site is unchanged in shape; only the context type and pipeline registration differ.
- `src/Mbproxy/Proxy/ProxyWorker.cs` — build one `PerPlcContext` per configured PLC at startup (calls `BcdTagMapBuilder.Build` and wraps the resulting map + a fresh `ProxyCounters` + a per-PLC logger). Stash the contexts in a `Dictionary<string, PerPlcContext>` keyed by PLC name.
- `src/Mbproxy/Program.cs` — register `BcdPduPipeline` as the `IPduPipeline` singleton; remove the `NoopPduPipeline` registration. The phase 03 `NoopPduPipeline.cs` file stays (it's useful in tests as a baseline) but is no longer wired in production.
- `tests/Mbproxy.Tests/Proxy/ProxyForwardingTests.cs` — update the test `Forward_FC03_HR1072_Returns_RawBCD_0x1234` (which was a phase-03 baseline) to a new test `Forward_FC03_HR1072_Returns_Decoded_1234` that asserts `1234`. The original raw-passthrough behaviour is preserved by configuring a PLC with NO BCD tags.
## Tasks
1. **`ProxyCounters.cs`** — `internal sealed class` holding `long` fields accessed via `Interlocked.Increment` / `Interlocked.Read`. Fields cover the per-PLC counter list from [`../design.md`](../design.md) → Status page → Per-PLC fields. Methods:
- `void IncrementPdusForwarded()`, `void IncrementFcCount(byte fc)`, `void AddRewrittenSlots(int n)`, `void IncrementPartialBcd()`, `void IncrementInvalidBcd()`, `void IncrementBackendException(byte code)`, `void AddBytes(long up, long down)`.
- `CounterSnapshot Snapshot()` — returns an immutable record with all the values; consumed by phase 07's status page.
2. **`PerPlcContext.cs`** — `internal sealed class` holding `string PlcName`, `BcdTagMap TagMap`, `ProxyCounters Counters`, `ILogger Logger`. Constructed once per PLC at startup; lifetime = lifetime of the listener.
3. **`BcdPduPipeline.cs`** — implements `IPduPipeline`. Behaviour per direction:
- **`RequestToBackend`**: inspect the PDU's function code byte (`pdu[0]`):
- FC06: read `(address, value)` from `pdu[1..]`. If `TagMap.TryGet(address)` and Width=16, replace value bytes with `BcdCodec.Encode16(value)`. If Width=32 and this is the LOW address, it's a single-register write to half a 32-bit tag — pass through raw + warn (the design's partial-overlap policy). If `address` is the HIGH register of a 32-bit pair, same partial-pass-through + warn. The PDU length is unchanged.
- FC16: `TryGetForRange(start, qty)`; for each hit, re-encode the relevant register-pair-or-singleton. Partial-overlap warnings emitted per offending slot.
- All other FCs: no-op.
- **`ResponseToClient`**: inspect `pdu[0]`:
- FC03 / FC04: `TryGetForRange(echoedStart, byteCount/2)`. The start address isn't in the response (Modbus FC03 response = `[fc, byteCount, ...data]`), so the rewriter needs the matching request — see Task 4.
- All other FCs: no-op.
- Exceptions from `BcdCodec.Decode*` are caught and turned into `mbproxy.rewrite.invalid_bcd` warnings; the byte is passed through unchanged.
4. **Request → response correlation.** The rewriter on a response needs the original request's start-address and quantity. Since the proxy is 1:1 per-client (no multiplexing), `PlcConnectionPair` keeps the last-issued request's `(fc, address, quantity)` in a per-pair slot. When the response arrives, the rewriter is invoked with that slot's contents as part of `PerPlcContext`. (We do NOT support pipelined multi-PDU requests on one socket in this phase; if a client tries, the slot is overwritten and the second response could mis-decode. Document the limitation; phase 08 may revisit if real clients pipeline.)
5. **`RewriterLogEvents.cs`** — `[LoggerMessage]` source-generated definitions:
- `mbproxy.rewrite.partial_bcd` — Warning, params: PlcName, Address, ClientStart, ClientQty.
- `mbproxy.rewrite.invalid_bcd` — Warning, params: PlcName, Address, RawValue, Direction.
- `mbproxy.exception.passthrough` — Information, params: PlcName, Fc, ExceptionCode. (Moved here from a phase-03 TODO.)
## Public surface declared in this phase
```csharp
namespace Mbproxy.Proxy;
internal sealed class BcdPduPipeline : IPduPipeline { /* full impl */ }
internal sealed class PerPlcContext { public string PlcName; public BcdTagMap TagMap; public ProxyCounters Counters; public ILogger Logger; }
internal sealed class ProxyCounters {
public void IncrementPdusForwarded();
public void IncrementFcCount(byte fc);
public void AddRewrittenSlots(int n);
public void IncrementPartialBcd();
public void IncrementInvalidBcd();
public void IncrementBackendException(byte code);
public void AddBytes(long up, long down);
public CounterSnapshot Snapshot();
}
public sealed record CounterSnapshot(/* mirrors design.md per-PLC status fields */);
```
Nothing else becomes public.
## Tests required
### Unit (`Category = Unit`)
`BcdPduPipelineTests` (≥ 20 tests). Each test builds a synthetic PDU byte array + a `PerPlcContext` with a hand-rolled `BcdTagMap`, calls `pipeline.Process`, and asserts the resulting bytes.
Coverage matrix:
| FC | Tag scenario | Expected | Counter delta |
|----|--------------|----------|---------------|
| 03 response | single 16-bit BCD at the read address | bytes replaced with binary-encoded value | `RewrittenSlots += 1` |
| 03 response | full 32-bit BCD pair within read range | both register-bytes replaced with binary-encoded 32-bit value | `RewrittenSlots += 2` |
| 03 response | partial 32-bit (low only, qty=1 at low addr) | bytes unchanged | `PartialBcd += 1` |
| 03 response | partial 32-bit (high only, qty=1 at high addr) | bytes unchanged | `PartialBcd += 1` |
| 03 response | mixed: 16-bit + non-BCD in same read | only the 16-bit slot rewritten | `RewrittenSlots += 1` |
| 03 response | bad nibble (0x12A4) at a 16-bit BCD slot | bytes unchanged | `InvalidBcd += 1` |
| 04 response | 16-bit BCD at the read address | same as FC03 | `RewrittenSlots += 1` |
| 06 request | write to 16-bit BCD address | binary integer in payload → BCD nibbles | `RewrittenSlots += 1` |
| 06 request | write to the LOW addr of a 32-bit pair (qty=1) | bytes unchanged (partial) | `PartialBcd += 1` |
| 06 request | write to the HIGH addr of a 32-bit pair | bytes unchanged (partial) | `PartialBcd += 1` |
| 06 request | write value outside `[0,9999]` for 16-bit | `mbproxy.rewrite.invalid_bcd` Warning; bytes unchanged | `InvalidBcd += 1` |
| 16 request | write multi covering one 16-bit BCD + 3 non-BCD | only the 16-bit slot re-encoded | `RewrittenSlots += 1` |
| 16 request | write multi covering one full 32-bit pair | both registers re-encoded as the CDAB pair | `RewrittenSlots += 2` |
| 16 request | write multi crossing into one half of a 32-bit pair | partial slot passed through; warn | `PartialBcd += 1` |
| 01 / 02 / 05 / 15 | any | no-op | none |
| 03 exception response | exception 02 returned by PLC | bytes unchanged, no rewriting attempted | `BackendExceptions[2] += 1`, `mbproxy.exception.passthrough` logged |
Additional:
- Counter snapshot reflects increments exactly (no off-by-one).
- Empty `BcdTagMap` produces zero rewrites for any FC.
### E2E (`Category = E2E`, `[Collection(nameof(DL205SimulatorCollection))]`)
`RewriterE2ETests` (≥ 6 tests, all against the dl205.json simulator profile):
1. `Read_HR1072_AsBcd_ReturnsDecoded_1234` — configure the BCD tag at addr 1072 width 16; assert `1234`.
2. `Read_HR1072_AsRaw_WhenNotConfigured_Returns_0x1234` — no BCD tags configured; assert raw `4660`. (Verifies the pipeline is opt-in per tag.)
3. `Write_HR200_AsBcd_StoresEncoded_0x9876` — configure addr 200 width 16. Write decimal 9876 through proxy; read raw from sim, expect `0x9876` (39030).
4. `Read_HR1056_HR1057_AsBcd32_ReturnsDecoded_From_CDAB` — seed an alternate profile (or write via proxy first if the default profile's float32 markers aren't suitable BCD32 fixtures). Verify the CDAB layout end-to-end.
5. `Partial_FC03_OnHighRegisterOf_32BitPair_PassesThroughRaw_AndLogsWarning` — use the in-memory Serilog sink to verify `mbproxy.rewrite.partial_bcd` was logged.
6. `MbapTxId_StillPreserved_AfterRewriting_20Consecutive` — same as phase 03's test 5, but with BCD rewrite in the path. Proves rewriting doesn't tamper with the MBAP header.
## Phase gate
- [ ] Zero-warnings build.
- [ ] All phase 0003 tests still green (with the phase-03 placeholder test renamed/repurposed as described).
- [ ] All new unit tests green (≥ 16 in BcdPduPipelineTests + counter snapshot tests).
- [ ] All new e2e tests green when simulator is available.
- [ ] PDU rewriting NEVER changes the MBAP `length` field; verify in a unit test that re-encoded PDUs are exactly the same byte length as the originals.
- [ ] `ProxyCounters` is allocation-free per increment on the hot path. The `Snapshot()` call may allocate (it's used only by the status page, off the hot path).
- [ ] Log event names match [`../design.md`](../design.md) → Logging table exactly (including the new `mbproxy.rewrite.invalid_bcd` event added here — update design.md in this PR to add the row).
## Out of scope
- Auto-recovery of failed listener binds (phase 05).
- Backend-connect retry pipeline (phase 05).
- Counter exposure via HTTP (phase 07).
- Hot-reload of the per-PLC `BcdTagMap` (phase 06).
- Pipelined / multi-PDU-in-flight on a single client socket. The proxy serialises by the design's 1:1 model; if a real client pipelines, document as a known limitation.
## Notes for the subagent
- The Modbus FC03/04 response does NOT carry the start address — only the byte count and the register data. You must remember the last request's `(startAddress, quantity)` per `PlcConnectionPair`. This is fine because the proxy is 1:1 and one client = one in-flight request at a time.
- For FC16 requests, the wire format is `[fc, startHi, startLo, qtyHi, qtyLo, byteCount, ...data]`. The PDU passed to the pipeline starts at `fc`. Compute slot offsets from `startAddress + (offsetInData / 2)`.
- Update [`../design.md`](../design.md) → Logging events table to add the new `mbproxy.rewrite.invalid_bcd` event. Do this in the same PR; the doc and the code stay in sync.
- The `mbproxy.exception.passthrough` event was specified in design.md but not wired in phase 03. This phase wires it. If during phase 03 it was already wired by mistake, leave it and remove the TODO comment.
+125
View File
@@ -0,0 +1,125 @@
# Phase 05 — Listener supervisor + auto-recovery
Wrap each `PlcListener` in a Polly-backed supervisor task. Failed binds (at startup or runtime) are retried per the design's recovery profile. Backend-connect Polly retries that were deferred from phase 03 land here too.
**Depends on:** Phase 03 (PlcListener, PlcConnectionPair).
**Parallel-safe with:** nothing (changes ProxyWorker, listener lifecycle, and connection-pair connect path simultaneously).
## Goal
Eliminate "startup race lost a port, service degraded for hours" as a real failure mode. After this phase, a port temporarily in use at boot will bind once it frees; a backend connect transient failure retries within a tight budget instead of immediately dropping the upstream client.
State per listener: `bound` / `recovering` / `stopped`. Reported on the status page (phase 07) via counters and a state field.
## Outputs
```
src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs # owns one PlcListener; retry pipeline
src/Mbproxy/Proxy/Supervision/SupervisorState.cs # enum + state-snapshot record
src/Mbproxy/Proxy/Supervision/PolicyFactory.cs # builds Polly ResiliencePipelines from ResilienceOptions
tests/Mbproxy.Tests/Proxy/Supervision/SupervisorTests.cs # port-conflict recovery, runtime-fault recovery
tests/Mbproxy.Tests/Proxy/Supervision/BackendConnectRetryTests.cs # Polly retry on backend connect
tests/Mbproxy.Tests/Proxy/Supervision/PolicyFactoryTests.cs # unit
```
Modifications:
- `src/Mbproxy/Proxy/ProxyWorker.cs` — owns a `Dictionary<string, PlcListenerSupervisor>` instead of raw `PlcListener` instances. Stop/start of an individual listener now flows through the supervisor.
- `src/Mbproxy/Proxy/PlcConnectionPair.cs` — backend connect now goes through a Polly pipeline built from `ResilienceOptions.BackendConnect`. Remove the `// Phase 05: wrap in Polly` TODO from phase 03.
- `src/Mbproxy/Proxy/ProxyCounters.cs` — add `RecoveryAttempts` counter and `LastBindError` (last failure message, up to 256 chars). Update `CounterSnapshot` to include them.
- `src/Mbproxy/Proxy/RewriterLogEvents.cs` (or a sibling `SupervisorLogEvents.cs`) — add `[LoggerMessage]` definitions for `mbproxy.listener.recovered` (Info, `Plc`, `Port`, `AttemptCount`) and `mbproxy.backend.failed` (Warning, `Plc`, `Reason`). The latter event name already exists in design.md.
## Tasks
1. **`PolicyFactory.cs`** — converts `ResilienceOptions.BackendConnect` and `ResilienceOptions.ListenerRecovery` into `Polly.ResiliencePipeline` instances. Pipelines use `RetryStrategyOptions<T>` with `DelayGenerator` reading from the configured `BackoffMs` arrays. Listener recovery uses a 5-step initial backoff then steady-state at `SteadyStateMs` indefinitely (model as a custom delay generator that returns the steady-state value once the attempt index exceeds the initial array length).
2. **`SupervisorState.cs`** — `enum SupervisorState { Bound, Recovering, Stopped }` and a `record SupervisorSnapshot(SupervisorState State, string? LastBindError, int RecoveryAttempts)`.
3. **`PlcListenerSupervisor.cs`** —
- Constructor: takes a `PlcOptions`, a `PerPlcContext`, the recovery `ResiliencePipeline`, and an `IPduPipeline`. Internally instantiates `PlcListener` lazily inside the retry loop.
- `StartAsync(CancellationToken)`: launches a supervisor task. Inside the task: call `_listener.StartAsync()`. On success, transition to `Bound`, log `mbproxy.startup.bind` (first attempt) or `mbproxy.listener.recovered` (subsequent), and `await _listener.RunAsync(ct)` — which returns when the listener accepts loop ends.
- On exception or normal-but-faulted return from the listener: transition to `Recovering`, log `mbproxy.startup.bind.failed`, increment `RecoveryAttempts`, dispose the failed listener, await Polly's next delay, retry.
- `StopAsync`: transition to `Stopped`, cancel the supervisor token, await the supervisor task.
- `Snapshot()`: returns `SupervisorSnapshot` for the status page.
4. **`PlcConnectionPair.cs` backend-connect retry** — wrap `Socket.ConnectAsync(host, port, ct)` in a `ResiliencePipeline.ExecuteAsync` built from `ResilienceOptions.BackendConnect`. After all attempts exhausted, close the upstream socket (as before) and log `mbproxy.backend.failed`. Crucial: backend-connect retries happen ONCE per upstream client connection (not per request); a connect failure terminates the pair.
5. **`ProxyWorker.cs`** — change to owning supervisors instead of raw listeners. Startup creates one supervisor per `PlcOptions`, starts them all in parallel (`await Task.WhenAll(...)` of their start tasks). The "ready" log event now fires after every supervisor has either reached `Bound` or entered `Recovering`. Shutdown stops all supervisors in parallel; clamp the total shutdown time at 5 s.
## Public surface declared in this phase
```csharp
namespace Mbproxy.Proxy.Supervision;
internal sealed class PlcListenerSupervisor : IAsyncDisposable {
public string PlcName { get; }
public Task StartAsync(CancellationToken ct);
public Task StopAsync(CancellationToken ct);
public SupervisorSnapshot Snapshot();
}
public sealed record SupervisorSnapshot(SupervisorState State, string? LastBindError, int RecoveryAttempts);
public enum SupervisorState { Bound, Recovering, Stopped }
internal static class PolicyFactory {
public static ResiliencePipeline BuildBackendConnect(RetryProfile profile, ILogger logger);
public static ResiliencePipeline BuildListenerRecovery(RecoveryProfile profile, ILogger logger);
}
```
`SupervisorSnapshot` is `public` because phase 07 (status page) consumes it. Everything else stays `internal`.
## Tests required
### Unit (`Category = Unit`)
`PolicyFactoryTests` (≥ 4 tests):
1. `BuildBackendConnect_ProducesPipeline_With3Attempts_Default`
2. `BuildBackendConnect_Backoff_MatchesConfig` — fake `TimeProvider`, assert delay sequence.
3. `BuildListenerRecovery_InitialBackoffFollowedBySteadyState` — drive 10 attempts, assert delays match.
4. `BuildBackendConnect_NoRetry_OnNonTransientException``SocketException` with WSAECONNREFUSED is retried; `ArgumentException` is not.
### Integration (`Category = Unit`; uses real sockets but no simulator)
`SupervisorTests` (≥ 5 tests):
1. `Supervisor_StartsListener_AndTransitionsToBound`
2. `Supervisor_StartFails_WhenPortInUse_TransitionsToRecovering` — bind a `TcpListener` on a free port first, then start the supervisor on the same port; assert `State == Recovering` and `LastBindError` is populated within 100 ms.
3. `Supervisor_Recovers_WhenPortFrees` — same setup as test 2, then dispose the blocking listener; assert the supervisor transitions to `Bound` and emits `mbproxy.listener.recovered` within `InitialBackoffMs[0] + 500ms`. Use an in-memory Serilog sink to verify the log event.
4. `Supervisor_RuntimeFault_TriggersRecovery` — replace the listener implementation with a faulting fake (or use reflection to force `_listener` to be one) and assert recovery kicks in.
5. `Supervisor_Stop_CleanlyTransitionsTo_Stopped_AndCancelsRetry` — supervisor in `Recovering` state, call `StopAsync`, assert it returns within 1 s without waiting out the next backoff window.
`BackendConnectRetryTests` (≥ 3 tests):
1. `BackendConnect_RetriesPerPipeline_OnConnectionRefused` — point a `PlcConnectionPair` at `127.0.0.1:1`, assert it sees exactly 3 connect attempts with the configured delays.
2. `BackendConnect_Succeeds_OnSecondAttempt_WhenBackendBecomesReachable` — start the pair against a closed port, open a listener on that port mid-backoff, assert connect succeeds and the pair runs.
3. `BackendConnect_AllAttemptsFail_ClosesUpstream` — pair gets a fresh upstream socket, never reaches a backend, the upstream socket is closed within `BackoffMs.Sum() + tolerance`.
### E2E (`Category = E2E`)
`SupervisorE2ETests` (≥ 2 tests, against the simulator):
1. `E2E_Recovery_When_BlockingListenerReleasesPort` — same shape as the unit recovery test, but with the simulator on the backend; confirms the supervisor doesn't disrupt the simulator-facing path during recovery.
2. `E2E_RecoveryAttempts_CounterIncrements_Visible_OnSnapshot` — drives the supervisor into recovery and back, then asserts `counters.RecoveryAttempts > 0`. Phase 07 will surface this on the HTTP endpoint; here we just verify the counter snapshot.
## Phase gate
- [ ] Zero-warnings build.
- [ ] All phase 0004 tests still green.
- [ ] All new unit + integration tests green.
- [ ] E2E recovery test green when simulator is available.
- [ ] `mbproxy.listener.recovered` event log includes `AttemptCount` field.
- [ ] No deadlocks under StopAsync while supervisor is mid-backoff (verify by the test above).
- [ ] Backend-connect failures from phase 03 are now wrapped in Polly; the TODO comment from phase 03 is gone.
- [ ] [`../design.md`](../design.md) → "Listener auto-recovery" matches implementation. If during implementation the backoff arrays needed tweaking, update design.md in this PR.
## Out of scope
- Hot-reload-driven add/remove of supervisors (phase 06 owns reconcile).
- HTTP exposure of supervisor state (phase 07).
- Restart-from-crash diagnostics, Windows EventLog integration (phase 08).
- Adaptive backoff (e.g., jitter, exponential beyond the configured array). Stick to the configured schedule.
## Notes for the subagent
- Polly v8 (`Polly.Core`) is the target — `ResiliencePipeline` and `RetryStrategyOptions<T>`, not the v7 `Policy.Handle<>()` fluent API. If the package version pinned in phase 00 turns out to be v7, bump it in this phase and note the bump in the csproj comment.
- The supervisor task uses one `CancellationTokenSource` per supervisor instance. Cancelling it must cancel both the Polly delay AND the inner `_listener.RunAsync` cleanly. Polly's `ResiliencePipeline.ExecuteAsync(ct)` honours the token; double-check the listener does too.
- Do not introduce a generic "task supervisor" abstraction. `PlcListenerSupervisor` is the only thing supervising in this codebase; YAGNI on the framework.
- The supervisor must NOT swallow exceptions from `_listener.RunAsync` other than `OperationCanceledException`. Log them at Warning with the exception, then enter the recovery loop. Operators reading logs need to see WHY a listener died, not just that it was restarted.
+158
View File
@@ -0,0 +1,158 @@
# Phase 06 — Configuration hot-reload
Subscribe to `IOptionsMonitor<MbproxyOptions>.OnChange` and reconcile the running supervisors + per-PLC tag maps + connection settings against the new config — without restarting the host.
**Depends on:** Phase 05 (supervisor lifecycle).
**Parallel-safe with:** nothing (touches the widest cross-cut: supervisors + tag maps + counters + DI options).
## Goal
A `appsettings.json` save propagates per the design's reconcile table:
| Change | Action |
|--------|--------|
| `BcdTags.Global` add/remove/width | Rebuild every PLC's `BcdTagMap`, swap atomically. Next PDU sees it. |
| `Plcs[i].BcdTags.{Add,Remove}` | Rebuild that PLC's `BcdTagMap` only. |
| New `Plcs[i]` | Create supervisor + context, start it. |
| Removed `Plcs[i]` | Stop supervisor, close all client connections to it. |
| Changed `ListenPort` / `Host` | Stop + start the supervisor (remove + add semantics). |
| `Connection.Backend*TimeoutMs` | Take effect on the next backend connect / request. |
| Invalid reload | Reject as a whole; keep current state; log `mbproxy.config.reload.rejected`. |
Validation runs FIRST. A reload that would produce duplicate `ListenPort` values, or a `BcdTagMapBuilder.Build` error for any PLC, is rejected atomically before any state mutates.
## Outputs
```
src/Mbproxy/Configuration/ConfigReconciler.cs # OnChange handler; orchestrates the apply
src/Mbproxy/Configuration/ReloadValidator.cs # cross-PLC validation (duplicate ports, etc.)
src/Mbproxy/Configuration/ReloadPlan.cs # immutable diff record between current and new
tests/Mbproxy.Tests/Configuration/ReloadValidatorTests.cs
tests/Mbproxy.Tests/Configuration/ConfigReconcilerTests.cs
tests/Mbproxy.Tests/Configuration/HotReloadE2ETests.cs # real appsettings.json mutation, real host
```
Modifications:
- `src/Mbproxy/Proxy/ProxyWorker.cs` — accept a `ConfigReconciler` and forward `IOptionsMonitor.OnChange` to it; on startup, also seed the reconciler with the initial snapshot.
- `src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs` — expose a `Task ReplaceContextAsync(PerPlcContext newCtx, CancellationToken ct)` that atomically swaps the BCD tag map and counters without restarting the listener. Old in-flight connections finish on the old map; new connections use the new map. (Document the brief transition window in comments.)
- Add `mbproxy.config.reload.applied` and `mbproxy.config.reload.rejected` `[LoggerMessage]` events.
- `src/Mbproxy/Options/MbproxyOptions.cs` — wire `IValidateOptions<MbproxyOptions>` to call the schema-level validator only. Cross-PLC validation (duplicate ports, etc.) is handled by `ReloadValidator` because it requires inspecting multiple `Plcs[i]` together, which `IValidateOptions` doesn't naturally express.
## Tasks
1. **`ReloadPlan.cs`** — immutable record describing the diff:
```csharp
public sealed record ReloadPlan(
IReadOnlyList<PlcOptions> ToAdd,
IReadOnlyList<string> ToRemove, // PLC names
IReadOnlyList<(string Name, PlcOptions New)> ToRestart, // port or host changed
IReadOnlyList<(string Name, BcdTagMap NewMap)> ToReseat, // tag map changed
ConnectionOptions Connection);
```
Computed by a pure function `ReloadPlan.Compute(MbproxyOptions current, MbproxyOptions next)`; PLC identity is keyed on `Name` (NOT on `ListenPort`, which is mutable).
2. **`ReloadValidator.cs`** — single static method `Validate(MbproxyOptions next, out IReadOnlyList<string> errors)`:
- PLC names are unique and non-empty.
- `ListenPort` values are unique.
- For each PLC, `BcdTagMapBuilder.Build(global, perPlc).Errors` is empty.
- `AdminPort` doesn't collide with any `Plcs[i].ListenPort`.
- All ports are in `[1, 65535]`.
3. **`ConfigReconciler.cs`** — subscribes via constructor-injected `IOptionsMonitor<MbproxyOptions>.OnChange`. On change:
- Snapshot the new options.
- Run `ReloadValidator.Validate`. On failure: log `mbproxy.config.reload.rejected` with the error list; do nothing else.
- Compute `ReloadPlan` against the current snapshot.
- Apply the plan in order:
1. Stop supervisors in `ToRemove` (concurrently).
2. Stop+restart supervisors in `ToRestart` (concurrently).
3. Build new `PerPlcContext` for each `ToReseat` entry and call `supervisor.ReplaceContextAsync(newCtx)`.
4. Build supervisors for `ToAdd`, start them.
- On success: log `mbproxy.config.reload.applied` with summary (`PlcsAdded`, `PlcsRemoved`, `PlcsReseated`, `TagListDelta`). Record `lastReloadUtc` and bump `reloadCount` on a service-wide counter (consumed by phase 07).
- On any step throwing: best-effort log the partial-apply state at Error, then continue. The host stays up. (The validator should have caught most failure modes; a runtime failure here is a true bug.)
4. **`ProxyWorker.cs`** updates — register the reconciler with the host and wire startup to use it for the initial snapshot.
## Public surface declared in this phase
```csharp
namespace Mbproxy.Configuration;
internal sealed class ConfigReconciler : IDisposable {
public ConfigReconciler(IOptionsMonitor<MbproxyOptions> monitor, /* dependencies */);
public Task ApplyAsync(MbproxyOptions next, CancellationToken ct); // exposed for tests
public void Dispose();
}
public sealed record ReloadPlan(
IReadOnlyList<PlcOptions> ToAdd,
IReadOnlyList<string> ToRemove,
IReadOnlyList<(string Name, PlcOptions New)> ToRestart,
IReadOnlyList<(string Name, BcdTagMap NewMap)> ToReseat,
ConnectionOptions Connection) {
public static ReloadPlan Compute(MbproxyOptions current, MbproxyOptions next);
}
internal static class ReloadValidator {
public static bool Validate(MbproxyOptions next, out IReadOnlyList<string> errors);
}
```
## Tests required
### Unit (`Category = Unit`)
`ReloadValidatorTests` (≥ 6 tests):
1. `Validate_DuplicatePlcName_Fails`
2. `Validate_DuplicateListenPort_Fails`
3. `Validate_AdminPortCollidesWith_PlcListenPort_Fails`
4. `Validate_PerPlc_BcdMapBuildError_Fails`
5. `Validate_PortOutOfRange_Fails`
6. `Validate_HappyPath_Passes`
`ReloadPlanTests` (≥ 5 tests):
1. `Compute_AddOnePlc_OnlyToAddPopulated`
2. `Compute_RemoveOnePlc_OnlyToRemovePopulated`
3. `Compute_ChangePort_GoesToToRestart_NotToReseat`
4. `Compute_ChangePerPlcTagOverride_GoesToToReseat`
5. `Compute_ChangeGlobalTagList_AllPlcsReseat_NoRestart`
`ConfigReconcilerTests` (≥ 4 tests, using a fake `IOptionsMonitor` + fake supervisor factory):
1. `Apply_HappyPath_StartsAndStopsSupervisors_PerPlan`
2. `Apply_ValidationFails_NoMutationOccurs_AndLogsRejected`
3. `Apply_ReseatTagMap_DoesNotRestartSupervisor`
4. `Apply_ConcurrentReloads_Are_Serialised` — two rapid changes get processed in order, no interleaving.
### E2E (`Category = E2E`)
`HotReloadE2ETests` (≥ 4 tests, using a real `Host.CreateApplicationBuilder` + temp appsettings.json file):
1. `E2E_AddPlcAtRuntime_NewListenerBinds_AndIsReachable` — start the host with one PLC, write a new appsettings adding a second PLC pointing at the simulator on a fresh listen port, drive NModbus against the new proxy port within 2 s.
2. `E2E_RemovePlcAtRuntime_ClosesUpstreamConnections` — start with two PLCs and a connected client, write appsettings removing one; client's socket closes within 1 s.
3. `E2E_ChangeGlobalBcdTagList_RewriteReflectsImmediately` — start with addr 1072 NOT in BCD list, read raw 0x1234. Write appsettings adding it. Read again, get decoded 1234.
4. `E2E_InvalidReload_DoesNotMutateRunningState` — start happy, write a broken appsettings (duplicate ListenPort), assert the host keeps running with the OLD config and `mbproxy.config.reload.rejected` is logged.
## Phase gate
- [ ] Zero-warnings build.
- [ ] All phase 0005 tests still green.
- [ ] All new unit tests green.
- [ ] All e2e hot-reload tests green when the simulator is available.
- [ ] `mbproxy.config.reload.applied` / `.rejected` events match the design's properties list.
- [ ] A misconfigured reload (duplicate ports) is rejected atomically — the assertion in test E2E_4 verifies no partial mutation.
- [ ] The reconciler serializes concurrent `OnChange` notifications (`SemaphoreSlim` or equivalent) so two file saves in quick succession don't race.
- [ ] Counters `service.config.reloadCount` and `service.config.reloadRejectedCount` are bumped correctly.
## Out of scope
- Watching for files OTHER than `appsettings.json` (env files, dotnet user-secrets, etc.). The default config source set established in phase 00 is the contract.
- Reloading Serilog log levels at runtime. Possible but not in this phase.
- A reload audit log file. The accept/reject events are sufficient.
- Online schema migrations (e.g., renaming a key in an older config to a new one). Reject-the-whole-thing is the simpler contract.
## Notes for the subagent
- `IOptionsMonitor.OnChange` can fire MULTIPLE times for a single file save on some platforms (text editors saving via rename-and-replace can trigger 2-3 events). Debounce inside the reconciler — a 250 ms quiescent window after the last `OnChange` before computing the plan. Document the choice in code.
- The reconciler must NOT block the `OnChange` callback thread for I/O (`StopAsync` etc.). Use `Channel<ReloadRequest>` or a `Task.Run`-style hand-off so the callback returns immediately.
- When a supervisor restart is in progress (e.g., port changed), reject further reloads briefly with a queued "retry after current applies" — OR just serialise everything via a single semaphore and accept that a backed-up reload queue gets all changes eventually. Pick the simpler option (semaphore); document it.
- `BcdTagMapBuilder.Build` is the validator for tag-list well-formedness; do not duplicate that validation in `ReloadValidator`. The validator just calls `Build` and checks the `Errors` list.
+147
View File
@@ -0,0 +1,147 @@
# Phase 07 — Status page
Stand up the read-only Kestrel-hosted admin endpoint on `Mbproxy.AdminPort`. Two routes — `GET /` (self-contained HTML, meta-refresh 5 s) and `GET /status.json` (the same data as JSON). No admin actions, no auth.
**Depends on:** Phase 05 (supervisor snapshots), Phase 06 (config reload counters).
**Parallel-safe with:** nothing (touches DI registration + needs counters from both 05 and 06).
## Goal
A single port that an operator can open in a browser and see, at a glance:
- Service uptime, version, last-reload timestamp + counts.
- Every configured PLC's listener state (`bound` / `recovering` / `stopped`), last bind error, currently connected clients and their per-client PDU counts, PDU counts by function code, BCD slots rewritten, partial-overlap warnings, backend exception counts by code, last round-trip ms, bytes upstream/downstream.
Same data is exposed as `/status.json` for scraping (Prometheus textfile, custom Nagios check, etc.).
## Outputs
```
src/Mbproxy/Admin/AdminEndpointHost.cs # owns the Kestrel server lifecycle
src/Mbproxy/Admin/StatusSnapshotBuilder.cs # composes per-PLC + service-wide snapshots
src/Mbproxy/Admin/StatusDto.cs # the wire DTOs for /status.json
src/Mbproxy/Admin/StatusHtmlRenderer.cs # builds the single-page HTML
src/Mbproxy/Admin/AssemblyVersionAccessor.cs # cached version string
tests/Mbproxy.Tests/Admin/StatusSnapshotBuilderTests.cs
tests/Mbproxy.Tests/Admin/AdminEndpointTests.cs # HTTP-level; live Kestrel + HttpClient
```
Modifications:
- `src/Mbproxy/Mbproxy.csproj` — add `Microsoft.AspNetCore.App` framework reference (the Worker SDK doesn't include ASP.NET Core by default).
- `src/Mbproxy/Program.cs` — register `AdminEndpointHost` as a hosted service; wire it through DI alongside the proxy worker. AdminPort comes from `IOptionsMonitor<MbproxyOptions>`.
- `src/Mbproxy/Proxy/ProxyCounters.cs` — extend with per-client counters: `IReadOnlyList<ClientCounterSnapshot> Snapshot()` includes connected clients with `Remote`, `ConnectedAtUtc`, `PdusForwarded`, `LastRoundTripMs`.
- `src/Mbproxy/Proxy/PlcConnectionPair.cs` — record connect time, expose `RemoteEndpoint`, track round-trip time per request (EWMA via `LastRoundTripMs` field).
- Service-wide counters introduced here: `ServiceCounters` with `UptimeStartedAtUtc`, `LastReloadUtc`, `ReloadCount`, `ReloadRejectedCount`. Wired into `ConfigReconciler` (bump on apply / reject) and the service start path (set started-at).
## Tasks
1. **`StatusDto.cs`** — record types matching the design's per-PLC + service-wide field tables verbatim. Use `System.Text.Json` source generation (`JsonSerializerContext`) to keep the response allocation-light:
```csharp
[JsonSerializable(typeof(StatusResponse))]
internal partial class StatusJsonContext : JsonSerializerContext;
```
2. **`StatusSnapshotBuilder.cs`** — pulls from injected `ProxyWorker` (or a slim view of it), `ConfigReconciler`, `ServiceCounters`, and each `PlcListenerSupervisor`. Builds a `StatusResponse` record. Pure logic; no I/O. The builder is `[Sealed]` and constructed once via DI; calling `Build()` is the only operation.
3. **`StatusHtmlRenderer.cs`** — pure function `string Render(StatusResponse status)`. Produces a single HTML document with:
- `<meta http-equiv="refresh" content="5">` for auto-refresh.
- A header line with service version + uptime + last-reload info.
- A table per PLC. Columns match the per-PLC field set; `listener.state` is colour-coded inline (CSS in a `<style>` block — no external assets).
- Total page weight under 50 KB for typical fleets; the design's 54-PLC count puts the table at ~54 rows.
4. **`AssemblyVersionAccessor.cs`** — reads `AssemblyInformationalVersionAttribute` once at startup, caches it as a string. Used for the `service.version` field.
5. **`AdminEndpointHost.cs`** — `IHostedService` that:
- On start: builds a `WebApplication` (Kestrel) configured to listen on `AdminPort`. Maps `GET /` to a handler that calls `StatusSnapshotBuilder.Build()` then `StatusHtmlRenderer.Render()`, returning `text/html`. Maps `GET /status.json` to a handler returning `JsonSerializer.Serialize(snapshot, StatusJsonContext.Default.StatusResponse)`. NO other routes.
- If `AdminPort` is in use at startup: log `mbproxy.admin.bind.failed` (new event) at Error, do not throw. The proxy listeners continue to run; only the admin endpoint is missing. Operators see this in logs.
- On hot-reload of `AdminPort`: stop and restart the Kestrel server bound to the new port.
- On stop: `Stop()` the Kestrel app gracefully with a 2 s deadline.
6. **`ServiceCounters.cs`** (under `src/Mbproxy/`) — a singleton DI service holding the service-wide counters. `Initialize(DateTimeOffset startedAtUtc)`; `RecordReloadApplied(DateTimeOffset)`; `RecordReloadRejected()`. Snapshot returns an immutable record.
## Public surface declared in this phase
```csharp
namespace Mbproxy.Admin;
internal sealed class AdminEndpointHost : IHostedService { /* ... */ }
public sealed record StatusResponse(
ServiceFields Service,
ListenersAggregate Listeners,
IReadOnlyList<PlcStatus> Plcs);
public sealed record ServiceFields(
long UptimeSeconds, string Version,
DateTimeOffset? ConfigLastReloadUtc, int ConfigReloadCount, int ConfigReloadRejectedCount);
public sealed record ListenersAggregate(int Bound, int Configured);
public sealed record PlcStatus(
string Name, string Host, int ListenPort,
PlcListenerStatus Listener,
PlcClientsStatus Clients,
PlcPdusStatus Pdus,
PlcBackendStatus Backend,
PlcBytesStatus Bytes);
public sealed record PlcListenerStatus(string State, string? LastBindError, int RecoveryAttempts);
public sealed record PlcClientsStatus(int Connected, IReadOnlyList<ClientSnapshot> RemoteEndpoints);
public sealed record ClientSnapshot(string Remote, DateTimeOffset ConnectedAtUtc, long PdusForwarded);
public sealed record PlcPdusStatus(long Forwarded, FcCounts ByFc, long RewrittenSlots, long PartialBcdWarnings);
public sealed record FcCounts(long Fc03, long Fc04, long Fc06, long Fc16, long Other);
public sealed record PlcBackendStatus(long ConnectsSuccess, long ConnectsFailed, ExceptionCounts ExceptionsByCode, double LastRoundTripMs);
public sealed record ExceptionCounts(long Code01, long Code02, long Code03, long Code04);
public sealed record PlcBytesStatus(long UpstreamIn, long UpstreamOut);
```
## Tests required
### Unit (`Category = Unit`)
`StatusSnapshotBuilderTests` (≥ 6 tests):
1. `Build_NoPlcsConfigured_ReturnsEmptyPlcList`
2. `Build_OnePlcBound_PopulatesListenerState_Bound`
3. `Build_PlcRecovering_PopulatesLastBindError_AndAttempts`
4. `Build_AggregatesListenersBoundAndConfigured`
5. `Build_PerClientSnapshot_Includes_RemoteAndConnectedAt_AndPduCount`
6. `Build_ServiceFields_IncludeUptime_Version_AndLastReload`
`StatusHtmlRendererTests` (≥ 3 tests):
1. `Render_OnePlc_ProducesValidHtml_WithMetaRefresh`
2. `Render_RecoveringPlc_HighlightsState`
3. `Render_PageWeightUnder50KB_For54Plcs` — assert character length.
### E2E (`Category = E2E`)
`AdminEndpointTests` (≥ 5 tests, against a live in-process Kestrel + simulator):
1. `Get_StatusJson_ReturnsValidShape`
2. `Get_StatusJson_AfterReadFC03_ShowsPduCountIncreased`
3. `Get_StatusJson_AfterPartialBcdWrite_ShowsPartialBcdWarning`
4. `Get_Root_ReturnsHtml_WithMetaRefresh`
5. `AdminPort_BindFailure_ServiceStaysUp_AndLogsBindFailed` — pre-bind the AdminPort, start the service, assert proxy listeners come up and the admin endpoint logs the failure.
## Phase gate
- [ ] Zero-warnings build.
- [ ] All phase 0006 tests still green.
- [ ] All new unit + e2e tests green.
- [ ] `/status.json` shape matches the field tables in [`../design.md`](../design.md) → "Status page" exactly (field names, casing, nesting).
- [ ] Counters on the read path (`PdusForwarded`, etc.) remain allocation-free; `Snapshot()` is the only allocating call and it's on the cold path.
- [ ] AdminPort collision is logged but does NOT take down the proxy.
- [ ] Hot-reload of `AdminPort` works (verified by adding a test in this phase or extending one of phase 06's e2e tests).
## Out of scope
- Authentication / authorisation on the admin port. Design explicitly defers to network-layer trust.
- Prometheus exposition format. The `/status.json` shape is the contract; downstream tools can transform.
- WebSocket push of counters. Meta-refresh is good enough at 54 PLCs.
- Historical counter retention (rolling windows, time series). Counters are cumulative since process start; restart resets.
- Per-tag-level telemetry (which BCD addresses got rewritten how often). The per-PLC `RewrittenSlots` total is enough; finer granularity goes in a future phase if needed.
## Notes for the subagent
- Use the minimal-API style for the two endpoints; no controllers. The whole admin endpoint is ~50 lines of map / handler code.
- `System.Text.Json` source generation needs `[JsonSerializable]` on the DTO chain. Don't use reflection-based serialization in this codebase — it adds AOT-unsafety and is slower for the simple shape.
- For the HTML page, embed CSS in a `<style>` block. Do not link external stylesheets — the admin endpoint must work over a firewalled network with no internet egress.
- Test 3 of `AdminEndpointTests` requires triggering a partial-BCD warning, which means configuring a 32-bit BCD tag and reading only one half of it through the proxy. This is the same scenario phase 04's e2e test 5 exercised; reuse the setup.
- The admin port collision test is important: an operator misconfiguration must not take down the proxy itself. Log Error, continue running.
+134
View File
@@ -0,0 +1,134 @@
# Phase 08 — Windows service hardening
Install / uninstall scripts, graceful shutdown, Windows Event Log integration, and the public-facing `README.md` that the root `wwtools/CLAUDE.md` index points at. This is the "ship it" phase.
**Depends on:** Phase 04 (rewriter), Phase 07 (status page).
**Parallel-safe with:** nothing.
## Goal
After this phase, an operator can:
1. `dotnet publish` the service into a self-contained folder.
2. Run `install.ps1` to register it as a Windows service.
3. See it appear in `services.msc` running as `Local System` (default — overridable to a managed service account).
4. Stop it cleanly via `sc.exe stop mbproxy`; the service finishes all in-flight PDUs and exits within 10 s.
5. Read crash reasons from the Windows Event Log alongside the Serilog rolling-file output.
6. Read [`../../mbproxy/README.md`](../../mbproxy/README.md) to figure all of this out without needing to talk to a developer.
## Outputs
```
mbproxy/README.md # tool-level human entry point (per DOCS-GUIDE Layer 2)
mbproxy/install/install.ps1 # registers the service
mbproxy/install/uninstall.ps1 # removes it
mbproxy/install/mbproxy.config.template.json # commented appsettings.json for ops
mbproxy/docs/operations.md # ops runbook (install, upgrade, troubleshooting)
src/Mbproxy/Diagnostics/ShutdownCoordinator.cs # graceful-shutdown helper
src/Mbproxy/Diagnostics/EventLogBridge.cs # logs critical events to Windows Event Log
tests/Mbproxy.Tests/Diagnostics/ShutdownCoordinatorTests.cs
```
Modifications:
- `src/Mbproxy/Program.cs` — wire `ShutdownCoordinator` into the host-stop signal. Wire `EventLogBridge` as a Serilog sub-sink for events at Error and above when running under Windows Service (`WindowsServiceHelpers.IsWindowsService()` true).
- `mbproxy/Mbproxy.csproj``<PublishSingleFile>true</PublishSingleFile>` and `<SelfContained>true</SelfContained>` for the publish profile.
- `../CLAUDE.md` (the root `wwtools/CLAUDE.md`) — update the `mbproxy` index row to point at the new `mbproxy/README.md` (per the maintenance note in `mbproxy/CLAUDE.md`).
- `mbproxy/CLAUDE.md` — update the "Current state" section to reflect the post-implementation state (no longer "no code yet"), and the Maintenance section to note that the README is now the canonical human entry point.
## Tasks
1. **`mbproxy/README.md`** — follows the DOCS-GUIDE Layer-2 template exactly. Required sections in order: one-sentence identification, hard constraints / prerequisites, layout, resource index, build & run, install. Cross-link to `docs/design.md`, `docs/plan/README.md`, `docs/operations.md`, `CLAUDE.md`. No deep prose tutorials; the README routes.
2. **`mbproxy/install/install.ps1`** — parameters: `-InstallPath <path>` (default `C:\Program Files\Mbproxy`), `-ServiceName <name>` (default `mbproxy`), `-DisplayName <text>`, `-Account <managed-service-account>` (default `LocalSystem`). Behaviour:
- Verifies admin rights; fails with a clear message if not elevated.
- Copies the publish output (passed via `-PublishOutput <path>`) to `InstallPath`.
- Runs `sc.exe create <ServiceName> binPath= "<InstallPath>\Mbproxy.exe" start= auto displayName= "<DisplayName>" obj= <Account>`.
- Sets the failure-action policy: restart after 60 s on first/second failure, no restart on subsequent (`sc.exe failure ...`).
- Creates `%ProgramData%\mbproxy\logs\` with appropriate ACLs.
- Copies `mbproxy.config.template.json` to `%ProgramData%\mbproxy\appsettings.json` if no config exists.
- Optionally starts the service if `-Start` flag is passed.
3. **`mbproxy/install/uninstall.ps1`** — stops the service if running, `sc.exe delete <ServiceName>`, removes `InstallPath` (with `-KeepConfig` flag to preserve `%ProgramData%\mbproxy\appsettings.json`).
4. **`mbproxy/install/mbproxy.config.template.json`** — a fully commented `appsettings.json` showing the full schema with example values and inline `//` comments describing every field. (Use `appsettings.jsonc` semantics; .NET's configuration loader tolerates `//` comments when configured to.)
5. **`ShutdownCoordinator.cs`** — orchestrates graceful shutdown on `IHostApplicationLifetime.ApplicationStopping`:
- Stop accepting new upstream connections on all `PlcListenerSupervisor`s.
- Wait for in-flight PDUs to complete with a `10 s` deadline (configurable via `Connection.GracefulShutdownTimeoutMs`, default 10000).
- Stop the admin endpoint.
- Cancel all remaining work. Log `mbproxy.shutdown.complete` with `InFlightAtCancel` count.
6. **`EventLogBridge.cs`** — adds a Serilog sub-sink that writes events with level >= Error to the Windows Event Log under source `mbproxy`. Only enabled when running as a Windows Service. The install script creates the event source.
7. **`mbproxy/docs/operations.md`** — operations runbook:
- Install / uninstall steps (mirror to `README.md`).
- Upgrade procedure (stop service, copy new binaries, start).
- Where logs live, how to roll them, retention defaults.
- Common failure modes (port already in use, PLC unreachable, BCD validation reject) with the relevant log event names and what to check.
- The `services.msc` / `sc.exe` / `Get-Service` commands operators will actually use.
- How to safely edit `appsettings.json` for hot-reload (with the rejection-keeps-old-config promise).
## Public surface declared in this phase
```csharp
namespace Mbproxy.Diagnostics;
internal sealed class ShutdownCoordinator {
public Task ShutdownAsync(int timeoutMs, CancellationToken hostCt);
}
internal sealed class EventLogBridge { /* Serilog sub-sink */ }
```
No additional public types are needed; all surfaces from previous phases remain stable.
## Tests required
### Unit (`Category = Unit`)
`ShutdownCoordinatorTests` (≥ 4 tests):
1. `Shutdown_NoActiveConnections_CompletesImmediately`
2. `Shutdown_OneActiveConnection_WaitsForCompletion`
3. `Shutdown_TimeoutExceeded_CancelsRemainingWork_AndReportsCount`
4. `Shutdown_AdminEndpointStopped_AfterListenersStopped` — ordering test.
### E2E (`Category = E2E`)
`ShutdownE2ETests` (≥ 2 tests, against simulator):
1. `E2E_StopHost_WithConnectedClient_DrainsCleanlyWithin10s` — start host, connect NModbus, issue 5 back-to-back FC03 reads, signal host stop, assert all 5 complete and the client's TCP socket is closed cleanly.
2. `E2E_StopHost_DuringInFlightRequest_CancelsAfterTimeout` — same but with a `Connection.BackendRequestTimeoutMs` that exceeds the shutdown deadline; assert shutdown completes within the deadline and the in-flight request was cancelled.
### Manual / smoke
- Install the service via `install.ps1` on a clean test VM; confirm it appears in `services.msc` with `Local System` identity.
- `sc.exe start mbproxy` — service starts, admin endpoint at `http://localhost:8080/` shows the proxy is up.
- Send `sc.exe stop mbproxy` — service stops within 10 s.
- Trigger a crash (e.g., corrupt `appsettings.json` while running and reload — actually this is rejected gracefully; better: kill the process with Task Manager) — confirm an entry appears in Windows Event Log under source `mbproxy`.
- `uninstall.ps1` — service removed cleanly; `%ProgramData%\mbproxy\` preserved unless `-KeepConfig` was not passed.
The manual smoke results go into `docs/operations.md` as a "first install" verification checklist.
## Phase gate
- [ ] Zero-warnings build.
- [ ] All phase 0007 tests still green.
- [ ] All new unit tests green.
- [ ] All e2e shutdown tests green.
- [ ] `mbproxy/README.md` exists, follows the DOCS-GUIDE Layer-2 template, and routes into deep docs without duplicating their content.
- [ ] Root `wwtools/CLAUDE.md` index row for `mbproxy` points at `mbproxy/README.md` (was previously pointing into the design plan or the bare folder).
- [ ] `install.ps1` and `uninstall.ps1` are idempotent — re-running install when the service already exists is a clean no-op or update, not a hard error.
- [ ] Windows Event Log source is created during install and removed during uninstall.
- [ ] `dotnet publish src/Mbproxy/Mbproxy.csproj -c Release -r win-x64 --self-contained true /p:PublishSingleFile=true` produces a single executable under 50 MB.
- [ ] Manual smoke checklist in `docs/operations.md` has been executed on at least one fresh VM and the result documented.
## Out of scope
- Linux / Docker packaging. The design fixes Windows Service as the deployment target.
- Centralised log aggregation (Splunk forwarder config, Elastic agent, etc.). Document where the logs are; let ops integrate.
- A signed installer (MSI / setup.exe). PowerShell-driven install is the contract; an MSI can be added later if procurement demands it.
- Metric exposition for Prometheus / OpenTelemetry. The status page's `/status.json` is sufficient for the operational needs declared in the design.
## Notes for the subagent
- The Windows Event Log source creation requires admin rights — that's already a precondition for `install.ps1`. Do not try to create the source at runtime from the service itself (it would fail when the service runs as a non-admin account).
- Single-file publish makes `Assembly.GetExecutingAssembly().Location` empty. If `AssemblyVersionAccessor` (phase 07) used that, swap to `Assembly.GetExecutingAssembly().GetCustomAttribute<AssemblyInformationalVersionAttribute>()`.
- The `mbproxy/README.md` is what an operator reads first. Be ruthless about length — aim for under 100 lines. The DOCS-GUIDE says routes, not tutorials.
- After this phase merges, the project is feature-complete against [`../design.md`](../design.md). Any further work belongs in a NEW design revision (dated, in the same doc) and a new phase plan.
+341
View File
@@ -0,0 +1,341 @@
# Phase 09 — MBAP TxId multiplexing (single backend connection per PLC)
Replace the 1:1 upstream-client ↔ backend-socket model with a **single backend connection per PLC**, multiplexed across all upstream clients via MBAP transaction-ID rewriting and a correlation map. After this phase the H2-ECOM100's 4-simultaneous-TCP-client cap is no longer an operational ceiling — the proxy holds exactly one slot per PLC regardless of how many upstream clients are connected.
**Status:** shipped 2026-05-14. Phases 00-08 shipped the production-ready 1:1 model; this phase swapped connection management without changing the transparent-rewrite contract.
## Implementation clarifications discovered during 2026-05-14 ship
These notes capture decisions and surprises that surfaced during the actual implementation. They supplement (not replace) the Tasks section below.
1. **A per-request timeout watchdog is part of Phase 9, not deferred.** The 1:1 model collapsed missing-response handling onto the dedicated backend socket dying. The multiplexed model needs an explicit timer because a single lost or mis-routed response would otherwise leak a correlation entry forever and hang the upstream pipe indefinitely. The watchdog ticks at quarter-`BackendRequestTimeoutMs` (min 100 ms), scans the correlation map, and times out stale requests with **Modbus exception 0x0B (Gateway Target Device Failed To Respond)** delivered to the upstream party with the original TxId restored. Log event `mbproxy.multiplex.request.timeout` (Warning).
2. **PlcListener constructs a multiplexer unconditionally.** The Phase-9 draft had `PlcListener` conditionally construct the multiplexer only when a `PerPlcContext` was supplied; the no-context fallback dropped accepted upstream sockets. Tests (and any pre-Phase-6 startup path that lacked a context) hit a regression. The fix is to construct a minimal default `PerPlcContext` from the `PlcOptions` if the caller didn't supply one, and require `_multiplexer` to be non-null when `RunAsync` runs.
3. **`BackendConnectFailure_ClosesUpstreamCleanly` is now lazy.** The 1:1 model attempted a backend connect at upstream-accept time, so simply opening a TCP connection to a proxy with a bad backend triggered the close. The multiplexed model connects to the backend on the *first upstream frame*, so the test has to send a Modbus request before the proxy attempts the (failing) backend connect that causes the upstream close. Updated in-place.
4. **pymodbus 3.13.0 simulator is broken under multiplexed concurrent requests.** Its `ServerRequestHandler` keeps a single `last_pdu` per connection and schedules `handle_later` via `asyncio.call_soon`; two MBAP frames in one recv buffer overwrite `last_pdu` before the first handler runs, and both responses carry the later TxId. The real DL260 ECOM properly echoes per-request TxIds. Consequence for tests:
- **Mux correctness under truly concurrent backend traffic is proven against the stub backend in `PlcMultiplexerTests`**, which models the DL260's correct TxId-echo behaviour.
- **`MultiplexerE2ETests` paces requests** so pymodbus only ever sees one MBAP frame at a time on the shared backend connection. The headline test (`E2E_FiveSimultaneousClients_AllReadHR1072_AllGetDecoded_1234`) verifies the connection ceiling lift (5 simultaneous upstream connections, where Phase-08's 1:1 model would have refused the 5th) — *not* the under-concurrency multiplexing behaviour.
- **The watchdog is the production defence** if any real backend (or future simulator) ever mis-echoes a TxId: stale entries time out cleanly with exception 0x0B rather than hanging upstream clients.
5. **E2E timeouts.** Per `docs/plan/README.md`'s Test discipline, all E2E tests are 5 s by default. Hot-reload tests that genuinely need 5 s + 3 s of propagation windows carry a 10 s timeout with a one-line comment; `E2E_BackendDisconnect_DuringInflight_CascadesUpstream_AndRecovers` carries 8 s for its sequential connects + Polly-paced reconnect path.
6. **`AsyncHostDispose` deadlock note.** Test fixtures that hold `IHost` via `await using` were originally written with a 5 s shutdown timeout; under Phase 9's drained-channel cleanup that occasionally exceeded the test's own `Timeout = 5000`. Reduced to 2-3 s where it doesn't materially affect the test's drain semantics.
**Depends on:** Phase 04 (rewriter), Phase 05 (supervisor + Polly), Phase 07 (status page DTO surface).
**Parallel-safe with:** nothing within itself. **Hard rule.** This phase deletes `PlcConnectionPair` and rewires the supervisor + rewriter correlation path simultaneously; the cross-cut is too broad for safe parallel work. The optional intra-phase slicing (below) is the closest thing to parallel.
## Goal
The H2-ECOM100 accepts 4 concurrent TCP clients per PLC; today's 1:1 model means the 5th upstream client to the same proxy port fails at backend connect. This phase eliminates that ceiling by making **one persistent backend socket per PLC**, with the proxy serving as a connection multiplexer that rewrites MBAP transaction IDs to keep concurrent in-flight requests from different upstream clients distinguishable on the single wire.
The wire-rate ceiling does not change — the H2-ECOM100 internally serializes requests (one per PLC scan, ~2-10 ms scan time) regardless of how many TCP connections it has. We're shifting where serialization happens (proxy outbound queue vs PLC accept queue), not adding throughput. The dashboard pay-off is that "PLC clients connected" can rise into the dozens without the proxy degrading.
## Intra-phase slicing (the closest thing to parallel-safe within this phase)
The phase is one merge but can be implemented as five small commits in this order:
| Slice | Output | Files touched | Hours | Parallelizable? |
|-------|--------|---------------|-------|-----------------|
| 9.1 | Pure data types (TxIdAllocator, CorrelationMap, InFlightRequest) + their unit tests | new files under `src/Mbproxy/Proxy/Multiplexing/` and `tests/...` | ~5 | Yes — pure logic, disjoint from rest. A second agent can write the E2E test scaffolding (slice 9.5) in parallel. |
| 9.2 | `PlcMultiplexer` + `UpstreamPipe` skeleton with backend reader/writer loops | new files in `Multiplexing/` | ~10 | No — depends on 9.1's data types. |
| 9.3 | Refactor `PlcListener` to own the multiplexer; delete `PlcConnectionPair`; rewire supervisor | modifies existing Proxy + Supervision files | ~8 | No — depends on 9.2. |
| 9.4 | Update `BcdPduPipeline` to use correlation entries (drop `PerPlcContextWithRequest`); counter additions; status DTO + HTML updates | modifies pipeline + admin files | ~6 | No — depends on 9.3. |
| 9.5 | Full E2E test suite + design.md + CLAUDE.md doc updates | new test file + doc edits | ~6 | Test-writing yes (slice 9.5 skeleton can land in parallel with 9.1); the doc edits at the end are sequential after 9.3. |
**Total:** ~35 hours. With one parallel agent producing slice 9.1's data types and another sketching the e2e test fixtures during slice 9.5-prep, calendar time can compress to ~28 hours.
## Outputs (new files in this phase)
```
src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs # single backend conn owner; mux logic
src/Mbproxy/Proxy/Multiplexing/UpstreamPipe.cs # per-upstream-client reader/writer
src/Mbproxy/Proxy/Multiplexing/TxIdAllocator.cs # 16-bit allocator with wrap tracking
src/Mbproxy/Proxy/Multiplexing/CorrelationMap.cs # proxyTxId → InFlightRequest
src/Mbproxy/Proxy/Multiplexing/InFlightRequest.cs # the correlation record
src/Mbproxy/Proxy/Multiplexing/MultiplexerLogEvents.cs # [LoggerMessage] vocab for this phase
tests/Mbproxy.Tests/Proxy/Multiplexing/TxIdAllocatorTests.cs
tests/Mbproxy.Tests/Proxy/Multiplexing/CorrelationMapTests.cs
tests/Mbproxy.Tests/Proxy/Multiplexing/PlcMultiplexerTests.cs # integration, real sockets
tests/Mbproxy.Tests/Proxy/Multiplexing/RewriterCorrelationTests.cs # rewriter w/ multiplexed paths
tests/Mbproxy.Tests/Proxy/Multiplexing/MultiplexerE2ETests.cs # against pymodbus sim
```
## Files modified (existing files in this phase)
```
src/Mbproxy/Proxy/PlcListener.cs # owns PlcMultiplexer; accept loop hands sockets to it
src/Mbproxy/Proxy/PlcConnectionPair.cs # DELETED — replaced by UpstreamPipe + Multiplexer
src/Mbproxy/Proxy/IPduPipeline.cs # PduContext gains in-flight correlation entry
src/Mbproxy/Proxy/PerPlcContext.cs # delete PerPlcContextWithRequest; replaced by InFlightRequest passed per-call
src/Mbproxy/Proxy/BcdPduPipeline.cs # FC03/04 response decodes via InFlightRequest, not last-request slot
src/Mbproxy/Proxy/ProxyCounters.cs # new fields: InFlightCount, MaxInFlight, TxIdWraps, BackendDisconnectCascades, BackendQueueDepth
src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs # supervises mux lifecycle alongside listener
src/Mbproxy/Admin/StatusDto.cs # PlcBackendStatus gains the new mux fields
src/Mbproxy/Admin/StatusSnapshotBuilder.cs # populate mux fields from counters
src/Mbproxy/Admin/StatusHtmlRenderer.cs # show inFlight/max-in-flight in the per-PLC row
docs/design.md # rewrite Connection model + Failure modes for multiplexed reality
mbproxy/CLAUDE.md # flip Architecture summary's connection-model bullet
docs/kpi.md # update operational notes referring to 4-client cap
```
## Tasks
### 9.1 Data types (pure logic)
1. **`TxIdAllocator`** — `internal sealed class TxIdAllocator`. State: `_inUse` (`bool[65536]` for O(1) lookup; ~64 KB), `_next` (`ushort`), `_inFlightCount` (long), `_wrapCount` (long). Methods:
- `bool TryAllocate(out ushort id)` — atomic via `lock` (the allocator is per-PLC, contention is low). Scans forward from `_next` for the next free slot; sets `_inUse[id] = true`; bumps `_next`. Returns `false` if `_inFlightCount == 65536` (saturated; emit `mbproxy.multiplex.saturated` Error and let caller decide to drop or queue).
- `void Release(ushort id)` — clears `_inUse[id]`; decrements `_inFlightCount`.
- `int InFlightCount { get; }`, `long WrapCount { get; }` — for telemetry.
- **Wrap counter:** increment whenever `_next` rolls over `0xFFFF → 0x0000`.
2. **`InFlightRequest` + `InterestedParty`** — `InterestedParty` is `internal sealed record InterestedParty(UpstreamPipe Pipe, ushort OriginalTxId)`. `InFlightRequest` is `internal sealed record InFlightRequest(byte UnitId, byte Fc, ushort StartAddress, ushort Qty, IReadOnlyList<InterestedParty> InterestedParties, DateTimeOffset SentAtUtc)`. Carries enough state for: (a) restoring each party's original TxId on the way back, (b) the FC03/04 correlation the rewriter needs (start/qty), (c) routing the response to each interested upstream socket, (d) round-trip-time measurement.
**In Phase 9 `InterestedParties` always contains exactly one element.** The list shape is forward-compat with [Phase 10 — read coalescing](10-read-coalescing.md), which extends the same record to fan-out responses to multiple upstream clients without further refactor of the multiplexer's data model. Resist any reviewer suggestion to simplify it back to a single `UpstreamPipe Upstream` field — the list shape is the load-bearing foundation for Phase 10.
3. **`CorrelationMap`** — wraps a `ConcurrentDictionary<ushort, InFlightRequest>`. Methods: `bool TryAdd(ushort, InFlightRequest)`, `bool TryRemove(ushort, out InFlightRequest)`, `int Count { get; }`, `IReadOnlyCollection<InFlightRequest> Snapshot()` (for diagnostics; allocates a list). The dict is correct-by-construction for the mux's single-writer-add / single-reader-remove pattern; `ConcurrentDictionary` keeps it safe if/when we add upstream-side cancellation.
### 9.2 Multiplexer + UpstreamPipe
4. **`UpstreamPipe`** — `internal sealed class UpstreamPipe : IAsyncDisposable`. One instance per accepted upstream socket. Fields: `Socket _upstream`, `Guid _id`, `IPEndPoint _remoteEp`, `DateTimeOffset _connectedAtUtc`, `volatile bool _alive`, `Channel<byte[]> _responseChannel` (capacity 16). Two tasks:
- **Read task**: pumps inbound MBAP frames from `_upstream` to a per-pipe `OnFrame` callback (registered by the multiplexer).
- **Write task**: drains `_responseChannel` and writes each frame back to `_upstream`.
On fault: sets `_alive = false`, closes the socket, the multiplexer notices on next correlation lookup and drops responses bound for this pipe.
5. **`PlcMultiplexer`** — `internal sealed class PlcMultiplexer : IAsyncDisposable`. One instance per PLC. Fields: backend `Socket`, `TxIdAllocator`, `CorrelationMap`, `Channel<byte[]> _outboundChannel` (cap 256), `PerPlcContext _ctx` (tag map + counters + logger), list of attached `UpstreamPipe`s. Two backend tasks plus a fan-in:
- **Backend writer task**: drains `_outboundChannel` → writes to backend socket. Single writer; no synchronization on the socket needed.
- **Backend reader task**: reads MBAP frames from backend → looks up `proxyTxId` in `CorrelationMap` → calls `pipeline.Process(ResponseToClient, header, pdu, ctx with InFlight)` → for each `InterestedParty` in `InFlightRequest.InterestedParties` (always exactly one in Phase 9; list-of-N once Phase 10 ships): writes a copy of the frame with that party's `OriginalTxId` restored in the MBAP header to the party's `UpstreamPipe._responseChannel` (or drops silently for that party if its pipe is `_alive = false`) → `CorrelationMap.TryRemove(proxyTxId)` + `TxIdAllocator.Release(proxyTxId)`.
- **Per-upstream `OnFrame`**: invoked by each `UpstreamPipe`'s read task. Steps:
1. Parse MBAP: original TxId, length, unitId, PDU.
2. `TryAllocate` a proxyTxId. If saturated, write a Modbus exception response (Slave Device Failure, code 04) back to upstream and continue.
3. Build `InFlightRequest` (parse FC/start/qty from PDU if FC03/04 — needed for FC06 too if we want the symmetric correlation later).
4. `TryAdd` to correlation map.
5. Call `pipeline.Process(RequestToBackend, ...)` to apply BCD rewriting.
6. Overwrite MBAP TxId bytes with proxyTxId.
7. Enqueue the modified frame into `_outboundChannel`.
6. **Backend disconnect handling** — when the backend reader/writer task throws (socket closed, network reset, etc.):
- Stop both tasks; close the backend socket.
- Walk the correlation map; for each entry, close that entry's `UpstreamPipe` (cascade). Increment `BackendDisconnectCascades` by the upstream-pipe count.
- Clear correlation map and TxIdAllocator.
- The supervisor's Polly pipeline takes over for backend reconnect — when the next upstream request arrives, the multiplexer attempts a fresh backend connection through the Polly pipeline.
### 9.3 Listener + supervisor refactor
7. **`PlcListener.RunAsync`** — accept loop changes:
- One `PlcMultiplexer` per listener (constructed in `PlcListenerSupervisor` and handed in).
- On accept: wrap the socket in `UpstreamPipe`, register with the multiplexer via `mux.Attach(pipe)`.
- On listener stop: dispose the multiplexer (which closes the backend + all attached pipes).
- `ActivePairs` property → renamed `ActiveUpstreams` returning the multiplexer's list of attached `UpstreamPipe`s. Status page consumes this.
8. **Delete `PlcConnectionPair.cs`** — entire file. The replacement is `UpstreamPipe` + `PlcMultiplexer`. No backwards-compat shims; we're moving cleanly.
9. **`PlcListenerSupervisor`** — gains ownership of `PlcMultiplexer` alongside the listener. The Polly listener-recovery pipeline is unchanged; the multiplexer has its own internal Polly backend-connect pipeline (same `ResilienceOptions.BackendConnect` shape as today, just owned by the mux instead of the pair).
### 9.4 Rewriter + counters + status page
10. **`BcdPduPipeline`** — the FC03/04 response path stops reading `PerPlcContextWithRequest.LastRequestStart/Qty`. Instead, the multiplexer attaches an `InFlightRequest` to the `PduContext` for each response call:
```csharp
public sealed class PerPlcContext : PduContext {
public BcdTagMap TagMap { get; init; }
public ProxyCounters Counters { get; init; }
public ILogger Logger { get; init; }
public InFlightRequest? CurrentRequest { get; init; } // NEW — non-null on response, null on request
}
```
Concurrency: each backend response is handled on the backend reader task; the request path is handled by the per-upstream read task. Different `InFlightRequest` instances → no contention.
11. **Drop `PerPlcContextWithRequest`** entirely. The last-request-slot pattern was a 1:1-model workaround; the correlation map subsumes it.
12. **`ProxyCounters` additions:**
- `InFlightCount` (`long` snapshot of `CorrelationMap.Count`)
- `MaxInFlight` (`long`, peak observed via `Interlocked.Max`)
- `TxIdWraps` (`long` from `TxIdAllocator.WrapCount`)
- `BackendDisconnectCascades` (`long`)
- `BackendQueueDepth` (snapshot of `_outboundChannel.Reader.Count`)
13. **Status page** — `StatusDto.PlcBackendStatus` gains `InFlight`, `MaxInFlight`, `TxIdWraps`, `DisconnectCascades`, `QueueDepth`. `StatusSnapshotBuilder` populates them. `StatusHtmlRenderer` adds a column or compact `[3/256]` indicator per PLC row. The JSON field names land in camelCase per the existing source-gen convention.
### 9.5 Tests + docs
14. **Unit + integration test suites** (see Tests required below).
15. **`docs/design.md` updates:**
- **Connection model** section: rewrite. The diagram changes from "many clients → many backend sockets" to "many clients → one backend socket per PLC, multiplexed by proxy TxId rewriting." The operational consequence warning flips: instead of "5th client fails," it becomes "if backend disconnects, all attached upstream clients are cascaded closed; they reconnect on their own next request."
- **Failure modes** section: amend to describe the cascade behaviour.
- **Rewriter** section: amend to note the rewriter consumes `InFlightRequest` for response correlation (no architectural change, just an update to the description of how correlation flows).
16. **`mbproxy/CLAUDE.md`** Architecture summary: first bullet flips from "1:1 upstream-client ↔ backend-socket" to "single backend socket per PLC, multiplexed via MBAP TxId rewriting."
17. **`docs/kpi.md`** — the "Tier 2 → Connection-cap saturation warning" KPI loses its meaning (4-client cap no longer relevant on the upstream side). Either remove it or repurpose to track in-flight saturation against the 16-bit TxId space (which never realistically saturates but is the new equivalent ceiling).
## Public surface declared in this phase
All `internal sealed` — the multiplexer types are not consumed outside the assembly.
```csharp
namespace Mbproxy.Proxy.Multiplexing;
internal sealed class TxIdAllocator {
public bool TryAllocate(out ushort id);
public void Release(ushort id);
public int InFlightCount { get; }
public long WrapCount { get; }
}
internal sealed record InterestedParty(UpstreamPipe Pipe, ushort OriginalTxId);
internal sealed record InFlightRequest(
byte UnitId, byte Fc,
ushort StartAddress, ushort Qty,
IReadOnlyList<InterestedParty> InterestedParties,
DateTimeOffset SentAtUtc);
// Phase 9: InterestedParties.Count is always 1.
// Phase 10 (read coalescing): the same record fans out to N parties without further refactor.
internal sealed class CorrelationMap {
public bool TryAdd(ushort proxyTxId, InFlightRequest req);
public bool TryRemove(ushort proxyTxId, out InFlightRequest req);
public int Count { get; }
public IReadOnlyCollection<InFlightRequest> Snapshot();
}
internal sealed class UpstreamPipe : IAsyncDisposable {
public Guid Id { get; }
public IPEndPoint RemoteEp { get; }
public DateTimeOffset ConnectedAtUtc { get; }
public long PdusForwardedCount { get; }
public bool IsAlive { get; }
public Task RunReadLoopAsync(Func<byte[], Task> onFrame, CancellationToken ct);
public ValueTask SendResponseAsync(byte[] frame, CancellationToken ct);
public ValueTask DisposeAsync();
}
internal sealed class PlcMultiplexer : IAsyncDisposable {
public void Attach(UpstreamPipe pipe);
public IReadOnlyCollection<UpstreamPipe> AttachedPipes { get; }
public Task RunAsync(CancellationToken ct);
public ValueTask DisposeAsync();
}
```
`PerPlcContext` gains a nullable `CurrentRequest` property. `PerPlcContextWithRequest` is removed (along with its `LastRequest*` slots).
## Tests required
### Unit (`Category = Unit`)
**`TxIdAllocatorTests`** (≥ 8 tests):
1. `Allocate_FromEmpty_Returns_NextSequential`
2. `Allocate_AfterRelease_Reuses_FreedId`
3. `Allocate_AllocatesEveryUshort_BeforeWrapping`
4. `Allocate_WrapsCorrectly_After0xFFFF`
5. `Allocate_WhenSaturated_ReturnsFalse_DoesNotThrow`
6. `Release_OfNonAllocated_IsNoOp`
7. `Concurrent_AllocateRelease_NoDuplicateIds_Under_Parallel_Stress` (100 tasks, 1000 ops each)
8. `WrapCount_IncrementsOnEachFullWrap`
**`CorrelationMapTests`** (≥ 5 tests):
1. `TryAdd_Then_TryRemove_RoundTrips`
2. `TryAdd_DuplicateKey_Fails`
3. `TryRemove_OfMissing_ReturnsFalse`
4. `Snapshot_ReflectsCurrentState`
5. `Concurrent_AddRemove_NoDataLoss_Under_Parallel_Stress`
**`PlcMultiplexerTests`** (≥ 7 tests, real sockets, no simulator):
1. `SingleUpstream_RoundTripsFC03_Through_Multiplexer`
2. `SingleUpstream_RoundTripsFC06_Through_Multiplexer`
3. `TwoUpstreams_ConcurrentFC03_BothGetCorrectResponses` — proves TxId rewriting works end-to-end against a stub backend
4. `TwoUpstreams_ProxyTxIds_AreDistinct_OnTheWire` — sniff the backend socket; verify per-request TxIds are unique even when upstream TxIds collide
5. `UpstreamDisconnect_DoesNotAffectOtherUpstreams` — drop one client mid-flight; other client's response still arrives
6. `BackendDisconnect_CascadesToAllUpstreams` — kill backend; verify all upstream sockets close within 500 ms, `BackendDisconnectCascades` increments by N
7. `BackendReconnect_AfterCascade_NextUpstreamRequest_Succeeds`
**`RewriterCorrelationTests`** (≥ 4 tests):
1. `FC03Response_DecodedViaInFlightRequest_NotPerPairSlot`
2. `ConcurrentFC03_FromTwoUpstreams_DecodeCorrectly_NoCrossTalk` — set up two `InFlightRequest`s with different start addresses, deliver responses out of order; verify each decodes against its own request
3. `ConcurrentFC06_FromTwoUpstreams_EncodeCorrectly`
4. `ResponseForDeadUpstream_IsDropped_NoExceptionPropagates`
### Integration (`Category = Unit`, no simulator)
These use real `TcpListener` + `Socket` against a stub backend (a `TcpListener` that just echoes or canned-responds). They live in `PlcMultiplexerTests`.
### E2E (`Category = E2E`)
**`MultiplexerE2ETests`** (≥ 5 tests, against pymodbus simulator):
1. `E2E_FiveConcurrentClients_AllReadHR1072_AllGetDecoded_1234` — the headline test. Five NModbus clients connected to the proxy in parallel; pymodbus sim has the BCD register at 1072. All five get `1234`. With Phase 08's 1:1 model, the 5th client would fail at backend connect.
2. `E2E_TwentyConcurrent_FC03_Requests_AcrossThreeClients_AllSucceed`
3. `E2E_BackendDisconnect_DuringInflight_CascadesUpstream_AndRecovers` — kill the sim mid-flight (simulate by closing on its side); verify upstream clients see clean socket close; relaunch sim; new upstream connection succeeds.
4. `E2E_RewriterStillWorks_UnderMultiplexedThreeClients` — three clients each writing different decimal values to different BCD-configured addresses via FC06; verify sim's register state.
5. `E2E_StatusPage_Shows_InFlightAndMaxInFlight` — drive 4 concurrent reads, verify `/status.json` reports `inFlight >= 1` during the burst and `maxInFlight >= 4`.
## Phase gate
- [ ] `dotnet build Mbproxy.slnx -c Debug` — zero warnings, zero errors.
- [ ] All 271 prior tests still green. Specifically: `Forward_FC03_HR1072_Returns_Decoded_1234`, `Forward_FC06_WriteHR200_ThenReadBack_RoundTrips`, `MbapTxId_IsPreservedEndToEnd`, and `MbapTxId_StillPreserved_AfterRewriting_20Consecutive` continue to pass against the multiplexed implementation. The MBAP-TxId-preserved tests are the **critical regression guard** — if multiplexing leaks proxy TxIds back to the client, these fail.
- [ ] All new unit tests pass (≥ 24 new in slices 9.1-9.2 alone).
- [ ] All new E2E tests pass (≥ 5).
- [ ] `Forward_FC03_HR1072_Returns_Decoded_1234` PASSES with 5 concurrent NModbus clients connected to the same proxy port. **This is THE phase test.**
- [ ] `PlcConnectionPair.cs` is gone. Grep for the type name across the solution returns zero hits.
- [ ] `PerPlcContextWithRequest` is gone. Grep returns zero hits.
- [ ] `docs/design.md` "Connection model" section is rewritten; the 1:1 model description is gone or moved into a "Historical: pre-Phase-09 model" footnote.
- [ ] `mbproxy/CLAUDE.md` Architecture summary's connection-model bullet is updated.
- [ ] Backend disconnect with N upstream clients in-flight: all N close within 500 ms; counter `BackendDisconnectCascades += N`.
- [ ] `mbproxy.multiplex.saturated` Error event fires if TxId allocator hits 65,536 in-flight. (Stress-test acceptable; manufacture by holding 65,536 pending responses against a stub backend.)
- [ ] Shutdown semantics still work: `ShutdownCoordinator` drains in-flight requests (now visible via `InFlightCount`, not `IsProcessing`).
- [ ] Status page renders the new fields; HTML page weight remains under 50 KB for 54 PLCs.
- [ ] CounterSnapshot's existing field set is preserved — only **added** fields, no renames or removals. Backwards-compat per the policy in `docs/kpi.md`.
## Out of scope
- **Foundation for future caching, not caching itself.** This phase establishes the chokepoint where any future caching or coalescing layer plugs in, but implements no caching of any kind. `InFlightRequest.InterestedParties` is shaped as a list specifically to make [Phase 10 — read coalescing](10-read-coalescing.md) additive without refactor; do not infer caching behavior from the list shape alone. Tier C-2 (short-TTL response cache) and Tier C-3 (periodic poll + cache) remain explicitly out of scope until their own design discussions and `design.md` updates land.
- **Per-tag read coalescing** — if two clients read the same register at the same time, Phase 9's multiplexer sends both requests. Coalescing them into one backend round-trip is the explicit goal of [Phase 10](10-read-coalescing.md), which plugs into the `InterestedParties` seam created here.
- **Backend keepalive / heartbeat** — the design's current "no keepalive" position stands. An idle backend with no upstream activity will die after middlebox timeouts; the next upstream request triggers a fresh connect via Polly. Multiplexing doesn't change this.
- **TxId fairness scheduling** — FIFO order in the `_outboundChannel` is the contract. No round-robin per upstream, no priority. If a single upstream client floods the channel, others queue behind. This is a stated trade-off and matches the ECOM's internal serialization anyway.
- **Pipelined multi-PDU-in-flight per single upstream client** — still unsupported. One in-flight request per upstream pipe at a time. Multiplexing across DIFFERENT upstream clients works fully; multiplexing across multiple in-flight requests from the SAME upstream client does not. Document the constraint.
- **Linux / cross-platform packaging** — still Windows Service only.
## Subagent briefing
If you're the agent picking up this phase, here's the executive summary you need in your head:
1. **You are deleting `PlcConnectionPair`.** Everything that file did is now split between `UpstreamPipe` (the per-client half) and `PlcMultiplexer` (the per-PLC half). Read `PlcConnectionPair.cs` once before you delete it — every behavior in there has a destination in one of the two new classes.
2. **Single-writer / single-reader on the backend socket.** Two tasks share the backend socket: one writes (drained from `_outboundChannel`), one reads (decodes MBAP frames). No third task touches the socket. This invariant is what makes the channel + dictionary design correct without locks.
3. **The rewriter doesn't know about MBAP framing or correlation.** It still receives `(direction, mbapHeader span, pdu span, PerPlcContext ctx)`. The only addition is `ctx.CurrentRequest` (nullable, non-null on response). The rewriter is otherwise unchanged. Resist refactoring it.
4. **`InFlightRequest.SentAtUtc` powers `lastRoundTripMs` correctly across multiplexed clients.** Today's EWMA is per-pair; under multiplexing, the timestamp moves to per-request. The status counter stays the same.
5. **Cascade-on-backend-disconnect is the most subtle behavior.** Get the test for it right early (`BackendDisconnect_CascadesToAllUpstreams`). It's the difference between "graceful failure" and "leaked upstream sockets that hold connections open until OS timeout."
6. **TxId allocator saturation is a real-world impossibility but a stress-test reality.** Hold 65,536 responses in a stub backend; the allocator must refuse the 65,537th cleanly with an exception response code 04, not crash.
7. **Update the docs in the SAME PR as the code.** `design.md` Connection model, `mbproxy/CLAUDE.md` Architecture summary, and `docs/kpi.md` connection-cap KPI either get rewritten or removed. Doc drift is a gate fail.
8. **Do NOT introduce parallel agents within this phase.** The cross-cut is too broad. If you have spare agent budget, slice 9.1 (data types + their unit tests) can run alongside slice 9.5 (e2e test scaffolding writing against the unchanged outer-shape contract) but the middle slices are sequential.
9. **The 4 critical regression tests** that must stay green:
- `Forward_FC03_HR1072_Returns_Decoded_1234`
- `Forward_FC06_WriteHR200_ThenReadBack_RoundTrips`
- `Forward_FC16_WriteMultipleHR201_203_ThenReadBack_RoundTrips`
- `MbapTxId_IsPreservedEndToEnd` ← THIS is the one that proves multiplexing is transparent.
10. **When in doubt, re-read `BcdPduPipeline.ProcessResponse`.** The FC03/04 correlation logic there is the most subtle existing code that you're touching. Walk through it with one upstream client in mind first, then mentally replay with two; both must work without code change to the pipeline (only the way `PerPlcContext.CurrentRequest` gets populated changes).
## Cross-references
- Today's 1:1 model: [`../design.md`](../design.md) → "Connection model" (will be rewritten by this phase).
- DL260 4-client cap source: [`../../DL260/dl205.md`](../../DL260/dl205.md) → "Behavioral Oddities".
- Existing rewriter request→response correlation: `src/Mbproxy/Proxy/BcdPduPipeline.cs` `ProcessResponse` (lines reading `PerPlcContextWithRequest.LastRequest*`).
- Polly pipelines this phase reuses without modification: `src/Mbproxy/Proxy/Supervision/PolicyFactory.cs`.
- Counter-snapshot backwards-compat policy: [`../kpi.md`](../kpi.md) → "Backwards-compat policy".
+308
View File
@@ -0,0 +1,308 @@
# Phase 10 — Read coalescing (in-flight only, zero staleness)
When two or more upstream clients send the same FC03/FC04 request to the same PLC while a matching request is already in flight, attach the late arrivals to the existing in-flight entry and fan out the single backend response to all attached clients. Operates entirely within the in-flight window (microseconds to ~10 ms typical) — no post-response caching, no TTL, no staleness contract change.
**Status:** post-1.0 follow-on, depends on Phase 9.
**Depends on:** Phase 09 (multiplexer + `InFlightRequest` with `InterestedParties` list shape).
**Parallel-safe with:** nothing. The phase modifies `PlcMultiplexer.OnFrame` and the backend reader fan-out path; both are tightly coupled.
## Goal
Phase 9's multiplexer routes every upstream request individually, even when two upstream clients are asking for identical data. In a fleet of 54 PLCs where the HMI, historian, and engineering workstation all poll the same screen tags every second, that's up to 3× redundant backend traffic per overlapping read — and the H2-ECOM100's single-request-per-scan internal serialization means redundant traffic compounds into measurable backend latency.
Phase 10 detects same-key reads within the in-flight window and serves them from a single backend response. Coalescing operates entirely between "first request sent to backend" and "response received from backend." Once the response is fanned out, the coalescing entry dies. No values are held past the response arrival; no invalidation logic; no design-doc change to the "not a polling/cache layer" stance.
## Why this is safe — the zero-staleness argument
A coalesced response is a value the backend was going to return to the first request anyway. By the time the second client's request arrives, the first request is already on the wire to the PLC. The PLC's response represents the register values at the moment the PLC serviced the request. Even if the second request had been sent separately on its own backend round-trip, the H2-ECOM100's internal serialization would have queued it behind the first, returning the same value (or a value as old as one extra PLC scan ≈ 2-10 ms older).
In other words: the only thing Phase 10 changes is whether the proxy sends one or two requests to the PLC. The answer the upstream clients see is identical (or fresher than the "two requests" alternative, since coalescing means the second client doesn't wait for a second backend round-trip).
## Outputs (new files in this phase)
```
src/Mbproxy/Proxy/Multiplexing/CoalescingKey.cs # readonly record struct
src/Mbproxy/Proxy/Multiplexing/InFlightByKeyMap.cs # ConcurrentDictionary wrapper with atomic attach-or-create
src/Mbproxy/Proxy/Multiplexing/CoalescingLogEvents.cs # [LoggerMessage] vocab for this phase
tests/Mbproxy.Tests/Proxy/Multiplexing/CoalescingKeyTests.cs
tests/Mbproxy.Tests/Proxy/Multiplexing/InFlightByKeyMapTests.cs
tests/Mbproxy.Tests/Proxy/Multiplexing/ReadCoalescingTests.cs
tests/Mbproxy.Tests/Proxy/Multiplexing/ReadCoalescingE2ETests.cs
```
## Files modified (existing files in this phase)
```
src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs # OnFrame learns coalescing path; reader fans out
src/Mbproxy/Proxy/ProxyCounters.cs # new: CoalescedHitCount, CoalescedMissCount, CoalescedResponseToDeadUpstream
src/Mbproxy/Options/ResilienceOptions.cs # new: ReadCoalescing sub-options
src/Mbproxy/Admin/StatusDto.cs # PlcBackendStatus gains coalescing fields
src/Mbproxy/Admin/StatusSnapshotBuilder.cs # populate new fields
src/Mbproxy/Admin/StatusHtmlRenderer.cs # show coalescing ratio in per-PLC row
docs/design.md # Rewriter section: note FC03/04 may be coalesced before reaching backend
docs/kpi.md # graduate "coalescing ratio" KPI from future to supported
install/mbproxy.config.template.json # add the new Resilience.ReadCoalescing section with comments
```
`InFlightRequest.cs` does **not** change — the `InterestedParties` list shape was specifically introduced in Phase 9 to make this phase additive.
## Tasks
### 10.1 Data types
1. **`CoalescingKey`** — `readonly record struct CoalescingKey(byte UnitId, byte Fc, ushort StartAddress, ushort Qty)`. Hash key for the in-flight-by-key map. Auto-generated record-struct equality. Verify hashcode distribution is reasonable for typical V-memory address ranges (smoke-test in unit tests).
2. **`InFlightByKeyMap`** — wraps `ConcurrentDictionary<CoalescingKey, InFlightRequest>` plus a small lock for atomic attach-or-create. Methods:
- `bool TryAttachOrCreate(CoalescingKey key, InterestedParty party, Func<InFlightRequest> factory, int maxParties, out InFlightRequest req, out bool wasNew)` — atomic: if the key exists and `req.InterestedParties.Count < maxParties`, append the party to a freshly-built `IReadOnlyList<InterestedParty>` (since the record is immutable, we substitute a new `InFlightRequest` with the extended list in the map) and return `(wasNew=false)`; else call factory to build a new entry, store it, return `(wasNew=true)`.
- `bool TryRemove(CoalescingKey key, out InFlightRequest req)` — called by the backend reader after fan-out completes.
- The "attach to existing" path is the load-bearing concurrency primitive of this phase. The simpler implementation: small `lock` around the attach branch. The lock-free implementation uses `AddOrUpdate` with a comparand check. Pick the simpler one; document the choice in code.
### 10.2 Multiplexer integration
3. **Request path** in `PlcMultiplexer.OnFrame`:
```csharp
bool coalesceCandidate = (fc is 0x03 or 0x04)
&& resilienceOptions.CurrentValue.ReadCoalescing.Enabled;
if (coalesceCandidate)
{
var key = new CoalescingKey(unitId, fc, startAddr, qty);
var party = new InterestedParty(upstreamPipe, originalTxId);
InFlightRequest? req;
bool wasNew;
inFlightByKey.TryAttachOrCreate(
key, party,
factory: () => BuildAndRegisterNew(unitId, fc, startAddr, qty, party),
maxParties: resilienceOptions.CurrentValue.ReadCoalescing.MaxParties,
out req, out wasNew);
if (!wasNew)
{
counters.IncrementCoalescedHit();
return; // do NOT send to backend — first request will get the response
}
counters.IncrementCoalescedMiss();
// fall through: factory already allocated proxyTxId + added to correlation map + sent
return;
}
// FC06/FC16 or coalescing disabled: existing Phase 9 path (allocate, register, send).
```
The factory closure does the existing Phase 9 work (TxId allocate, correlation map add, MBAP rewrite, send to outbound channel). The new code only adds the "is this already in-flight?" check before that work.
4. **Response fan-out** in the backend reader task — already shaped correctly by Phase 9; this phase just makes sure the `CoalescingKey` matching the response is also removed from `InFlightByKeyMap` alongside the `CorrelationMap` removal:
```csharp
if (correlationMap.TryRemove(proxyTxId, out var req))
{
txIdAllocator.Release(proxyTxId);
// Also clear the coalescing key so a new identical request after this point starts fresh.
var key = new CoalescingKey(req.UnitId, req.Fc, req.StartAddress, req.Qty);
inFlightByKey.TryRemove(key, out _);
// Phase 9's fan-out loop — already iterates InterestedParties.
foreach (var party in req.InterestedParties)
{
if (!party.Pipe.IsAlive)
{
counters.IncrementCoalescedResponseToDeadUpstream();
continue;
}
var partyFrame = WithTxId(responseFrame, party.OriginalTxId);
party.Pipe.SendResponse(partyFrame);
}
}
```
### 10.3 Configuration
5. **Extend `ResilienceOptions`:**
```csharp
public sealed class ReadCoalescingOptions
{
public bool Enabled { get; init; } = true;
public int MaxParties { get; init; } = 32;
}
public sealed class ResilienceOptions
{
public RetryProfile BackendConnect { get; init; } = new();
public RecoveryProfile ListenerRecovery { get; init; } = new();
public ReadCoalescingOptions ReadCoalescing { get; init; } = new(); // ← new
}
```
Hot-reloadable via the existing `IOptionsMonitor<MbproxyOptions>` wiring. Disabling `Enabled` at runtime means new requests take the non-coalescing path; existing in-flight coalesced entries drain naturally.
6. **`mbproxy.config.template.json` update** — add a commented `ReadCoalescing` block to the install template under `Resilience` with the two new keys, default values, and a one-paragraph explanation.
### 10.4 Counters and status surfacing
7. **`ProxyCounters` additions:**
```csharp
public void IncrementCoalescedHit();
public void IncrementCoalescedMiss();
public void IncrementCoalescedResponseToDeadUpstream();
```
`CounterSnapshot` gains `CoalescedHitCount`, `CoalescedMissCount`, `CoalescedResponseToDeadUpstream` — all `long`, all Interlocked. The status page derives `coalescingRatio = Hit / (Hit + Miss)` for display; the raw counts are exposed in JSON for downstream tooling.
8. **`/status.json` per-PLC fields** — extend `PlcBackendStatus`:
```csharp
public sealed record PlcBackendStatus(
long ConnectsSuccess, long ConnectsFailed,
ExceptionCounts ExceptionsByCode,
double LastRoundTripMs,
long CoalescedHitCount, // ← new
long CoalescedMissCount, // ← new
long CoalescedResponseToDeadUpstream); // ← new
```
9. **HTML page** — extend the per-PLC row with a compact `Coal: 73%` cell (`hit / (hit+miss) * 100`, rounded). Page-weight assertion (under 50 KB for 54 PLCs) must continue to pass.
### 10.5 Documentation
10. **`docs/design.md` Rewriter section:** add a paragraph clarifying that FC03/FC04 requests may be coalesced with other in-flight requests of the same `(unitId, fc, start, qty)` before reaching the backend. Emphasize that the transparency contract holds — each client sees its own original TxId restored on the response, and the response value is identical to what an uncoalesced request would have returned (within the PLC's scan-time precision).
11. **`docs/kpi.md` Tier 1:** the new `coalescedHitCount`, `coalescedMissCount`, derived `coalescingRatio` graduate from "future" to "supported" Tier 1 fields. Mention the `coalescedResponseToDeadUpstream` counter as a low-priority Tier 2 informational metric.
## Public surface declared in this phase
```csharp
namespace Mbproxy.Proxy.Multiplexing;
internal readonly record struct CoalescingKey(
byte UnitId, byte Fc, ushort StartAddress, ushort Qty);
internal sealed class InFlightByKeyMap
{
public bool TryAttachOrCreate(
CoalescingKey key,
InterestedParty party,
Func<InFlightRequest> factory,
int maxParties,
out InFlightRequest req,
out bool wasNew);
public bool TryRemove(CoalescingKey key, out InFlightRequest req);
public int Count { get; }
}
```
```csharp
namespace Mbproxy.Options;
public sealed class ReadCoalescingOptions
{
public bool Enabled { get; init; } = true;
public int MaxParties { get; init; } = 32;
}
// Added field on existing ResilienceOptions:
public ReadCoalescingOptions ReadCoalescing { get; init; } = new();
```
`ProxyCounters` and `CounterSnapshot` gain three new `long` fields. No public-surface removals, no renames.
## Tests required
### Unit (`Category = Unit`)
**`CoalescingKeyTests`** (≥ 4 tests):
1. `Equality_OnIdenticalKeys_ReturnsTrue`
2. `Equality_OnDifferentFc_ReturnsFalse` — FC03 vs FC04 with same start/qty/unit are NOT equal (different Modbus tables).
3. `Equality_OnDifferentUnitId_ReturnsFalse`
4. `HashCode_DistributionSanity` — build 10,000 randomly-generated keys, bucket by `Key.GetHashCode() & 0xFF`, assert no bucket has > 5 % of total (rough uniformity check).
**`InFlightByKeyMapTests`** (≥ 6 tests):
1. `TryAttachOrCreate_NewKey_CallsFactory_ReturnsTrue_WasNewTrue`
2. `TryAttachOrCreate_ExistingKey_AppendsParty_ReturnsTrue_WasNewFalse`
3. `TryAttachOrCreate_ExistingKey_AtMaxParties_CreatesFreshEntry_NotAppend` — refuses to fan out beyond the cap; preserves backend-load-shedding guarantee.
4. `TryRemove_AfterAttach_AllPartiesPresent_InRetrievedEntry`
5. `TryRemove_OfMissing_ReturnsFalse`
6. `Concurrent_AttachOrCreate_From_Two_Threads_NoLostParties_AndNoDuplicateEntries` — 100 tasks × 1000 ops each.
**`ReadCoalescingTests`** (≥ 7 tests, real sockets, stub backend):
1. `TwoClients_SameRequest_OnlyOneBackendRoundTrip` — stub backend counts received requests; assert 1.
2. `TwoClients_DifferentRequests_BothHitBackend` — different start addresses; assert 2.
3. `FiveClients_SameRequest_OneBackendRoundTrip_FiveResponses` — fan-out works correctly with 5 attached parties.
4. `FC03_And_FC04_SameAddress_NOT_Coalesced` — different tables.
5. `FC06_Write_NeverCoalesced` — writes always allocate their own TxId.
6. `OneClient_DisconnectsMidFlight_OthersStillGetResponse_AndDeadUpstreamCounterIncrements`
7. `AtMaxParties_NextRequest_StartsFreshBackendRoundTrip` — verify the cap behaviour: when `MaxParties = 2` and 3 simultaneous clients send the same request, the third opens a new in-flight entry rather than joining the first.
### E2E (`Category = E2E`)
**`ReadCoalescingE2ETests`** (≥ 5 tests, against pymodbus simulator, `[Collection(nameof(DL205SimulatorCollection))]`):
1. `E2E_FiveConcurrentClients_SameReadHR1072_CoalescedHitCount_AtLeast_3` — five NModbus clients connect to the proxy, simultaneously read HR1072 (BCD-configured). Assert `coalescedHitCount >= 3` (race wiggle room — perfect coalescing would give 4 hits, but the racy first-arrivals can both miss).
2. `E2E_RewriterStillWorks_ForAllCoalescedParties` — same setup, but with BCD tag at 1072. All five clients receive decoded `1234`. Proves the rewriter sees a coalesced response correctly and the TxId restoration doesn't perturb the BCD bytes.
3. `E2E_DifferentRegisters_NotCoalesced_CoalescedHitCount_Zero` — five clients reading five different addresses; assert no coalescing happened.
4. `E2E_StatusPage_Shows_CoalescingRatio` — `/status.json` for the test PLC has populated `coalescedHitCount` and `coalescedMissCount` after the burst.
5. `E2E_DisableViaHotReload_RevertToPhase9Behaviour` — write a temp appsettings with `ReadCoalescing.Enabled = false`, hot-reload, verify subsequent identical reads each hit the backend separately (counter doesn't increment).
## Phase gate
- [ ] `dotnet build Mbproxy.slnx -c Debug` — zero warnings, zero errors.
- [ ] All prior tests still green — specifically the **4 critical Phase-9 regression guards**:
- `Forward_FC03_HR1072_Returns_Decoded_1234`
- `Forward_FC06_WriteHR200_ThenReadBack_RoundTrips`
- `Forward_FC16_WriteMultipleHR201_203_ThenReadBack_RoundTrips`
- `MbapTxId_IsPreservedEndToEnd`
- [ ] All new unit + e2e tests pass (≥ 17 new).
- [ ] **Headline assertion:** 5 concurrent FC03 reads of the same register through the proxy produce **at most 2** backend round-trips (allowing one race for the initial pair). Verifiable via stub-backend's request counter in `ReadCoalescingTests`.
- [ ] FC04 reads of the same address as a coexisting FC03 stream do NOT coalesce together. Verified by an explicit test.
- [ ] FC06 / FC16 writes are NEVER on the coalescing path. Verified by setting `MaxParties = 1` and confirming write throughput is unaffected.
- [ ] Coalescing-ratio counter ≥ 50 % under the headline stress test (5 simultaneous identical reads).
- [ ] Disabling coalescing via `Mbproxy.Resilience.ReadCoalescing.Enabled = false` hot-reloads cleanly; running coalesced entries drain naturally without errors.
- [ ] `docs/design.md` Rewriter section mentions the coalescing path; `docs/kpi.md` Tier 1 includes the new fields; `install/mbproxy.config.template.json` includes the new commented `Resilience.ReadCoalescing` block.
- [ ] HTML page weight under 50 KB for 54 PLCs (verify with the existing renderer test).
## Out of scope
- **Post-response caching** — no TTL, no staleness window beyond "while the request is in flight." This phase is strictly in-flight. A response-cache phase would be a separate plan (Phase 11+) and would require the design.md "not a cache layer" stance to be revisited and rewritten.
- **Range-overlap coalescing** — request A reading [100..110], request B reading [105..115]. Different keys; no coalescing. Range-overlap detection is a separate optimisation with its own algorithmic complexity (interval trees, etc.) and its own staleness questions (request B's response would include reg 100..104 from A's perspective, but those weren't in B's response).
- **Cross-PLC coalescing** — each PLC's multiplexer has its own key map. No optimization across PLCs (their backend connections are independent anyway).
- **Write coalescing / batching** — different problem with non-idempotency concerns. The design doc's "no mid-request retry on writes" principle extends to "no write coalescing."
- **Predictive batching** — combining a single client's likely-next read into the current request. Out of scope; speculative reads are a different optimization category.
- **Adaptive `MaxParties`** — staying at the configured value. Auto-tuning is interesting but speculative.
## Subagent briefing
If you're the agent picking up this phase:
1. **Phase 9's `InterestedParties` list is the seam.** This phase only adds the "look up the key, attach a new party to an existing entry" logic. The fan-out side already iterates the list correctly. If you find yourself rewriting Phase 9's response path, you've drifted out of scope.
2. **`CoalescingKey` includes `UnitId`.** DL260 fleets typically use unit 1, but we don't assume — different unit IDs are different PLC personalities behind the same TCP socket and must not coalesce.
3. **FC03 and FC04 are different tables.** Same register address space in DL series, but Modbus treats them separately. Different `CoalescingKey` for the same address; no coalescing across them.
4. **Coalescing is best-effort under races.** Two simultaneous identical requests can both miss the map and create separate entries — counter just shows a lower ratio. Not a bug; documented behaviour. Do not over-engineer with double-checked locking.
5. **`MaxParties` is the load-shedding safety valve.** If a thousand HMI panels all attach to one in-flight request, the response fan-out cost goes linear with attachment count and stalls the backend reader task. Cap at 32 by default. Past the cap, route through a fresh entry — fan-out cost per entry is bounded.
6. **The attach-or-create operation MUST be atomic per key.** Two simultaneous arrivals must not both create new entries for the same key (would defeat coalescing). The simpler implementation: `lock(map.SyncRoot)` around the attach branch. The lock-free implementation uses `AddOrUpdate` with the updateFactory checking the count cap. Pick whichever you can write correctly in 30 minutes; document the choice.
7. **Response fan-out must check `Pipe.IsAlive` per party.** An upstream client that disconnects between attaching and the response arriving — count it as `CoalescedResponseToDeadUpstream` and continue with the others. Do not throw, do not log per-occurrence at Information (would be too noisy under client churn).
8. **Hot-reload of `Enabled` doesn't disrupt in-flight entries.** Disabling the feature mid-flight just means subsequent requests take the non-coalescing path. Existing coalesced entries drain when their response arrives. Don't try to "flush" them on the reload event.
9. **`CoalescedHit + CoalescedMiss = total FC03+FC04 requests`.** The math has to balance per snapshot. Use `Interlocked.Increment` exclusively. Disabling coalescing means every FC03/04 request becomes a Miss (which is fine — the metric still tracks total reads).
10. **Update `design.md` AND `kpi.md` AND the install template in the same PR as the code.** Doc drift is a gate failure. The coalescing-ratio KPI specifically graduates from "future" to "Tier 1 supported" — make that promotion explicit in `kpi.md`.
## Cross-references
- Phase 9's multiplexer is the foundation. The `InterestedParty` and `InterestedParties` types live there: [`09-txid-multiplexing.md`](09-txid-multiplexing.md).
- KPI graduation target: [`../kpi.md`](../kpi.md) → Tier 1 (rates / percentiles / availability — coalescing-ratio joins this tier).
- Modbus unit-ID semantics that make coalescing-key uniqueness load-bearing: [`../../DL260/dl205.md`](../../DL260/dl205.md) → "Function Code Support" and "Coils and Discrete Inputs".
- Counter snapshot backwards-compat policy that this phase respects (additive only): [`../kpi.md`](../kpi.md) → "Backwards-compat policy".
+374
View File
@@ -0,0 +1,374 @@
# Phase 11 — Short-TTL response cache (bounded staleness)
Cache FC03/FC04 responses with a per-tag TTL. Subsequent same-key reads within the TTL window are served from the cache without backend traffic. FC06/FC16 writes invalidate overlapping cache entries on the response side. **This phase is a deliberate design-contract change** — the proxy gains an opt-in cache layer with explicit bounded staleness.
**Status:** post-1.0 follow-on, depends on Phase 10. **Architectural pivot — read the "Design pivot" section below before scoping.**
**Depends on:** Phase 09 (multiplexer chokepoint), Phase 10 (`CoalescingKey` is reused as `CacheKey` — same shape).
**Parallel-safe with:** nothing.
## Design pivot — do NOT skip this section
Phases 09 and 10 were additive performance optimisations that preserved the design's "transparent inline proxy" contract. **Phase 11 is different.** It changes the load-bearing claim in `docs/design.md`:
- **Today's contract** (lines 12-20 of `design.md`): *"The service is not a polling/cache layer. It is a transparent Modbus TCP proxy whose job is to rewrite the configured BCD tags in real time, in both directions, while proxying every other byte of the MBTCP connection untouched."*
- **Post-Phase-11 contract:** the proxy is *optionally* a cache layer within a bounded TTL. The TTL is per-tag, default 0 (no caching), opt-in by operator action.
Implication: **Task 1 of this phase is rewriting the relevant `design.md` sections.** The contract update is a code commit too — review, land first, then build the implementation against the new contract. Shipping cache code while design.md still says "not a cache layer" is a gate failure, not a merge-it-and-fix-later situation.
The cache is **OFF by default**. A fresh post-Phase-11 deployment with no TTL configuration behaves identically to a Phase-10 deployment. The opt-in shape (per-tag `CacheTtlMs` configuration) means a deployment can adopt Phase 11 without changing semantics until an operator explicitly opts a tag in.
## Goal
Reduce backend Modbus traffic for the common SCADA case where many clients poll the same registers at near-identical cadences. Phase 10 already coalesces within the in-flight window (~10 ms). Phase 11 extends the "served without backend traffic" window from the in-flight microseconds to operator-configurable seconds.
Concretely: with `CacheTtlMs = 1000` on a frequently-read BCD tag, the backend sees at most one read of that tag per second per PLC regardless of how many upstream clients are polling.
## What it does NOT do
- **No active polling.** Cache entries are populated on demand by upstream reads, not by proactive polling. (Active polling is Tier C-3 from the conversation history — a separate phase if ever wanted.)
- **No predictive prefetching.**
- **No SCADA-style subscription/notification model.**
- **No write-back caching.** Writes always go straight through to the backend; cache invalidation happens on the write-response side, not by intercepting the write.
- **No cross-PLC caching.** Each PLC's cache is independent.
- **No persistence.** Process restart wipes the cache. Cache survives backend disconnects (the cached data was fresh when stored; disconnects don't retroactively invalidate it).
## Outputs (new files)
```
src/Mbproxy/Proxy/Cache/CacheKey.cs # reuses CoalescingKey shape; type-aliased or reflected
src/Mbproxy/Proxy/Cache/CacheEntry.cs # response bytes + expiry + lastFetched
src/Mbproxy/Proxy/Cache/ResponseCache.cs # the cache itself; TTL-based eviction, LRU under cap
src/Mbproxy/Proxy/Cache/CacheInvalidator.cs # address-range-overlap matcher for write invalidation
src/Mbproxy/Proxy/Cache/CacheLogEvents.cs # [LoggerMessage] vocab for this phase
tests/Mbproxy.Tests/Proxy/Cache/CacheKeyTests.cs
tests/Mbproxy.Tests/Proxy/Cache/CacheEntryTests.cs
tests/Mbproxy.Tests/Proxy/Cache/ResponseCacheTests.cs
tests/Mbproxy.Tests/Proxy/Cache/CacheInvalidatorTests.cs
tests/Mbproxy.Tests/Proxy/Cache/ResponseCacheE2ETests.cs
```
## Files modified
```
src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs # OnFrame: cache check BEFORE coalescing; OnResponse: cache store + write invalidation
src/Mbproxy/Options/BcdTagOptions.cs # add CacheTtlMs (default 0 = no caching)
src/Mbproxy/Options/PlcOptions.cs # add DefaultCacheTtlMs
src/Mbproxy/Options/MbproxyOptions.cs # add Cache section (AllowLongTtl, MaxEntriesPerPlc, EvictionIntervalMs)
src/Mbproxy/Bcd/BcdTag.cs # carry CacheTtlMs on the record
src/Mbproxy/Bcd/BcdTagMapBuilder.cs # resolve per-tag TTL with per-PLC default fallback
src/Mbproxy/Proxy/ProxyCounters.cs # new: CacheHit, CacheMiss, CacheInvalidations, CacheEntryCount, CacheBytes
src/Mbproxy/Admin/StatusDto.cs # surface cache KPIs in PlcBackendStatus
src/Mbproxy/Admin/StatusSnapshotBuilder.cs # populate
src/Mbproxy/Admin/StatusHtmlRenderer.cs # show cache-hit ratio per PLC row
src/Mbproxy/Configuration/ReloadValidator.cs # validate CacheTtlMs bounds; require AllowLongTtl=true for > 60s
docs/design.md # SUBSTANTIAL — see Task 1
docs/kpi.md # graduate cache KPIs from future to Tier 1
install/mbproxy.config.template.json # add CacheTtlMs examples + staleness commentary
mbproxy/CLAUDE.md # Architecture summary: add the cache-layer bullet
```
## Tasks
### 11.1 Design contract update — **DO THIS FIRST**
1. **`docs/design.md` updates** (review and land before writing implementation code):
**a. "What this is" section** — add the cache disclosure paragraph:
> As of Phase 11, the proxy gains an *optional* per-tag response cache with a bounded staleness window (`CacheTtlMs`). The cache is OFF by default (`CacheTtlMs = 0`) and must be opt-in per tag. With caching enabled, the proxy is no longer purely transparent — upstream reads may return a value up to `CacheTtlMs` milliseconds old. The 1:1 read-to-backend-request guarantee no longer holds; operators opting tags into caching MUST acknowledge the staleness bound.
**b. New section "Cache contract"** between "Rewriter" and "Failure modes":
- Cache populates on demand only. No polling.
- Cache entries carry their TTL with them. Hits older than TTL are evicted on access.
- FC06/FC16 successful responses invalidate cache entries whose address range overlaps the write.
- Cache survives backend disconnects (cached data was valid at cache time).
- Cache does NOT survive process restart.
- Multi-tag read range: effective TTL is the minimum of all configured tags in the range. Any tag with TTL = 0 in the range disables caching for the whole read.
- Cache stores POST-rewriter bytes (BCD already decoded). Hits bypass the rewriter entirely.
**c. "Failure modes" section** — add bullet on cache behaviour during backend recovery:
- Cache hits remain valid during a `recovering` listener state. Data was fresh when cached; recovery only affects future requests.
- Invalidations during recovery: writes that arrive cannot reach the backend, so the invalidation never happens. This is consistent — the write didn't take effect either. Cache entries remain valid until their TTL expires.
**d. "Rewriter" section** — clarify that the rewriter runs on the cache-miss path (decode on store), and that cache hits return pre-decoded bytes without re-invoking the rewriter.
Treat (a)-(d) as one atomic change. Get them reviewed, land them, then implement against the new contract.
### 11.2 Cache key
2. **`CacheKey`** — same shape as Phase 10's `CoalescingKey`: `readonly record struct CacheKey(byte UnitId, byte Fc, ushort StartAddress, ushort Qty)`. If Phase 10 is already merged, prefer **a `using CacheKey = CoalescingKey;` alias** over a redefinition — same data, same hashing, single source of truth. If the two phases land together (Phase 10 + 11 in a coordinated release), consider renaming `CoalescingKey``ReadKey` to make the shared use site neutral.
### 11.3 Cache entry and storage
3. **`CacheEntry`** — `internal sealed record CacheEntry(byte[] PduBytes, DateTimeOffset CachedAtUtc, DateTimeOffset ExpiresAtUtc, int Length, ushort LastUsedTick)`. `LastUsedTick` is a monotonic counter for LRU ordering (avoids `DateTimeOffset.UtcNow` calls on every cache access).
4. **`ResponseCache`** — `internal sealed class ResponseCache : IDisposable`. Methods:
- `bool TryGet(CacheKey key, out CacheEntry entry)` — returns true ONLY if entry exists and `entry.ExpiresAtUtc > DateTimeOffset.UtcNow`. Updates `LastUsedTick` on hit. Expired entries removed lazily.
- `void Set(CacheKey key, CacheEntry entry)` — replaces any existing entry. If `Count >= MaxEntriesPerPlc`, evict the LRU entry first.
- `int Invalidate(byte unitId, ushort startAddress, ushort qty)` — delegates to `CacheInvalidator`. Returns count invalidated.
- `int Count { get; }`, `long ApproximateBytes { get; }`
- Background eviction loop (started in constructor, stopped in `Dispose`): every `EvictionIntervalMs` (default 5000), scans the map and removes entries past `ExpiresAtUtc`.
5. **`CacheInvalidator`** — pure logic: `static IEnumerable<CacheKey> FindOverlapping(IReadOnlyCollection<CacheKey> haystack, byte unitId, ushort writeStart, ushort writeQty)`. Returns keys whose range `[StartAddress, StartAddress + Qty)` intersects `[writeStart, writeStart + writeQty)`. Limit scope to keys matching `unitId` and `Fc in {3, 4}` (we never cache writes; invalidation only applies to read entries).
### 11.4 Multiplexer integration
6. **Cache lookup in `PlcMultiplexer.OnFrame`** — for FC03/04 requests when the read range has a non-zero resolved TTL:
```csharp
if (fc is 0x03 or 0x04 && resolvedTtlMs > 0) {
var key = new CacheKey(unitId, fc, startAddr, qty);
if (cache.TryGet(key, out var entry)) {
counters.IncrementCacheHit();
// Build a fresh MBAP wrapper for this client and send.
var hitFrame = BuildResponseFrame(entry.PduBytes, originalTxId, unitId);
upstreamPipe.SendResponse(hitFrame);
return; // no coalescing check, no backend round-trip
}
counters.IncrementCacheMiss();
}
// Fall through to Phase 10 coalescing path → Phase 9 send path
```
**Order matters:** cache check FIRST, then coalescing. A cache hit short-circuits everything; only on a miss do we engage Phase 10's coalescing logic.
7. **Cache store on response** — in the backend reader fan-out path, AFTER the rewriter has run on the response:
```csharp
if (req.Fc is 0x03 or 0x04 && req.ResolvedCacheTtlMs > 0) {
var key = new CacheKey(req.UnitId, req.Fc, req.StartAddress, req.Qty);
var now = DateTimeOffset.UtcNow;
var entry = new CacheEntry(
PduBytes: rewrittenPduBytes.ToArray(), // defensive copy
CachedAtUtc: now,
ExpiresAtUtc: now.AddMilliseconds(req.ResolvedCacheTtlMs),
Length: rewrittenPduBytes.Length,
LastUsedTick: NextLruTick());
cache.Set(key, entry);
}
```
Note: `req.ResolvedCacheTtlMs` is computed at request-receive time by walking the BcdTagMap for tags in `[StartAddress, StartAddress + Qty)` and taking `min(CacheTtlMs)`. If any tag has TTL = 0, `ResolvedCacheTtlMs = 0` and the whole read is uncached.
8. **Cache invalidation on write response** — FC06 / FC16 successful response (NOT exception response):
```csharp
if (req.Fc is 0x06 or 0x10 && (fc & 0x80) == 0) {
int invalidated = cache.Invalidate(req.UnitId, req.StartAddress, req.Qty);
if (invalidated > 0) {
counters.AddCacheInvalidations(invalidated);
CacheLogEvents.WriteInvalidatedEntries(logger, req.UnitId,
req.StartAddress, req.Qty, invalidated);
}
}
```
Invalidation is by ADDRESS RANGE OVERLAP, not by exact key match. A write to register 105 invalidates a cached read of [100..110] and a cached read of [105..115] but NOT a cached read of [200..210].
### 11.5 Per-tag TTL configuration
9. **`BcdTagOptions` extension:**
```csharp
public sealed class BcdTagOptions {
public ushort Address { get; init; }
public byte Width { get; init; }
public int CacheTtlMs { get; init; } = 0; // 0 = no caching (default)
}
```
10. **`PlcOptions.DefaultCacheTtlMs`** — applies to any tag whose explicit `CacheTtlMs` was not set (use a nullable `int?` on `BcdTagOptions` instead of `int = 0` to distinguish "explicitly zero" from "unset"). Default for the PLC default itself is 0.
11. **`MbproxyOptions.Cache` section:**
```csharp
public sealed class CacheOptions {
public bool AllowLongTtl { get; init; } = false; // gate for TTL > 60_000
public int MaxEntriesPerPlc { get; init; } = 1000;
public int EvictionIntervalMs { get; init; } = 5000;
}
```
12. **Validation** in `ReloadValidator`: `CacheTtlMs >= 0` always; `CacheTtlMs > 60_000` requires `Cache.AllowLongTtl = true`. Reject reloads that violate. Prevents "left at 1 hour by accident" deployments.
13. **`BcdTagMapBuilder.Build` resolution**: returns each `BcdTag` with `CacheTtlMs` resolved per fallback rules: explicit per-tag → per-PLC default → 0.
### 11.6 Counters and status surfacing
14. **`ProxyCounters` additions:**
- `CacheHitCount` (Interlocked long)
- `CacheMissCount` (Interlocked long)
- `CacheInvalidations` (Interlocked long)
- `CacheEntryCount` (snapshot from `ResponseCache.Count` — read-time)
- `CacheBytes` (snapshot from `ResponseCache.ApproximateBytes` — read-time)
15. **`StatusDto.PlcBackendStatus` extension:**
```csharp
public sealed record PlcBackendStatus(
long ConnectsSuccess, long ConnectsFailed,
ExceptionCounts ExceptionsByCode,
double LastRoundTripMs,
long CoalescedHitCount, long CoalescedMissCount, long CoalescedResponseToDeadUpstream, // Phase 10
long CacheHitCount, long CacheMissCount, // Phase 11
long CacheInvalidations, long CacheEntryCount, long CacheBytes); // Phase 11
```
16. **HTML page** — add a compact `Cache: 73%` cell per PLC row. Page-weight assertion (under 50 KB for 54 PLCs) must continue to pass.
### 11.7 Documentation and template
17. **`docs/kpi.md`** — graduate cache-hit-ratio KPIs from "deferred / future" to Tier 1 supported. Add `cacheEntryCount` and `cacheBytes` as Tier 2 memory-watch KPIs.
18. **`install/mbproxy.config.template.json`** — add a fully-commented `Mbproxy.Cache` section showing `AllowLongTtl`, `MaxEntriesPerPlc`, `EvictionIntervalMs`. Show example per-tag `CacheTtlMs: 1000` and per-PLC `DefaultCacheTtlMs: 500` entries. Include a prominent comment explaining the staleness contract: "**clients reading these tags will see values up to `CacheTtlMs` milliseconds old**".
19. **`mbproxy/CLAUDE.md` Architecture summary** — add a bullet:
> - **Optional response cache** with per-tag TTL (default 0 = off). Cached FC03/04 responses serve subsequent same-key reads without backend traffic; FC06/FC16 write responses invalidate overlapping entries by address range.
## Public surface declared in this phase
```csharp
namespace Mbproxy.Proxy.Cache;
internal readonly record struct CacheKey(
byte UnitId, byte Fc, ushort StartAddress, ushort Qty);
internal sealed record CacheEntry(
byte[] PduBytes,
DateTimeOffset CachedAtUtc, DateTimeOffset ExpiresAtUtc,
int Length, ushort LastUsedTick);
internal sealed class ResponseCache : IDisposable {
public bool TryGet(CacheKey key, out CacheEntry entry);
public void Set(CacheKey key, CacheEntry entry);
public int Invalidate(byte unitId, ushort startAddress, ushort qty);
public int Count { get; }
public long ApproximateBytes { get; }
public void Dispose();
}
internal static class CacheInvalidator {
public static IEnumerable<CacheKey> FindOverlapping(
IReadOnlyCollection<CacheKey> haystack,
byte unitId, ushort writeStart, ushort writeQty);
}
```
```csharp
namespace Mbproxy.Options;
public sealed class CacheOptions {
public bool AllowLongTtl { get; init; } = false;
public int MaxEntriesPerPlc { get; init; } = 1000;
public int EvictionIntervalMs { get; init; } = 5000;
}
// Added field on MbproxyOptions:
public CacheOptions Cache { get; init; } = new();
// Added field on BcdTagOptions (nullable to distinguish "unset" from "explicitly 0"):
public int? CacheTtlMs { get; init; }
// Added field on PlcOptions:
public int DefaultCacheTtlMs { get; init; } = 0;
```
`ProxyCounters` and `CounterSnapshot` gain 5 new long fields. No public-surface removals or renames.
## Tests required
### Unit (`Category = Unit`)
**`CacheKeyTests`** (≥ 3 tests): equality across identical keys; FC03 vs FC04 differs; UnitId differs.
**`CacheEntryTests`** (≥ 3 tests): expired detection at boundary; immutability of `PduBytes`; LRU tick monotonicity.
**`CacheInvalidatorTests`** (≥ 5 tests, range-overlap math):
1. `FullOverlap_WriteCoversEntryRange_Invalidates`
2. `PartialOverlap_WriteStartsBeforeEntry_Invalidates`
3. `PartialOverlap_WriteEndsAfterEntry_Invalidates`
4. `Adjacent_NotOverlapping_DoesNotInvalidate` — write to `[10..15]` does NOT invalidate cached `[15..20]` (half-open intervals — `15` is not in the entry's range).
5. `NoOverlap_DoesNotInvalidate`
6. `DifferentUnitId_DoesNotInvalidate`
**`ResponseCacheTests`** (≥ 8 tests):
1. `SetThenGet_RoundTrips`
2. `GetExpiredEntry_ReturnsFalse_AndRemoves` — uses a small TTL + `Task.Delay`
3. `Invalidate_OverlappingRange_RemovesMatching` — set 3 entries, invalidate a range overlapping 2 of them, verify Count drops by 2
4. `Invalidate_OnlyAffectsFc03Fc04_KeysWithFcOther_NotTouched` — there shouldn't be FC06/FC16 entries in cache, but a defensive test
5. `Set_AtMaxEntries_EvictsLRU`
6. `LRU_TracksAccessOrder_Across_Get_And_Set`
7. `Concurrent_GetSet_NoDataRace` — 100 tasks, 1000 ops each
8. `Dispose_StopsEvictionLoop`
### E2E (`Category = E2E`)
**`ResponseCacheE2ETests`** (≥ 6 tests, against pymodbus simulator):
1. `E2E_CacheHit_AfterFirstRead_NoBackendTraffic` — configure tag at HR1072 with `CacheTtlMs = 5000`; first read goes to backend; second read within 5s hits cache. Verify via the simulator's HTTP introspection or by timing (cache hits return ~ms; backend reads return ~10ms).
2. `E2E_CacheExpires_AfterTtl_NextReadHitsBackend` — short TTL (e.g., 200 ms); after delay, second read goes to backend.
3. `E2E_WriteInvalidatesOverlappingCacheEntries` — read HR1072 (cache it), write to HR1072 with FC06, next read MUST miss cache and re-fetch.
4. `E2E_NonOverlappingWrite_DoesNotInvalidate` — read HR1072 (cache it), write to HR1080, next read of HR1072 still hits cache.
5. `E2E_BcdDecodedBytesAreCached_NotRawBcd` — cache hit returns the decoded `1234`, not `0x1234`. Proves the cache stores post-rewriter bytes.
6. `E2E_DisablingCache_ViaHotReload_FlushesEntries` — set `CacheTtlMs = 1000` on a tag, do a read (cached), hot-reload with `CacheTtlMs = 0`, next read must hit the backend even though the old entry is still within its TTL window.
7. `E2E_MultiTagRead_RangeWithZeroTtlTag_DisablesCaching` — read [100..110] where one tag in the range has `CacheTtlMs = 0`; verify no caching of the whole read.
## Phase gate
- [ ] **`docs/design.md` updates from Task 1 are merged FIRST** (or in the same PR). The contract change is not optional and not deferrable. Gate fail otherwise.
- [ ] `dotnet build Mbproxy.slnx -c Debug` — zero warnings, zero errors.
- [ ] All prior tests still green — the **4 critical Phase-9 regression guards** + **Phase 10's coalescing tests**.
- [ ] All new unit + e2e tests pass (≥ 25 new).
- [ ] **Default TTL = 0 → no observable behavior change vs Phase 10.** Verify: run the full Phase 10 test suite with the Phase 11 build; everything green.
- [ ] **Headline assertion (E2E):** configure `CacheTtlMs = 1000` on HR1072; issue 10 reads at 100 ms intervals; backend (stub or sim with introspection) sees exactly 1 backend round-trip.
- [ ] Write invalidation correctly handles all 6 range-overlap cases (full, two partial, adjacent, none, different-unit-id).
- [ ] Memory cap enforced: with `MaxEntriesPerPlc = 5`, 6 distinct cache inserts produce 5 entries (one LRU eviction observed).
- [ ] Validation rejects `CacheTtlMs > 60_000` unless `Cache.AllowLongTtl = true`.
- [ ] Hot-reload of `CacheTtlMs` flushes entries for the affected tag (or, simpler: flushes the entire cache for the PLC). Pick the simpler option (PLC-wide flush) and document.
- [ ] HTML page weight under 50 KB for 54 PLCs (verify with the existing renderer test).
- [ ] `docs/kpi.md` Tier 1 includes cache-hit-ratio.
- [ ] `install/mbproxy.config.template.json` includes the new `Mbproxy.Cache` block with the staleness commentary.
## Out of scope
- **Active polling** — cache populates on demand only. No background poll loop.
- **Predictive prefetching** — no speculative reads.
- **Range-overlap coalescing of cache entries** — if reads `[100..110]` and `[105..115]` are both cached, no attempt to merge them into one `[100..115]` entry. Same-key only.
- **Cross-PLC caching** — each PLC's cache is independent. No optimisation across PLCs.
- **Persistence** — process restart wipes the cache. No file/Redis backing store.
- **Cache warming** — no pre-populating the cache from a snapshot, last-known-good file, etc.
- **TTL > 60 seconds without explicit `AllowLongTtl` opt-in** — refused at validation.
- **Adaptive TTL** — operator-configured only. No auto-tuning.
## Subagent briefing
If you're the agent picking up this phase:
1. **Task 1 is design.md, not code.** The contract update is the gate. Do not write the cache code until the design changes have been reviewed and merged (or are in the same PR with explicit reviewer attention). A reviewer who lands the code without the design update has failed the gate, and so have you.
2. **Default TTL = 0 means default behavior = Phase 10 unchanged.** Critical for backwards-compat. Every existing test that doesn't set `CacheTtlMs` must continue to pass without modification.
3. **Cache stores POST-rewriter bytes.** The rewriter runs once on the cache-miss path; subsequent hits return cached decoded bytes directly. Do not re-invoke the rewriter on hits — wastes CPU and changes nothing.
4. **Write-invalidation is by ADDRESS RANGE OVERLAP, not by exact key match.** A write to register 105 invalidates a cached read of `[100..110]`. Use half-open interval math: write `[w, w+q)` overlaps entry `[s, s+n)` iff `w < s+n && s < w+q`.
5. **Multi-tag read range: effective TTL is `min(TTLs)`.** If any tag in the read range has TTL = 0, the whole read is uncached. Conservative-by-design.
6. **Cache lookup happens BEFORE coalescing.** Order: cache check → cache miss → coalescing check (Phase 10) → backend send (Phase 9). A cache hit short-circuits everything.
7. **`CacheKey` is structurally identical to `CoalescingKey`.** Prefer aliasing over redefinition. If the two phases land together, rename the shared type to `ReadKey` to make the joint use site neutral.
8. **MBAP TxId restoration on cache-hit responses.** The cache stores the PDU bytes (post-rewriter); on hit, build a fresh MBAP wrapper with the requesting client's `OriginalTxId`. There's no cached MBAP — the per-request TxId is supplied by the upstream pipe's request.
9. **Hot-reload of `CacheTtlMs`: flush the whole PLC cache on any tag-list change.** Tag-level granularity is technically possible but complicates the reload code path. The simple correctness move is "any tag-list change to this PLC → drop all cached entries for this PLC and let them re-populate." Document the choice.
10. **Eviction loop: `PeriodicTimer` + cancellation token.** Not `System.Timers.Timer`. The cache is `IDisposable`; the loop honours `Dispose`.
11. **Update `docs/design.md` AND `docs/kpi.md` AND `mbproxy/CLAUDE.md` AND `install/mbproxy.config.template.json` IN THE SAME PR AS THE CODE.** Doc drift is a gate fail. The architectural pivot must be visible across all reader-facing surfaces.
## Cross-references
- Phase 9's multiplexer is the chokepoint that hosts the cache check: [`09-txid-multiplexing.md`](09-txid-multiplexing.md).
- Phase 10's `CoalescingKey` is the same shape as Phase 11's `CacheKey`: [`10-read-coalescing.md`](10-read-coalescing.md).
- The "not a polling/cache layer" stance that this phase pivots away from: [`../design.md`](../design.md) → "What this is" + "Purpose".
- KPI graduation target: [`../kpi.md`](../kpi.md) → Tier 1 (cache-hit-ratio joins this tier).
- Resolution rules for per-tag `CacheTtlMs` (Global Add Remove fallback + per-PLC default): [`../design.md`](../design.md) → "Hybrid tag resolution".
+107
View File
@@ -0,0 +1,107 @@
# mbproxy — implementation plan
Phase-by-phase implementation plan for the `mbproxy` service. Each phase is a self-contained work spec with explicit deliverables, tests, and a gate checklist that must be green before the next phase begins. Settled against the design plan in [`../design.md`](../design.md) on 2026-05-13.
**Briefing a subagent for a phase:** hand it exactly three documents — the phase doc, [`../design.md`](../design.md), and [`../../DL260/dl205.md`](../../DL260/dl205.md). Tell it not to read other phase docs unless its own doc lists them under "Cross-references". The phase doc IS the contract.
## Phase graph
| # | Phase | Depends on | Parallel-safe with |
|---|-------|------------|--------------------|
| 00 | [Bootstrap](00-bootstrap.md) — host + DI + Serilog + options POCOs | — | (must run first, alone) |
| 01 | [Simulator harness](01-simulator-harness.md) — pymodbus xUnit fixture | 00 | 02 |
| 02 | [BCD codec](02-bcd-codec.md) — pure encode/decode logic | 00 | 01, 03 |
| 03 | [Proxy plumbing](03-proxy-plumbing.md) — TcpListener + 1:1 byte forwarder | 00 | 02 |
| 04 | [Rewriter integration](04-rewriter-integration.md) — wire codec into proxy | 02, 03 | — |
| 05 | [Listener supervisor](05-listener-supervisor.md) — Polly auto-recovery | 03 | — |
| 06 | [Hot-reload](06-hot-reload.md) — `IOptionsMonitor` reconcile | 05 | — |
| 07 | [Status page](07-status-page.md) — Kestrel admin endpoint | 05, 06 | — |
| 08 | [Service hardening](08-service-hardening.md) — Windows service + shutdown | 04, 07 | — |
| 09 | [TxId multiplexing](09-txid-multiplexing.md) — single backend connection per PLC (post-1.0 follow-on) | 04, 05, 07 | — |
| 10 | [Read coalescing](10-read-coalescing.md) — in-flight FC03/04 dedup (post-1.0 follow-on) | 09 | — |
| 11 | [Response cache](11-response-cache.md) — short-TTL post-response cache, bounded staleness (post-1.0; **design-contract pivot**) | 10 | — |
```
┌── 01 (sim) ──┐
00 ─────┼── 02 (codec) ─┼──── 04 ───┐
└── 03 (plumbing)┴── 05 ─── 06 ─── 07 ─── 08
└─────────────────→ 09 ───→ 10 ───→ 11 (post-1.0)
```
**Phases 09, 10, and 11 are post-1.0 follow-ons**, not part of the initial 1.0 release.
- **Phase 09** rewires the connection layer to lift the H2-ECOM100's 4-concurrent-client cap as an operational ceiling. Pick it up only after Phase 08 has shipped and field experience confirms the 4-client cap is a real production problem (not just a theoretical one).
- **Phase 10** plugs into Phase 09's `InterestedParties` seam to coalesce same-key FC03/04 reads within the in-flight window. Zero post-response staleness. Worth doing only if field telemetry shows meaningful read overlap (≥ 2× duplicate-read traffic from concurrent HMIs / historians).
- **Phase 11** extends the "served without backend traffic" window from in-flight microseconds (Phase 10) to operator-configurable seconds via a per-tag TTL response cache. **This is a deliberate design-contract pivot** — the proxy stops being purely transparent and becomes an opt-in cache layer with bounded staleness. The cache is OFF by default; opting tags in is the operator's explicit acknowledgement of the staleness window. Pick up only if Phase 10's coalescing-ratio under real load reveals enough cross-poll overlap to justify staleness as a trade.
## Working with subagents
### Default: one subagent per phase, sequential
Spawn one Agent (Sonnet or Opus) per phase in order. Each agent reads exactly:
- Its own phase doc (under this directory).
- [`../design.md`](../design.md) — architecture, the source of truth.
- [`../../DL260/dl205.md`](../../DL260/dl205.md) — device quirks.
That is sufficient context. The agent must NOT invent scope beyond the phase doc's "Outputs" section. If it discovers a design-affecting issue, it must STOP and surface the issue rather than improvise — designs change in [`../design.md`](../design.md), not silently in code.
### Advanced: parallel subagents within a single phase boundary
Two phases marked "Parallel-safe with" each other can be picked up by independent subagents at the same time. The only safe parallel windows in this plan are:
- **Phase 01 ∥ Phase 02** (sim harness lives in `tests/sim/`, codec lives in `src/Mbproxy/Bcd/` — fully disjoint).
- **Phase 02 ∥ Phase 03** (codec is pure logic in `src/Mbproxy/Bcd/`; plumbing is in `src/Mbproxy/Proxy/` — disjoint).
- **Phase 01 + Phase 02 + Phase 03** all three at once is also safe (all touch different directories).
**Required pattern:**
1. Spawn each parallel agent with `isolation: "worktree"` (Agent tool's worktree mode creates an isolated git checkout).
2. Each agent gets ONE phase doc + design.md + dl205.md.
3. Each agent runs its phase gate locally before its worktree is committed.
4. Merge order: lower phase number first. Resolve conflicts manually if the agents drifted outside their declared output scope (which they shouldn't).
5. After merge, re-run the phase 00 smoke test plus both merged phases' tests to confirm no integration regression.
**Hard rules — anti-patterns that break parallel work:**
- ❌ Any two phases editing the same `.csproj` PackageReference list at the same time. Phase 00 owns the initial csproj; later phases append PackageReferences atomically and a parallel pair must coordinate via separate `<ItemGroup>` blocks or sequential merges.
- ❌ Running phase 04 in parallel with anything (it integrates two prior phases — by definition it touches their outputs).
- ❌ Running phase 06 in parallel with anything (the hot-reload reconcile inspects state from listener supervisor + rewriter + counters; it has the widest cross-cut).
- ❌ Spawning more than 3 concurrent worktree agents (review/merge overhead grows superlinearly and the value disappears).
## Phase gate template
Every phase MUST be green on all of these before its branch is merged:
1. **Build is clean.** `dotnet build src/Mbproxy/Mbproxy.csproj -c Debug` with **zero warnings**. `<TreatWarningsAsErrors>true</TreatWarningsAsErrors>` is set in phase 00 and stays set forever.
2. **All unit tests pass.** `dotnet test tests/Mbproxy.Tests/Mbproxy.Tests.csproj --filter Category!=E2E` is green.
3. **E2E tests pass when the simulator is available.** `dotnet test tests/Mbproxy.Tests/Mbproxy.Tests.csproj --filter Category=E2E --blame-hang-timeout 2m` is green on a machine with Python + pymodbus installed. The `--blame-hang-timeout` is mandatory — never run E2E without it. Skipped tests (due to missing simulator) don't count as failures, but ANY test added in this phase must NOT skip when the sim IS available, and every E2E test MUST carry a `[Fact(Timeout = …)]` per the Test discipline rules below.
4. **No regressions in any prior phase's tests.** The full suite stays green.
5. **No new public types beyond what the phase doc declares.** Scope creep is a gate fail. If a needed type is missing from the doc, update the doc first.
6. **No `TODO` / `FIXME` / `HACK` comments committed.** Either resolve or file in the [Deferred](#deferred) section below.
7. **Design / docs are in sync.** If a design decision changed during the phase, [`../design.md`](../design.md) is updated in the same PR — and only mirror to [`../../CLAUDE.md`](../../CLAUDE.md)'s Architecture summary if the change shifts one of the headline bullets.
8. **Phase doc itself is updated** to reflect any clarifications discovered during implementation, so the next subagent picking up the project doesn't relearn what this one learned.
## Test discipline
- **Framework:** xUnit (v3 if available, v2 otherwise) + **Shouldly** for assertions. Never `Assert.Equal(x, y)` — always `y.ShouldBe(x)`. Never `Assert.True(p)` — always `p.ShouldBeTrue("reason")`.
- **Categories:** `[Trait("Category", "Unit")]` (default; no traits needed), `[Trait("Category", "E2E")]` (needs simulator), `[Trait("Category", "Stress")]` (slow / load-bearing — opt-in only).
- **No mocks for code we own.** Exercise our types directly. Mock only at the network/file/process boundary — and prefer a real local socket / real temp file over a mock when feasible.
- **Test naming:** `MethodOrScenario_Condition_ExpectedOutcome`. Example: `BcdCodec_Decode16_Returns1234_For0x1234`.
- **One assertion per test where reasonable.** Multi-assertion tests are acceptable when they assert facets of the same scenario; never when they're really separate tests glued together.
- **Every `[Trait("Category","E2E")]` test MUST declare a hard timeout** via `[Fact(Timeout = N)]` (xUnit v3, milliseconds). **Default: `5_000` ms.** Expand per-test only when the test genuinely needs longer (concurrent bursts > 100 ops, reload-propagation debounce, graceful-shutdown drain) — and add a one-line comment explaining why. Start tight; raise only when a real test fails with a non-deadlock reason. Reason this matters: the existing fixtures use synchronous NModbus calls and stub TCP servers that **do not honor `TestContext.Current.CancellationToken`** — without `[Fact(Timeout=…)]`, a deadlock in the proxy hangs the runner indefinitely. The same rule applies to `[Trait("Category","Stress")]`. Unit tests are exempt unless they touch real sockets or processes.
- **Run E2E with a hang backstop.** The phase gate's E2E command is `dotnet test ... --filter Category=E2E --blame-hang-timeout 2m`. The `--blame-hang-timeout` is a process-level safety net in case a test's individual `Timeout` somehow doesn't fire (e.g. an unmanaged thread blocking finalization).
## Deferred
A running list of things explicitly NOT done in any current phase. When a phase reveals one, add it here so it isn't forgotten and so the deferral is visible at review time:
- *(none yet)*
## Cross-references
- Architecture and load-bearing decisions: [`../design.md`](../design.md)
- Device quirks the proxy must respect: [`../../DL260/dl205.md`](../../DL260/dl205.md)
- pymodbus simulator profile that backs e2e tests: [`../../DL260/dl205.json`](../../DL260/dl205.json)
- As-deployed PLC parameters (port 502, BCD-by-default, swap bytes, etc.): [`../../DL260/mbtcp_settings.JPG`](../../DL260/mbtcp_settings.JPG)